Friday, June 02, 2006

Filling test tables quickly

Let's say that you are building a new application, and you need to test it against a large set of data. You would need either to borrow the data from some known source or to create it yourself.

If you have such collection at your disposal, good for you. But more often than not you need some sort of data that is not ready in your repositories, and then you need to have a quick method to create it.

I will tell you three quick tricks to create large datasets of simple data types, namely numbers, words, and dates.

Le's assume that we need at least one million records. Here's how to go for it.

numbers

Creating a large table with just a numeric field is quite simple.

You may be tempted to run a loop in your favorite language, or even in a SQL stored procedure, but this approach would run for quite a long time.
There is a better solution, that will fill your large table in a few seconds

drop table if exists numbers;
create table numbers ( id int not null primary key);

delimiter $$

drop procedure if exists fill_numbers $$
create procedure fill_numbers()
deterministic
begin
declare counter int default 1;
insert into numbers values (1);
while counter < 1000000
do
insert into numbers (id)
select id + counter
from numbers;
select count(*) into counter from numbers;
select counter;
end while;
end $$
delimiter ;

call fill_numbers();
Rather than inserting 1,000,000 lines, this procedure will insert just one record, and then it will double the table 20 times, until we end up with a table containing 1,048,576 lines (220). This operation runs in less than 8 seconds in my laptop, which is quite slow if compared to my usual servers.
Even without a stored procedure, you could just insert a line manually and then execute 20 times this query:
insert into numbers (id) select id + (select count(*) from numbers) from numbers; select count(*) from numbers;

And it should not take you more than 30 seconds.

words

If your case calls for a large list of unique words, you could of course build it with a program, but then again the insertion process would be quite slow to complete. A faster method is to load an existing list from as file.

All Unix system include a word list, whose size ranges from a few thousand to half a million. If you don't have such a list available in your box, you can get one (or build it from several ones) from several places. Good places to start looking for words are this and this.
At the end of the exercise, let's say that you've got a list of about half a million distinct words in /usr/share/dict/words. Thus, you can proceed to build your table.

drop table if exists words;
create table words (
id int not null auto_increment primary key,
t varchar(50) not null
);

load data local infile '/usr/share/dict/words'
into table words (t);

Query OK, 518584 rows affected (4.94 sec)
Records: 518584 Deleted: 0 Skipped: 0 Warnings: 0

select count(*) from words;
+----------+
| count(*) |
+----------+
| 518584 |
+----------+
1 row in set (0.04 sec)
That was quick, but we got only about half of what we needed. We said that we wanted one million records, and here we only have a little more than five hundred thousand.
Since we need unique words, we can ask the database to produce the missing half by reversing the existing ones.
insert into words (t) select reverse(t) from words;
Query OK, 518584 rows affected (3.98 sec)
Records: 518584 Duplicates: 0 Warnings: 0

select count(*) from words;
+----------+
| count(*) |
+----------+
| 1037168 |
+----------+
Almost there. Now we passed the million records mark, but we are not sure that they are unique, since the reverse of one word could be the duplicate of an existing word (think about mood and doom, for example). Thus, to complete the task, let's add a unique index with the IGNORE clause, so that we'll get rid of any duplicate.
alter ignore table words add unique key (t);
Query OK, 1037168 rows affected (46.69 sec)
Records: 1037168 Duplicates: 5791 Warnings: 0

select count(*) from words;
+----------+
| count(*) |
+----------+
| 1031377 |
+----------+
That's it. A million words, without much sweat.

dates

Finally, let's see how to create a large list of dates. Actually, you would seldom need a million dates, since a million days cover more than 2,700 years. A table of just dates, thus, will usually range from 1,000 to 10,000 records. When you need a million records, you are more likely talking about DATETIME values, with intervals of hours, minutes, or seconds. Nothing will prevent you from using this technique to create one thousand DATE values, but let's keep to our original goal of filling a large table.
Then, if we want records with one minute interval, we can proceed like this:
drop table if exists dates;
create table dates (
id int(11) not null auto_increment primary key,
dt datetime not NULL
) engine=myisam;

delimiter $$

drop procedure if exists make_dates $$
CREATE PROCEDURE make_dates( max_recs int)
begin
declare start_dt datetime;
declare numrecs int default 1;
set start_dt = date_format( now() - interval max_recs minute, '%Y-%m-%d %H:%i:00');

insert into dates (dt) values (start_dt );

while numrecs < max_recs
do
insert into dates (dt)
select dt + interval ( numrecs ) minute
from dates;
select count(*) into numrecs from dates;
select numrecs;
end while;
end $$

delimiter ;
Does it look familiar? It should, since it's the same technique we've used to put a set of numbers into a table.

Only this time we use the record count to calculate the interval in minutes between the existing records and the ones being inserted.
Also in this case, we double the table 20 times, to pass the one million records mark.

call make_dates( 1000000 );
+---------+
| numrecs |
+---------+
| 2 |
+---------+
1 row in set (0.02 sec)

+---------+
| numrecs |
+---------+
| 4 |
+---------+
1 row in set (0.02 sec)

# ... 16 more counts

+---------+
| numrecs |
+---------+
| 524288 |
+---------+
1 row in set (5.99 sec)

+---------+
| numrecs |
+---------+
| 1048576 |
+---------+
1 row in set (10.18 sec)

select count(*), min(dt), max(dt) from dates;
+----------+---------------------+---------------------+
| count(*) | min(dt) | max(dt) |
+----------+---------------------+---------------------+
| 1048576 | 2004-07-07 13:57:00 | 2006-07-05 18:12:00 |
+----------+---------------------+---------------------+
It took a little more than the numbers, because of the overhead of calculating one million date intervals, but it was about ten seconds in total, which is a reasonable time to get a test table.
There are other techniques as well, but these three are techniques that you can use to get the job done even without dedicated tools.

3 comments:

Frank said...

I've used the code in your post as a basis for writing my own 'insert a range of dates' SQL. Thank you! This is what I made. (Please note that this is for SQL Server though)

SET DATEFIRST 1

declare @counter as int
set @counter = 1
declare @numberofdays int
set @numberofdays = 365
declare @newdate datetime
set @newdate = GETDATE()

set nocount on

DELETE FROM days

WHILE (@counter <= @numberofdays)
BEGIN
INSERT INTO days ([date], [day], [month], [year], [weekday], [week])
VALUES (
@newdate,
Datepart(dd, @newdate),
Datepart(mm, @newdate),
Datepart(yyyy, @newdate),
Datepart(dw, @newdate),
Datepart(wk, @newdate)
)
;
set @counter = @counter + 1
set @newdate = DATEADD(d, 1, @newdate)
END

set nocount off

select * from days

Chris said...

I often use a table of numbers to split long strings and delimited lists and the procedures here a very helpful. Thanks. On my slow computer it took 45 seconds using fill_numbers().

Another way is to just use self-joins on a table of digits.

create table digits(digit int);
insert into digits values (0), (1), (2), (3), (4), (5), (6), (7), (8), (9);

## need auto_increment
create table numbers ( id int not null auto_increment primary key );

insert into numbers select NULL from digits d1, digits d2, digits d3, digits d4, digits d5, digits d6;

Query OK, 1000000 rows affected (12.44 sec)


If you just need a short sequence of numbers, you can often get by without the table, but you do have to sort the column.

select concat(tens.digit, ones.digit)+0 as number from digits tens, digits ones order by 1;
+--------+
| number |
+--------+
| 0 |
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
| 7 |
| 8 |
| 9 |
| 10 |
| 11 |
| 12 |
..


Chris

Ronald Bradford said...

With your SQL only numbers line:

insert into numbers (id) select id + (select count(*) from numbers) from numbers; select count(*) from numbers;

you need to include you still have to seed the table first with

insert into numbers values (1);