The Data Charmer: June 2006

Let's say that you are building a new application, and you need to test it against a large set of data. You would need either to borrow the data from some known source or to create it yourself.

If you have such collection at your disposal, good for you. But more often than not you need some sort of data that is not ready in your repositories, and then you need to have a quick method to create it.

I will tell you three quick tricks to create large datasets of simple data types, namely numbers, words, and dates.

Le's assume that we need at least one million records. Here's how to go for it.

numbers

Creating a large table with just a numeric field is quite simple.

You may be tempted to run a loop in your favorite language, or even in a SQL stored procedure, but this approach would run for quite a long time.
There is a better solution, that will fill your large table in a few seconds

drop table if exists numbers;
create table numbers ( id int not null primary key);

delimiter $$

drop procedure if exists fill_numbers $$
create procedure fill_numbers()
deterministic
begin
  declare counter int default 1;
  insert into numbers values (1);
  while counter < 1000000
  do
      insert into numbers (id)
          select id + counter
          from numbers;
      select count(*) into counter from numbers;
      select counter;
  end while;
end $$
delimiter ;

call fill_numbers();

Rather than inserting 1,000,000 lines, this procedure will insert just one record, and then it will double the table 20 times, until we end up with a table containing 1,048,576 lines (2²⁰). This operation runs in less than 8 seconds in my laptop, which is quite slow if compared to my usual servers.
Even without a stored procedure, you could just insert a line manually and then execute 20 times this query:

insert into numbers (id) select id + (select count(*) from numbers) from numbers; select count(*) from numbers;

And it should not take you more than 30 seconds.

words

If your case calls for a large list of unique words, you could of course build it with a program, but then again the insertion process would be quite slow to complete. A faster method is to load an existing list from as file.

All Unix system include a word list, whose size ranges from a few thousand to half a million. If you don't have such a list available in your box, you can get one (or build it from several ones) from several places. Good places to start looking for words are this and this.
At the end of the exercise, let's say that you've got a list of about half a million distinct words in /usr/share/dict/words. Thus, you can proceed to build your table.

drop table if exists words;
create table words (
  id int not null auto_increment primary key,
  t varchar(50) not null
);

load data local infile '/usr/share/dict/words'
  into table words (t);

Query OK, 518584 rows affected (4.94 sec)
Records: 518584  Deleted: 0  Skipped: 0  Warnings: 0

select count(*) from words;
+----------+
| count(*) |
+----------+
|   518584 |
+----------+
1 row in set (0.04 sec)

That was quick, but we got only about half of what we needed. We said that we wanted one million records, and here we only have a little more than five hundred thousand.
Since we need unique words, we can ask the database to produce the missing half by reversing the existing ones.

insert into words (t) select reverse(t) from words;
Query OK, 518584 rows affected (3.98 sec)
Records: 518584  Duplicates: 0  Warnings: 0

select count(*) from words;
+----------+
| count(*) |
+----------+
|  1037168 |
+----------+

Almost there. Now we passed the million records mark, but we are not sure that they are unique, since the reverse of one word could be the duplicate of an existing word (think about mood and doom, for example). Thus, to complete the task, let's add a unique index with the IGNORE clause, so that we'll get rid of any duplicate.

alter ignore table words add unique key (t);
Query OK, 1037168 rows affected (46.69 sec)
Records: 1037168  Duplicates: 5791  Warnings: 0

select count(*) from words;
+----------+
| count(*) |
+----------+
|  1031377 |
+----------+

That's it. A million words, without much sweat.

dates

Finally, let's see how to create a large list of dates. Actually, you would seldom need a million dates, since a million days cover more than 2,700 years. A table of just dates, thus, will usually range from 1,000 to 10,000 records. When you need a million records, you are more likely talking about DATETIME values, with intervals of hours, minutes, or seconds. Nothing will prevent you from using this technique to create one thousand DATE values, but let's keep to our original goal of filling a large table.
Then, if we want records with one minute interval, we can proceed like this:

drop table if exists dates;
create table dates (
  id int(11) not null auto_increment primary key,
  dt datetime not NULL
) engine=myisam;

delimiter $$

drop procedure if exists make_dates $$
CREATE PROCEDURE make_dates( max_recs int)
begin
  declare start_dt datetime;
  declare numrecs int default 1;
  set start_dt = date_format( now() - interval max_recs minute, '%Y-%m-%d %H:%i:00');

  insert into dates (dt) values (start_dt );

  while numrecs < max_recs
  do
      insert into dates (dt)
          select dt + interval ( numrecs ) minute
          from dates;
      select count(*) into numrecs from dates;
      select numrecs;
  end while;
end $$

delimiter ;

Does it look familiar? It should, since it's the same technique we've used to put a set of numbers into a table.

Only this time we use the record count to calculate the interval in minutes between the existing records and the ones being inserted.
Also in this case, we double the table 20 times, to pass the one million records mark.

call make_dates( 1000000 );
+---------+
| numrecs |
+---------+
|       2 |
+---------+
1 row in set (0.02 sec)

+---------+
| numrecs |
+---------+
|       4 |
+---------+
1 row in set (0.02 sec)

# ... 16 more counts

+---------+
| numrecs |
+---------+
|  524288 |
+---------+
1 row in set (5.99 sec)

+---------+
| numrecs |
+---------+
| 1048576 |
+---------+
1 row in set (10.18 sec)

select count(*), min(dt), max(dt) from dates;
+----------+---------------------+---------------------+
| count(*) | min(dt)             | max(dt)             |
+----------+---------------------+---------------------+
|  1048576 | 2004-07-07 13:57:00 | 2006-07-05 18:12:00 |
+----------+---------------------+---------------------+

It took a little more than the numbers, because of the overhead of calculating one million date intervals, but it was about ten seconds in total, which is a reasonable time to get a test table.
There are other techniques as well, but these three are techniques that you can use to get the job done even without dedicated tools.