Tuesday, June 20, 2006

Talking at FrOSCon 2006



I will be at FrOSCon 2006 on June 24, with a talk about Pivot tables in MySQL 5.
I will arrive in Sankt Augustin on June 23, flying from Alghero Fertilia, Italy to Frankfurt Hahn, Germany and then driving to the conference site (travel plan).
The funny thing is that, when I booked my flight, I had no idea that I was about to get in the middle of the Soccer World Cup. Being no soccer fan, I realized this fact only when the media circus started and I could not avoid being informed. Oh, well, now I understand why it wasn't easy to find a hotel, even booking two months in advance!

The slides and the source code for this talk are available from my site.

Friday, June 02, 2006

Filling test tables quickly

Let's say that you are building a new application, and you need to test it against a large set of data. You would need either to borrow the data from some known source or to create it yourself.

If you have such collection at your disposal, good for you. But more often than not you need some sort of data that is not ready in your repositories, and then you need to have a quick method to create it.

I will tell you three quick tricks to create large datasets of simple data types, namely numbers, words, and dates.

Le's assume that we need at least one million records. Here's how to go for it.

numbers

Creating a large table with just a numeric field is quite simple.

You may be tempted to run a loop in your favorite language, or even in a SQL stored procedure, but this approach would run for quite a long time.
There is a better solution, that will fill your large table in a few seconds

drop table if exists numbers;
create table numbers ( id int not null primary key);

delimiter $$

drop procedure if exists fill_numbers $$
create procedure fill_numbers()
deterministic
begin
declare counter int default 1;
insert into numbers values (1);
while counter < 1000000
do
insert into numbers (id)
select id + counter
from numbers;
select count(*) into counter from numbers;
select counter;
end while;
end $$
delimiter ;

call fill_numbers();
Rather than inserting 1,000,000 lines, this procedure will insert just one record, and then it will double the table 20 times, until we end up with a table containing 1,048,576 lines (220). This operation runs in less than 8 seconds in my laptop, which is quite slow if compared to my usual servers.
Even without a stored procedure, you could just insert a line manually and then execute 20 times this query:
insert into numbers (id) select id + (select count(*) from numbers) from numbers; select count(*) from numbers;

And it should not take you more than 30 seconds.

words

If your case calls for a large list of unique words, you could of course build it with a program, but then again the insertion process would be quite slow to complete. A faster method is to load an existing list from as file.

All Unix system include a word list, whose size ranges from a few thousand to half a million. If you don't have such a list available in your box, you can get one (or build it from several ones) from several places. Good places to start looking for words are this and this.
At the end of the exercise, let's say that you've got a list of about half a million distinct words in /usr/share/dict/words. Thus, you can proceed to build your table.

drop table if exists words;
create table words (
id int not null auto_increment primary key,
t varchar(50) not null
);

load data local infile '/usr/share/dict/words'
into table words (t);

Query OK, 518584 rows affected (4.94 sec)
Records: 518584 Deleted: 0 Skipped: 0 Warnings: 0

select count(*) from words;
+----------+
| count(*) |
+----------+
| 518584 |
+----------+
1 row in set (0.04 sec)
That was quick, but we got only about half of what we needed. We said that we wanted one million records, and here we only have a little more than five hundred thousand.
Since we need unique words, we can ask the database to produce the missing half by reversing the existing ones.
insert into words (t) select reverse(t) from words;
Query OK, 518584 rows affected (3.98 sec)
Records: 518584 Duplicates: 0 Warnings: 0

select count(*) from words;
+----------+
| count(*) |
+----------+
| 1037168 |
+----------+
Almost there. Now we passed the million records mark, but we are not sure that they are unique, since the reverse of one word could be the duplicate of an existing word (think about mood and doom, for example). Thus, to complete the task, let's add a unique index with the IGNORE clause, so that we'll get rid of any duplicate.
alter ignore table words add unique key (t);
Query OK, 1037168 rows affected (46.69 sec)
Records: 1037168 Duplicates: 5791 Warnings: 0

select count(*) from words;
+----------+
| count(*) |
+----------+
| 1031377 |
+----------+
That's it. A million words, without much sweat.

dates

Finally, let's see how to create a large list of dates. Actually, you would seldom need a million dates, since a million days cover more than 2,700 years. A table of just dates, thus, will usually range from 1,000 to 10,000 records. When you need a million records, you are more likely talking about DATETIME values, with intervals of hours, minutes, or seconds. Nothing will prevent you from using this technique to create one thousand DATE values, but let's keep to our original goal of filling a large table.
Then, if we want records with one minute interval, we can proceed like this:
drop table if exists dates;
create table dates (
id int(11) not null auto_increment primary key,
dt datetime not NULL
) engine=myisam;

delimiter $$

drop procedure if exists make_dates $$
CREATE PROCEDURE make_dates( max_recs int)
begin
declare start_dt datetime;
declare numrecs int default 1;
set start_dt = date_format( now() - interval max_recs minute, '%Y-%m-%d %H:%i:00');

insert into dates (dt) values (start_dt );

while numrecs < max_recs
do
insert into dates (dt)
select dt + interval ( numrecs ) minute
from dates;
select count(*) into numrecs from dates;
select numrecs;
end while;
end $$

delimiter ;
Does it look familiar? It should, since it's the same technique we've used to put a set of numbers into a table.

Only this time we use the record count to calculate the interval in minutes between the existing records and the ones being inserted.
Also in this case, we double the table 20 times, to pass the one million records mark.

call make_dates( 1000000 );
+---------+
| numrecs |
+---------+
| 2 |
+---------+
1 row in set (0.02 sec)

+---------+
| numrecs |
+---------+
| 4 |
+---------+
1 row in set (0.02 sec)

# ... 16 more counts

+---------+
| numrecs |
+---------+
| 524288 |
+---------+
1 row in set (5.99 sec)

+---------+
| numrecs |
+---------+
| 1048576 |
+---------+
1 row in set (10.18 sec)

select count(*), min(dt), max(dt) from dates;
+----------+---------------------+---------------------+
| count(*) | min(dt) | max(dt) |
+----------+---------------------+---------------------+
| 1048576 | 2004-07-07 13:57:00 | 2006-07-05 18:12:00 |
+----------+---------------------+---------------------+
It took a little more than the numbers, because of the overhead of calculating one million date intervals, but it was about ten seconds in total, which is a reasonable time to get a test table.
There are other techniques as well, but these three are techniques that you can use to get the job done even without dedicated tools.