The Data Charmer: March 2006

Thursday, March 09, 2006

Seeking alternatives to cursors

As users of stored procedures know, cursors can only work with explicit SQL statements, while they don't work with dynamic queries (there was an article by Beat Vontobel and a consequent feature request).
Trying to overcome this limitation, I came up with a somewhat useable alternative and a performance surprise. Let's go see them in due order.
The example is built against the World database provided in MySQL web site.
The method used for this test is an algorithm to compute a global checksum for a table. You may know that such a global CRC exists already, but it is only available for MyISAM tables, and as a procedure only (feature request #17009).
To calculate such a checksum we can do two things:

Using a cursor, we start with an empty string, representing the global checksum, then calculate the checksum for each row, and we merge such checksum to the global checksum.
Using a SELECT statement getting the checksum of each row, we start with a empty user variable, and we merge that result to the user variable.

The second method seems a bit expensive, because it fetches as many records as are in the table, creating a recordset that we really don't need, since we are only interested in its fina result. A nice touch in this case is the usage of a BLACKHOLE Storage Engine. Instead of a simple SELECT, we use a INSERT ... SELECT statement.
Here comes the preparation code.


use world;

drop table if exists new_city;
create table new_city like City;
insert into new_city select * from City; 

delimiter //

drop procedure if exists double_recs//
create procedure double_recs ()
deterministic
begin
   declare counter int default 0;
   while counter < 7 do
       insert into new_city select NULL,Name,CountryCode,District,Population from new_city;
       set counter = counter + 1;
   end while;
end//

delimiter ;

select 'inserting a bunch of records in new_city ' as 'wait a moment' ;
call double_recs() ;
select count(*) as 'rows in new_city' from new_city;
drop procedure if exists double_recs ;

drop table if exists crc_res;
CREATE TABLE `crc_res` (
 `crc` varchar(100) default NULL,
 `cnt` int(11) default NULL
) ENGINE=BLACKHOLE ;

delimiter //

DROP FUNCTION IF EXISTS make_City_CRC //

CREATE FUNCTION make_City_CRC( )
returns varchar(100)
READS SQL DATA
BEGIN
   declare done boolean default false;
   declare sha_global varchar(100);
   declare counter int;
   declare sha_riga varchar(100);
   DECLARE cursor1 CURSOR FOR
       SELECT SHA1(
           CONCAT_WS('#',ID,Name, CountryCode, District,Population))
       FROM new_city
       ORDER BY ID;
   DECLARE CONTINUE HANDLER FOR SQLSTATE '02000' SET done = true;

   OPEN cursor1;
   SET sha_global = "";
   set counter =0;
   GET_ROWS:
   loop
       FETCH cursor1 INTO sha_riga;
       IF  done THEN
           leave GET_ROWS;
       END IF;
       SET sha_global = SHA1(CONCAT(sha_global, sha_riga));
       set counter = counter + 1;
   END loop;
   CLOSE cursor1;
   set @gcounter = counter;
   RETURN sha_global;
END //

delimiter ;

Now we have a function to calculate the checksum of table new_city, which is a copy of City, inflated with copies of its records until it gets about half a million (to be able to compare timings with whole seconds rather than milliseconds), and we also create a BLACKHOLE table as preparation for the alternative method.
The CRC business goes like this: for each row, we calculate the SHA1 value of the whole record, by taking a CONCAT_WS of all the columns. You can see why I would need dynamic SQL to make this function a generic one. For each row, we get the SHA1 of the general checksum, concatenated to the row checksum. At the end, the general checksum thus calculated is returned.
In the case of the cursor method, the syntax is clear. In the alternative method, we use a user variable, previously initialized, and set it using the ":=" operator, thus preserving its value from row to row. Since we are sending the recordset to a BLACKHOLE table, we are not burdened by an unwanted recordset.
Executing both methods, we get the same results, but the timing are quite different. The testing code follows.


select 'using a BLACKHOLE table' as 'first test';

set @gcrc = '';
set @gcnt = 0;
set @start_test1 = now();
insert into crc_res
    select @gcrc := sha1(concat(@gcrc, sha1(concat_ws('#',ID,Name,CountryCode,District,Population)) ) ),
    @gcnt := @gcnt + 1 from new_city order by ID;
set @end_test1 = now();

select @gcrc;
select @gcnt;

select 'using a cursor' as 'second test';

set @start_test2 = now();
select make_City_CRC();
set @end_test2 = now();
select @gcounter;

select timediff(@end_test1, @start_test1) as 'test BLACKHOLE', timediff(@end_test2, @start_test2) as 'test CURSOR';

And here is the result.


+-------------------------+
| first test              |
+-------------------------+
| using a BLACKHOLE table |
+-------------------------+
Query OK, 522112 rows affected (11.19 sec)
Records: 522112  Duplicates: 0  Warnings: 0
+------------------------------------------+
| @gcrc                                    |
+------------------------------------------+
| 1f4c512ff345a88a876fe0faddcf37344cb34f29 |
+------------------------------------------+
+--------+
| @gcnt  |
+--------+
| 522112 |
+--------+

+----------------+
| second test    |
+----------------+
| using a cursor |
+----------------+
+------------------------------------------+
| make_City_CRC()                          |
+------------------------------------------+
| 1f4c512ff345a88a876fe0faddcf37344cb34f29 |
+------------------------------------------+
1 row in set (18.51 sec)

+-----------+
| @gcounter |
+-----------+
| 522112    |
+-----------+
+----------------+-------------+
| test BLACKHOLE | test CURSOR |
+----------------+-------------+
| 00:00:11       | 00:00:18    |
+----------------+-------------+

The cursor method takes 18.51 seconds, while the SELECT+BLACKHOLE method takes only 11. Can anyone comment on this performance difference?
I know that this is not a general replacement to a cursor, but keep it in mind when you would like to use dynamic SQL with a cursor.

Friday, March 03, 2006

MySQL 5.1: Improving ARCHIVE performance with partitioning

In a recent article (Improving Database Performance with Partitioning) Robin Schumacher explains how to use range partitioning, giving some impressive examples of performance gains.

The examples given are quite slow to create. For those willing to try out the same tables with a filling function a bit faster, here goes:

drop table if exists no_part_tab;
create table no_part_tab
(c1 int(11) default NULL,
c2 varchar(30) default NULL,
c3 date default NULL) engine=myisam;

drop table if exists part_tab ;
CREATE TABLE part_tab
(  c1 int default NULL,
c2 varchar(30) default NULL,
c3 date default NULL
) engine=myisam
PARTITION BY RANGE (year(c3)) (PARTITION p0 VALUES LESS THAN (1995),
PARTITION p1 VALUES LESS THAN (1996) , PARTITION p2 VALUES LESS THAN (1997) ,
PARTITION p3 VALUES LESS THAN (1998) , PARTITION p4 VALUES LESS THAN (1999) ,
PARTITION p5 VALUES LESS THAN (2000) , PARTITION p6 VALUES LESS THAN (2001) ,
PARTITION p7 VALUES LESS THAN (2002) , PARTITION p8 VALUES LESS THAN (2003) ,
PARTITION p9 VALUES LESS THAN (2004) , PARTITION p10 VALUES LESS THAN (2010),
PARTITION p11 VALUES LESS THAN MAXVALUE );

delimiter //

drop procedure if exists load_part_tab //
CREATE PROCEDURE load_part_tab( max_recs int, rows_per_query int)
begin
declare counter int default 0;
declare step int default 0;
declare base_query varchar(100) default 'insert into part_tab values ';
declare first_loop boolean default true;
declare v int default 0;
set @query = base_query;
while v < max_recs
do
if (counter = rows_per_query) then
   set first_loop = true;
   set counter = 0;
   prepare q from @query;
   execute q;
   deallocate prepare q;
   set @query = base_query;
   set step = step + 1;
   select step, v, now();
end if;
if (first_loop) then
   set first_loop = false;
else
   set @query = concat(@query, ',');
end if;
set @query = concat( @query,
'(', v, ',',
'"testing partitions"',',"',
adddate('1995-01-01',(rand(v)*36520) mod 3652), '")'
);
set v = v + 1;        set counter = counter + 1;
end while;
if (counter) then
   prepare q from @query;
   execute q;
   deallocate prepare q;
end if;
end
//
delimiter ;

call load_part_tab(8000000,1000);
insert into no_part_tab select * from part_tab;

Using this data, the timings for my comparison are:

select  count(*) from no_part_tab where c3 > date '1995-01-01' and c3 < date '1995-12-31';
+----------+
| count(*) |
+----------+
|   795181 |
+----------+
1 row in set (9.05 sec)

select  count(*) from part_tab where c3 > date '1995-01-01' and c3 < date '1995-12-31';
+----------+
| count(*) |
+----------+
|   795181 |
+----------+
1 row in set (1.04 sec)

Now, coming to the purpose of this post, I created a further table using the ARCHIVE engine, to check if I could couple the gains in size reduction with the speed improvements allowed by partitioning.
The immediate transformation from MyISAM to ARCHIVE does not work (~~bug #17754~~ Update: Bug fixed), so let's create that table explicitly:

drop table if exists part_archive_tab;
CREATE TABLE `part_archive_tab` (
`c1` int(11) not null ,
`c2` varchar(30) default NULL,
`c3` date default NULL
-- unique key (c1)
) ENGINE=ARCHIVE DEFAULT CHARSET=latin1
PARTITION BY RANGE (year(c3)   )
(
PARTITION p0 VALUES LESS THAN (1995),
PARTITION p1 VALUES LESS THAN (1996),
PARTITION p2 VALUES LESS THAN (1997),
PARTITION p3 VALUES LESS THAN (1998),
PARTITION p4 VALUES LESS THAN (1999),
PARTITION p5 VALUES LESS THAN (2000),
PARTITION p6 VALUES LESS THAN (2001),
PARTITION p7 VALUES LESS THAN (2002),
PARTITION p8 VALUES LESS THAN (2003),
PARTITION p9 VALUES LESS THAN (2004),
PARTITION p10 VALUES LESS THAN (2010),
PARTITION p11 VALUES LESS THAN MAXVALUE ENGINE = ARCHIVE
);
insert into part_archive_tab select * from part_tab;

And I repeated the same search:

select  count(*) from part_archive_tab where c3 > date '1995-01-01' and c3 < date '1995-12-31';
+----------+
| count(*) |
+----------+
|   795181 |
+----------+
1 row in set (1.29 sec)

select  count(*) from part_archive_tab;
+----------+
| count(*) |
+----------+
|  8000000 |
+----------+
1 row in set (9.24 sec)

That's very interesting. As I expected, a COUNT without a WHERE clause to an ARCHIVE table means a full table scan. But when we use a WHERE clause involving a range, the performance is really impressive. Using the new PARTITIONS extension of the EXPLAIN command, we get this confirmed:

explain partitions select  count(*) from part_archive_tab where c3 > date '1995-01-01' and c3 < date '1995-12-31'\G
*************************** 1. row ***************************
         id: 1
select_type: SIMPLE
      table: part_archive_tab
 partitions: p1
       type: ALL
possible_keys: NULL
        key: NULL
    key_len: NULL
        ref: NULL
       rows: 798458
      Extra: Using where

Unfortunately, it seems that there are still some problems (~~bug #17894~~ update: Fixed), but partitions are great and I am sure this will be one of the top features in the next future.