Thursday, March 09, 2006

Seeking alternatives to cursors

As users of stored procedures know, cursors can only work with explicit SQL statements, while they don't work with dynamic queries (there was an article by Beat Vontobel and a consequent feature request).
Trying to overcome this limitation, I came up with a somewhat useable alternative and a performance surprise. Let's go see them in due order.
The example is built against the World database provided in MySQL web site.
The method used for this test is an algorithm to compute a global checksum for a table. You may know that such a global CRC exists already, but it is only available for MyISAM tables, and as a procedure only (feature request #17009).
To calculate such a checksum we can do two things:
  • Using a cursor, we start with an empty string, representing the global checksum, then calculate the checksum for each row, and we merge such checksum to the global checksum.
  • Using a SELECT statement getting the checksum of each row, we start with a empty user variable, and we merge that result to the user variable.
The second method seems a bit expensive, because it fetches as many records as are in the table, creating a recordset that we really don't need, since we are only interested in its fina result. A nice touch in this case is the usage of a BLACKHOLE Storage Engine. Instead of a simple SELECT, we use a INSERT ... SELECT statement.
Here comes the preparation code.

use world;

drop table if exists new_city;
create table new_city like City;
insert into new_city select * from City;

delimiter //

drop procedure if exists double_recs//
create procedure double_recs ()
deterministic
begin
declare counter int default 0;
while counter < 7 do
insert into new_city select NULL,Name,CountryCode,District,Population from new_city;
set counter = counter + 1;
end while;
end//

delimiter ;

select 'inserting a bunch of records in new_city ' as 'wait a moment' ;
call double_recs() ;
select count(*) as 'rows in new_city' from new_city;
drop procedure if exists double_recs ;

drop table if exists crc_res;
CREATE TABLE `crc_res` (
`crc` varchar(100) default NULL,
`cnt` int(11) default NULL
) ENGINE=BLACKHOLE ;

delimiter //

DROP FUNCTION IF EXISTS make_City_CRC //

CREATE FUNCTION make_City_CRC( )
returns varchar(100)
READS SQL DATA
BEGIN
declare done boolean default false;
declare sha_global varchar(100);
declare counter int;
declare sha_riga varchar(100);
DECLARE cursor1 CURSOR FOR
SELECT SHA1(
CONCAT_WS('#',ID,Name, CountryCode, District,Population))
FROM new_city
ORDER BY ID;
DECLARE CONTINUE HANDLER FOR SQLSTATE '02000' SET done = true;

OPEN cursor1;
SET sha_global = "";
set counter =0;
GET_ROWS:
loop
FETCH cursor1 INTO sha_riga;
IF done THEN
leave GET_ROWS;
END IF;
SET sha_global = SHA1(CONCAT(sha_global, sha_riga));
set counter = counter + 1;
END loop;
CLOSE cursor1;
set @gcounter = counter;
RETURN sha_global;
END //

delimiter ;

Now we have a function to calculate the checksum of table new_city, which is a copy of City, inflated with copies of its records until it gets about half a million (to be able to compare timings with whole seconds rather than milliseconds), and we also create a BLACKHOLE table as preparation for the alternative method.
The CRC business goes like this: for each row, we calculate the SHA1 value of the whole record, by taking a CONCAT_WS of all the columns. You can see why I would need dynamic SQL to make this function a generic one. For each row, we get the SHA1 of the general checksum, concatenated to the row checksum. At the end, the general checksum thus calculated is returned.
In the case of the cursor method, the syntax is clear. In the alternative method, we use a user variable, previously initialized, and set it using the ":=" operator, thus preserving its value from row to row. Since we are sending the recordset to a BLACKHOLE table, we are not burdened by an unwanted recordset.
Executing both methods, we get the same results, but the timing are quite different. The testing code follows.

select 'using a BLACKHOLE table' as 'first test';

set @gcrc = '';
set @gcnt = 0;
set @start_test1 = now();
insert into crc_res
select @gcrc := sha1(concat(@gcrc, sha1(concat_ws('#',ID,Name,CountryCode,District,Population)) ) ),
@gcnt := @gcnt + 1 from new_city order by ID;
set @end_test1 = now();

select @gcrc;
select @gcnt;

select 'using a cursor' as 'second test';

set @start_test2 = now();
select make_City_CRC();
set @end_test2 = now();
select @gcounter;

select timediff(@end_test1, @start_test1) as 'test BLACKHOLE', timediff(@end_test2, @start_test2) as 'test CURSOR';

And here is the result.

+-------------------------+
| first test |
+-------------------------+
| using a BLACKHOLE table |
+-------------------------+
Query OK, 522112 rows affected (11.19 sec)
Records: 522112 Duplicates: 0 Warnings: 0
+------------------------------------------+
| @gcrc |
+------------------------------------------+
| 1f4c512ff345a88a876fe0faddcf37344cb34f29 |
+------------------------------------------+
+--------+
| @gcnt |
+--------+
| 522112 |
+--------+

+----------------+
| second test |
+----------------+
| using a cursor |
+----------------+
+------------------------------------------+
| make_City_CRC() |
+------------------------------------------+
| 1f4c512ff345a88a876fe0faddcf37344cb34f29 |
+------------------------------------------+
1 row in set (18.51 sec)

+-----------+
| @gcounter |
+-----------+
| 522112 |
+-----------+
+----------------+-------------+
| test BLACKHOLE | test CURSOR |
+----------------+-------------+
| 00:00:11 | 00:00:18 |
+----------------+-------------+

The cursor method takes 18.51 seconds, while the SELECT+BLACKHOLE method takes only 11. Can anyone comment on this performance difference?
I know that this is not a general replacement to a cursor, but keep it in mind when you would like to use dynamic SQL with a cursor.

No comments: