The Data Charmer: MySQL 5.1: Improving ARCHIVE performance with partitioning

In a recent article (Improving Database Performance with Partitioning) Robin Schumacher explains how to use range partitioning, giving some impressive examples of performance gains.

The examples given are quite slow to create. For those willing to try out the same tables with a filling function a bit faster, here goes:

drop table if exists no_part_tab;
create table no_part_tab
(c1 int(11) default NULL,
c2 varchar(30) default NULL,
c3 date default NULL) engine=myisam;

drop table if exists part_tab ;
CREATE TABLE part_tab
(  c1 int default NULL,
c2 varchar(30) default NULL,
c3 date default NULL
) engine=myisam
PARTITION BY RANGE (year(c3)) (PARTITION p0 VALUES LESS THAN (1995),
PARTITION p1 VALUES LESS THAN (1996) , PARTITION p2 VALUES LESS THAN (1997) ,
PARTITION p3 VALUES LESS THAN (1998) , PARTITION p4 VALUES LESS THAN (1999) ,
PARTITION p5 VALUES LESS THAN (2000) , PARTITION p6 VALUES LESS THAN (2001) ,
PARTITION p7 VALUES LESS THAN (2002) , PARTITION p8 VALUES LESS THAN (2003) ,
PARTITION p9 VALUES LESS THAN (2004) , PARTITION p10 VALUES LESS THAN (2010),
PARTITION p11 VALUES LESS THAN MAXVALUE );

delimiter //

drop procedure if exists load_part_tab //
CREATE PROCEDURE load_part_tab( max_recs int, rows_per_query int)
begin
declare counter int default 0;
declare step int default 0;
declare base_query varchar(100) default 'insert into part_tab values ';
declare first_loop boolean default true;
declare v int default 0;
set @query = base_query;
while v < max_recs
do
if (counter = rows_per_query) then
   set first_loop = true;
   set counter = 0;
   prepare q from @query;
   execute q;
   deallocate prepare q;
   set @query = base_query;
   set step = step + 1;
   select step, v, now();
end if;
if (first_loop) then
   set first_loop = false;
else
   set @query = concat(@query, ',');
end if;
set @query = concat( @query,
'(', v, ',',
'"testing partitions"',',"',
adddate('1995-01-01',(rand(v)*36520) mod 3652), '")'
);
set v = v + 1;        set counter = counter + 1;
end while;
if (counter) then
   prepare q from @query;
   execute q;
   deallocate prepare q;
end if;
end
//
delimiter ;

call load_part_tab(8000000,1000);
insert into no_part_tab select * from part_tab;

Using this data, the timings for my comparison are:

select  count(*) from no_part_tab where c3 > date '1995-01-01' and c3 < date '1995-12-31';
+----------+
| count(*) |
+----------+
|   795181 |
+----------+
1 row in set (9.05 sec)

select  count(*) from part_tab where c3 > date '1995-01-01' and c3 < date '1995-12-31';
+----------+
| count(*) |
+----------+
|   795181 |
+----------+
1 row in set (1.04 sec)

Now, coming to the purpose of this post, I created a further table using the ARCHIVE engine, to check if I could couple the gains in size reduction with the speed improvements allowed by partitioning.
The immediate transformation from MyISAM to ARCHIVE does not work (~~bug #17754~~ Update: Bug fixed), so let's create that table explicitly:

drop table if exists part_archive_tab;
CREATE TABLE `part_archive_tab` (
`c1` int(11) not null ,
`c2` varchar(30) default NULL,
`c3` date default NULL
-- unique key (c1)
) ENGINE=ARCHIVE DEFAULT CHARSET=latin1
PARTITION BY RANGE (year(c3)   )
(
PARTITION p0 VALUES LESS THAN (1995),
PARTITION p1 VALUES LESS THAN (1996),
PARTITION p2 VALUES LESS THAN (1997),
PARTITION p3 VALUES LESS THAN (1998),
PARTITION p4 VALUES LESS THAN (1999),
PARTITION p5 VALUES LESS THAN (2000),
PARTITION p6 VALUES LESS THAN (2001),
PARTITION p7 VALUES LESS THAN (2002),
PARTITION p8 VALUES LESS THAN (2003),
PARTITION p9 VALUES LESS THAN (2004),
PARTITION p10 VALUES LESS THAN (2010),
PARTITION p11 VALUES LESS THAN MAXVALUE ENGINE = ARCHIVE
);
insert into part_archive_tab select * from part_tab;

And I repeated the same search:

select  count(*) from part_archive_tab where c3 > date '1995-01-01' and c3 < date '1995-12-31';
+----------+
| count(*) |
+----------+
|   795181 |
+----------+
1 row in set (1.29 sec)

select  count(*) from part_archive_tab;
+----------+
| count(*) |
+----------+
|  8000000 |
+----------+
1 row in set (9.24 sec)

That's very interesting. As I expected, a COUNT without a WHERE clause to an ARCHIVE table means a full table scan. But when we use a WHERE clause involving a range, the performance is really impressive. Using the new PARTITIONS extension of the EXPLAIN command, we get this confirmed:

explain partitions select  count(*) from part_archive_tab where c3 > date '1995-01-01' and c3 < date '1995-12-31'\G
*************************** 1. row ***************************
         id: 1
select_type: SIMPLE
      table: part_archive_tab
 partitions: p1
       type: ALL
possible_keys: NULL
        key: NULL
    key_len: NULL
        ref: NULL
       rows: 798458
      Extra: Using where

Unfortunately, it seems that there are still some problems (~~bug #17894~~ update: Fixed), but partitions are great and I am sure this will be one of the top features in the next future.

1 comment:

Anonymous said...: I understand that archiving of data means moving the data from operational table to another table whose engine type is archive. However my doubts is still the load on the operational database due to the historical data is not reduced by 100%. I am looking for any way by which user could archive the data in a remove mysql process? Is this a valid problem, if so how it is approached in real time; November 11, 2010 at 7:03:00 AM GMT+1