The Data Charmer: Sharding for the masses: the spider storage engine

Monday, July 06, 2009

Sharding for the masses: the spider storage engine

In my previous article about the Spider storage engine, I made some tests and I saw that the engine has potential. I did also identify some problems, which were promptly addressed by the author. I have looked at the latest version (0.12), and the results are very much encouraging.

Installing the Spider storage engine is not trivial but not extremely difficult either. My previous article has a step-by-step explanation. Before installing, though, you need to apply one patch to the server, to enable condition pushdown to the partitions engine. What is it about? In short, when you issue a query from the Spider table, with a WHERE clause, the query is passed to a backend database without the WHERE clause. True, it's filtered by partition, but each partition may have a large data set, and this means a full table scan on the backend server. In a few cases, this is not a big deal, because the general performance of the Spider engine is better than a non-partitioned table, but it can be improved.
The lack of conditions pushdown is a problem already known in the Federated engine. With the above patch applied, the Spider engine pushes the condition down to the back end, and the query is executed with the WHERE clause intact.
Let's consider this simple query:

select * from salaries where from_date='1996-01-10' limit 10;

Before applying the patch, the backend database would execute this:

select `emp_no`,`salary`,`from_date`,`to_date`
from `employees`.`salaries` limit 0,9223372036854775807

After the patch, the backend receives and execute the more sensible

select `emp_no`,`salary`,`from_date`,`to_date`
from `employees`.`salaries`
where (`from_date` = '1996-01-10')
limit 0,9223372036854775807

A final observation. While I was testing the latest version, I noticed something that should have been clear from the beginning, i.e. that a spider engine set of backends can have multiple front-ends. Meaning that you can create a Spider table from a different host, and access the same backends that another host is already using. In theory, it means more flexibility when dealing with large amount of data. In practice, we'll have to test it more and see what happens.