Saturday, November 05, 2005

New data replication paradigms

MySQL success is due to not only to its efficiency, but also to its ease of administration.
And despite what its detractors say about lack of features (that was before
5.0, of course), there is a feature that is crucial for enterprise database
management, and MySQL has had it built-in since early version. I am talking
about data replication that can trasform a single server into a scalable and
highly reliable system with minimum effort and expense.

Recently, one more element was added to MySQL set of features. MySQL Cluster is an ambitious engine that turns your servers into a full fledged cluster that can be accessed from any node without worry. Any node is master and slaveat the same time. However, these features come at the price of some higherhardware requirement (MySQL Cluster is, as of today, an all-in-RAM database).

MySQL data transfer methods

There are differences between these two approaches. Choosing MySQL Cluster or standard replication is a trade off.

methoddescriptionthe goodthe bad
replicationasynchronous transfer, based on one master and N slavesFast, reliable, robust, easy to set up, no special HW required, transaction-safeneeds some care in the code
MSQL Clustersynchronous transfer, based on a cluster enginefast, no code modification neededrequire huge amounts of RAM, not easy to set up, minimal transaction support
If you start from scratch, it's easy to build your applications in such a way that they are replication-aware, thus benefitting from the low-cost high scalability of this paradigm.
OTOH, if you have an application that is already established (and perhaps complex and large), adapting it to the one-master-many-slaves architecture could be a problem.
There are external tools that promise to fill the gap, and offer clever solutions, mostly based on in-between filtering layers that give your application the illusion that there is just one server, while in truth there are several ones. Also these solutions cost you something, not only in mere money, (since they are - with limited exceprions - commercial tools), but also in features, because filters mostly are not transaction safe.

Different needs for data transfers

Aside from the mere method of data movement, there are other needs from the enterprise. Notice that this is not some stretch of my imagination, but the practical result of several years of consulting in various database-related matters.

One thing that most medium to large companies wanted is the reverse of replication, i.e. they want the master to be updated by many slaves. The need is clear when you think of several branches of a company in need of sending updates about sales records and store levels. Unfortunately, simple replication does not allow this feature, and even the cluster would not be bent easily to this need.
Another thing that companies want is a conditional replication, based on rules that are far beyond the limited configuration range of both replication and cluster. For example some companies wanted a table migrated from server A to server B, but to different tables, depending on a time frame.

Enter MySQL 5, and a new world of possibilities is open to the clever administrator.

New data transfers paradigms: FEDERATED tables

The reverse replication can be achieved with tables using the FEDERATED engine .
For each table you need to replicate from the branch database to the main house, you just need to establish a FEDERATED table and a couple of triggers.
Let's day, just for the sake of it, that you have the main server at big.central.mycompany.com, and the branches as far.away.mycompany.com (codename : FAW) and at fairly.distant.mycompany.com (code name: FDI). The main table you need to update at big.central lays in the "biz" database and looks like this:
CREATE TABLE all_sales
(
branch_id char(3) not null,
sale_date datetime not null,
item_id int not null,
quantity int not null,
primary key (branch_id, sale_date, item_id)
) engine=innodb;
In each branch, you have one table for your branch sales, that you update every day with the consolidate data from your detail tables.
CREATE TABLE branch_sales
(
sale_date datetime not null,
item_id int not null,
quantity int not null,
primary key (sale_date, item_id)
) engine=innodb;
To get the data from branch_sales to the main central table, you build a mirror of such table in each branch, but instead of being InnoDB or MyISAM, it will be FEDERATED.
CREATE TABLE central_all_sales
(
branch_id char(3) not null,
sale_date datetime not null,
item_id int not null,
quantity int not null,
primary key (branch_id, sale_date, item_id)
) engine=FEDERATED
CONNECTION='mysql://user:pwd@big.central.mycompany.com/biz/all_sales';
As a final touch, you add a few triggers to your branch table, to get the data moving:
create trigger branch_sales_ai after INSERT on branch_sales
for each row
insert into central_all_sales
values ( 'FAW', new.sale_date, new.item_id, new.quantity);

create trigger branch_sales_au after UPDATE on branch_sales
for each row
update central_all_sales
set sale_date = new.sale_date,
item_id = new.item_id,
quantity = new.quantity
where
sale_date = old.sale_date
and
item_id = old.item_id;

create trigger branch_sales_ad after DELETE on branch_sales
for each row
delete from central_all_sales
where
branch_id = 'FAW'
and
sale_date = old.sale_date
and
item_id = old.item_id;

Similar triggers (with a branch_id of 'FDI') will be created in the second branch. Now, every record in branch_sales is replicated to the central table without himan intervention. For extra safety, you may add some codeto check that the record does not exist yet, before inserting.

New data transfers paradigms: BLACKHOLE tables

Have you heard about the BLACKHOLE engine?
It's a simple concept. A table with a full description, that does not store anything. Every thing you send to such table is simply discarded, and that is no surprising, since there is no storage device associated with BLACKHOLE tables.
What's the use, then? You may ask. The use is to take advantage of the side effects of such table, rather that its storage capabilities. For once, records sent to a BLACKHOLE table will leave a trail in the binary log, thus allowing you to replicate its data from a server that does not have much storage available, but you need it just as an intermediate level to spread your data to several more slaves.
Moreover, and here we come to the point of our quest, BLACKHOLE tables can fire triggers, and we are going to use them to meet the second need we have outlined, i.e. the conditional update.
Let's say that big.central.mycompany.com want the sales records sent to different tables depending on the time of the day (it's a silly example, but bear with me for a while, just pretending it's something sensible). However, the application that deals with the data transfer is already working, and nobody dares to touch it, for fear of disrupting something. But a simple change can achieve the goal.
Let's change the structure of all_sales:
CREATE TABLE all_sales
(
branch_id char(3) not null,
sale_date datetime not null,
item_id int not null,
quantity int not null,
primary key (branch_id, sale_date, item_id)
) engine=BLACKHOLE;
Just one word changed, and now the table will discard any data sent to it.
Before seeing what we can do with it, let's create three more tables.
CREATE TABLE all_morning_sales
(
branch_id char(3) not null,
sale_date datetime not null,
item_id int not null,
quantity int not null,
primary key (branch_id, sale_date, item_id)
) engine=innodb;
CREATE TABLE all_afternoon_sales
(
branch_id char(3) not null,
sale_date datetime not null,
item_id int not null,
quantity int not null,
primary key (branch_id, sale_date, item_id)
) engine=innodb;
CREATE TABLE all_late_sales
(
branch_id char(3) not null,
sale_date datetime not null,
item_id int not null,
quantity int not null,
primary key (branch_id, sale_date, item_id)
) engine=innodb;
Here we have everything ready. Now let's add one trigger:
delimiter //

create trigger all_sales_ai after insert on all_sales
for each row
begin
case
when time(sale_date) between '08:00:00' and '13:00:00'
then insert into all_morning_sales
values (new.branch_id, new.sale_date, new.item_id,
new.quantity);
when time(sale_date) between '13:00:01' and '18:00:00'
then insert into all_afternoon_sales
values (new.branch_id, new.sale_date, new.item_id,
new.quantity);
else
insert into all_late_sales
values (new.branch_id, new.sale_date, new.item_id,
new.quantity);
end case;
end //

delimiter ;

Similarly, you can create a trigger for updates and deletes, and of course you can get creative and update several tables at once with different parts of the incoming data.
Here. Now you have some food for thought and launch yourselves into a new data transfer enterprise.

1 comment:

rpbouman said...

Hi Giuseppe,

I just want to say that I enjoyed reading this article very much.

I've never occupied myself with replication, or these new storage engines, but this seems a valuable primer.

Thanks!