Friday, November 12, 2010

Filtering binary logs with MySQL Sandbox and replication

A few days ago, a friend of mine asked me if I knew of a way of filtering a bunch of binary logs, to extract only statements related to a single table. The task was about filtering a few hundred binary log files.

It's a tricky problem. Even with my experience with regular expressions, I knew that using a script to extract statements related to a single table was going to be a nasty business.
However, I know of an entity that can filter by table name efficiently, and that's the MySQL replication system. So I suggested using replication to a sandbox with a replicate-wild-do-table statement to get the job done.
My friend was skeptical and did not want to go that way. I was busy writing an article for an Italian magazine and did not follow up immediately. But today, with the article safely in the editor's hands, I did a quick test, and guess what? It works!
binary log filter
Here is a step-by-step procedure to do it. I started with a server built with MySQL Sandbox, using MySQL 5.5.7. I used the employees test database to create a large enough binlog, and soon I had a database containing 160 MB of data and a binary log of about the same size.
Then I decided that I wanted to filter the binlog, to get only statements about the employees table. Thus, I issued this command:

$ make_sandbox 5.5.7 --sandbox_port=6000 \
  --sandbox_directory=trans_repl \
  -c log-slave-update \
  -c replicate-wild-do-table=employees.employees \
  -c log-bin=mysql-bin \
  -c server-id=101
The "-c" option transfers its argument to the sandbox configuration file.
At the end of this operation, I had one server with the same version of the server that I had filled with the employee database. The server is ready to filter replicated streams, accepting only commands that affect the table 'employees' within the database 'employees'.

The second step was to create an empty database in the second server, with the Innodb tables converted to BlackHole (to avoid wasting unnecessary space).

Inside the first sandbox, I did this:

$ ./my sqldump -B --no-data employees \
  | perl -pe 's/=innodb/=blackhole/i' \
  | ~/sandboxes/trans_repl/use
Combining the flexibility of the sandbox with some command line skills, the operation requires just one command.
Before starting the replication, I needed to avoid re-creating the tables, or my blackhole trick would have been useless. So I looked at the binary log, and found where the CREATE TABLE statements ended:

$ ./my sqlbinlog ./data/mysql-bin.000001 |less
BEGIN
/*!*/;
# at 3057
#101112 9:48:45 server id 1 end_log_pos 3364 Query thread_id=1 exec_time=0 error_code=0
SET TIMESTAMP=1289551725/*!*/;
INSERT INTO `departments` VALUES ('d001','Marketing'),('d002','Finance'),('d003','Human Resources'),('d004','Production'),('d005','Development'),('d006','Quality Management'),('d007','Sales'),('d008','Research'),('d009','Customer Service')/*!*/;

Armed with this knowledge, I logged in the second server and did the following:

FLUSH BINARY LOGS; # to get a binlog with only the statements that I need

CHANGE MASTER TO
master_host='127.0.0.1',
master_port=5570,
master_user='msandbox',
master_password='msandbox',
master_log_file='mysql-bin.000001',
master_log_pos=3057;    # this is the position after all the CREATE TABLE
                        # statements in the master
START SLAVE;
After a few seconds, I issued a "SHOW SLAVE STATUS". All was OK.
I flushed the logs again and inspected the second binary log. As expected, it contained only the statements related to the employees table.
Total cost of the operation: 5 minutes. Way quicker than writing this report!

4 comments:

shantanu said...

This will return all the statements of the table cwa_account and ticket_sell

# mysqlbinlog mysql-bin.* | awk '/tbl_company/,/;/' | more

datacharmer said...

@Shantanu,
You will filter all the commands that contain "tbl_company", even inside a query itself. For example, if the table name is also inside the text being inserted into a table, you will filter the wrong query.
Bit that is not the most important part. The biggest problem is that the query depends on several statements that are executed before the command, such as SET TIMESTAMP and SET SQL_MODE, and so on.
So, unless you want to implement the replication server itself, a simple regular expression will not get you off the hook.

Swany said...

Hi,

I had to do something like this in a data recovery scenario recently. We had hundreds of binary logs to dig through to extract changes for only a subset of tables. Using MySQL replication to filter the binary logs was very effective.

shantanu said...

I just wanted to suggest a quick and dirty solution that may be useful in some cases.

-- Shantanu

Vote on Planet MySQL