The Data Charmer: Test driving the free and open online backup

Great news today (May 6, 2008). MySQL has confirmed that, contrary to previous leaked information, the online backup components will be open source. It means free, without exceptions, as announced. It couldn't be otherwise. On stage at JavaOne, both Jonathsn Schwartz and Rich Green stressed the importance of being open.
Politics aside, what is this online backup of which everyone has complained but few seem to have really tried?
In short, it's a feature that allows you to make a copy of the data from one or more database to a file.
The Preview page on MySQL Forge tracks the progress of the ongoing development. The user manual has now a new section on the online backup where you can find the syntax of the basic commands:

BACKUP {DATABASE | SCHEMA}
{ * | db_name [, db_name] ... }
TO 'image_file_name';

You can backup just one database, or all of them at once. (BACKUP SCHEMA * to 'all_dbs.bkp')

The RESTORE command is quite simple. No options are needed. Every database that was backed up will be restored from that file.


RESTORE FROM 'image_file_name';

CAVEAT: the RESTORE command is currently destructive. More destructive than mysqldump, actually. What does it mean? If you make a backup with mysqldump, you will create a script containing DROP TABLE, CREATE TABLE, and INSERT statements. If you replay that script in your client, the existing tables will be dropped and re-created. This is expected, and you know that mysqldump acts that way. However, the script created by mysqldump will spare the existing tables.
If you create a table after you made the dump, when you reload the backup script your table is not dropped. The online backup, instead, will drop the whole database and re-create it from scratch. This behavior may change, but for the moment be careful.

Colin has recently published a simple test with the Sakila database. Let's see something more demanding, with an additional test that shows us if the restore is equivalent to the backup.
A conveniently sized data collection is this employees database, containing 4 million records. The data is fake, of course, but the size is large enough to be non-trivial.
One important feature of this package is that it includes a test suite, which tells you if the data was loaded correctly, by calculating a global CRC.

Let's get started. For this experiment, I compiled the latest code from mysql-6.0 tree, and used a MySQL Sandbox to test the server.
After downloading the employees database, I expanded it under the sandbox directory, and loaded the data:

$ ../use -t < employees.sql
+-----------------------------+
| INFO                        |
+-----------------------------+
| CREATING DATABASE STRUCTURE |
+-----------------------------+
+------------------------+
| INFO                   |
+------------------------+
| storage engine: MyISAM |
+------------------------+
+---------------------+
| INFO                |
+---------------------+
| LOADING departments |
+---------------------+
+-------------------+
| INFO              |
+-------------------+
| LOADING employees |
+-------------------+
+------------------+
| INFO             |
+------------------+
| LOADING dept_emp |
+------------------+
+----------------------+
| INFO                 |
+----------------------+
| LOADING dept_manager |
+----------------------+
+----------------+
| INFO           |
+----------------+
| LOADING titles |
+----------------+
+------------------+
| INFO             |
+------------------+
| LOADING salaries |
+------------------+
+----------------------------+
| INFO                       |
+----------------------------+
| Changes in current version |
+----------------------------+
+---------+--------------+------------------------------------------------------------+
| version | version_date | notes                                                      |
+---------+--------------+------------------------------------------------------------+
| 1.0.0   | 2008-01-03   | Initial release                                            |
| 1.0.0   | 2008-01-03   | No known bugs                                              |
| 1.0.1   | 2008-01-04   | Adapted code to run on 4.1                                 |
| 1.0.2   | 2008-03-15   | Added support for PBXT - replaced LOAD DATA with SQL dumps |
+---------+--------------+------------------------------------------------------------+

The next step, very important, is to test the load, to see if the data is the same as it was packaged.

$ ../use -t < test_employees.sql
+----------------------+
| INFO                 |
+----------------------+
| TESTING INSTALLATION |
+----------------------+
+--------------+------------------+------------------------------------------+
| table_name   | expected_records | expected_crc                             |
+--------------+------------------+------------------------------------------+
| employees    |           300024 | 4d4aa689914d8fd41db7e45c2168e7dcb9697359 |
| departments  |                9 | 4b315afa0e35ca6649df897b958345bcb3d2b764 |
| dept_manager |               24 | 9687a7d6f93ca8847388a42a6d8d93982a841c6c |
| dept_emp     |           331603 | f16f6ce609d032d6b1b34748421e9195c5083da8 |
| titles       |           443308 | d12d5f746b88f07e69b9e36675b6067abb01b60e |
| salaries     |          2844047 | b5a1785c27d75e33a4173aaa22ccf41ebd7d4a9f |
+--------------+------------------+------------------------------------------+
+--------------+------------------+------------------------------------------+
| table_name   | found_records    | found_crc                                |
+--------------+------------------+------------------------------------------+
| employees    |           300024 | 4d4aa689914d8fd41db7e45c2168e7dcb9697359 |
| departments  |                9 | 4b315afa0e35ca6649df897b958345bcb3d2b764 |
| dept_manager |               24 | 9687a7d6f93ca8847388a42a6d8d93982a841c6c |
| dept_emp     |           331603 | f16f6ce609d032d6b1b34748421e9195c5083da8 |
| titles       |           443308 | d12d5f746b88f07e69b9e36675b6067abb01b60e |
| salaries     |          2844047 | b5a1785c27d75e33a4173aaa22ccf41ebd7d4a9f |
+--------------+------------------+------------------------------------------+
+--------------+---------------+-----------+
| table_name   | records_match | crc_match |
+--------------+---------------+-----------+
| employees    | OK            | OK        |
| departments  | OK            | OK        |
| dept_manager | OK            | OK        |
| dept_emp     | OK            | OK        |
| titles       | OK            | OK        |
| salaries     | OK            | OK        |
+--------------+---------------+-----------+

This is the expected result from the embedded test suite. Now we know that the data is valid, and we can make a snapshot using the online backup.

backup schema employees to 'emp.bkp';
+-----------+
| backup_id |
+-----------+
| 1         |
+-----------+

What is this backup_id? it's a sequential internal identification that is used by the backup engine to keep track of the tasks. More on that one later.
This backup is quite fast. Compared to mysqldump, is 10 to 20 % faster (depending on the engine). The resulting dump is also 30 to 40% smaller.

Before proceeding with the restore, let's make sure that the test routine catches any small change.

select * from salaries limit 3;
+--------+--------+------------+------------+
| emp_no | salary | from_date  | to_date    |
+--------+--------+------------+------------+
|  10001 |  60117 | 1986-06-26 | 1987-06-26 |
|  10001 |  62102 | 1987-06-26 | 1988-06-25 |
|  10001 |  66074 | 1988-06-25 | 1989-06-25 |
+--------+--------+------------+------------+
3 rows in set (0.00 sec)

select * from titles limit 3;
+--------+-----------------+------------+------------+
| emp_no | title           | from_date  | to_date    |
+--------+-----------------+------------+------------+
|  10001 | Senior Engineer | 1986-06-26 | 9999-01-01 |
|  10002 | Staff           | 1996-08-03 | 9999-01-01 |
|  10003 | Senior Engineer | 1995-12-03 | 9999-01-01 |
+--------+-----------------+------------+------------+
3 rows in set (0.00 sec)

delete from titles where emp_no=10002 limit 1;
Query OK, 1 row affected (0.00 sec)

update salaries set salary = 62103 where emp_no=10001 and salary=62102;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

Now we repeat the test, and we observe that the test suite catches both the small changes we have made.

+--------------+------------------+------------------------------------------+
| table_name   | expected_records | expected_crc                             |
+--------------+------------------+------------------------------------------+
| employees    |           300024 | 4d4aa689914d8fd41db7e45c2168e7dcb9697359 |
| departments  |                9 | 4b315afa0e35ca6649df897b958345bcb3d2b764 |
| dept_manager |               24 | 9687a7d6f93ca8847388a42a6d8d93982a841c6c |
| dept_emp     |           331603 | f16f6ce609d032d6b1b34748421e9195c5083da8 |
| titles       |           443308 | d12d5f746b88f07e69b9e36675b6067abb01b60e |
| salaries     |          2844047 | b5a1785c27d75e33a4173aaa22ccf41ebd7d4a9f |
+--------------+------------------+------------------------------------------+
+--------------+------------------+------------------------------------------+
| table_name   | found_records    | found_crc                                |
+--------------+------------------+------------------------------------------+
| employees    |           300024 | 4d4aa689914d8fd41db7e45c2168e7dcb9697359 |
| departments  |                9 | 4b315afa0e35ca6649df897b958345bcb3d2b764 |
| dept_manager |               24 | 9687a7d6f93ca8847388a42a6d8d93982a841c6c |
| dept_emp     |           331603 | f16f6ce609d032d6b1b34748421e9195c5083da8 |
| titles       |           443307 | ce8ac7a8e39695bc0d000e7168ad240c8dada4bb |
| salaries     |          2844047 | 3135f3d7622a3ef788af4055200426d2c337d94d |
+--------------+------------------+------------------------------------------+
+--------------+---------------+-----------+
| table_name   | records_match | crc_match |
+--------------+---------------+-----------+
| employees    | OK            | OK        |
| departments  | OK            | OK        |
| dept_manager | OK            | OK        |
| dept_emp     | OK            | OK        |
| titles       | not ok        | not ok    |
| salaries     | OK            | not ok    |
+--------------+---------------+-----------+

At this point, we are satisfied that the data was loaded as expected, and the test suite is reliably catching any differences..
We are ready for the final test. Dropping the database and restoring its contents from the backup file.

drop schema employees;
Query OK, 10 rows affected (0.04 sec)

###############################################################
### BOOM! without a backup, this statement would be a disaster!
###############################################################

restore from 'emp.bkp';
+-----------+
| backup_id |
+-----------+
| 2        |
+-----------+
1 row in set (1 min 15.46 sec)

Now the database is back from the dead. The last thing to do is to check that we have indeed a reliable copy of the data. We launch the test suite once more.

$ ../use -t < test_employees.sql
[...]
+--------------+---------------+-----------+
| table_name   | records_match | crc_match |
+--------------+---------------+-----------+
| employees    | OK            | OK        |
| departments  | OK            | OK        |
| dept_manager | OK            | OK        |
| dept_emp     | OK            | OK        |
| titles       | OK            | OK        |
| salaries     | OK            | OK        |
+--------------+---------------+-----------+

Yeah! QED.
Not bad for a feature in alpha stage. I will try to play with concurrent backup and restore calls to see how nicely it can handle the load. That's material for another post. In the meantime, have fun with it!

1 comment:

András said...: What is not really clear for me, if the database/table is blocked during the backup process, or you can query/alter it as you want? I have a huge table (backing up is quite long) and I'm looking for a backup process that is not blocking the site running.; May 7, 2008 at 12:13:00 AM GMT+2