Thursday, May 15, 2008

What's in a name?

What's in a name? That which we call a rose by any other name would smell as sweet.

Good old William had his reasons to say so, but then either he was not concerned about misspellings or he didn't care. (A recent book about the Bard actually points out that all Shakespeare's known signatures are differently spelled, and none of them is spelled Shakespeare!)
The problem with my name is that, for the majority of non-Italians, it does not sound familiar, and consequently it gets misspelled. As a frequent traveler, I have seen every sort of mischief done to my first name.
I don't really understand why, but most English speakers write "GUIseppe" instead of "GIUseppe." Sometimes I was called "Giuseppa" (which is a female name) or "Giuseppi" (which is a non existing plural of my name), or "Jooseppai" (which is how I pronounce it) and several variations with additional or missing vowels.
Let me set the record straight, for anybody who cares.
The right spelling is "G-I-U-SEPPE". To remember the right order, you may use what I call the selfish rule: First I, then U. Got it? :)

Now, let me use this fact to compare the behavior of two companies. When I was hired by MySQL, I braced myself in wait of the misspelling, which did not come! All my official documents, personnel listings, email accounts, company badge, were produced without any mistake. I was impressed. Apparently, MySQL HR people are very attentive to these details. When I was hired by Sun, instead, well, let me say that it was not the same quality. MySQL was an international company, while Sun seems more an US company with international branches. They are learning, though.

I wrote the draft for this post long ago, but I left it aside. Now, since it seems that I am not the only one with this problem, I state my point.

Tuesday, May 06, 2008

Test driving the free and open online backup

Great news today (May 6, 2008). MySQL has confirmed that, contrary to previous leaked information, the online backup components will be open source. It means free, without exceptions, as announced. It couldn't be otherwise. On stage at JavaOne, both Jonathsn Schwartz and Rich Green stressed the importance of being open.
Politics aside, what is this online backup of which everyone has complained but few seem to have really tried?
In short, it's a feature that allows you to make a copy of the data from one or more database to a file.
The Preview page on MySQL Forge tracks the progress of the ongoing development. The user manual has now a new section on the online backup where you can find the syntax of the basic commands:

BACKUP {DATABASE | SCHEMA}
{ * | db_name [, db_name] ... }
TO 'image_file_name';
You can backup just one database, or all of them at once. (BACKUP SCHEMA * to 'all_dbs.bkp')

The RESTORE command is quite simple. No options are needed. Every database that was backed up will be restored from that file.

RESTORE FROM 'image_file_name';
CAVEAT: the RESTORE command is currently destructive. More destructive than mysqldump, actually. What does it mean? If you make a backup with mysqldump, you will create a script containing DROP TABLE, CREATE TABLE, and INSERT statements. If you replay that script in your client, the existing tables will be dropped and re-created. This is expected, and you know that mysqldump acts that way. However, the script created by mysqldump will spare the existing tables.
If you create a table after you made the dump, when you reload the backup script your table is not dropped. The online backup, instead, will drop the whole database and re-create it from scratch. This behavior may change, but for the moment be careful.

Colin has recently published a simple test with the Sakila database. Let's see something more demanding, with an additional test that shows us if the restore is equivalent to the backup.
A conveniently sized data collection is this employees database, containing 4 million records. The data is fake, of course, but the size is large enough to be non-trivial.
One important feature of this package is that it includes a test suite, which tells you if the data was loaded correctly, by calculating a global CRC.

Let's get started. For this experiment, I compiled the latest code from mysql-6.0 tree, and used a MySQL Sandbox to test the server.
After downloading the employees database, I expanded it under the sandbox directory, and loaded the data:
$ ../use -t < employees.sql
+-----------------------------+
| INFO |
+-----------------------------+
| CREATING DATABASE STRUCTURE |
+-----------------------------+
+------------------------+
| INFO |
+------------------------+
| storage engine: MyISAM |
+------------------------+
+---------------------+
| INFO |
+---------------------+
| LOADING departments |
+---------------------+
+-------------------+
| INFO |
+-------------------+
| LOADING employees |
+-------------------+
+------------------+
| INFO |
+------------------+
| LOADING dept_emp |
+------------------+
+----------------------+
| INFO |
+----------------------+
| LOADING dept_manager |
+----------------------+
+----------------+
| INFO |
+----------------+
| LOADING titles |
+----------------+
+------------------+
| INFO |
+------------------+
| LOADING salaries |
+------------------+
+----------------------------+
| INFO |
+----------------------------+
| Changes in current version |
+----------------------------+
+---------+--------------+------------------------------------------------------------+
| version | version_date | notes |
+---------+--------------+------------------------------------------------------------+
| 1.0.0 | 2008-01-03 | Initial release |
| 1.0.0 | 2008-01-03 | No known bugs |
| 1.0.1 | 2008-01-04 | Adapted code to run on 4.1 |
| 1.0.2 | 2008-03-15 | Added support for PBXT - replaced LOAD DATA with SQL dumps |
+---------+--------------+------------------------------------------------------------+
The next step, very important, is to test the load, to see if the data is the same as it was packaged.
$ ../use -t < test_employees.sql
+----------------------+
| INFO |
+----------------------+
| TESTING INSTALLATION |
+----------------------+
+--------------+------------------+------------------------------------------+
| table_name | expected_records | expected_crc |
+--------------+------------------+------------------------------------------+
| employees | 300024 | 4d4aa689914d8fd41db7e45c2168e7dcb9697359 |
| departments | 9 | 4b315afa0e35ca6649df897b958345bcb3d2b764 |
| dept_manager | 24 | 9687a7d6f93ca8847388a42a6d8d93982a841c6c |
| dept_emp | 331603 | f16f6ce609d032d6b1b34748421e9195c5083da8 |
| titles | 443308 | d12d5f746b88f07e69b9e36675b6067abb01b60e |
| salaries | 2844047 | b5a1785c27d75e33a4173aaa22ccf41ebd7d4a9f |
+--------------+------------------+------------------------------------------+
+--------------+------------------+------------------------------------------+
| table_name | found_records | found_crc |
+--------------+------------------+------------------------------------------+
| employees | 300024 | 4d4aa689914d8fd41db7e45c2168e7dcb9697359 |
| departments | 9 | 4b315afa0e35ca6649df897b958345bcb3d2b764 |
| dept_manager | 24 | 9687a7d6f93ca8847388a42a6d8d93982a841c6c |
| dept_emp | 331603 | f16f6ce609d032d6b1b34748421e9195c5083da8 |
| titles | 443308 | d12d5f746b88f07e69b9e36675b6067abb01b60e |
| salaries | 2844047 | b5a1785c27d75e33a4173aaa22ccf41ebd7d4a9f |
+--------------+------------------+------------------------------------------+
+--------------+---------------+-----------+
| table_name | records_match | crc_match |
+--------------+---------------+-----------+
| employees | OK | OK |
| departments | OK | OK |
| dept_manager | OK | OK |
| dept_emp | OK | OK |
| titles | OK | OK |
| salaries | OK | OK |
+--------------+---------------+-----------+
This is the expected result from the embedded test suite. Now we know that the data is valid, and we can make a snapshot using the online backup.
backup schema employees to 'emp.bkp';
+-----------+
| backup_id |
+-----------+
| 1 |
+-----------+
What is this backup_id? it's a sequential internal identification that is used by the backup engine to keep track of the tasks. More on that one later.
This backup is quite fast. Compared to mysqldump, is 10 to 20 % faster (depending on the engine). The resulting dump is also 30 to 40% smaller.

Before proceeding with the restore, let's make sure that the test routine catches any small change.
select * from salaries limit 3;
+--------+--------+------------+------------+
| emp_no | salary | from_date | to_date |
+--------+--------+------------+------------+
| 10001 | 60117 | 1986-06-26 | 1987-06-26 |
| 10001 | 62102 | 1987-06-26 | 1988-06-25 |
| 10001 | 66074 | 1988-06-25 | 1989-06-25 |
+--------+--------+------------+------------+
3 rows in set (0.00 sec)

select * from titles limit 3;
+--------+-----------------+------------+------------+
| emp_no | title | from_date | to_date |
+--------+-----------------+------------+------------+
| 10001 | Senior Engineer | 1986-06-26 | 9999-01-01 |
| 10002 | Staff | 1996-08-03 | 9999-01-01 |
| 10003 | Senior Engineer | 1995-12-03 | 9999-01-01 |
+--------+-----------------+------------+------------+
3 rows in set (0.00 sec)

delete from titles where emp_no=10002 limit 1;
Query OK, 1 row affected (0.00 sec)

update salaries set salary = 62103 where emp_no=10001 and salary=62102;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0
Now we repeat the test, and we observe that the test suite catches both the small changes we have made.
+--------------+------------------+------------------------------------------+
| table_name | expected_records | expected_crc |
+--------------+------------------+------------------------------------------+
| employees | 300024 | 4d4aa689914d8fd41db7e45c2168e7dcb9697359 |
| departments | 9 | 4b315afa0e35ca6649df897b958345bcb3d2b764 |
| dept_manager | 24 | 9687a7d6f93ca8847388a42a6d8d93982a841c6c |
| dept_emp | 331603 | f16f6ce609d032d6b1b34748421e9195c5083da8 |
| titles | 443308 | d12d5f746b88f07e69b9e36675b6067abb01b60e |
| salaries | 2844047 | b5a1785c27d75e33a4173aaa22ccf41ebd7d4a9f |
+--------------+------------------+------------------------------------------+
+--------------+------------------+------------------------------------------+
| table_name | found_records | found_crc |
+--------------+------------------+------------------------------------------+
| employees | 300024 | 4d4aa689914d8fd41db7e45c2168e7dcb9697359 |
| departments | 9 | 4b315afa0e35ca6649df897b958345bcb3d2b764 |
| dept_manager | 24 | 9687a7d6f93ca8847388a42a6d8d93982a841c6c |
| dept_emp | 331603 | f16f6ce609d032d6b1b34748421e9195c5083da8 |
| titles | 443307 | ce8ac7a8e39695bc0d000e7168ad240c8dada4bb |
| salaries | 2844047 | 3135f3d7622a3ef788af4055200426d2c337d94d |
+--------------+------------------+------------------------------------------+
+--------------+---------------+-----------+
| table_name | records_match | crc_match |
+--------------+---------------+-----------+
| employees | OK | OK |
| departments | OK | OK |
| dept_manager | OK | OK |
| dept_emp | OK | OK |
| titles | not ok | not ok |
| salaries | OK | not ok |
+--------------+---------------+-----------+
At this point, we are satisfied that the data was loaded as expected, and the test suite is reliably catching any differences..
We are ready for the final test. Dropping the database and restoring its contents from the backup file.
drop schema employees;
Query OK, 10 rows affected (0.04 sec)

###############################################################
### BOOM! without a backup, this statement would be a disaster!
###############################################################

restore from 'emp.bkp';
+-----------+
| backup_id |
+-----------+
| 2 |
+-----------+
1 row in set (1 min 15.46 sec)
Now the database is back from the dead. The last thing to do is to check that we have indeed a reliable copy of the data. We launch the test suite once more.
$ ../use -t < test_employees.sql
[...]
+--------------+---------------+-----------+
| table_name | records_match | crc_match |
+--------------+---------------+-----------+
| employees | OK | OK |
| departments | OK | OK |
| dept_manager | OK | OK |
| dept_emp | OK | OK |
| titles | OK | OK |
| salaries | OK | OK |
+--------------+---------------+-----------+
Yeah! QED.
Not bad for a feature in alpha stage. I will try to play with concurrent backup and restore calls to see how nicely it can handle the load. That's material for another post. In the meantime, have fun with it!

Slides from Creative Programming talk

Thanks to all the attendees to the session on creative programming with MySQL at CommunityOne.
As announced, the slides are online. 3.5M (PDF)
Enjoy!

Thursday, May 01, 2008

new blog at blogs.sun.com

Today the integration with Sun made two steps forward for me. I finally got the access to Sun internal network, and as a consequence, I was able to create a new blog.
My first post is just an introduction to the larger Sun community. From my next post, I will start aggregating that blog as well in Planet MySQL.