The Data Charmer: May 2008

Wednesday, May 28, 2008

Piercing the fog

There have been many surveys on MySQL, but usually not as deep as this one launched by Keith Murphy, editor of the MySQL Magazine and Mark Schoonover.
I have seen some turmoil lately, with talks of forking the code base and organizing a community conference. The community seems restless, and this initiative could be an attempt of counting their ranks to see how much they can dare.
I welcome this initiative of seeing through the fog. If a survey was needed, rather than waiting one to fall from the sky (or from the Sun :-) ), the industrious bloggers have started their own. I will be curious to see the results.
Inside MySQL, we have been talking about creating a phone home feature to do a live count of the user base, but as you can see there is no such thing yet. So this survey could provide the data that the missing phone home feature has failed to bring back. The user base measuring itself: that's an interesting thought.
If you can spare 10 minutes, I recommend taking the survey.

Tuesday, May 27, 2008

Chaining Proxies

If you need to combine two scripts with MySQL Proxy you have three choices.

You can manually rewrite the two scripts (good luck!)
you can use the load-multi module to load scripts on demand;
or you can use the proxy chaining technique.

To chain two proxies, you need to start one Proxy with the first Lua script, pointing at the real backend, listening to a non-standard port. Then you start the second Proxy with the second Lua script, having the first proxy as a backend, and listening to the standard proxy port.
It's a difficult and error prone procedure. You would forget about it, unless there were an easy workaround. And indeed you can have such a workaround. Just use the chain_proxy script from MySQL Forge and use it with this simple syntax:

$ chain_proxy first_script.lua second_script.lua

When I made this script, I had the problem of handling the standard output for each proxy instance. The brute force solution is to start each instance in a separate terminal window, but that is less than optimal. It would be almost as much work as doing things manually.
Considering that usually I need the output of a Proxy session only for debugging and that I am a frequent user of GNU screen, I made this arrangement:
The script starts each proxy inside a separate screen. At the end, it gives the list of screens being used, and creates a customized script in /tmp/ to kill the chained proxies and remove the screens.

$ chain_proxy digits.lua loop.lua
proxy started on screen "second_proxy" using script "loop.lua" - pid_file : /tmp/proxy1.pid
proxy started on screen "first_proxy" using script "digits.lua" - pid_file : /tmp/proxy_chain1.pid
stop script is /tmp/proxy_chain_stop
There are screens on:
 21257.first_proxy (Detached)
 21258.second_proxy (Detached)
2 Sockets in /tmp/uscreens/S-gmax.

In this example, the two proxies are started in sequence, and the script gives information on what is going on. The output of "loop.lua" is in the "first_proxy" screen. To see it, I only need to do

$ screen -r first_proxy

When I am finished using the chained Proxy, I type

$ /tmp/proxy_chain_stop
There are screens on:
 21257.first_proxy (Detached)
 21258.second_proxy (Detached)
2 Sockets in /tmp/uscreens/S-gmax.

No Sockets found in /tmp/uscreens/S-gmax.

This script can also be implemented by redirecting the output of each proxy to a different file. YMMV. I like the screen solution better.

I will be speaking about this feature (and more) during my MySQL University session on advanced Lua scripting.

Virtual squares - Taking virtualization to new limits

During the Italian Free Software Conference in Trento, I attended an amazing presentation on virtual components.
Renzo Davoli, professor at Bologna University and hacker, has stretched the concept of virtualization to new limits. Virtual Square is a project aiming at developing more dynamic implementations of virtual entities, which eventually get separated from the concept of operating system and root privileges.

The coolest aspect of all this project is the virtualization of single elements like a disk drive, a net port, a file system, without root privileges, and with no impact on other users.
Virtualizing single elements makes life easier for demanding users, and more quite for their neighbors, who won't be affected by massive reduction of overall resources as it happen with normal virtualization of operating systems.
Think of the applications: for example, it would be easier to establish dedicated quotas for database and web server users, with better security and easier maintenance and without creating new OS users. MySQL, with its limited interface for user resources, would surely benefit from this system. There is a lot of potential in this idea. Let's hope it is pushed a bit farther than the academic circles.

Thursday, May 22, 2008

A look at the Open Source Census

Yesterday I had a look at the Open Source Census (OSC), a project aiming at counting the installations of open source packages.

It is a collaborative effort. The OSC offers the infrastructure, and it's open to cooperation. It works with a disc scanner, and relies on the principle that the contributions are voluntary and anonymous.
If you want to contribute, you need to register, and you are encouraged to do so anonymously. The OSC is not interested in who you are, but the registration gives you a chance of tracking your results with the rest of the community.
The registration is also necessary to avoid duplicates, and to track your installations over time, should you decide to do that on a regular basis.
If you want to give it a try, the procedure goes like this:

Register (with any fancy name you want);
download the scanner package (it's open source)
scan your computer (or some selected directories), using your unique identifier
look at the results;
send the results to the census (using the same tool).

Bear in mind that there is no way for the OSC to track down who you are, because you are not asked in the first place. So the results are truly anonymous.

Some concerns:

The scan package is open source. However, even the simplest package available the pure Ruby) is more than 6000 lines of code. If you are security conscious, examining the code is not a quick task. You will have to rely on peer review from the community, or do a test scan in a virtual machine to examine what the software does (that's what I did);
If you have a large machine, the scan may take hours. You will have to plan for night scans if you want to contribute seriously.

On the plus side, the process of scanning is open. You can contribute the signatures of your favorite open source tools, to be included in the next scans.
Looking at the results may bring some surprises. I did not know that I had two different versions of MySQL-connector/J in my laptop, or two versions of PostgreSQL connector either! And I wonder how I managed to get 5 (FIVE) different versions of docbook-xml in my laptop, given that I haven't ever asked to install it.
However, the scanner gives you more than the public list of packages found in your box. For your own consumption (it is not sent to the census) it produces a detailed list of where each package was found. So you cam analyze that list and eventually clean up the system.
Other surprising results. I checked how often MySQL was installed, and it turns out to be present on 37% of the scanned boxes. The surprising results, however, is the the distribution of old versions. About 20% of MySQL servers are still using version 4.x. Fascinating!
The project is young, but very promising, in my opinion. There may be problems for adoption from large corporations (security policies will be hard to deal with), but if the community picks up, it may produce good results. Give it a try.

Thursday, May 15, 2008

What's in a name?

What's in a name? That which we call a rose by any other name would smell as sweet.

Good old William had his reasons to say so, but then either he was not concerned about misspellings or he didn't care. (A recent book about the Bard actually points out that all Shakespeare's known signatures are differently spelled, and none of them is spelled Shakespeare!)
The problem with my name is that, for the majority of non-Italians, it does not sound familiar, and consequently it gets misspelled. As a frequent traveler, I have seen every sort of mischief done to my first name.
I don't really understand why, but most English speakers write "GUIseppe" instead of "GIUseppe." Sometimes I was called "Giuseppa" (which is a female name) or "Giuseppi" (which is a non existing plural of my name), or "Jooseppai" (which is how I pronounce it) and several variations with additional or missing vowels.
Let me set the record straight, for anybody who cares.
The right spelling is "G-I-U-SEPPE". To remember the right order, you may use what I call the selfish rule: First I, then U. Got it? :)

Now, let me use this fact to compare the behavior of two companies. When I was hired by MySQL, I braced myself in wait of the misspelling, which did not come! All my official documents, personnel listings, email accounts, company badge, were produced without any mistake. I was impressed. Apparently, MySQL HR people are very attentive to these details. When I was hired by Sun, instead, well, let me say that it was not the same quality. MySQL was an international company, while Sun seems more an US company with international branches. They are learning, though.

I wrote the draft for this post long ago, but I left it aside. Now, since it seems that I am not the only one with this problem, I state my point.

Tuesday, May 06, 2008

Test driving the free and open online backup

Great news today (May 6, 2008). MySQL has confirmed that, contrary to previous leaked information, the online backup components will be open source. It means free, without exceptions, as announced. It couldn't be otherwise. On stage at JavaOne, both Jonathsn Schwartz and Rich Green stressed the importance of being open.
Politics aside, what is this online backup of which everyone has complained but few seem to have really tried?
In short, it's a feature that allows you to make a copy of the data from one or more database to a file.
The Preview page on MySQL Forge tracks the progress of the ongoing development. The user manual has now a new section on the online backup where you can find the syntax of the basic commands:

BACKUP {DATABASE | SCHEMA}
{ * | db_name [, db_name] ... }
TO 'image_file_name';

You can backup just one database, or all of them at once. (BACKUP SCHEMA * to 'all_dbs.bkp')

The RESTORE command is quite simple. No options are needed. Every database that was backed up will be restored from that file.


RESTORE FROM 'image_file_name';

CAVEAT: the RESTORE command is currently destructive. More destructive than mysqldump, actually. What does it mean? If you make a backup with mysqldump, you will create a script containing DROP TABLE, CREATE TABLE, and INSERT statements. If you replay that script in your client, the existing tables will be dropped and re-created. This is expected, and you know that mysqldump acts that way. However, the script created by mysqldump will spare the existing tables.
If you create a table after you made the dump, when you reload the backup script your table is not dropped. The online backup, instead, will drop the whole database and re-create it from scratch. This behavior may change, but for the moment be careful.

Colin has recently published a simple test with the Sakila database. Let's see something more demanding, with an additional test that shows us if the restore is equivalent to the backup.
A conveniently sized data collection is this employees database, containing 4 million records. The data is fake, of course, but the size is large enough to be non-trivial.
One important feature of this package is that it includes a test suite, which tells you if the data was loaded correctly, by calculating a global CRC.

Let's get started. For this experiment, I compiled the latest code from mysql-6.0 tree, and used a MySQL Sandbox to test the server.
After downloading the employees database, I expanded it under the sandbox directory, and loaded the data:

$ ../use -t < employees.sql
+-----------------------------+
| INFO                        |
+-----------------------------+
| CREATING DATABASE STRUCTURE |
+-----------------------------+
+------------------------+
| INFO                   |
+------------------------+
| storage engine: MyISAM |
+------------------------+
+---------------------+
| INFO                |
+---------------------+
| LOADING departments |
+---------------------+
+-------------------+
| INFO              |
+-------------------+
| LOADING employees |
+-------------------+
+------------------+
| INFO             |
+------------------+
| LOADING dept_emp |
+------------------+
+----------------------+
| INFO                 |
+----------------------+
| LOADING dept_manager |
+----------------------+
+----------------+
| INFO           |
+----------------+
| LOADING titles |
+----------------+
+------------------+
| INFO             |
+------------------+
| LOADING salaries |
+------------------+
+----------------------------+
| INFO                       |
+----------------------------+
| Changes in current version |
+----------------------------+
+---------+--------------+------------------------------------------------------------+
| version | version_date | notes                                                      |
+---------+--------------+------------------------------------------------------------+
| 1.0.0   | 2008-01-03   | Initial release                                            |
| 1.0.0   | 2008-01-03   | No known bugs                                              |
| 1.0.1   | 2008-01-04   | Adapted code to run on 4.1                                 |
| 1.0.2   | 2008-03-15   | Added support for PBXT - replaced LOAD DATA with SQL dumps |
+---------+--------------+------------------------------------------------------------+

The next step, very important, is to test the load, to see if the data is the same as it was packaged.

$ ../use -t < test_employees.sql
+----------------------+
| INFO                 |
+----------------------+
| TESTING INSTALLATION |
+----------------------+
+--------------+------------------+------------------------------------------+
| table_name   | expected_records | expected_crc                             |
+--------------+------------------+------------------------------------------+
| employees    |           300024 | 4d4aa689914d8fd41db7e45c2168e7dcb9697359 |
| departments  |                9 | 4b315afa0e35ca6649df897b958345bcb3d2b764 |
| dept_manager |               24 | 9687a7d6f93ca8847388a42a6d8d93982a841c6c |
| dept_emp     |           331603 | f16f6ce609d032d6b1b34748421e9195c5083da8 |
| titles       |           443308 | d12d5f746b88f07e69b9e36675b6067abb01b60e |
| salaries     |          2844047 | b5a1785c27d75e33a4173aaa22ccf41ebd7d4a9f |
+--------------+------------------+------------------------------------------+
+--------------+------------------+------------------------------------------+
| table_name   | found_records    | found_crc                                |
+--------------+------------------+------------------------------------------+
| employees    |           300024 | 4d4aa689914d8fd41db7e45c2168e7dcb9697359 |
| departments  |                9 | 4b315afa0e35ca6649df897b958345bcb3d2b764 |
| dept_manager |               24 | 9687a7d6f93ca8847388a42a6d8d93982a841c6c |
| dept_emp     |           331603 | f16f6ce609d032d6b1b34748421e9195c5083da8 |
| titles       |           443308 | d12d5f746b88f07e69b9e36675b6067abb01b60e |
| salaries     |          2844047 | b5a1785c27d75e33a4173aaa22ccf41ebd7d4a9f |
+--------------+------------------+------------------------------------------+
+--------------+---------------+-----------+
| table_name   | records_match | crc_match |
+--------------+---------------+-----------+
| employees    | OK            | OK        |
| departments  | OK            | OK        |
| dept_manager | OK            | OK        |
| dept_emp     | OK            | OK        |
| titles       | OK            | OK        |
| salaries     | OK            | OK        |
+--------------+---------------+-----------+

This is the expected result from the embedded test suite. Now we know that the data is valid, and we can make a snapshot using the online backup.

backup schema employees to 'emp.bkp';
+-----------+
| backup_id |
+-----------+
| 1         |
+-----------+

What is this backup_id? it's a sequential internal identification that is used by the backup engine to keep track of the tasks. More on that one later.
This backup is quite fast. Compared to mysqldump, is 10 to 20 % faster (depending on the engine). The resulting dump is also 30 to 40% smaller.

Before proceeding with the restore, let's make sure that the test routine catches any small change.

select * from salaries limit 3;
+--------+--------+------------+------------+
| emp_no | salary | from_date  | to_date    |
+--------+--------+------------+------------+
|  10001 |  60117 | 1986-06-26 | 1987-06-26 |
|  10001 |  62102 | 1987-06-26 | 1988-06-25 |
|  10001 |  66074 | 1988-06-25 | 1989-06-25 |
+--------+--------+------------+------------+
3 rows in set (0.00 sec)

select * from titles limit 3;
+--------+-----------------+------------+------------+
| emp_no | title           | from_date  | to_date    |
+--------+-----------------+------------+------------+
|  10001 | Senior Engineer | 1986-06-26 | 9999-01-01 |
|  10002 | Staff           | 1996-08-03 | 9999-01-01 |
|  10003 | Senior Engineer | 1995-12-03 | 9999-01-01 |
+--------+-----------------+------------+------------+
3 rows in set (0.00 sec)

delete from titles where emp_no=10002 limit 1;
Query OK, 1 row affected (0.00 sec)

update salaries set salary = 62103 where emp_no=10001 and salary=62102;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

Now we repeat the test, and we observe that the test suite catches both the small changes we have made.

+--------------+------------------+------------------------------------------+
| table_name   | expected_records | expected_crc                             |
+--------------+------------------+------------------------------------------+
| employees    |           300024 | 4d4aa689914d8fd41db7e45c2168e7dcb9697359 |
| departments  |                9 | 4b315afa0e35ca6649df897b958345bcb3d2b764 |
| dept_manager |               24 | 9687a7d6f93ca8847388a42a6d8d93982a841c6c |
| dept_emp     |           331603 | f16f6ce609d032d6b1b34748421e9195c5083da8 |
| titles       |           443308 | d12d5f746b88f07e69b9e36675b6067abb01b60e |
| salaries     |          2844047 | b5a1785c27d75e33a4173aaa22ccf41ebd7d4a9f |
+--------------+------------------+------------------------------------------+
+--------------+------------------+------------------------------------------+
| table_name   | found_records    | found_crc                                |
+--------------+------------------+------------------------------------------+
| employees    |           300024 | 4d4aa689914d8fd41db7e45c2168e7dcb9697359 |
| departments  |                9 | 4b315afa0e35ca6649df897b958345bcb3d2b764 |
| dept_manager |               24 | 9687a7d6f93ca8847388a42a6d8d93982a841c6c |
| dept_emp     |           331603 | f16f6ce609d032d6b1b34748421e9195c5083da8 |
| titles       |           443307 | ce8ac7a8e39695bc0d000e7168ad240c8dada4bb |
| salaries     |          2844047 | 3135f3d7622a3ef788af4055200426d2c337d94d |
+--------------+------------------+------------------------------------------+
+--------------+---------------+-----------+
| table_name   | records_match | crc_match |
+--------------+---------------+-----------+
| employees    | OK            | OK        |
| departments  | OK            | OK        |
| dept_manager | OK            | OK        |
| dept_emp     | OK            | OK        |
| titles       | not ok        | not ok    |
| salaries     | OK            | not ok    |
+--------------+---------------+-----------+

At this point, we are satisfied that the data was loaded as expected, and the test suite is reliably catching any differences..
We are ready for the final test. Dropping the database and restoring its contents from the backup file.

drop schema employees;
Query OK, 10 rows affected (0.04 sec)

###############################################################
### BOOM! without a backup, this statement would be a disaster!
###############################################################

restore from 'emp.bkp';
+-----------+
| backup_id |
+-----------+
| 2        |
+-----------+
1 row in set (1 min 15.46 sec)

Now the database is back from the dead. The last thing to do is to check that we have indeed a reliable copy of the data. We launch the test suite once more.

$ ../use -t < test_employees.sql
[...]
+--------------+---------------+-----------+
| table_name   | records_match | crc_match |
+--------------+---------------+-----------+
| employees    | OK            | OK        |
| departments  | OK            | OK        |
| dept_manager | OK            | OK        |
| dept_emp     | OK            | OK        |
| titles       | OK            | OK        |
| salaries     | OK            | OK        |
+--------------+---------------+-----------+

Yeah! QED.
Not bad for a feature in alpha stage. I will try to play with concurrent backup and restore calls to see how nicely it can handle the load. That's material for another post. In the meantime, have fun with it!

Slides from Creative Programming talk

Thanks to all the attendees to the session on creative programming with MySQL at CommunityOne.
As announced, the slides are online. 3.5M (PDF)
Enjoy!

Thursday, May 01, 2008

new blog at blogs.sun.com

Today the integration with Sun made two steps forward for me. I finally got the access to Sun internal network, and as a consequence, I was able to create a new blog.
My first post is just an introduction to the larger Sun community. From my next post, I will start aggregating that blog as well in Planet MySQL.

The Data Charmer

Wednesday, May 28, 2008

Piercing the fog

Tuesday, May 27, 2008

Chaining Proxies

Virtual squares - Taking virtualization to new limits

Thursday, May 22, 2008

A look at the Open Source Census

Thursday, May 15, 2008

What's in a name?

Tuesday, May 06, 2008

Test driving the free and open online backup

Slides from Creative Programming talk

Thursday, May 01, 2008

new blog at blogs.sun.com

Pages

Giuseppe Maxia

Like my posts? Buy me a book!

Julian's force

Yes, I am one of those

Sapere, saper fare, fare, far sapere

Blog Archive

Planet MySQL