Tuesday, March 18, 2008

Creative MySQL hacking and Summer of Code


You can hack the MySQL server in three ways:
  • Changing the server code. This is straightforward. Download the code, and make the changes you need. If you need the change only for a limited period, this is enough. If the change must be persistent, then you should try to make it accepted. The Code Contribution Program may be for you.
  • Creating a plugin or a UDF. Using the plugin interface or the UDF API you can extend the server in many ways. UDFs require that you (and the user) be able to compile the additional code. Plugins require compiling code only from the author.
  • Mixing the existing hooks. This means being creative bu stretching the intended purpose of simple components of the server or external tools. This is how you can hack the server and create new features using stored routines, events, engines like federated, blackhole, archive, and external tools like MySQL Proxy. This is like composing chess puzzles, where you are given the basic rules, and you have to achieve clever results. This sort of jigsaw puzzle hacking is my favorite.

I am considering some of this latest hacking ideas for the incoming Summer of Code.
If you have any ideas on this department and you would like to pursue them, please send a message to the SoC mailing list. I will be glad to discuss them.

Additional reasons to attend the MySQL UC2008


I was not totally sincere.
When I said that I had seven reasons to attend the UC2008, I was just using a cute title. In reality, the main reason to attend the Users Conference is because I will meet a bunch of friends, and probably make new ones.
Yes! Let's not forget that the Users Conference, in addition to the conference rooms, has many corridors where you can start casual conversations with most anyone, and learn things as amazing as the ones that are said from the podiums.
And the social events! There is lunch by the pool, with ample tables, where you can be seated between a system architect and a stellar developer, and spend the whole meal talking about the subtleties of threading or the latest trend in distributed storage.

I can't say that I look forward to any tutorial, since my tutorial day will be filled with my own, but I should mention some more sessions that I look forward to. For example,
Since I was in the reviewing committee, I know all the sessions, and 98% of them have been accepted with my enthusiastic approval (2% of the times my opinion was outvoted, but it happens to all of us), so the only problem is that there is too much choice, and every time slot involves tough decisions about what to attend and what to miss.
More reasons:

Perl myths dispelled

Tim Bunce, the author of the Perl DBI, the database library that has influenced similar works in most languages, has published a presentation about Perl myths. The essential list of myths goes:
  • Perl is dead. No, it's alive and kicking (look at the numbers)
  • Perl is difficult to read. So are Java and C++, when written by bad programmers.
  • Perl 6 is killing Perl 5. Actually, the opposite is true. While Perl 6 is taking way too long to be implemented, its analysis has advantaged Perl 5, which has grown new features and performance in 5.10.
Highly educational stuff, as it is this other article: Good Perl code is the best form of evangelism.
As for me, Perl is my main tool of the trade, not only with MySQL and other database related work, but for all system tasks. I use it for mostly any scripting task, and on the command line, to replace grep, awk, sed. I also use it instead of shell scripting, whenever the script grows longer than a few lines.

Monday, March 17, 2008

Updated Community Development Program and a new blogger



Welcome to Patrik Backman (with his Nordic Brainstorm [*] ) among the bloggers! Patrik is Director of Software Engineering at MySQL, and very attentive to community matters.
In his first blog entry, Patrik presents the updated Community Development Program, created by Georg Richter, the Development Manager of Connectors and Client Connectivity, well known for his active role in the community.

The new program wants to strengthen the relationship between the external community and the developers inside the company. Contributors will choose among a list of features to implement and bugs to fix, and assigned to a developer who will act as mentor.
This way, the contributor can work closely with someone who can guide the coding in the right direction, saving time to both the coder and the reviewer.
If you have a knack for coding, give it a try!

[*]The name was suggested while brainstorming on IRC. Among the other choices was "the loud thought", which you will appreciate if you know Patrik. :)

Using the event scheduler with OS commands

One of the major additions to MySQL 5.1 is the the event scheduler. It is an internal scheduler, which does not need any help from the operating system. As such, it works independently in every platform.
One drawback of this feature, though, is that it can't communicate with the operating system. i.e. the event scheduler can't read system files, can't send e-mail messages, store data into log files. It can only work within the database server. This is convenient for security, but it is quite limiting. Time for hacking!

In getting started with MySQL Proxy I showed an example of how to run shell commands from any MySQL client. Unfortunately, this method can't be used with the events, because events can't send queries to the outside. Or can they? Let's be creative, and combine Federated tables and MySQL Proxy:
drop table if exists t1, t1f;
create server fed
foreign data wrapper mysql
options (
host '127.0.0.1',
port 4040
database 'test',
user 'msandbox',
password 'msandbox'
);
create table t1 (
id int nt null, cmd varchar(250), primary key (id), key (cmd)
);
create table t1f (
id int not null, cmd varchar(250), primary key (id), key (cmd)
) engine=federated connection = 'fed/t1';
Table t1f is Federated, and it is accessed through the Proxy port (4040). Thus, every query directed to this table will be intercepted by MySQL Proxy, and then we can do what we want.
The Lua script associated with this Federated table is the following, which intercepts UPDATE statements directed to table t1, and sends the contents of the cmd column to an echo statement, using a OS call. (It's just a proof of concept)
function read_query(packet)
if packet:byte() ~= proxy.COM_QUERY then return end
local query = packet:sub(2)
if query:match('UPDATE') then
local tokens = proxy.tokenize(query)
if tokens[10]['token_name'] == 'TK_INTEGER'
and tokens[2]['text'] == 't1'
and tokens[10]['text'] == '1' then
print( " --- " .. query )
os.execute('echo "executing ### ' .. tokens[6]['text'] .. ' ###"')
end
end
end
In short, this script checks if we are updating table t1 (not t1f! The Federated engine accesses the base table) with id = 1, and if yes, uses the content of cmd as argument for an OS command.
Now, let's create and event and see what happens:
set global event_scheduler=1;
create event e1
on schedule every 10 second
do
update t1f set cmd=concat('do this ', now())
where id =1;
The result, as observed by the Proxy, is as follows:
--- UPDATE `t1` SET `cmd` = 'do this 2008-03-17 02:27:47' WHERE `id` = 1 LIMIT 1
executing ### do this 2008-03-17 02:27:47 ###
--- UPDATE `t1` SET `cmd` = 'do this 2008-03-17 02:27:57' WHERE `id` = 1 LIMIT 1
executing ### do this 2008-03-17 02:27:57 ###
QED!

Saturday, March 15, 2008

PBXT, or how to really exploit the plugin architecture

MySQL 5.1 introduces the plugin interface, a series of API that allow the load and unload of server components at runtime.
The idea behind this architecture is to make MySQL server really expandable.
Surprisingly, the latest engines being developed in house (Falcon and Maria) are not implemented using the plugin technology. The best example of how to take advantage of this interface comes from the outside. Paul McCullagh's PBXT is a transactional engine that can be loaded at run time.
To use it, you can either download the pre-compiled plugin library, and load it into your server, or get the source code and build the whole server from scratch.
Of course, the most interesting option is the pre-compiled binary. I downloaded the latest 5.1.23 binaries for Mac OS X, and the corresponding plugin library for my system.
I installed the server using MySQL Sandbox, and then ran this command:
show variables like 'plugin_dir';
+---------------+----------------------------------+
| Variable_name | Value |
+---------------+----------------------------------+
| plugin_dir | /Users/gmax/opt/mysql/5.1.23/lib |
+---------------+----------------------------------+
This command tells me in which directory I should copy the library I just downloaded from primebase site.
$ cp ~/Downloads/libpbxt.so /Users/gmax/opt/mysql/5.1.23/lib
Now I can load the component.
INSTALL PLUGIN pbxt SONAME 'libpbxt.so';
Query OK, 0 rows affected (0.03 sec)

select * from information_schema.engines where engine='pbxt'\G
*************************** 1. row ***************************
ENGINE: PBXT
SUPPORT: YES
COMMENT: High performance, multi-versioning transactional engine
TRANSACTIONS: YES
XA: NO
SAVEPOINTS: NO
That's it! The additional engine is up and running! No compilation needed.

A rather unorthodox test follows. With the help of Patrick Crews I created a test employees database with about 4 million records in 6 tables. Not huge, but not trivial either.
I loaded this database using MyISAM, InnoDB and PBXT, and compared the loading times. Not surprisingly, MyISAM and InnoDB are much faster than PBXT.
Engine          Loading time
------------- ------------
MyISAM 1m08.546s
Innodb 1m46.622s
PBXT 3m20.842s
However, the database includes a test suite (If you have been following my blog, you shouldn't be surprised about that), which calculates the number of records and a global CRC for all the tables. This check includes queries like the following:
SET @crc= '';
INSERT INTO tchecksum
SELECT @crc := SHA1(CONCAT_WS('#',@crc,
emp_no,birth_date,first_name,last_name,gender,hire_date))
FROM employees ORDER BY emp_no;
INSERT INTO found_values VALUES ('employees', (SELECT COUNT(*) FROM employees), @crc);
Heavy stuff. I don't know how much relevant these queries are in real world applications, but that is what I had at hand without squeezing my brains, and this is the result of running the whole test that calculates the global CRC of 4 million records:
Engine          Checking time
------------- -------------
MyISAM 0m26.312s
Innodb 0m31.627s
PBXT 0m26.230s
In this particular test (repeated three times with two different operating systems), PBXT is faster than both MyISAM and InnoDB.
The engine is still green, and perhaps my test is silly, but this start is promising!

Thursday, March 13, 2008

Looking for a MySQL Proxy guru


MySQL Proxy is the most exciting addition to the range of MySQL products since 5.0. Using Proxy you can convert your database server into an application server, or you can create new command on the fly, fix bugs, filter queries, add load balancing to a set of servers, and a myriad of wonderful things.
The company itself is planning to do more with MySQL Proxy, and we have come to a point where we have more works in our hands that we can manage with the current manpower. So, we are looking for a Proxy enthusiast to become a QA engineer. The job opening is online. Look it up. It's a challenging job, but I can promise you that it's really exciting! (Working with developers like Jan Kneschke, Kay Roepke, Mark Matthews, Eric Herman, and the rest of the Enterprise team is a rewarding experience)

Reason #1 to attend the MySQL UC 2008

MySQL Conference & Expo 2008
Disclaimer: Forget about my affiliation, this is my personal list of things that I am going to enjoy at the UC.

#1 The lost art of the Self Join



When you work in the same field for several years, you risk to become effective but unimaginative. You may be good at coding queries or designing tables, but sometimes you lose track with your origins, when you were a creative programmer, who used to tweak the intricacies of C++ or Perl to create marvelous useless brilliancies.
If you recognize yourself in this picture, and wish you could have a spark of that enthusiastic force that made you learn new languages and idioms, despair not. Beat Vontobel session will be like a fresh wind that will clean your mind of the dull tasks and reconcile you to the beauty of programming.
If you are a programmer, come to the Users Conference and don't miss this one!

Don't forget this: Every speaker can give discount codes! Do you want one? drop me a note by email. Do you know another speaker? ask him/her for the discount code!

More reasons:

Wednesday, March 12, 2008

Reason # 2 to attend the MySQL UC2008

MySQL Conference & Expo 2008
Disclaimer: Forget about my affiliation, this is my personal list of things that I am going to enjoy at the UC.

#2 Astronomy, Petabytes, and MySQL



Petabytes! Have you heard that before? Perhaps you have, during some theoretical IT class. But have you used anything that could be remotely close to petabytes?
A petabyte is one thousand Terabyte, or one million Gigabyte.
If you struggle with databases with a few dozen Gigabytes, and you have headaches when you think that you have to deal with Terabytes of data, you can't possibly imagine how to store petabytes of data, and retrieving it!
This session is a challenge. At Stanford they are not morons, and if they are planning to use MySQL to store petabytes of data, there must be a good technical story behind it, which needs to be told. And you must not possibly miss it!

More reasons:

Reason #3 to attend the MySQL UC2008

MySQL Conference & Expo 2008
Disclaimer: Forget about my affiliation, this is my personal list of things that I am going to enjoy at the UC.

#3 Testing PHP/MySQL Applications with PHPUnit/DbUnit



Old school technologists don't think kindly of PHP. Its adepts are believed to be sloppy programmers that create brittle applications. It takes programmers like Sebastian Bergman to level the score and to show that a good programmer is shown by best practices, not by the choice of language.
The best thing a good programmer can do while developing an application is testing. Sebastian shows why you should do it, and how. Not only that, it will show you how easy it is. At the end of this session you will wonder how could you have survived for years without unit testing.
PHP programmers, mark your calendars!


More reasons:

Disabling Mac OS X spotlight, or how to make your life complicated

Since my previous laptop is temporarily out of service, I am using a new one with Mac OS X Leopard. Which is cool, and it has great enhancements. However, after a few hours of usage, I noticed a sudden increase of CPU activity (+50%), for no apparent reason.
The resource monitor fingered a process named "mdworker", and Google took me to a page explaining that the real culprit is spotlight, which is indexing the disk contents.
Since I needed all the CPU power of my laptop to compile the latest server, I asked Google again, and I found a page with a method to disable Spotlight permanently. The recommended method involves root access, changing attributes of a long list of files, and a reboot. (!!!)
I could not afford the luxury of a reboot at the moment (and the instructions looked too obscure for my taste), and therefore I used an alternative (and more Mac-ish) way. I opened the spotlight options in System Preferences, found the privacy section, and added the whole Macintosh HD to the list of places not to index.


Ten seconds later, the CPU activity was back to normal.
I have enabled again the spotlight indexing during the night, and all is well.

Tuesday, March 11, 2008

Want a discount to attend the UC? Call a speaker!



Every speaker at the Users conference has a 20% discount code for friends and relatives who want to attend the UC2008.
Would you like to attend the UC, and save 20%? Send me a message! (my_first_name AT mysql DOT com).
If you don't know me, but you know another speaker, ask him or her for the discount code! And then, when you register for the conference, enter the discount code in the registration form.

Notice that I disabled comments to this post, so nobody will be tempted to ask for the code here. If you need the discount code, don't post comments. Send an email!

Reason #4 to attend the MySQL UC2008

MySQL Conference & Expo 2008
Disclaimer: Forget about my affiliation, this is my personal list of things that I am going to enjoy at the UC.

#4 A Tour of External Language Stored Procedures for MySQL



Eric Herman and Antony Curtis are hard workers. You don't see them bragging about this and that. They do long and thorough research, and then they prototype. At that point, they claim victory, and usually with reason.
Their work on stored procedures using external languages is one of the most refreshing pieces of new technology applied to MySQL since the announcement of MySQL 5.0.
If you are dissatisfied with the cumbersome heaviness of the standard stored routines, come to this session. You will be surprised of the interesting things you can do NOW with a MySQL server.

More reasons:

Reason #5 to attend the MySQL UC 2008

MySQL Conference & Expo 2008
Disclaimer: Forget about my affiliation, this is my personal list of things that I am going to enjoy at the UC.

#5 Database Security Using White-Hat Google Hacking



Sheeri always amazes me. When I think that I have seen all it has to be seen in the field of databases, she will always come with some innovative way of looking at old things.
This session is a series of aha! ideas. Some of them can have come to you. You may have heard of some others, but Sheeri does not stop at that. When she hears about a promising practice, she digs in, until she gets all is there to be learned.
She's a captivating speaker. Once you attend one of her sessions (or if you just talk to her in a corridor about anything), you are addicted. Highly recommended!

More reasons:

Monday, March 10, 2008

Reason #6 to attend the MySQL UC 2008

MySQL Conference & Expo 2008
Disclaimer: Forget about my affiliation, this is my personal list of things that I am going to enjoy at the UC.

#6 Optimizing MySQL and InnoDB on Solaris 10 for World's Largest Photo Blogging Community



Farhan Mashraqi knows his job. He must be good at it, because he has on his hands a very busy and heavily loaded server. Fotolog is one of those servers that can take away all your resources very quickly if you don't pay attention. If you want to stay online, you have to make sure that the server is always healthy and efficient.
The above is theory. Farhan tells you the practice behind this simple rule. He will show how to manage such a monster application and still ride the waves.
Don't miss it!

More reasons:

Did you back up your data today?

My laptop disk died on Friday.
Although this is a sad occurrence, I was very pleased to find out that my backup was in good shape, and I was back in working conditions 30 minutes after the accident.
I don't regret buying a 2 TB firewire hard disk.

That made a huge difference. Last time that I had a breakout, restoring 50GB from a USB external disk took about 5 hours. Transferring the same amount of data through a firewire interface takes now less than 25 minutes. I love technology!

Reason #7 to attend the MySQL UC 2008

MySQL Conference & Expo 2008
Disclaimer: Forget about my affiliation, this is my personal list of things that I am going to enjoy at the UC.

#7 MySQL Proxy : the complete tutorial


I shall start with a shameless plug, of course. I am going to enjoy this tutorial for several reasons.

For starters, it's the first tutorial I get to host at the Users Conference, and this is understandably satisfactory in itself.
Then, because it is going to give me the technical room that I was longing for. I have been writing and speaking about MySQL Proxy for 10 months, and on every occasion I felt that I had time or space limitations. There was not enough time to explain all that we can do with Proxy, or not enough time to write a longer article.
Now we can remedy. I and Jan Kneschke will have 6 full hours to explain Proxy beauty and intricacies, ranging from simple wizardry to replication goodies.
We will cover the basics of Lua language, with some internals, so that the attendees will leave the tutorial with the basic know-how to use the Proxy effectively.
There are still available seats. Hurry up and register!

Friday, March 07, 2008

Virtual attendance to Meetup-mashup in Boston - March 10

I love this company!
MySQL is still a virtual company. The Sun acquisition hasn't change this very peculiar fact. Yesterday and the day before we have held IRC meetings between the MySQL community and some Sun open source big shots.
virtual sea
On Monday evening in Boston, there will be a more traditional user group meeting in Boston, host by Sheeri K. Cabral. Actually, not very much traditional. I mean, the real people will physically attend the meetup, but the whole meetup will be recorded, and you can participate via IRC. And, even better, I will make a guest appearance via video-conference.
So, please meet me in Boston on March 10, at 7 PM EST, on a cold evening, while I will talk from Sardinia, 6,400 Km (4,000 miles) apart. Thanks Sheeri for organizing this event!

Thursday, March 06, 2008

Logs of IRC Meetup session with Ian Murdock

As Kaj announced, there was a IRC session with Ian Murdock on March 5.
Since I have been working in a virtual company for 18 months, I should not be surprised about this kind of events. Actually, IRC has become part of my life. I could not conceive working efficiently without it. When some new colleague asks me to have a simple phone call, without any IRC or other char facilities, I start shaking at the idea of such poor and antiquate way of communicating.
Even for one with experience, though, a public IRC meeting in a crowded #room is quite an experience. The "location" for the meeting is room #mysql on irc.freenode.net, a room that normally contains 400~500 people, where the level of noise is quite high.
Melbourne, Feb 2008
Having a public interview in such room is like crashing a party of bikers and start talking philosophy. The beauty of the system is that nobody seem to care. The bikers continue talking about their stuff, and our interview on philosophy went quite smoothly, albeit intermingled with requests of help (promptly answered by a bunch of MySQL geeks) and some trollish shouting that added color to the picture.
For the curious ones, here is the full session logs.

Monday, March 03, 2008

DISTINCT? Don't be lazy!

Sometimes, the easy solution is not the best one. I saw this problem happening several times. The query returns duplicates, and the first reaction of the developer is to fix it with DISTINCT.

Let's look at an example. Given the data below:

select * from people;
+-----------+-------+
| person_id | name |
+-----------+-------+
| 1 | Joe |
| 2 | Mary |
| 3 | Frank |
+-----------+-------+
3 rows in set (0.00 sec)

select * from projects;
+------------+-------------+-----------+
| project_id | descr | person_id |
+------------+-------------+-----------+
| 1 | Joe First | 1 |
| 2 | Joe second | 1 |
| 3 | Mary First | 2 |
| 4 | Mary second | 2 |
| 5 | Frank first | 3 |
+------------+-------------+-----------+
5 rows in set (0.00 sec)

select * from jobs;
+--------+-----------+-----------+------------+
| job_id | job_descr | person_id | project_id |
+--------+-----------+-----------+------------+
| 1 | joe aaa | 1 | 1 |
| 2 | joe bbb | 1 | 1 |
| 3 | joe ccc | 1 | 2 |
| 4 | Mary aaa | 2 | 3 |
| 5 | Mary bbb | 2 | 3 |
| 6 | Mary ccc | 2 | 3 |
| 7 | Mary ddd | 2 | 4 |
| 8 | Frank aaa | 3 | 5 |
| 9 | Frank bbb | 3 | 5 |
+--------+-----------+-----------+------------+
9 rows in set (0.01 sec)

The problem comes with this query:

SELECT p.name, COUNT(j.job_id) AS total , job_descr
FROM people p
INNER JOIN jobs j ON p.person_id = j.person_id
INNER JOIN projects pr ON pr.person_id = j.person_id
GROUP BY p.person_id ORDER BY total DESC,p.name;
+-------+-------+-----------+
| name | total | job_descr |
+-------+-------+-----------+
| Mary | 8 | Mary aaa |
| Joe | 6 | joe aaa |
| Frank | 2 | Frank aaa |
+-------+-------+-----------+

As you can easily see, the query reports twice the amount of jobs for Mary and Joe. The lazy solution is this

SELECT p.name, COUNT(DISTINCT j.job_id) AS total , job_descr
FROM people p
INNER JOIN jobs j ON p.person_id = j.person_id
INNER JOIN projects pr ON pr.person_id = j.person_id
GROUP BY p.person_id ORDER BY total DESC,p.name;
+-------+-------+-----------+
| name | total | job_descr |
+-------+-------+-----------+
| Mary | 4 | Mary aaa |
| Joe | 3 | joe aaa |
| Frank | 2 | Frank aaa |
+-------+-------+-----------+
However, this query does not tackle the real problem, which is that the query is joining two tables (projects and jobs) using a non-primary key column. And this "solution" also ignores the even more serious problem that the person_id is redundant, and should not be in the jobs table in the first place. Joining with a pair of primary/foreign key is the right cure:
SELECT p.name, COUNT(j.job_id) AS total, job_descr
FROM people p
INNER JOIN jobs j ON p.person_id = j.person_id
INNER JOIN projects pr ON pr.project_id = j.project_id
GROUP BY p.person_id ORDER BY total DESC,p.name ;
+-------+-------+-----------+
| name | total | job_descr |
+-------+-------+-----------+
| Mary | 4 | Mary aaa |
| Joe | 3 | joe aaa |
| Frank | 2 | Frank aaa |
+-------+-------+-----------+
The result is the same, but if you apply these queries on a couple of heavily populated tables, the first lazy query can be 5 times slower than the second one. The reason is simple: since the join was done on a non primary key column, the query performs a Cartesian product of projects and jobs, followed by a costly sort to remove the duplicates. The second query, instead, filters off the duplicates efficiently on the first step, thus delivering the wanted result faster.