Thursday, December 28, 2006

MySQL Quality Assurance forum

As preparation for the upcoming Quality Contribution Program, a new MySQL forum was created today.
The new forum is dedicated to Quality Assurance matters. It is not the place where to submit bugs (there is already the bug reporting system for that purpose). It is rather a place where to discuss quality assurance problems, such as:
  • How do I report this particular kind of bug?
  • How do I make a test case for this specific situation?
  • What is the best strategy to report a nasty cluster of bugs?
  • Improving testing techniques;

Everything related to Quality assurance can be discussed there. If you have an idea on how to make better test cases, go there and launch the challenge. If you want to experiment a new technique for bug hunting, let's hear it!

Thursday, November 30, 2006

The hidden risks of SQL MODE

MySQL 5.0 introduces improved SQL modes, which can fine tune the way your server behaves. If you are a long term MySQL user, you may be familiar with the speed for accuracy trade-off. MySQL has a default for each field, and guesses a value when you don't provide an appropriate one when inserting or updating. If this behavior is not acceptable to you, you can now tell the server to be less permissive. Check out an article by Robin Schumacher, where this concept is explained thoroughly.

If you look at the manual, though, you will see that the SQL modes are quite a few, and you may be tempted to combine some of them to control every tiny part of the server input.
Beware, though. There are some pitfalls that you should be aware of. Let's walk through an example.

SET SQL_MODE='';
select @@sql_mode, cast(1 as unsigned) - cast(2 as unsigned);
+------------+-------------------------------------------+
| @@sql_mode | cast(1 as unsigned) - cast(2 as unsigned) |
+------------+-------------------------------------------+
| | 18446744073709551615 |
+------------+-------------------------------------------+
What is this? It is a subtraction between two unsigned values. There is an overflow in the result, and then the result of 1-2 becomes the highest BIGINT value minus one.
We can control this behavior, and use a specific SQL_MODE, NO_UNSIGNED_SUBTRACTION, to tell the server that it should not allow a subtraction between unsigned values, and treat them as signed instead.
SET SQL_MODE='NO_UNSIGNED_SUBTRACTION';
select @@sql_mode, cast(1 as unsigned) - cast(2 as unsigned);
+-------------------------+-------------------------------------------+
| @@sql_mode | cast(1 as unsigned) - cast(2 as unsigned) |
+-------------------------+-------------------------------------------+
| NO_UNSIGNED_SUBTRACTION | -1 |
+-------------------------+-------------------------------------------+
Fine. Now we know that we can take control of subtractions. But there is something more to know. The SQL MODE sticks to each procedure, function, or trigger, meaning that each routine is executed using the SQL mode that was active at creation time. This could lead to surprising results.
set sql_mode='';
drop function if exists subtraction;
create function subtraction(x int unsigned, y int unsigned)
returns int
deterministic
return x - y;

SET SQL_MODE='NO_UNSIGNED_SUBTRACTION';

select @@sql_mode, subtraction(1,2), cast(1 as unsigned) - cast(2 as unsigned);
+-------------------------+------------------+-------------------------------------------+
| @@sql_mode | subtraction(1,2) | cast(1 as unsigned) - cast(2 as unsigned) |
+-------------------------+------------------+-------------------------------------------+
| NO_UNSIGNED_SUBTRACTION | 2147483647 | -1 |
+-------------------------+------------------+-------------------------------------------+
Look here. We set the SQL_MODE to NO_UNSIGNED_SUBTRACTION, because we want to avoid that unpleasant effect, but the subtraction function was created with a different SQL_MODE.
Therefore, the operations inside such function will be affected by the stored SQL_MODE, regardless of the one that is active at the moment.

Whenever your result depends on a specific SQL_MODE, always check which mode is associated with the stored routines or triggers that you are using.
SELECT
routine_name,. sql_mode
FROM
information_schema.routines
WHERE routine_schema='test'
AND routine_name='subtraction'
AND routine_type='function';
+--------------+----------+
| ROUTINE_NAME | SQL_MODE |
+--------------+----------+
| subtraction | |
+--------------+----------+

There is also a more complex example involving triggers.

Monday, November 13, 2006

MySQL testing techniques: comparing tables

This is the first tutorial of a series dedicated to testing techniques.
Soon all this material will find a home in a more appropriate place. In the meantime, enjoy the lecture!

While testing large tables, it is useful to test if two tables have the same contents. For example, if you want to compare performance between two tables using different storage engines, or tables created on different filesystems, you must be sure that both tables have the same content. Having loaded both from the same source is not a guarantee that the contents are the same: a mistake or different SQL modes used during the load may result in substantial differences.

General concepts

Then, you need to compare two, possibly very large tables. There are several methods available. One is to run a query with a LEFT OUTER JOIN. However, this method is likely to take very long or even exhaust your system resources if your tables are really large.
One method that I have been advocating for long time is to run a global CRC on both tables and then compare the results.
And, I hear you asking, how do you get a global table CRC?
There is no predefined SQL feature for this task. Recent MyISAM tables have a built-in CRC, but you can't get it from a SELECT statement, and besides, if you need to compare the contents of such a table with one using a different engine, you are out of luck. Then, we need to use something more general, which can be applied to any table.
The first step to get a global CRC is to get a list of the columns that we can then pass to a CRC function such as SHA1 or MD5.

This list is a string made of the name of the columns, which we will pass to a CONCAT_WS function. However, if you know how SQL functions work, you will know that any NULL value in the list will nullify the whole expression. Therefore, we need to make sure that every nullable column is properly handled by a COALESCE function. The result of this operation, which we delegate to a stored function, is a safe list of column.
The second step towards a global table CRC is to calculate a CRC for each record. We use the above list of columns to create a SELECT statement returning a SHA1 for each record in the table. But, what to do with it? There is no aggregate SQL function available for SHA or MD5. Thus, we need to process the result and calculate our CRC manually.
As noted in a previous post, we can do that in two ways, using cursors or using a blackhole table. Some benchmarks show that the blackhole table is much faster than the cursor, and this is what we do.

We start with an empty CRC. For each row, we compute a CRC of the whole record, plus the existing CRC. Since we are using a SELECT statement, we need to get rid of the output, because we are only interested in the calculation stored in the user variable. For this purpose, a black hole table is very well suited. At the end of the SELECT + INSERT operation, we have in hand two variables, one showing the count and one holding the global CRC for the table.
Repeating this process for the second table we need to compare, we can then compare two simple values, and determine at a glance if we are dealing with comparable data sets.

Implementation

Let's put the concepts together with a few stored routines.

delimiter //

drop function if exists get_safe_column_list //
create function get_safe_column_list
(
p_db_name varchar(50),
p_table_name varchar(50),
p_null_text varchar(20)
)
returns varchar(10000)
reads sql data
begin
if ( @@group_concat_max_len < 10000 ) then
set group_concat_max_len = 10000;
end if;
return (
select
group_concat( if(is_nullable = 'no', column_name,
concat("COALESCE(",column_name, ", '", p_null_text,"')") ))
from
information_schema.columns
where
table_schema= p_db_name
and
table_name = p_table_name
);
end //
The first function returns a safe list of column names.

drop function if exists get_primary_key //
create function get_primary_key (
p_db_name varchar(50),
p_table_name varchar(50)
)
returns varchar(10000)
begin
if ( @@group_concat_max_len < 10000 ) then
set group_concat_max_len = 10000;
end if;
return (
select
group_concat(column_name order by ORDINAL_POSITION)
from information_schema.KEY_COLUMN_USAGE
where
table_schema=p_db_name
and table_name = p_table_name
and constraint_name = 'PRIMARY' );
end //
The second routine returns a table primary key, as a list of column.
drop procedure if exists table_crc //
create procedure table_crc (
IN p_db_name varchar(50),
IN p_table_name varchar(50),
OUT p_table_crc varchar(100)
)
reads sql data
main_table_crc:
begin
declare pk varchar(1000);
declare column_list varchar(10000);
set pk = get_primary_key(p_db_name, p_table_name);
set column_list = get_safe_column_list(p_db_name, p_table_name, 'NULL');
if (column_list is null) then
set p_table_crc = null;
leave main_table_crc;
end if;
set @q = concat(
'INSERT INTO bh SELECT @tcnt := @tcnt + 1, ',
'@tcrc := SHA1(CONCAT(@tcrc, CONCAT_WS("#",', column_list, ')))',
' FROM ', p_db_name, '.', p_table_name,
if (pk is null, '', concat(' ORDER BY ', pk))
);
drop table if exists bh;
create table bh (counter int, tcrc varchar(50)) engine = blackhole;
set @tcrc= '';
set @tcnt= 0;
prepare q from @q;
execute q;
set p_table_crc = concat(@tcnt,'-',@tcrc);
deallocate prepare q;
end //
The third procedure returns the global CRC of a given table.

drop procedure if exists table_compare //
create procedure table_compare (
IN p_db_name1 varchar(50),
IN p_table_name1 varchar(50),
IN p_db_name2 varchar(50),
IN p_table_name2 varchar(50),
OUT same_contents boolean
)
begin
declare crc1 varchar(100);
declare crc2 varchar(100);
call table_crc(p_db_name1,p_table_name1, crc1);
call table_crc(p_db_name2,p_table_name2, crc2);
select concat(p_db_name1, '.', p_table_name1) as table_name, crc1 as crc
union
select concat(p_db_name2, '.', p_table_name2) as table_name, crc2 as crc ;
set same_contents = (crc1 = crc2);
select crc1=crc2 as 'same contents';
end //

delimiter ;
The final routine puts all pieces together, returning a boolean value telling if the two tables have the same contents.

Testing

After loading the above code in our database, we can call a "table_crc" procedure to get our coveted value. Let's take the famous world database and let's give it a try.

mysql> use world;
Database changed
mysql> create table City2 like City;
Query OK, 0 rows affected (0.03 sec)

mysql> alter table City2 ENGINE = InnoDB;
Query OK, 0 rows affected (0.09 sec)
Records: 0 Duplicates: 0 Warnings: 0

mysql> insert into City2 select * from City order by District, population;
Query OK, 4079 rows affected (0.33 sec)
Records: 4079 Duplicates: 0 Warnings: 0

First of all, we create another table, with the same structure of City, but using a different engine, and storing data in a bizarre order to see if our routine is robust enough. In fact our routine will calculate the CRC after sorting the data by primary key, so that there won't be any surprise.

mysql> call table_crc(schema(), 'City', @city_crc);
Query OK, 0 rows affected, 1 warning (0.16 sec)

mysql> select @city_crc;
+-----------------------------------------------+
| @city_crc |
+-----------------------------------------------+
| 4079-407840fbf812b81eee55d3a438cf953f81c63bc0 |
+-----------------------------------------------+
1 row in set (0.00 sec)

mysql> call table_crc(schema(), 'City2', @city2_crc);
Query OK, 0 rows affected (0.13 sec)

mysql> select @city2_crc;
+-----------------------------------------------+
| @city2_crc |
+-----------------------------------------------+
| 4079-407840fbf812b81eee55d3a438cf953f81c63bc0 |
+-----------------------------------------------+
1 row in set (0.01 sec)

When we compare the CRC, we can easily see that the two tables are the same. If all these statements are tedious to write, we can use a shortcut:

mysql> call table_compare(schema(), 'City',schema(), 'City2', @same);
+-------------+-----------------------------------------------+
| table_name | crc |
+-------------+-----------------------------------------------+
| world.City | 4079-407840fbf812b81eee55d3a438cf953f81c63bc0 |
| world.City2 | 4079-407840fbf812b81eee55d3a438cf953f81c63bc0 |
+-------------+-----------------------------------------------+
2 rows in set (0.24 sec)

+---------------+
| same contents |
+---------------+
| 1 |
+---------------+
1 row in set (0.24 sec)
Le'ts make it fail, to see if it is true:

mysql> update City2 set population = population + 1 where id = 1;
Query OK, 1 row affected (0.05 sec)
Rows matched: 1 Changed: 1 Warnings: 0
Now the two tables will have at least one difference:
mysql> call table_compare(schema(), 'City',schema(), 'City2', @same);
+-------------+-----------------------------------------------+
| table_name | crc |
+-------------+-----------------------------------------------+
| world.City | 4079-407840fbf812b81eee55d3a438cf953f81c63bc0 |
| world.City2 | 4079-b3b613b20570024be727ef0454053a96cfc36633 |
+-------------+-----------------------------------------------+
2 rows in set (0.23 sec)

+---------------+
| same contents |
+---------------+
| 0 |
+---------------+
1 row in set (0.23 sec)
And our routine finds it. QED.

Sunday, October 29, 2006

Speaking at the Open Source Database Conference 2006

I will be a speaker at the Open Source Database Conference, which is held in Frankfurt from 6th to 8th November 2006, parallel to the International PHP conference.
I will present two sessions, one on Advanced Replication Techniques in MySQL 5 and the other on The MySQL General Purpose Stored Routines Library.
I submitted both proposals long before I started considering joining MySQL, so I will go there with the blessing of my current employer, but I will speak under my former affiliation, to avoid attributing to my current company what are my personal ideas.

Thursday, October 19, 2006

Contributing to MySQL QA - Ideas wanted

MySQL has recently started a campaign of open contribution, inviting the community to participate to the MySQL project in many ways.

The next target, also considering the higer stakes coming from the MySQL Enterprise challenge, will be Quality Assurance.

Quality Assurance

Now what is Quality Assurance (QA)? If you think that it's just bug hunting, then you have a simplistic view of the software generation lifecycle. QA deals with all the steps in the software lifecycle, and at each steps there are actions that can affect the quality of the final outcome. QA components include (but are not limited to) failure testing, statistical control, process and performance control, best practice adoption.

MySQL AB has its own QA department. Those of you who attended the MySQL Users Conference may have had a chance to attend a presentation by senior QA manager Omer BarNir about Internal QA in Open Source Development, where he explains the challenge of being a QA professional in such a dynamic company (presentation slides ).

Software testing

Since MySQL is a software company, finding bugs is important, of course, and software testing is one of the main branches of QA, but the crucial part of finding bugs is not how many you can find, but when you find them. The earlier you find a bug, the less costly will be to fix it.

Looking for bugs as soon as a feature pops up in the development lifecycle is just part of the task. The way you look for bugs is different from person to person, depending on who you are:
  • a developer with an intimate knowledge of the code may find low level bugs that nobody else can catch, with tools like code peer reviews and walkthroughs;
  • a professional tester, even without development experience, may find functional bugs by testing the application with a methodical approach;
  • the final user, who does not have to know any of the above, may find bugs by simply using the application and comparing the results with earlier expectations.

Of all these three methods of finding bugs, the first one is the most effective. Bugs found at that stage are the least expensive to fix. But of course one can't find all the bugs with code inspections. And then there are the other two levels, which catch bugs that will cost more effort to fix.

Obviously, the bugs that affect the final users are the ones for which you care most, and it would be of mutual benefit to find these bugs as early as possible.

Community involvement

From the above explanation, you can see what the problem is. No matter how skilled are the professionals at this job, finding all the bugs is impossible. You can take any book on this subject, and every one of them will tell you the same postulate: You will not find all bugs. One thought is especially discouraging for us. Although the QA professionals find a lot of bugs, those which you neve hear of, because they are fixed before you hit the download button to get the application, there are still the ones that affect you, the final user, the most important party involved.

Therefore, here is the idea. We want to involve the community of users in our Quality Assurance activities.

How can we do that? To tell you the truth, we have plenty of ideas on what we can do and how to promote it. After all, until one month ago I was still just one of the many community members, and I have some ideas.And there are many brilliant people in house who came up with promising ideas.

But at this stage I don't want to tell you what I think it should be done, but rather listening to what users propose.

The questions to answer are:
  • how can MySQL involve the users in earlier bug finding?
  • how can MySQL involve the users in its testing process?
  • what else can community members do for QA? (hints: bugs verification, performance testing, standard compliance testing)
  • What kind of incentive would make you (more) willing to cooperate? (hints: public recognition, free services, discount on services, contests with prizes)

If you have an idea related to this topic, even if it is not an answer to the above questions, write a comment to this post, or drop me a line (giuseppe {mind the spelling} at mysql dot com).

Thanks in advance!

Also published at O'Reilly Databases

Tuesday, October 03, 2006

Take the MySQL Certification in five steps

I recently took two certification MySQL 5 exams. At that time, I was on vacation, but now that I am back I would like to share with the community some advice on how to pass the exams.
Son't worry, it is not about cheating. But read on, and you'll decide if it was worth listening.

1. read the book

So much for the ones who thought I was teaching some tricks. Nothing like that. Let me tell you why you should read the book.
  • You could read the (free) online manual instead. No questions about that. If you read the whole manual, you will know all is needed to pass the exam. But you will have to read twice as much as the book (about 1300 pages of user manual instead of 672 pages of certification book).

  • The book will tell you what is important for the exam and what isn't. True, even if the certification exam does not mention it, it could be important (and usually is), but you really want to pass, don't you? So, the book is better.

  • The book is organized by exams. There are four of them, and the book covers their subjects nicely in a very organized way. In the manual, you either read it all, or you will never be sure that you covered everything.

Notice that, although I am now a MySQL employee, I don't get any share from the book sales. What I am saying here is what I honestly believe. I bought the book long before an employment with MySQL was even faintly suggested. Actually, I should add that everything in this article is my personal opinion, and it is not the official take of MySQL AB.
Summing up this item: Read the book, because it will save you time.

2. Get some hands-on experience

Reading the book (or the manual, or both) is not enough. Even if you commit the whole beast to memory, it won't be enough to pass the exam. To pass it you need to apply your knowledge to some real world problems. I can't tell you the questions you are going to get in the exam, but I can tell you the gist of it.
During the exam you won't get questions like "what does the book say about this matter?"
Instead, you will get questions like "given this problem with this set of conditions, which of the following actions is most likely to solve it?"
If, in addition to reading the book, you have some practical experience, you will be able to apply what you learned and answer the questions. If you have a prodigious memory and remember every word of the user's manual but have never tried some of that stuff in the wild, chances are that you won't pass the exam.

3. answer the sample questions from the book

After you get some experience, then try to answer the sample questions from the book. Be aware that the book ask questions in a way that is different from what you get in the exam. The book my ask you to "list all the methods to solve a given problem," while at the exam you get questions like "which of the following methods will solve the given problem?" and you get a multilple-choices-list. So the questions from the book are actually more difficult than those in the exam itself. That's fair. If you answer all the questions from the book, the ones in the exam will look a lot easier.

4. Participate to a forum and answer questions

When it comes to practice questions, you can't get enough of them. After you answer all the questions from the book, you still feel that some more exercise could do you good. There's an easy way of practicing. Subscribe to a mailing list, a newsgroup, a forum dedicated to MySQL, and read through the questions that people ask every day. Even better answer some question yourself! Start with the easy questions, and then try tackling the though ones. You may not know all the answers, but you can find out, because you have read the book and you know where to look in the manual. If you don't, it's a good moment for starting. The real trick is this: whenever you answer a question about something that you know only in theory, spend a few minutes to do it in practice. This way, you will be sure that your answer is correct (and you'll avoid some embarrassment) and you will add some more experience to your bag of tricks. This whole process will boost your confidence a lot. After a few weeks of answering at least one question per day, you will be a celebrity in that forum of your choice, you will have made somebody happy, and many people will have thanked you. What better way of studying?

5. Play chess

Now, wait a minute! What has this to do with the exam? Don't worry. I am not out of my mind, and I will explain shortly what I mean.
During the exam, you will have to answer 70 questions in 90 minutes (the upgrade exams has a different timing, but if you go for it you will have already taken an exam, and you know already what I am talking about). This is a great source of stress. Having a clock that clicks your time away can have a negative influence on your answers. Talking to some other candidates who took the exam, the greatest concern was that time restriction. But you know what? It was not a problem at all for me. And the reason is that I am a chess player, and therefore I am used to taking decision with a clock ticking at my side, and telling me that my time is near exhaustion. In competition chess games, you are given a double clock with two buttons. When it's your time to move, your button is up and your time is running. When you have decided your move, you make it, and push the button. Then your clock stops and your opponent's start ticking.
If you are used to this stressful way of taking decisions in rapid (30 minutes for the whole game) or blitz games (five minutes!), a simple clock giving you 90 minutes for 70 "moves" looks like a joke.
So, if you play chess, resume your chess club card, or play some Internet game, and get some practice at time management. If you don't play chess, answer the above mentioned questions with a clock that rings after a given time.

Then, get a good night's sleep and take the exam

As a last piece of advice, remember that a certification exam is a stressful experience, no matter how well you have prepared. So you need all your strenght and energy for it. Go to the exam well rested and fresh. If you have to take more than one exams, don't do them in a row. Put at least a few hours before the next one, and in between take a walk, read a book, or do anything to recharge your spirits.

Good luck!

Monday, October 02, 2006

Are logins before download any good?

If you are used to open source products, chances are you have gone through this routine more than once. Search for what you need, find a suitable product, go to its web site, download it, test it. Then, if you like it, you start using it right away, otherwise you dump it without a second thought.
The whole process takes less than one minute for small packages. But anyway, even for larger packages, the total time that this whole business requires your attention is very low. Even if it requires a huge download, it can be left unattended and you can resume the testing task when you feel like it. The bottom line is that we got used to a quick try-and-use process of open source products.

Sometimes, though, while performing the above routine, there is a unexpected obstacle. The product maker requires a free login. You don't have to pay anything, but you have to go through the motions of filling a form that asks you everything about your precious self, your company, education, employment history, financial health, and so forth.
Filling these forms is really annoying for several reasons:

  • You got used to the quick download-and-try business, and this sudden stop is not welcome
  • You can't see any added value in this form filling. Actually, you are sure that your level of spam (both by email and by regular mail) will increase;
  • You think at the waste of time this form is, especially considering that you may be throwing the whole product away after ten minutes.
  • This is contrary to the whole open source spirit, where you achieve success by providing a good product. The register-bedore-downloading strategy, instead, tries to cheat into a let's-grab-a-potential-customer-as-soon-as-he-shows-up utterly losing attitude.
After this problem has bitten you once or twice, you start developing a strong defense strategy. The next time some site asks you for a registration before downloading, you start filling the form with fake information, using a temporary but legitimate email address, claiming to run a multi-million dollar business, and presto! you get away with the download, never to be seen to that site again, unless that product is really a earth-shakening tool (which seldom is, in those cases).

And so here are two reasons not to impose a registration before downloading an open source product:
  • It's useless. If you want to cheat the unfair system, they can't do nothing to prevent it.
  • It's damaging. If they want to propose an open source product, imposing a registration is like screaming: "Hey! We want to play the open source game, but we are totally and hopelessly unaware of how to play the game. Cheat us!"
Which I usually do.

Also published at ITToolbox

Friday, September 29, 2006

Log Buffer #12: a Carnival of the Vanities for DBAs

I fell for it. I commented on the Log Buffer and shortly after that I was offered to host an issue. Here we go, then. Welcome to Log Buffer #12!

Confessions of an IT Hitman by Dratz features a strong message: Don't build a data warehouse, arguing that most of the times a DW is built, it's just because a clueless customers was either listening to buzzwords from a salesman or following the latest trend. The key message, for the ones who missed the build-up of the data warehousing movement during the past decade, is DW is a business solution, not a technology solution. OTOH, there are projects that would really need a data warehouse, and don't get one. But this is a different story.

Mats Kindahl in his MySQL Musings talks about Replication and the disappearing statements, i.e. the risks and gotchas of replicating data in MySQL while limiting the databases involved in the replication process. It is something that the manual states quite clearly, but it bites back quite often nonetheless. Mats explains some unusual points in this old issue. Querying a table with a fully qualified name may result in more than overkill. You may be the next victim of the infamous disappearing statement.

Brian Aker is always a volcano of ideas. His Brian 'Krow' Aker's Idle Thoughts don't show much idleness. Rather, he's often producing some hack at great speed, or he's pondering on a new hack. This one, Firefox, yum, sounds interesting, because the suggestion came from MySQL CEO, who is also a man of vision. It looks like something will come out of this clash between dreamers.

Vadim Tkachenko from MySQL Performance Blog, which he runs together with Peter Zaitsev goes int a wrestiling with the latest beta of SolidDB for MySQL.

Test Drive of Solid is a comparision between the behavior of the SOlid engine compared to InnoDB, to check if a migration from InnoDB to Solid could be done easily. Vadim finds several differences and a few bugs. SolidDB doesn't seem to be solid enough in this beta, but it's getting closer.

A charming set of slides from Lars Thalmann in his Lars Thalmann's Blog talking about the joy and the shock of Being a MySQL Developes. If you thought that a developer for an open source company had an easier job, think again. Go through Lars's slides and discover a new dimension of cooperative work.

Good news from Markus Popp (db4free.net blog). For almost one year I have waited for MySQL status and variables to be available as INFORMATION_SCHEMA tables. Now it seems that MySQL 5.1.12 offers New information_schema views that do exactly that. I built MySQL from source a few days ago, and the new features were not there. After seeing Markus's post, I rebuilt it, and spent some time toying with the fresh additions. For example, calculating the key cache efficiency is one task that previously you had to perform with the help of a programming language. Now you can do it in SQL.

A glimpse into the future from Mikael Ronstrom, senior software architect at MySQL AB. The State of MySQL partitioning seems to be close to bug-free, and he's already fiddling with new enhancements in the next version of MySQL. It's almost a pity that I had to submit a new bug for partitioning just hours after his post!

Matt Asay from Open Sources examines The spirit of winning, taking into account the elements tha make a winning team in athletics, and finds that the same elements apply to open source competitiveness. Food for thought.

Bob's World by Bob Field is usually full of interesting concepts. This one, Advanced Features as Crutches is no exception. Advanced features, Bob warns against using an advanced feature just because it's there. Everything has its place, but it is no mandatory to use all the frills a DBMS engine offers. If you do so, security and simplicity may suffer. I fully agree.

In The Oracle Base Blog, Timothy S. Hall turns a joke into a lesson. So many people ask Where's the 'Go faster' switch, and they won't understand the hard work necessary to actually improve the performance. So he makes a surprising proposal. Read on.

Craig Mullings's dbazine.com features two entries that appeal to common sense. Intelligent Database Management lists the tasks of a DBA, stressing the effort that has to be put into the job. Only through intelligent automation of the tasks it is possible to keep up with the chore of managing database systems at a high professional level. Choosing the best DBMS is an old question that does not get a straight answer. Craig goes explains what you need to take into account to reach a sensible decision.

If you liked the previous entries, be aware that Craig writes in different places. DB2PORTAL Blog is one, and there he wrote Sequence Objects and Identity Columns. It is not another chapter of the natural vs surrogate keys saga, but a lucid explanation of what are the differences between identity columns and sequence objects, when each of them is needed and how to handle them. Even if DB2 is not your database of choice, his explanation can help you choose the right solution.

This entry gives a deja vu feeling after you read the one by Craig about Intelligent Database Management. One day in the life of a DBA, as seen by Jeff Hunter is what you get from reading What do you do all day? (a piece of So What Co-Operative). Not a common DBA, mind you, but a very organized one. As seen previously, the good DBA job is a balanced mixture of experience, organization, and an educated choice of technology. That way, even long lasting tasks can be approached with a quiet frame of mind. Jeff's tasks include such things as monitoring the databases, insuring that backups are done, checking for critical conditions. He sounds less worried than the average DBA, because he merges experience and knowledge with the right tools for the job.

Raven Zachary is a contributor to The 451 Group. In Open source database poll highlights barriers to adoption he comments on a survey launched by SolidDB about the acceptance of open source database. It seems that most of the IT professionals attending a recent Linux Expo were timid about adopting open source database.

A different angle is offered by another research, as reported by Zack Urlocker in his The Open Force blog. Linux Database Attach Rate says that 85% of RedHat customers show a strong interest in open source databases. I guess that this research and the survey in the previous item come from different sources. Zack also reports on the Zmanda Recovery Manager, an open source (dual licensed) tool dedicated to backup and recovery of MySQL databases.

Paddy Sreenivasan covers the same topic in Selective recovery of MySQL databases at the O'Reilly database weblogs. Paddy highlights the common needs in a backup solution and explains how Zmanda covers them.

If you jump to Sheeri Kritzer weblog, its title "My-ess-queue-ell vs. My-see-quel" will give away her main area of expertise. One intriguing entry in her blog, Choosing datatypes for fields is a clever hack to improve the quality of data uploads into MySQL tables. So simple, and yet so good. To choose the best datatype, first use a bad one, and then let the system tell you which ones you can have. Well done! If you know Sheeri (I had the pleasure of seeing her at work during the MySQL Users Conference, but reading her blog is enough to get the idea) you will see that she doesn't accept anything blindly, but she needs to explore and dig deeply every concept. In Isolation & Concurrency she takes Jim Starkey's assertion that MVCC could not be serialized as a starting point to an enthusiastic explanation of why you can't do it.

DBMS2 features a piece by Curt Monash, delving into the intricacies of data warehousing terms.
Data warehouse and mart uses - a tentative taxonomy is a quick introduction to the world of business intelligence by comparing the features that each different technology supports or requires. It won't replace a good book, but it sets the record straight in less than a page. Really commendable.

The scintillating Roland Bouman has done that again. He saw an intriguing problem with some clever solutions, and found a way of doing the same job in pure SQL. The matter at hand was Finding Redundant Indexes using the MySQL Information Schema. Others have done this before, using different methodologies. Roland points to the language-independent solution, and he delivers, as he usually does. Browsing Roland Bouman's blog you'll find several witty examples of his analytic approach.

A storm of advice for Oracle developers from ... a DBA. Andy Campbell in his Oracle Stuff I Should Have Known ! gives sensible advice to developers. If Only ... more developers used some nifty features that andy explains in full, their life (and the DBA's as well) would be much easier. This kind of advice (use application metadata to say which parts are being executed) would be good for any DBMS. If only ...

Thomas Kyte grabs the suggestion in The Tom Kyte's Blog and elaborates on his own about Action, Module, Program ID and V$SQL.... This entry, like the previous one, generated quite a stir, with good and bad vibrations on both sides of the Developers-DBA divide.

And finally, a personal closing note. I mentioned in one of my previous posts that I joined MySQL AB as QA developer. A few days in the job, and now I am experiencing the thrill of working in a virtual company

Thursday, September 28, 2006

Working in a virtual company




A few days in the job for MySQL and still, even though I can't say I have mastered the tasks at hand (too much to get used to!) I tasted the feeling of working with perople living in different time zones.
To make this feeling more real, I went to a Chinese shop a few blocks down my street and I bought six cheap clocks. I know that I could use the net but the tangible clocks give me a sense of being in a "real office"!

Monday, September 25, 2006

Riding the Dolphin

I have been a consultant for long time. It is an interesting job, which has given me much satisfaction. The coolest part of the job is the inner pleasure you feel when you solve a difficult problem in a few minutes, earning a reputation for being a wizard.
The negative part of the job is being alone. As a consultant, you are supposed to know everything, and as such you don't have anybody to talk with when you have doubts (and if you are really good you must have doubts from time to time). Well, this condition makes you stronger, and improves your confidence, but your interaction with your co-workers is quite weak.
So I decided to try a new career for a change, and to cash on my experience and knowledge of databases in general and MySQL in particular. And which position should be more suitable than one at MySQL AB itself?
I applied for a couple of jobs at MySQL, got two offers, chose the one that appealed me the most, and here I am. Today I joined the MySQL Quality Assurance department, while turning to a minimum my involvement with my consulting company.
From now on, I'll write a lot about testing and quality assurance. I have talked about testing in the past, but only as a matter of personal taste when developing. Now it's time to involve the community in the testing and QA process. More about this later.
In the meantime, I take a break from consulting and start the new job.

Thursday, September 14, 2006

A fruitful vacation - MySQL 5 certification

Near the end of a two weeks vacation in southern Italy, I combined pleasure and business and took the 4 to 5 upgrade exams, which I passed, to achieve both MySQL 5 DBA and Developer certifications.
I took both exams today in Palermo, one in the morning and one in the afternoon. Now I am back to full-time tourist for a few more days.

Thursday, August 31, 2006

MySQL Sandbox tutorial reprinted by (IN)SECURE magazine


The latest issue of (IN)SECURE magazine, a freely available digital security magazine discussing some of the hottest information security topics, is featuring a reprint of my introduction to MySQL Sandbox.

Friday, August 11, 2006

MySQL Federated Engine: the missing manual

If you have used the MySQL Federated engine, you may share my opinion that it's perhaps the most amazing feature introduced in recent versions, far more interesting and promising than other more advertised ones.
If you have used it, though, you must have noticed how skinny is the official manual. There are countless questions popping from your head while testing this feature, and no answers in sight.
Federated Engine Missing ManualWell, take heart! The missing manual for the Federated Engine was just released. As any missing manual, it is not an official document, but just a collection of many practical things I found out in my daily usage. There must be much more to find and to try, but for now, enjoy this article as the first step toward being productive with the Federated engine.


Errata: you can safely ignore the lines mentioning foreground in the color codes table. The producer has not rendered it the way it should have been. It will be hopefully fixed soon.
(fixed)

Monday, July 31, 2006

MySQL Sandbox tutorial translated in Portuguese

A while ago I wrote a tutorial on the MySQL Sandbox.
Wallace Reis asked me permission to translate it, and I happily agreed.
And now there it is: the tutorial Portuguese translation (the first part, but the rest will come).
Thanks, Wallace!

Friday, July 28, 2006

An ugly query should be a warning bell

A recent question in a newsgroup was

Given this table in MySQL 4.0, how can I run a single query to sort the records by the number of pipes in myfield?
id(int) -------- myfield(varchar)

1 --------- |number1|number2|
2 --------- |number1|
3 --------- |number1|number2|number3|
4 --------- |number1|number2|number3|number4|

Suppressing my urge to say something about the data structure, I suggested this clever solution from a test sample I deduced from the request:

select * from mytable;
+----+-------------------+
| id | myfield |
+----+-------------------+
| 1 | 10|1|12|9 |
| 2 | 1|2 |
| 3 | 19|5|59|91 |
| 4 | 111|31|10|1|12|9 |
| 5 | 1|2|7 |
| 6 | 95|59|91 |
| 7 | 123456789|2000000 |
+----+-------------------+

select id, myfield,
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(myfield,
'9',''),
'8',''),
'7',''),
'6',''),
'5',''),
'4',''),
'3',''),
'2',''),
'1',''),
'0','') as so
from mytable
order by so;
+----+-------------------+-------+
| id | myfield | so |
+----+-------------------+-------+
| 2 | 1|2 | | |
| 7 | 123456789|2000000 | | |
| 5 | 1|2|7 | || |
| 6 | 95|59|91 | || |
| 1 | 10|1|12|9 | ||| |
| 3 | 19|5|59|91 | ||| |
| 4 | 111|31|10|1|12|9 | ||||| |
+----+-------------------+-------+

However, the poster added hastily that his request was wrong, and that what he identified as "numbers" are actually character strings of various length, thus making my solution much harder to find. The sample given was


1 --------- |ruote in lega|macchine gialle|
2 --------- |case in affitto|
3 --------- |treni|vagoni|camini|

But then I overcame my desire of showing off with my cunning solutions, and told him that, apart for the initial misunderstanding about numbers and strings, his table was stained with the unforgivable sin of not being even in 1NF.
Therefore, the right course of action would be to split the table:

contents
1 ruote in lega
2 macchine gialle
3 case in affitto
4 treni
5 vagoni
6 camini

mytable
ID id_content
1 1
1 2
2 3
3 4
3 5
3 6

This way, the ugly query to sort by "how many pipes" would be reduced to the simple

SELECT id, count(*) as how_many
from mytable inner join contents using (id_content)
group by id
order by how_many

Beware of ugly queries. Often they tell you that something in your structure design may be wrong.

Monday, July 17, 2006

Obscure settings

I remember when MySQL was spreading rapidly, mostly due to its ease of use. That was long time ago, apparently. Now we are entering the dark ages of obscure settings, or so it seems, reading some MySQL Cluster setting definitions.
Quoting from the the manual:

TimeBetweenLocalCheckpoints

This parameter is an exception in that it does not specify a time to wait before starting a new local checkpoint; rather, it is used to ensure that local checkpoints are not performed in a cluster where relatively few updates are taking place. In most clusters with high update rates, it is likely that a new local checkpoint is started immediately after the previous one has been completed.

The size of all write operations executed since the start of the previous local checkpoints is added. This parameter is also exceptional in that it is specified as the base-2 logarithm of the number of 4-byte words, so that the default value 20 means 4MB (4 × 220) of write operations, 21 would mean 8MB, and so on up to a maximum value of 31, which equates to 8GB of write operations.


Any suggestions for taking this definition down to planet Earth?

Thursday, July 06, 2006

Measuring MySQL resources consumption on Linux

If you have ever felt the need of measuring how much of your resources a MySQL process is eating up, you're welcome to share my experience on the subject, reading Measuring resources for a MySQL server on Linux, which also introduces mysqlresources, a new command line tool that gets your server's statitics from the operating system and prints a nice report.

With MySQL, it's easy to get statistics from the server itself, but sometimes you need a view from the outside. mysqlresources does just that. Read on.

Wednesday, July 05, 2006

A first taste of logs on demand

From time to time I install MySQL from the development tree, to see what's new, and in the latest one I found a pleasant surprise:
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3 to server version: 5.1.12-beta

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql [localhost] {datacharmer} ((none)) > show variables like "%log";
+--------------------------------+-------+
| Variable_name | Value |
+--------------------------------+-------+
| back_log | 50 |
| general_log | OFF |
| innodb_locks_unsafe_for_binlog | OFF |
| log | OFF |
| slow_query_log | OFF |
| sync_binlog | 0 |
+--------------------------------+-------+
6 rows in set (0.00 sec)

mysql [localhost] {datacharmer} ((none)) > select * from mysql.general_log;
Empty set (0.01 sec)

mysql [localhost] {datacharmer} ((none)) > set global general_log=1;
Query OK, 0 rows affected (0.01 sec)

mysql [localhost] {datacharmer} ((none)) > select * from mysql.general_log;
+---------------------+-----------------------------------------+-----------+-----------+--------------+---------------------------------+
| event_time | user_host | thread_id | server_id | command_type | argument |
+---------------------+-----------------------------------------+-----------+-----------+--------------+---------------------------------+
| 2006-07-05 17:26:56 | datacharmer[datacharmer] @ localhost [] | 3 | 0 | Query | select * from mysql.general_log |
+---------------------+-----------------------------------------+-----------+-----------+--------------+---------------------------------+
1 row in set (0.04 sec)

mysql [localhost] {datacharmer} ((none)) > select * from mysql.general_log;
+---------------------+-----------------------------------------+-----------+-----------+--------------+---------------------------------+
| event_time | user_host | thread_id | server_id | command_type | argument |
+---------------------+-----------------------------------------+-----------+-----------+--------------+---------------------------------+
| 2006-07-05 17:26:56 | datacharmer[datacharmer] @ localhost [] | 3 | 0 | Query | select * from mysql.general_log |
| 2006-07-05 17:27:08 | datacharmer[datacharmer] @ localhost [] | 3 | 0 | Query | select * from mysql.general_log |
+---------------------+-----------------------------------------+-----------+-----------+--------------+---------------------------------+
2 rows in set (0.00 sec)

mysql [localhost] {datacharmer} ((none)) > set global general_log=0;
Query OK, 0 rows affected (0.01 sec)

mysql [localhost] {datacharmer} ((none)) > select * from mysql.general_log;
+---------------------+-----------------------------------------+-----------+-----------+--------------+---------------------------------+
| event_time | user_host | thread_id | server_id | command_type | argument |
+---------------------+-----------------------------------------+-----------+-----------+--------------+---------------------------------+
| 2006-07-05 17:26:56 | datacharmer[datacharmer] @ localhost [] | 3 | 0 | Query | select * from mysql.general_log |
| 2006-07-05 17:27:08 | datacharmer[datacharmer] @ localhost [] | 3 | 0 | Query | select * from mysql.general_log |
| 2006-07-05 17:27:11 | datacharmer[datacharmer] @ localhost [] | 3 | 0 | Query | set global general_log=0 |
+---------------------+-----------------------------------------+-----------+-----------+--------------+---------------------------------+
3 rows in set (0.00 sec)
As you can see, the promised logs on demand are there.
This feature was long overdue. Well done!

Tuesday, June 20, 2006

Talking at FrOSCon 2006



I will be at FrOSCon 2006 on June 24, with a talk about Pivot tables in MySQL 5.
I will arrive in Sankt Augustin on June 23, flying from Alghero Fertilia, Italy to Frankfurt Hahn, Germany and then driving to the conference site (travel plan).
The funny thing is that, when I booked my flight, I had no idea that I was about to get in the middle of the Soccer World Cup. Being no soccer fan, I realized this fact only when the media circus started and I could not avoid being informed. Oh, well, now I understand why it wasn't easy to find a hotel, even booking two months in advance!

The slides and the source code for this talk are available from my site.

Friday, June 02, 2006

Filling test tables quickly

Let's say that you are building a new application, and you need to test it against a large set of data. You would need either to borrow the data from some known source or to create it yourself.

If you have such collection at your disposal, good for you. But more often than not you need some sort of data that is not ready in your repositories, and then you need to have a quick method to create it.

I will tell you three quick tricks to create large datasets of simple data types, namely numbers, words, and dates.

Le's assume that we need at least one million records. Here's how to go for it.

numbers

Creating a large table with just a numeric field is quite simple.

You may be tempted to run a loop in your favorite language, or even in a SQL stored procedure, but this approach would run for quite a long time.
There is a better solution, that will fill your large table in a few seconds

drop table if exists numbers;
create table numbers ( id int not null primary key);

delimiter $$

drop procedure if exists fill_numbers $$
create procedure fill_numbers()
deterministic
begin
declare counter int default 1;
insert into numbers values (1);
while counter < 1000000
do
insert into numbers (id)
select id + counter
from numbers;
select count(*) into counter from numbers;
select counter;
end while;
end $$
delimiter ;

call fill_numbers();
Rather than inserting 1,000,000 lines, this procedure will insert just one record, and then it will double the table 20 times, until we end up with a table containing 1,048,576 lines (220). This operation runs in less than 8 seconds in my laptop, which is quite slow if compared to my usual servers.
Even without a stored procedure, you could just insert a line manually and then execute 20 times this query:
insert into numbers (id) select id + (select count(*) from numbers) from numbers; select count(*) from numbers;

And it should not take you more than 30 seconds.

words

If your case calls for a large list of unique words, you could of course build it with a program, but then again the insertion process would be quite slow to complete. A faster method is to load an existing list from as file.

All Unix system include a word list, whose size ranges from a few thousand to half a million. If you don't have such a list available in your box, you can get one (or build it from several ones) from several places. Good places to start looking for words are this and this.
At the end of the exercise, let's say that you've got a list of about half a million distinct words in /usr/share/dict/words. Thus, you can proceed to build your table.

drop table if exists words;
create table words (
id int not null auto_increment primary key,
t varchar(50) not null
);

load data local infile '/usr/share/dict/words'
into table words (t);

Query OK, 518584 rows affected (4.94 sec)
Records: 518584 Deleted: 0 Skipped: 0 Warnings: 0

select count(*) from words;
+----------+
| count(*) |
+----------+
| 518584 |
+----------+
1 row in set (0.04 sec)
That was quick, but we got only about half of what we needed. We said that we wanted one million records, and here we only have a little more than five hundred thousand.
Since we need unique words, we can ask the database to produce the missing half by reversing the existing ones.
insert into words (t) select reverse(t) from words;
Query OK, 518584 rows affected (3.98 sec)
Records: 518584 Duplicates: 0 Warnings: 0

select count(*) from words;
+----------+
| count(*) |
+----------+
| 1037168 |
+----------+
Almost there. Now we passed the million records mark, but we are not sure that they are unique, since the reverse of one word could be the duplicate of an existing word (think about mood and doom, for example). Thus, to complete the task, let's add a unique index with the IGNORE clause, so that we'll get rid of any duplicate.
alter ignore table words add unique key (t);
Query OK, 1037168 rows affected (46.69 sec)
Records: 1037168 Duplicates: 5791 Warnings: 0

select count(*) from words;
+----------+
| count(*) |
+----------+
| 1031377 |
+----------+
That's it. A million words, without much sweat.

dates

Finally, let's see how to create a large list of dates. Actually, you would seldom need a million dates, since a million days cover more than 2,700 years. A table of just dates, thus, will usually range from 1,000 to 10,000 records. When you need a million records, you are more likely talking about DATETIME values, with intervals of hours, minutes, or seconds. Nothing will prevent you from using this technique to create one thousand DATE values, but let's keep to our original goal of filling a large table.
Then, if we want records with one minute interval, we can proceed like this:
drop table if exists dates;
create table dates (
id int(11) not null auto_increment primary key,
dt datetime not NULL
) engine=myisam;

delimiter $$

drop procedure if exists make_dates $$
CREATE PROCEDURE make_dates( max_recs int)
begin
declare start_dt datetime;
declare numrecs int default 1;
set start_dt = date_format( now() - interval max_recs minute, '%Y-%m-%d %H:%i:00');

insert into dates (dt) values (start_dt );

while numrecs < max_recs
do
insert into dates (dt)
select dt + interval ( numrecs ) minute
from dates;
select count(*) into numrecs from dates;
select numrecs;
end while;
end $$

delimiter ;
Does it look familiar? It should, since it's the same technique we've used to put a set of numbers into a table.

Only this time we use the record count to calculate the interval in minutes between the existing records and the ones being inserted.
Also in this case, we double the table 20 times, to pass the one million records mark.

call make_dates( 1000000 );
+---------+
| numrecs |
+---------+
| 2 |
+---------+
1 row in set (0.02 sec)

+---------+
| numrecs |
+---------+
| 4 |
+---------+
1 row in set (0.02 sec)

# ... 16 more counts

+---------+
| numrecs |
+---------+
| 524288 |
+---------+
1 row in set (5.99 sec)

+---------+
| numrecs |
+---------+
| 1048576 |
+---------+
1 row in set (10.18 sec)

select count(*), min(dt), max(dt) from dates;
+----------+---------------------+---------------------+
| count(*) | min(dt) | max(dt) |
+----------+---------------------+---------------------+
| 1048576 | 2004-07-07 13:57:00 | 2006-07-05 18:12:00 |
+----------+---------------------+---------------------+
It took a little more than the numbers, because of the overhead of calculating one million date intervals, but it was about ten seconds in total, which is a reasonable time to get a test table.
There are other techniques as well, but these three are techniques that you can use to get the job done even without dedicated tools.

Wednesday, May 24, 2006

Tutorial for a more powerful MySQL Sandbox

The MySQL Sandbox has been enhanced, with a new express installer, capable of setting up a sandbox in just a few seconds, and an interactive wizard, for fine tuning complex installations.
A lengthty tutorial is accompanying the new release.

This tool could be the basis for a long overdue interactive installation tool for Unix systems. What do you think?

Sunday, May 14, 2006

New homes


I have a new web site. I registered the datacharmer.org and datacharmer.com domains, which are now jointly my main page. (Both addresses point to the same page)


Another change involves the MySQL General Purpose Stored Routines Library, which has evolved to version 1.0 (not alpha anymore). Its home is now at SourceForge.

Wednesday, April 26, 2006

MySQL Community Awards

Second day of the MySQL USers Conference 2006.
At breakfast, Arjen approached me, asking if I could hang nearby the podium during the keynote. "You may end up on stage", he said. I thought that they were going to hand me the iPod that was announced some time ago. So I went there and secured a seat as close as I could, together with Markus, Beat, Roland, and Sheeri.
However, instead of the iPod (I got it after lunch, BTW) I was called on stage to get a Community Award, and the same was granted to Roland, Markus, and Rasmus (!!)
price giving


Looks quite useful. Now when somebody asks difficult questions, I can always consult the crystal ball. :)

Thanks to the MySQL Community for this wonderful surprise, and congratulations to Roland, Markus, and Rasmus!

(Pictures courtesy of Roland. Thanks)

Tuesday, April 25, 2006

Talk at the MySQL UC: Higher Order MySQL

From the highly unorthodox department, a speech about stored routines that create other routines in MySQL.

Higher Order MySQL at the MySQL Users Conference, 17:20, Ballroom C.

Get the source code and examples.

Friday, April 21, 2006

Advanced replication techniques in MySQL 5.x



OnLamp has just published an article I wrote about some interesting things you can do with MySQL 5.x and replication.
  • You can create an array of peer-to-peer nodes that are at the same time master and slave, without much worry about insertion conflicts, thanks to some server variables introduced in MySQL 5.0;
  • You can achieve automatic failover using a combination of federated tables (MySQL 5.0) and events (5.1)
The article is a proof of concept that using such techniques you can achieve your goals without help from external applications.

If you want to get your hands dirty, using the companion package MySQL Replication Playground you can try out the goodies described in the article without risking your valuable servers.

Comments welcome.

Introducing MySQL 5 Replication Playground

As promised when I talked about the MySQL Sandbox and as mentioned in this (very) recent article on replication, I have released a testing environment for replication, called the MySQL 5 Replication Playground.
It's a package that creates in one single server an array of four replicating nodes using MySQL 5.0 or 5.1. The purpose of this package is to help developers and administrators when they need to test applications involving replication. You don't need to use several boxes. One machine is enough. With just one instance of MySQL installed, the Replication Playground will install four data directories and it will create the necessary scripts to start, stop, and using the nodes.
The installation comes in two flavors: standard (i.e. one master and three slaves, as usual) and circular (where each node is at once master and slave, as explained in the above mentioned article).
Once you install it, you can play with it and test your failover procedures, without worrying about your valuable production data. The playground works with separated ports and sockets, and it's completely independent from any other MySQL installation.

The docs in the distribution package contain also some sample sessions with both the standard and the circular installation.

The package requires:
  • Linux, or FreeBSD (theoretically, it should work with any Unix flavor, but it has been tested only with Linux and FreeBSD)
  • MySQL 5.0 or 5.1
  • bash
  • tar
  • gzip
  • Perl (for the installation)
  • Optionally, you need the Perl DBI and DBD::mysql to run some test tools

Wednesday, April 19, 2006

Introducing MySQL Sandbox

Normal MySQL users are going through a flurry of announcements about new versions that make them salivate ath the thought of what can be done with these new goodies. In November, version 5.0 became generally available, thur ready for production, then version 5.1 entered beta.
Ah! It would be nice if I could just install the new versions and test them. Sometimes, though, you can't do that, because you don't have root access, or you have already an older MySQL server running.
For years, due to my consulting duties, I have maintained several non-conflicting, different versions of MySQL. I did that manually, using some scripts that I have developed and tuned over time. For me it is not a big hassle. I am used to the command line, I wrote the tools, I know what to take care of, it's easy. Not so for some of my co-workers, who need to use the same facilities but lack the same degree of confidence that I got with trial-and-error.
So finally I put together all I did for myself and transferred most of my own knowledge into a Perl script that can create a seamless installation of MySQl 5.0 or 5.1 without need of root access, and without disturbing (or being disturbed by) existing MySQL installations.
I called this tool the MySQL Sandbox and I put it on SourceForge.

With this package, achieving your goal of a separated testing installation is quite simple (provided that you are running Linux or some other Unix, that is):
  • Download the binaries for the version of MySQL you want to try, or follow Markus Popp's instructions and compile your own.
  • Unpack MySQL binaries in a convenient place (for the sake of this example, let's say you put it under $HOME/binaries/5.1.10)
  • Download the latest MySQL Sandbox package and unpack it.
  • Run the install program, saying where are the binaries
$ ./install.pl --basedir=$HOME/binaries/5.1.10
The MySQL 5 Sandbox, version 1.1 19-Apr-2006
(C) 2006 Giuseppe Maxia, Stardata s.r.l.

installing with the following parameters:
basedir : /home/gmax/binaries/5.1.10
binbash : /bin/bash
conf_file :
datadir_from : archive
db_password : datacharmer
db_user : datacharmer
force : 0
home_directory : /home/gmax
install_version : 5.1
my_file :
operating_system_user : gmax
sandbox_directory : mysql_sandbox
sandbox_port : 3310
verbose : 0
do you agree? ([Y],n)
If you answer "Y", or just press ENTER, the sandbox will create a $HOME/mysql_sandbox directory, and a data directory under it, containing the minimum for your MySQL installation to work.
The destination directory will also contain some shell scripts to start, stop, and use the test server.
Optionally, you can also change the default configuration file by specifying --my_file={small|large|huge} . The sandbox installer will try to locate the sample files from the basedir you indicated, and will merge the basic parameters into your destination file (my.sandbox.cnf)
Once the installation is over, you go to the newly created directory, and start the server:
$ ./start.sh
~/binaries/5.1.10 ~/mysql_sandbox
~/mysql_sandbox
sandbox server started
After that, using the new server is easy.
$ ./use.sh
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 1 to server version: 5.1.10-beta-max

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql [localhost] {datacharmer} ((none)) > exit
As you see, there is also a dedicated prompt that may remind you what you are using.
It has been tested in several Linux boxes. Ideally, it should work on every Unix box. Read the docs, or try
$ ./install --help

Happy testing!

However, that is not the end of it. More goodies are coming in the next days. Stay tuned.

Sunday, April 16, 2006

SQL Easter egg

There is a tentative Obfuscated SQL contest with one entry by Kai Voigt.
SELECT CONCAT(CONV(536,8,20), LPAD(EXPORT_SET(1,'Y','N','',1), 3, "P"), CHAR(32), HEX(234), UNHEX(594939469*POW(3,2)))
Not bad, but ... SQL without data ? Doesn't sound right.
Here's my take (requires MySQL 4.1 or later):
drop table if exists complex;
drop table if exists simple;
CREATE TABLE complex ( simple char(2), complex char(1)) ;
CREATE TABLE simple ( complex char(2), simple char(1));

INSERT INTO complex VALUES ('0','7'),('1','6'),('2','6'),('3','6'),('4','2'),
('5','7'),('6','7'),('7','7'),('8','6'),('10','7'),('11','7'),('12','2'),
('13','7'),('14','6'),('15','6'),('24','2'),('25','6'),('26','6'),('27','6'),
('28','6'),('29','6'),('30','7'),('31','6');
INSERT INTO simple VALUES ('0','4'),('1','F'),('2','5'),('3','8'),('4','0'),
('5','2'),('6','1'),('7','3'),('8','A'),('10','3'),('11','5'),('12','0'),
('13','4'),('14','E'),('15','1'),('24','0'),('25','c'),('26','1'),('27','8'),
('28','b'),('29','3'),('30','2'),('31','5');
set @complex = sqrt(sqrt(0x19a1));
set @simple = sqrt(sqrt(sqrt(0x290D741)));
select group_concat(unhex(concat(simple,complex))
order by complex.non_simple separator '' )
from (select simple ^ @complex non_simple, complex simple from complex) simple natural join
(select complex ^ @simple non_simple, simple complex from simple) complex ;

Enjoy!

Sunday, April 09, 2006

Sorting data by set of keys

This seems to be a popular problem, since I saw it twice in the past few weeks in two different newsgroups. Somebody complains that, having query with a set of keys like this
SELECT * from mytable where id IN (200, 2 ,100)
they get the results in a order that ius different from the one they specified. For example, they may get something like
select
*
from
main_table
where
id in (200,2,100);
+-----+----------+
| id | contents |
+-----+----------+
| 2 | b |
| 100 | aa |
| 200 | bb |
+-----+----------+
One of the posters complained that the result should come in the specified order [200,2,100], and that MySQL was arbitrarily sorting its resultset.
When I see such requests, I usually explain that there are two misconceptions. The first being that MySQL sorts results without asking. That it does not do, because a DBMS engine is basically lazy by design, and it won't do anything unless explicitly asked for. The second is that you are asking for a sort order in your set of keys. Such list is just an indication of which keys to consider for filtering the results, and it has no influence at all on ordering. If you want ordering, you need to say so explicitly, using the ORDER BY clause.
Once I clarified what the problem is, let's see the solutions.
You can sort it on-the-fly, if your values are reasonably few
select
*
from
main_table
where
id in (200, 2, 100)
order by
case id
when 200 then 1
when 2 then 2
when 100 then 3
end ;
+-----+----------+
| id | contents |
+-----+----------+
| 200 | bb |
| 2 | b |
| 100 | aa |
+-----+----------+
This solution looks like a hack, and in fact it can negatively affect performance, and it is difficult to handle when you need to change the sort order. The best way is to use a support table providing the required sort order.
create table sorting_table (
sort_order int not null auto_increment primary key,
fk_id int,
key (fk_id)
);
# here we insert the values in the order we want.
# the auto incremented key will insure that this is the correct one
insert into sorting_table (fk_id) values (200), (2), (100);
select * from sorting_table;
+------------+-------+
| sort_order | fk_id |
+------------+-------+
| 1 | 200 |
| 2 | 2 |
| 3 | 100 |
+------------+-------+
Now to get the wanted order we use a join between the main table and the secondary one.
select
main_table.*
from
main_table
inner join sorting_table on (id=fk_id)
where
id in (200,2, 100)
order by
sort_order ;
+-----+----------+
| id | contents |
+-----+----------+
| 200 | bb |
| 2 | b |
| 100 | aa |
+-----+----------+
This is a much better solution, which offers flexibility and efficiency at the same time.

Dumping MySQL information schema

I have been invited to join the O'Reilly weblogs on databases.
Today I gave my first contribution, with a piece on dumping MySQL information schema contents.

Sunday, April 02, 2006

Measuring replication speed

I am writing an article about some advanced replication features (coming soon), and I wanted to measure the speed of each specific architecture I am working with.
There seem to be no builtin MySQL solution for this particular problem, and thus I am using a hand made metrics tool, inspired by something that Jeremy Zawodny suggested in his High Performance MySQL a few years ago (see the relevant chapter on replication with code).
The naive method would be to set a timestamp in a master table, and to measure the same record when it shows up in the slave. Unfortunately, (but correctly, I have to say!) this does not work, because the binary log includes information on the current timestamp, so that master and slave in the end have exactly the same contents.
Using a trigger in MySQL 5.0 seems promising, because you can set the timestamp tothe value of SYSDATE(), thus showing the different insertion times in master and slave. However, MySQL imestamps precision is limited to seconds. Therefore, if I insert a record at 12:00:00.9999 and the slave receives it at 12:00:01.0001, MySQL will record a difference of 1 second, even though in this case the difference would be just a couple of milliseconds.

With this in mind, I worked on Zawodny suggestion, which was still dealing with seconds, to imrove the method and measure the performance of a replication system with a much better accuracy (possibly microseconds).

The schema works as follows (see the flow chart):

Get the system time in terms of seconds and microseconds (I am using Perl with the Time::HiRes module, but any major language has a corresponding feature). Insert that time in a table on the master, measuring also the time spent for the insertion operation.
Immediately after, start looping on the slave to get that same record. When you do it, get again the system time (with microseconds precision) and subtract the time you got from the table.
For better precision subtract the time you used for inserting and for fetching, and you get the bare replication time.

Here is an example, using a record of 1000 bytes, in a loop that increases the size of the record by 1000 bytes at each step.

0 41 5 1143983120 992916 1143983120 997036
1 41 5 1143983121 207218 1143983121 211084
2 41 5 1143983121 420654 1143983121 424630
3 41 5 1143983121 634802 1143983121 638909
4 41 5 1143983121 849051 1143983121 853521
5 41 5 1143983122 65003 1143983122 67892
6 41 5 1143983122 279944 1143983122 282933
7 41 5 1143983122 493895 1143983122 498320
8 41 5 1143983122 710443 1143983122 715032
9 41 5 1143983122 927867 1143983122 932013
10 41 5 1143983123 145376 1143983123 149648

This is the raw output used for debugging. The first colum is the step. The second is the ID retrieved from the record, the third one is the sequence number (I inserted 5 records for each steps, and asked to fetch the record with sequence no. 5).
The fourth and fifth columns are the system time that was stored in the table. The remaining two columns are the system time as recorded at fetch time.
With the above data, we can produce a listing of the actual raplication times, in human-readable format.

loop data size master insert slave retrieval total repl. time bare time
---- --------- ------------- --------------- ---------------- ---------
1 1000 0.001655 0.000859 0.004120 0.001606
2 1000 0.001643 0.001268 0.003866 0.000955
3 2000 0.001505 0.000850 0.003976 0.001621
4 3000 0.001604 0.000849 0.004107 0.001654
5 4000 0.001943 0.000851 0.004470 0.001676
6 5000 0.001853 0.000981 0.002889 0.000055
7 6000 0.001964 0.000968 0.002989 0.000057
8 7000 0.003285 0.001081 0.004425 0.000059
9 8000 0.001431 0.000851 0.004589 0.002307
10 9000 0.003076 0.001012 0.004146 0.000058
11 10000 0.001282 0.001600 0.004272 0.001390

So the average replication time is, in this case, 0.001040 seconds.
For this test, I was using a sandbox environment, where both master and slave are in the same server. Thus there is some slight overhead due to the fact that one CPU is handling two instances of MySQL, and the same disk is hosting two logs (binary on the master, relay-log on the slave) in addition to the actual tables.
Using a real production replication, these times go down to an average of about 0.000500 seconds. It depends on several factors, from the speed of your networking equipment to the sheer power of your boxes.
Anyway, this is a tool that can be useful sometimes. You can get the source code from MySQL Forge snippets.

Comments welcome.

Thursday, March 09, 2006

Seeking alternatives to cursors

As users of stored procedures know, cursors can only work with explicit SQL statements, while they don't work with dynamic queries (there was an article by Beat Vontobel and a consequent feature request).
Trying to overcome this limitation, I came up with a somewhat useable alternative and a performance surprise. Let's go see them in due order.
The example is built against the World database provided in MySQL web site.
The method used for this test is an algorithm to compute a global checksum for a table. You may know that such a global CRC exists already, but it is only available for MyISAM tables, and as a procedure only (feature request #17009).
To calculate such a checksum we can do two things:
  • Using a cursor, we start with an empty string, representing the global checksum, then calculate the checksum for each row, and we merge such checksum to the global checksum.
  • Using a SELECT statement getting the checksum of each row, we start with a empty user variable, and we merge that result to the user variable.
The second method seems a bit expensive, because it fetches as many records as are in the table, creating a recordset that we really don't need, since we are only interested in its fina result. A nice touch in this case is the usage of a BLACKHOLE Storage Engine. Instead of a simple SELECT, we use a INSERT ... SELECT statement.
Here comes the preparation code.

use world;

drop table if exists new_city;
create table new_city like City;
insert into new_city select * from City;

delimiter //

drop procedure if exists double_recs//
create procedure double_recs ()
deterministic
begin
declare counter int default 0;
while counter < 7 do
insert into new_city select NULL,Name,CountryCode,District,Population from new_city;
set counter = counter + 1;
end while;
end//

delimiter ;

select 'inserting a bunch of records in new_city ' as 'wait a moment' ;
call double_recs() ;
select count(*) as 'rows in new_city' from new_city;
drop procedure if exists double_recs ;

drop table if exists crc_res;
CREATE TABLE `crc_res` (
`crc` varchar(100) default NULL,
`cnt` int(11) default NULL
) ENGINE=BLACKHOLE ;

delimiter //

DROP FUNCTION IF EXISTS make_City_CRC //

CREATE FUNCTION make_City_CRC( )
returns varchar(100)
READS SQL DATA
BEGIN
declare done boolean default false;
declare sha_global varchar(100);
declare counter int;
declare sha_riga varchar(100);
DECLARE cursor1 CURSOR FOR
SELECT SHA1(
CONCAT_WS('#',ID,Name, CountryCode, District,Population))
FROM new_city
ORDER BY ID;
DECLARE CONTINUE HANDLER FOR SQLSTATE '02000' SET done = true;

OPEN cursor1;
SET sha_global = "";
set counter =0;
GET_ROWS:
loop
FETCH cursor1 INTO sha_riga;
IF done THEN
leave GET_ROWS;
END IF;
SET sha_global = SHA1(CONCAT(sha_global, sha_riga));
set counter = counter + 1;
END loop;
CLOSE cursor1;
set @gcounter = counter;
RETURN sha_global;
END //

delimiter ;

Now we have a function to calculate the checksum of table new_city, which is a copy of City, inflated with copies of its records until it gets about half a million (to be able to compare timings with whole seconds rather than milliseconds), and we also create a BLACKHOLE table as preparation for the alternative method.
The CRC business goes like this: for each row, we calculate the SHA1 value of the whole record, by taking a CONCAT_WS of all the columns. You can see why I would need dynamic SQL to make this function a generic one. For each row, we get the SHA1 of the general checksum, concatenated to the row checksum. At the end, the general checksum thus calculated is returned.
In the case of the cursor method, the syntax is clear. In the alternative method, we use a user variable, previously initialized, and set it using the ":=" operator, thus preserving its value from row to row. Since we are sending the recordset to a BLACKHOLE table, we are not burdened by an unwanted recordset.
Executing both methods, we get the same results, but the timing are quite different. The testing code follows.

select 'using a BLACKHOLE table' as 'first test';

set @gcrc = '';
set @gcnt = 0;
set @start_test1 = now();
insert into crc_res
select @gcrc := sha1(concat(@gcrc, sha1(concat_ws('#',ID,Name,CountryCode,District,Population)) ) ),
@gcnt := @gcnt + 1 from new_city order by ID;
set @end_test1 = now();

select @gcrc;
select @gcnt;

select 'using a cursor' as 'second test';

set @start_test2 = now();
select make_City_CRC();
set @end_test2 = now();
select @gcounter;

select timediff(@end_test1, @start_test1) as 'test BLACKHOLE', timediff(@end_test2, @start_test2) as 'test CURSOR';

And here is the result.

+-------------------------+
| first test |
+-------------------------+
| using a BLACKHOLE table |
+-------------------------+
Query OK, 522112 rows affected (11.19 sec)
Records: 522112 Duplicates: 0 Warnings: 0
+------------------------------------------+
| @gcrc |
+------------------------------------------+
| 1f4c512ff345a88a876fe0faddcf37344cb34f29 |
+------------------------------------------+
+--------+
| @gcnt |
+--------+
| 522112 |
+--------+

+----------------+
| second test |
+----------------+
| using a cursor |
+----------------+
+------------------------------------------+
| make_City_CRC() |
+------------------------------------------+
| 1f4c512ff345a88a876fe0faddcf37344cb34f29 |
+------------------------------------------+
1 row in set (18.51 sec)

+-----------+
| @gcounter |
+-----------+
| 522112 |
+-----------+
+----------------+-------------+
| test BLACKHOLE | test CURSOR |
+----------------+-------------+
| 00:00:11 | 00:00:18 |
+----------------+-------------+

The cursor method takes 18.51 seconds, while the SELECT+BLACKHOLE method takes only 11. Can anyone comment on this performance difference?
I know that this is not a general replacement to a cursor, but keep it in mind when you would like to use dynamic SQL with a cursor.

Friday, March 03, 2006

MySQL 5.1: Improving ARCHIVE performance with partitioning

In a recent article (Improving Database Performance with Partitioning) Robin Schumacher explains how to use range partitioning, giving some impressive examples of performance gains.

The examples given are quite slow to create. For those willing to try out the same tables with a filling function a bit faster, here goes:
drop table if exists no_part_tab;
create table no_part_tab
(c1 int(11) default NULL,
c2 varchar(30) default NULL,
c3 date default NULL) engine=myisam;

drop table if exists part_tab ;
CREATE TABLE part_tab
( c1 int default NULL,
c2 varchar(30) default NULL,
c3 date default NULL
) engine=myisam
PARTITION BY RANGE (year(c3)) (PARTITION p0 VALUES LESS THAN (1995),
PARTITION p1 VALUES LESS THAN (1996) , PARTITION p2 VALUES LESS THAN (1997) ,
PARTITION p3 VALUES LESS THAN (1998) , PARTITION p4 VALUES LESS THAN (1999) ,
PARTITION p5 VALUES LESS THAN (2000) , PARTITION p6 VALUES LESS THAN (2001) ,
PARTITION p7 VALUES LESS THAN (2002) , PARTITION p8 VALUES LESS THAN (2003) ,
PARTITION p9 VALUES LESS THAN (2004) , PARTITION p10 VALUES LESS THAN (2010),
PARTITION p11 VALUES LESS THAN MAXVALUE );

delimiter //

drop procedure if exists load_part_tab //
CREATE PROCEDURE load_part_tab( max_recs int, rows_per_query int)
begin
declare counter int default 0;
declare step int default 0;
declare base_query varchar(100) default 'insert into part_tab values ';
declare first_loop boolean default true;
declare v int default 0;
set @query = base_query;
while v < max_recs
do
if (counter = rows_per_query) then
set first_loop = true;
set counter = 0;
prepare q from @query;
execute q;
deallocate prepare q;
set @query = base_query;
set step = step + 1;
select step, v, now();
end if;
if (first_loop) then
set first_loop = false;
else
set @query = concat(@query, ',');
end if;
set @query = concat( @query,
'(', v, ',',
'"testing partitions"',',"',
adddate('1995-01-01',(rand(v)*36520) mod 3652), '")'
);
set v = v + 1; set counter = counter + 1;
end while;
if (counter) then
prepare q from @query;
execute q;
deallocate prepare q;
end if;
end
//
delimiter ;

call load_part_tab(8000000,1000);
insert into no_part_tab select * from part_tab;
Using this data, the timings for my comparison are:
select  count(*) from no_part_tab where c3 > date '1995-01-01' and c3 < date '1995-12-31';
+----------+
| count(*) |
+----------+
| 795181 |
+----------+
1 row in set (9.05 sec)

select count(*) from part_tab where c3 > date '1995-01-01' and c3 < date '1995-12-31';
+----------+
| count(*) |
+----------+
| 795181 |
+----------+
1 row in set (1.04 sec)
Now, coming to the purpose of this post, I created a further table using the ARCHIVE engine, to check if I could couple the gains in size reduction with the speed improvements allowed by partitioning.
The immediate transformation from MyISAM to ARCHIVE does not work (bug #17754 Update: Bug fixed), so let's create that table explicitly:
drop table if exists part_archive_tab;
CREATE TABLE `part_archive_tab` (
`c1` int(11) not null ,
`c2` varchar(30) default NULL,
`c3` date default NULL
-- unique key (c1)
) ENGINE=ARCHIVE DEFAULT CHARSET=latin1
PARTITION BY RANGE (year(c3) )
(
PARTITION p0 VALUES LESS THAN (1995),
PARTITION p1 VALUES LESS THAN (1996),
PARTITION p2 VALUES LESS THAN (1997),
PARTITION p3 VALUES LESS THAN (1998),
PARTITION p4 VALUES LESS THAN (1999),
PARTITION p5 VALUES LESS THAN (2000),
PARTITION p6 VALUES LESS THAN (2001),
PARTITION p7 VALUES LESS THAN (2002),
PARTITION p8 VALUES LESS THAN (2003),
PARTITION p9 VALUES LESS THAN (2004),
PARTITION p10 VALUES LESS THAN (2010),
PARTITION p11 VALUES LESS THAN MAXVALUE ENGINE = ARCHIVE
);
insert into part_archive_tab select * from part_tab;
And I repeated the same search:
select  count(*) from part_archive_tab where c3 > date '1995-01-01' and c3 < date '1995-12-31';
+----------+
| count(*) |
+----------+
| 795181 |
+----------+
1 row in set (1.29 sec)

select count(*) from part_archive_tab;
+----------+
| count(*) |
+----------+
| 8000000 |
+----------+
1 row in set (9.24 sec)
That's very interesting. As I expected, a COUNT without a WHERE clause to an ARCHIVE table means a full table scan. But when we use a WHERE clause involving a range, the performance is really impressive. Using the new PARTITIONS extension of the EXPLAIN command, we get this confirmed:
explain partitions select  count(*) from part_archive_tab where c3 > date '1995-01-01' and c3 < date '1995-12-31'\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: part_archive_tab
partitions: p1
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 798458
Extra: Using where
Unfortunately, it seems that there are still some problems (bug #17894 update: Fixed), but partitions are great and I am sure this will be one of the top features in the next future.

Tuesday, February 28, 2006

puzzled by date functions

Today I was analyzing the behavior of the query optimizer, and I stumbled into a most curious case.
I created two functions returning the extremes of a date range, and I wanted to see how many times those functions were called when used in a WHERE clause So I added log tracing instructions to both of them. The result was quite surprising.
Let's set the environment first.
create database if not exists test ;
use test ;

delimiter //

drop function if exists today_start //
CREATE FUNCTION today_start() RETURNS datetime
begin
insert into mylog (routine_name) values ('today_start');
-- return current_date();
return '2006-02-28 00:00:00';
end //

drop function if exists today_end //
CREATE FUNCTION today_end() RETURNS datetime
begin
insert into mylog (routine_name) values ('today_end');
-- return current_date() + interval 1 day - interval 1 second;
return '2006-02-28 23:59:59';
end //

delimiter ;

drop table if exists t1;
create table t1 (
id int(11) NOT NULL auto_increment,
dt datetime NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

drop table if exists mylog;
create table mylog (
id int not null auto_increment primary key,
routine_name varchar(20) not null,
TS timestamp
);

INSERT INTO `t1` VALUES
(1,'2006-02-28 11:19:35'), (2,'2006-02-28 11:19:38'),
(3,'2006-02-28 11:19:40'), (4,'2006-03-01 11:20:09'),
(5,'2006-03-01 11:20:11'), (6,'2006-03-01 11:20:12'),
(7,'2006-03-01 11:20:13');

select * from t1;
+----+---------------------+
| id | dt |
+----+---------------------+
| 1 | 2006-02-28 11:19:35 |
| 2 | 2006-02-28 11:19:38 |
| 3 | 2006-02-28 11:19:40 |
| 4 | 2006-03-01 11:20:09 |
| 5 | 2006-03-01 11:20:11 |
| 6 | 2006-03-01 11:20:12 |
| 7 | 2006-03-01 11:20:13 |
+----+---------------------+
Now I select all today's rows from t1.
select * from t1 where dt between today_start() and today_end();
+----+---------------------+
| id | dt |
+----+---------------------+
| 1 | 2006-02-28 11:19:35 |
| 2 | 2006-02-28 11:19:38 |
| 3 | 2006-02-28 11:19:40 |
+----+---------------------+
That's correct. Now, let's see how many times the function was called:
select * from mylog;
+----+--------------+---------------------+
| id | routine_name | TS |
+----+--------------+---------------------+
| 1 | today_start | 2006-02-28 12:26:24 |
| 2 | today_end | 2006-02-28 12:26:24 |
+----+--------------+---------------------+
And that too was what I expected. But the story changes if I use a slightly different table. This one has the same columns as t1, but the primary key is the datetime column.
drop table if exists t2;
create table t2 (
id int not null,
dt datetime NOT NULL,
PRIMARY KEY (dt)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

insert into t2 (id, dt) select id, dt from t1;
Now I did the same experiment with this table:
truncate mylog;
select * from t2 where dt between today_start() and today_end();
+----+---------------------+
| id | dt |
+----+---------------------+
| 1 | 2006-02-28 11:19:35 |
| 2 | 2006-02-28 11:19:38 |
| 3 | 2006-02-28 11:19:40 |
+----+---------------------+
The query finds the same records. Let's see what happens to mylog:
select * from mylog;
+----+--------------+---------------------+
| id | routine_name | TS |
+----+--------------+---------------------+
| 1 | today_start | 2006-02-28 12:30:00 |
| 2 | today_end | 2006-02-28 12:30:00 |
| 3 | today_start | 2006-02-28 12:30:00 |
| 4 | today_end | 2006-02-28 12:30:00 |
+----+--------------+---------------------+
I can't imagine why this is happening. The only difference is that dt is now primary key. Instead of being called once, the routine is called twice. If I simply drop the primary key in t2, then the routine is called once per query, as expected.
The result does not change if I use InnoDB tables instead of MyISAM.

Can anyone explain what is happening here?

Friday, February 17, 2006

Speaking at the 2006 MySQL Users Conference

Well, it came a bit late, but it's official (good thing I haven't bought a conference pass yet!)
I'll be speaking at the fourth MySQL Users conference on Higher Order MySQL with this abstract:

Higher order functions are those that can accept a function as an argument or can produce a function as their output. MySQL recent addition of stored procedures and dynamic querying makes it possible to extend the DBMS features, especially for business intelligence and data warehousing. Functions that can create functions are useful to expand the language, producing complex SQL code quickly and accurately.

Some examples of what can be achieved with these kind of functions: global table CRC (as oposed to record CRC) for remote table comparison, cross tabulation (pivot tables in SQL), data on demand from compressed storage (indexed, as opposed to non-indexed data stored by the archive engine), enhanced metadata queries, where records from the information schema and records from data table create customized administrative tools. The applications are countless. The power of stored procedures, paired with dynamic queries can give developers the ability of doubling the features of MySQL DBMS. Higher order functions and procedures are one step beyond normal stored procedures. They give developers one templates to create standardized functions for similar tasks. While the basics of such discipline are better exploited in application languages (C, C++, Perl come to mind), dynamic SQL is perfectly capable of achieving important goals.

Apart from speaking, I will have a chance to meet in person several people that I know only by mail, and to meet again someone that I keep meeting (gladly) over and over at the conferences.
Yes, I look forward to it!

Monday, February 13, 2006

Workshop on MySQL 5, Milan, March 2006


My company is organizing a workshop on MySQL 5.
It will be held in Milan, on March 24th, and it will cover the following:
  • MySQL 5 main new features (stored procedures, information schema,triggers, storage engines, dynamic SQL);
  • The MySQL 5 general purpose routine library;
  • New features in upcoming 5.1 (plug-in engines, XML support, partitioning, event handling);
  • Replication in MySQL 5 (some multiple master tricks and how to use storage engines as complement to replication);
  • Technical questions parade. Each subscriber can submit one technical question, which will get an answer during the workshop.
The workshop is in Italian. English speakers who wish to attend (or to organize a similar event in English) can contact me at one of my company addresses.

Friday, January 27, 2006

MySQL 5 general purpose routine library - VII: Testing and best practices

(Seventh part of the MySQL 5 general purpose routine library presentation.)

Test units

Testing is often considered a boring activity, something that is not cool enough for a developer to take into consideration.
Sometimes, it's true. Testing is almost as exciting as having a haircut or trimming your nails.
It needs to be done, but you hardly welcome the task.
However, that happens when you don't test with method. Because testing can be fun and it can be as rewarding as any other programming activity.

What you need to do, is to move your testing schedule up to the development schedule and put it in front of the code.That way, you don't have just "some testing", which is usually fumbling with a few examples to ensure that your routine does what you thought it should do.
That's not testing. That's sampling.
And if that fails, you will test even less after you fix the bug (or you think you did),
because that sort of testing is really boring and tiresome.

What you should do instead is to create testing units.
The practice of unit testing (http://en.wikipedia.org/wiki/Unit_testing) calls for the creation of a set of tests to probe all facets of the code you are about to check.
This practice is well known to Perl programmers, since almost all modules from Perl itself and from CPAN come with a standard set of tests. Some modules have even thousands of them.
The clever addition of recent practice, though, is not just writing the testing units,
but writing them before the actual code.
The idea is this: write your specifications as testing units, i.e. put down a stream of cases to prove that the code being produced meets your expectations. If your specifications are detailed and complex, so will be the testing units, which will include as many detail as your specification requires.
Using this approach, you will know that your routine works when it passes all the tests.

Unit testing in MySQL

Testing in Perl is a powerful experience. There are so many tools available, that testing it's becoming an art in itself. There are modules to help you test just about everything. Writing your tests in Perl becomes a very easy task, even pleasant sometimes.
Coming from such experience, the testing environment for MySQL stored routines looks just, well, lacking.
Therefore my first thought when I was facing the task was "let's do it in Perl", and so I did it.
My first test unit (for the arrays module) was in Perl. It did not take long to write, but I found myself writing quite a lot to describe how the tests work in Perl.
And then I realized that, for a non Perl programmer, installing the Perl DBI and DBD::mysql modules is not an easy task, and it can be actually a frightening experience.
The same goes for another well established testing environment for C and C++. Who is not used to compiling and linking a C program will find even a simple task very demanding.
Thus I came to the logical conclusion that testing for stored routines has to be done with MySQL resources alone. It is not very advanced, and it pales compared to Perl, but it can be enhanced with some well thought structures and routines.
So, here goes.
Let's see how unit testing can be performed with MySQL and the general purpose stored routines.

Testing support in mysql-sr-lib

To ease the task of testing, there are a few routines. The most important one, and perhaps the only one that you may need in setting a test, is log_test, which expects four parameters

procedure log_test(
p_description varchar(200),
p_result text,
p_expected text,
p_outcome boolean)
The first one is a description of what we are testing, perhaps the routine name, with some additional short comment. The second one is the result we are inspecting. The third one is a description of what we are expecting. The last one is a boolean, and it should receive an expression, the evaluation of which will tell us whether the test succeeded or not. I know that it sounds obscure, but it will become clearer with the examples below.
Furthermore, we have some generalized tests for something that is quite common during our testing tasks. It is very likely that, testing stored routines, we want to test if such routine exists. And it is equally common to test if a given table exists. So, there are procedures to perform these simple repetitive tasks: check_routine and check_table do the job for you, and if you look at the source code for them, you will see just a clear example of how a test should be carried out. If you have several routines to check at once (as it is the case of this collection, for instance) you can benefit from check_routine_simple and check_for_routines_existence. The first one will let you set a user variable @database and use it to check your routines, so you will set the database parameter only once. The second one gets the routine names form a table (_routine_list) and calls check_routine for each of them. This latest case will raise a question: "how do I know how the table structure?" Don't worry. Before starting your testing work, call initialize_tests, and that, among other hidden tasks, will be done for you. To see different uses of these routines, inspect the source code of test_*.mysql in the "tests" directory. for now, it's enough to say that these three uses are equivalent:
 call initialize_tests();

connect;
call check_routine(database(), 'myp1', 'procedure');
call check_routine(database(), 'myf2', 'function');

connect;
set @database = database();
call check_routine_simple('myp1', 'procedure');
call check_routine_simple('myf2', 'function');

connect;
insert into _routine_list (routine_name, routine_type) values
('myp1', 'procedure'),
('myf2', 'function');
call check_for_routines_existence(database());
Last, there is show_test_results, which gives you an account of what you have tested, listing the result of each test, a summary of passed and failed ones, and a further list of failed test for your easy review. And now it's time for some hands on.

Short tutorial: Writing a testing unit

For this brief exercise, we want to create two stored routines. A procedure that creates a table with a character column and two numeric columns, and a function that will return the sum of the two numbers given the string.
Some facts: The name of the table is 'my_data'. The procedure, 'make_table,' won't have any parameters, the function, 'get_result,' will have one. These facts are enough to set our tests.
   -- test_tutorial.mysql
call initialize_tests();

call check_routine(database(), 'make_table', 'procedure');
call check_routine(database(), 'get_result', 'function');

connect;
call make_table();
call check_table(database(), 'my_data');

connect;
insert into my_data values('first', 1, 2), ('second',10,20),('third', 100, 200);
set @my_res = get_result('first');
call log_test('get_result first', @my_res, '@my_res = 3', @my_res = 3 );

set @my_res = get_result('second');
call log_test('get_result second', @my_res, '@my_res = 30', @my_res = 30 );

set @my_res = get_result('third');
call log_test('get_result third', @my_res, '@my_res = 300', @my_res = 300 );

set @my_res = get_result('fourth');
call log_test('get_result non-existent', @my_res, '@my_res is null',
@my_res is null );

call show_test_results();
This test is what we expect the actual routines to do. It can be part of the user's requirements to signify what you want to achieve. That done, it's time to write the actual code:
   -- test_t1.mysql

delimiter //

drop procedure if exists make_table //
create procedure make_table ()
deterministic
begin
create table my_data
(
name char(10) not null primary key,
val1 int,
val2 int
);
end//

drop function if exists get_result //
create function get_result ( p_name char(10))
returns int
reads sql data
begin
declare mysum int;
select val1*val2 into mysum -- intentional bug
from my_data
where name = p_name;
return mysum;
end//

delimiter ;
Now we have all the components. To perform the tests, we need to install the test utilities and the globals module.


$ mysql databasename < globals.mysql
$ mysql databasename < test_utilities.mysql


and then we can install the two routines:

   $ mysql databasename < test_t1.mysql


Finally, we run the test. Don't forget to add the '-t' option to mysql.


$ mysql -t databasename < test_tutorial.mysql
+---------+-------------------------+--------+-----------------------+------------+
| test no | description | result | expected | outcome |
+---------+-------------------------+--------+-----------------------+------------+
| 1 | make_table p exists | 1 | routine_exists = true | pass |
| 2 | get_result f exists | 1 | routine_exists = true | pass |
| 3 | table my_data exists | 1 | table_exists = true | pass |
| 4 | get_result first | 2 | @my_result = 3 | ** fail ** |
| 5 | get_result second | 200 | @my_result = 30 | ** fail ** |
| 6 | get_result third | 20000 | @my_result = 300 | ** fail ** |
| 7 | get_result non-existent | NULL | @my_result is null | pass |
+---------+-------------------------+--------+-----------------------+------------+
7 rows in set (0.00 sec)

+-------------+--------+--------+-------------------------+
| total tests | passed | failed | passed tests percentage |
+-------------+--------+--------+-------------------------+
| 7 | 4 | 3 | 57.14 |
+-------------+--------+--------+-------------------------+
1 row in set (0.00 sec)

+---------+-------------------+--------+------------------+------------+
| test no | description | result | expected | outcome |
+---------+-------------------+--------+------------------+------------+
| 4 | get_result first | 2 | @my_result = 3 | ** fail ** |
| 5 | get_result second | 200 | @my_result = 30 | ** fail ** |
| 6 | get_result third | 20000 | @my_result = 300 | ** fail ** |
+---------+-------------------+--------+------------------+------------+


The tests worked, but something surely went wrong, because three tests were unsuccessful. We expected the result to be 3, 30,300, and instead we got 2,200,20000. Time to inspect the source, and we find quickly that get_result has val1*val1 where it should have been val1+val2. Let's correct the mistake, and run the test again

   $ mysql databasename < test_t1.mysql
$ mysql -t databasename < test_tutorial.mysql
+---------+-------------------------+--------+-----------------------+---------+
| test no | description | result | expected | outcome |
+---------+-------------------------+--------+-----------------------+---------+
| 1 | make_table p exists | 1 | routine_exists = true | pass |
| 2 | get_result f exists | 1 | routine_exists = true | pass |
| 3 | table my_data exists | 1 | table_exists = true | pass |
| 4 | get_result first | 3 | @my_result = 3 | pass |
| 5 | get_result second | 30 | @my_result = 30 | pass |
| 6 | get_result third | 300 | @my_result = 300 | pass |
| 7 | get_result non-existent | NULL | @my_result is null | pass |
+---------+-------------------------+--------+-----------------------+---------+
7 rows in set (0.00 sec)

+-------------+--------+--------+-------------------------+
| total tests | passed | failed | passed tests percentage |
+-------------+--------+--------+-------------------------+
| 7 | 7 | 0 | 100.00 |
+-------------+--------+--------+-------------------------+


Now we got what we wanted. The routine works reasonably well. Notice that we also included a test for failure, when we asked for a value 'fourth', which does not exist in the table.

best practices



Enlightened by the testing example, we can draw some guidelines for developing stored routines. Actually, this advice can be successfully applied to any programming language, and I owe this great advice to Damian Conway (Ten essential development practices). Follow these simple rules, and you'll see the quality of your routines improving soon.


Write the interface

Before writing any code, write down the interface, i.e. how you would like to call the routine, with the necessary parameters. Possibly, write down the documentation for such a routine, describing how to invoke it and what it should do. Write down every detail of the implementation such as environment variables, dependencies, side effects, and so on.

Write a test unit for your routine

Before writing the code, there is yet another step. Write some tests for your routine. You already know the interface, so you can create the tests for your routine. For every feature described in the documentation, you should prepare at least one test. Make tests for success and for failure. For example, if your routine has some pre-requisites, test that it fails when such requisites aren't met.


As soon as you implement one feature of the routine, test it

When finally writing the code, run the tests on your routine, to see if that feature works as expected. If the test for that feature fails, don't go on developing further features until the failing one is fixed.

When your routine passes all tests, you're done

Your work is finished as soon as your routine passes all the tests. You'll be amazed at the low number of bugs that this approach leaves behind. Most of them are caught during the test.

MySQL 5 general purpose routine library - VI : Globals

(Sixth part of the MySQL 5 general purpose routine library presentation.)
Among the General Puropose Routines there are some that are more general than others. I refer to those small utility routines that don't belong to any specific module, or that are useful and required by any module. I call this group globals because it applies to all the modules, and also because one of its objectives is to create global variables that are available across sessions.

Finding existing objects

This category has some functions to report on the existence of a specific database object, mainly tables, views and routines.
Although this is a task mainly related to testing, I didn't put it among the test utilities, because they are really general, and I believe that they should be useful to many developers in several different tasks.
The list starts with three simple functions table_exists, table_or_view_exists, view_exists, which can tell you in one call if a given table or view exists.
Similar ones are routine_exists, function_exists and procedure_exists. The first one tells you if a given routine exists, providing the database name, the routine name and the routine type (procedure or function). The other two functions are just shortcuts toward calling routine_exists.

Using global variables

The main course in globals, and the reason why this module holds this name is due to the implementation of global variables, i.e. variables that keep their value across sessions. Unlike user variables, that lose their value after a disconnection, global variables can stay stored for as long as you need them.
To set a global variable, you can use either a procedure global_var_set or the homonymous function.
  set @myvar = global_var_set('varname','an important value');
call global_var_set('varname', 'an important value');
Both these calls will create a global variable named 'varname'. The first usage has the side effect of actually returning the variable's value, which is immediately assigned to a user variable. The second form will just create a global variable, assigning its value to a @global_var_set user variable.
To retrieve the value, you use global_var_get.
  set @othervar = global_var_get('varname');
Global variables are created on a user basis. With the default implementation, several users can define variables with the same name, and they will be kept separated, each one being available to their respective creator. It is possible to set the global variables to be truly global, each one available to all users, by changing the return value of library_user, which is used, incidentally, by the arrays module as well.
You can drop global variables with global_var_drop, and inquire about the existence of a global variables using global_var_exists.

Sunday, January 08, 2006

Checking MySQL weak passwords

Do you have a policy of strong passwords?
Do you want to enforce it? If yes, then you may want a quick way of checking if your users are abiding to the rules.
You can either force the users to change passwords through a smart application, to prevent them from using weak passwords, or, if you are assessing an existing environment, you can check the existing passwords against a dictionary, to see if there are some imeediately exploitable passwords among your server accounts.

The mechanics of passwords checking

Checking passwords against a dictionary is a costly business. It can be made more affordable if we manage to put all the items to check in the database itself, so that we can use the DBMS force to do our job.
What we have to do is to collect a large list of words that we would not like our users to set as their password. Immediate candidates are user names, both from the operating system and the database, user full names, and of course words from as many dictionaries we believe our users could make advantage of.


WARNING Password checking is a breech of privacy. If you need to use the techniques explained in this post, be sure to:
  • Inform the users about a policy of strict passwords;
  • Inform them that you will run sanity checks from time to time;
  • Make them sign a form to acknowledge that they know and understand the policy;
  • If you are not the boss, get the management approval (in written!);
  • Have the company lawyer review and approve the policy wording and the forms.
Failure to do so may result in unpleasant situations, ranging from slight embarrassment to criminal charges, depending on the company you work with, your country regulation, and your ill luck.
The table we need to create will have a word, which is what we would not want to be used as a password. A password, which is the word encoded with MySQL password function, and a reversed_password, which is the word spelled backwards encoded with password. If you have time and patience, you may add more variations (such as all lowercase, all uppercase).
Using such table, finding the weak passwords is quite easy. To have something to find, let's create some dumb users:
GRANT all on test.* to test1 identified by "test";        # name of a database
GRANT all on test.* to test2 identified by "test2"; # name of a user
GRANT all on test.* to test3 identified by "master"; # a common word
GRANT all on test.* to test4 identified by "retsam"; # a reversed common word
GRANT all on test.* to test5 identified by "marshmallow"; # an uncommon word
Now let's get them:
select
user, word as 'chosen password', word as 'original word', 'direct' as 'password type'
from
passwords
inner join mysql.user on password=pwd
union
select
user, reverse(word),word,'reverse' as 'password type'
from
passwords
inner join mysql.user on password=reverse_pwd
;
+-----------+-----------------+---------------+---------------+
| user | chosen password | original word | password type |
+-----------+-----------------+---------------+---------------+
| test5 | marshmallow | marshmallow | direct |
| test3 | master | master | direct |
| test4 | retsam | master | reverse |
| test1 | test | test | direct |
| test2 | test2 | test2 | direct |
+-----------+-----------------+---------------+---------------+

Here we are. All the offending users came to light, with a simple (and very efficient) query.

Creating the passwords table


The following perl script will create the passwords table and populate it, using a dictionary of English words and the other above mentioned elements.

#!/usr/bin/perl 
use strict;
use warnings;
use Data::Dumper;
use DBI;

my $words_file = '/usr/share/dict/words';
my $groups_file = '/etc/group';
my $users_file = '/etc/passwd';
my $DEBUG = $ENV{DEBUG};
my $deferred_keys = $ENV{DEFERRED_KEYS} || 1;
my $max_records = $ENV{MAX_RECORDS} || 30_000;
my $base_query = qq{INSERT IGNORE INTO passwords VALUES };
my $query = $base_query;
my $record_count = 0;
my $total_recs = 0;
my $database = 'test';
my $host = 'localhost';
my $conf_file = "$ENV{HOME}/.my.cnf";
my $table_struct = qq{
create table passwords (
word char(50) collate utf8_bin NOT NULL,
pwd char(41) collate utf8_bin NOT NULL,
reverse_pwd char(41) collate utf8_bin NOT NULL,
PRIMARY KEY (word),
KEY (pwd),
KEY (reverse_pwd)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
};

my $dbh=DBI->connect("dbi:mysql:$database;host=$host"
. ";mysql_read_default_file=$conf_file",
undef,
undef,
{RaiseError => 1})
or die "Can't connect: $DBI::errstr\n";

$dbh->do(qq{drop table if exists passwords});
$dbh->do($table_struct);
$dbh->do(qq{ALTER TABLE passwords DISABLE KEYS}) if $deferred_keys;

# get the database users
my $users = $dbh->selectall_arrayref(qq{select user from mysql.user});

# get the database names
my $databases = $dbh->selectall_arrayref(qq{show databases});

# get the table names
my $tables = $dbh->selectall_arrayref(qq{
select distinct table_name from information_schema.tables});

# add all table, users, database names to the word list
for ((@{$users}, @{$databases}, @{$tables})) {
process_query( $_->[0] );
}

# get the OS group names
open my $IN, q{<}, $groups_file
or die "unable to open $groups_file\n";
process_query((split(q{:}, $_))[0]) while <$IN>;
close $IN;

# get the OS user names
open $IN, q{<}, $users_file
or die "unable to open $users_file\n";
while (<$IN>) {
my ($user_id, $full_name) = (split(q{:}, $_))[0,4];
process_query($user_id);
process_query($full_name) if $full_name;
}
close $IN;

# get the words from a dictionary
# repeat this step for every other
# dictionary file you may want to include
open $IN, q{<}, $words_file
or die "unable to open $words_file\n";
process_query( $_ ) while <$IN>;
close $IN;

# finish off leftovers
if ($record_count) {
$dbh->do($query);
$total_recs += $record_count;
print "$record_count ($total_recs)\n" if $DEBUG;
}

$dbh->do(qq{ALTER TABLE passwords ENABLE KEYS}) if $deferred_keys;

sub process_query {
my ($word) = @_;
chomp $word;
if ($record_count) {
$query .= q{,};
}
my $quoted_word = $dbh->quote($word);
$query .= q{(}
. $quoted_word
. q{, password(}
. $quoted_word
. q{), password(reverse(}
. $quoted_word
. q{)))};
if (++$record_count >= $max_records) {
$dbh->do($query);
$total_recs += $record_count;
print "$record_count ($total_recs)\n" if $DEBUG;
$record_count = 0;
$query = $base_query;
}
}

Some efficiency issues

Why do we need a Perl script to achieve such goal? Everything could be done with a few one-liners, and this is actually what I tried in the beginning.
Of course you can use some one-liners, but you will end up inserting one record at the time, which becomes unbearably slow when you pass the fifty thousand record mark.
As an example, to insert five hundred thousand records, this script takes about one minute, because it makes multiple insert statements of 30,000 records each, while inserting records one by one took about 17 minutes.
One more problem may arise if you are not careful about the column types. Your "password" and "reversed_password" columns must have exactly the same type, character set and collation of the "password" column in mysql.user table. If you don't make them match exactly, MySQL will refuse to use any index from that table when joining them. The query I showed before the script runs in less than one second. If you use a different character set, it will require a table scan, and it can take as much as one minute for a table of one million words.