Monday, December 20, 2010

Looking for a hack - Passing comment-like info through the binary log

hacker I am facing an interesting problem. I need to mark somehow a statement in such a way that the comment is preserved through the binary log.
I don't have control on how the statement is generated or using which client software. For the sake of example, let's say that I need to mark a CREATE PROCEDURE statement in such a way that, if I extract the query from the binary log and apply it to another server, the information is still available.

Background

Normally, I would use a comment. The first thing I would think is
CREATE PROCEDURE p1(i int) select "hello" /* This is my text */
But most client libraries will strip it.
There was a clever trick by Roland Bouman that allowed users to bypass this limitation. You could use a qualified comment such as /*!999999 This is my test */, but unfortunately it only works in MySQL 5.0.x, while MySQL 5.1 strips everything down, even if the comment is a legitimate keyword.

create procedure p9 (j int) insert /*!50000 INTO */ t1 values (j) ;
Query OK, 0 rows affected (0.00 sec)

 show create procedure p9\G
*************************** 1. row ***************************
           Procedure: p9
            sql_mode: 
    Create Procedure: CREATE DEFINER=`msandbox`@`%` PROCEDURE `p9`(j int)
insert  INTO  t1 values (j)
character_set_client: latin1
collation_connection: latin1_swedish_ci
  Database Collation: latin1_swedish_ci
1 row in set (0.00 sec)

Other tricks

Something else that I have tried: I can inject a query before or after the one that I need to monitor.
create table if not exists comments (t varchar(100)) engine=blackhole;
 
update comments set t='the next statement is what I need';
create procedure p1(i int) insert into t1 values (j);
update comments set t='the previous statement is what I need';
This approach does possibly introduce some overhead.

Or I can add a comment clause in the statement.
create procedure p1(i int) comment="this is what I need" insert into t1 values (j);
This approach requires parsing the SQL, and dealing with artistic indentation and usage of other options in the query. And if I need to deal with commands that don't support the "comment" option, I am back to square one.

Advice wanted

So far, the only method that works almost always is the blackhole trick (1)
I would like to know if there is any method of introducing a piece of information related to a given statement, in such a way that the comment survives after one of the following:
  • The binary log is converted to queries and passed to a MySQL client that applies the stream of queries to another server.
  • The binary log is associated with another master, and then passed to a slave through regular replication.
In both cases, I need to find the original information in the last server's binary log.

As a related matter, I know that MySQL, in regular replication, passes some information across binary logs, and that information is the server-id. If I set an intermediate server as relay slave, the server-id of the original master is associated with the query recorder in the binary log of every slave. I don't know if I can use this information for my purposes, but I would like to know how does the replication process maintain the server ID across servers.

Maybe it's too late for me and I can't see an obvious solution. I will appreciate any suggestion. Thanks in advance

(1) If the blackhole is disabled, the method fails, or introduce unacceptable overhead.

Thursday, December 16, 2010

Some hidden goods in MySQL 5.5

5.5 GA The announcement of MySQL 5.5 released as GA has outlined the improvements in this version, which indeed has enough good new features to excite most any user.
There are two additions, though, that were lost in the noise of the bigger features, and I would like to spend a few words for each of them.
The first addition is something that users of stored routines have been waiting for since MySQL 5.0. No, it is not SIGNAL and its close associate RESIGNAL, which have been publicized enough. I am talking about the stored routine parameters, for which now there is a dedicated table in the information_schema.
Let's see an example, with a simple procedure that uses three parameters.

drop procedure if exists add_to_date ;
create procedure add_to_date(in d date, in i int, out nd date)
deterministic
    set nd = d + interval i day;
This works as expected in both 5.1 and 5.5. (Never mind that it's redundant. I know it. It's only for the sake of keeping the example short).

 call add_to_date('2010-12-15',10,@new_date);
Query OK, 0 rows affected (0.00 sec)

 select @new_date;
+------------+
| @new_date  |
+------------+
| 2010-12-25 |
+------------+
1 row in set (0.00 sec)
The difference starts to show when you want to deal with this procedure programmatically. If you need to find out which parameters are expected by this procedure, your only option in MySQL 5.1 is parsing the result of SHOW CREATE PROCEDURE add_to_date. Not terribly difficult in any scripting language, but a hassle in SQL.
In MySQL 5.5, instead, you can easily get the routine parameters with a simple query:

 select parameter_name, parameter_mode,data_type from information_schema. parameters where specific_schema='test' and specific_name= 'add_to_date' order by ordinal_position;
+----------------+----------------+-----------+
| parameter_name | parameter_mode | data_type |
+----------------+----------------+-----------+
| d              | IN             | date      |
| i              | IN             | int       |
| nd             | OUT            | date      |
+----------------+----------------+-----------+
3 rows in set (0.00 sec)

Speaking of the information_Schema, there are more goodies that were not emphasized enough. The Innodb engine that you find in the server is the evolution of the InnoDB plugin that ships with MySQL 5.1. Only that it is now built-in. What many people forget to mention is that the plugin (and thus the current InnoDB engine in 5.5) comes provided with its own InnoDB-specific instrumentation tables in the information_schema.

show tables like 'innodb%';
+----------------------------------------+
| Tables_in_information_schema (innodb%) |
+----------------------------------------+
| INNODB_CMP_RESET                       |
| INNODB_TRX                             |
| INNODB_CMPMEM_RESET                    |
| INNODB_LOCK_WAITS                      |
| INNODB_CMPMEM                          |
| INNODB_CMP                             |
| INNODB_LOCKS                           |
+----------------------------------------+
7 rows in set (0.00 sec)
This is the same set of tables that you may have seen if you have worked with the InnoDB plugin in 5.1. In short, you can get a lot of the info that you used to look at in the output of SHOW ENGINE INNODB STATUS. For more information, you should look at what the InnoDB plugin manual says on this topic.
I don't know if the tables can replace the SHOW ENGINE INNODB STATUS. Perhaps someone can comment on this issue and provide more information?

Thursday, December 09, 2010

Speaking at the O'Reilly MySQL Conference - April 2011

I will present two talks at the MySQL Conference next April.
One is a three hours tutorial on Advanced MySQL Replication Techniques, and the other is a normal session on The art of sandboxing. Reducing Complex Systems to Manageable Boxes.
The first topic is not a first to me. But the contents are going to be fresh and new. There has been so much going on in the replication field, that the talk on this topic that I presented in 2007 looks like ancient history.
The second topic is completely new. I have often presented the result of my sandboxing efforts, but I have never thought of explaining the techniques themselves. Now that I have got some experience at reducing differently complex systems to sandboxes, I want to share the knowledge, to promote more work in this field.

Monday, December 06, 2010

Excluding databases from mysqldump

A question that came up during the MySQL track at the UKOUG conference in Birmingham was "Can I exclude only a few databases from mysqldump? Let's say that I have 50 databases, and I want to dump all of them, except a few."
As many know, mysqldump has an option to ignore specific tables. SO if you have 1,000 tables in a databases, you can tell mysqldump to dump all the tables except a few ones.
There is no corresponding option to exclude one or more databases.
However, if you know your command line tools, the solution is easy:
First, we get the list of all databases:
mysql -B -N -e 'show databases'
information_schema
employees
five
four
mysql
one
performance_schema
six
test
three
two
-B forces batch mode (no dashes box around the data), while -N gets the result without the headers.
Now, let's say that we want to exclude databases four, five, and six. And since we want to avoid unpleasant side effects, also information_schema and performance_schema.
Thus, we pipe the previous data through a filter. I use Perl, but sed or grep could get the job done.
mysql -B -N -e 'show databases' | \
  perl -ne 'print unless /\b(?:four|five|six|_schema)\b/' 
employees
mysql
one
test
three
two
Now that we have the list of databases that we need, we can tell mysqldump to backup the databases from such list. All we need is converting the vertical list into a horizontal one using xargs

mysql -B -N -e 'show databases' | \
  perl -ne 'print unless /\b(?:four|five|six|_schema)\b/' \
  xargs echo mysqldump -B 
mysqldump -B employees mysql one test three two
That's it. The last line is the resulting command. Once you are sure that it is what you want, remove the "echo" after xargs, and the command will be executed.

Update: Thanks to Shantanu, who pointed that the regexp does not filter properly. So I added the boundary checks (\b) to make my words match the result.

Friday, December 03, 2010

My picks for PGDay-EU 2010

PGDayEU2010
On Sunday I will be in Stuttgart with the double purpose of attending the annual European PostrgreSQL conference and the technical meeting of my company that will be held after the normal proceedings of PGDay-EU.
For the first time in several years I am attending a conference where I am not a speaker. In my previous job I did not have much opportunity to attend PostgreSQL meetings, and I welcome this opportunity. The schedule is quite interesting, and I have made my personal picks:

Who's afraid of MySQL forks?

mysql forks? There is much talk about MySQL forks and how they are going to replace MySQL, or take over MySQL user base, or become more powerful/profitable/popular/you-name-it than MySQL itself.
Let's clear some air on this topic. There is more about forks than meets the eye, especially if you think about a few obvious facts.
What's a fork? According to Wikipedia
a project fork happens when developers take a legal copy of source code from one software package and start independent development on it, creating a distinct piece of software.
By this definition, when someone who doesn't work at the MySQL project distributes a package that is based on MySQL code but differs from the original, it's a fork.
Why am I approaching the issue from this angle? Because, apart from Windows users, who mostly download MySQL from the official site, the majority of users get MySQL through a Linux distribution or some other project. And most of the time such packages are different from the ones built by the MySQL team. There is nothing wrong with that. The differences are sometimes minimal packaging changes done to adapt MySQL to the specific distribution, and sometimes they are a cherry-picking application of patches to an old version that needs to be maintained so that the package is unlike any other MySQL version that you may find in the wild. Even if the version is the same, depending on the distribution and the age of the server, the code beneath could be wildly different from the official versions.
Thus, it turns out that many users, possibly the majority, are using a MySQL fork, albeit a very minor one.
But when people talk about forks, they often refer to three main projects:
  • The Percona distribution. This is a collection of a few distinct patches in the server, coupled with a fork of the InnoDB plugin, named XtraDB, and an independent tool for backup (XtraBackup). This fork has a solid business background. Every patch has been developed to meet user requests, and the engineers at Percona maintain them appropriately.
  • Then we have the MariaDB fork, which is a series of changes to the MySQL core, motivated by the desire of the developers to build a rich set of feature enhancements while being backward compatible to the main distribution. The business model is thus a fast track of new features and bug fixes to customers.
  • And then there is Drizzle, which has even less business traction than MariaDB, but a very well defined goal of creating a lightweight database by re-engineering a bare bones stripped down version of MySQL that is now very distant from its origins.
What I said in the above descriptions is just the synopsis of what these three forks are. In recent mythology, it is fabled that, if MySQL ceases to exist (because it goes bankrupt, or Oracle kills it, or a major accident happens to the project, whatever) users can replace MySQL with one fork, and live happily ever after.
Not so fast. There is something that few people take into account when listening to this too often repeated tale.
What most observers miss is that the forks' original code (with the exception of Drizzle) is very marginal. The bulk of the distribution is still the code produced by the MySQL team, which is merged at every minor release, and integrated with the patches produced by Percona and MariaDB. So, while technically they are forks of MySQL, they can't live independently from the official MySQL distribution. Both Percona and MariaDB don't have the manpower to maintain the server by handling the huge amount of bugs that the MySQL team is fixing every month.
There is also a matter of skill set. Percona has talented InnoDB experts, while MariaDB has mostly core server experts (and some are among the top ones, I may add). They could complement each other, although it seems that cooperation between the two projects is not as good as it used to be. (Could be my personal impression.)
The bottom line, though, is if both projects are able to survive should the main project become unavailable. I am not suggesting that Oracle wants to make MySQL scarce. On the contrary, all the information at my disposal suggest that Oracle will keep MySQL publicly available for long time.
This state of affair seems to indicate that Drizzle is, instead, a true fork that does not depend on MySQL health. To some extent, this is true. However, the main storage engine in Drizzle is InnoDB. Therefore, at least today, Drizzle is as dependent on Oracle as Percona and MariaDB.
What would happen tomorrow, if the disaster depicted by doomsday advocates comes true and MySQL actually disappears? I don't honestly know, but I would love to have a public commitment from the major players, about what they are prepared to do in terms of maintaining that huge chunk of code that today they take from Oracle releases on a monthly basis.
This is all matter of thought for MySQL users.

About adoption of the forks today, I have seen five types of arguments in favor of a MySQL fork:
  1. I need the feature provided by Percona or MariaDB, or I need a quick bug fix that I can't get from the slow roadmap at Oracle. I trust that this handful of people are able to maintain that little code that differs from MySQL and matters to me. So I don't care if they don't have 100 developers on the task.
  2. Given Oracle's track record in other Open Source projects, I don't trust them to deliver MySQL according to FOSS principles, so let's go for true Open Source alternatives.
  3. Most MySQL developers have now left Oracle, and so the forks have more chances of being higher quality.
  4. Cool! MariaDB/Percona has a bunch of features more than MySQL. It must be better. Let's use it.
  5. I like new technology. Let's plunge into them!
Argument #1 is a solid business backed reason for adopting some software. The risk is often well calculated, especially if the evaluation can be backed by performance and functional tests.
Argument #2 is frivolous, as it mixes subjective feelings into business matters. And so is argument #4. Yet, these two types of advocacy are quite popular and spread much faster than the more reasonable approach seen at #1.
Argument #3 is debatable. MySQL developers at Oracle outnumber all forks easily. The idea that the departure of a few core developers can alter the system in such a way that the whole project crumble has been already negated by facts: MySQL 5.5 is an excellent release, with enthusiastic appreciation from power users. While I agree that top MySQL talents work at the forks, I consider the MySQL team to be still in excellent shape.
Argument #5 is reasonable, if it is followed by cool judgment and backed by facts. I am one who is always ready to try new solutions, and love experimenting with cool technology. But adoption is different from proof of concept. I am happy to see that Drizzle can replace MySQL in some applications, but would I trust it in its present beta stage? Certainly not. So, I am happy to test, but I trust my valuable data to more stable solutions.

What's for you, the final user? My personal advice is: don't adopt blindly because of some enthusiastic advertising. But test the product thoroughly, and if it fits your needs, by all means, go for it. But if you don't have a specific reason, I recommend staying with the official branch, because, despite the change in affiliation, there is still a well experienced team behind it.