Friday, November 28, 2008

Yet another replication trap


replication and engines

When I filed Bug#39197 replication breaks with large load with InnoDB, flush logs, and slave stop/start, I genuinely thought that it was a serious problem. I was a bit puzzled, to tell the truth, because the scenario that I was using seemed common enough for this bug to be found already.
Anyway, it was verified independently, but there was a catch. The script in the master was using SET storage_engine=InnoDB to create the tables necessary for the test. That looked good enough to me. The script was indeed creating InnoDB tables on the master. The trouble was that the "SET" command is not replicated. Thus the slaves were using the default storage engine (i.e. MyISAM). That reduces the scenario to a case were I was splitting a transaction and trying to apply it to a non transactional table. Not a bug, as Andrei Elkin and Sveta Smirnova correctly found out, but not an easy error to spot either.
According to the manual, using the storage_engines variable allows for a certain degree of flexibility in replication. The non-replication of this instruction that affects DDL commands is described as a feature. In my experience, it's a bug, by design even, but this is my personal opinion.

8 comments:

arjenAU said...

Making statements not replicate is already possible, by setting the SQL_BIN_LOG variable, and also explicitly in option syntax to some SQL commands.

I see no reason to explicitly exclude a statement such as the one above from replication, since if people want it excluded they can do so. Perhaps they WANT to get it replicated but now they don't have a choice!

So I agree with you, I reckon this is a bug. Is that specifically filed as a bug, so we can comment on it?

Mark Callaghan said...

I think this is a bug -- http://bugs.mysql.com/bug.php?id=41101

The immediate problem is that the binlog cannot be used to setup a new master as it will use the wrong table types (the set commands are not written to the binlog).

The longer term problems are:
1) things done to make slave database state different from the master must be done on the slave, not on binlog events because the binlog is used to prepare new masters
2) things that prevent a binlog event from being written should be explicit (set sql_log_bin or a no_replicate clause on the statement). As it is there are too many special cases that few people remember. This is a big TCO issue.

leithal said...

I disagree - it's not a bug, the SET command does not change data, it is not DML, etc. - so there is no reason for it to be in the binary logs at all.

There are reasons why there is an ENGINE = foo for CREATE TABLE statements.. :)

So, if you would like SET commands to be included within the binary logs - that is most definitely a feature request.

I think we actually have pretty good flexibility with what we have now, however (as long as you understand it, that is).

Mark Leith

Andrei said...

Arjen, to be precise the master has a choice to CREATE table providing engine=x.
Still, I would rather agree with you and Mark's 1-2 maxims.

I'd like to share my view on a possible future of safe and flexible replication.
There is no notion yet of the master server state context
for binlog which the slave could try to match against at its sql thread start-up time.
E.g the slave thread could refuse to start when it found
the same master's @@global.storage_engine != its own.

The context can be defined as a collection of the master server global
variables and their values at binlog creating time.
Such a snapshot of the master state should be placed on the top of every binlog file, e.g next to the FormatDescriptor event.
At the master run time changes in the master state would be recorded into the binlog.
Similarly the connection state context might be defined but there is
no reason to record its initial snapshot but rather only changes at
the run time.

In effect, a particular query, e.g CREATE TABLE w/o engine qualifier,
won't require any the session and the master server global variable to
be brought in with the query event (The query private context
(auto-inc range, last insert id, rand seed, timestamp) will remain
though). The correct mimicing of the master connection state by the
slave sql thread is provided with the initial master context snapshot
and the following records of set @@global.var and set @@session.var.

As to flexibility, act on the slave side: use ALTER etc and provide
different policy how to react on a mismatch with the master context snapshot at the slave sql thread start time.

regards,

Andrei

Roland Bouman said...

Hi!

@Leithal:

"the SET command does not change data"

Well, the SET statement itself does not. But it sure can impact data.

I don't have any experience with replication, but if no SET statements are ever replicated, not even SET sql_mode, wel....

For example, if the default sql_mode is traditional, but a client chooses to set the sql_mode to the empty string (MySQL default) then statement based replication is likely to break, no?

Dates and times and missing columns may be valid on the master but not on the slave. Also, the effect of statements may be different when things like HIGH_NOT_PRECEDENCE or PIPES_AS_CONCAT or ANSI_QUOTES are changed.

regards,

Roland Bouman

gmax said...

Received from Brian:
Drizzle will be replicating the actual "this is what we did for the table creation"... so not a bug with us.

gmax said...

@Roland,
Your reasoning is correct, but the example is not well chosen. SQL_MODE is replicated. Even if you change it for the session, its contents are preserved.
That is another argument in favor of the it's-a-bug theory. For coherence, all the environment variables should be preserved in replication. But since this is, according to the manual, a feature, then it is not replicated, by design.

Giuseppe

Roland Bouman said...

Hi Giuseppe,

thanks for pointing this out.
I guess thats my inexperience with replication showing...:)

But I agree - the only way to be sure is to maintain the entire state of the master's session.

Another example:

SET group_concat_max_len=x

?

Vote on Planet MySQL