MySQL 5.6 is probably the version of MySQL with the biggest bundle of new features. You may want to try it soon, since it's now released as GA, and I would like to offer some practical experience on how to use some of the advanced features.
Since replication is my main interest, I will focus on some of the new features in this field, and I will compare what's available in MySQL 5.6 with Tungsten Replicator.
The focus of the comparison is usability, manageability, and some hidden functionality. Parallel replication has been available with Tungsten Replicator for almost two years, and Global Transaction Identifiers for much longer than that. With MySQL 5.6, it seems that the MySQL team wants to close the gap. While the main feature (parallel execution threads) is available and performing well, there are some shocking differences in terms of ease of use, administration handles, and capabilities, that suggest at least poor integration into the server.
To test replication, I have enabled parallel workers and GTID in the slaves, and I ran 5 quick processes on separated databases. I repeated the operation with MySQL 5.6 and Tungsten Replicator. Let's see what I have found.
Monitoring
First of all, there is "SHOW SLAVE STATUS," which has been enhanced to include the GTID. Trouble is, this command was designed for a single replication channel, and when the GTID from several parallel threads are listed, the result is puzzling.
show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 127.0.0.1 Master_User: rsandbox Master_Port: 18675 Connect_Retry: 60 Master_Log_File: mysql-bin.000002 Read_Master_Log_Pos: 386003338 Relay_Log_File: mysql_sandbox18676-relay-bin.000006 Relay_Log_Pos: 315698722 Relay_Master_Log_File: mysql-bin.000002 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 315698559 Relay_Log_Space: 386003928 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 646 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 1 Master_UUID: f6ea8d94-7147-11e2-a036-6c626da07446 Master_Info_File: mysql.slave_master_info SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Reading event from the relay log Master_Retry_Count: 86400 Master_Bind: Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: Master_SSL_Crl: Master_SSL_Crlpath: Retrieved_Gtid_Set: f6ea8d94-7147-11e2-a036-6c626da07446:1-1089548 Executed_Gtid_Set: f6ea8d94-7147-11e2- a036-6c626da07446:1-891950:891952-891953:891955-891963:891965-891969:891971-891973:891976-891992:891994-8919 95:891997:891999:892001-892002:892005-892016:892019-892023:892025-892026:892028:892030-892032:892034-892039: 892041-892044:892046:892048-892051:892053-892054:892056-892063:892065-892072:892074-892084:892086-892089:892 091:892094:892096-892111:892113-892122:892124-892137:892139-892142:892144-892146:892148-892150:892152:892154 -892159:892161-892171:892173-892182:892184-892185:892187:892189-892193:892196-892198:892200-892206:892208:89 2210-892211:892213-892216:892218-892219:892222:892224-892226:892228:892230-892249:892251:892255:892257:89225 9-892260:892262-892263:892265:892268:892270-892288:892290:892292:892295-892298:892300-892307:892309:892312:8 92314-892317:892319-892320:892324-892328:892332-892336:892342:892344-892349:892352-892355:892359-892360:8923 62-892363:892369-892372:892376:892378-892384:892387:892390:892392:892394:892396-892397:892399:892401-892403: 892405-892407:892409:892413-892415:892417:892421:892423-892426:892429-892431:892433:892435:892437-892439:892 442:892445:892448 Auto_Position: 0
After this abundant display of GTID (alas, without any reference to the transactions being processed), I was expecting something from the tables dedicated to replication monitoring. However, what is missing in these tables is any reference to GTIDs, and any context about the transactions. There is no indication of which schema the changes are happening, how well or bad distributed they are, whether there was the need for serialization, or other info.
select * from mysql.slave_worker_info\G *************************** 1. row *************************** Id: 1 Relay_log_name: ./mysql_sandbox18676-relay-bin.000006 Relay_log_pos: 91550519 Master_log_name: mysql-bin.000002 Master_log_pos: 91550356 Checkpoint_relay_log_name: ./mysql_sandbox18676-relay-bin.000006 Checkpoint_relay_log_pos: 91371802 Checkpoint_master_log_name: mysql-bin.000002 Checkpoint_master_log_pos: 91371639 Checkpoint_seqno: 508 Checkpoint_group_size: 64 Checkpoint_group_bitmap: #####(unprintable characters replaced)##### *************************** 2. row *************************** Id: 2 Relay_log_name: ./mysql_sandbox18676-relay-bin.000006 Relay_log_pos: 91497101 Master_log_name: mysql-bin.000002 Master_log_pos: 91496938 Checkpoint_relay_log_name: ./mysql_sandbox18676-relay-bin.000006 Checkpoint_relay_log_pos: 91319665 Checkpoint_master_log_name: mysql-bin.000002 Checkpoint_master_log_pos: 91319502 Checkpoint_seqno: 501 Checkpoint_group_size: 64 ?! ?AA@oint_group_bitmap: #####(unprintable characters replaced)##### *************************** 3. row *************************** Id: 3 Relay_log_name: ./mysql_sandbox18676-relay-bin.000006 Relay_log_pos: 91540831 Master_log_name: mysql-bin.000002 Master_log_pos: 91540668 Checkpoint_relay_log_name: ./mysql_sandbox18676-relay-bin.000006 Checkpoint_relay_log_pos: 91360379 Checkpoint_master_log_name: mysql-bin.000002 Checkpoint_master_log_pos: 91360216 Checkpoint_seqno: 511 Checkpoint_group_size: 64 Checkpoint_group_bitmap: #####(unprintable characters replaced)##### *************************** 4. row *************************** Id: 4 Relay_log_name: ./mysql_sandbox18676-relay-bin.000006 Relay_log_pos: 91461999 Master_log_name: mysql-bin.000002 Master_log_pos: 91461836 Checkpoint_relay_log_name: ./mysql_sandbox18676-relay-bin.000006 Checkpoint_relay_log_pos: 91281068 Checkpoint_master_log_name: mysql-bin.000002 Checkpoint_master_log_pos: 91280905 Checkpoint_seqno: 510 Checkpoint_group_size: 64 Checkpoint_group_bitmap: #####(unprintable characters replaced)##### *************************** 5. row *************************** Id: 5 Relay_log_name: ./mysql_sandbox18676-relay-bin.000006 Relay_log_pos: 91373959 Master_log_name: mysql-bin.000002 Master_log_pos: 91373796 Checkpoint_relay_log_name: ./mysql_sandbox18676-relay-bin.000006 Checkpoint_relay_log_pos: 91192982 Checkpoint_master_log_name: mysql-bin.000002 Checkpoint_master_log_pos: 91192819 Checkpoint_seqno: 508 Checkpoint_group_size: 64 Checkpoint_group_bitmap: #####(unprintable characters replaced)#####
The table dedicated to parallel replication doesn't give us any information about GTID. And it doesn't say in which schema the transaction is happening. This lack of information, for a feature that is based on schema isolation, is quite startling.
So, we have a look at the crash-safe slave tables.
select * from mysql.slave_master_info\G *************************** 1. row *************************** Number_of_lines: 23 Master_log_name: mysql-bin.000002 Master_log_pos: 385282815 Host: 127.0.0.1 User_name: rsandbox User_password: rsandbox Port: 18675 Connect_retry: 60 Enabled_ssl: 0 Ssl_ca: Ssl_capath: Ssl_cert: Ssl_cipher: Ssl_key: Ssl_verify_server_cert: 0 Heartbeat: 1800 Bind: Ignored_server_ids: 0 Uuid: f6ea8d94-7147-11e2-a036-6c626da07446 Retry_count: 86400 Ssl_crl: Ssl_crlpath: Enabled_auto_position: 0 select * from mysql.slave_relay_log_info\G *************************** 1. row *************************** Number_of_lines: 7 Relay_log_name: ./mysql_sandbox18676-relay-bin.000006 Relay_log_pos: 235324325 Master_log_name: mysql-bin.000002 Master_log_pos: 235324162 Sql_delay: 0 Number_of_workers: 5 Id: 1
What's wrong with this picture? Two main things:
- There is no mention of GTID in these tables either. So, if you are using GTID, your only option is to look at the binary logs, and see if you get any solace when a failover occurs. But the server itself does not offer any help.
- There is only one row available for the master and relay log info. When you are using parallel replication, this means that either every commit must lock the same row, or these tables are useless for keeping the slave crash safe. If the information used for crash safe comes from the parallel workers table, then the crash safe tables are redundant. But as you can see on your own, the parallel workers table has less information than master_info and relay_log_info.
Thus, we have a look at how GTID and parallel threads are implemented in Tungsten.
Even limiting the information to the tables found in the server, we find several differences:
- The GTID is displayed in every row. (It's the column named seqno). The original binlog name and position are also preserved and associated with the GTID (in the column named eventid.)
- Every row displays in which schema it is being processed. (It's the column named shard_id)
- There is no separated table for crash-safe info and parallel threads. When Tungsten uses a single thread, the table has only one row. When it is using parallel appliers, then there are multiple rows, one for each applier. This way, there is no possibility of deadlock, as every thread locks only one row.
select * from tungsten_tsandbox.trep_commit_seqno +---------+---------+--------+-----------+-----------+--------------+-------------------------------------+-----------------+---------------------+------------+---------------------+ | task_id | seqno | fragno | last_frag | source_id | epoch_number | eventid | applied_latency | update_timestamp | shard_id | extract_timestamp | +---------+---------+--------+-----------+-----------+--------------+-------------------------------------+-----------------+---------------------+------------+---------------------+ | 0 | 2218566 | 0 | 1 | 127.0.0.1 | 0 | mysql-bin.000002:0000000330476211;0 | 1 | 2013-02-07 23:12:07 | evaluator1 | 2013-02-07 23:12:06 | | 1 | 2218521 | 0 | 1 | 127.0.0.1 | 0 | mysql-bin.000002:0000000330469587;0 | 1 | 2013-02-07 23:12:07 | evaluator2 | 2013-02-07 23:12:06 | | 2 | 2220380 | 0 | 1 | 127.0.0.1 | 0 | mysql-bin.000002:0000000330747639;0 | 1 | 2013-02-07 23:12:07 | evaluator3 | 2013-02-07 23:12:06 | | 3 | 2219360 | 0 | 1 | 127.0.0.1 | 0 | mysql-bin.000002:0000000330594617;0 | 1 | 2013-02-07 23:12:07 | evaluator4 | 2013-02-07 23:12:06 | | 4 | 2221220 | 0 | 1 | 127.0.0.1 | 0 | mysql-bin.000002:0000000330873205;0 | 0 | 2013-02-07 23:12:07 | evaluator5 | 2013-02-07 23:12:07 | +---------+---------+--------+-----------+-----------+--------------+-------------------------------------+-----------------+---------------------+------------+---------------------+
You can see the same information in vertical output, to appreciate the level of detail.
select * from tungsten_tsandbox.trep_commit_seqno\G *************************** 1. row *************************** task_id: 0 seqno: 2257934 fragno: 0 last_frag: 1 source_id: 127.0.0.1 epoch_number: 0 eventid: mysql-bin.000002:0000000336350320;0 applied_latency: 9 update_timestamp: 2013-02-07 23:12:26 shard_id: evaluator1 extract_timestamp: 2013-02-07 23:12:17 *************************** 2. row *************************** task_id: 1 seqno: 2257938 fragno: 0 last_frag: 1 source_id: 127.0.0.1 epoch_number: 0 eventid: mysql-bin.000002:0000000336350875;0 applied_latency: 9 update_timestamp: 2013-02-07 23:12:26 shard_id: evaluator2 extract_timestamp: 2013-02-07 23:12:17 *************************** 3. row *************************** task_id: 2 seqno: 2257936 fragno: 0 last_frag: 1 source_id: 127.0.0.1 epoch_number: 0 eventid: mysql-bin.000002:0000000336350648;0 applied_latency: 9 update_timestamp: 2013-02-07 23:12:26 shard_id: evaluator3 extract_timestamp: 2013-02-07 23:12:17 *************************** 4. row *************************** task_id: 3 seqno: 2257916 fragno: 0 last_frag: 1 source_id: 127.0.0.1 epoch_number: 0 eventid: mysql-bin.000002:0000000336347768;0 applied_latency: 9 update_timestamp: 2013-02-07 23:12:26 shard_id: evaluator4 extract_timestamp: 2013-02-07 23:12:17 *************************** 5. row *************************** task_id: 4 seqno: 2257933 fragno: 0 last_frag: 1 source_id: 127.0.0.1 epoch_number: 0 eventid: mysql-bin.000002:0000000336350156;0 applied_latency: 9 update_timestamp: 2013-02-07 23:12:26 shard_id: evaluator5 extract_timestamp: 2013-02-07 23:12:17
In addition to the information that you can get from the database server, Tungsten has its own tools that provide information about the process. For example, trepctl status -name stores provides, among other things, detailed information about how many transactions were processed, accepted, and rejected by each channel, and if there was any collision that caused serialization.
trepctl status -name stores Processing status command (stores)... NAME VALUE ---- ----- activeSeqno : 2267132 doChecksum : false flushIntervalMillis : 0 fsyncOnFlush : false logConnectionTimeout : 28800 logDir : /home/tungsten/tsb2/db3/tlogs/tsandbox logFileRetainMillis : 604800000 logFileSize : 100000000 maximumStoredSeqNo : 2267230 minimumStoredSeqNo : 0 name : thl readOnly : false storeClass : com.continuent.tungsten.replicator.thl.THL timeoutMillis : 2147483647 NAME VALUE ---- ----- criticalPartition : -1 discardCount : 0 estimatedOfflineInterval: 0.0 eventCount : 2267140 headSeqno : 2267140 maxOfflineInterval : 5 maxSize : 10 name : parallel-queue queues : 5 serializationCount : 0 serialized : false stopRequested : false store.0 : THLParallelReadTask task_id=0 thread_name=store-thl-0 hi_seqno=2267140 lo_seqno=1 read=2267140 accepted=460813 discarded=1806327 events=0 store.1 : THLParallelReadTask task_id=1 thread_name=store-thl-1 hi_seqno=2267140 lo_seqno=1 read=2267140 accepted=452812 discarded=1814328 events=0 store.2 : THLParallelReadTask task_id=2 thread_name=store-thl-2 hi_seqno=2267140 lo_seqno=1 read=2267140 accepted=450812 discarded=1816328 events=0 store.3 : THLParallelReadTask task_id=3 thread_name=store-thl-3 hi_seqno=2267140 lo_seqno=1 read=2267140 accepted=450397 discarded=1816743 events=0 store.4 : THLParallelReadTask task_id=4 thread_name=store-thl-4 hi_seqno=2267140 lo_seqno=1 read=2267141 accepted=452306 discarded=1814834 events=0 storeClass : com.continuent.tungsten.replicator.thl.THLParallelQueue syncInterval : 10000 Finished status command (stores)...
The same tool can show what is happening for each shard. When you are dealing with many channels, you need to know who is doing what, and sometime with a high level of detail. This is especially useful when things go wrong. Knowing what has happened to your transaction, and at which stage it stopped working could be a great help to fix problems. You can follow every transaction using either the GTID (appliedLastSeqno), or the binlog+position (appliedLastEventId), combined with the schema where it is happening (shardId.)
trepctl status -name shards Processing status command (shards)... NAME VALUE ---- ----- appliedLastEventId: mysql-bin.000002:0000000338943258;0 appliedLastSeqno : 2275303 appliedLatency : 18.942 eventCount : 243189 shardId : evaluator1 stage : q-to-dbms NAME VALUE ---- ----- appliedLastEventId: mysql-bin.000002:0000000338940934;0 appliedLastSeqno : 2275288 appliedLatency : 18.935 eventCount : 231993 shardId : evaluator10 stage : q-to-dbms NAME VALUE ---- ----- appliedLastEventId: mysql-bin.000002:0000000338939386;0 appliedLastSeqno : 2275277 appliedLatency : 18.931 eventCount : 234687 shardId : evaluator2 stage : q-to-dbms NAME VALUE ---- ----- appliedLastEventId: mysql-bin.000002:0000000338938221;0 appliedLastSeqno : 2275269 appliedLatency : 18.928 eventCount : 231045 shardId : evaluator3 stage : q-to-dbms NAME VALUE ---- ----- appliedLastEventId: mysql-bin.000002:0000000338942040;0 appliedLastSeqno : 2275295 appliedLatency : 18.933 eventCount : 225755 shardId : evaluator4 stage : q-to-dbms NAME VALUE ---- ----- appliedLastEventId: mysql-bin.000002:0000000338942486;0 appliedLastSeqno : 2275298 appliedLatency : 18.937 eventCount : 221864 shardId : evaluator5 stage : q-to-dbms NAME VALUE ---- ----- appliedLastEventId: mysql-bin.000002:0000000338942930;0 appliedLastSeqno : 2275301 appliedLatency : 18.94 eventCount : 219574 shardId : evaluator6 stage : q-to-dbms NAME VALUE ---- ----- appliedLastEventId: mysql-bin.000002:0000000338943372;0 appliedLastSeqno : 2275304 appliedLatency : 18.931 eventCount : 220078 shardId : evaluator7 stage : q-to-dbms NAME VALUE ---- ----- appliedLastEventId: mysql-bin.000002:0000000338942600;0 appliedLastSeqno : 2275299 appliedLatency : 18.929 eventCount : 221647 shardId : evaluator8 stage : q-to-dbms NAME VALUE ---- ----- appliedLastEventId: mysql-bin.000002:0000000338940109;0 appliedLastSeqno : 2275282 appliedLatency : 18.931 eventCount : 226607 shardId : evaluator9 stage : q-to-dbms Finished status command (shards)...
And if the info by shard is not enough, we can also see the functioning of replication in terms of tasks. For the current transactions, we can check what is happening at every stage, with a level of detail that the MySQL server cannot provide.
trepctl status -name tasks Processing status command (tasks)... NAME VALUE ---- ----- appliedLastEventId: mysql-bin.000002:0000000339749071;0 appliedLastSeqno : 2280695 appliedLatency : 22.267 applyTime : 53.543 averageBlockSize : 1.000 cancelled : false currentLastEventId: mysql-bin.000002:0000000339749071;0 currentLastFragno : 0 currentLastSeqno : 2280695 eventCount : 2280695 extractTime : 3053.902 filterTime : 0.194 otherTime : 2.864 stage : remote-to-thl state : extract taskId : 0 NAME VALUE ---- ----- appliedLastEventId: mysql-bin.000002:0000000339747579;0 appliedLastSeqno : 2280685 appliedLatency : 22.233 applyTime : 135.079 averageBlockSize : 9.983 cancelled : false currentLastEventId: mysql-bin.000002:0000000339747579;0 currentLastFragno : 0 currentLastSeqno : 2280685 eventCount : 2280685 extractTime : 2972.509 filterTime : 0.167 otherTime : 2.714 stage : thl-to-q state : extract taskId : 0 NAME VALUE ---- ----- appliedLastEventId: mysql-bin.000002:0000000339747579;0 appliedLastSeqno : 2280685 appliedLatency : 22.243 applyTime : 174.149 averageBlockSize : 9.813 cancelled : false currentLastEventId: mysql-bin.000002:0000000339747579;0 currentLastFragno : 0 currentLastSeqno : 2280685 eventCount : 463769 extractTime : 2930.461 filterTime : 4.553 otherTime : 1.317 stage : q-to-dbms state : extract taskId : 0 NAME VALUE ---- ----- appliedLastEventId: mysql-bin.000002:0000000339746639;0 appliedLastSeqno : 2280679 appliedLatency : 22.247 applyTime : 173.182 averageBlockSize : 9.790 cancelled : false currentLastEventId: mysql-bin.000002:0000000339746639;0 currentLastFragno : 0 currentLastSeqno : 2280679 eventCount : 455841 extractTime : 2931.707 filterTime : 4.374 otherTime : 1.219 stage : q-to-dbms state : extract taskId : 1 NAME VALUE ---- ----- appliedLastEventId: mysql-bin.000002:0000000339747249;0 appliedLastSeqno : 2280683 appliedLatency : 22.246 applyTime : 171.505 averageBlockSize : 9.819 cancelled : false currentLastEventId: mysql-bin.000002:0000000339747249;0 currentLastFragno : 0 currentLastSeqno : 2280683 eventCount : 453846 extractTime : 2933.467 filterTime : 4.28 otherTime : 1.23 stage : q-to-dbms state : extract taskId : 2 NAME VALUE ---- ----- appliedLastEventId: mysql-bin.000002:0000000339747415;0 appliedLastSeqno : 2280684 appliedLatency : 22.252 applyTime : 168.773 averageBlockSize : 9.792 cancelled : false currentLastEventId: mysql-bin.000002:0000000339747415;0 currentLastFragno : 0 currentLastSeqno : 2280684 eventCount : 453447 extractTime : 2936.219 filterTime : 4.283 otherTime : 1.213 stage : q-to-dbms state : extract taskId : 3 NAME VALUE ---- ----- appliedLastEventId: mysql-bin.000002:0000000339746311;0 appliedLastSeqno : 2280677 appliedLatency : 22.25 applyTime : 167.426 averageBlockSize : 9.804 cancelled : false currentLastEventId: mysql-bin.000002:0000000339746311;0 currentLastFragno : 0 currentLastSeqno : 2280677 eventCount : 454922 extractTime : 2937.622 filterTime : 4.188 otherTime : 1.25 stage : q-to-dbms state : extract taskId : 4 Finished status command (tasks)...
Fixing an error
Let's try to see what happens with a simple, classic replication error. We will enter a record in the slave, and then do the same in the master. We will assume full knowledge of the facts, and try to repair the damage.
In MySQL 5.6, when the slave stops replication because of the duplicate key error, we get this:
show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 127.0.0.1 Master_User: rsandbox Master_Port: 18675 Connect_Retry: 60 Master_Log_File: mysql-bin.000002 Read_Master_Log_Pos: 136120191 Relay_Log_File: mysql_sandbox18676-relay-bin.000006 Relay_Log_Pos: 136120097 Relay_Master_Log_File: mysql-bin.000002 Slave_IO_Running: Yes Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1062 Last_Error: Error 'Duplicate entry '1' for key 'PRIMARY'' on query. Default database: 'test'. Query: 'insert into t1 values (1)' Skip_Counter: 0 Exec_Master_Log_Pos: 136119934 Relay_Log_Space: 136120781 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 1062 Last_SQL_Error: Error 'Duplicate entry '1' for key 'PRIMARY'' on query. Default database: 'test'. Query: 'insert into t1 values (1)' Replicate_Ignore_Server_Ids: Master_Server_Id: 1 Master_UUID: c3054160-7b4a-11e2-a17e-6c626da07446 Master_Info_File: mysql.slave_master_info SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Master_Retry_Count: 86400 Master_Bind: Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: 130220 11:49:56 Master_SSL_Crl: Master_SSL_Crlpath: Retrieved_Gtid_Set: c3054160-7b4a-11e2-a17e-6c626da07446:1-386581 Executed_Gtid_Set: c3054160-7b4a-11e2-a17e-6c626da07446:1-386580, c90226d2-7b4a-11e2-a17e-6c626da07446:1 Auto_Position: 0
In this situation, we try the old remedy of skipping a transaction.
set global sql_slave_skip_counter=1; ERROR 1858 (HY000): sql_slave_skip_counter can not be set when the server is running with GTID_MODE = ON. Instead, for each transaction that you want to skip, generate an empty transaction with the same GTID as the transaction
Oops! Now I have two problems:
- How do I generate an empty transaction with a given GTID?
- Which is the given GTID? I get several ones:
show global variables like '%gtid_executed%'\G *************************** 1. row *************************** Variable_name: gtid_executed Value: c3054160-7b4a-11e2-a17e-6c626da07446:1-386580, c90226d2-7b4a-11e2-a17e-6c626da07446:1
I found this manual page about skipping transactions, but I don't get where I should run these instructions. I tried running them in the slave, but the replication error remains in place. I guess we need a clearer example that explains how to fix the error, because this surely does not look like a solution.
Let's try the same issue with Tungsten.
$ trepctl status Processing status command... NAME VALUE ---- ----- appliedLastEventId : NONE appliedLastSeqno : -1 appliedLatency : -1.0 channels : -1 clusterName : default currentEventId : NONE currentTimeMillis : 1361356976587 dataServerHost : 127.0.0.1 extensions : latestEpochNumber : -1 masterConnectUri : thl://127.0.0.1:12110/ masterListenUri : thl://127.0.0.1:12111/ maximumStoredSeqNo : -1 minimumStoredSeqNo : -1 offlineRequests : NONE pendingError : Event application failed: seqno=444898 fragno=0 message=java.sql.SQLException: Statement failed on slave but succeeded on master pendingErrorCode : NONE pendingErrorEventId : mysql-bin.000002:0000000116727315;0 pendingErrorSeqno : 444898 pendingExceptionMessage: java.sql.SQLException: Statement failed on slave but succeeded on master insert into t1 values (1) /* ___SERVICE___ = [tsandbox] */ pipelineSource : UNKNOWN relativeLatency : -1.0 resourcePrecedence : 99 rmiPort : 10102 role : slave seqnoType : java.lang.Long serviceName : tsandbox serviceType : unknown simpleServiceName : tsandbox siteName : default sourceId : 127.0.0.1 state : OFFLINE:ERROR timeInStateSeconds : 18.747 uptimeSeconds : 923.323 version : Tungsten Replicator 2.0.7 build 278 Finished status command...
The error tells me that GTID 444898 is at fault. The remedy that I would apply in this case is the same that I would do anyway, regardless of parallel replication.
$ trepctl online -skip-seqno 444898 $ trepctl status Processing status command... NAME VALUE ---- ----- appliedLastEventId : mysql-bin.000002:0000000116726621;0 appliedLastSeqno : 444893 appliedLatency : 0.0 channels : 5 clusterName : default currentEventId : NONE currentTimeMillis : 1361358074668 dataServerHost : 127.0.0.1 extensions : latestEpochNumber : 0 masterConnectUri : thl://127.0.0.1:12110/ masterListenUri : thl://127.0.0.1:12111/ maximumStoredSeqNo : 444898 minimumStoredSeqNo : 0 offlineRequests : NONE pendingError : NONE pendingErrorCode : NONE pendingErrorEventId : NONE pendingErrorSeqno : -1 pendingExceptionMessage: NONE pipelineSource : thl://127.0.0.1:12110/ relativeLatency : 1599.668 resourcePrecedence : 99 rmiPort : 10102 role : slave seqnoType : java.lang.Long serviceName : tsandbox serviceType : local simpleServiceName : tsandbox siteName : default sourceId : 127.0.0.1 state : ONLINE timeInStateSeconds : 2.076 uptimeSeconds : 2021.404 version : Tungsten Replicator 2.0.7 build 278 Finished status command...
And we are back in business. Since GTID in Tungsten are well integrated in every aspect of the operations, I can use them to fix problems.
More loose ends
What we've seen above shows that the integration between existing replication features and the new ones is less than desirable. There are more issues that may come to the user's attention. One that struck me as very odd is Replicate_Ignore_Server_Ids
This value is found in 'SHOW SLAVE STATUS', and it designates the list of servers that were excluded from replication. This value is associated to the option 'IGNORE_SERVER_IDS' that you can give in CHANGE MASTER TO. What's odd with this value? It is that it only accepts server-IDs, not server-UUIDs. If you try to use a server-UUID in CHANGE MASTER TO, you get an error.
However, the reason for mentioning this variable is that, just by chance, I was able to "fix" the above problem, instead of using an empty transaction. I did the following:
stop slave; change master to ignore_server_ids=(1); start slave; show slave status\G # no errors stop slave; change master to ignore_server_ids=(); start slave;
And we have skipped a transaction in an undocumented way. Bug or feature?
8 comments:
I think you need to somehow figure out how to run start slave until because you need to skip just one statement. By adding the server_id to the list of servers to ignore you will ignore the error, but also ignore any other queries after it in the replication stream. So you need to figure out how to run one event from that server, then turn skipping off again. Good luck with that :D
I found your post because I had a similar problem with error 1062 while using GTID, so the default error skipping did not work.
However, I managed to solve the faulty replication on the slave by using the commands found on the manual page about skipping transactions.
Your show slave status\G;-log shows the following:
Retrieved_Gtid_Set: c3054160-7b4a-11e2-a17e-6c626da07446:1-386581
Executed_Gtid_Set: c3054160-7b4a-11e2-a17e-6c626da07446:1-386580,
c90226d2-7b4a-11e2-a17e-6c626da07446:1
The GTID that caused the error was c3054160-7b4a-11e2-a17e-6c626da07446:386581 (all GTIDs up to 386580 were executed, the log shows that the slave retrieved one more which could not be executed). Even if Retrieved_Gtid_Set showed 1-386586, the faulty GTID would be 386581.
Running the following commands on the slave machine (the one that stopped replicating, I ran them in the MySQL Command Line Client) solved the problem for me:
SET GTID_NEXT="c3054160-7b4a-11e2-a17e-6c626da07446:386581";
BEGIN;
COMMIT;
SET GTID_NEXT=AUTOMATIC;
FLUSH LOGS;
PURGE BINARY LOGS TO "mysql-bin.000002";
START SLAVE;
All GTIDs and commands are adjusted to your problem. Hope this helps!
Christian,
Thanks for your comment. This method works, although it looks frightfully unfriendly.
Yes you're right, it is not very user friendly, yet still friendlier than having to recreate the replication from a database dump if it happens once.
Thanks for your post anyway, it lead the way to solve my problem! In order to avoid it the next time, I'll just use users with limited rights to avoid writing in slave databases accidentally.
What I'm really missing this time is a way to check if how many transactions were really executed in parallel regardless of slave_parallel_type configurations...any hint on that? Cheers!
@Wagner, can you elaborate? It is not clear which system you are referring to.
just brilliant!
Probably the best post about the issue I've seen
Post a Comment