Sunday, May 28, 2017

How to break MySQL InnoDB cluster

A few weeks ago I started experimenting with MySQL InnoDB cluster. As part of the testing, I tried to kill a node to see what happens to the cluster.

The good news is that the cluster is resilient. When the primary node goes missing, the cluster replaces it immediately, and operations continue. This is one of the features of an High Availability system, but this feature alone does not define the usefulness or the robustness of the system. In one of my previous jobs, I worked at testing a commercial HA system and I've learned a few things about what makes a reliable system.

Armed with this knowledge, I did some more experiments with InnoDB Cluster. The attempt from my previous article had no other expectation than seeing operations continue with ease (primary node replacement.) In this article, I examine a few more features of an HA system:

  • Making sure that a failed primary node does not try to force itself back into the cluster;
  • Properly welcoming a failed node into the cluster;
  • Handling a Split Brain cluster.

To explore the above features (or lack of) we are going to simulate some mundane occurrences. We start with the same cluster seen in the previous article, using Docker InnoDB Cluster. The initial state is

{
    "clusterName": "testcluster",
    "defaultReplicaSet": {
        "name": "default",
        "primary": "mysqlgr1:3306",
        "status": "OK",
        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
        "topology": {
            "mysqlgr1:3306": {
                "address": "mysqlgr1:3306",
                "mode": "R/W",
                "readReplicas": {},
                "role": "HA",
                "status": "ONLINE"
            },
            "mysqlgr2:3306": {
                "address": "mysqlgr2:3306",
                "mode": "R/O",
                "readReplicas": {},
                "role": "HA",
                "status": "ONLINE"
            },
            "mysqlgr3:3306": {
                "address": "mysqlgr3:3306",
                "mode": "R/O",
                "readReplicas": {},
                "role": "HA",
                "status": "ONLINE"
            }
        }
    }
}

The first experiment is to restart a non-primary node

$ docker restart mysqlgr2

and see what happens to the cluster

$ ./tests/check_cluster.sh | grep 'primary\|address\|status'
    "primary": "mysqlgr1:3306",
    "status": "OK_NO_TOLERANCE",
    "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active",
            "address": "mysqlgr1:3306",
            "status": "ONLINE"
            "address": "mysqlgr2:3306",
            "status": "(MISSING)"
            "address": "mysqlgr3:3306",
            "status": "ONLINE"

The cluster detects that one member is missing. But after a few seconds, it goes back to normality:

$ ./tests/check_cluster.sh | grep 'primary\|address\|status'
    "primary": "mysqlgr1:3306",
    "status": "OK",
    "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
            "address": "mysqlgr1:3306",
            "status": "ONLINE"
            "address": "mysqlgr2:3306",
            "status": "ONLINE"
            "address": "mysqlgr3:3306",
            "status": "ONLINE"

This looks good. Now, let's do the same to the primary node

$ docker restart mysqlgr1

$ ./tests/check_cluster.sh 2| grep 'primary\|address\|status'
    "primary": "mysqlgr2:3306",
    "status": "OK_NO_TOLERANCE",
    "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active",
            "address": "mysqlgr1:3306",
            "status": "(MISSING)"
            "address": "mysqlgr2:3306",
            "status": "ONLINE"
            "address": "mysqlgr3:3306",
            "status": "ONLINE"

As before, the cluster detects that a node is missing, and excludes it from the cluster. Since it was the primary node, another one becomes primary.

However, this time the node does not come back in the cluster. Checking the cluster status again after several minutes, node 1 is still reported missing. This is not a bug. This is a feature of well behaved HA systems: a primary node that has been already replaced should not come back to the cluster automatically.

Also this experiment was good. Now, for the interesting part, let's see the Split-Brain situation.

Np brain 987746 000000

At this moment, there are two parts of the cluster, and each one sees it in a different way. The view from the current primary node is the one reported above and what we would expect: node 1 is not available. But if we ask the cluster status to node 1, we get a different situation:

$ ./tests/check_cluster.sh 1 | grep 'primary\|address\|status'
    "primary": "mysqlgr1:3306",
    "status": "OK_NO_TOLERANCE",
    "statusText": "Cluster is NOT tolerant to any failures. 2 members are not active",
            "address": "mysqlgr1:3306",
            "status": "ONLINE"
            "address": "mysqlgr2:3306",
            "status": "(MISSING)"
            "address": "mysqlgr3:3306",
            "status": "(MISSING)"

Node 1 thinks it's the primary, and two nodes are missing. Node 2 and three think that node 1 is missing.

In a sane system, the logical way to operate is to admit the failed node back into the cluster, after checking that it is safe to do so. In the InnoDB cluster management there is a rejoinInstance method that allows us to get an instance back:

$ docker exec -it mysqlgr2 mysqlsh --uri root@mysqlgr2:3306 -p$(cat secretpassword.txt)

mysql-js> cluster = dba.getCluster()
<Cluster:testcluster>

mysql-js> cluster.rejoinInstance('mysqlgr1:3306')
Rejoining the instance to the InnoDB cluster. Depending on the original
problem that made the instance unavailable, the rejoin operation might not be
successful and further manual steps will be needed to fix the underlying
problem.

Please monitor the output of the rejoin operation and take necessary action if
the instance cannot rejoin.

Please provide the password for 'root@mysqlgr1:3306':
Rejoining instance to the cluster ...

The instance 'root@mysqlgr1:3306' was successfully rejoined on the cluster.

The instance 'mysqlgr1:3306' was successfully added to the MySQL Cluster.

Sounds good, eh? Apparently, we have node 1 back in the fold. Let's check:

$ ./tests/check_cluster.sh 2| grep 'primary\|address\|status'
    "primary": "mysqlgr2:3306",
    "status": "OK_NO_TOLERANCE",
    "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active",
            "address": "mysqlgr1:3306",
            "status": "(MISSING)"
            "address": "mysqlgr2:3306",
            "status": "ONLINE"
            "address": "mysqlgr3:3306",
            "status": "ONLINE"

Nope. Node 1 is still missing. And if we try to rescan the cluster, we see that the rejoin call was not effective:

mysql-js> cluster.rescan()
Rescanning the cluster...

Result of the rescanning operation:
{
    "defaultReplicaSet": {
        "name": "default",
        "newlyDiscoveredInstances": [],
        "unavailableInstances": [
            {
                "host": "mysqlgr1:3306",
                "label": "mysqlgr1:3306",
                "member_id": "6bd04911-4374-11e7-b780-0242ac170002"
            }
        ]
    }
}

The instance 'mysqlgr1:3306' is no longer part of the HA setup. It is either offline or left the HA group.
You can try to add it to the cluster again with the cluster.rejoinInstance('mysqlgr1:3306') command or you can remove it from the cluster configuration.
Would you like to remove it from the cluster metadata? [Y|n]: n

It's curious (and frustrating) that we get a recommendation to run the very same function that we've attempted a minute ago.

But, just as a devilish thought, let's try the same experiment from the invalid cluster.

$ docker exec -it mysqlgr1 mysqlsh --uri root@mysqlgr1:3306 -p$(cat secretpassword.txt)

mysql-js> cluster = dba.getCluster()
<Cluster:testcluster>


mysql-js> cluster.rejoinInstance('mysqlgr2:3306')
Rejoining the instance to the InnoDB cluster. Depending on the original
problem that made the instance unavailable, the rejoin operation might not be
successful and further manual steps will be needed to fix the underlying
problem.

Please monitor the output of the rejoin operation and take necessary action if
the instance cannot rejoin.

Please provide the password for 'root@mysqlgr2:3306':
Rejoining instance to the cluster ...

The instance 'root@mysqlgr2:3306' was successfully rejoined on the cluster.

The instance 'mysqlgr2:3306' was successfully added to the MySQL Cluster.
mysql-js> cluster.status()
{
    "clusterName": "testcluster",
    "defaultReplicaSet": {
        "name": "default",
        "primary": "mysqlgr1:3306",
        "status": "OK_NO_TOLERANCE",
        "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active",
        "topology": {
            "mysqlgr1:3306": {
                "address": "mysqlgr1:3306",
                "mode": "R/W",
                "readReplicas": {},
                "role": "HA",
                "status": "ONLINE"
            },
            "mysqlgr2:3306": {
                "address": "mysqlgr2:3306",
                "mode": "R/O",
                "readReplicas": {},
                "role": "HA",
                "status": "ONLINE"
            },
            "mysqlgr3:3306": {
                "address": "mysqlgr3:3306",
                "mode": "R/O",
                "readReplicas": {},
                "role": "HA",
                "status": "(MISSING)"
            }
        }
    }
}

Now this was definitely not supposed to happen. The former failed node has invited a healthy node into its minority cluster and the operation succeeded!

The horrible part? This illegal operation succeeded into reconciling the views from node 1 and node2. Now also node 2 thinks that node1 is again the primary node, and node 3 (which was minding its own business and never had any accidents) is considered missing:

$ ./tests/check_cluster.sh 2| grep 'primary\|address\|status'
    "primary": "mysqlgr1:3306",
    "status": "OK_NO_TOLERANCE",
    "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active",
            "address": "mysqlgr1:3306",
            "status": "ONLINE"
            "address": "mysqlgr2:3306",
            "status": "ONLINE"
            "address": "mysqlgr3:3306",
            "status": "(MISSING)"

And node 3 all of a sudden finds itself in the role of failed node, while it had had nothing to do about the previous operations:

$ ./tests/check_cluster.sh 3| grep 'primary\|address\|status'
    "primary": "mysqlgr3:3306",
    "status": "OK_NO_TOLERANCE",
    "statusText": "Cluster is NOT tolerant to any failures. 2 members are not active",
            "address": "mysqlgr1:3306",
            "status": "(MISSING)"
            "address": "mysqlgr2:3306",
            "status": "(MISSING)"
            "address": "mysqlgr3:3306",
            "status": "ONLINE"

In short, while we were attempting to fix a split brain, we ended up with a different split brain, and an unexpected node promotion. This is clearly a bug, and I hope the MySQL team can make the system more robust.

Monday, May 08, 2017

Getting to know MySQL InnoDB cluster, the new kid in the block

Innodb cluster3

InnoDB Cluster was released as GA a few weeks ago. I remember the initial announcement of the product at OOW 2016, promising a seamless solution for replication and high availability with great ease of use. I was a bit disappointed to see that, at GA release time, the InnoDB Cluster is a patchwork of three separate products (Group Replication, MySQL Router, MySQL Shell) which the users have to collect and install separately.

Given this situation, I was very pleased when Matthew Lord published Docker-InnoDB-Cluster, an image for Docker that contains everything you need to get the system up and running. The associated scripts make the experience even easier: not only we don't have to hunt for components, but the cluster deployment procedure is completely automated.

Installation

The process is painless. After cloning the repository the start script takes care of everything. It will create a network, deploy three database nodes, and fire up the router.

$ ./start_three_node_cluster.sh
Creating dedicated grnet network...
# network grnet already exists
NETWORK ID          NAME                DRIVER              SCOPE
8fa365076198        grnet               bridge              local
Bootstrapping the cluster...
12fb4bd975c2fb2e7152ed64e12d2d212bbc9f1d3b39d715ea0c73eeb37fed45
Container mysqlgr1 is up at Sun May  7 22:02:38 CEST 2017
Starting mysqlgr1 container...
Starting mysqlgr1 container...
MySQL init process done. Ready for start up.
Getting GROUP_NAME...
Adding second node...
a2b504ea1920d35b1555f65de24cd364fc1bc7a6ac87ca4eb32f4c02f5afce7c
Container mysqlgr2 is up at Sun May  7 22:02:48 CEST 2017
Starting mysqlgr2 container...
Starting mysqlgr2 container...
MySQL init process done. Ready for start up.
Adding third node...
393d46b9a1795531d99f68645087393a54b2463ef88b9b3c4cbe735c1527fe57
Container mysqlgr3 is up at Sun May  7 22:02:58 CEST 2017
Starting mysqlgr3 container...
Starting mysqlgr3 container...
MySQL init process done. Ready for start up.
Sleeping 10 seconds to give the cluster time to sync up
Adding a router...
830c3125bad70b09b057cee370ee490bcb88b1d4a1bfec347cda847942f3b56e
Container mysqlrouter1 is up at Sun May  7 22:03:17 CEST 2017
Done!
Connecting to the InnoDB cluster...

Most of the configuration (which has been simplified thanks to the usage of MySQL shell to add nodes) is handled inside the container initialization script. Just a few details are needed in the cluster deployment script to get the result.

The deployment script will also invoke the mysql shell in one of the nodes to show the status of the cluster:

Creating a Session to 'root@mysqlgr1:3306'
Classic Session successfully established. No default schema selected.
{
    "clusterName": "testcluster",
    "defaultReplicaSet": {
        "name": "default",
        "primary": "mysqlgr1:3306",
        "status": "OK",
        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
        "topology": {
            "mysqlgr1:3306": {
                "address": "mysqlgr1:3306",
                "mode": "R/W",
                "readReplicas": {},
                "role": "HA",
                "status": "ONLINE"
            },
            "mysqlgr2:3306": {
                "address": "mysqlgr2:3306",
                "mode": "R/O",
                "readReplicas": {},
                "role": "HA",
                "status": "ONLINE"
            },
            "mysqlgr3:3306": {
                "address": "mysqlgr3:3306",
                "mode": "R/O",
                "readReplicas": {},
                "role": "HA",
                "status": "ONLINE"
            }
        }
    }
}

The above status is the result of dba.getCluster().status(), which is a convenient way of collecting a bunch of information about the cluster and then present them in a compact JSON structure. If you enable the general log prior to running this command, you would see something like this:

select count(*) from performance_schema.replication_group_members where MEMBER_ID = @@server_uuid AND MEMBER_STATE IS NOT NULL AND MEMBER_STATE != 'OFFLINE';
select count(*) from mysql_innodb_cluster_metadata.instances where mysql_server_uuid = @@server_uuid;
SELECT @@server_uuid, VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'group_replication_primary_member';
SELECT MEMBER_STATE FROM performance_schema.replication_group_members WHERE MEMBER_ID = '0030396b-3300-11e7-a8b6-0242ac170002';
SELECT CAST(SUM(IF(member_state = 'UNREACHABLE', 1, 0)) AS SIGNED) AS UNREACHABLE,  COUNT(*) AS TOTAL FROM performance_schema.replication_group_members;
select count(*) from performance_schema.replication_group_members where MEMBER_ID = @@server_uuid AND MEMBER_STATE IS NOT NULL AND MEMBER_STATE != 'OFFLINE';
select count(*) from mysql_innodb_cluster_metadata.instances where mysql_server_uuid = @@server_uuid;
SELECT @@server_uuid, VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'group_replication_primary_member';
SELECT MEMBER_STATE FROM performance_schema.replication_group_members WHERE MEMBER_ID = '0030396b-3300-11e7-a8b6-0242ac170002';
SELECT CAST(SUM(IF(member_state = 'UNREACHABLE', 1, 0)) AS SIGNED) AS UNREACHABLE,  COUNT(*) AS TOTAL FROM performance_schema.replication_group_members;
SELECT cluster_id, cluster_name, default_replicaset, description, options, attributes FROM mysql_innodb_cluster_metadata.clusters WHERE attributes->'$.default' = true;
show databases like 'mysql_innodb_cluster_metadata';
SELECT replicaset_name, topology_type FROM mysql_innodb_cluster_metadata.replicasets WHERE replicaset_id = 7;
select count(*) from performance_schema.replication_group_members where MEMBER_ID = @@server_uuid AND MEMBER_STATE IS NOT NULL AND MEMBER_STATE != 'OFFLINE';
select count(*) from mysql_innodb_cluster_metadata.instances where mysql_server_uuid = @@server_uuid;
SELECT @@server_uuid, VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'group_replication_primary_member';
SELECT MEMBER_STATE FROM performance_schema.replication_group_members WHERE MEMBER_ID = '0030396b-3300-11e7-a8b6-0242ac170002';
SELECT CAST(SUM(IF(member_state = 'UNREACHABLE', 1, 0)) AS SIGNED) AS UNREACHABLE,  COUNT(*) AS TOTAL FROM performance_schema.replication_group_members;
SELECT @@group_replication_single_primary_mode;
SHOW STATUS LIKE 'group_replication_primary_member';
SELECT mysql_server_uuid, instance_name, role, MEMBER_STATE, JSON_UNQUOTE(JSON_EXTRACT(addresses, "$.mysqlClassic")) as host FROM mysql_innodb_cluster_metadata.instances LEFT JOIN performance_schema.replication_group_members ON `mysql_server_uuid`=`MEMBER_ID` WHERE replicaset_id = 7;

In short, these commands check that the cluster is resilient, summarized in the statusText field, which says that we can lose up to one node and the cluster will keep working.

High Availability

What we have after deployment is a system that is highly available:

  • Group replication with one primary node;
  • Access to the cluster through the router, which provides one port for Read/write and one for Read-Only;
  • Automatic failover. When the primary node fails, another one is promoted on the spot, without any manual labor.

Let's start a test. We can check whether the data inserted from the R/W port is then retrieved by other nodes using the R/O port.

$ docker exec -it mysqlrouter1 /opt/ic/tests/test_router.sh
Server ID of current master
--------------
SELECT @@global.server_id
--------------

+--------------------+
| @@global.server_id |
+--------------------+
|                100 |
+--------------------+
Create content using router
--------------
create schema if not exists test
--------------

--------------
create table t1(id int not null primary key, name varchar(50))
--------------

--------------
insert into t1 values (1, "aaa")
--------------

The first part of the test will show the server ID of the primary node, by using the router R/W port (6446.) Then it will create a table and insert one record.

Server ID of a RO node
--------------
SELECT @@global.server_id
--------------

+--------------------+
| @@global.server_id |
+--------------------+
|                200 |
+--------------------+
retrieving contents using router
--------------
SELECT * from test.t1
--------------

+----+------+
| id | name |
+----+------+
|  1 | aaa  |
+----+------+

Using the read-only port (6447), we get a different node, and we retrieve the data created in the primary node.

Now we can test the high availability. Since we are using Docker, instead of simply kill the MySQL service, we can simulate an anvil falling on the server, by wiping away the container:

$ docker rm -f -v mysqlgr1
mysqlgr1

The primary node is gone for good. Let's see what the cluster status says now:

$ ./tests/check_cluster.sh 2
Creating a Session to 'root@mysqlgr2:3306'
Classic Session successfully established. No default schema selected.
{
    "clusterName": "testcluster",
    "defaultReplicaSet": {
        "name": "default",
        "primary": "mysqlgr2:3306",
        "status": "OK_NO_TOLERANCE",
        "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active",
        "topology": {
            "mysqlgr1:3306": {
                "address": "mysqlgr1:3306",
                "mode": "R/O",
                "readReplicas": {},
                "role": "HA",
                "status": "(MISSING)"
            },
            "mysqlgr2:3306": {
                "address": "mysqlgr2:3306",
                "mode": "R/W",
                "readReplicas": {},
                "role": "HA",
                "status": "ONLINE"
            },
            "mysqlgr3:3306": {
                "address": "mysqlgr3:3306",
                "mode": "R/O",
                "readReplicas": {},
                "role": "HA",
                "status": "ONLINE"
            }
        }
    }
}

There are a few differences compared to the initial report:

  • The primary is now node 2 (mysqlgr2);
  • Node 1 is marked as MISSING;
  • The cluster has lost its resilience. Unless we add another node, no further failures will be handled automatically.

We can run the router test again, and it will work just as well, with the differences reported below:

Server ID of current master
--------------
SELECT @@global.server_id
--------------

+--------------------+
| @@global.server_id |
+--------------------+
|                200 |
+--------------------+
Create content using router
--------------
create schema if not exists test
--------------

--------------
drop table if exists t1
--------------

--------------
create table t1(id int not null primary key, name varchar(50))
--------------

--------------
insert into t1 values (1, "aaa")
--------------

Server ID of a RO node
--------------
SELECT @@global.server_id
--------------

+--------------------+
| @@global.server_id |
+--------------------+
|                300 |
+--------------------+

We see that the primary has now ID 200, and the R/O node is 300 (the only other node that has survived.)

Summarizing

  • The good

    • I can see that some of the ease of use promised in San Francisco is already available. We can create a cluster with little effort.
    • The recovery from the master failure is transparent.
    • The cluster status gives clear information about the system.
  • The bad

    • MySQL shell is difficult to use. The command line help is insufficient: some options require trial and error to work correctly. It also does not use an options file like other MySQL clients.
    • Adding a node after the primary has become unavailable is harder than it should be, and the manual does not contemplate this case. It only mentions a server that can be restarted.
    • Restarting the router after the primary died is impossible with the current configuration.
    • The metadata for replication is now in three different schemas: mysql, performance_schema, and mysql_innodb_cluster_metadata. I understand the reasons, but I believe that a simplification would be possible.
  • The bottom line: quite good to start a cluster, but not enough to deal effectively with simple HA cases. Possibly released too early.