Monday, May 27, 2013

Getting started with replication from MySQL to MongoDB

As you probably know, Tungsten Replicator can replicate data from MySQL to MongoDB. The installation is relatively simple and, once done, replication works very well. There was a bug in the installation procedure recently, and as I was testing that the breakage has been fixed, I wanted to share the experience of getting started with this replication.

Step 1: install a MySQL server

For this exercise, we will use a MySQL sandbox running MySQL 5.5.31.

We download the binaries from dev.mysql.com and install a sandbox, making sure that it is configured as master, and that it is used row-based-replication.

$ mkdir -p $HOME/opt/mysql
$ cd ~/downloads
$ wget http://dev.mysql.com/get/Downloads/MySQL-5.5/mysql-5.5.31-linux2.6-x86_64.tar.gz/from/http://cdn.mysql.com/
$ make_sandbox --export_binaries ~/downloads/mysql-5.5.31-linux2.6-x86_64.tar.gz -- --master
$ echo 'binlog-format=row' >> ~/sandboxes/msb_5_5_31/my.sandbox.cnf

$ ~/sandboxes/msb_5_5_31/use -u root
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 5.5.31-log MySQL Community Server (GPL)

Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql [localhost] {root} ((none)) > grant all on *.* to tungsten identified by 'secret' with grant option;
Query OK, 0 rows affected (0.00 sec)


mysql [localhost] {root} ((none)) > set global binlog_format=row;
Query OK, 0 rows affected (0.00 sec)

The above command will install an instance of MySQL 5.5.31 in the directory $HOME/sandboxes/msb_5_5_31. You can use any other MySQL version. In that case, you should change the DEPLOY.sh file below with the coordinates of your server.

Step2: install a MongoDB server

Get the binaries from MongoDB downloads, and unpack them.

$ cd ~/downloads
$ wget http://fastdl.mongodb.org/linux/mongodb-linux-x86_64-2.4.3.tgz
$ cd
$ tar -xzf ~/downloads/mongodb-linux-x86_64-2.4.3.tgz
$ mv mongodb-linux-x86_64-2.4.3 mongodb

Once you have unpacked the binaries, you can run the server, with a command such as this:

#!/bin/bash
MONGODB_HOME=$HOME/mongodb

if [ ! -d $MONGODB_HOME ]
then
    echo "$MONGODB_HOME not found"
    exit 1
fi

cd $MONGODB_HOME
if [ ! -d $MONGODB_HOME/data ]
then
    mkdir $MONGODB_HOME/data
fi

./bin/mongod \
   --logpath=$MONGODB_HOME/mongodb.log \
   --dbpath=$MONGODB_HOME/data \
   --fork \
   --rest

Now MongoDB should be ready to receive commands. If needed, you can see a tutorial on basic MongoDB operations. For now, it's enough to check that your system is running:

$ cd ~/mongodb
$ ./bin/mongo
MongoDB shell version: 2.4.3
connecting to: test
> show dbs
local 0.078125GB
>

Step 3: install a master replicator using MySQL

As I mentioned above, there was a bug in the installation. The bug was fixed yesterday, and thus you should use a recent build from http://bit.ly/tr20_builds. For this exercise, I am using tungsten-replicator-2.1.0-269.tar.gz.

Of course, you should also configure your host according to the system requirements for Tungsten.

We start by creating a few directories that we need for our deployment.

$ cd 
$ mkdir deploy
$ cd deploy
$ mkdir mysql     # here we install the master replicator
$ mkdir mongodb   # here we install the slave replicator
$ wget https://s3.amazonaws.com/files.continuent.com/builds/nightly/tungsten-2.0-snapshots/tungsten-replicator-2.1.0-269.tar.gz
$ tar -xzf tungsten-replicator-2.1.0-269.tar.gz

We now create a defaults script that will be used by both master and slave installer.

$ cat DEPLOY.sh
MYSQL_PORT=5531
MYSQL_SANDBOX_BASE=$HOME/sandboxes/msb_5_5_31
MYSQL_BASEDIR=$HOME/opt/mysql/5.5.31
MYSQL_CONF=$MYSQL_SANDBOX_BASE/my.sandbox.cnf
MYSQL_BINLOG_DIRECTORY=$MYSQL_SANDBOX_BASE/DATA/
DEPLOY_HOME=$HOME/deploy
MYSQL_DEPLOY=$DEPLOY_HOME/mysql
MONGODB_DEPLOY=$DEPLOY_HOME/mongodb
MONGODB_PORT=27017
TUNGSTEN_BINARIES=$DEPLOY_HOME/tungsten-replicator-2.1.0-269
MASTER_THL_PORT=12500
SLAVE_THL_PORT=12600
MASTER_RMI_PORT=11500
SLAVE_RMI_PORT=11600

And finally the master installation script:

$ cat install_master.sh
. ./DEPLOY.sh

cd $TUNGSTEN_BINARIES

export PATH=$MYSQL_BASEDIR/bin:$PATH

./tools/tungsten-installer --master-slave -a \
  --datasource-type=mysql \
  --master-host=127.0.0.1  \
  --datasource-user=tungsten  \
  --datasource-password=secret  \
  --datasource-mysql-conf=$MYSQL_CONF \
  --datasource-log-directory=$MYSQL_BINLOG_DIRECTORY \
  --datasource-port=$MYSQL_PORT \
  --service-name=mongodb \
  --home-directory=$MYSQL_DEPLOY \
  --cluster-hosts=127.0.0.1 \
  --thl-port=$MASTER_THL_PORT \
  --rmi-port=$MASTER_RMI_PORT \
  --java-file-encoding=UTF8 \
  --mysql-use-bytes-for-string=false \
  --mysql-enable-enumtostring=true \
  --mysql-enable-settostring=true \
  --svc-extractor-filters=colnames,pkey \
  --svc-parallelization-type=none --start-and-report

We can run the installation command for the master.

$ ./install_master.sh
INFO  >> 127_0_0_1 >> Getting services list
INFO  >> 127_0_0_1 >> ..
Processing services command...
NAME              VALUE
----              -----
appliedLastSeqno: 0
appliedLatency  : 1.218
role            : master
serviceName     : mongodb
serviceType     : local
started         : true
state           : ONLINE
Finished services command...

Step 4: install a slave replicator using MongoDB

Now we have a master database server with a replicator that can send data across for any slave to pick it up. We can install one or more MySQL slaves, but this is not the purpose of this exercise. So we skip any regular slave installation and will install a MongoDB slave. To do so, we create a third script, which will invoke the installer to run a replicator slave service that applies data to MongoDB.

$ cat install_slave.sh
. ./DEPLOY.sh

cd $TUNGSTEN_BINARIES

export PATH=$MYSQL_BASEDIR/bin:$PATH

tools/tungsten-installer --master-slave -a \
  --datasource-type=mongodb \
  --master-host=127.0.0.1  \
  --service-name=mongodb \
  --home-directory=$MONGODB_DEPLOY \
  --cluster-hosts=127.0.0.1 \
  --datasource-port=$MONGODB_PORT \
  --master-thl-port=$MASTER_THL_PORT \
  --thl-port=$SLAVE_THL_PORT \
  --rmi-port=$SLAVE_RMI_PORT \
  --java-file-encoding=UTF8 \
  --skip-validation-check=InstallerMasterSlaveCheck \
  --svc-parallelization-type=none --start-and-report

Compared to MySQL installation, we see that the 'datasource-type' is 'mongodb', and as such it does not need username and password. Let's run it.

$ ./install_slave.sh
WARN  >> 127.0.0.1 >> Currently unable to check for the THL schema in mongodb
INFO  >> 127_0_0_1 >> Getting services list
INFO  >> 127_0_0_1 >> Processing services command...
NAME              VALUE
----              -----
appliedLastSeqno: 0
appliedLatency  : 49.444
role            : slave
serviceName     : mongodb
serviceType     : local
started         : true
state           : ONLINE
Finished services command...

We get a warning, which should not worry us, as it is expected. Notice that the high latency reported by the replicator is due to the delay in installing the slave. The only event in the replication pipeline is the 'master online' broadcast, which was replicated after the installation was completed.

If you got to this point, with both the master and slave replicator in the 'ONLINE' state, then you can continue with checking that replication works.

Step 5: check replication

Replication from MySQL (a relational DBMS) to MongoDB (a key-value store) requires some understanding of what we are replicating and what limitations we can face.

In MySQL, we have tables, which are translated to MongoDB collections. What is very important to understand is that Tungsten Replicator DOES NOT REPLICATE DDL EVENTS. Any CREATE/DROP/ALTER events are ignored by the replicator.

However, the replicator is smart enough to convert every table record into a document within the relevant collection. Every insert will generate a document inside MongoDB. Let's see an example:

mysql [localhost] {msandbox} (test) > create table myfirst (id int not null primary key, name char(30), d date);
Query OK, 0 rows affected (0.00 sec)

mysql [localhost] {msandbox} (test) > insert into myfirst values (1, 'Harry Potter', '1997-06-30');
Query OK, 1 row affected (0.00 sec)

On the MongoDB side, we will see the corresponding object:

> show dbs
local 0.078125GB
test 0.203125GB
tungsten_mongodb 0.203125GB
> use test
switched to db test
> show collections
myfirst
system.indexes
> db.myfirst.find()
{ "_id" : ObjectId("51a2867d3004fd7959a5f5aa"), "id" : "1", "name" : "Harry Potter", "d" : "1997-06-30" }

Not only inserts, but also updates are recognised:

mysql [localhost] {msandbox} (test) > insert into myfirst values (2, 'Harry Potter 2', '1999-06-02');
Query OK, 1 row affected (0.00 sec)

mysql [localhost] {msandbox} (test) > update myfirst set name = 'Harry Potter 1' where id =1;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0

Which will result in:

> db.myfirst.find()
{ "_id" : ObjectId("51a2867d3004fd7959a5f5aa"), "id" : "1", "name" : "Harry Potter 1", "d" : "1997-06-30" }
{ "_id" : ObjectId("51a287633004fd7959a5f5ab"), "id" : "2", "name" : "Harry Potter 2", "d" : "1999-06-02" }

And delete statements:

mysql [localhost] {msandbox} (test) > delete from myfirst  where id =2;
Query OK, 1 row affected (0.00 sec)

#####
> db.myfirst.find()
{ "_id" : ObjectId("51a2867d3004fd7959a5f5aa"), "id" : "1", "name" : "Harry Potter 1", "d" : "1997-06-30" }

This is all fine. But what happens when we insert quite a lot of data? Let's find out. We're going to use the test employees database to see if the system complains.

$ cd ~/data/employees/
$  ~/sandboxes/msb_5_5_31/use < employees.sql
INFO
CREATING DATABASE STRUCTURE
INFO
storage engine: InnoDB
INFO
LOADING departments
INFO
LOADING employees
INFO
LOADING dept_emp
INFO
LOADING dept_manager
INFO
LOADING titles
INFO
LOADING salaries

To see if replication is OK, we check both master and slave for their status, and then we count the records in MySQL and MongoDB.

$ ./mongodb/tungsten/tungsten-replicator/bin/trepctl -port $MASTER_RMI_PORT services
Processing services command...
NAME              VALUE
----              -----
appliedLastSeqno: 182
appliedLatency  : 1.152
role            : master
serviceName     : mongodb
serviceType     : local
started         : true
state           : ONLINE
Finished services command...

$ ./mongodb/tungsten/tungsten-replicator/bin/trepctl -port $SLAVE_RMI_PORT services
Processing services command...
NAME              VALUE
----              -----
appliedLastSeqno: 106
appliedLatency  : 93.469
role            : slave
serviceName     : mongodb
serviceType     : local
started         : true
state           : ONLINE
Finished services command...

So, we see that the slave is lagging, as the data is hard to digest (about 4 million records). Let's try again after a few seconds:

$ ./mongodb/tungsten/tungsten-replicator/bin/trepctl -port $SLAVE_RMI_PORT services
Processing services command...
NAME              VALUE
----              -----
appliedLastSeqno: 182
appliedLatency  : 89.99
role            : slave
serviceName     : mongodb
serviceType     : local
started         : true
state           : ONLINE
Finished services command...

Both the master and the slave have reached the same transaction number. The slave has some delay, due to the size of the transactions (which become even bigger because of the RBR). When the size of the transaction is small, the latency becomes quite acceptable:

$ ./mongodb/tungsten/tungsten-replicator/bin/trepctl -port $MASTER_RMI_PORT heartbeat

$ ./mongodb/tungsten/tungsten-replicator/bin/trepctl -port $SLAVE_RMI_PORT services
Processing services command...
NAME              VALUE
----              -----
appliedLastSeqno: 183
appliedLatency  : 0.423
role            : slave
serviceName     : mongodb
serviceType     : local
started         : true
state           : ONLINE
Finished services command...

As a last operation, let's check the numbers of the biggest table in the employees database:

$ ~/sandboxes/msb_5_5_31/use  employees
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 34
Server version: 5.5.31-log MySQL Community Server (GPL)

Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql [localhost] {msandbox} (employees) > show tables;
+---------------------+
| Tables_in_employees |
+---------------------+
| departments         |
| dept_emp            |
| dept_manager        |
| employees           |
| salaries            |
| titles              |
+---------------------+
6 rows in set (0.00 sec)

mysql [localhost] {msandbox} (employees) > select count(*) from employees;
+----------+
| count(*) |
+----------+
|   300024 |
+----------+
1 row in set (0.22 sec)

mysql [localhost] {msandbox} (employees) > select count(*) from dept_emp;
+----------+
| count(*) |
+----------+
|   331603 |
+----------+
1 row in set (0.28 sec)

mysql [localhost] {msandbox} (employees) > select count(*) from titles;
+----------+
| count(*) |
+----------+
|   443308 |
+----------+
1 row in set (0.25 sec)

mysql [localhost] {msandbox} (employees) > select count(*) from salaries;
+----------+
| count(*) |
+----------+
|  2844047 |
+----------+
1 row in set (0.83 sec)

####

> use employees
switched to db employees
> show collections
departments
dept_emp
dept_manager
employees
salaries
system.indexes
titles
> db.employees.count()
300024
> db.dept_emp.count()
331603
> db.titles.count()
443308
> db.salaries.count()
2844047
>

This proves that replication works, and even large amount of data are replicated reasonably fast.

Here's the official documentation about MySQL to MongoDB replication: https://docs.continuent.com/wiki/display/TEDOC/Replicating+from+MySQL+to+MongoDB. There you will find more details on how to install and what to expect. Happy hacking!

Sunday, May 05, 2013

MySQL Sandbox supports latest MySQL releases, has more metadata and docs

MySQL Sandbox has been updated again. The latest version is 3.0.38, which was just released. There were four releases in the space of one week, and this last one is just a polished edition.

Cherry-picking from the Change log:

  • Added option --bind_address to complement the effects of --remote_access;
  • The script 'enable_gtid' (for MySQL 5.6 +) now is durable. Previously the changes did not survive a restart.
  • Now you can install MariaDB with its bizarre version '10.0'
  • It also works well with MySQL 5.7. A bug prevented the creation of 'enable_gtid', but it is now fixed.
  • Enhanced the 'check_slaves' script in replication sandboxes. Now it includes the output of 'show master status', so you can see at a glance if replication is working well;
  • Added README file to each sandbox (simple or multiple)
  • Added connection.json to each sandbox (simple or multiple) The file contains information to use the sandbox with third party applications.
  • Added connection samples (for PHP, Perl, Python, Java, shell) to 'connection.json';
  • Added about 80 new tests to the test suite. Now the package has more than 2,000 tests (~290 tests running 7 times, each with a different MySQL version)

Documentation and metadata

The changes that probably are most visible to users involve metadata and documentation.

When you install a single sandbox, you will find a README file, containing a summary of the actions that you can do with the sandbox, and instructions to find more information.

Composite sandboxes (replication, multiple, circular, custom) have their own customised README.

While README is a pleasant addition for first time users, the metadata files could be beneficial also to old timers. In each sandbox there are two files:

  • connection.json
  • default_connection.json

If it is a single sandbox, you will see something like this:

{
    "origin": {
        "mysql_sandbox_version" : "3.0.38",
        "mysql_version":    "5.6",
        "binaries":         "/Users/gmax/opt/mysql/5.6.11"
    },
    "connection": {
        "host":             "127.0.0.1",
        "port":             "5611",
        "socket":           "/tmp/mysql_sandbox5611.sock",
        "bind_address":     "127.0.0.1"
    },
    "users": {
       "admin": {
            "username":     "root@localhost",
            "password":     "msandbox",
            "privileges":   "all, with grant option"
        },
       "all_privileges": {
            "username":     "msandbox@127.%",
            "password":     "msandbox",
            "privileges":   "all, no grant option"
        },
        "read_write": {
            "username":     "msandbox_rw@127.%",
            "password":     "msandbox",
            "privileges":   "SELECT,INSERT,UPDATE,DELETE,CREATE,DROP,INDEX,ALTER,SHOW DATABASES,CREATE TEMPORARY TABLES,LOCK TABLES, EXECUTE"
        },
        "read_only": {
            "username":     "msandbox_ro@127.%",
            "password":     "msandbox",
            "privileges":   "SELECT,EXECUTE"
        },
        "replication": {
            "username":     "rsandbox@127.%",
            "password":     "rsandbox",
            "privileges":   "REPLICATION SLAVE"
        }
    },
    "samples": {
        "php": {
            "mysqli" : "$mysqli = new mysqli('127.0.0.1', 'msandbox', 'msandbox', 'test', '5611');",
            "pdo"    : "$dbh = new PDO('mysql:host=127.0.0.1;port=5531', 'msandbox', 'msandbox');"
        },
        "perl" : {
            "dbi" : "$dbh=DBI->connect( 'DBI:mysql:host=127.0.0.1;port=5611', 'msandbox', 'msandbox')"
        },
        "python" : {
            "mysql.connector" : "cnx = mysql.connector.connect(user='msandbox', password='msandbox', host='127.0.0.1', port=5611, database='test')"
        },
        "java" : {
            "DriverManager" : "con=DriverManager.getConnection(\\\"jdbc:mysql://127.0.0.1:5611/test\\\", \\\"msandbox\\\", \\\"msandbox\\\")"
        },
        "shell" : {
            "generic": "/Users/gmax/opt/mysql/5.6.11/bin/mysql -h 127.0.0.1 -P 5611 -u msandbox -pmsandbox"
        }
    }
}
The file contains most of the information needed to use a sandbox with third party applications, and even ready-made connection strings for some popular programming languages.

If you safe in a hurry, you may look at default_connection.json, instead:

{
    "host":     "127.0.0.1",
    "port":     "5611",
    "socket":   "/tmp/mysql_sandbox5611.sock",
    "username": "msandbox@127.%",
    "password": "msandbox"
}

In composite sandboxes, these files collect the json objects for the underlying sandboxes

$ cat ~/sandboxes/rsandbox_5_6_11/default_connection.json
{
"master":
    {
        "host":     "127.0.0.1",
        "port":     "18876",
        "socket":   "/tmp/mysql_sandbox18876.sock",
        "username": "msandbox@127.%",
        "password": "msandbox"
    }
,
"node1":
    {
        "host":     "127.0.0.1",
        "port":     "18877",
        "socket":   "/tmp/mysql_sandbox18877.sock",
        "username": "msandbox@127.%",
        "password": "msandbox"
    }
,
"node2":
    {
        "host":     "127.0.0.1",
        "port":     "18878",
        "socket":   "/tmp/mysql_sandbox18878.sock",
        "username": "msandbox@127.%",
        "password": "msandbox"
    }
}

There is another addition that was not emphasised before. In each single sandbox, there is a script named 'json_in_db'.

This script loads the contents of connection.json into the database server. This could be useful if you want to access the metadata through a SQL interface. I was also thinking of loading the same data into a table by default, but I did not want to introduce database objects without user acknowledgement. It's probably a behaviour that can be authorised with another installation option.

UPDATE: I have just found out that in the replication README, the head credits are included twice. Oh, well. I will fix it as soon as there are some more changes. Three releases in two days are enough.


Access sandboxes from other hosts

By default, a MySQL sandbox instance can be accessed by localhost only.

This access is regulated by an option, "--remote_access", which is set to '127.%', and it is used to create the default users. You can see the resulting instructions in the file 'grants.mysql' inside each sandbox.

grant all on *.* to msandbox@'127.%' identified by 'msandbox';
  grant all on *.* to msandbox@'localhost' identified by 'msandbox';
  grant SELECT,EXECUTE on *.* to msandbox_ro@'127.%' identified by 'msandbox';
  grant SELECT,EXECUTE on *.* to msandbox_ro@'localhost' identified by 'msandbox';
  grant REPLICATION SLAVE on *.* to rsandbox@'127.%' identified by 'rsandbox';

If you want to access the sandbox remotely, you can change "--remote_access" to include your specific subnet, or to open it completely:

--remote_access='192.168.1.%'
  --remote_access='%'
Starting with MySQL::Sandbox 3.0.34, you can also define the bind address for your MySQL server. By default it is '127.0.0.1'. It will be changed to '0.0.0.0' if you choose a customized --remote_access. If you change both --remote_access and --bind_address, No adjustment will be made.

Happy hacking!