Monday, May 27, 2013

Getting started with replication from MySQL to MongoDB

As you probably know, Tungsten Replicator can replicate data from MySQL to MongoDB. The installation is relatively simple and, once done, replication works very well. There was a bug in the installation procedure recently, and as I was testing that the breakage has been fixed, I wanted to share the experience of getting started with this replication.

Step 1: install a MySQL server

For this exercise, we will use a MySQL sandbox running MySQL 5.5.31.

We download the binaries from dev.mysql.com and install a sandbox, making sure that it is configured as master, and that it is used row-based-replication.

$ mkdir -p $HOME/opt/mysql
$ cd ~/downloads
$ wget http://dev.mysql.com/get/Downloads/MySQL-5.5/mysql-5.5.31-linux2.6-x86_64.tar.gz/from/http://cdn.mysql.com/
$ make_sandbox --export_binaries ~/downloads/mysql-5.5.31-linux2.6-x86_64.tar.gz -- --master
$ echo 'binlog-format=row' >> ~/sandboxes/msb_5_5_31/my.sandbox.cnf

$ ~/sandboxes/msb_5_5_31/use -u root
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 5.5.31-log MySQL Community Server (GPL)

Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql [localhost] {root} ((none)) > grant all on *.* to tungsten identified by 'secret' with grant option;
Query OK, 0 rows affected (0.00 sec)


mysql [localhost] {root} ((none)) > set global binlog_format=row;
Query OK, 0 rows affected (0.00 sec)

The above command will install an instance of MySQL 5.5.31 in the directory $HOME/sandboxes/msb_5_5_31. You can use any other MySQL version. In that case, you should change the DEPLOY.sh file below with the coordinates of your server.

Step2: install a MongoDB server

Get the binaries from MongoDB downloads, and unpack them.

$ cd ~/downloads
$ wget http://fastdl.mongodb.org/linux/mongodb-linux-x86_64-2.4.3.tgz
$ cd
$ tar -xzf ~/downloads/mongodb-linux-x86_64-2.4.3.tgz
$ mv mongodb-linux-x86_64-2.4.3 mongodb

Once you have unpacked the binaries, you can run the server, with a command such as this:

#!/bin/bash
MONGODB_HOME=$HOME/mongodb

if [ ! -d $MONGODB_HOME ]
then
    echo "$MONGODB_HOME not found"
    exit 1
fi

cd $MONGODB_HOME
if [ ! -d $MONGODB_HOME/data ]
then
    mkdir $MONGODB_HOME/data
fi

./bin/mongod \
   --logpath=$MONGODB_HOME/mongodb.log \
   --dbpath=$MONGODB_HOME/data \
   --fork \
   --rest

Now MongoDB should be ready to receive commands. If needed, you can see a tutorial on basic MongoDB operations. For now, it's enough to check that your system is running:

$ cd ~/mongodb
$ ./bin/mongo
MongoDB shell version: 2.4.3
connecting to: test
> show dbs
local 0.078125GB
>

Step 3: install a master replicator using MySQL

As I mentioned above, there was a bug in the installation. The bug was fixed yesterday, and thus you should use a recent build from http://bit.ly/tr20_builds. For this exercise, I am using tungsten-replicator-2.1.0-269.tar.gz.

Of course, you should also configure your host according to the system requirements for Tungsten.

We start by creating a few directories that we need for our deployment.

$ cd 
$ mkdir deploy
$ cd deploy
$ mkdir mysql     # here we install the master replicator
$ mkdir mongodb   # here we install the slave replicator
$ wget https://s3.amazonaws.com/files.continuent.com/builds/nightly/tungsten-2.0-snapshots/tungsten-replicator-2.1.0-269.tar.gz
$ tar -xzf tungsten-replicator-2.1.0-269.tar.gz

We now create a defaults script that will be used by both master and slave installer.

$ cat DEPLOY.sh
MYSQL_PORT=5531
MYSQL_SANDBOX_BASE=$HOME/sandboxes/msb_5_5_31
MYSQL_BASEDIR=$HOME/opt/mysql/5.5.31
MYSQL_CONF=$MYSQL_SANDBOX_BASE/my.sandbox.cnf
MYSQL_BINLOG_DIRECTORY=$MYSQL_SANDBOX_BASE/DATA/
DEPLOY_HOME=$HOME/deploy
MYSQL_DEPLOY=$DEPLOY_HOME/mysql
MONGODB_DEPLOY=$DEPLOY_HOME/mongodb
MONGODB_PORT=27017
TUNGSTEN_BINARIES=$DEPLOY_HOME/tungsten-replicator-2.1.0-269
MASTER_THL_PORT=12500
SLAVE_THL_PORT=12600
MASTER_RMI_PORT=11500
SLAVE_RMI_PORT=11600

And finally the master installation script:

$ cat install_master.sh
. ./DEPLOY.sh

cd $TUNGSTEN_BINARIES

export PATH=$MYSQL_BASEDIR/bin:$PATH

./tools/tungsten-installer --master-slave -a \
  --datasource-type=mysql \
  --master-host=127.0.0.1  \
  --datasource-user=tungsten  \
  --datasource-password=secret  \
  --datasource-mysql-conf=$MYSQL_CONF \
  --datasource-log-directory=$MYSQL_BINLOG_DIRECTORY \
  --datasource-port=$MYSQL_PORT \
  --service-name=mongodb \
  --home-directory=$MYSQL_DEPLOY \
  --cluster-hosts=127.0.0.1 \
  --thl-port=$MASTER_THL_PORT \
  --rmi-port=$MASTER_RMI_PORT \
  --java-file-encoding=UTF8 \
  --mysql-use-bytes-for-string=false \
  --mysql-enable-enumtostring=true \
  --mysql-enable-settostring=true \
  --svc-extractor-filters=colnames,pkey \
  --svc-parallelization-type=none --start-and-report

We can run the installation command for the master.

$ ./install_master.sh
INFO  >> 127_0_0_1 >> Getting services list
INFO  >> 127_0_0_1 >> ..
Processing services command...
NAME              VALUE
----              -----
appliedLastSeqno: 0
appliedLatency  : 1.218
role            : master
serviceName     : mongodb
serviceType     : local
started         : true
state           : ONLINE
Finished services command...

Step 4: install a slave replicator using MongoDB

Now we have a master database server with a replicator that can send data across for any slave to pick it up. We can install one or more MySQL slaves, but this is not the purpose of this exercise. So we skip any regular slave installation and will install a MongoDB slave. To do so, we create a third script, which will invoke the installer to run a replicator slave service that applies data to MongoDB.

$ cat install_slave.sh
. ./DEPLOY.sh

cd $TUNGSTEN_BINARIES

export PATH=$MYSQL_BASEDIR/bin:$PATH

tools/tungsten-installer --master-slave -a \
  --datasource-type=mongodb \
  --master-host=127.0.0.1  \
  --service-name=mongodb \
  --home-directory=$MONGODB_DEPLOY \
  --cluster-hosts=127.0.0.1 \
  --datasource-port=$MONGODB_PORT \
  --master-thl-port=$MASTER_THL_PORT \
  --thl-port=$SLAVE_THL_PORT \
  --rmi-port=$SLAVE_RMI_PORT \
  --java-file-encoding=UTF8 \
  --skip-validation-check=InstallerMasterSlaveCheck \
  --svc-parallelization-type=none --start-and-report

Compared to MySQL installation, we see that the 'datasource-type' is 'mongodb', and as such it does not need username and password. Let's run it.

$ ./install_slave.sh
WARN  >> 127.0.0.1 >> Currently unable to check for the THL schema in mongodb
INFO  >> 127_0_0_1 >> Getting services list
INFO  >> 127_0_0_1 >> Processing services command...
NAME              VALUE
----              -----
appliedLastSeqno: 0
appliedLatency  : 49.444
role            : slave
serviceName     : mongodb
serviceType     : local
started         : true
state           : ONLINE
Finished services command...

We get a warning, which should not worry us, as it is expected. Notice that the high latency reported by the replicator is due to the delay in installing the slave. The only event in the replication pipeline is the 'master online' broadcast, which was replicated after the installation was completed.

If you got to this point, with both the master and slave replicator in the 'ONLINE' state, then you can continue with checking that replication works.

Step 5: check replication

Replication from MySQL (a relational DBMS) to MongoDB (a key-value store) requires some understanding of what we are replicating and what limitations we can face.

In MySQL, we have tables, which are translated to MongoDB collections. What is very important to understand is that Tungsten Replicator DOES NOT REPLICATE DDL EVENTS. Any CREATE/DROP/ALTER events are ignored by the replicator.

However, the replicator is smart enough to convert every table record into a document within the relevant collection. Every insert will generate a document inside MongoDB. Let's see an example:

mysql [localhost] {msandbox} (test) > create table myfirst (id int not null primary key, name char(30), d date);
Query OK, 0 rows affected (0.00 sec)

mysql [localhost] {msandbox} (test) > insert into myfirst values (1, 'Harry Potter', '1997-06-30');
Query OK, 1 row affected (0.00 sec)

On the MongoDB side, we will see the corresponding object:

> show dbs
local 0.078125GB
test 0.203125GB
tungsten_mongodb 0.203125GB
> use test
switched to db test
> show collections
myfirst
system.indexes
> db.myfirst.find()
{ "_id" : ObjectId("51a2867d3004fd7959a5f5aa"), "id" : "1", "name" : "Harry Potter", "d" : "1997-06-30" }

Not only inserts, but also updates are recognised:

mysql [localhost] {msandbox} (test) > insert into myfirst values (2, 'Harry Potter 2', '1999-06-02');
Query OK, 1 row affected (0.00 sec)

mysql [localhost] {msandbox} (test) > update myfirst set name = 'Harry Potter 1' where id =1;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0

Which will result in:

> db.myfirst.find()
{ "_id" : ObjectId("51a2867d3004fd7959a5f5aa"), "id" : "1", "name" : "Harry Potter 1", "d" : "1997-06-30" }
{ "_id" : ObjectId("51a287633004fd7959a5f5ab"), "id" : "2", "name" : "Harry Potter 2", "d" : "1999-06-02" }

And delete statements:

mysql [localhost] {msandbox} (test) > delete from myfirst  where id =2;
Query OK, 1 row affected (0.00 sec)

#####
> db.myfirst.find()
{ "_id" : ObjectId("51a2867d3004fd7959a5f5aa"), "id" : "1", "name" : "Harry Potter 1", "d" : "1997-06-30" }

This is all fine. But what happens when we insert quite a lot of data? Let's find out. We're going to use the test employees database to see if the system complains.

$ cd ~/data/employees/
$  ~/sandboxes/msb_5_5_31/use < employees.sql
INFO
CREATING DATABASE STRUCTURE
INFO
storage engine: InnoDB
INFO
LOADING departments
INFO
LOADING employees
INFO
LOADING dept_emp
INFO
LOADING dept_manager
INFO
LOADING titles
INFO
LOADING salaries

To see if replication is OK, we check both master and slave for their status, and then we count the records in MySQL and MongoDB.

$ ./mongodb/tungsten/tungsten-replicator/bin/trepctl -port $MASTER_RMI_PORT services
Processing services command...
NAME              VALUE
----              -----
appliedLastSeqno: 182
appliedLatency  : 1.152
role            : master
serviceName     : mongodb
serviceType     : local
started         : true
state           : ONLINE
Finished services command...

$ ./mongodb/tungsten/tungsten-replicator/bin/trepctl -port $SLAVE_RMI_PORT services
Processing services command...
NAME              VALUE
----              -----
appliedLastSeqno: 106
appliedLatency  : 93.469
role            : slave
serviceName     : mongodb
serviceType     : local
started         : true
state           : ONLINE
Finished services command...

So, we see that the slave is lagging, as the data is hard to digest (about 4 million records). Let's try again after a few seconds:

$ ./mongodb/tungsten/tungsten-replicator/bin/trepctl -port $SLAVE_RMI_PORT services
Processing services command...
NAME              VALUE
----              -----
appliedLastSeqno: 182
appliedLatency  : 89.99
role            : slave
serviceName     : mongodb
serviceType     : local
started         : true
state           : ONLINE
Finished services command...

Both the master and the slave have reached the same transaction number. The slave has some delay, due to the size of the transactions (which become even bigger because of the RBR). When the size of the transaction is small, the latency becomes quite acceptable:

$ ./mongodb/tungsten/tungsten-replicator/bin/trepctl -port $MASTER_RMI_PORT heartbeat

$ ./mongodb/tungsten/tungsten-replicator/bin/trepctl -port $SLAVE_RMI_PORT services
Processing services command...
NAME              VALUE
----              -----
appliedLastSeqno: 183
appliedLatency  : 0.423
role            : slave
serviceName     : mongodb
serviceType     : local
started         : true
state           : ONLINE
Finished services command...

As a last operation, let's check the numbers of the biggest table in the employees database:

$ ~/sandboxes/msb_5_5_31/use  employees
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 34
Server version: 5.5.31-log MySQL Community Server (GPL)

Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql [localhost] {msandbox} (employees) > show tables;
+---------------------+
| Tables_in_employees |
+---------------------+
| departments         |
| dept_emp            |
| dept_manager        |
| employees           |
| salaries            |
| titles              |
+---------------------+
6 rows in set (0.00 sec)

mysql [localhost] {msandbox} (employees) > select count(*) from employees;
+----------+
| count(*) |
+----------+
|   300024 |
+----------+
1 row in set (0.22 sec)

mysql [localhost] {msandbox} (employees) > select count(*) from dept_emp;
+----------+
| count(*) |
+----------+
|   331603 |
+----------+
1 row in set (0.28 sec)

mysql [localhost] {msandbox} (employees) > select count(*) from titles;
+----------+
| count(*) |
+----------+
|   443308 |
+----------+
1 row in set (0.25 sec)

mysql [localhost] {msandbox} (employees) > select count(*) from salaries;
+----------+
| count(*) |
+----------+
|  2844047 |
+----------+
1 row in set (0.83 sec)

####

> use employees
switched to db employees
> show collections
departments
dept_emp
dept_manager
employees
salaries
system.indexes
titles
> db.employees.count()
300024
> db.dept_emp.count()
331603
> db.titles.count()
443308
> db.salaries.count()
2844047
>

This proves that replication works, and even large amount of data are replicated reasonably fast.

Here's the official documentation about MySQL to MongoDB replication: https://docs.continuent.com/wiki/display/TEDOC/Replicating+from+MySQL+to+MongoDB. There you will find more details on how to install and what to expect. Happy hacking!

5 comments:

grails database fan said...

Hi, Giuseppe. Which typical scenario will we use replicating to MongoDB? Is it mostly do read operations on data stored in Mongo for performance benefit? Are people using this in some projects?

Giuseppe Maxia said...

@Grails,
The typical scenario is someone who wants the reliability of a RDBMS for their transactions and some secondary service using MongoDB.

Austin Rivas said...

Hey Giuseppe,
Love your walkthrough, very thorough!

I am in the process of setting up MySQL to MongoDB replication for a proof of concept, but I am getting hung up on MySQL install. MySQL is currently our database of record, but we are looking to move our reporting and a few other tasks to MongoDB because of the flexibility that document based DB's allow.

Specifically when I run `./install_master` I receive an error from tungsten saying it was unable to authenticate.

My environment meets all of the system requirements for tungsten, although my open file limit is set to 1024, but I don't think that is the problem.

I have attached my console output below.

Thanks in advance for any insight you can offer,
- Austin

[austin@austin:~/deploy]
[17:03:29] $ ./install_master.sh
WARN >> This command has been DEPRECATED
WARN >> Switch to the tpm script for managing installation and upgrade
WARN >> 127.0.0.1 >> THL schema tungsten_mongodb already exists at tungsten@127.0.0.1:5531 (WITH PASSWORD)
ERROR >> 127.0.0.1 >> Unable to connect to the MySQL server using tungsten@127.0.0.1:5531 (WITH PASSWORD)
#####################################################################
# Validation failed
#####################################################################
#####################################################################
# Errors for 127.0.0.1
#####################################################################
ERROR >> 127.0.0.1 >> Unable to connect to the MySQL server using tungsten@127.0.0.1:5531 (WITH PASSWORD) (MySQLLoginCheck)

Giuseppe Maxia said...

See recently updated installation commands at slideshare

Anonymous said...

@Giuseppe,
Hello, it is interest topic and I appreciate your effort!

But there is tiny typo error on link you made;system requirements for Tungsten <- this link has double h, it looks like hhttps://blahblah~~
;)

Vote on Planet MySQL