The Data Charmer: monitoring

Showing posts with label monitoring. Show all posts

Monday, August 31, 2015

MySQL replication in action - Part 5 - parallel appliers

Previous episodes:

MySQL replication monitoring 101

MySQL replication in action - Part 1: GTID & Co

MySQL replication in action - Part 2 - Fan-in topology

MySQL replication in action - Part 3 - All-masters P2P topology

MySQL replication in action - Part 4 - star and hybrid topologies

Parallel replication overview

One of the main grievance of replication users is that, while a well tuned master server can handle thousands of concurrent operations, an equally tuned slave is constrained to work on a single thread. In Figure 1, we see the schematics of this paradigm. Multiple operations on the master are executed simultaneously and saved to the binary log. The slave IO thread copies the binary log events to a local log, and on such log the SQL thread executes the events on the slave database. When the master is very active, chances are that the slave lags behind, causing hatred and nightmares to the DBAs.

Figure 1 - Single applier

MySQL replication in action - Part 1: GTID & Co

In the theoretical part of this series, we have seen the basics of monitoring. In that article, though, we have barely mentioned the new tools available in MySQL 5.7 and MariaDB 10. Let’s start from something that has the potential of dramatically changing replication as we know it.

Crash-safe tables and Global transaction identifiers in MySQL 5.6 and 5.7

Global transaction identifiers (GTID) is a feature that has been in my wish list for long time, since the times I was working with the MySQL team. By the time I left Oracle, this feature was not even in the plans.
When MySQL 5.6 was first disclosed, the biggest improvement for replication was the introduction of crash-safe tables (see Status persistence in Monitoring 101.) There are two tables in the mysql database, named slave_master_info and slave_relay_log_info. At the beginning, these tables were using the MyISAM engine, thus defeating the purpose of making them crash-safe. In later versions, the developers decided to bite the bullet and create these tables with innodb from the beginning.
These two tables allow us to see the same information previously stored in the files master.info and relay_log.info. What makes these tables convenient is that they should survive a crash better than the standalone files.

MySQL replication monitoring 101

Replication is the process that transfers data from an active master to a slave server, which reproduces the data stream to achieve, as best as possible, a faithful copy of the data in the master.

To check replication health, you may start with sampling the service, i.e. committing some Sentinel value in the master and retrieving it from the slave.

Sentinel data: Tap tap… Is this thing on?

If you want to make sure that replication is working, the easiest test is using replication itself to see if data is being copied across from the master to the slaves. The method is easy:

Make sure that the data you want to see is NOT in the master or in the slave. If you skip this step, you may think that replication is working, while in fact it may not.
Either create a table in the master or use a table that you know exists both in the master and the slave.
Insert several records in the master table.
Check that they are replicated in the slave correctly.
Update a record in the master.
Watch it changing in the slave.
Delete a record in the master.
Watch it disappear in the slave.

Refactored again: poor man's MySQL replicator monitor

I saw that both Haidong Ji and Geert VanderKelen have proposed a Python monitor for MySQL replication, calling it the "poor man's version".
See Poor man’s MySQL replication monitoring and Geert's Refactored: Poor man’s MySQL replication monitoring.
Having Python in your server doesn't really qualify as "poor man". In many cases it's a luxury, and thus, here's my shot at the problem, using a Bash shell script.
Unlike its Python-based competition, this version also checks that the slave is replicating from the intended master, and that it is not lagging behind.

#!/bin/bash

USERNAME=msandbox
PASSWORD=msandbox
EXPECTED_MASTER_HOST=127.0.0.1
EXPECTED_MASTER_PORT=27371

SLAVE_HOST=127.0.0.1
SLAVE_PORT=27372

MYSQL="mysql -u $USERNAME -p$PASSWORD "
MASTER="$MYSQL -h $EXPECTED_MASTER_HOST -P $EXPECTED_MASTER_PORT"
SLAVE="$MYSQL -h $SLAVE_HOST -P $SLAVE_PORT"

$MASTER -e 'SHOW MASTER STATUS\G' > mstatus
$SLAVE -e 'SHOW SLAVE STATUS\G' > sstatus

function extract_value {
    FILENAME=$1
    VAR=$2
    grep -w $VAR $FILENAME | awk '{print $2}'
}

Master_Binlog=$(extract_value mstatus File )
Master_Position=$(extract_value mstatus Position )

Master_Host=$(extract_value sstatus Master_Host)
Master_Port=$(extract_value sstatus Master_Port)
Master_Log_File=$(extract_value sstatus Master_Log_File)
Read_Master_Log_Pos=$(extract_value sstatus Read_Master_Log_Pos)
Slave_IO_Running=$(extract_value sstatus Slave_IO_Running)
Slave_SQL_Running=$(extract_value sstatus Slave_SQL_Running)

ERROR_COUNT=0
if [ "$Master_Host" != "$EXPECTED_MASTER_HOST" ]
then
    ERRORS[$ERROR_COUNT]="the slave is not replicating from the host that it is supposed to"
    ERROR_COUNT=$(($ERROR_COUNT+1))
fi

if [ "$Master_Port" != "$EXPECTED_MASTER_PORT" ]
then
    ERRORS[$ERROR_COUNT]="the slave is not replicating from the host that it is supposed to"
    ERROR_COUNT=$(($ERROR_COUNT+1))
fi

if [ "$Master_Binlog" != "$Master_Log_File" ]
then
    ERRORS[$ERROR_COUNT]="master binlog ($Master_Binlog) and Master_Log_File ($Master_Log_File) differ"
    ERROR_COUNT=$(($ERROR_COUNT+1))
fi

POS_DIFFERENCE=$(echo ${Master_Position}-$Read_Master_Log_Pos|bc)

if [ $POS_DIFFERENCE -gt 1000 ]
then
    ERRORS[$ERROR_COUNT]="The slave is lagging behind of $POS_DIFFERENCE"
    ERROR_COUNT=$(($ERROR_COUNT+1))
fi

if [ "$Slave_IO_Running" == "No" ]
then
    ERRORS[$ERROR_COUNT]="Replication is stopped"
    ERROR_COUNT=$(($ERROR_COUNT+1))
fi

if [ "$Slave_SQL_Running" == "No" ]
then
    ERRORS[$ERROR_COUNT]="Replication (SQL) is stopped"
    ERROR_COUNT=$(($ERROR_COUNT+1))
fi

if [ $ERROR_COUNT -gt 0 ]
then
    EMAIL=myname@gmail.com
    SUBJECT="ERRORS in replication"
    BODY=''
    CNT=0
    while [ "$CNT" != "$ERROR_COUNT" ]
    do
        BODY="$BODY ${ERRORS[$CNT]}"
        CNT=$(($CNT+1))
    done
    echo $SUBJECT
    echo $BODY
    echo $BODY | mail -s "$SUBJECT" $EMAIL
else
    echo "Replication OK"
    printf "file: %s at %'d\n" $Master_Log_File  $Read_Master_Log_Pos
fi

Monday, February 14, 2011

How to detect if a MySQL server is an active replication slave

Sometimes you know for sure. And sometimes you wonder: Is this server part of a replication system? And, most specifically, is it an active slave?
The completeness of the answer depends on how much visibility you have on the server.
If you can ask the DBA, and possibly have access to the server data directory and configuration file, you can get a satisfactory answer. But if your access is limited to SQL access, things get a bit more complicated.
If you have the SUPER or REPLICATION_CLIENT privilege, then it's easy, at least in the surface.
SHOW SLAVE STATUS will tell you if the slave is running. An empty set means that the server was not configured as a slave.
The answer is not absolute, though. You need to read the output of SHOW SLAVE STATUS to understand if replication is under way.
For example, what is the difference between these two listings?


## listing 1
*************************** 1. row ***************************
               Slave_IO_State: 
                  Master_Host: QA1
                  Master_User: tungsten_slave
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: 
          Read_Master_Log_Pos: 4
               Relay_Log_File: QA2-relay-bin.000001
                Relay_Log_Pos: 4
        Relay_Master_Log_File: 
             Slave_IO_Running: No
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 0
              Relay_Log_Space: 106
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
## Listing 2
*************************** 1. row ***************************
               Slave_IO_State: 
                  Master_Host: QA1
                  Master_User: test
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: 
          Read_Master_Log_Pos: 4
               Relay_Log_File: QA2-relay-bin.000001
                Relay_Log_Pos: 4
        Relay_Master_Log_File: 
             Slave_IO_Running: No
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 0
              Relay_Log_Space: 125
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error:

They look almost the same, and yet the similarity is deceiving. The first listing is what you get immediately after a call to CHANGE MASTER TO. If you run START SLAVE at this point, replication will start flowing.
The second listing is what you get immediately after a call to RESET SLAVE. The crucial difference is that RESET SLAVE removes the two .info files containing replication credentials and positions. A call to START SLAVE in this scenario will only get you an error, as the slave does not know where and how to connect.
So, in this case, SQL visibility does only tell you that the server is not receiving replication date, and that it was at least once configured as a slave. The telltale detail is the user name ("test") that should give you a hint of something fishy going on. Unless you have called your user "test", in which case you were asking for trouble. I would say that this situation is a bug. RESET SLAVE should remove every memory of the slave configuration, and instead it keeps only the host name. Although it is not clear in this particular example, it also forgets the master connection port.

Now, if your purpose was to set replication with different coordinates, the good news is that in both cases a well formed call ^(*) to CHANGE MASTER TO will do what you expect, i.e. it will establish the credentials to the master, so that a further invocation of START SLAVE will let replication data flow.

^(*) By "well formed" I mean a call that includes host, port, username, password, binary log file and position, and eventually all the information that you need to get the slave at work.

Saturday, January 22, 2011

Pitfalls of monitoring MySQL table activity with stored routines

A friend of mine needed a quick method to monitor a table that was frequently updated. There were several procedures writing to the table several times per second. He needed a quick way of determining how many updates per second the table was getting.
The table has only one row, which includes a counter that is increased at each operation. Therefore, the simple plan was:

Let's get the counter from the table;
Allow N seconds to pass;
Get the counter again;
The difference between the second counter and the first counter, divided by the number of seconds gives the updates per second.

The plan makes sense, and if you run the above commands manually, you get what you want.

However, my friend wanted the update ratio to be a single operation, say like:


SELECT update_ratio();

He went to make a simple function, following the four steps described above.


delimiter //
drop function if exists update_ratio //
create function update_ratio()
RETURNS INT 
begin
     declare sleep_wait int default 5;
     declare start int default 1;
     declare finish int default 1;

     set start = (select counter from mytable);
     do sleep(sleep_wait);
     set finish = (select counter from mytable);
     return (finish-start)/sleep_wait;
end $$
delimiter ;

It seems OK. The function runs without errors, but it always returns zero.
Mystery! Running the statements manually gives always a sensible result. Using triggers to monitor the table shows that indeed it is updated many times per second, but the function returns always zero.
More puzzling is the fact that if we convert the function to a procedure, it gives the wanted result.

The solution to the mystery is found in the MySQL online manual

A stored function acquires table locks before executing, to avoid inconsistency in the binary log due to mismatch of the order in which statements execute and when they appear in the log.

In other words, it means that all tables referenced in a stored functions are locked when the function starts. Therefore the external procedures that were updating the table will have to wait until the function's end before updating. When the function reads from the table, it gets always the same record counter, because no updates were happening in the meantime. That's why the second read is the same as the first one, and the result is zero.

What should you do then?
One option is to convert the function into a procedure:


delimiter //
drop procedure if exists show_update_ratio //
create procedure show_update_ratio()
begin
     declare sleep_wait int default 5;
     declare start int default 1;
     declare finish int default 1;
    
     select counter into start from mytable;
     do sleep(sleep_wait);
     select counter into finish from mytable;
     SET @UPS := (finish-start)/sleep_wait;
end //

That gets the job done. If you want to get the result into a variable, you can do it with two statements.


call show_update_ratio();
select @UPS;

If you don't change the last SET into a SELECT and just display the value.

Another option is using several SQL commands from your application. Also in this case, make sure that you are NOT wrapping this code inside a transaction, or you will get the same result in both queries

# WRONG!
     set autocommit=0;
     BEGIN;
     select counter into @start from mytable;
     set @start = start;
     do sleep(5);
     select counter into @finish from mytable;
     select (@finish - @start) / 5 as UPS;

If you go for this solution (or even the stored procedure), make sure that you are either using autocommit, or commit after each query if you must use a transaction.

Monday, September 01, 2008

Introducing the MySQL community-driven Replication Monitoring Tools

If you are using MySQL replication, you know how hard is to monitor it properly.
You have a wide choice of commercial and free tools, all of which check the health of your replication system from the outside.
A few years ago, I wrote an article advocating a self-monitoring and self-healing replication system, using new features in MySQL 5.1. At the time, there were some missing technology pieces to make this project feasible. Now the pieces exist, and you can create your own self monitoring replication system.

Hartmut rules!

It started during FrOSCon, when I discussed with Hartmut Holzgraefe the practical application of a plugin that he wrote some time ago. One of the missing pieces for the internal replication monitoring is the ability of reading replication values into variables. Currently, you can see the replication status with SHOW MASTER STATUS and SHOW SLAVE STATUS. Unfortunately, you can't manipulate the values from these commands inside a stored routines. And thus there is no way of measuring the health of a replication system without the help of external programming languages.
This means that, even if you manage to detect a master failure, you can't create a CHANGE MASTER STATUS with the right parameters.
During our FrOSCon talks, Hartmut released a small Launchpad project, the MySQL replication status INFORMATION_SCHEMA plugin, which implements two INFORMATION_SCHEMA tables for MASTER and SLAVE STATUS. Using these extensions, you can get single values inside a stored routine.

The Replication Monitor is born

From this starting point, the Replication Monitoring Tools become possible.
The project is young, and it only contains a proof-of-concept monitoring tool (see below), but given enough time and help, it can be expanded into a full fledged system.
The roadmap has a long list of possible features, some of which are quite ambitious

The project is open. I need ideas, manpower, testers, to implement all the intended features. This is a project from the community to the community.
The principle that you have seen several times when downloading MySQL server fully applies. If you have time, you can save money and help yourself and the community at the same time.

Replication Monitor in action

The proof of concept implements a table on the master, where the slaves write their status, by means of federated tables.
Each slave uses a federated table to the master INFORMATION_SCHEMA.MASTER_STATUS to compare its status with the master,a nd a second federated table to write its status to the master table. The result is that you can read the global status of a replication system in the master, and in each slave (because the table is replicated).

master> select * from slaves_status ;
+-----------+------------------+------------+----------------+
| server_id | binlog_file      | binlog_pos | slave_status   |
+-----------+------------------+------------+----------------+
|       101 | mysql-bin.000001 |      82321 | IO+,SQL+,P+,E+ |
|       102 | mysql-bin.000001 |      82530 | IO+,SQL+,P+,E+ |
|       103 | mysql-bin.000001 |      82739 | IO+,SQL+,P+,E+ |
|       104 | mysql-bin.000001 |      82948 | IO+,SQL+,P+,E+ |
+-----------+------------------+------------+----------------+

The status reads as follows:
IO+ means that the slave IO_thread is working (the opposite is io-.
SQL+ means that the SQL thread is working.
P+ means that the slave is reading from the latest binlog and that the position is equal or greater to the one shown by the master.
E+ means that the slave has executed all the statements received from the master.
Let's try a small experiment. We stop slave 2 and see what happens.

slave2> stop slave;

master> select * from slaves_status ;
+-----------+------------------+------------+----------------+
| server_id | binlog_file      | binlog_pos | slave_status   |
+-----------+------------------+------------+----------------+
|       101 | mysql-bin.000001 |      91517 | IO+,SQL+,P+,E+ |
|       102 | mysql-bin.000001 |      91517 | io-,sql-,p-    |
|       103 | mysql-bin.000001 |      91932 | IO+,SQL+,P+,E+ |
|       104 | mysql-bin.000001 |      92141 | IO+,SQL+,P+,E+ |
+-----------+------------------+------------+----------------+

The monitor shows that slave 2 is not working. If we restart the slave, the situation is restored.

slave2> start slave;

select * from slaves_status ;
+-----------+------------------+------------+----------------+
| server_id | binlog_file      | binlog_pos | slave_status   |
+-----------+------------------+------------+----------------+
|       101 | mysql-bin.000001 |     114894 | IO+,SQL+,P+,E+ |
|       102 | mysql-bin.000001 |     115104 | IO+,SQL+,P+,E+ |
|       103 | mysql-bin.000001 |     115314 | IO+,SQL+,P+,E+ |
|       104 | mysql-bin.000001 |     115524 | IO+,SQL+,P+,E+ |
+-----------+------------------+------------+----------------+

We can do the same experiment while loading some heavy data, like the employee sample database.

master> select * from slaves_status ;
+-----------+------------------+------------+----------------+
| server_id | binlog_file      | binlog_pos | slave_status   |
+-----------+------------------+------------+----------------+
|       101 | mysql-bin.000001 |     134158 | IO+,SQL+,P+,E+ |
|       102 | mysql-bin.000001 |     134975 | IO+,SQL+,p-    |
|       103 | mysql-bin.000001 |     133777 | IO+,SQL+,P+,E+ |
|       104 | mysql-bin.000001 |    1165155 | IO+,SQL+,P+,e- |
+-----------+------------------+------------+----------------+

master> select * from slaves_status ;
+-----------+------------------+------------+----------------+
| server_id | binlog_file      | binlog_pos | slave_status   |
+-----------+------------------+------------+----------------+
|       101 | mysql-bin.000001 |    7343962 | IO+,SQL+,P+,e- |
|       102 | mysql-bin.000001 |    8374099 | IO+,SQL+,P+,e- |
|       103 | mysql-bin.000001 |    7343751 | IO+,SQL+,P+,e- |
|       104 | mysql-bin.000001 |    8374310 | IO+,SQL+,P+,e- |
+-----------+------------------+------------+----------------+

Very rapidly (the experiment happens in a MySQL Sandbox, so all servers use the same disk and CPU), the slaves are left behind in execution.
If we stop a slave while loading, the situation is even more explicit.

slave3> stop slave;

master> select * from slaves_status ;
+-----------+------------------+------------+----------------+
| server_id | binlog_file      | binlog_pos | slave_status   |
+-----------+------------------+------------+----------------+
|       101 | mysql-bin.000001 |   45764491 | IO+,SQL+,P+,e- |
|       102 | mysql-bin.000001 |   45764703 | IO+,SQL+,P+,e- |
|       103 | mysql-bin.000001 |   42685103 | io-,sql-,p-    |
|       104 | mysql-bin.000001 |   42685103 | IO+,SQL+,P+,e- |
+-----------+------------------+------------+----------------+

Let's introduce a new element of difference, and flush logs while still loading data.

master> flush logs;

master> select * from slaves_status ;
+-----------+------------------+------------+----------------+
| server_id | binlog_file      | binlog_pos | slave_status   |
+-----------+------------------+------------+----------------+
|       101 | mysql-bin.000002 |    2044673 | IO+,SQL+,P+,e- |
|       102 | mysql-bin.000002 |    3066965 | IO+,SQL+,P+,e- |
|       103 | mysql-bin.000001 |   42685103 | io-,sql-,p-    |
|       104 | mysql-bin.000002 |    3067176 | IO+,SQL+,P+,e- |
+-----------+------------------+------------+----------------+

Now the stopped slave is really far behind. Let's see what happens when we put it back online.

slave3> start slave;

master> select * from slaves_status ;
+-----------+------------------+------------+----------------+
| server_id | binlog_file      | binlog_pos | slave_status   |
+-----------+------------------+------------+----------------+
|       101 | mysql-bin.000002 |   27604369 | IO+,SQL+,P+,e- |
|       102 | mysql-bin.000002 |   27603945 | IO+,SQL+,P+,e- |
|       103 | mysql-bin.000002 |   27604157 | IO+,SQL+,P+,E+ |
|       104 | mysql-bin.000002 |   25558385 | IO+,SQL+,P+,e- |
+-----------+------------------+------------+----------------+

The restored slave was the first one to catch up in execution. This would not probably happen in a situation with separate hosts for each slave, but it's interesting to test our system.
And finally, after a few seconds more, all slaves have caught up with the master, with both data fetching and execution.

master> select * from slaves_status ;
+-----------+------------------+------------+----------------+
| server_id | binlog_file      | binlog_pos | slave_status   |
+-----------+------------------+------------+----------------+
|       101 | mysql-bin.000002 |  104641288 | IO+,SQL+,P+,E+ |
|       102 | mysql-bin.000002 |  104641501 | IO+,SQL+,P+,E+ |
|       103 | mysql-bin.000002 |  104641714 | IO+,SQL+,P+,E+ |
|       104 | mysql-bin.000002 |  104641927 | IO+,SQL+,P+,E+ |
+-----------+------------------+------------+----------------+

Getting started

There is a step-by-step set of instructions in MySQL Forge Wiki that explain how to build and install the necessary plugin.
After that, you should get the code from the bazaar repository and make the changes in the option files of master and slaves (check the sql/master_scripts and sql/slave_scripts directories). Then, you should load the initializing script in the master and in each slave, and you are in business.
Notice that this first version relies on a MySQL Sandbox running with MySQL 5.1.28, compiled from source. Further versions will lift this limitation, but for now, if you want to try it out, you need to follow my steps closely.

Next steps

What's left? Ah, yes. A totally internal monitoring system is not much helpful. If the DBA needs to run a query to know what's happening, then the system is almost worthless.
This is another challenge, because MySQL architecture forbids connection to operating system services such as mail applications. There is already a solution to this problem. We only need to integrate it with the rest of the replication monitoring system. More challenges are outlined in the roadmap.
If you think this is interesting, check the roadmap, and comment on it. Even better, if you want to implement parts of it, join the sandbox developers group and start coding!

The Data Charmer

Monday, August 31, 2015

MySQL replication in action - Part 5 - parallel appliers

Parallel replication overview

Wednesday, August 12, 2015

MySQL replication in action - Part 1: GTID & Co

Crash-safe tables and Global transaction identifiers in MySQL 5.6 and 5.7

Thursday, July 30, 2015

MySQL replication monitoring 101

Sentinel data: Tap tap… Is this thing on?

Thursday, April 07, 2011

Refactored again: poor man's MySQL replicator monitor

Monday, February 14, 2011

How to detect if a MySQL server is an active replication slave

Saturday, January 22, 2011

Pitfalls of monitoring MySQL table activity with stored routines

Monday, September 01, 2008

Introducing the MySQL community-driven Replication Monitoring Tools

Hartmut rules!

The Replication Monitor is born

Replication Monitor in action

Getting started

Next steps

Pages

Giuseppe Maxia

Like my posts? Buy me a book!

Julian's force

Yes, I am one of those

Sapere, saper fare, fare, far sapere

Blog Archive

Planet MySQL