The Data Charmer: MySQL 5.7 : Playing with mysqlpump

Monday, September 28, 2015

MySQL 5.7 : Playing with mysqlpump

MySQL 5.7 comes with a new backup tool, named mysqlpump, which is almost the same as mysqldump with the ability of extracting data in parallel threads.

I tried a little experiment. Using a server containing 11 databases, with a total of 300 tables and about 20 million rows (roughly ≈ 10GB,) I used both mysqldump and mysqlpump to get a backup.

mysqldump --all-databases  > dump.sql

mysqlpump --all-databases \
    --add-drop-database --add-drop-table --skip-watch-progress \
    --default-parallelism=10 \
    --parallel-schemas=db,db1,db2 \
    --parallel-schemas=db3,db4,db5 \
    --parallel-schemas=db6,db7,db8 \
    --parallel-schemas=db9,db10 > pump.sql

The backup with mysqldump took 3 minutes and 33 seconds. The one with mysqlpump took 2 minutes and 55 seconds (saving 38 seconds). This does not seem to be a great gain. I experimented with several values of default-parallelism and different grouping of databases, and also without any parameters at all, but I always get the same time.

If there is a different way of invoking mysqlpump to use parallelism better, I would like to know.

There are four interesting points about mysqlpump that users should know:

mysqlpump has options to include and exclude objects (databases, tables, routines, users) from the backup. This is a long awaited feature that will be welcome by many DBAs.
The option --no-data is called --skip-dump-rows. (Just in case you want to use the new tool alternate way of reproducing DDL. But be aware that there is at least one bug)
A backup created with mysqlpump can only be loaded into a database of the same name. This is due to the parallel work, which requires that the INSERT statements contain both the database and the table names. But it means that, unlike with mysqldump, you can't backup tables from database X and load them to database Y.
The most serious limitation of mysqlpump, which I have seen both in the manual and in a blog article is that, while the backup is parallelized, the restore is serialized. Both sources say to run "mysqlpump > file.sql" and "mysql < file.sql". What is the advantage of extracting data with N parallel threads if I then need to apply it with a single thread? I would have expected an option to create N files, which I can then load using several background tasks, or even better an option in the mysql client to handle parallel backup files. I may be missing something here. I will appreciate comments by more savvy users.

The idea is good. The tool still has some rough edges, but I am sure it can be improved.

10 comments:

jynus said...: No, the greatest limitation is the lack of data consistency ("Current Limitations" on the blog post you link to). Yes, I can depool and stop replication on a slave, but that is a big assumption...; September 28, 2015 at 1:49:00 PM GMT+2
Giuseppe Maxia said...: @jynus,
I can live with the speed vs. consistency trade-off, because there is a simple workaround (as you mention, backup from a slave) but the main feature for this tool is parallel processing, which then gets lost if the output goes to a single stream.

I could easily achieve parallelism in both backup and restore by running several copies of mysqldump in the background, one for a sizable chunk of the whole data, and then run the same number of mysql clients for restore, but I will have to do this from an offline slave, and in that case I would not need the new tool.

My point is that parallel extraction without any help for later parallel apply does not look very useful.; September 28, 2015 at 2:29:00 PM GMT+2
Mike said...: What about those that backup their db regularly, but rarely need to restore? Still a gain, I'd say.; September 28, 2015 at 6:00:00 PM GMT+2
Giuseppe Maxia said...: @Mike,
I don't buy it. A backup should be tested regularly to make sure it works. In this scenario, I would like a method that makes my restore as quick as possible.; September 28, 2015 at 6:17:00 PM GMT+2
Morgan Tocker said...: Out of curiosity, did you test for a difference in restore time?

mysqlpump uses some smart optimizations around InnoDB tables (delay secondary index creation until after data is loaded) , so even though it is a single-threaded restore, it should still be faster :); September 29, 2015 at 12:41:00 AM GMT+2
Giuseppe Maxia said...: @Morgan,
I wanted to run the reload but I was blocked by bug#78593.

Anyway, I repaired mysqlpump output using a one-liner to fix the ALTER TABLE statements, and I ran the comparison.

The good news is that mysqlpump recovers data correctly. The checksum of the original data and the restored data match.
The bad news is that mysqlpump is almost 4 times slower than mysqldump 232 minutes against 67.

Examining the dump files, I see that mysqldump creates about 8,500 INSERT statements, while mysqlpump generates almost 10 times as much. I guess it is a result of the parallelism.

So, it seems that there is still some work to do.; September 29, 2015 at 1:38:00 PM GMT+2
Marcin Babij said...: Hello Giuseppe,
running test with 40+ threads (4 parallel-schemas * 10 default threads) can be not a good idea. I'd start with --default-parallelism=2 or 4 and no --parallel-schemas options specified (these should be used for example when you know that certain schemas are on separate disk drives). You can also test it with value of 0 or 1.
Second, terminals and standard library can threat standard output differently (less efficiently), and as this kind of work bases on disk writes, consider trying --result-file option instead of output redirection - in your case you compare both dump and pump using redirection, so the impact on result is smaller, but still there is some. /dev/null is also a good place if you don't want to test your disk instead of the real subject of test.
Also, please make sure the disk I/Os both on server and client are not a bottlenecks - then adding more threads won't help much (and in fact can degrade reading performance on server, especially on magnetic HDDs). If client and server are on separate machines please check if network isn't bottleneck - you seem to have 57MB/s (about 500mbps) transfer to disk already.
Please keep in mind, that processing this data takes a lot of CPU on both server and client sides, so if you have server and client on the same machine with 2 (or with HyperThreading 2x2) cores, you may not achieve good results.
Overall, we claim that it shouldn't be slower than mysqldump (you can try single-threaded mode, with --default-parallelism=0 or 1 and no --parallel-schemas) and how much you gain in most cases is bounded with your hardware abilities. On the other hand neither mysqlpump nor MySQL Server scale linearly, but making dump take 20% of time that mysqldump needs shouldn't be a problem (for example on 16+ core machine where server is located too, using about 8 threads in mysqlpump).
As to the features, one might get rows data being written to separate multiple files, but that didn't make it to first release, especially that it wouldn't be so obvious to use (some queries must be executed before adding rows, some, like creating secondary indexes, later).
Multi-threaded import - that's possible in future, tool is prepared to handle such usage cases easily. The thing is that it would work best if we have new dense and binary format for dump files (both CPU and disk space efficient) - this would allow MT import easier to implement and also would add additional features like filtering out some e.g. tables without having whole file to be read.
Pump was designed to get new features being added as separate C++ classes, without changing existing code much. This enables us to add new features in more comfortable way, and also enables you and others to extend Pump at your own and share features as additional source files, not hard-to-maintain diff files.
Also, as far as I know, there will be second development iteration to add new things, the list of items to do is not settled yet, so any input is welcome. For example we will soon make Pump assure consistency of result when multi-threading is used using locks on all tables.
As for the number of INSERTs you may want to play with --extended-insert option value (max number of rows per single INSERT statement - increases speed and memory consumption on client, it has to store them all concatenated before write). The default for it (250) is safe but most probably you may want increase it a lot if you know your rows are short.; September 29, 2015 at 10:26:00 PM GMT+2
Giuseppe Maxia said...: @MB_
Thanks for your comments. Some of your text would be useful in the manual.
I will have another shot at mysqlpump, possibly when Bug#78593 is fixed.; September 30, 2015 at 5:44:00 AM GMT+2
Unknown said...: I find that dumps are typically not nearly as slow as loads..
I wrote this shell script to perform a mysql dump or load in a parallelized manner, or both in one shot to copy a database from one machine to another

https://github.com/DeepInfoSci/migrate_db

it basically chunks up the tables and orchestrates many instances of mysqldump in parallel; March 15, 2016 at 8:44:00 PM GMT+1
Manish Kumar Rai said...: Can we use where clause on mysqlpump. I want to fatch the data for selected date range like

mysqlpump --user= --password= --database --tables where data >= '2015-12-01' and '2016-12-01' > dump.sql; December 14, 2016 at 12:16:00 PM GMT+1