Monday, May 17, 2010

LOAD DATA: a tricky replication issue

When you are importing large amounts of data from other sources LOAD DATA is a common method of inserting data into a table.
It is one of the old commands implemented in MySQL. As such it is very fast, and it has been optimized for both MyISAM and InnoDB.
All is well when you are loading data into a standalone server. All is almost well when you are using replication. LOAD DATA used to be a problem in old versions of MYSQL, prior to 4.1. With recent versions of MySQL, it is replicated correctly, and sometimes efficiently.
The trouble starts when the data file is big. The definition of big and the amount of trouble that you can get depends on many factors. That's why users may not realize that this problem exists, even with fairly large files, and then being hit by this disaster when the file is only a little larger than the previous ones.
First, let me explain what happens when you replicate LOAD DATA.
  1. The LOAD DATA query runs in the master.
  2. When the query is finished, the master starts pumping data to the binary log.
  3. The slave receives the binary log, and it will create a copy of the data file in the default temporary directory.
  4. The slave executes the LOAD DATA query using the temporary file.
  5. When the slave is done loading the data, the temporary file is deleted
  6. The data from the relay log is deleted

At the end of the exercise, your data is only in the database table, both in the master and in the slaves. However, during the loading, each slave needs THREE TIMES THE STORAGE of the initial data file size, not counting the indexes.
If your data is 10 GB, then you will need 20 GB on the master (10 for the table, 10 for the binary log, and eventually 10 more for the indexes).
On the slave, you will need 30 GB: 10 for the table (+ indexes if requested), 10 for the relay logs, and 10 for the file in the temporary directory. The last part is the tricky one. The temporary directory is whatever is indicated in the TMPDIR system variable. If that directory is in a partition with less than 10 GB free, your replication will break, even if your data directory has terabytes of free space.


Mark Robson said...

I think you're forgetting that, if you're using InnoDB, a LOAD DATA happens in one transaction which means using a lot of space in the innodb log to roll back.

As they are fixed size, on many installations a 10G LOAD DATA will simply fail. And moreover, when it does it will cause a very inefficient rollback of a huge amount of data which could (in my experience) take hours.

So large data LOAD DATA is not friendly to InnoDB; I don't know about other transactional engines (MyISAM of course doesn't have this problem).

rpbouman said...

mmm, would it make sense to add a COMMIT EVERY x LINES clause to LOAD DATA INFILE ?

Mark Callaghan said...

Feature request 24313 is open for this. A patch was provided.

Mats Kindahl said...

Row-based replication just add the rows to the binary log. It doesn't help reduce the size of the InnoDB transaction log, but at least there will not be a temporary file created.