Recently we had a small hiccup in our service; it was caused by our mysql replication configuration and a small mistake which was made a long time before. The lesson was learned, and now I show that to you.
MySQL replication is simple. The replica set’s primary server (aka. replication master) writes the events which affect the database to a binary log file, and slaves will apply that log on their datasets. The binary log can be statement-based, row-based or in mixed mode, the events can be either DML or DDL queries. Simple enough.
Of course, we can filter out some statements from replication, for example when there are tables which are only read or write on the master, and didn’t needed to have on slaves. On kinja, we have certain tables where this rule applies, for example, a sessions table, which contains session related data. We have to ensure that the data will be there asap when it was written, and because of that, there is no reason to replicate that data to the slaves – there won’t be any threads which try to read from that table, but the table is big enough to consider filter its data from the slaves.
To filter out some tables from replication, there are many ways, you can filter them on the master, or filter them on slaves, both ways have their pros and cons. The binlog-ignore-[table|db|wild] or replicate-ignore-[table|db|wild] configuration options can be used to ensure this behaviour, the binlog-* refers to master (will it write the statements to binlog, or not) and the replicate-* has to be configured on slaves (to step over certain statements on the slave while applying relay logs). At kinja, we have to use replicate-ignore statements on MySQL slaves, because the tables which are filtered out from the replication on slaves still have to replicate to standby master and to backup servers. There need be one thing to share with you too: we have two different datacenters to ensure reliability.
So, when the second datacenter was populated with servers, the mistake was made: when the servers were built up, I made it from the master databases backup file, and after the hosts were built, I dropped the tables which are filtered out from replication, because the tables are big enough to eat up too many precious disk space.
The site went nicely for weeks without any hiccups, but there came a day when some maintenance stuff needed to be done. The main reason for this periodic maintenance can be read here, but in a nutshell: if data deleted from InnoDB, space will be not freed up sometimes manually need to do this. This can be performed with ‘OPTIMIZE TABLE’ command, which will drop the free space from the table, and set the rows ordered by the primary key. So far, so good.
The only problem is with this way, that the table will be locked during the operation. Err… we don’t want to lock a session related table for a long time on a high traffic website, so we have to find another way to solve our problem. Of course, there is a solution, named pt-online-schema-change (pt-osc). This tool is designed for our needs, you can alter the database online only without locking. Its documentation talks by itself, but in a few words: it creates an empty table, copies over the data from the original table, make the changes on the new table, (keeps track of changes in the original table and applies them during this), and at the end, swaps the tables. Voilà.
So, I was run my pt-online-schema change command on DB, and whoops, there was broke something. At this point, I got no idea, what happened, but on charts, I’ve seen that the replication lags are crawling up slowly, (I’ll write a few words about the mambo collector on a future post) but only in some servers. For luck, we could isolate the problem, it happened in the new datacenter, the old was unaffected. This was the point when the site was in the hickup state, but after we realized, that the problem persists only in one dc, we could disable that server from load balancers. The site was up and running, but the half of the SQL slaves had broken replication. Hm.
I started to investigate what has happened, and it was clean enough to find: the ‘SHOW SLAVE STATUS’ command showed that the replication was stopped because there was an error: the pt-osc command tried to put some triggers on nonexistent table…the table, what was filtered out from replication… So, there was a need to run a ‘SHOW CREATE TABLE’ command on the table where it existed, and create it on the slaves empty. After this, I could issue on the slaves a ‘START SLAVE’ command, what was run without error, and the replication was fixed. After the slaves reached the master’s position, the servers could be put back into service. (In fact, there is another way to solve this: ‘SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE’, or run pt-slave-restart in the background, which does the same, but neither way is future proof, next time they will break again.)
So the lesson to learn is the following: you can filter out tables from replication, but then the replication can break if something tries to use that table anyhow because the replication filters only apply when the tables are called directly. So, if you decide to filter out tables from replication, you have to create them on slaves empty or truncate them instead of dropping. (This applies to filter out entire databases too. For example, if you filter out database1 from replication, but you try to run a query in database2 which are pointing to the database1, it will cause trouble. (For example USE database2, INSERT INTO database1.table1 …))
Managing MySQL is always fun, but always be cautious, it is very simple. Just think twice.