It was just a few days ago that we announced, with celebratory enthusiasm, Tungsten Replicator 2.1.1, and today we are at it again, with Tungsten Replicator 2.1.2.
What happened? In a surfeit of overconfidence, we released Tungsten 2.1.1, with faith on the test suite and its result. The faith was justified, as the test suite was able to catch any known problem and regression. The overconfidence was unjustified, because, due to a series of unfortunate events, some sections of the test suite were accidentally disabled, and the regression that was lurking in the dark was not caught.
Therefore, instead of having a quiet post-release week-end, the whole team has worked round the clock to plug the holes. There were two bugs that actually broke the tests (when we put the suite in good order, that is):
- A 0 valued DATETIME causes a ClassCastException at apply time on slaves using row binlogs.
- TIME column microseconds are not replicated correctly in RBR
We came out of the exercise with the embarrassing bugs fixed, a few more found in the process, and a much improved test suite, which now checks itself to make sure that nothing is missing, and everything that should be tested was actually tested. While the bugs were fixed over the week-end, we spent three more days just testing the release with all the firepower we could muster.
The test suite has grown quite a while during the past years. The first version of the suite, 2+ years ago, ran in about 15 minutes. We have been adding steadily to the tests, and now the full set of tests (when none of it gets accidentally switched off, that is) requires about 12 hours. Now we feel much better about the release. All I wrote about Tungsten Replicator 2.1.1 is also good for Tungsten 2.1.2, so I encourage all to give it a try.
As a further lesson learned, from now on we’re going to publish a release candidate a few weeks before blessing a build as GA. The labels in the downloads page tell which stage of readiness each build is.
1 comment:
A little like backups and other maintenance processes, recording the time it takes, and the size produced are important metrics, like in capacity planning, the number of objects.
By recording these tpyes of information additional cross checking can be added to many processes. You can use the change of values as an "approximate" measurement of potential issues and red flags.
In your example, recording the number of tests executed each test run, and graphing and alerting on a -ve delta of significance. While a database backup, execution time is generally red flag, for testing, it's much harder with changing H/W specs, and software improvements.
Still, this is a good lesson to the many who do little to no testing at all, why dedicated test suites (with dedicated test environments and dedicated test data) is so important.
Post a Comment