Wednesday, August 06, 2008

Code is the easy part: Preventing data corruptions

Coding, coding, coding. There's an awful lot of attention paid to coding in the software engineering world. I read redditt's programming category a lot these days, and most software engineering posts are about specifically code. I think this is not because coding is the most important part of being a programmer, but because it is the most interesting and fun part.   Code is easy to change. If you have a bug, you reproduce it and fix it, and then it goes away. Sometimes it's difficult, but even if it is, you fix it, make sure it doesn't happen again by writing test cases (preferably before you fix it), and then you forget about it.

Let's contrast the fixing of bugs in code, with fixing corrupted data. Corrupted data in any reasonable system is probably going to be caused by a software bug. So, you first have to find out why the data is corrupted.  Of course, maybe the corruption is old, and the bug has been fixed.  Hard to say.  You have to investigate.  You could try and reproduce it, if you can figure out a likely scenario.  Most likely you will fail to reproduce it, and even more likely is that you know you will fail to reproduce it, so you don't try.  So instead you look at the code, seeing how it could happen.  This is a useful practice.  You can sometimes find the cause here.  Then you can write a test for it and fix it.   After you fix it, your work isn't done, of course.  You have to actually fix the corruptions.  This could be a simple as running a SQL query, or as complicated as writing a script to patch things up using a code library to do the work.   Of course, sometimes you never can find out why your data is corrupted, so you just have to fix it, if you can figure out how.


Some amount of corrupted data is an inevitability, and in fact some may come from design decisions.  For example, some database systems cannot do two-phased commits, and if you need to hook into one, you may have to accept a certain amount of data corruption due to not having atomicity in your transactions.  If the error rate is very low, some corruption may be a fair price to pay for whatever benefits the second system is getting you.   Even so, this is dangerous, and a low error rate today may be a severe error rate tomorrow, leaving you with a lot of angry customers with corrupted data, and a few dejected developers who have to clean it all up.

There's a few best practices you can do to avoid data corruption
  • Use one transactional system with ACID properties, and use it correctly.
  • When using SQL, use foreign key references whenever possible. 
  • Before saving data, assert its correctness as a precondition.  This includes both the values stored, and the relationship of the data to data that both links to it, and is linked to.
  • Create a data checker that will run a series of tests on your data.  This is basically like unit-tests for data, but you can run it not only after a series of tests, but in your production system too.  Run this program regularly, and pay attention to the output.  You want to be alerted of any changes in the sanity of your data.  Like unit tests, the more invariants you encode into this tool, the better you will be.  When changing or adding data,  modify the data checker code.  After each QA cycle, run the data checker.  Any errors should be entered as bugs.
  • If your data can be repaired, have an automated data repairer.  This shouldn't be run regularly, because you don't want to get too complacent about your corruptions.  Instead, if you notice that the data checker has picked up some new errors, then you run this, modifying it first if the errors are of a new type.

Doing a good job on all these tips should prevent most data corruption, but not all.  Like bugs, even the best preventative measures will not guarantee success.

Having clean data is extremely important.  This data is not yours, it is your customer's, and they trust you with it.  You need to protect it, and it isn't easy, but preventing data corruptions is always the right thing to do.

No comments: