Friday, August 24, 2012

Fault (Tolerance) Ideas

Murphy's Law said that if anything could go wrong, then it will. (ref: Captain Edward A. Murphy http://www.murphys-laws.com/murphy/murphy-true.html). In our world of computing this includes :

  • our network  switches and wirings, they could be disabled, or worse : bit flipping data that were sent through the network
  • TCP checksum, instead of checksum errors (that will get transmitted), double bit flip will corrupt the packet but TCP layer not knowing that it is corrupted
  • HDD wiring, install wrong cable or wrongly install a correct cable. Switching a good ultra DMA ATA cable with a bad one (so it will still be detected as ultra DMA) and we get a large ultra DMA CRC error rate. And we have also  CRC-undetectable  error rates of something like 5x10^-13 (illogically taken from  http://doc.utwente.nl/64267/1/schiphorst.pdf ),  this corrupt data (average is one bit for two terabytes of data) will get stored to our disks.
  • HDD failure, that commodity disks will fail in 2 -3 years, and  might be sooner. Our industry standard RAID 5 is no longer suffice for large disk deployments, better use RAID6 or RAID1.
So ideas for a large scale fault tolerant system will include :
  • End to end data corruption detection. Put it in application/database level then we get a pretty good coverage of things that could be detected if goes wrong. Two different CRC algorithm will suffice.
  • redundancy at least 3x for each data block. Or object. Suddenly RAID1 no longer suffice (because only 2x redundancy)
  • auto-replication or self healing. In event that a HDD gets replaced with new one, data is ought to be re-mirrored to the new HDD.
  • multihomed systems. That means at least two Network Interface on each host, each connected with different network switch providing network redundancy.
  • monitoring. The drawback of automatic healing or automatic failover is that the human operator doesn't know that there is a failure happening. Even if nothing must be done (such as automatic re-mirroring in progress) it could be an indication that something is wrong (like bad switch that corrupt TCP packets with the same checksum).

No comments: