Fault (Tolerance) Ideas

Murphy's Law said that if anything could go wrong, then it will. (ref: Captain Edward A. Murphy http://www.murphys-laws.com/murphy/murphy-true.html). In our world of computing this includes :

  • our network  switches and wirings, they could be disabled, or worse : bit flipping data that were sent through the network
  • TCP checksum, instead of checksum errors (that will get transmitted), double bit flip will corrupt the packet but TCP layer not knowing that it is corrupted
  • HDD wiring, install wrong cable or wrongly install a correct cable. Switching a good ultra DMA ATA cable with a bad one (so it will still be detected as ultra DMA) and we get a large ultra DMA CRC error rate. And we have also  CRC-undetectable  error rates of something like 5x10^-13 (illogically taken from  http://doc.utwente.nl/64267/1/schiphorst.pdf ),  this corrupt data (average is one bit for two terabytes of data) will get stored to our disks.
  • HDD failure, that commodity disks will fail in 2 -3 years, and  might be sooner. Our industry standard RAID 5 is no longer suffice for large disk deployments, better use RAID6 or RAID1.
So ideas for a large scale fault tolerant system will include :
  • End to end data corruption detection. Put it in application/database level then we get a pretty good coverage of things that could be detected if goes wrong. Two different CRC algorithm will suffice.
  • redundancy at least 3x for each data block. Or object. Suddenly RAID1 no longer suffice (because only 2x redundancy)
  • auto-replication or self healing. In event that a HDD gets replaced with new one, data is ought to be re-mirrored to the new HDD.
  • multihomed systems. That means at least two Network Interface on each host, each connected with different network switch providing network redundancy.
  • monitoring. The drawback of automatic healing or automatic failover is that the human operator doesn't know that there is a failure happening. Even if nothing must be done (such as automatic re-mirroring in progress) it could be an indication that something is wrong (like bad switch that corrupt TCP packets with the same checksum).

Comments

Popular posts from this blog

Long running process in Linux using PHP

Reverse Engineering Reptile Kernel module to Extract Authentication code

SAP System Copy Lessons Learned