Fault (Tolerance) Ideas
Murphy's Law said that if anything could go wrong, then it will. (ref: Captain Edward A. Murphy http://www.murphys-laws.com/murphy/murphy-true.html). In our world of computing this includes :
- our network switches and wirings, they could be disabled, or worse : bit flipping data that were sent through the network
- TCP checksum, instead of checksum errors (that will get transmitted), double bit flip will corrupt the packet but TCP layer not knowing that it is corrupted
- HDD wiring, install wrong cable or wrongly install a correct cable. Switching a good ultra DMA ATA cable with a bad one (so it will still be detected as ultra DMA) and we get a large ultra DMA CRC error rate. And we have also CRC-undetectable error rates of something like 5x10^-13 (illogically taken from http://doc.utwente.nl/64267/1/schiphorst.pdf ), this corrupt data (average is one bit for two terabytes of data) will get stored to our disks.
- HDD failure, that commodity disks will fail in 2 -3 years, and might be sooner. Our industry standard RAID 5 is no longer suffice for large disk deployments, better use RAID6 or RAID1.
So ideas for a large scale fault tolerant system will include :
- End to end data corruption detection. Put it in application/database level then we get a pretty good coverage of things that could be detected if goes wrong. Two different CRC algorithm will suffice.
- redundancy at least 3x for each data block. Or object. Suddenly RAID1 no longer suffice (because only 2x redundancy)
- auto-replication or self healing. In event that a HDD gets replaced with new one, data is ought to be re-mirrored to the new HDD.
- multihomed systems. That means at least two Network Interface on each host, each connected with different network switch providing network redundancy.
- monitoring. The drawback of automatic healing or automatic failover is that the human operator doesn't know that there is a failure happening. Even if nothing must be done (such as automatic re-mirroring in progress) it could be an indication that something is wrong (like bad switch that corrupt TCP packets with the same checksum).
Comments