Fault (Tolerance) Ideas

Murphy's Law said that if anything could go wrong, then it will. (ref: Captain Edward A. Murphy http://www.murphys-laws.com/murphy/murphy-true.html). In our world of computing this includes :

our network switches and wirings, they could be disabled, or worse : bit flipping data that were sent through the network
TCP checksum, instead of checksum errors (that will get transmitted), double bit flip will corrupt the packet but TCP layer not knowing that it is corrupted
HDD wiring, install wrong cable or wrongly install a correct cable. Switching a good ultra DMA ATA cable with a bad one (so it will still be detected as ultra DMA) and we get a large ultra DMA CRC error rate. And we have also CRC-undetectable error rates of something like 5x10^-13 (illogically taken from http://doc.utwente.nl/64267/1/schiphorst.pdf ), this corrupt data (average is one bit for two terabytes of data) will get stored to our disks.
HDD failure, that commodity disks will fail in 2 -3 years, and might be sooner. Our industry standard RAID 5 is no longer suffice for large disk deployments, better use RAID6 or RAID1.

So ideas for a large scale fault tolerant system will include :

End to end data corruption detection. Put it in application/database level then we get a pretty good coverage of things that could be detected if goes wrong. Two different CRC algorithm will suffice.
redundancy at least 3x for each data block. Or object. Suddenly RAID1 no longer suffice (because only 2x redundancy)
auto-replication or self healing. In event that a HDD gets replaced with new one, data is ought to be re-mirrored to the new HDD.
multihomed systems. That means at least two Network Interface on each host, each connected with different network switch providing network redundancy.
monitoring. The drawback of automatic healing or automatic failover is that the human operator doesn't know that there is a failure happening. Even if nothing must be done (such as automatic re-mirroring in progress) it could be an indication that something is wrong (like bad switch that corrupt TCP packets with the same checksum).

Inventor's Paradox