You're probably aware of this, but for the sake of others reading: Crashing afte...

AdamProut · on Aug 11, 2022

You could replace step 3) with "pull power plug from host" for the same effect.

Think of all the extra disk IO the worlds databases are doing to defend against step 3) :)

remram · on Aug 11, 2022

In distributed system terms, this is a byzantine failure which most consensus mechanisms are not equipped to handle.

_vvhw · on Aug 11, 2022

And yet many non-Byzantine consensus protocols are equipped to handle the network fault model, which could be seen as equally Byzantine under this definition.

The problem is really that many formal proofs of consensus have focused only on the network fault model, and neglected the storage fault model.

Both network/storage fault models require practical engineering efforts to get right. I think a better term for this is “near-Byzantine” fault tolerance. It's what non-Byzantine fault tolerance looks like when implemented correctly in the real world—the GP comment is a great example of how to approach and think about this from an engineering perspective.

I dive into this in detail also here: https://www.youtube.com/watch?v=rNmZZLant9o

remram · on Aug 11, 2022

"near-Byzantine" is not a very clear term you can reason about. A system is either Byzantine-fault-tolerant, in which case it handles all Bizantine faults, or it is not. A system that is tolerant to some faults (that you may want to call "Byzantine") is not BFT.

You don't call plaintext SMS "tamper-resistent" because it resists to some simple tampering. You don't call your house "FBI resistant" because you managed to convince them once to turn around.

A Byzantine fault is clearly defined as a case where a specific node may be doing anything, including not know it has failed, including malicious behavior. It is important that people know what class of faults their system is designed to resist; for Raft/Paxos, it is NOT Byzantine faults. Those systems are pretty great, but trying to pretend they aim at BFT is dangerous misinformation...

_vvhw · on Aug 11, 2022

What then would you specify as the clearly defined storage fault model for non-Byzantine protocols such as Paxos/RAFT that rely on stable storage for correctness?

remram · on Aug 11, 2022

Anything is possible with Byzantine faults, on the specific failed node. It will not remember voting, it will not remember to vote, it will not remember its identity, etc. PAXOS/Raft are not tolerant to a minority of nodes exhibiting those kinds of faults, only to a minority of nodes being unreachable or partitioned.

Remember that the Byzantine generals had traitors among them, not merely communication issues.

_vvhw · on Aug 12, 2022

What I mean is, if you're implementing Paxos/RAFT—what do you expect of the disk, that it's perfect?