Achieving Fast Failovers After Network Partitions in a Distributed SQL Database
In February of this year, Kyle Kingsbury of Jepsen.io was conducting formal testing of YugabyteDB for correctness under extreme and unorthodox conditions. Obviously, simulating all manner of network partitions is part of his testing methodology. As a result, during his testing he spotted the fact that although nodes would reliably come back after a failure, the recovery itself was taking roughly 25 seconds to occur. We certainly didn’t like the sound of that!
…