What Happens to Tablets (Shards) When Node Is Lost and Then Brought Back Into Cluster?
In scenarios where you have a running cluster and you lose a node, due to, say, a networking partition, there is a process in place to handle this. But remember, in terms of the CAP theorem, YugabyteDB is a CP database. This means it will prioritize consistency over availability in the event of a network partition. However, this does not mean it is not highly available. With a replication factor of 3, your cluster will be able to tolerate losing a single node and still be able to serve all application traffic.
Continuous Availability in YugabyteDB
When the node goes down, all leaders sitting on that node—whether a master-leader or a tablet-leader—will go through a 3-second re-election process. This process elects one of the followers to the leader role. During this time, there will be higher latencies for any tablet-group going through the re-election process. The same goes for any YB-Master level operations if the master-leader happened to fall on that node.
* Continuous availability is one of YugabyteDB’s core design principles. This means a repaired node, once back online, will be caught up by the remaining nodes. Then the leaders will be redistributed equally across all the nodes.
* If you want to see how this stands with other database systems, check out this comparison against the 60s-120s failover window with Amazon Aurora.
When a Node is Down for Longer Period of Time
By default, if a node is down for longer than 900 seconds (15 minutes), you will have to replace the node since the system will remove the data from the downed node. This duration after which a follower will fail because the leader has not received a heartbeat is configurable (in seconds). We recommend adding a new node to the quorum and removing the downed node if you expect the node to be down for a long period of time. The data replication to this newly introduced node happens behind the scenes, with no manual steps required from the user.
