The Role of Consensus in Replicated Storage Systems

Paxos is to agree on one decision. Multi-Paxos as Raft is to agree on a sequence of decisions which could be a replicated log.

Apache Kafka: KRaft

Terminology: Kafka’s topic is like a replicated log for a storage node.

Kafka is a distributed system composed of clients and servers which communicate through TCP, providing services for event streaming. It is deployed in various environments including hardwares, VMs, containers in on-premise and cloud.

Kafka replicates the log for each topic’s partition across a certain number of servers. It adopts Raft-like protocol to maintain the identical logs between a leader and followers. However, instead of using majority vote, a Kafka node maintains a set of in-sync (active, live) replicas (ISR, like a session list) to elect a leader. The drawback of majority vote is that it takes a few failures to leave a quorum system with no electable leaders. Enough redundancy to tolerate single point failure will sacrifice the throughput a lot for large data. A write to a topic’s partition is committed until all sessions in the ISR have received the write. Thus any replica that is in-sync is eligible to be leader. This ISR and f+1 replicas can make a Kafka’s topic tolerate f failures without losing committed messages.

There is another design principle of Kafka that does not require crashed nodes to recover with all data intact. A replica must fully re-sync to rejoin the ISR regardless of lost unflushed data in its crash.

Kafka guarantees the data consistency which is based on at least one in-sync replica. There are two expected behaviors for a system where all the replicas die:

  1. Wait for a replica in the ISR to be alive and choose it as the leader
  2. Choose the first replica (in or outside the ISR) to be alive as the leader

(Availability and Consistency) Kafka uses the first approach as default. The trade-off is between availability and consistency. If the data on the previous in-sync replica was lost, then the system is down for the first approach. Unless we violate the consistency guarantee, and take the first come-back-to-life replica as the leader and make the system available again.

(Availability and Durability) Kafka provides two configurations for the number of acknowledgements (acks) by replicas.

  1. The topic’s partition remains unavailable when all replicas die.
  2. The partition will only accept writes if the size of the ISR is above a certain min.

A Kafka cluster manages hundreds of the partitions mentioned above. A special controller within each cluster manages the status of all the nodes. If a node dies, the controller will elect a new leader from the remaining nodes of ISR. If the controller fails, another controller will be elected.

Ceph Monitors: Paxos

Ceph is a software defined storage designed to store multiple PB data with high performance without any single point of failure. It’s deployed on commodity hardware or cloud. Reliable Autonomic Distributed Object Store (RADOS) is an object store composed of storage nodes that distributes data reliably and consistently. Object Storage Daemon (ODS) is a service running on the storage node which writes the data to hard drives.

Data replication is performed on ODS. A Ceph client writes to the primary ODS which is found by CRUSH and the cluster map. Then the primary ODS writes the data to the secondary ODSes through the same method. Each ODS writes data to its hard drive.

A Ceph storage cluster maintains monitors which update the topology of the whole cluster through the CRUSH algorithm. Ceph monitors use Paxos and majority vote to achieve strong consistency of the cluster map. They maintain a master copy of the cluster map including the changes to their operating states, the state of the cluster and cluster members etc. A single monitor in a cluster cannot survive single-point failure. There should be at least three monitors to tolerate single-point failure.

EBS Physalia: Paxos

To be continued…

DRBD Split-brain Solution

Terminology:

  1. Replicated data set: According to that a volume contains the replicated data set and and a set of metadata, the replicated data set denotes data, not metadata or states.

A split-brain issue for DRBD, for example, is that the secondary node is disconnected with the primary node and sees itself as the “primary”. This scenario can result in inconsistent data between those two nodes as they both start to service data.

DRBD tries to solve network partitions through preventions (dependency on external cluster resource management tool like Pacemaker), pre-configured auto-recovery (dangerous to lose an amount of data), manual intervention when the issue is detected (but how to solve divergent data case even if you get the admin access), and the quorum method. We will look into the quorum method.

DRBD quorum is a three-node quorum composed of the original two nodes (primary and secondary) and a third server which could be diskless. The main idea of the quorum method is to apply the majority rule on a quorum with a third node for replicated data. The data can be written onto a node only if that node connects to the majority of the existing nodes (including itself) in the cluster. The third node serves as a tie-breaker. However, the third (disk-less) server for a two-node cluster can be the bottleneck of this system. An exception is when a primary node’s secondary nodes leave gracefully, meaning the data on those nodes is out-of-date, the primary node can still write data.

I argue the downside of DRBD quorum can be attributed to two sources:

  1. Cluster membership change for DRBD either depends on external membership management tool Pacemaker or the internal auto-promote configuration (v9.1+). Both take a few seconds to promote a new Primary node which could happen frequently in a quorum system.
  2. Having a disk or not becomes a factor for the majority rule of DRBD. When the third node C is connected to the primary node A only first and then disconnected with A and connected to B. B cannot be promoted to the new primary node because B cannot check if C is up-to-date.

The majority number of nodes is different from the majority votes applied by consensus algorithms as Raft. It doesn’t achieve a real majority of writes on the nodes for a three-node cluster with a diskless node. Even if can, there are still error cases to consider:

  1. Stale majority: Network partition: {Node1} vs {Node2, Node3}, Node2 has stale data.
  2. N/w partitions: A node leaves a partition to another one, causing the majority changes. During the actual promotion/demotion process, two primaries may exist.
  3. Quorum loss and recovery.

In conclusion, if RMR requires strong consistency for pool metadata, the number of nodes in an RMR pool group should be greater than three and use consensus algorithms like Raft or Paxos. However, it may be challenging to implement and test.

Reference:

  1. https://linbit.com/drbd-user-guide/drbd-guide-9_0-en/#s-split-brain-notification-and-recovery
  2. https://linbit.com/drbd-user-guide/drbd-guide-9_0-en/#s-configuring-quorum-tiebreaker
  3. Apache Kafka documentation
  4. Understanding Ceph: open-source scalable storage
  5. Ceph as software define storage | PDF | Operating Systems
  6. Architecture — Ceph Documentation
  7. Millions of Tiny Databases | USENIX