Distributed Systems notes

Design goals

  • Infrastructure services: storage, communication, computing. The course goal is to build an abstraction that hides the distribution while realizing high performance.

  • performance -> scalability

    • scenario: n users <=> 1 web server <=> 1 database
    • When n is 1million, scaling up web servers and spliting users accounts into multiple webs can achieve performance until web-db communication becomes bottle neck, that is, scalable web servers no longer help. Then db will need to be refactored so that is splitting data into multiple dbs.
    • terms
      • scalability: N times computers/resources can achieve N times throughput.
    • special cases
      • quick response time for a single user request
      • all users want to update the same data
  • fault tolerance -> availability, recoveribility

    • scenario: always sth brokes when running 1000 computers vs running 1

    • rare events become real problems

    • terms

      • availability: system will keep operating when certain kinds of failures occur. It will not be available when too many failures occur.

      • recoverability: system stops running/reponding to requests until the failed componets are repaired. Need to save latest data on disk etc so they can recover it when powering up.

      • memory access speed V.S. disk

      • normal scale: 3 GHz microprocessor

    • tools

      • non-volatile storage: store log of states. -> writing NV storage is too slow (move disk arms and wait for a disk platter to rotate)
      • replicated servers: use a copy -> two replicas drifted out of sync
  • consistency

    • General-purpose infrastructure needs well-defined behavior.

    • everything could broke and lead to lose replica.

    • strong consistency often means poor performance relative to weaker consistency.

Case study: MapReduce 2004

Consistency and linearizability

Case study: GFS 2003

Primary/Backup Replication

  1. One way of providing fault tolerance: Replication.
  2. Replication deals with “fail-stop” failure of a single replica. It may not be able to detect h/w or bugs in s/w or human config errors. Geo-failure are dealt with only if replicas are physically separated.
  3. Failures in one server are independent from another otherwise it’s useless to use replication.
  4. Whether replication is worthwhile the N*expense of computing resources depends on the consequences of failures.
  5. Two main approaches:
    • state transfer: primary sends internal memory (state) to backup
    • replicated state machine: clients send operations to primary and primary sends those external events to backups. All replicas have deterministic same state as the primary if they execute the operations in the same order.
  6. tradeoffs:
    • state transfer is simpler but may be large and slow to transfer over network
    • replicated state machine are comparably smaller than state and generates less network traffic but complex to get right.
  7. example: VM-FT uses replicated state machine on single-core processor. It uses state transfer scheme when it is expanded to multi-core.
  8. Design points of replication schema
    • What state to replicate?
    • Does primary have to wait for backup?
    • When to cut over to backup?
    • Are anomalies visible at cut-over?
    • How to bring a replacement backup up to speed?

Problem 1: At what level do we want replicas to be identical?

application state: forward op stream, can be efficient. ex: GFS.

machine level, registers and RAM content: forward machine events (interrupts, DMA, &c), not as efficient. ex: run an existing software without modification

Case study: VMware FT 2010

Terms:

  1. vm-ft: fault-tolerant virtual machines
  2. failstop: if something goes wrong, the computer would stop executing instead of generating incorrect results. E.g. unplug the power cable out of the server, CPU overheats.
  3. failover: if something goes wrong, the computer will be replaced by another one.
  4. virtual lockstep
  5. mode: logging, replaying, normal
  6. a deferred-execution context (similar to a tasklet in Linux) is a mechanism for deferring work to be executed later at a more appropriate time

FT design

  1. Synchronization of primary and backup VMs is based on the technique of deterministic replay. Briefly talking, deterministic replay transfers the non-deterministic operations/events to the log entries of the file. The backup can read from the file and replay execution of the primary.
    • non-determinism of events and operations in VMs: virtual interrupts, reading the clock cycle counter of the processor
    • question: non-deterministic input (capture and apply) -> deterministic execution + performance unaffected (at instruction level)
    • However, that technique for VMware vSphere platform is introduced in 2007.
  2. FT protocol: No data is lost if a backup takes over after the primary fails.
    • Output requirement <- output rule: The backup is supposed to execute consistently as the primary after it takes over the primary VM. The rule to guarantee that is two phase commit. The primary delays sending the output until the backup VM has received and acknowledged the log entries of the output operations.
    • Can tolerate lost packets: incoming packets may be dropped during failure of the primary due to reasons unrelated to that failure.

failure handling

  1. context: time lag on the execution of the backup VM. To control the time lag to be less than 100ms and no greater than 1s, the primary VM’s CPU limit is managed to increase/decrease to occupy/spare more execution time.

  2. failure detection: use heartbeating (UDP) to monitor the traffic on the logging channel that connects primary and backup. A timeout in the flow of log entries and acknowledgements is detected as a failure.

  3. split-brain problem: solution: (requirement) only one VM (primary or backup) takes over the execution.

    Problem 0: How does VM-FT handle network partitions? That is, is it possible that if the primary and the backup end up in different network partitions that the backup will become a primary too and the system will run with two primaries?

    There will be only one VM executing during network partitions. The primary VM delays sending outputs to the external world until the backup receives and acknowledges the log entries sent by it. When network partitions occur, the primary and backup VMs lost information of each other. The failure whether caused by network connection or VM faults is unknown.

    VM-FT hence performs an atomic op test-and-set on shared storage that stores the virtual disks of VMs when the primary or the backup wants to go live. If the op succeeds, then the VM is allowed to go live. If not, that means a VM must have been live, so the current VM halts itself. If the VM cannot access shared storage, it waits until it can.

  4. Start a new backup on another host: (primary, backup) = (P1, B1) -> (P1 is down, B1 takes over) -> (B1, ?) -> (B1, B2)

    • Based on VMware VMotion: migration of a running VM from one server to another server within one second.
    • Modified FT VMotion clones a VM to a remote host rather than migrating it within minutes or unnoticeable interruption.
    • Simplified steps: the mode of B1, B2: (logging, replaying) -> choose a server to run B2: cluster service decides it -> cloning…
  5. To eliminate non-determinism

    • on disk IO issues:

      • parallel disk IOs try to access the same location on the shared storage -> detect IO races and make it sequential
      • a disk op races with an application op when they are reading the same memory block at the same time -> MMU protection on pages - traps - (too expensive) => bounce buffers (cheaper) to read/write from/at
      • when failures happen, no way to find out the IOs issuing during that time are completed or not. -> reissue the pending IOs
    • network IO issues: async updates to a VM’s state while executing ?-> VM traps and interrupts + delaying the sending packets-> performance challenges

Aside: Ignored: operations on VM-FT

design alternatives

non-shared storage vs shared storage

advantages:

  1. no delaying disk writes for the primary VM
  2. adds availability when the primary and backup VMs are far apart

disadvantages:

  1. need the alternative to handle network partition
  2. need to sync disks of VMs

Verified Primary/Backup

case study: Grove 2023

Raft

Problem 1: Suppose we have the scenario shown in the Raft paper’s Figure 7: a cluster of seven servers, with the log contents shown. The first server crashes (the one at the top of the figure), and cannot be contacted. A leader election ensues. For each of the servers marked (a), (d), and (f), could that server be elected? If yes, which servers would vote for it? If no, what specific Raft mechanism(s) would prevent it from being elected?

Problem 2: Could a received InstallSnapshot RPC cause the state machine to go backwards in time? That is, could step 8 in Figure 13 cause the state machine to be reset so that it reflects fewer executed operations? If yes, explain how this could happen. If no, explain why it can’t happen.

Paxos

Assumptions: non-Byzantine failures. Allows failures of missing messages and unordered requests.

P1. An acceptor must accept the first proposal that it receives.

For the acceptance of majority, one acceptor must accecpt >= 1 proposals. Assign number to the proposal as (seq_number, value).

A value is chosen only when it is accepted by a majority of acceptors. -> A value is chosen only when the proposal is chosen.

If multiple proposals are chosen, then it is guaranteed that all chosen proposals have the same value.

P2. If a proposal with value v is chosen, then every higher-numbered proposal that is chosen has value v.
-> 2a. accepted by any acceptor
-> 2b. issued by any proposer
-> 2c. For any v and n, if a proposal(n, v) is issued, then there is a set S consisting of a majority of acceptors such that either (a) no acceptor in S has accepted any proposal numbered less than n, or (b) v is the value of the highest-numbered proposal among all proposals numbered less than n accepted by the acceptors in S.

To prove 2c, the proposer (n, v) controls future acceptance by requesting a promise from acceptors that no more such acceptances of proposals numbered less than n.

Proposer’s algorithm:
(1) a prepare req
A proposer send a prososal (n) and demands its acceptors to respond with:

  1. a promise never again to agree a proposal numbered less than n
  2. the proposal with the highest number less than n that it has accepted (previous accepted value), if any.

After the proposer gets the responses from a majority of the acceptors, it is getting into phase 2.

(2) an accept req, sharing the same set of acceptors as phase-1.
The proposer issues a proposal (n, v) where v is the value either the highest-numbered proposal among the responses or the any one selected by the proposer if reported no proposals.

P1a. An acceptor can accept a proposal numbered n iff it has not responded to a prepare request having a number greater than n.

Phase 1:
(1) proposer: sends a prepare req (n)
(2) acceptor: responds the proposal(n, v) or null

Phase 2:
(1) proposer: sends an accept req (n, v) to the same majority group of acceptors
(2) acceptor: accepts (n, v) unless there are prepare requests numbered greater than n.

Optimizations:

  1. acceptors ignore the prepare requests numbered less than its accepted request number.
  2. acceptors notify the proposers the highest number of its prepare requests so that proposers can drop the proposals with lower numbered req.

Phase 3: learner: the acceptors can respond their acceptances to some set of distinguished learners.

# of distinguished learners^, reliability^ & communication complexity^.

Termination policy: a distinguished proposer must be selected as the only one to issue proposals. A reliable algorithm for electing a proposer must use either randomness or real time (timeouts).

implementation:

Paxos chooses a leader to perform the role of the distinguished proposer and the distinguished learner.

For persistence, an acceptor records its response in storage before sending. And each proposer keeps the highest-numbered proposal in storage before issuing.

state machine: ???

Suppose that the acceptors are A, B, and C. A and B are also proposers. How does Paxos ensure that the following sequence of events can’t happen? What actually happens, and which value is ultimately chosen?

  1. A sends prepare requests with proposal number 1, and gets responses from A, B, and C.
  2. A sends accept(1, "foo") to A and C and gets responses from both. Because a majority accepted, A thinks that "foo" has been chosen. However, A crashes before sending an accept to B.
  3. B sends prepare messages with proposal number 2, and gets responses from B and C.
  4. B sends accept(2, "bar") messages to B and C and gets responses from both, so B thinks that "bar" has been chosen.

After step 2, B and C have had the prepare message from A which is #1. After step 3, B would get response from C with value “foo” and gets null from B. At step 4, B will not send accecpt(2, “bar”) but (2, “foo”).