Distributed Systems notes

Design goals

  • Infrastructure services: storage, communication, computing. The course goal is to build an abstraction that hides the distribution while realizing high performance.

  • performance -> scalability

    • scenario: n users <=> 1 web server <=> 1 database
    • When n is 1million, scaling up web servers and spliting users accounts into multiple webs can achieve performance until web-db communication becomes bottle neck, that is, scalable web servers no longer help. Then db will need to be refactored so that is splitting data into multiple dbs.
    • terms
      • scalability: N times computers/resources can achieve N times throughput.
    • special cases
      • quick response time for a single user request
      • all users want to update the same data
  • fault tolerance -> availability, recoveribility

    • scenario: always sth brokes when running 1000 computers vs running 1

    • rare events become real problems

    • terms

      • availability: system will keep operating when certain kinds of failures occur. It will not be available when too many failures occur.

      • recoverability: system stops running/reponding to requests until the failed componets are repaired. Need to save latest data on disk etc so they can recover it when powering up.

      • memory access speed V.S. disk

      • normal scale: 3 GHz microprocessor

    • tools

      • non-volatile storage: store log of states. -> writing NV storage is too slow (move disk arms and wait for a disk platter to rotate)
      • replicated servers: use a copy -> two replicas drifted out of sync
  • consistency

    • General-purpose infrastructure needs well-defined behavior.

    • everything could broke and lead to lose replica.

    • strong consistency often means poor performance relative to weaker consistency.

case study: MapReduce 2004

case study: GFS 2003

Primary/Backup Replication

  1. One way of providing fault tolerance: Replication.
  2. Replication deals with “fail-stop” failure of a single replica. It may not be able to detect h/w or bugs in s/w or human config errors. Geo-failure are dealt with only if replicas are physically separated.
  3. Failures in one server are independent from another otherwise it’s useless to use replication.
  4. Whether replication is worthwhile the N*expense of computing resources depends on the consequences of failures.
  5. Two main approaches:
    • state transfer: primary sends internal memory (state) to backup
    • replicated state machine: clients send operations to primary and primary sends those external events to backups. All replicas have deterministic same state as primary if execute the operations in the same order.
  6. tradeoffs:
    • state transfer is simpler but may be large and slow to transfer over network
    • replicated state machine are comparably smaller than state and generates less network traffic but complex to get right.
  7. example: VM-FT uses replicated state machine on single-core processor. It uses state transfer scheme when it is expanded to multi-core.
  8. Design points of replication schema
    • What state to replicate?
    • Does primary have to wait for backup?
    • When to cut over to backup?
    • Are anomalies visible at cut-over?
    • How to bring a replacement backup up to speed?

Problem 1: At what level do we want replicas to be identical?

application state: forward op stream, can be efficient. ex: GFS.

machine level, registers and RAM content: forward machine events (interrupts, DMA, &c), not as efficient. ex: run an existing software without modification

terms

  1. fail-stop: if something goes wrong, the computer would stop executing instead of generating incorrect results. Like unplug the powercable out of the server, CPU overheats.

case study: VMware FT 2010

Problem 0: How does VM-FT handle network partitions? That is, is it possible that if the primary and the backup end up in different network partitions that the backup will become a primary too and the system will run with two primaries?

There will be only one primary executing in network partitions. The primary VM may not send an output to the external world until the backup receives and acknowledges the log entry associated with the operation. When network partitions happen, the primary cannot get a response from the backup and the backup cannot get a heartbeat from the primary. Either failures occur to the servers or network connectivity is lost.

VM-FT hence performs an atomic op test-and-set on shared storage that stores the virtual disks of vms when the primary or the backup wants to go live.If the op suceeds, then the vm is allowed to go live. If not, that means a vm must have been live, so the current vm halts itself. If the vm cannot access shared storage, it just waits until it can.

Raft

Readings:

  1. Raft (2014): In Search of an Understandable Consensus Algorithm (Extended Version)

Raft elections

log handling

Raft persistence

client behavior

snapshots

case study: ZooKeeper 2010

case study:

chain replication

case study: cr 2004

distributed transations

case study: scalable dfs 1997

case study: spanner 2012

case study: optimistic concurrency control 2015

case study: spark 2012

case study: scaling memcache at Facebook 2013

case study: causal consistency 2011

case study: SUNDR 2004