2025-2026 Sem I

This course will deep dive into distributed storage systems.

In the first part of the course, we will familiarize ourselves with Temporal Logic of Actions (TLA+). In the second part, we will discuss a research paper in every class. The paper discussion will be led by student presenters. Each student is expected to submit a paper review for every paper on their own.

Logistics

  • Prerequisites: COL331 or equivalent.
  • Timings: Tuesdays and Fridays. 2-3:30pm. AC Slot.
  • Place: LH421
  • Emails may not be answered in a timely manner. Instead, please post on Piazza.
  • Class material can be found here.
  • Reference textbook(s):

Tentative evaluation criteria

  • Paper presentation: 25%
    • This includes presenting a paper, creating a quiz for the class, and evaluating the quiz.
    • Presentation schedule
  • Quiz: 10%
    • Every discussed paper will have a quiz. This will be taken before the class. Other classes, such as on TLA+, may also have a quiz. All quizzes will be pre-announced (no surprise quizzes).
    • There will be no makeup quizzes. Best N-1 out of N will be considered.
  • Paper reviews: 30%
    • This includes writing a summary of the paper and your commentary on what could be improved. Submit it on Gradescope Hand it over before the class begins.
    • There will be no makeup paper reviews. Best N-1 out of N will be considered.
  • Projects: 35%
    • Mini project: 15%
      • Model Chain replication membership changes in TLA+
    • Major project: 20%
      • You can model a (non-trivial) system from scratch in TLA+ to argue about its correctness and show interesting edge cases/optimization that were not obvious in the paper.
      • And/Or you can implement a (non-trivial) system from scratch and evaluate its performance. Ideally, you provide some performance insight that was not obvious in the paper.

Audit criteria: 30% or more marks. Present a paper.

Ethics

Paper reviews: You are encouraged to discuss papers with your peers. But each review must be written independently. If two student reviews are found to be copied, both reviews will be penalized. It is encouraged to note in the header: “Paper discussed with student XYZ”. You can use LLMs to help you understand the paper. But the paper review must be written by you. All paper reviews need to be hand-written.

Quizzes: Please do not share your quiz with your peers.

Paper presentation: You are allowed to reuse slides from other presentations. But this has to be acknowledged at the start of presentation.

Tentative schedule

Date Topic
25 Jul.
Friday.
1. Introduction to distributed storage systems. Safety and Liveness. Linearizability.
Recommended reading: (CACM 2015) How Amazon web services uses formal methods
27 Jul.
Sunday.
2. Leslie’s lectures 1-4.
29 Jul.
Tuesday.
3. Read/write/visualize some TLA+ specs. State machines, actions, invariants, enabling condition of an action, explicit model checkers (like TLC) check invariants by just traversing the state machine.
1 Aug.
Friday.
4. Distributed ACID transactions. Isolation: Serializability, optimistic/pessimistic concurrency control, wound-wait deadlock avoidance. Atomicity: Two-phase commits.
Recommended reading: (CSUR 1981) Concurrency Control in Distributed Database Systems
3 Aug.
Sunday.
5. Leslie’s lectures 5-6.
5 Aug.
Tuesday.
6. Read TLA+ specs for transaction commit and two-phase commit. Restarts in two-phase commits. Write-ahead logging.
8 Aug.
Friday.
7. (OSDI 2004) Chain Replication. Replicated state machines. Write TLA+ specs for a single server key-value store and a chain replicated key-value store. Take home Quiz 1.
10 Aug.
Sunday.
8. Leslie’s lectures 8(a,b).
12 Aug.
Tuesday.
9. TLA+ specification is just a temporal formula that accepts certain behaviors. State formulas, invariants, action formulas, box action formulas. Induction for proving invariants. Prove TCSpec from transaction commit has TCTypeOk invariant.
14 Aug.
Thursday.
10. Stuttering. Prove that two-phase commit implements transaction commit. Eventually. TLA inference and equivalence rules. Take home Quiz 2.
15 Aug.
Friday.
Independence Day
11. Leslie’s lectures 9(a, b).
19 Aug.
Tuesday.
12. Liveness, weak and strong fairness.
22 Aug.
Friday.
13. How TLC checks for liveness? Proof lattices for proving liveness.
Recommended reading: (TOPLAS 1982) Proving Liveness Properties of Concurrent Programs
24 Aug.
Sunday.
14. Leslie’s lectures 10(a, b)
Recommended reading: (TOPLAS 1994) The temporal logic of actions
26 Aug.
Tuesday.
15. (WDAG 1996) How to build a highly available system using consensus. Asynchronous networks. Quorums. FLP impossibility. Paxos consensus algorithm: leader can only propose safe values.
29 Aug.
Friday.
16. (TOPLAS 2022) Prophecy made simple
Stuttering invariance, internal and external variables, refinement mapping, chain replication spec implements single server spec.
30 Aug.
Saturday.
17. (TOPLAS 2022) Prophecy made simple
Auxiliary variables, history variables, problem with Herlihy-Wing linearizable queue, prophecy variables.
2 Sep.
Tuesday.
Hacking day. No class
5 Sep.
Friday.
Milad-un-nabi
9 Sep.
Tuesday.
18. Introduce a distributed actor framework for the major project. See toy implementations resilient to chaos like message loss, duplication, delays, crashes, and restarts.
Recommended reading: Chaos Engineering
12 Sep.
Friday.
Exam week
16 Sep.
Tuesday.
Exam week
19 Sep.
Friday.
Hacking day. No class
23 Sep.
Tuesday.
19. (ATC 2014) In search of an understandable consensus algorithm
26 Sep.
Friday.
20. (ATC 2014) In search of an understandable consensus algorithm
30 Sep.
Tuesday.
Semester break
3 Oct.
Friday.
Semester break
7 Oct.
Tuesday.
21. (SOSP 2013) There is more consensus in Egalitarian Parliaments
10 Oct.
Friday.
22. (SOSP 2013) There is more consensus in Egalitarian Parliaments
14 Oct.
Tuesday.
23. (NSDI 2021) EPaxos revisited
17 Oct.
Friday.
24. (ATC 2022) High-throughput replication with integrated membership management
21 Oct.
Tuesday.
25. (SOSP 2023) LazyLog: A New Shared Log Abstraction for Low-Latency Applications
24 Oct.
Friday.
26. (OSDI 2012) Spanner: Google’s Globally Distributed Database
28 Oct.
Tuesday.
27. (OSDI 2012) Spanner: Google’s Globally Distributed Database
31 Oct.
Friday.
28. (SOSP 2007) Dynamo: Amazon’s Highly Available Key-value Store
4 Nov.
Tuesday.
29. (SOSP 2007) Dynamo: Amazon’s Highly Available Key-value Store
7 Nov.
Friday.
30. (SIGMOD 2013) Bolt-on Causal Consistency
11 Nov.
Tuesday.
31. Project presentations
14 Nov.
Friday.
32. Project presentations