COL862 Special Topics in Software Systems
2025-2026 Sem I
This course will deep dive into distributed storage systems.
In the first part of the course, we will familiarize ourselves with Temporal Logic of Actions (TLA+). In the second part, we will discuss a research paper in every class. The paper discussion will be led by student presenters. Each student is expected to submit a paper review for every paper on their own.
Logistics
- Prerequisites: COL331 or equivalent.
- Timings: Tuesdays and Fridays. 2-3:30pm. AC Slot.
- Place: LH421
- Emails may not be answered in a timely manner. Instead, please post on Piazza.
- Class material can be found here.
- Reference textbook(s):
Tentative evaluation criteria
- Paper presentation: 25%
- This includes presenting a paper, creating a quiz for the class, and evaluating the quiz.
- Presentation schedule
- Quiz: 10%
- Every discussed paper will have a quiz. This will be taken before the class. Other classes, such as on TLA+, may also have a quiz. All quizzes will be pre-announced (no surprise quizzes).
- There will be no makeup quizzes. Best N-1 out of N will be considered.
- Paper reviews: 30%
- This includes writing a summary of the paper and your commentary on what
could be improved.
Submit it on GradescopeHand it over before the class begins. - There will be no makeup paper reviews. Best N-1 out of N will be considered.
- This includes writing a summary of the paper and your commentary on what
could be improved.
- Projects: 35%
- Mini project: 15%
- Model Chain replication membership changes in TLA+
- Major project: 20%
- You can model a (non-trivial) system from scratch in TLA+ to argue about its correctness and show interesting edge cases/optimization that were not obvious in the paper.
- And/Or you can implement a (non-trivial) system from scratch and evaluate its performance. Ideally, you provide some performance insight that was not obvious in the paper.
- Mini project: 15%
Audit criteria: 30% or more marks. Present a paper.
Ethics
Paper reviews: You are encouraged to discuss papers with your peers. But each review must be written independently. If two student reviews are found to be copied, both reviews will be penalized. It is encouraged to note in the header: “Paper discussed with student XYZ”. You can use LLMs to help you understand the paper. But the paper review must be written by you. All paper reviews need to be hand-written.
Quizzes: Please do not share your quiz with your peers.
Paper presentation: You are allowed to reuse slides from other presentations. But this has to be acknowledged at the start of presentation.
Tentative schedule
Date | Topic |
---|---|
25 Jul. Friday. |
1. Introduction to distributed storage systems. Safety and Liveness. Linearizability. Recommended reading: (CACM 2015) How Amazon web services uses formal methods |
27 Jul. Sunday. |
2. Leslie’s lectures 1-4. |
29 Jul. Tuesday. |
3. Read/write/visualize some TLA+ specs. State machines, actions, invariants, enabling condition of an action, explicit model checkers (like TLC) check invariants by just traversing the state machine. |
1 Aug. Friday. |
4. Distributed ACID transactions. Isolation: Serializability, optimistic/pessimistic concurrency control, wound-wait deadlock avoidance. Atomicity: Two-phase commits. Recommended reading: (CSUR 1981) Concurrency Control in Distributed Database Systems |
3 Aug. Sunday. |
5. Leslie’s lectures 5-6. |
5 Aug. Tuesday. |
6. Read TLA+ specs for transaction commit and two-phase commit. Restarts in two-phase commits. Write-ahead logging. |
8 Aug. Friday. |
7. (OSDI 2004) Chain Replication. Replicated state machines. Write TLA+ specs for a single server key-value store and a chain replicated key-value store. Take home Quiz 1. |
10 Aug. Sunday. |
8. Leslie’s lectures 8(a,b). |
12 Aug. Tuesday. |
9. TLA+ specification is just a temporal formula that accepts certain behaviors. State formulas, invariants, action formulas, box action formulas. Induction for proving invariants. Prove TCSpec from transaction commit has TCTypeOk invariant. |
14 Aug. Thursday. |
10. Stuttering. Prove that two-phase commit implements transaction commit. Eventually. TLA inference and equivalence rules. Take home Quiz 2. |
15 Aug. Friday. |
Independence Day 11. Leslie’s lectures 9(a, b). |
19 Aug. Tuesday. |
12. Liveness, weak and strong fairness. |
22 Aug. Friday. |
13. How TLC checks for liveness? Proof lattices for proving liveness. Recommended reading: (TOPLAS 1982) Proving Liveness Properties of Concurrent Programs |
24 Aug. Sunday. |
14. Leslie’s lectures 10(a, b) Recommended reading: (TOPLAS 1994) The temporal logic of actions |
26 Aug. Tuesday. |
15. (WDAG 1996) How to build a highly available system using consensus. Asynchronous networks. Quorums. FLP impossibility. Paxos consensus algorithm: leader can only propose safe values. |
29 Aug. Friday. |
16. (TOPLAS 2022) Prophecy made simple Stuttering invariance, internal and external variables, refinement mapping, chain replication spec implements single server spec. |
30 Aug. Saturday. |
17. (TOPLAS 2022) Prophecy made simple Auxiliary variables, history variables, problem with Herlihy-Wing linearizable queue, prophecy variables. |
2 Sep. Tuesday. |
Hacking day. No class |
5 Sep. Friday. |
Milad-un-nabi |
9 Sep. Tuesday. |
18. Introduce a distributed actor framework for the major project. See toy implementations resilient to chaos like message loss, duplication, delays, crashes, and restarts. Recommended reading: Chaos Engineering |
12 Sep. Friday. |
Exam week |
16 Sep. Tuesday. |
Exam week |
19 Sep. Friday. |
Hacking day. No class |
23 Sep. Tuesday. |
19. (ATC 2014) In search of an understandable consensus algorithm |
26 Sep. Friday. |
20. (ATC 2014) In search of an understandable consensus algorithm |
30 Sep. Tuesday. |
Semester break |
3 Oct. Friday. |
Semester break |
7 Oct. Tuesday. |
21. (SOSP 2013) There is more consensus in Egalitarian Parliaments |
10 Oct. Friday. |
22. (SOSP 2013) There is more consensus in Egalitarian Parliaments |
14 Oct. Tuesday. |
23. (NSDI 2021) EPaxos revisited |
17 Oct. Friday. |
24. (ATC 2022) High-throughput replication with integrated membership management |
21 Oct. Tuesday. |
25. (SOSP 2023) LazyLog: A New Shared Log Abstraction for Low-Latency Applications |
24 Oct. Friday. |
26. (OSDI 2012) Spanner: Google’s Globally Distributed Database |
28 Oct. Tuesday. |
27. (OSDI 2012) Spanner: Google’s Globally Distributed Database |
31 Oct. Friday. |
28. (SOSP 2007) Dynamo: Amazon’s Highly Available Key-value Store |
4 Nov. Tuesday. |
29. (SOSP 2007) Dynamo: Amazon’s Highly Available Key-value Store |
7 Nov. Friday. |
30. (SIGMOD 2013) Bolt-on Causal Consistency |
11 Nov. Tuesday. |
31. Project presentations |
14 Nov. Friday. |
32. Project presentations |