2024-2025 Sem I

This course introduces cloud infrastructure. Students should feel more comfortable with building cloud services after having done this course.

Course Information

  • Prerequisites: COL331 or equivalent.

    Note: The course includes programming assignments and thus expects proficiency with systems programming and debugging.

  • Credits: 3-0-2
  • Slot: AC, Tuesdays and Fridays 2-3:15pm in IIA 305 LH316.
  • TAs:
    • Shivam Verma: Shivam.Verma.cs520@cse.iitd.ac.in
    • Rohith Vishnumolakala: csy227589@cse.iitd.ac.in
    • Satyam Jay: anz238224@sit.iitd.ac.in
  • Reading material: There is no textbook for the course. Most lectures will link to more reading material. Lecture notes can be found here and here.
  • Acknowledgements: Thanks to Robert T. Morris, MIT and Mythilli Vutukuru, IITB; parts of this course have been inspired by courses made available by them.

Grading criteria

  • 30% programming assignments
  • 20% project
  • 10% pre-class question
  • 20% minor exam
  • 20% major exam

Supporting systems

  • Programming assignments are to be done on Baadal. You will need VPN access to IITD network!
  • Discussions should be done on Piazza.

Policies

  • Attendance policy: If a student’s attendance is less than 75%, the student will be awarded one grade less than the actual grade that he/she has earned. For example, a student who has got an A grade but has attendance less than 75% will be awarded an A- grade.
  • Audit criteria: 40% or more marks in total. 40% or more marks in major+minor exams. Attendance>=75%.
  • Ethics: Please re-read IITD honour code. Cheating will get an F in the course. Why should I not cheat?
  • Late policy: To help you cope with unexpected emergencies, you can hand in your Labs solutions late, but the total amount of lateness summed over all the lab deadlines must not exceed 72 hours. You can divide up your 72 hours among the labs however you like; you don’t have to ask or tell us. You can only use late hours only for Labs.
  • There will be no make up pre-class questions. We will count the scores from your best (n-1) marks where n is the total number of pre-class questions.

Tentative topics

Computation:

  • Translate existing programs to distributed system. (Distributed shared memory)
  • Batch computation (MapReduce, Spark), streaming computation (Spark streaming, Flink, Google Dataflow), ML training (Tensorflow)
  • The problem of late data in streaming computation (Millwheel, Google dataflow): watermarks, triggers, windows.
  • Fault tolerance strategies: re-run deterministic idempotent functions using lineage (MapReduce, Spark, CIEL, Ray), asynchronous consistent checkpoints using Chandy-Lamport algorithm (Flink), inconsistent checkpoints (TensorFlow).
  • Straggler mitigation strategies: Backup tasks (MapReduce, Spark), just do M/N tasks (TensorFlow)
  • Scalability, locality, problem of synchronized times and vector clocks, etc.

Storage:

  • PACELC theorem: If partitioned, choose between availability and consistency, else choose between latency and consistency.
  • CP systems:
    • Linearizability. Raft: quorums, leader election.
    • Serializability. Google Spanner: distributed transactions, TrueTime, hybrid logical clocks.
  • AP systems:
    • Amazon dynamo: eventual consistency, hashing, gossip protocols, dotted version vectors, conflict-free replicated data types (CRDTs)
  • Somewhere between CP and AP
    • Google file system
    • Zookeeper
    • RedBlue consistency

Hardware-assisted virtualization:

  • CPU virtualization: KVM, Popek-Goldberg theorem
  • Memory virtualization: 2-D page tables

Disclaimer: Actual course contents may differ depending on student interest. Reach out to the instructor as soon as possible if there is a particular interest in a topic.

Tentative Schedule

Week Tuesday Friday Sunday
1 23 Jul
LEC 1: Introduction.
26 Jul
LEC 2: Scalability, Task DAGs, FaaS.
 
2 30 Jul
LEC 3: Struggles with DSM. Paper
What is ping-pong effect in DSM and what does Mirage do to reduce the ping-pong effect?
2 Aug
LEC 4: MapReduce. Paper
What does MapReduce do for tolerating worker failures?
 
3 6 Aug
LEC 5: Release Lab 1
What are narrow and wide dependencies in Spark?

9 Aug
LEC 6: Spark RDDs. Paper
What guarantees are provided by consumer groups in Redis?
 
4 13 Aug
tuesday is thursday. no class
16 Aug
no class day
18 Aug
Lab 1 due
5 20 Aug
LEC 7: Spark streaming. Paper
What are the trade-offs in using D-streams compared to continuous processing systems?
23 Aug
LEC 8: Virtual times and global states. Paper
For two vector times u and v, when can we say u || v?
 
6 27 Aug
LEC 9: Flink. Paper. Release Lab 2
How does Flink use its consistent snapshots?

30 Aug
LEC 10: Discuss Lab 2
What are the different phases and phase transitions of the coordinator?

 
7 3 Sep
LEC 11: Tensorflow computational model. Paper.
What are the different kinds of operations and edges in a Tensorflow graph?
6 Sep
LEC 12: TensorFlow. Paper.
How does TensorFlow provide fault tolerance?
8 Sep
Lab 2 due
8 10 Sep
class moved to 27 Sep
11 Sep
wednesday is friday
LEC 13: Ray. Paper.
What is an actor?
13 Sep
midsem
9 17 Sep
midsem
20 Sep
LEC 14:
 
10 24 Sep
LEC 15:
27 Sep
LEC 16:
 
11 1 Oct
LEC 17:
4 Oct
LEC 18:
 
12 8 Oct
sembreak
11 Oct
sembreak
 
13 15 Oct
LEC 19:
18 Oct
LEC 20:
 
14 22 Oct
LEC 21:
25 Oct
LEC 22:
26 Oct
saturday is friday
LEC 23:
15 29 Oct
LEC 24:
1 Nov
LEC 25:
 
16 5 Nov
Project presentations
8 Nov
Project presentations
 
17 12 Nov
LEC 26: