COL733 Cloud computing technology fundamentals
2024-2025 Sem I
This course introduces cloud infrastructure. Students should feel more comfortable with building cloud services after having done this course.
Course Information
- Prerequisites: COL331 or equivalent.
Note: The course includes programming assignments and thus expects proficiency with systems programming and debugging.
- Credits: 3-0-2
- Slot: AC, Tuesdays and Fridays 2-3:15pm in
IIA 305LH316. - TAs:
- Shivam Verma: Shivam.Verma.cs520@cse.iitd.ac.in
- Rohith Vishnumolakala: csy227589@cse.iitd.ac.in
- Satyam Jay: anz238224@sit.iitd.ac.in
- Reading material: There is no textbook for the course. Most lectures will link to more reading material. Lecture notes can be found here and here.
- Acknowledgements: Thanks to Robert T. Morris, MIT and Mythilli Vutukuru, IITB; parts of this course have been inspired by courses made available by them.
Grading criteria
- 30% programming assignments
- 20% project
- 10% pre-class question
- 20% minor exam
- 20% major exam
Supporting systems
- Programming assignments are to be done on Baadal. You will need VPN access to IITD network!
- Discussions should be done on Piazza.
Policies
- Attendance policy: If a student’s attendance is less than 75%, the student will be awarded one grade less than the actual grade that he/she has earned. For example, a student who has got an A grade but has attendance less than 75% will be awarded an A- grade.
- Audit criteria: 40% or more marks in total. 40% or more marks in major+minor exams. Attendance>=75%.
- Ethics: Please re-read IITD honour code. Cheating will get an F in the course. Why should I not cheat?
- Late policy: To help you cope with unexpected emergencies, you can hand in your Labs solutions late, but the total amount of lateness summed over all the lab deadlines must not exceed 72 hours. You can divide up your 72 hours among the labs however you like; you don’t have to ask or tell us. You can only use late hours only for Labs.
- There will be no make up pre-class questions. We will count the scores from your best (n-1) marks where n is the total number of pre-class questions.
Tentative topics
Computation:
- Translate existing programs to distributed system. (Distributed shared memory)
- Batch computation (MapReduce, Spark), streaming computation (Spark streaming, Flink, Google Dataflow), ML training (Tensorflow)
- The problem of late data in streaming computation (Millwheel, Google dataflow): watermarks, triggers, windows.
- Fault tolerance strategies: re-run deterministic idempotent functions using lineage (MapReduce, Spark, CIEL, Ray), asynchronous consistent checkpoints using Chandy-Lamport algorithm (Flink), inconsistent checkpoints (TensorFlow).
- Straggler mitigation strategies: Backup tasks (MapReduce, Spark), just do M/N tasks (TensorFlow)
- Scalability, locality, problem of synchronized times and vector clocks, etc.
Storage:
- PACELC theorem: If partitioned, choose between availability and consistency, else choose between latency and consistency.
- CP systems:
- Linearizability. Raft: quorums, leader election.
- Serializability. Google Spanner: distributed transactions, TrueTime, hybrid logical clocks.
- AP systems:
- Amazon dynamo: eventual consistency, hashing, gossip protocols, dotted version vectors, conflict-free replicated data types (CRDTs)
- Somewhere between CP and AP
- Google file system
- Zookeeper
- RedBlue consistency
Hardware-assisted virtualization:
- CPU virtualization: KVM, Popek-Goldberg theorem
- Memory virtualization: 2-D page tables
Disclaimer: Actual course contents may differ depending on student interest. Reach out to the instructor as soon as possible if there is a particular interest in a topic.
Tentative Schedule
Week | Tuesday | Friday | Sunday |
---|---|---|---|
1 | 23 Jul LEC 1: Introduction. |
26 Jul LEC 2: Scalability, Task DAGs, FaaS. |
|
2 | 30 Jul LEC 3: Struggles with DSM. Paper What is ping-pong effect in DSM and what does Mirage do to reduce the ping-pong effect? |
2 Aug LEC 4: MapReduce. Paper What does MapReduce do for tolerating worker failures? |
|
3 | 6 Aug LEC 5: Release Lab 1 What are narrow and wide dependencies in Spark? |
9 Aug LEC 6: Spark RDDs. Paper What guarantees are provided by consumer groups in Redis? |
|
4 | 13 Aug tuesday is thursday. no class |
16 Aug no class day |
18 Aug Lab 1 due |
5 | 20 Aug LEC 7: Spark streaming. Paper What are the trade-offs in using D-streams compared to continuous processing systems? |
23 Aug LEC 8: Virtual times and global states. Paper For two vector times u and v, when can we say u || v? |
|
6 | 27 Aug LEC 9: Flink. Paper. Release Lab 2 How does Flink use its consistent snapshots? |
30 Aug LEC 10: Discuss Lab 2 What are the different phases and phase transitions of the coordinator? |
|
7 | 3 Sep LEC 11: Tensorflow computational model. Paper. What are the different kinds of operations and edges in a Tensorflow graph? |
6 Sep LEC 12: TensorFlow. Paper. How does TensorFlow provide fault tolerance? |
8 Sep Lab 2 due |
8 | 10 Sep class moved to 27 Sep |
11 Sep wednesday is friday LEC 13: Ray. Paper. What is an actor? |
13 Sep midsem |
9 | 17 Sep midsem |
20 Sep LEC 14: GFS. Paper. What are the (dis)advantages of keeping a large chunk size? |
|
10 | 24 Sep LEC 15: CRAQ. Paper. Release Lab 3 In Chain Replication, what kind of read-write histories can be seen by clients? |
27 Sep LEC 16: Dynamo. Paper. |
|
11 | 1 Oct LEC 17: Dynamo. Paper. How does Dynamo distribute keys to servers? |
4 Oct class moved to 8 Oct |
6 Oct Lab 3 due |
12 | 8 Oct LEC 18: Bayou. Paper. Release Lab 4 |
11 Oct sembreak |
|
13 | 15 Oct LEC 19: Raft. Paper. How does Raft elect a leader? |
18 Oct LEC 20: Raft. Paper. What is going on in Figure 8? |
20 Oct Lab 4 due |
14 | 22 Oct LEC 21: Zookeeper. Paper What are Zookeeper’s consistency guarantees? |
25 Oct LEC 22: Spanner. Paper. How does Spanner do concurrency control? |
26 Oct saturday is friday LEC 23: Spanner. Paper. What is safe time? |
15 | 29 Oct LEC 24: CPU virtualization. Paper In Spanner, if we change commit timestamp for RW transactions as max of all participating servers’ clock and start timestamp of lock-free RO transactions as max of all participating servers’ clock, do we get strict serializability without any other mechanism to deal with clock-skew? |
1 Nov project day |
|
16 | 5 Nov project day |
8 Nov class moved to 10 Nov |
10 Nov LEC 25: Memory virtualization. Paper |
17 | 12 Nov LEC 26: IO virtualization. Paper In Spanner, we make following changes: (1) set commit timestamp for RW transactions as max of all participating servers’ clock; (2) if the local clock of a RW transaction participant is behind commit timestamp during the commit phase, we advance it to the commit timestamp; and (3) set start timestamp of lock-free RO transactions as max of all participating servers’ clock. Do we get strict serializability without any other mechanism to deal with clock-skew? |
24 Nov Project due |
Encouraging student comments after the course
Course content was really good and we studied about different systems and techniques used in industry grade systems. Labs were helpful in improving conceptual clarity by actually implementing what we were studying on a smaller scale.
I found the topics really interesting, and the excitement from the prof was also really motivating. I found the markdown notes of the papers really helpful in understanding the paper in a shorter amount of time. It was a perfect middle ground between brief lecture slides and in depth papers.
The assignments were the really interesting, and the enormous effort by the TAs and prof to provide a refined starter code really helped to make the assignments easy to understand and help us put effort on getting the concepts working and not have to waste too much time dealing with the specifics of getting the setup running, and so on. Also just looking at the starter codes made me in awe. It actually motivated me to learn more about how to organize my own code better, better use of OOPs, type checking, testing with assert statements early on, proper directory structure, and the overall feel of having proper organized code. More COL courses should promote the students to have such level of coding structure in their assignments.