Research

Last updated: December 2023.

Broadly, I like to work on systems problems which can span operating systems, compilers, databases, and distributed systems. I hope to work on problems which can create immediate practical impact while staying relevant in the long-term. Most recently, I am focussed on how to build machine learning based software?

ML models have fundamentally increased the landscape of “computable problems”. Many problems which could not be handled by traditional computing is becoming well within the reach of computers, thanks to ML. ML is going to bring the same level of revolution in automation and productivity that computers have brought since they were first introduced. New softwares like smart doorbells to traditional softwares like OS, compilers, and databases are going to have ML components in them.

However, it is not obvious how one should program with ML components. There are fundamental new issues with creating ML based software:

Broken abstraction

Traditional software was built on reliable building blocks like sort with well-defined inputs and outputs. As a programmer, we usually never worry about sort giving wrong answers all of a sudden, because of which we get good abstraction. ML-based software builds on top of flaky building blocks, such as an ocr model to turn images into text, which can silently introduce mistakes.

None of the ML models claim 100% accuracy on all input variations since the input domain of these models are typically unconstrained. We are trying to develop novel abstractions, and execution engines to support these abstractions, to program in the presence of such flaky building blocks.

Our initial results show that our proposed abstractions can bring back composability, help improve both the precision and recall of the ML-based software, while reducing the “cost” of the deployed program.

Accuracy. The new kid in town.

Maintaining high-accuracy throughout the software’s deployed lifetime is quite difficult because an ML-based software is “correct” (i.e, has a reasonable accuracy) only under a certain “data distribution”.

We are working on abstractions that would let programmers modify their programs on-the-fly to satisfy accuracy, throughput, latency, and cost requirements. Such modifications can be triggered upon detecting changes like shifts in input data distribution or dropping network bandwidth. With these modifications, programs can maintain high accuracy along with other performance metrics.

The challenge is to enable programmers specify composable control logic to modify running programs in a meaningful manner. Ideally, while the program is undergoing modification, the end user should not see much degradation in latency/throughput.

Heterogenous hardware and distributed systems

We are exploring how to write ML-based software against the backdrop of a dataflow system that we are actively developing. This system aspires to make it straightforward to program, deploy, scale, and manage ML-based software. In the course of this, we are doing system-level optimizations to make it run on top of heterogenous hardware in a distributed manner.

We plan to experiment with some unconventional but very interesting approaches to turn a single-machine program distributed. More about this later.

Earlier research

Smartphone battery debugging

When I started my PhD, apps and smartphones were just entering consumer lives but the smartphones were limited by their battery: phones would rarely last beyond 5pm. My work was primarily on identifying and mitigating a new class of software bugs called “energy bugs”. When a software has an energy bug, the software continues to provide its functionalities but it quickly drains the phone battery.

Real-world impact:

In our research, we found hundreds of new energy bugs across all software layers including popular apps like Facebook and Netflix, Android framework, Android kernel, and the device drivers. We applied several patches to the Linux kernel to fix the kernel bugs.
One class of energy bugs identified by us, called no-sleep bugs, is being checked for by Wakelock and WakelockTimeout lint checks in Android Studio, since 2012, every time any Android app is built.
We launched an “eStar battery saver app” which was downloaded by over 100k Android users. The app was widely discussed in media (1, 2).
Our Hush system received over 50 forks, 150 stars on Github, and global media coverage (1, 2, 3). Ideas similar to the ones in Hush were released as “app-standby” and “Doze” battery-saving features in Android Marshmallow.
Our measurement study to understand daily battery drain breakdown and our work on differential energy profiling were also widely covered in media (1, 2).