Overview

The explosive growth in cloud data calls for a scalable computing infrastructure that can process this data in a fast and efficient manner. Our research provides mathematical rigor to the design of cloud infrastructure via stochastic modeling and analysis.

More broadly we are interested in the modeling and performance optimization of computing systems and networks.

Tools and Approach

We use a diverse set of analytical tools drawn from probability, queueing, coding theory, and statistical learning. Our approach involves the following steps, often traversed in an iterative manner:

  1. Choosing a model that captures the essential elements of the system, while being simple and tractable.
  2. Designing algorithms that are provably optimal within the model framework.
  3. Evaluating the proposed algorithms on the real system, going beyond the assumptions of the model.

Selected Projects

Distributed Machine Learning


In large-scale machine learning, training is performed by running stochastic gradient descent (SGD) in a distributed fashion using a central parameter server and multiple servers (learners). Synchronization delays due to straggling learners can significantly increase the run-time of the training job. A solution is to use asynchronous methods where the parameter server updates the model without waiting for all learners. While these asynchronous methods alleviate the problem of stragglers, they cause stale gradient updates which may adversely affect the algorithm convergence. Our research goal is to design a distributed SGD algorithm that strikes the best trade-off between the training time, and the error in the trained model.


Queueing Redundant Tasks to Reduce Delay in Cloud Systems


We model job service in a distributed system with n servers by n identical queues each of the servers. Each incoming job is forked to all or a subset of the n servers, and we wait for k of them to finish. The case k=1 corresponds to replication of the job at the servers. This is one of the first works to analyze the response time of queues with redundancy.

Straggler Replication in Parallel Computing


In large-scale computing where a job has hundreds of parallel tasks, the slowest task becomes the bottleneck. Frameworks such as MapReduce relaunch replicas of straggling tasks to cut the tail latency. We develop insights into the best relaunching time, and the number of replicas to relaunch to reduce latency, without a significant increase in computing cost. For heavy-tail distributions, redundancy can reduce latency and cost simultaneously!

Streaming Communication


Unlike traditional file transfer where only total delay matters, streaming requires fast and in-order delivery of individual packets to the user. We analyze the trade-off between throughput and the in-order delivery delay, and in particular how it is affected by the frequency of feedback to the source. We propose a simple combination of repetition and greedy linear coding that achieves close to optimal throughput-delay trade-off.