Tools and Approach

We use a diverse set of analytical tools drawn from probability, queueing, coding theory, and statistical learning. Our approach involves the following steps, often traversed in an iterative manner:

  1. Choosing a model that captures the essential elements of the system, while being simple and tractable.
  2. Designing algorithms that are provably optimal within the model framework.
  3. Evaluating the proposed algorithms on the real system, going beyond the assumptions of the model.

Selected Projects

Distributed Machine Learning


In large-scale machine learning, training is performed by running stochastic gradient descent (SGD) in a distributed fashion using multiple worker nodes. Synchronization delays due to straggling workers, and communication delays in aggregating updates from worker nodes can significantly increase the run-time of the training job. Asynchronous methods or sparse inter-node communication reduce the runtime per iteration, but they result in stale gradient updates which may adversely affect the algorithm convergence. Our research goal is to design a distributed SGD algorithm that strikes the best trade-off between the training time, and the error in the trained model.

Correlated Multi-armed Bandits


Multi-armed bandit algorithms have numerous data-driven decision-making applications including advertising, A/B testing and hyper-parameter tuning. Unlike classic works that assume independence of the rewards observed by sampling different arms, we consider correlated rewards. This correlation can be exploited to identify sub-optimal or "non-competitive" arms and avoid exploring them. This allows us to reduce a K armed bandit problem to a C armed bandit problem, where C is the number of "competitive" arms.

Queueing Redundant Tasks to Reduce Delay in Cloud Systems


We model job service in a distributed system with n servers by n identical queues each of the servers. Each incoming job is forked to all or a subset of the n servers, and we wait for k of them to finish. The case k=1 corresponds to replication of the job at the servers. This is one of the first works to analyze the response time of queues with redundancy.

Straggler Replication in Parallel Computing


In large-scale computing where a job has hundreds of parallel tasks, the slowest task becomes the bottleneck. Frameworks such as MapReduce relaunch replicas of straggling tasks to cut the tail latency. We develop insights into the best relaunching time, and the number of replicas to relaunch to reduce latency, without a significant increase in computing cost. For heavy-tail distributions, redundancy can reduce latency and cost simultaneously!

Streaming Communication


Unlike traditional file transfer where only total delay matters, streaming requires fast and in-order delivery of individual packets to the user. We analyze the trade-off between throughput and the in-order delivery delay, and in particular how it is affected by the frequency of feedback to the source. We propose a simple combination of repetition and greedy linear coding that achieves close to optimal throughput-delay trade-off.