I'm a PhD candidate in Computer Science at Princeton University. I'm very fortunate to be advised by Elad Hazan. My research is focused on provably efficient algorithms for reinforcement learning, aided by the lens of dynamical systems and the algorithmic toolkit of optimization and (non-stochastic) online learning. My recent efforts seek to address the challenges that accompany sparse rewards, partial observability and large (or continuous) state spaces. At this time, I'm a long-term research intern at Google AI.

I did my bachelors at IIT Kanpur, where I worked on sketch-based algorithms for machine learning, and space lower bounds in the streaming model. I've also spent time at Microsoft Research in Redmond working on program synthesis.

[December 2018] Posted a new preprint on maximum entropy exploration — a provably efficient algorithm for exploring MDPs in the absence of a reward function.

Suppose an agent is in an unknown MDP in the absence of a reward signal, what might we hope that an agent can efficiently learn to do? One natural objective problem is to learn a policy which induces a distribution over state space that is as uniform as possible (in an entropic sense). Despite this mathematical program being non-convex, we provide a provably efficient method, both in terms of sample size and computational complexity, to construct such a maximum-entropy exploratory policy.

We give a polynomial-time algorithm for learning latent-state linear dynamical systems without system identification, and without assumptions on the spectral radius of the system's transition matrix. The algorithm extends the recently introduced technique of spectral filtering, previously applied only to systems with a symmetric transition matrix, using a novel convex relaxation to allow for the efficient identification of phases.

Adaptive regularization methods come in diagonal and full-matrix variants. However, only the former have enjoyed widespread adoption in training large-scale deep models. In this paper, we show how to make full-matrix adaptive regularization practical and useful. At the heart of our algorithm is an efficient method for computing the inverse square root of a low-rank matrix. We show that GGT converges to first-order local minima, providing the first rigorous theoretical analysis of adaptive regularization in non-convex optimization.

We study the control of symmetric linear dynamical systems with unknown dynamics and a hidden state. Using a recent spectral filtering technique, we formulate optimal control in this setting as a convex program. This approach eliminates the need to solve the nonconvex problem of explicit identification of the system and its latent state, and allows for provable optimality guarantees for the control signal.

We present an efficient and practical algorithm for the online prediction of discrete-time linear dynamical systems. Despite the non-convex optimization problem, our algorithm comes with provable guarantees: it has near-optimal regret bounds compared to the best LDS in hindsight, while overparameterizing the model by a small logarithmic factor. Our analysis brings together ideas from improper learning through convex relaxations, online regret minimization, and the spectral theory of Hankel matrices.

We design differentially private algorithms for the problem of online linear optimization in the full information and bandit settings with optimal $O(\sqrt{T})$
regret bounds. In the full-information setting, our results demonstrate that $\epsilon$-differential privacy may be ensured for free. In particular, the regret bounds scale as $O\left(\sqrt{T}+{\epsilon}^{-1}\right)$. In the bandit setting, this is the first $O\left({\epsilon}^{-1}\sqrt{T}\right)$-regret algorithm.

We consider regret minimization in repeated games with non-convex loss functions. Minimizing the standard notion of regret is computationally intractable. Thus, we define a natural notion of regret which permits efficient optimization and generalizes offline guarantees for convergence to an approximate local optimum. We give gradient-based methods that achieve optimal regret, which in turn guarantee convergence to equilibrium in this framework.