Machine Learning 10-725Instructor: Yuanzhi Li (yuanzhil at andrew dot cmu dot edu)Course descriptions: According to Wikipedia: Convex optimization is a subfield of mathematical optimization that studies the problem of minimizing convex functions over convex sets. Many classes of convex optimization problems admit polynomial-time algorithms, whereas mathematical optimization is in general NP-hard. According to Yintat Lee: The design of algorithms is traditionally a discrete endeavor. However, many advances have come from a continuous viewpoint. Typically, a continuous process, deterministic or randomized is designed (or shown) to have desirable properties, such as approaching an optimal solution or a desired distribution, and an algorithm is derived from this by appropriate discretization. In interesting and general settings, the current fastest methods are a consequence of this perspective. In this course: Convex optimization was an extremely useful tool in Machine Learning, capable of solving many real-world problems efficiently, such as linear regression, logistic regression, linear programming, SDP programming etc. However, in modern machine learning, as the non-convex neural networks growing to be the dominating models in this field, principles developed in traditional convex optimization are now becoming insufficient. In these lectures, you will see a different convex optimization course: This newly designed course spans the old topics in convex optimization such as Gradient Descent, Stochastic Gradient Descent, Mirror Descent, Accelerated Gradient Descent, Interior Point Method, and making connections all the way to optimizations topics in deep learning, such as Neural Tangent Kernel, Scheduled Learning Rate, Momentums, Batch normalization, Adversarial training, Generative Adversarial Networks and Reinforcement Learning. In the end of the course, you will have a sense about the spirit of convex optimization, and how it can still be applied to other real-world problems that are not convex at all. TAs: Chirag Pabbaraju: cpabbara at andrew dot cmu dot edu Robin Schmucker: rschmuck at andrew dot cmu dot edu Shuli Jiang: shulij at andrew dot cmu dot edu Gabriele Farina: gfarina at andrew dot cmu dot edu Vishwak Srinivasan: vishwaks at andrew dot cmu dot edu Lectures: Monday and Wednesday 1:30-2:50pm, Remote Teaching (Recordings are avaliable) Course evaluation: 100% Homeworks (and they will be very simple and easy:) some amount of coding is expected). |
Fundamentals of convex optimization | ||||
Lecture 1 | Introduction: Why am I taking this course? | (Definitely not just because of the gradings are super gentle) | ||
Lecture 2 | Gradient Descent and Basic Mirror Descent | (One of the "only two" optimization algorithms) | ||
Lecture 3 | Momentum is all you need: The Accelerated Gradient Descent Algorithm | (Based on the paper "Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent") | ||
Lecture 4 | Constraint optimization: Projected Gradient Descent | |||
Lecture 5 | The real Mirror Descent | (One of the "only two" optimization algorithms) | ||
Lecture 6 | Stochastic Gradient Descent | |||
Lecture 7 | Variance Reductions in SGD: SVRG and SAGA | |||
Lecture 8 | Duality and Minmax Optimization | (The gradient descent ascent and the Mirror-Prox algorithm) | ||
Lecture 9 | Distributed Optimization | |||
Lecture 10 | Proximal Gradient Descent | |||
Lecture 11 | Fundations of Convex optimization: Recap | (This lecture will be removed in the next semester and be replaced by ``Geodesic Convex Optimization: Convex optimization over manifolds'', as students' requests) | ||
Convex geometry | ||||
Lecture 12 | Hessian Matrix and Pre-conditioned Gradient Descent | |||
Lecture 13 | Self-concordant functions and Interior Point Methods | (How to eat sandwiches elegantly) | ||
Lecture 14 | Adaptive Algorithms | (The famous Adagrad and Adam) | ||
Lecture 15 | Ellipsoid algorithm | (How to cut cakes elegantly) | ||
Lecture 16* (to be added in the next semester) | Geodesic Convex Optimization | (Optimize a "convex function" on a low dimensional mainfold) | ||
Introduction to non-convex optimization | ||||
Lecture 16 | Limitations of Convex Optimization, and why we need Non-Convexity in machine learning | (This lecture is really important, try not to miss it) | ||
Lecture 17 | Saddle Points, Local Minima, Hessian Descent and Noisy Gradient Descent | |||
Lecture 18 | Introduction to Bayesian Optimization | |||
Lecture 19 | Over-parameterization in non-convex optimization | (Larger non-convex models are easier to train?) | ||
Introduction to non-convex in deep learning: Beyond black-box optimization | ||||
Lecture 20 | Over-parameterization in deep learning: Principle of Diversity beyond Convex Optimization and Neural Tangent Kernels | (Larger neural networks are easier to train?) | ||
Lecture 21 | Training and Generalization: Algorithmic regularization and inductive bias in non-convex optimization | (For example, why larger initial learning rate in SGD improves generalization? Why using Momentum improves generalization?) | ||
Lecture 22 | Gradient Descent in Deep Learning at the Stability Edge | (How the local smoothness of the training objective continuously hovers above 2/learning rate in deep learning) | ||
Lecture 23 | Adversarial examples and adversarial training in deep learning | |||
Lecture 24 | Feature purification: The inductive bias of gradient descent in deep learning | (Feature purification: How adversarial training performs robust deep learning) | ||
Lecture 25 | Inductive bias of random initialization in deep learning: Ensemble and Knowledge distillation | (How ensemble in deep learning works fundamentally different from ensemble random feature mappings) | ||
Lecture 26 | Generative Adversarial Networks and Minmax non-convex non-concave optimization | (Optimality conditions and the local adversarial training algorithm, and the Forward Super-Resolution Principle) | ||
Lecture 27 | The non-convex optimizations inside deep learning: Forward Feature Learning and Backward Feature Correction | |||
Lecture 28 | Quantum optimization: Quantum version of gradient descent? |