A striking feature of modern supervised machine learning is its pervasive over-parametrization. Deep networks contain millions of parameters, often exceeding the number of data points by orders of magnitude. These networks are trained to nearly interpolate the data by driving the training error to zero. Yet, at odds with most theory, they show excellent test performance. It has become accepted wisdom that these properties are special to deep networks and require non-convex analysis to understand.
In this talk I will show that classical (convex) kernel methods do, in fact, exhibit these unusual properties. Moreover, kernel methods provide a competitive practical alternative to deep learning, after we address the non-trivial challenges of scaling to modern big data. I will also present theoretical and empirical results indicating that we are unlikely to make progress on understanding deep learning until we develop a fundamental understanding of classical "shallow" kernel classifiers in the "modern" over-fitted setting. Finally, I will show that ubiquitously used stochastic gradient descent (SGD) is very effective at driving the training error to zero in the interpolated regime, a finding that sheds light on the effectiveness of modern methods and provides specific guidance for parameter selection.
These results present a perspective and a challenge. Much of the success of modern learning comes into focus when considered from over-parametrization and interpolation point of view. The next step is to address the basic question of why classifiers in the "modern" interpolated setting generalize so well to unseen data. Kernel methods provide both a compelling set of practical algorithms and an analytical platform for resolving this fundamental issue.
Based on joint work with Siyuan Ma, Raef Bassily, Chayoue Liu and Soumik Mandal.