While convex analysis has received much attention by the machine learning community, theoretical analysis of non-convex optimization is still nascent. This blog as well as the recent NIPS 2015 workshop on non-convex optimization aim to accelerate research in this area. Along with Kamalika Chaudhuri, Percy Liang, Niranjan U Naresh, and Sewoong Oh, I was one of the organizers of this NIPS workshop.
The workshop started with my talk on recent progress and challenges. Many interesting machine learning tasks are non-convex, e.g. maximum likelihood estimation of latent variable models or training multi-layer neural networks. As the input dimension grows, the number of critical points can grow exponentially, making an analysis difficult. In contrast, strictly convex optimization has a unique critical point corresponding to the globally optimal solution.
Spectral and tensor methods
I gave an overview of instances where we can provide guarantees for nonconvex optimization. Examples include finding spectral decompositions of matrices and tensors, under a set of conditions. For more details on spectral methods, see Rong Ge’s post on this blog. Kevin Chen gave a compelling talk on the superiority of spectral methods in genomics for training hidden Markov models (HMM) compared to traditional approaches such as expectation maximization. In particular, spectral methods provided better biological interpretation, were more robust to class imbalance, and were orders of magnitude faster compared to EM.
Avoiding saddle points
Moving on to more general non-convex optimization, in my talk, I pointed out the difficulty in even converging to a local optimum due to the existence of saddle points. Saddle points are critical points which are not local minima, meaning there exist directions where the objective value decreases (for minimization problems). Saddle points can slow down gradient descent arbitrarily. Alternatively, if Newton’s method is run, it converges to an arbitrary critical point, and does not distinguish between a local minimum and a saddle point.
One solution to escape saddle points is to use the second order Hessian information to find the direction of escape when the gradient value is small: the Hessian eigenvectors with negative eigenvalues provide such directions of escape. See works here, here and here. A recent work surprisingly shows that it is possible to escape saddle points using only first order information based on noisy stochastic gradient descent (SGD). In many applications, this is far cheaper than (approximate) computation of the Hessian eigenvectors. However, one unresolved issue is handling degenerate saddle points, where there are only positive and zero eigenvalues in the Hessian matrix. For such points, even distinguishing saddle points from local optima is hard. It is also an open problem to establish the presence or absence of such degenerate saddle points for particular non-convex problems, e.g. in deep learning.
Optimization landscape for deep learning
Yann LeCun talked about how the success of deep learning has shown that non-convexity should not be treated as an obstacle. See slides here. An open problem is to understand the optimization landscape for deep learning. While classical works have shown the presence of bad local optima even in simple low-dimensional problems, a recent work argues that the picture is quite different in high dimensions, suggesting that the loss function for training deep neural networks can be approximated as random Gaussian polynomials under a set of (strong) assumptions. A beautiful math paper by Auffinger and Ben Arous characterizes the critical points of such random Gaussian polynomials, proving that the objective values of all local minima concentrate in a “narrow” band, and differ significantly from the values attained at saddle points. However, to add a word of caution, this does not imply that a random Gaussian polynomial is computationally easy to optimize. Current theory does not give a good characterization of the basins of attraction for the local optima, and requires exponential number of initializations to guarantee successful convergence. Moreover, degenerate saddle points can be present, and they are hard to escape from, as I discussed earlier.
Everything old is new again
Andrew Barron took us back in time, and described how he got interested in neural networks when he discovered that they were training 6-8 layer neural networks back in the 1960’s in his dad’s company. Andrew provided a glimpse of the classical results, which bound both the approximation and estimation errors for a shallow neural network (with one hidden layer). The approximation error is related to the Fourier spectrum of the target function: intuitively, a smoother signal has a lower amount of high frequency content, and has a better approximation error under the class of networks of a fixed size. The estimation bound gives the risk achieved under a finite amount of training data, but achieving this bound is computationally hard. Andrew then talked about his recent efforts to develop computationally efficient algorithms for training neural networks based on the use of generative models of the input. This builds on my recent work on training neural networks using tensor methods.
Do not accept the given non-convex problem as is
Sanjeev Arora argued that unlike traditional computer science theory which has focused on characterizing worst-case problem instances, machine learning offers a much more flexible framework. In fact, even the objective function can be changed! He focused on the problem of non-negative matrix factorization (NMF), where the original objective function of quadratic loss minimization is hard. On the other hand, under a so-called separability assumption, the objective can be changed as finding a simplex with non-negative vertices that contains the observations, which can be solved efficiently. A similar philosophy holds for spectral methods, where the original objective function is abandoned, and instead spectral decompositions are employed to solve the learning task at hand. Sanjeev also made the excellent point that the assumptions needed for success are often something we can control as data collectors. For instance, separability holds for the NMF problem when more features are collected, e.g. for learning latent variable PCFGs via NMF techniques.
In a similar spirit, Chris Re showed how we can strengthen the current theory to remove some of the pessimism behind the hardness of many problems that have good solutions in practice. He described the notion of combinatorial width or fractional hypertree width, a notion from logic, that can provide a better analysis of Gibbs sampling and belief propagation. Gregory Valiant showed that by assuming more structure on the class of probability distributions we can do much better compared to the unstructured setting. This includes topic models, hidden Markov models, and word embeddings.
Convex envelopes and Gaussian smoothing
Hossein Mobahi talked about smoothing approaches for convexification. The convex envelope of a function is the convex function that provides the tightest lower bound. Any global minimizer of the original objective function is also a global minimizer of the convex envelope. While we are mostly familiar with the characterization of convex envelopes through duality (i.e. dual of the dual is the convex envelope), its characterization through partial differential equation (PDE) has been mostly unknown to the machine learning community. Hossein introduced the PDE form for convex envelope, and gave a nice interpretation in terms of carrying out diffusion along non-convex regions of the objective function to obtain the convex envelope. However, the convergence time for this PDE can be in general exponential in the input dimension, and therefore is not tractable. Hossein showed that instead we can perform Gaussian smoothing efficiently for many functions such as polynomials. He gave a novel characterization of Gaussian smoothing as linearization of the convex envelope. He proved it by analyzing the PDE of the convex envelope, and then showing that its linearization results in the heat equation, which corresponds to Gaussian smoothing. Hossein also provided bounds for the continuation method, where the extent of smoothing is progressively decreased, and related it to the complexity of the objective function. For more details, refer to his paper.
NP-hardness should not deter us from analyzing non-convex optimization in the context of machine learning. We should be creative in making new assumptions, we should try to change the problem structure and the objective function, collect relevant data to make the learning problem easier, and so on.
As I conclude this blog post, I am reminded of my discussion with Leon Bottou. He said that there are three levels of thinking, with increasingly more sophistication. At the first level, we merely aim to prove statements that are already formulated. Unfortunately, almost all formal education focuses on this type of skill. At the second level, we try to draw implications, given a fixed set of assumptions. On the other hand, at the third level, we need to simultaneously come up with reasonable assumptions as well as their implications, and this level requires the highest level of creativity. In the area of machine learning, there is tremendous opportunity for such creativity, and I am hoping that the workshop managed to foster more of it.