Off the convex pathAlgorithms off the convex path.
http://offconvex.github.io/
Rip van Winkle's Razor, a Simple New Estimate for Adaptive Data Analysis<p><em>Can you trust a model whose designer had access to the test/holdout set?</em> This implicit question
in <a href="https://science.sciencemag.org/content/349/6248/636.full">Dwork et al 2015</a> launched a new field, <em>adaptive data analysis</em>.
The question referred to the fact that in many scientific settings as well as modern machine learning (with its standardized datasets like CIFAR,
ImageNet etc.) the model designer has full access to the holdout set and is free to ignore the</p>
<blockquote>
<p>(Basic Dictum of Data Science) “Thou shalt not train on the test/holdout set.”</p>
</blockquote>
<p>Furthermore, even researchers who scrupulously follow the Basic Dictum may be unknowingly violating it when they take inspiration (and design choices)
from published works by others who presumably published <em>only the best of the many models they evaluated on the test set.</em></p>
<p>Dwork et al. showed that if the test set has size $N$, and the designer is allowed to see the error of the first $i-1$ models on the test set before designing the $i$’th model, then a clever designer can use so-called <a href="https://arxiv.org/pdf/1502.04585.pdf"><em>wacky boosting</em></a> (see this <a href="http://blog.mrtz.org/2015/03/09/competition.html">blog post</a>) to ensure the accuracy of the $t$’th model on the test set as high as $\Omega(\sqrt{t/N})$. In other words, the test set could become essentially useless once $t \gg N$, a
condition that holds in ML, whereby in popular datasets (CIFAR10, CIFAR100, ImageNet etc.) $N$ is no more than $100,000$ and the total number of models being trained
world-wide is well in the millions if not higher (once you include hyperparameter searches).</p>
<blockquote>
<p><strong>Meta-overfitting Error (MOE)</strong> of a model is the difference between its average error on the test data and its expected error on the full distribution.
(It is closely related to <a href="https://en.wikipedia.org/wiki/False_discovery_rate"><em>false discovery rate</em></a> in statistics.)</p>
</blockquote>
<p>This blog post concerns <a href="https://arxiv.org/pdf/2102.13189.pdf">our new paper</a>, which gives meaningful upper bounds on this sort of trouble for popular
deep net architectures, whereas prior ideas from adaptive data analysis gave no nontrivial estimates. We call our estimate <em>Rip van Winkle’s Razor</em>
which combines references to <a href="https://en.wikipedia.org/wiki/Occam%27s_razor">Occam’s Razor</a> and the
<a href="https://en.wikipedia.org/wiki/Rip_Van_Winkle">mythical person who fell asleep for 20 years</a>.</p>
<figure align="center">
<img src="http://www.offconvex.org/assets/ripvanwinkle.jpg" alt="drawing" width="50%" />
<figcaption> Rip Van Winkle wakes up from 20 years of sleep, clearly needing a Razor </figcaption>
</figure>
<h2 id="adaptive-data-analysis-brief-tour">Adaptive Data Analysis: Brief tour</h2>
<p>It is well-known that for a model trained <strong>without</strong> ever querying the test set, MOE scales (with high probability over choice of the test set) as $1/\sqrt{N}$ where $N$
is the size of the test set. Furthermore standard concentration bounds imply that even if we train $t$ models without ever referring to the test set (in other words,
using proper data hygiene) then the maximum meta-overfitting error among the $t$ models scales whp as $O(\sqrt{\log(t)/ N})$. The trouble pinpointed by Dwork et al.
can happen only if models are designed adaptively, with test error of the previous models shaping the design of the next model.</p>
<p>Adaptive Data Analysis has come up with many good practices for honest researchers to mitigate such issues. For instance, Dwork et al. showed that using
Differential Privacy on labels while evaluating models can lower MOE. Or the <a href="https://arxiv.org/pdf/1502.04585.pdf">Ladder mechanism</a> helps in Kaggle-like
settings where the test dataset resides on a server that can choose to answers only a selected subset of queries, which essentially takes away the MOE issue.</p>
<p>For several good practices matching lower bounds exist showing a way to construct cheating models with MOE matching the upper bound.</p>
<p>However such recommended best practices do not help with understanding the MOE in the performance numbers of a new model since there is no guarantee that the
inventors never tuned models using the test set, or didn’t get inspiration from existing models that may have been designed that way. Thus statistically
speaking the above results still give no reason to believe that a modern deep net such as ResNet152 has low MOE.</p>
<p><a href="http://proceedings.mlr.press/v97/recht19a/recht19a.pdf">Recht et al. 2019</a> summed up the MOE issue in a catchy title: <em>Do ImageNet Classifiers Generalize to ImageNet?</em> They tried to answer their question experimentally by creating new test sets from scratch –we discuss their results later.</p>
<h2 id="moe-bounds-and-description-length">MOE bounds and description length</h2>
<p>The starting point of our work is the following classical concentration bounds:</p>
<blockquote>
<p><strong>Folklore Theorem</strong> With high probability over the choice of a test set of size $N$, the MOE of <em>all</em> models with description length at most $k$ bits is $O(\sqrt{k/N})$.</p>
</blockquote>
<p>At first sight this doesn’t seem to help us because one cannot imagine modern deep nets having a short description. The most obvious description involves reporting
values of the net parameters, which requires millions or even hundreds of millions of bits, resulting in a vacuous upper bound on MOE.</p>
<p>Another obvious description would be the computer program used to produce the model using the (publicly available) training and validation sets. However, these
programs usually rely on imported libraries through layers of encapsulation and so the effective program size is pretty large as well.</p>
<h2 id="rip-van-winkles-razor">Rip van Winkle’s Razor</h2>
<p>Our new upper bound involves a more careful definition of <em>Description Length</em>: it is the smallest description that allows a referee to reproduce a model of
similar performance using the (universally available) training and validation datasets.</p>
<p>While this phrasing may appear reminiscent of the review process for conferences and journals, there is a subtle difference with respect to what the referee
can or cannot be assumed to know. (Clearly, assumptions about the referee can greatly affect description length —e.g, a referee ignorant of even basic
calculus might need a very long explanation!)</p>
<blockquote>
<p><strong>Informed Referee:</strong> “Knows everything that was known to humanity (e.g., about deep learning, mathematics,optimization, statistics etc.) right up to the
moment of creation of the Test set.”</p>
</blockquote>
<blockquote>
<p><strong>Unbiased Referee:</strong> Knows nothing discovered since the Test set was created.</p>
</blockquote>
<p>Thus <em>Description Length</em> of a model is the number of bits in the shortest description that allows an informed but unbiased referee to reproduce the claimed result.</p>
<p>Note that informed referees let descriptions get shorter. Unbiased require longer descriptions that rule out any statistical “contamination” due to any interaction whatsoever with the test set. For example, momentum techniques in optimization were
well-studied before the creation of ImageNet test set, so informed referees can be expected to understand a line like “SGD with momentum 0.9.” But a
line like “Use Batch Normalization” cannot be understood by unbiased referees since conceivably this technique (invented after 2012) might have
become popular precisely because it leads to better performance on the test set of ImageNet.</p>
<p>By now it should be clear why the estimate is named after <a href="https://en.wikipedia.org/wiki/Rip_Van_Winkle">“Rip van Winkle”</a>: the referee can be thought
of as an infinitely well-informed researcher who went into deep sleep at the moment of creation of the test set, and has just been woken up years later
to start refereeing the latest papers. Real-life journal referees who luckily did not suffer this way should try to simulate the idealized Rip van Winkle
in their heads while perusing the description submitted by the researcher.</p>
<p>To allow as short a description as possible the researcher is allowed to compress the description of their new deep net non-destructively using any compression that would make sense to Rip van Winkle (e.g., <a href="https://en.wikipedia.org/wiki/Huffman_coding">Huffman Coding</a>). The description of the compression method itself
is not counted towards the description length – provided the same method is used for all papers submitted to Rip van Winkle. To give an example, a
technique appearing in a text known to Rip van Winkle could be succinctly referred to using the book’s ISBN number and page number.</p>
<h2 id="estimating-moe-of-resnet-152">Estimating MOE of ResNet-152</h2>
<p>As an illustration, here we provide a suitable description allowing Rip van Winkle to reproduce a mainstream ImageNet model, ResNet-152, which achieves $4.49\%$ top-5
test error.</p>
<p>The description consists of three types of expressions: English phrases, Math equations, and directed graphs. In the paper, we describe in detail how to encode
each of them into binary strings and count their lengths. The allowed vocabulary includes primitive concepts that were known before 2012, such
as <em>CONV, MaxPool, ReLU, SGD</em> etc., as well as a graph-theoretic notation/shorthand for describing net architecture. The newly introduced concepts
including <em>Batch-Norm</em>, <em>Layer, Block</em> are defined precisely using Math, English, and other primitive concepts.</p>
<figure align="center">
<img src="http://www.offconvex.org/assets/resnet_description.png" alt="drawing" width="80%" />
<figcaption><b>Description for reproducing ResNet-152</b></figcaption>
</figure>
<p>According to our estimate, the length of the above description is $1032$ bits, which translates into a upper bound on meta-overfitting error of merely $5\%$!
This suggests the real top-5 error of the model on full distribution is at most $9.49\%$. In the paper we also provide a $980$-bit long description for
reproducing DenseNet-264, which leads to $5.06\%$ upper bound on its meta-overfitting error.</p>
<p>Note that the number $5.06$ suggests higher precision than actually given by the method, since it is possible to quibble about the coding assumptions
that led to it. Perhaps others might use a more classical coding mechanism and obtain an estimate of $6\%$ or $7\%$.</p>
<p>But the important point is that unlike existing bounds in Adaptive Data Analysis, there is <strong>no</strong> dependence on $t$, the number of models that have been tested before, and the bound is non-vacuous.</p>
<h2 id="empirical-evidence-about-lack-of-meta-overfitting">Empirical evidence about lack of meta-overfitting</h2>
<p>Our estimates indicate that the issue of meta-overfitting on ImageNet for these mainstream models is mild. The reason is that despite the vast number
of parameters and hyper-parameters in today’s deep nets, the <em>information content</em> of these models is not high given knowledge circa 2012.</p>
<p>Recently Recht et al. <a href="https://arxiv.org/abs/1902.10811">tried to reach an empirical upper bound on MOE</a> for
ImageNet and <a href="https://arxiv.org/abs/1806.00451">CIFAR-10</a>. They created new tests sets by carefully replicating the methodology used for constructing the original ones. They found that error of famous published models of the past seven years is as much as 10-15% higher on the new test set as compared to the original. On the face of it, this seemed to confirm a case of bad meta-overfitting. But they also presented evidence that the swing in test error was due to systemic effects during test set creation. For instance, a comparable swing happens also for models that predated the creation of ImageNet (and thus were not overfitted to the ImageNet test set).
<a href="https://proceedings.neurips.cc/paper/2019/hash/ee39e503b6bedf0c98c388b7e8589aca-Abstract.html">A followup study</a> of a hundred Kaggle competitions used fresh,
identically distributed test sets that were available from the official competition organizers. The authors concluded that MOE does not appear to be significant in modern ML.</p>
<h2 id="conclusions">Conclusions</h2>
<p>To us the disquieting takeaway from Recht et al.’s results was that estimating MOE by creating a new test set is rife with systematic bias at best, and perhaps impossible, especially in datasets concerning rare or one-time phenomena (e.g., stock prices). Thus their work still left a pressing need for effective upper bounds on meta-overfitting error. Our Rip van Winkle’s Razor is elementary, and easily deployable by the average researcher. We hope it becomes part of the standard toolbox in Adaptive Data Analysis.</p>
Wed, 07 Apr 2021 14:00:00 -0700
http://offconvex.github.io/2021/04/07/ripvanwinkle/
http://offconvex.github.io/2021/04/07/ripvanwinkle/When are Neural Networks more powerful than Neural Tangent Kernels?<p>The empirical success of deep learning has posed significant challenges to machine learning theory: Why can we efficiently train neural networks with gradient descent despite its highly non-convex optimization landscape? Why do over-parametrized networks generalize well? The recently proposed Neural Tangent Kernel (NTK) theory offers a powerful framework for understanding these, but yet still comes with its limitations.</p>
<p>In this blog post, we explore how to analyze wide neural networks beyond the NTK theory, based on our recent <a href="https://arxiv.org/abs/1910.01619">Beyond Linearization paper</a> and follow-up <a href="https://arxiv.org/abs/2006.13436">paper on understanding hierarchical learning</a>. (This blog post is also cross-posted at the <a href="https://blog.einstein.ai/beyond-ntk/">Salesforce Research blog</a>.)</p>
<h3 id="neural-tangent-kernels">Neural Tangent Kernels</h3>
<p>The Neural Tangent Kernel (NTK) is a recently proposed theoretical framework for establishing provable convergence and generalization guarantees for wide (over-parametrized) neural networks <a href="https://arxiv.org/abs/1806.07572">(Jacot et al. 2018)</a>. Roughly speaking, the NTK theory shows that</p>
<ul>
<li>A sufficiently wide neural network trains like a linearized model governed by the derivative of the network with respect to its parameters.</li>
<li>At the infinite-width limit, this linearized model becomes a kernel predictor with the Neural Tangent Kernel (the NTK).</li>
</ul>
<p>Consequently, a wide neural network trained with small learning rate converges to 0 training loss and generalize as well as the infinite-width kernel predictor. For a detailed introduction to the NTK, please refer to the earlier <a href="http://www.offconvex.org/2019/10/03/NTK/">blog post</a> by Wei and Simon.</p>
<h3 id="does-ntk-fully-explain-the-success-of-neural-networks">Does NTK fully explain the success of neural networks?</h3>
<p>Although the NTK yields powerful theoretical results, it turns out that real-world deep learning <em>do not operate in the NTK regime</em>:</p>
<ul>
<li>Empirically, infinite-width NTK kernel predictors perform slightly worse (though competitive) than fully trained neural networks on benchmark tasks such as CIFAR-10 <a href="https://arxiv.org/abs/1904.11955">(Arora et al. 2019b)</a>. For finite width networks in practice, this gap is even more profound, as we see in Figure 1: The linearized network is a rather poor approximation of the fully trained network at practical optimization setups such as large initial learning rate <a href="https://arxiv.org/abs/2002.04010">(Bai et al. 2020)</a>.</li>
<li>Theoretically, the NTK has poor <em>sample complexity for learning certain simple functions</em>. Though the NTK is a universal kernel that can interpolate any finite, non-degenerate training dataset <a href="https://arxiv.org/abs/1810.02054">(Du et al. 2018</a><a href="https://arxiv.org/abs/1811.03804">, 2019)</a>, the test error of this kernel predictor scales with the RKHS norm of the ground truth function. For certain non-smooth but simple functions such as a single ReLU, this norm can be exponentially large in the feature dimension <a href="https://arxiv.org/abs/1904.00687">(Yehudai & Shamir 2019)</a>. Consequently, NTK analyses yield poor sample complexity upper bounds for learning such functions, whereas empirically neural nets only require a mild sample size <a href="https://arxiv.org/abs/1410.1141">(Livni et al. 2014)</a>.</li>
</ul>
<div style="text-align:center;">
<img style="width:700px;" src="http://www.offconvex.org/assets/taylor-plot.png" />
<br />
<i><b>Figure 1.</b>
Linearized model does not closely approximate the training trajectory of neural networks with practical optimization setups, whereas higher order Taylor models offer a substantially better approximation.
</i>
<br />
<br />
</div>
<p>These gaps urge us to ask the following</p>
<blockquote>
<p><strong>Question</strong>: How can we theoretically study neural networks beyond the NTK regime? Can we prove that neural networks outperform the NTK on certain learning tasks?</p>
</blockquote>
<p>The key technical question here is to mathematically understand neural networks operating <em>outside of the NTK regime</em>.</p>
<h2 id="higher-order-taylor-expansion">Higher-order Taylor expansion</h2>
<p>Our main tool for going beyond the NTK is the <em>Taylor expansion</em>. Consider a two-layer neural network with $m$ neurons, where we only train the “bottom” nonlinear layer $W$:</p>
\[f_{W_0 + W}(x) = \frac{1}{\sqrt{m}} \sum_{r=1}^m a_r \sigma( (w_{0,r} + w_r)^\top x).\]
<p>(Here, $W_0+W$ is an $m\times d$ weight matrix, where $W_0$ denotes the random initialization and $W$ denotes the trainable “movement” matrix initialized at zero). For small enough $W$, we can perform a Taylor expansion of the network around $W_0$ and get</p>
\[f_{W_0+W}(x) = \frac{1}{\sqrt{m}} \sum_{r=1}^m a_r \sigma(w_{0,r}^\top x) + \sum_{k=1}^\infty \frac{1}{\sqrt{m}} \sum_{r=1}^m a_r \frac{\sigma^{(k)} (w_{0,r}^\top x)}{k!} (w_r^\top x)^k\]
<p>Let us denote the $k$-th order term as $ f^{(k)}_{W_0, W}$, and rewrite this as</p>
\[f_{W_0+W}(x) = f^{(0)}_{W_0}(x) + \sum_{k=1}^\infty f^{(k)}_{W_0, W}(x).\]
<p>Above, term $f^{(k)}$ is a $k$-th order polynomial of the trainable parameter $W$. For the moment assume that $f^{(0)}(x)=0$ (this can be achieved via techniques such as the symmetric initialization).</p>
<p>The key insight of the NTK theory can be described as the following <strong>linearized approximation</strong> property</p>
<blockquote>
<p>For small enough $W$, the neural network $f_{W_0,W}$ is closely approximated by the linear model $f^{(1)}$.</p>
</blockquote>
<p>Towards moving beyond the linearized approximation, in our <a href="https://arxiv.org/abs/1910.01619">Beyond Linearization paper</a>, we start by asking</p>
<blockquote>
<p>Why just $f^{(1)}$? Can we also utilize the higher-order term in the Taylor series such as $f^{(2)}$?</p>
</blockquote>
<p>At first sight, this seems rather unlikely, as in Taylor expansions we always expect the linear term $f^{(1)}$ to dominate the whole expansion and have a larger magnitude than $f^{(2)}$ (and subsequent terms).</p>
<h3 id="killing-the-ntk-term-by-randomized-coupling">“Killing” the NTK term by randomized coupling</h3>
<p>We bring forward the idea of <em>randomization</em>, which helps us escape the “domination” of $f^{(1)}$ and couple neural networks with their quadratic Taylor expansion term $f^{(2)}$. This idea appeared first in <a href="https://arxiv.org/abs/1811.04918">Allen-Zhu et al. (2018)</a> for analyzing three-layer networks, and as we will show also applies to two-layer networks in a perhaps more intuitive fashion.</p>
<p>Let us now assign each weight movement $w_r$ with a <em>random sign</em> $s_r\in\{\pm 1\}$, and consider the randomized weights $\{s_rw_r\}$. The random signs satisfy the following basic properties:</p>
\[E[s_r]=0 \quad {\rm and} \quad s_r^2 \equiv 1.\]
<p>Therefore, let $SW\in\mathbb{R}^{m\times d}$ denote the randomized weight matrix, we can compare the first and second order terms in the Taylor expansion at $SW$:</p>
\[E_{S} \left[f^{(1)}_{W_0, SW}(x)\right] = E_{S} \left[ \frac{1}{\sqrt{m}}\sum_{r\le m} a_r \sigma'(w_{0,r}^\top x) (s_rw_r^\top x) \right] = 0,\]
<p>whereas</p>
\[f^{(2)}_{W_0, SW}(x) = \frac{1}{\sqrt{m}}\sum_{r\le m} a_r \frac{\sigma^{(2)}(w_{0,r}^\top x)}{2} (s_rw_r^\top x)^2 = \frac{1}{\sqrt{m}}\sum_{r\le m} a_r \frac{\sigma^{(2)}(w_{0,r}^\top x)}{2} (w_r^\top x)^2 = f^{(2)}_{W_0, W}(x).\]
<p>Observe that the sign randomization keeps the quadratic term $f^{(2)}$ unchanged, but “kills” the linear term $f^{(1)}$ in expectation!</p>
<p>If we train such a randomized network with freshly sampled signs $S$ at each iteration, the linear term $f^{(1)}$ will keep oscillating around zero and does not have any power in fitting the data, whereas the quadratic term is not affected at all and thus becomes the leading force for fitting the data. (The keen reader may notice that this randomization is similar to Dropout, with the key difference being that we randomize the weight <em>movement</em> matrix, whereas vanilla Dropout randomizes the weight matrix itself.)</p>
<div style="text-align:center;">
<img style="width:700px;" src="http://www.offconvex.org/assets/beyond-ntk.png" />
<br />
<i><b>Figure 2.</b>
The NTK regime operates in the "NTK ball" where the network is approximately equal to the linear term. The quadratic regime operates in a larger ball where the network is approximately equal to the sum of first two terms, but the linear term dominates and can blow up at large width. Our randomized coupling technique resolves this by introducing the random sign matrix that in expectation "kills" the linear term but always preserves the quadratic term.
</i>
<br />
<br />
</div>
<p>Our first result shows that networks with sign randomization can still be efficiently optimized, despite its now non-convex optimization landscape:</p>
<blockquote>
<p><strong>Theorem</strong>: Any escaping-saddle algorithm (e.g. noisy SGD) on the regularized loss function $E_S[L(W_0+SW)]+R(W)$, with freshly sampled sign $S=S_t$ per iteration, can find the global minimum in polynomial time.</p>
</blockquote>
<p>The proof builds on the quadratic approximation $E_S[f]\approx f^{(2)}$ and recent understandings on neural networks with quadratic activation, e.g. <a href="https://arxiv.org/abs/1707.04926">Soltanolkotabi et al. (2017)</a> & <a href="https://arxiv.org/abs/1803.01206">Du and Lee (2018)</a>.</p>
<h3 id="generalization-and-sample-complexity-case-study-on-learning-low-rank-polynomials">Generalization and sample complexity: Case study on learning low-rank polynomials</h3>
<p>We next study the generalization of these networks in the context of learning <em>low-rank degree-$p$ polynomials</em>:</p>
\[f_\star(x) = \sum_{s=1}^{r_\star} \alpha_s (\beta_s^\top x)^{p_s}, \quad |\alpha_s|\le 1,\|(\beta_s^\top x)^{p_s}\|_{L_2} \le 1, p_s\le p \quad \textrm{for all } s.\]
<p>We are specifically interested in the case where $r_\star$ is small (e.g. $O(1)$), so that $y$ only depends on the projection of $x$ on a few directions. This for example captures teacher networks with polynomial activation of bounded degree and analytic activation (approximately), as well as constant depth teacher networks with polynomial activations.</p>
<p>For the NTK, the sample complexity of learning polynomials have been studied extensively in <a href="https://arxiv.org/abs/1901.08584">(Arora et al. 2019a)</a>, <a href="https://arxiv.org/abs/1904.12191">(Ghorbani et al. 2019)</a>, and many concurrent work. Combined, they showed that the sample complexity for learning degree-$p$ polynomials is $\Theta(d^p)$, with matching lower and upper bounds:</p>
<blockquote>
<p><strong>Theorem (NTK)</strong> : Suppose $x$ is uniformly distributed on the sphere, then the NTK requires $O(d^p)$ samples in order to achieve a small test error for learning any degree-$p$ polynomial, and there is a matching lower bound of $\Omega(d^p)$ for any inner-product kernel method.</p>
</blockquote>
<p>In our <a href="https://arxiv.org/abs/1910.0161">Beyond Linearization paper</a>, we show that the quadratic Taylor model achieves an improved sample complexity of $\tilde{O}(d^{p-1})$ with isotropic inputs:</p>
<blockquote>
<p><strong>Theorem (Quadratic Model)</strong>: For mildly isotropic input distributions, the two-layer quadratic Taylor model (or two-layer NN with sign randomization) only requires $\tilde{O}({\rm poly}(r_\star, p)d^{p-1})$ samples in order to achieve a small test error for learning a low-rank degree-$p$ polynomial.</p>
</blockquote>
<p>In our <a href="https://arxiv.org/abs/2006.13436">follow-up paper on understanding hierarchical learning</a>, we further design a “hierarchical learner” using a specific three-layer network, and show the following</p>
<blockquote>
<p><strong>Theorem (Three-layer hierarchical model)</strong>: Under mild input distribution assumptions, a three-layer network with a fixed representation layer of width $D=d^{p/2}$ and a trainable quadratic Taylor layer can achieve a small test error using only $\tilde{O}({\rm poly}(r_\star, p)d^{p/2})$ samples.</p>
</blockquote>
<p>When $r_\star,p=O(1)$, the quadratic Taylor model can improve over the NTK by a multiplicative factor of $d$, and we can further get a substantially larger improvement of $d^{p/2}$ by using the three-layer hierarchical learner. Here we briefly discuss the proof intuitions, and refer the reader to our papers for more details.</p>
<ul>
<li>
<p><strong>Generalization bounds</strong>: We show that, while the NTK and quadratic Taylor model expresses functions using similar random feature constructions, their generalization depends differently on the norm of the input. In the NTK, the generalization depends on the L2 norm of the features (as well as the weights), whereas generalization of the quadratic Taylor model depends on the operator norm of the input matrix features $\frac{1}{n}\sum x_ix_i^\top$ times the nuclear norm of $\sum w_rw_r^\top$. It turns out that this decomposition can match the one given by the NTK (it is never worse), and in addition be better by a factor of $O(\sqrt{d})$ if the input distribution is mildly isotropic so that $\|\frac{1}{n}\sum x_ix_i^\top\|_{\rm op} \le 1/\sqrt{d} \cdot \max \|x_i\|_2^2$, leading to the $O(d)$ improvement in the sample complexity.</p>
</li>
<li>
<p><strong>Hierarchical learning</strong>: The key intuition behind the hierarchical learner is that we can utilize the $O(d)$ sample complexity gain to its fullest, by applying quadratic Taylor model to not the input $x$, but a feature representation $h(x)\in \mathbb{R}^D$ where $D\gg d$. This yields a gain as long as $h$ is rich enough to express $f_\star$ and also isotropic enough to let the operator norm $\|\frac{1}{n}\sum h(x_i)h(x_i)^\top\|_{\rm op}$ be nice. In particular, for learning degree-$p$ polynomials, the best we can do is to choose $D=d^{p/2}$, leading to a sample complexity saving of $\tilde{O}(D)=\tilde{O}(d^{p/2})$.</p>
</li>
</ul>
<h3 id="concluding-thoughts">Concluding thoughts</h3>
<p>In this post, we explored higher-order Taylor expansions (in particular the quadratic expansion) as an approach to deep learning theory beyond the NTK regime. The Taylorization approach has several advantages:</p>
<ul>
<li>Non-convex but benign optimization landscape;</li>
<li>Provable generalization benefits over NTKs;</li>
<li>Ability of modeling hierarchical learning;</li>
<li>Convenient API for expeirmentation (cf. the <a href="https://github.com/google/neural-tangents">Neural Tangents</a> package and the <a href="https://arxiv.org/abs/2002.04010">Taylorized training</a> paper).</li>
</ul>
<p>We believe these advantages make the Taylor expansion a powerful tool for deep learning theory, and our results are just a beginning. We also remark that there are other theoretical frameworks such as the <a href="https://arxiv.org/abs/1909.08156">Neural Tangent Hierarchy</a> or the <a href="https://arxiv.org/abs/1804.06561">Mean-Field Theory</a> that go beyond the NTK with their own advantages in various angles, but without computational efficiency guarantees. See the <a href="https://jasondlee88.github.io/slides/beyond_ntk.pdf">slides</a> for more on going beyond NTK. Making progress on any of these directions (or coming up with new ones) would be an exciting direction for future work.</p>
Thu, 25 Mar 2021 07:00:00 -0700
http://offconvex.github.io/2021/03/25/beyondNTK/
http://offconvex.github.io/2021/03/25/beyondNTK/Beyond log-concave sampling (Part 3)<p>In the <a href="http://www.offconvex.org/2020/09/19/beyondlogconvavesampling">first post</a> of this series, we introduced the challenges of sampling distributions beyond log-concavity. In <a href="http://www.offconvex.org/2021/03/01/beyondlogconcave2/">Part 2</a> we tackled sampling from <em>multimodal</em> distributions: a typical obstacle occuring in problems involving statistical inference and posterior sampling in generative models. In this (final) post of the series, we consider sampling in the presence of <em>manifold structure in the level sets of the distribution</em> – which also frequently manifests in the same settings. It will cover the paper <a href="https://arxiv.org/abs/2002.05576">Fast convergence for Langevin diffusion with matrix manifold structure</a> by Ankur Moitra and Andrej Risteski .</p>
<h1 id="sampling-with-matrix-manifold-structure">Sampling with matrix manifold structure</h1>
<p>The structure on the distribution we consider in this post is <em>manifolds</em> of equiprobable points: this is natural, for instance, in the presence of invariances in data (e.g. rotations of images). It can also appear in neural-network based probabilistic models due to natural invariances they encode (e.g., scaling invariances in ReLU-based networks).</p>
<center>
<img src="http://www.andrew.cmu.edu/user/aristesk/manifold.jpg" width="400" />
</center>
<!--[HL: would it be better to start with the decomposition theorem, to parallel the first section?]
[AR: let me try this first, we can reorganize. I think this is better to motivate the assumption somewhat]-->
<p>At the level of techniques, the starting point for our results is a close connection between the geometry, more precisely <em>Ricci curvature</em> of a manifold, and the mixing time of Brownian motion on a manifold. The following theorem holds:</p>
<blockquote>
<p><strong>Theorem (Bakry and Émery ‘85, informal)</strong>: If the manifold $M$ has positive Ricci curvature, Brownian motion on the manifold mixes rapidly in $\chi^2$ divergence.</p>
</blockquote>
<p>We will explain the notions from differential geometry shortly, but first we sketch our results, and how they use this machinery. We present two results: the first is a “meta”-theorem that provides a generic decomposition framework, and the second is an instantiation of this framework for a natural family of problems that exhibit manifold structure: posteriors for matrix factorization, sensing, and completion.
<!--a sampling version of matrix factorization--></p>
<h2 id="a-general-manifold-decomposition-framework">A general manifold decomposition framework</h2>
<p>Our first result is a general decomposition framework for analyzing mixing time of Langevin in the presence of manifolds of equiprobable points.</p>
<p>To motivate the result, note that if we consider the distribution $p_{\beta}(x) \propto e^{-\beta f(x)}$, for large (but finite) $\beta$, the Langevin chain corresponding to that distribution, started close to a manifold of local minima, will tend to stay close to (but not on!) it for a long time. See the figure below for an illustration. This, we will state a “robust” version of the above manifold result, for a chain that’s allowed to go off the manifold.</p>
<center>
<img src="http://www.andrew.cmu.edu/user/aristesk/single_manifold.gif" width="300" />
</center>
<p>We show the following statement. (Recall that a bounded Poincaré constant corresponds to rapid mixing for Langevin. See the <a href="http://www.offconvex.org/2020/09/19/beyondlogconvavesampling">first post</a> for a refresher.)</p>
<blockquote>
<p><strong>Theorem 1 (Moitra and Risteski ‘20, informal)</strong>:
Suppose the Langevin chain corresponding to $p(x) \propto e^{-f(x)}$ is initialized close to a manifold $M$ satisfying the following two properties:
<br /><br />
(1) It stays in some neighborhood $D$ of the manifold $M$ with large probability for a long time.
<br /><br />
(2) $D$ can be partitioned into manifolds $M^{\Delta}$ satisfying:
<br /><br />
(2.1) The conditional distribution of $p$ restricted to $M^{\Delta}$ has a upper bounded Poincare constant.
<br /><br />
(2.2) The marginal distribution over $\Delta$ has a upper bounded Poincare constant.
<br /><br />
(2.3) The conditional probability distribution over $M^{\Delta}$ does not “change too quickly” as $\Delta$ changes.
<br /><br />
Then Langevin mixes quickly to a distribution close to the conditional distribution of $p$ restricted to $D$.</p>
</blockquote>
<center>
<img src="http://www.andrew.cmu.edu/user/aristesk/partition_illustration.gif" />
</center>
<p>While the above theorem is a bit of a mouthful (even very informally stated) and requires a choice of partitioning of $D$ to be “instantiated”, it’s quite natural to think of it as an analogue of local convergence results for gradient descent in optimization. Namely, it gives geometric conditions under which Langevin started near a manifold mixes to the “local” stationary distribution (i.e. the conditional distribution $p$ restricted to $D$).</p>
<p>The proof of the theorem uses similar decomposition ideas as result on sampling multimodal distributions from the <a href="http://www.offconvex.org/2021/03/01/beyondlogconcave2/">previous post</a>, albeit is complicated by measure theoretic arguments. Namely, the manifolds $M^{\Delta}$ have technically zero measure under the distribution $p$, so care must be taken with how the “projected” and “restricted” chain are defined—the key tool for this is the so-called <a href="https://en.wikipedia.org/wiki/Smooth_coarea_formula#:~:text=In%20Riemannian%20geometry%2C%20the%20smooth,with%20integrals%20over%20their%20codomains.&text=%2C%20i.e.%20the%20determinant%20of%20the,orthogonal%20complement%20of%20its%20kernel.">co-area formula</a>.</p>
<p>The challenge in using the above framework is instantiating the decomposition: namely, the choice of the partition of $D$ into manifolds $M^{\Delta}$. In the next section, we show how this can be done for posteriors in problems like matrix factorization/sensing/completion.</p>
<h2 id="matrix-factorization-and-relatives">Matrix factorization (and relatives)</h2>
<p>To instantiate the above framework in a natural setting, we consider distributions exhibiting invariance under orthogonal transformations. Namely, we consider distributions of the type</p>
\[p: \mathbb{R}^{d \times k} \to \mathbb{R}, \hspace{0.5cm} p(X) \propto e^{-\beta \| \mathcal{A}(XX^T) - b \|^2_2}\]
<p>where $b \in \mathbb{R}^{m}$ is a fixed vector and $\mathcal{A}$ is an operator that returns a $m$-dimensional vector given a $d \times d$ matrix. For this distribution, we have $p(X) = p(XO)$ for any orthogonal matrix $O$, since $XX^T = XO (XO)^T$ . Depending on the choice of $\mathcal{A}$, we can easily recover some familiar functions inside the exponential: e.g. the $l_2$ losses for (low-rank) matrix factorization, matrix sensing and matrix completion. These losses received a lot of attention as simple examples of objectives that are non-convex but can still be optimized using gradient descent. (See e.g. <a href="https://arxiv.org/abs/1704.00708">Ge et al. ‘17</a>.)</p>
<p>These distributions also have a very natural statistical motivation. Namely, consider the distribution over $m$-dimensional vectors, such that</p>
\[b = \mathcal{A}(XX^T) + n, \hspace{0.5cm} n \sim N\left(0,\frac{1}{\sqrt{\beta}}I\right).\]
<p>Then, the distribution $p(X) \propto e^{-\beta | \mathcal{A}(XX^T) - b |^2_2 }$ can be viewed as the posterior distribution over $X$ with a uniform prior. Thus, sampling from these distributions can be seen as the distributional analogue of problems like matrix factorization/sensing/completion, the difference being that we are not merely trying to find the <em>most likely</em> matrix $X$, but also trying to sample from the posterior.</p>
<p>We will consider the case when $\beta$ is sufficiently large (in particular, $\beta = \Omega(\mbox{poly}(d))$: in this case, the distribution $p$ will concentrate over two (separated) manifolds: $E_1 = \{X_0 R: R \mbox{ is orthogonal with det 1}\}$ and $E_2 = \{X_0 R: R \mbox{ is orthogonal with det }-1\}$, where $X_0$ is any fixed minimizer of $| \mathcal{A}(XX^T) - b |^2_2$. Hence, when started near one of these manifolds, we expect Langevin to stay close to it for a long time (see figure below).</p>
<center>
<img src="http://www.andrew.cmu.edu/user/aristesk/langevin_matrix.gif" width="500" />
</center>
<p>We show:</p>
<blockquote>
<p><strong>Theorem 2 (Moitra and Risteski ‘20, informal)</strong>: Let $\mathcal{A}$ correspond to matrix factorization, sensing or completion under standard parameter assumptions for these problems. Let $\beta = \Omega(\mbox{poly}(d))$.
If initialized close to one of $E_i, i \in \{1, 2\}$, after a polynomial number of steps the discretized Langevin dynamics will converge to a distribution that is close in total variation distance
to p(X) when restricted to a neighborhood of $E_i$.</p>
</blockquote>
<!--Let $\beta=\Omega(\mbox{poly}(d), \log (1/\delta))$ and let $p^i(X) \propto p(𝑋)$, if $\|X −\Pi_{E_i}(𝑋)\|_2<\|X − \Pi_{E_{2−i}}(X)\|_2$ and $p^i (X)=0$ otherwise.
Then, Langevin diffusion initialized $O(\sigma_{\min}(M)/k)$ close to $E_i$ run for t steps samples from distribution $p_t$, s.t.
$$ \chi^2(p_t, p^i ) \leq \delta + e^{−𝑡/C}, 𝐶=O\left(\frac{\beta}{(k \sigma_{\min}(M))}\right) $$-->
<p>We remark that the closeness condition for the first step is easy to ensure using existing results on gradient-descent based optimization for these objectives. It’s also easy to use the above result to sample approximately from the distribution $p$ itself, rather than only the “local” distributions $p^i$ – this is due to the fact that the distribution $p$ looks like the “disjoint union” of the distributions $p^1$ and $p^2$.</p>
<p>Before we describe the main elements of the proof, we review some concepts from differential geometry.</p>
<h2 id="extremely-brief-intro-to-differential-geometry">(Extremely) brief intro to differential geometry</h2>
<p>We won’t do a full primer on differential geometry in this blog post, but we will briefly informally describe some of the relevant concepts. See Section 5 of <a href="https://arxiv.org/abs/2002.05576">our paper</a> for an intro to differential geometry (written with a computer science reader in mind, so more easy-going than a differential geometry textbook).</p>
<p>Recall, the <em>tangent</em> space at $x$, denoted by $T_x M$, is the set of all derivatives $v$ of a curve passing through $x$.<!-- the exponential map at a point $x$ in direction $v \in T_x M$, denoted by $\exp_x(v)$ is the movement of $x$ along the *geodesic* (i.e. shortest path curve) at $x$ for one unit of time (note this curve is unique). See the left part of the figure below.--><!-- To explain Ricci curvature, consider first the intuitive concept of curvature: Euclidean space has zero curvature, while the sphere has positive curvature because it "folds into itself." One way to capture this is by looking at the volume of a geodesic ball around a point: the sphere's positive curvature causes the volume to be *less* than the volume in Euclidean space. We can take this change in volume as the *definition* of curvature.--> The <em>Ricci curvature</em> at a point $x$, in direction $v \in T_x M$, denoted $\mbox{Ric}_x(v)$, captures the second-order term in the rate of change of volumes of sets in a small neighborhood around $x$, as the points in the set are moved along the geodesic (i.e. shortest path curve) in direction $v$ (or more precisely, each point $y$ in the set is moved along the geodesic in the direction of the parallel transport of $v$ at $y$; see the right part of the figure below from <a href="https://projecteuclid.org/euclid.aspm/1543086328">(Ollivier 2010)</a>). A Ricci curvature of $0$ preserves volumes (think: a plane), a Ricci curvature $>0$ shrinks volume (think: a sphere), and a Ricci curvature $<0$ expands volume (think: a hyperbola).</p>
<center>
<img src="http://www.andrew.cmu.edu/user/aristesk/tangent.jpg" width="800" />
</center>
<!--Slightly more mathematically, it's relatively easy to understand the Ricci curvature when we have a parametrized manifold. The curvature of the manifold should be intuitively captured by the second-order behavior of the parametrization. Namely, consider a manifold parametrized locally as
$$\phi: T_x M \to T_x M \times N_x M, \phi(z) = x + (z, g(z))$$
where $N_xM$ is the *normal* space at $x$, the subspace orthogonal to $T_xM$.
Then, the Hessian viewed as the quadratic form $\nabla^2 g: T_x M \times T_x M \to N_x M$ is called the *second fundamental form* and denoted as $\mathrm{I\!I}_x$. If $\{e_i\}$ is an orthonormal basis of $T_x M$, the Ricci curvature in a direction $v \in T_x M$ is then:
$$\mbox{Ric}(v) = \sum_i \langle \mathrm{I\!I}(u,u), \mathrm{I\!I}(e_i, e_i) \rangle - \|\mathrm{I\!I}(u,e_i)\|^2$$
-->
<p>The connections between curvature and mixing time of diffiusions is rather deep and we won’t attempt to convey it fully in a blog post - the definitive reference is <a href="https://link.springer.com/book/10.1007/978-3-319-00227-9">Analysis and Geometry of Markov Diffusion Operators</a> by Bakry, Gentil and Ledoux. The main idea is that mixing time can be bounded by how long it takes for random walks starting at different locations to “join together,” and positive curvature brings them together faster.
<!-- A small gesture towards these connections can be conveyed through a popular coupling for Brownian motion called a *reflection coupling*.--></p>
<p>To make this formal, we define a <em>coupling</em> of two random variables $X, Y$ to be any random variable $W = (X’,Y’)$ such that the marginal distribution of the coordinates $X’$ and $Y’$ are the same as the distributions of $X$ and $Y$. It’s well known that the convergence time of a random walk in total variation distance can be upper bounded by the expected time until two coupled copies of the walk join. On the plane, a canonical coupling (the <em>reflection coupling</em>) between two Brownian motions can be constructed by reflecting the move of the second process through the perpendicular bisector between the locations of the two processes (see figure below). On a positively curved manifold (like a sphere), an analogous reflection can be defined, and the curvature only brings the two processes closer faster.</p>
<center>
<img src="http://www.andrew.cmu.edu/user/aristesk/reflection.jpg" width="500" />
</center>
<p>As a final tool, our proof uses a very important theorem due to <a href="https://www.sciencedirect.com/science/article/pii/S0001870876800023">Milnor</a> about manifolds with algebraic structure:</p>
<blockquote>
<p><strong>Theorem (Milnor ‘76, informal)</strong>: The Ricci curvature of a Lie group equipped with a left-invariant metric is non-negative.</p>
</blockquote>
<p>In a pinch, a Lie group is a group that also is a smooth manifold, and furthermore, the group operations result in a smooth transformation on the manifold - so that the “geometry” and “algebra” combine together. A metric is left-invariant for the group if acting on the left by any group element leaves the metric “unchanged”.</p>
<h2 id="implementing-the-decomposition-framework">Implementing the decomposition framework</h2>
<p>To apply the framework we sketched out as part of Theorem 1, we need to verify the conditions of the Theorem.</p>
<p>To prove <strong>Condition 1</strong>, we need to show that for large $\beta$, the random walk stays near to the manifold it’s been initialized close to. The main tools for this are <a href="https://en.wikipedia.org/wiki/It%C3%B4%27s_lemma">Ito’s lemma</a>, local convexity of the function $| \mathcal{A}(XX^T) - b |_2^2$ and basic results in the theory of <a href="https://en.wikipedia.org/wiki/Cox%E2%80%93Ingersoll%E2%80%93Ross_model">Cox-Ingersoll-Ross</a> processes. Namely, Ito’s lemma (which can be viewed as a “change-of-variables” formula for random variables) allows us to write down a stochastic differential equation for the evolution of the distance of $X$ from the manifold, which turns out to have a “bias” towards small values, due to the local convexity of $| \mathcal{A}(XX^T) - b |_2^2$. This can in turn be analyzed approximately as Cox-Ingersoll-Ross process - a well-studied type of non-negative stochastic process.</p>
<p>To prove <strong>Condition 2</strong>, we need to specify the partition of the space around the manifolds $E_i$. Describing the full partition is somewhat technical, but importantly, the manifolds $M^{\Delta}$ have the form $M^{\Delta} = \{\Delta U: U \mbox{ is an orthogonal matrix with det 1}\}$ for some matrix $\Delta \in \mathbb{R}^{n \times k}$.</p>
<p>The proof that $M^{\Delta}$ has a good Poincare constant (i.e. Condition 2.1) relies on two ideas: first, $M^{\Delta}$ is a Lie group with group operation $\circ$ defined such that $(\Delta U) \circ (\Delta V) := \Delta (UV)$, along with a corresponding left-invariant metric - thus, by Milnor’s theorem, it has a non-negative Ricci curvature; second, we can relate the Ricci curvatures with the Euclidean metric to the curvature with the left-invariant metric. The proof that the marginal distribution over $\Delta$ has a good Poincaré constant involves showing that this distribution is approximately log-concave. Finally, the “change-of-conditional-probability” condition (Condition 2.3) can be proved by explicit calculation.</p>
<h1 id="closing-remarks">Closing remarks</h1>
<p>In this series of posts, we surveyed two recent approaches to analyzing Langevin-like sampling algorithms <em>beyond log-concavity</em> - the most natural analogue to non-convexity in the world of sampling/inference. The structures we considered, <em>multi-modality</em> and <em>invariant manifolds</em>, are common in practice in modern machine learning.</p>
<p>Unlike non-convex optimization, provable guarantees for sampling beyond log-concavity is still under-studied and we hope our work will inspire and excite further efforts. For instance, how do we handle modes of different “shape”? Can we handle an exponential number of modes, if they have further structure (e.g., posteriors in concrete latent-variable models like Bayesian networks)? Can we handle more complex manifold structure (e.g. the matrix distributions we considered for <em>any</em> $\beta$)?</p>
Fri, 12 Mar 2021 06:00:00 -0800
http://offconvex.github.io/2021/03/12/beyondlogconcave3/
http://offconvex.github.io/2021/03/12/beyondlogconcave3/Beyond log-concave sampling (Part 2)<p>In our previous <a href="http://www.offconvex.org/2020/09/19/beyondlogconvavesampling">blog post</a>, we introduced the challenges of sampling distributions beyond log-concavity.
We first introduced the problem of sampling from a distibution $p(x) \propto e^{-f(x)}$ given value or gradient oracle access to $f$, as an analogous problem to black-box optimization with oracle access. We introduced the natural algorithm for sampling in this setup: Langevin Monte Carlo, a Markov Chain reminiscent of noisy gradient descent,</p>
\[x_{t+\eta} = x_t - \eta \nabla f(x_t) + \sqrt{2\eta}\xi_t,\quad \xi_t\sim N(0,I).\]
<p>Finally, we laid out the challenges when $f$ is not convex; in particular, LMC can suffer from slow mixing.</p>
<p>In this and the coming post, we describe two of our recent works tackling this problem. We identify two kinds of structure beyond log-concavity under which we can design provably efficient algorithms: <em>multi-modality</em> and <em>manifold structure in the level sets</em>. These structures commonly occur in practice, especially in problems involving statistical inference and posterior sampling in generative models.</p>
<p>In this post, we will focus on multimodality, covered by the paper <a href="https://arxiv.org/abs/1812.00793">Simulated tempering Langevin Monte Carlo</a> by Rong Ge, Holden Lee, and Andrej Risteski.</p>
<h1 id="sampling-multimodal-distributions-with-simulated-tempering">Sampling multimodal distributions with simulated tempering</h1>
<p>The classical scenario in which Langevin takes exponentially long to mix is when $p$ is a mixture of two well-separated gaussians. In broadest generality, this was considered by <a href="http://www.ems-ph.org/journals/show_abstract.php?issn=1435-9855%20&vol=6&iss=4&rank=1">Bovier et al. 2004</a> who used tools from metastable processes to show that transitioning from one peak to another can take exponential time. Roughly speaking, they show the transition time is proportional to the “energy barrier” a particle has to cross. If the gaussians have unit variance and means at distance $2r$, then the probability density at a point midway in between is $\propto e^{-r^2/2}$, and this energy barrier is $\propto e^{r^2/2}$. Thus, the mixing time is exponential. Qualitatively, the intuition for this phenomenon is simple to describe: if started at point A, the drift (i.e. gradient) term will push the walk towards A, so long as it’s close to the basin around A; hence, to transition from A to B (through C) the Gaussian noise must persistenly counteract the gradient term.</p>
<center>
<img src="http://www.andrew.cmu.edu/user/aristesk/animation_bovier.gif" width="500" />
</center>
<p>Hence Langevin on its own will not work even in very simple multimodal settings.</p>
<p>In <a href="https://arxiv.org/abs/1812.00793">our paper</a>, we show that combining Langevin Monte Carlo with a temperature-based heuristic called <em>simulated tempering</em> can significantly speed up mixing for multimodal distributions, where the number of modes is not too large, and the modes “look similar.”</p>
<p>More precisely, we show:</p>
<blockquote>
<p><strong>Theorem (Ge, Lee, Risteski ‘18, informal)</strong>: If $p(x)$ is a mixture of $k$ shifts of a strongly log-concave distribution in $d$ dimensions (e.g. Gaussian), an algorithm based on simulated tempering and Langevin Monte Carlo that runs in time poly($d,k, 1/\varepsilon$) produces samples from a distribution $\varepsilon$-close to $p$ in total variation distance.</p>
</blockquote>
<p>The main idea is to create a meta-Markov chain (the simulated tempering chain) which has two types of moves: change the current “temperature” of the sample, or move “within” a temperature. The main intuition behind this is that at higher temperatures, the distribution is flatter, so the chain explores the landscape faster (see the figure below).</p>
<center>
<img src="http://www.andrew.cmu.edu/user/aristesk/animation_tempering.gif" />
</center>
<p>More formally, the distribution at inverse temperature $\beta$ is given by $p_\beta(x) \propto e^{-\beta f(x)}$. The Langevin chain which corresponds to $\beta$ is given by</p>
\[x_{t+\eta} = x_t - \eta \beta \nabla f(x_t) + \sqrt{2\eta}\xi_t,\quad \xi_t\sim N(0,I).\]
<p>As in the figure above, a high temperature (low $\beta<1$) flattens out the distribution and causes the chain to mix faster (top distribution in figure). However, we can’t merely run Langevin at a higher temperature, because the stationary distribution of the high-temperature chain is wrong: it’s $p_\beta(x)$. The idea behind simulated tempering is to run Langevin chains at different temperatures, sometimes swapping to another temperature to help lower-temperature chains explore. To maintain the right stationary distributions at each temperature, we use a Metropolis-Hastings filtering step.</p>
<p>More formally, choosing a suitable sequence $0< \beta_1< \cdots <\beta_L=1$, we define the simulated tempering chain as follows.</p>
<p><img style="float: right;" src="http://holdenlee.github.io/pics/stl.png" width="300" /></p>
<ul>
<li>The <em>state space</em> is a pair of a temperature and location in space $(i, x), i \in [L], x \in \mathbb{R}^d$.<br />
<!--$L$ copies of the state space (in our case $\mathbb R^d$), one copy for each temperature.--></li>
<li>The <em>transitions</em> are defined as follows.
<ul>
<li>If the current point is $(i,x)$, then <em>evolve</em> $x$ according to Langevin diffusion with inverse temperature $\beta_i$.</li>
<li>Propose swaps with some rate $\lambda >0$. Proposing a swap means attempting to move to a neighboring chain, i.e. change $i$ to $i’=i\pm 1$. With probability $\min{p_{i’}(x)/p_i(x), 1}$, the transition is accepted. Otherwise, stay at the same point. This is a <em>Metropolis-Hastings step</em>; its purpose is to preserve the stationary distribution.</li>
</ul>
</li>
</ul>
<p>Finally, it’s not too hard to see that at the stationary distribution, the samples at the $L$th level ($\beta_L=1$) are the desired samples.</p>
<h2 id="proof-idea-decomposition-theorem">Proof idea: decomposition theorem</h2>
<p>The main strategy is inspired by Madras and Randall’s <a href="https://www.jstor.org/stable/2699896">Markov chain decomposition theorem</a>, which gives a criterion for a Markov chain to mix rapidly: partition the state space into sets, and show that</p>
<ol>
<li>The Markov chain mixes rapidly when restricted to each set of the partition.</li>
<li>The <em>projected</em> Markov Chain, which we define momentarily, mixes rapidly. If there are $m$ sets, the projected chain $\overline M$ is defined on the state space ${1,\ldots, m}$, and transition probabilities are given by average probability flows between the corresponding sets.</li>
</ol>
<p>To implement this strategy, we first have to specify the partition. In fact, we roughly show that there is a partition of $[L] \times \mathbb{R}^d$ in which:</p>
<ol>
<li>The simulated tempering Langevin chain mixes fast within each of the sets.</li>
<li>The “volume” of the sets (under the stationary distribution of the tempering chain) is not too small.
<!-- [HL: alt.] There is no set at high temperature that has much larger volume at low temperature.
--></li>
</ol>
<p>In applying the Madras-Randall framework with this partition, it’s clear that point (1) above satisfies requirement (1) for the framework; point (2) ensures that the projected Markov chain has no “bottlenecks” and hence that it mixes rapidly (requirement (2)). More precisely, we can show rapid mixing either through the method of canonical paths or Cheeger’s inequality. To do this, we exhibit a “good-probability” path between any two sets in the partition, going through the highest temperature.</p>
<p>The intuition for why this path works is illustrated in the figure below: when transitioning from the set corresponding to the left mode at level $L$ to the right mode at level $L$, each of the steps up/down the temperatures are accepted with good probability if the neighboring temperatures are not too different; at the highest temperature, the chain mixes fast by point (1), and since each of the sets are not too small by point (2), there is a reasonable probability to end at the right mode at the highest temperature.</p>
<center>
<img src="http://www.andrew.cmu.edu/user/aristesk/animation_conductance.gif" />
</center>
<!--(rework this picture?) This is a Markov chain with a small state space, so its spectral gap is easy to lower-bound (e.g., with Cheeger's inequality). The one thing we need to check is that there is no "bottleneck," i.e., one set in the partition that has low probability at high temperature and high probability at low temperature. -->
<p>Intuitively, the partition should track the “modes” of the distribution, but a technical hurdle in implementing this plan is in defining the partition when the modes overlap. One can either do this spectrally (i.e. showing that the Langevin chain has a spectral gap, and use theorems about <a href="https://arxiv.org/abs/1309.3223">spectral graph partitioning</a>, as we did in the <a href="https://arxiv.org/abs/1710.02736">first version</a> of the paper), or use a functional “soft decomposition theorem” which is a more flexible version of the classical decomposition theorem, which we use in a <a href="https://arxiv.org/abs/1812.00793">later version</a> of the paper.</p>
<!-- ![](http://holdenlee.github.io/pics/proj_chain.png)-->
Mon, 01 Mar 2021 06:00:00 -0800
http://offconvex.github.io/2021/03/01/beyondlogconcave2/
http://offconvex.github.io/2021/03/01/beyondlogconcave2/Can implicit regularization in deep learning be explained by norms?<p>This post is based on my <a href="https://arxiv.org/pdf/2005.06398.pdf">recent paper</a> with <a href="https://noamrazin.github.io/">Noam Razin</a> (to appear at NeurIPS 2020), studying the question of whether norms can explain implicit regularization in deep learning.
TL;DR: we argue they cannot.</p>
<h2 id="implicit-regularization--norm-minimization">Implicit regularization = norm minimization?</h2>
<p>Understanding the implicit regularization induced by gradient-based optimization is possibly the biggest challenge facing theoretical deep learning these days.
In classical machine learning we typically regularize via norms, so it seems only natural to hope that in deep learning something similar is happening under the hood, i.e. the implicit regularization strives to find minimal norm solutions.
This is actually the case in the simple setting of overparameterized linear regression $-$ there, by a folklore analysis (cf. <a href="https://openreview.net/pdf?id=Sy8gdB9xx">Zhang et al. 2017</a>), gradient descent (and any other reasonable gradient-based optimizer) initialized at zero is known to converge to the minimal Euclidean norm solution.
A spur of recent works (see <a href="https://arxiv.org/pdf/2005.06398.pdf">our paper</a> for a thorough review) has shown that for various other models an analogous result holds, i.e. gradient descent (when initialized appropriately) converges to solutions that minimize a certain (model-dependent) norm.
On the other hand, as discussed last year in posts by <a href="http://www.offconvex.org/2019/06/03/trajectories/">Sanjeev</a> as well as <a href="http://www.offconvex.org/2019/07/10/trajectories-linear-nets/">Wei and myself</a>, mounting theoretical and empirical evidence suggest that it may not be possible to generally describe implicit regularization in deep learning as minimization of norms.
Which is it then?</p>
<h2 id="a-standard-test-bed-matrix-factorization">A standard test-bed: matrix factorization</h2>
<p>A standard test-bed for theoretically studying implicit regularization in deep learning is <em>matrix factorization</em> $-$ matrix completion via linear neural networks.
Wei and I already presented this model in our <a href="http://www.offconvex.org/2019/07/10/trajectories-linear-nets/">previous post</a>, but for self-containedness I will do so again here.</p>
<p>In <em>matrix completion</em>, we are given entries $\{ M_{i, j} : (i, j) \in \Omega \}$ of an unknown matrix $M$, and our job is to recover the remaining entries.
This can be seen as a supervised learning (regression) problem, where the training examples are the observed entries of $M$, the model is a matrix $W$ trained with the loss:
[
\qquad \ell(W) = \sum\nolimits_{(i, j) \in \Omega} (W_{i, j} - M_{i, j})^2 ~, \qquad\qquad \color{purple}{\text{(1)}}
]
and generalization corresponds to how similar $W$ is to $M$ in the unobserved locations.
In order for the problem to be well-posed, we have to assume something about $M$ (otherwise the unobserved locations can hold any values, and guaranteeing generalization is impossible).
The standard assumption (which has many <a href="https://en.wikipedia.org/wiki/Matrix_completion#Applications">practical applications</a>) is that $M$ has low rank, meaning the goal is to find, among all global minima of the loss $\ell(W)$, one with minimal rank.
The classic algorithm for achieving this is <a href="https://en.wikipedia.org/wiki/Matrix_norm#Schatten_norms"><em>nuclear norm</em></a> minimization $-$ a convex program which, given enough observed entries and under certain technical assumptions (“incoherence”), recovers $M$ exactly (cf. <a href="https://statweb.stanford.edu/~candes/papers/MatrixCompletion.pdf">Candes and Recht</a>).</p>
<p>Matrix factorization represents an alternative, deep learning approach to matrix completion.
The idea is to use a <em>linear neural network</em> (fully-connected neural network with linear activation), and optimize the resulting objective via gradient descent (GD).
More specifically, rather than working with the loss $\ell(W)$ directly, we choose a depth $L \in \mathbb{N}$, and run GD on the <em>overparameterized objective</em>:
[
\phi ( W_1 , W_2 , \ldots , W_L ) := \ell ( W_L W_{L - 1} \cdots W_1) ~. ~~\qquad~ \color{purple}{\text{(2)}}
]
Our solution to the matrix completion problem is then:
[
\qquad\qquad W_{L : 1} := W_L W_{L - 1} \cdots W_1 ~, \qquad\qquad\qquad \color{purple}{\text{(3)}}
]
which we refer to as the <em>product matrix</em>.
While (for $L \geq 2$) it is possible to constrain the rank of $W_{L : 1}$ by limiting dimensions of the parameter matrices $\{ W_j \}_j$, from an implicit regularization standpoint, the case of interest is where rank is unconstrained (i.e. dimensions of $\{ W_j \}_j$ are large enough for $W_{L : 1}$ to take on any value).
In this case there is <em>no explicit regularization</em>, and the kind of solution GD will converge to is determined implicitly by the parameterization.
The degenerate case $L = 1$ is obviously uninteresting (nothing is learned in the unobserved locations), but what happens when depth is added ($L \geq 2$)?</p>
<p>In their <a href="https://papers.nips.cc/paper/2017/file/58191d2a914c6dae66371c9dcdc91b41-Paper.pdf">NeurIPS 2017 paper</a>, Gunasekar et al. showed empirically that with depth $L = 2$, if GD is run with small learning rate starting from near-zero initialization, then the implicit regularization in matrix factorization tends to produce low-rank solutions (yielding good generalization under the standard assumption of $M$ having low rank).
They conjectured that behind the scenes, what takes place is the classic nuclear norm minimization algorithm:</p>
<blockquote>
<p><strong>Conjecture 1 (<a href="https://papers.nips.cc/paper/7195-implicit-regularization-in-matrix-factorization.pdf">Gunasekar et al. 2017</a>; informally stated):</strong>
GD (with small learning rate and near-zero initialization) over a depth $L = 2$ matrix factorization finds solution with minimum nuclear norm.</p>
</blockquote>
<p>Moreover, they were able to prove the conjecture in a certain restricted setting, and others (e.g. <a href="http://proceedings.mlr.press/v75/li18a/li18a.pdf">Li et al. 2018</a>) later derived proofs for additional specific cases.</p>
<p>Two years after Conjecture 1 was made, in a <a href="https://papers.nips.cc/paper/2019/file/c0c783b5fc0d7d808f1d14a6e9c8280d-Paper.pdf">NeurIPS 2019 paper</a> with Sanjeev, Wei and Yuping Luo, we presented empirical and theoretical evidence (see <a href="http://www.offconvex.org/2019/07/10/trajectories-linear-nets/">previous blog post</a> for details) which led us to hypothesize the opposite, namely, that for any depth $L \geq 2$, the implicit regularization in matrix factorization can <em>not</em> be described as minimization of a norm:</p>
<blockquote>
<p><strong>Conjecture 2 (<a href="https://papers.nips.cc/paper/2019/file/c0c783b5fc0d7d808f1d14a6e9c8280d-Paper.pdf">Arora et al. 2019</a>; informally stated):</strong>
Given a depth $L \geq 2$ matrix factorization, for any norm $\|{\cdot}\|$, there exist matrix completion tasks on which GD (with small learning rate and near-zero initialization) finds solution that does not minimize $\|{\cdot}\|$.</p>
</blockquote>
<p>Due to technical subtleties in their formal statements, Conjectures 1 and 2 do not necessarily contradict.
However, they represent opposite views on the question of whether or not norms can explain implicit regularization in matrix factorization.
The goal of my recent work with <a href="https://noamrazin.github.io/">Noam</a> was to resolve this open question.</p>
<h2 id="implicit-regularization-can-drive-all-norms-to-infinity">Implicit regularization can drive all norms to infinity</h2>
<p>The main result in our <a href="https://arxiv.org/pdf/2005.06398.pdf">paper</a> is a proof that there exist simple matrix completion settings where the implicit regularization in matrix factorization drives <strong><em>all norms towards infinity</em></strong>.
By this we affirm Conjecture 2, and in fact go beyond it in the following sense:
<em>(i)</em> not only is each norm disqualified by some setting, but there are actually settings that jointly disqualify all norms;
and
<em>(ii)</em> not only are norms not necessarily minimized, but they can grow towards infinity.</p>
<p>The idea behind our analysis is remarkably simple.
We prove the following:</p>
<blockquote>
<p><strong>Theorem (informally stated):</strong>
During GD over matrix factorization (i.e. over $\phi ( W_1 , W_2 , \ldots , W_L)$ defined by Equations $\color{purple}{\text(1)}$ and $\color{purple}{\text(2)}$), if learning rate is sufficiently small and initialization sufficiently close to the origin, then the determinant of the product matrix $W_{1: L}$ (Equation $\color{purple}{\text(3)}$) doesn’t change sign.</p>
</blockquote>
<p>A corollary is that if $\det ( W_{L : 1} )$ is positive at initialization (an event whose probability is $0.5$ under any reasonable initialization scheme), then it stays that way throughout.
This seemingly benign observation has far-reaching implications.
As a simple example, consider the following matrix completion problem ($*$ here stands for unobserved entry):
[
\qquad\qquad
\begin{pmatrix}
* & 1 \newline
1 & 0
\end{pmatrix}
~. \qquad\qquad \color{purple}{\text{(4)}}
]
Every solution to this problem, i.e. every matrix that agrees with its observations, must have determinant $-1$.
It is therefore only logical to expect that when solving the problem using matrix factorization, the determinant of the product matrix $W_{L : 1}$ will converge to $-1$.
On the other hand, we know that (with probability $0.5$ over initialization) $\det ( W_{L : 1} )$ is always positive, so what is going on?
This conundrum can only mean one thing $-$ as $W_{L : 1}$ fits the observations, its value in the unobserved location (i.e. $(W_{L : 1})_{11}$) diverges to infinity, which implies that <em>all norms grow to infinity!</em></p>
<p>The above idea goes way beyond the simple example given in Equation $\color{purple}{\text(4)}$.
We use it to prove that in a wide array of matrix completion settings, the implicit regularization in matrix factorization leads norms to <em>increase</em>.
We also demonstrate it empirically, showing that in such settings unobserved entries grow during optimization.
Here’s the result of an experiment with the setting of Equation $\color{purple}{\text(4)}$:</p>
<div style="text-align:center;">
<img style="width:300px;" src="http://www.offconvex.org/assets/reg_dl_not_norm_mf_exp.png" />
<br />
<i><b>Figure 1:</b>
Solving matrix completion problem defined by Equation $\color{purple}{\text(4)}$ using matrix factorization leads absolute value of unobserved entry to increase (which in turn means norms increase) as loss decreases.
</i>
</div>
<h2 id="what-is-happening-then">What is happening then?</h2>
<p>If the implicit regularization in matrix factorization is not minimizing a norm, what is it doing?
While a complete theoretical characterization is still lacking, there are signs that a potentially useful interpretation is <strong><em>minimization of rank</em></strong>.
In our aforementioned <a href="https://papers.nips.cc/paper/2019/file/c0c783b5fc0d7d808f1d14a6e9c8280d-Paper.pdf">NeurIPS 2019 paper</a>, we derived a dynamical characterization (and showed supporting experiments) suggesting that matrix factorization is implicitly conducting some kind of greedy low-rank search (see <a href="http://www.offconvex.org/2019/07/10/trajectories-linear-nets/">previous blog post</a> for details).
This phenomenon actually facilitated a new autoencoding architecture suggested in a recent <a href="https://arxiv.org/pdf/2010.00679.pdf">empirical paper</a> (to appear at NeurIPS 2020) by Yann LeCun and his team at Facebook AI.
Going back to the example in Equation $\color{purple}{\text(4)}$, notice that in this matrix completion problem all solutions have rank $2$, but it is possible to essentially minimize rank to $1$ by taking (absolute value of) unobserved entry to infinity.
As we’ve seen, this is exactly what the implicit regularization in matrix factorization does!</p>
<p>Intrigued by the rank minimization viewpoint, <a href="https://noamrazin.github.io/">Noam</a> and I empirically explored an extension of matrix factorization to <em>tensor factorization</em>.
Tensors can be thought of as high dimensional arrays, and they admit natural factorizations similarly to matrices (two dimensional arrays).
We found that on the task of <em>tensor completion</em> (defined analogously to matrix completion $-$ see Equation $\color{purple}{\text(1)}$ and surrounding text), GD on a tensor factorization tends to produce solutions with low rank, where rank is defined in the context of tensors (for a formal definition, and a general intro to tensors and their factorizations, see this <a href="http://www.kolda.net/publication/TensorReview.pdf">excellent survey</a> by Kolda and Bader).
That is, just like in matrix factorization, the implicit regularization in tensor factorization also strives to minimize rank!
Here’s a representative result from one of our experiments:</p>
<div style="text-align:center;">
<img style="width:700px;" src="http://www.offconvex.org/assets/reg_dl_not_norm_tf_exp.png" />
<br />
<i><b>Figure 2:</b>
In analogy with matrix factorization, the implicit regularization of tensor factorization (high dimensional extension) strives to find a low (tensor) rank solution.
Plots show reconstruction error and (tensor) rank of final solution on multiple tensor completion problems differing in the number of observations.
GD over tensor factorization is compared against "linear" method $-$ GD over direct parameterization of tensor initialized at zero (this is equivalent to fitting observations while placing zeros in unobserved locations).
</i>
<br />
<br />
</div>
<p>So what can tensor factorizations tell us about deep learning?
It turns out that, similarly to how matrix factorizations correspond to prediction of matrix entries via linear neural networks, tensor factorizations can be seen as prediction of tensor entries with a certain type of <em>non-linear</em> neural networks, named <em>convolutional arithmetic circuits</em> (in my PhD I worked a lot on analyzing the expressive power of these models, as well as showing that they work well in practice $-$ see this <a href="https://arxiv.org/pdf/1705.02302.pdf">survey</a> for a soft overview).</p>
<div style="text-align:center;">
<img style="width:900px;" src="http://www.offconvex.org/assets/reg_dl_not_norm_mf_lnn_tf_cac.png" />
<br />
<i><b>Figure 3:</b>
The equivalence between matrix factorizations and linear neural networks extends to an equivalence between tensor factorizations and a certain type of non-linear neural networks named convolutional arithmetic circuits.
</i>
<br />
<br />
</div>
<p>Analogously to how the input-output mapping of a linear neural network can be thought of as a matrix, that of a convolutional
arithmetic circuit is naturally represented by a tensor.
The experiment reported in Figure 2 (and similar ones presented in <a href="https://arxiv.org/pdf/2005.06398.pdf">our paper</a>) thus provides a second example of a neural network architecture whose implicit regularization strives to lower a notion of rank for its input-output mapping.
This leads us to believe that implicit rank minimization may be a general phenomenon, and developing notions of rank for input-output mappings of contemporary models may be key to explaining generalization in deep learning.</p>
<p><a href="http://www.cohennadav.com/">Nadav Cohen</a></p>
Fri, 27 Nov 2020 01:00:00 -0800
http://offconvex.github.io/2020/11/27/reg_dl_not_norm/
http://offconvex.github.io/2020/11/27/reg_dl_not_norm/How to allow deep learning on your data without revealing the data<p>Today’s online world and the emerging internet of things is built around a Faustian bargain: consumers (and their internet of things) hand over their data, and in return get customization of the world to their needs. Is this exchange of privacy for convenience inherent? At first sight one sees no way around because, of course, to allow machine learning on our data we have to hand our data over to the training algorithm.</p>
<p>Similar issues arise in settings other than consumer devices. For instance, hospitals may wish to pool together their patient data to train a large deep model. But privacy laws such as HIPAA forbid them from sharing the data itself, so somehow they have to train a deep net on their data without revealing their data. Frameworks such as Federated Learning (<a href="https://arxiv.org/abs/1610.05492">Konečný et al., 2016</a>) have been proposed for this but it is known that sharing gradients in that environment leaks a lot of information about the data (<a href="https://arxiv.org/abs/1906.08935">Zhu et al., 2019</a>).</p>
<p>Methods to achieve some of the above so could completely change the privacy/utility tradeoffs implicit in today’s organization of the online world.</p>
<p>This blog post discusses the current set of solutions, how they don’t quite suffice for above questions, and the story of a new solution, <a href="http://arxiv.org/abs/2010.02772">InstaHide</a>, that we proposed, and takeaways from a recent attack on it by Carlini et al.</p>
<h2 id="existing-solutions-in-cryptography">Existing solutions in Cryptography</h2>
<p>Classic solutions in cryptography do allow you to in principle outsource any computation to the cloud without revealing your data. (A modern method is Fully Homomorphic Encryption.) Adapting these ideas to machine learning presents two major obstacles: (a) (serious issue) huge computational overhead, which essentially rules it out for today’s large scale deep models (b) (less serious issue) need for special setups —e.g., requiring every user to sign up for public-key encryption.</p>
<p>Significant research efforts are being made to try to overcome these obstacles and we won’t survey them here.</p>
<h2 id="differential-privacy-dp">Differential Privacy (DP)</h2>
<p>Differential privacy (<a href="https://www.iacr.org/archive/eurocrypt2006/40040493/40040493.pdf">Dwork et al., 2006</a>, <a href="https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf">Dwork&Roth, 2014</a>) involves adding carefully calculated amounts of noise during training. This is a modern and rigorous version of classic <em>data anonymization</em> techniques whose canonical application is release of noised census data to protect privacy of individuals.</p>
<p>This notion was adapted to machine learning by positing that “privacy” in machine learning refers to trained classifiers not being dependent on data of individuals. In other words, if the classifier is trained on data from N individuals, it’s behavior should be essentially unchanged (statistically speaking) if we omit data from any individual. Note that this is a weak notion of privacy: it does not in any way hide the data from the company.</p>
<p>Many tech companies have adopted differential privacy in deployed systems but the following two caveats are important.</p>
<blockquote>
<p>(Caveat 1): In deep learning applications, DP’s provable guarantees are very weak.</p>
</blockquote>
<p>Applying DP to deep learning involves noticing that the gradient computation amounts to adding gradients of the loss corresponding to individual data points, and that adding noise to those individual gradients in calculated doses can help make the overall classifier limit its dependence on the individual’s datapoint.</p>
<p>In practice provable bounds require adding so much gradient noise that accuracy of the trained classifier plummets. We do not know of any successful training that achieved accuracy > 75 percent on CIFAR10 (or any that achieved accuracy even 10 percent on ImageNet). Furthermore, achieving this level of accuracy involves <strong>pretraining</strong> the classifier model on a large set of <strong>public</strong> images and then using the private/protected images only to fine-tune the parameters.</p>
<p>Thus it is no surprise that firms today usually apply DP with very low noise level, which give essentially no guarantees. Which brings us to:</p>
<blockquote>
<p>(Caveat 2): DP’s guarantees (and even weaker guarantees applying to deployment scenarios) possibly act as a fig leaf that allows firms to not address the kinds of privacy violations that the person on the street actually worries about.</p>
</blockquote>
<p>DP’s provable guarantee (which as noted, does not hold in deployed systems due to the low noise level used) would only ensure that a deployed ML software that was trained with data from tens of millions of users will not change its behavior depending upon private information of any single user.</p>
<p>But that threat model would seem remote to the person on the street. The privacy issue they worry about more is that copious amounts of our data are continuously collected/stored/mined/sold, often by entities we do not even know about. While lax regulation is primarily to blame, there is also the technical hurdle that there is no <strong>practical way</strong> for consumers to hide their data while at the same time benefiting from customized ML solutions that improve their lives.</p>
<p>Which brings us to the question we started with: <em>Could consumers allow machine learning to be done on their data without revealing their data?</em></p>
<h2 id="a-proposed-solution-instahide">A proposed solution: InstaHide</h2>
<p>InstaHide is a new concept: it hides or “encrypts” images to protect them somewhat, while still allowing standard deep learning pipelines to be applied on them. The deep model is trained entirely on encrypted images.</p>
<ul>
<li>
<p>The training speed and accuracy is only slightly worse than vanilla training: one can achieve a test accuracy of ~ 90 percent on CIFAR10 using encrypted images with a computation overhead $< 5$ percent.</p>
</li>
<li>
<p>When it comes to privacy, like every other form of cryptography, its security is based upon conjectured difficulty of the underlying computational problem.
(But we don’t expect breaking it to be as difficult as say breaking RSA.)</p>
</li>
</ul>
<h3 id="how-instahide-encryption-works">How InstaHide encryption works</h3>
<p>Here are some details. InstaHide belongs to the class of subset-sum type encryptions (<a href="https://www.cs.cmu.edu/afs/cs/user/dwoodruf/www/biwx.pdf">Bhattacharyya et al., 2011</a>), and was inspired by a data augmentation technique called Mixup (<a href="https://arxiv.org/abs/1710.09412">Zhang et al., 2018</a>). It views images as vectors of pixel values. With vectors you can take linear combinations. The figure below shows the result of a typical MixUp: adding 0.6 times the bird image with 0.4 times the airplane image. The image labels can also be treated as one-hot vectors, and they are mixed using the same coefficients in front of the image samples.</p>
<p style="text-align:center;">
<img src="/assets/mixup.png" width="60%" />
</p>
<p>To encrypt the bird image, InstaHide does mixup (i.e., combination with nonnegative coefficients) with one other randomly chosen training image, and with two other images chosen randomly from a large public dataset like imagenet. The coefficients 0.6., 0.4 etc. in the figure are also chosen at random. Then it takes this composite image and for every pixel value, it randomly flips the sign. With that, we get the encrypted images and labels. All random choices made in this encryption act as a one-time key that is never re-used to encrypt other images.</p>
<p>InstaHide has a parameter $k$ denoting how many images are mixed; in the picture, we have $k=4$. The figure below shows this encryption mechanism.</p>
<p style="text-align:center;">
<img src="/assets/instahide.png" width="80%" />
</p>
<p>When plugged into the standard deep learning with a private dataset of $n$ images, in each epoch of training (say $T$ epochs in total), InstaHide will re-encrypt each image in the dataset using a random one-time key. This will gives $n\times T$ encrypted images in total.</p>
<h3 id="the-security-argument">The security argument</h3>
<p>We conjectured, based upon intuitions from computational complexity of the k-vector-subset-sum problem (citations), that extracting information about the images could time $N^{k-2}$. Here $N$, the size of the public dataset, can be tens or hundreds of millions, so it might be infeasible for real-life attackers.</p>
<p>We also released a <a href="https://github.com/Hazelsuko07/InstaHide_Challenge">challenge dataset</a> with $k=6, n=100, T=50$ to enable further investigation of InstaHide’s security.</p>
<h2 id="carlini-et-als-recent-attack-on-instahide">Carlini et al.’s recent attack on InstaHide</h2>
<p>Recently, Carlini et al. have shared with us a manuscript with a two-step reconstruction attack (<a href="https://arxiv.org/pdf/2011.05315.pdf">Carlini et al., 2020</a>) against InstaHide.</p>
<p><strong><em>TL;DR: They used 11 hours on Google’s best GPUs to get partial recovery of our 100 challenge encryptions and 120 CPU hours to break the encryption completely. Furthermore, the latter was possible entirely because we used an insecure random number generator, and they used exhaustive search over random seeds.</em></strong></p>
<p>Now the details.</p>
<p>The attack takes $n\times T$ InstaHide-encrypted images as the input, ($n$ is the size of the private dataset, $T$ is the number of training epochs), and returns a reconstruction of the private dataset. It goes as follows.</p>
<ul>
<li>
<p>Map $n \times T$ encryptions into $n$ private images, by clustering encryptions of a same private image as a group. This is achieved by firstly building a graph representing pairwise similarity between encrypted images, and then assign each encryption a private image. In their implementation, they train a neural network to annotate pairwise similarity between encryptions.</p>
</li>
<li>
<p>Then, given the encrypted images and the mapping, they solve a nonlinear optimization problem via gradient desent to recover an approximation of the original private dataset.</p>
</li>
</ul>
<p>Using Google’s powerful GPU, it took them 10 hours to train the neural network for similarity annotation, and about another hour to get an approximation of our challenge set of $100$ images with $k=6, n=100, T=50$. This gave them vaguely correct images, with significant unclear areas and color shift.</p>
<p>They also proposed a different strategy which abuses the vulnerability of NumPy and PyTorch’s random number generator (<em>Aargh; we didn’t use a secure random number generator.</em>) They did brute force search of $2^{32}$ possible initial random seeds, which allows them to reproduce the randomness during encryption, and thus perform a pixel-perfect reconstruction. As they reported, this attack takes 120 CPU hours (they parallelize across 100 cores to obtain the solution in a little over an hour). We will have this implementation flaw fixed in an updated version.</p>
<h3 id="thoughts-on-this-attack">Thoughts on this attack</h3>
<p>Though the attack is clever and impressive, we feel that the long-term take-away is still unclear for several reasons.</p>
<blockquote>
<p>Variants of InstaHide seem to evade the attack.</p>
</blockquote>
<p>The challenge set contained 50 encryptions each of 100 images. This corresponds to using encrypted images for 50 epochs. But as done in existing settings that use DP, one can pretrain the deep model using non-private images and then fine-tune it with fewer epochs of the private images. Using a similar pipeline DPSGD (<a href="https://arxiv.org/abs/1607.00133">Abadi et al., 2016</a>), by pretraining a ResNet-18 on CIFAR100 (the public dataset) and finetuning for $10$ epochs on CIFAR10 (the private dataset) gives accuracy of 83 percent, still far better than any provable guarantees using DP on this dataset. Carlini et al.\ team conceded that their attack probably would not work in this setting.</p>
<p>Similarly using InstaHide purely at inference time (i.e., using ML, instead of training ML) still should be completely secure since only one encryption of the image is released. The new attack can’t work here at all.</p>
<blockquote>
<p>InstaHide was never intended to be a mission-critical encryption like RSA (which by the way also has no provable guarantees).</p>
</blockquote>
<p>InstaHide is designed to give users and the internet of things a <em>light-weight</em> encryption method that allows them to use machine learning without giving eavesdroppers or servers access to their raw data. There is no other cost-effective alternative to InstaHide for this application. If it takes Google’s powerful computers a few hours to break our challenge set of 100 images, this is not yet a cost-effective attack in the intended settings.</p>
<p>More important, the challenge dataset corresponded to an ambitious form of security, where the encrypted images themselves are released to the world. The more typical application is a Federated Learning (<a href="https://arxiv.org/abs/1610.05492">Konečný et al., 2016</a>) scenario: the adversary observes shared gradients that are computed using encrypted images (he also has access to the trained model). The attacks in this paper do not currently apply to that scenario. This is also the idea in <a href="https://arxiv.org/abs/2010.06053"><strong>TextHide</strong></a>, an adaptation of InstaHide to text data.</p>
<h2 id="takeways">Takeways</h2>
<p>Users need lightweight encryptions that can be applied in real time to large amounts of data, and yet allow them to take benefit of Machine Learning on the cloud. Methods to do so could completely change the privacy/utility tradeoffs implicitly assumed in today’s tech world.</p>
<p>InstaHide is the only such tool right now, and we now know that it provides moderate security that may be enough for many applications.</p>
<!--
### References
[1] [**InstaHide: Instance-hiding Schemes for Private Distributed Learning**](http://arxiv.org/abs/2010.02772), *Yangsibo Huang, Zhao Song, Kai Li, Sanjeev Arora*, ICML 2020
[2] [**mixup: Beyond Empirical Risk Minimization**](https://arxiv.org/abs/1710.09412), *Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, David Lopez-Paz*, ICLR 2018
[3] [**An Attack on InstaHide: Is Private Learning Possible with Instance Encoding?**](https://arxiv.org/pdf/2011.05315.pdf) *Nicholas Carlini, Samuel Deng, Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, Shuang Song, Abhradeep Thakurta, Florian Tramèr*, arxiv preprint
[4] [**Deep Learning with Differential Privacy**](https://arxiv.org/abs/1607.00133), *Martín Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, Li Zhang*, ACM CCS 2016
[5] [**Federated learning: Strategies for improving communication efficiency**](https://arxiv.org/abs/1610.05492), *Jakub Konečný, H. Brendan McMahan, Felix X. Yu, Peter Richtárik, Ananda Theertha Suresh, Dave Bacon*, NeurIPS Workshop 2016
[6] [**A method for obtaining digital signatures and public-key cryptosystems**](https://people.csail.mit.edu/rivest/Rsapaper.pdf), *R.L. Rivest, A. Shamir, and L. Adleman*, Communications of the ACM 1978
[7] [**Deep leakage from gradients**](https://arxiv.org/abs/1906.08935), *Ligeng Zhu, Zhijian Liu, and Song Han.* Neurips19. -->
Wed, 11 Nov 2020 02:00:00 -0800
http://offconvex.github.io/2020/11/11/instahide/
http://offconvex.github.io/2020/11/11/instahide/Mismatches between Traditional Optimization Analyses and Modern Deep Learning<p>You may remember our <a href="http://www.offconvex.org/2020/04/24/ExpLR1/">previous blog post</a> showing that it is possible to do state-of-the-art deep learning with learning rate that increases exponentially during training. It was meant to be a dramatic illustration that what we learned in optimization classes and books isn’t always a good fit for modern deep learning, specifically, <em>normalized nets</em>, which is our term for nets that use any one of popular normalization schemes,e.g. <a href="https://arxiv.org/abs/1502.03167">BatchNorm (BN)</a>, <a href="https://arxiv.org/abs/1803.08494">GroupNorm (GN)</a>, <a href="https://arxiv.org/abs/1602.07868">WeightNorm (WN)</a>. Today’s post (based upon <a href="https://arxiv.org/abs/2010.02916">our paper</a> with Kaifeng Lyu at NeurIPS20) identifies other surprising incompatibilities between normalized nets and traditional analyses. We hope this will change the way you teach and think about deep learning!</p>
<p>Before diving into the results, we recall that normalized nets are typically trained with weight decay (aka $\ell_2$ regularization). Thus the $t$th iteration of Stochastic Gradient Descent (SGD) is:</p>
\[w_{t+1} \gets (1-\eta_t\lambda)w_t - \eta_t \nabla \mathcal{L}(w_t; \mathcal{B}_t),\]
<p>where $\lambda$ is the weight decay (WD) factor (or $\ell_2$-regularization coefficient), $\eta_t$ the learning rate, $\mathcal{B}_t$ the batch, and $\nabla \mathcal{L}(w_t,\mathcal{B}_t)$ the batch gradient.</p>
<p>As sketched in our previous blog post, under fairly mild assumptions (namely, fixing the top layer during random initialization —which empirically does not hurt final accuracy) the loss function for training such normalized nets is <em>scale invariant</em>, which means $\mathcal{L}(w _ t; \mathcal{B}_ t)=\mathcal{L}(cw _ t; \mathcal{B} _ t)$, $\forall c>0$.</p>
<p>A consequence of scale invariance is that the $ \nabla _ w \mathcal{L} \vert _ {w = w _ 0} = c \nabla _ w \mathcal{L}\vert _ {w = cw _ 0}$ and $\nabla ^ 2 _ w \mathcal{L} \vert _ {w = w _ 0} = c ^ 2 \nabla ^ 2 _ w \mathcal{L} \vert _ {w = cw _ 0}$, for any $c>0$.</p>
<h2 id="some-conventional-wisdoms-cws">Some Conventional Wisdoms (CWs)</h2>
<p>Now we briefly describe some conventional wisdoms. Needless to say, by the end of this post these will turn out to be very very suspect! Possibly they were OK in earlier days of deep learning, and with shallower nets.</p>
<blockquote>
<p>CW 1) As we reduce LR to zero, optimization dynamic converges to a deterministic path (Gradient Flow) along which training loss strictly decreases.</p>
</blockquote>
<p>Recall that in traditional explanation of (deterministic) gradient descent, if LR is smaller than roughly the inverse of the smoothness of the loss function, then each step reduces the loss. SGD, being stochastic, has a distribution over possible paths. But very tiny LR can be thought of as full-batch Gradient Descent (GD), which in the limit of infinitesimal step size approaches Gradient Flow (GF).</p>
<p>The above reasoning shows very small LR is guaranteed to decrease the loss at least, as well as any higher LR, can. Of course, in deep learning, we care not only about optimization but also generalization. Here small LR is believed to hurt.</p>
<blockquote>
<p>CW 2) To achieve the best generalization the LR must be large initially for quite a few epochs.</p>
</blockquote>
<p>This is primarily an empirical finding: using too-small learning rates or too-large batch sizes from the start (all other hyper-parameters being fixed) is known to lead to worse generalization (<a href="https://arxiv.org/pdf/1206.5533.pdf">Bengio, 2012</a>; <a href="https://arxiv.org/abs/1609.04836">Keskar et al., 2017</a>).</p>
<p>A popular explanation for this phenomenon is that the noise in gradient estimation during SGD is beneficial for generalization. (As noted, this noise tends to average out when LR is very small.) Many authors have suggested that the noise helps becauses it keeps the trajectory away from sharp minima which are believed to generalize worse, although there is some difference of opinion here (<a href="http://www.bioinf.jku.at/publications/older/3304.pdf">Hochreiter&Schmidhuber, 1997</a>; <a href="https://arxiv.org/abs/1609.04836">Keskar et al., 2017</a>; <a href="https://arxiv.org/abs/1712.09913">Li et al., 2018</a>; <a href="https://arxiv.org/abs/1803.05407">Izmailov et al., 2018</a>; <a href="https://arxiv.org/pdf/1902.00744.pdf">He et al., 2019</a>). <a href="https://arxiv.org/abs/1907.04595">Li et al., 2019</a> also gave an example (a simple two-layer net) where this observation of worse generalization due to small LR is mathematically proved and also experimentally verified.</p>
<blockquote>
<p>CW 3) Modeling SGD via a Stochastic Differential Equation (SDE) in the continuous-time limit with a fixed Gaussian noise. Namely, think of SGD as a diffusion process that <strong>mixes</strong> to some Gibbs-like distribution on trained nets.</p>
</blockquote>
<p>This is the usual approach to formal understanding of CW 2 (<a href="https://arxiv.org/abs/1710.06451">Smith&Le, 2018</a>; <a href="https://arxiv.org/abs/1710.11029">Chaudhari&Soatto, 2018</a>; <a href="https://arxiv.org/abs/2004.06977">Shi et al., 2020</a>). The idea is that SGD is gradient descent with a noise term, which has a continuous-time approximation as a diffusion process described as</p>
\[dW_t = - \eta_t \lambda W_t dt - \eta_t \nabla \mathcal{L}(W_t) dt + \eta_t \Sigma_{W_t}^{1/2} dB_t,\]
<p>where $\sigma_{W_t}$ is the covariance of stochastic gradient $ \nabla \mathcal{L}(w_t; \mathcal{B}_t)$, and $B_t$ denotes Brownian motion of the appropriate dimension. Several works have adopted this SDE view and given some rigorous analysis of the effect of noise.</p>
<p>In this story, SGD turns into a geometric random walk in the landscape, which can in principle explore the landscape more thoroughly, for instance by occasionally making loss-increasing steps. While an appealing view, rigorous analysis is difficult because we lack a mathematical description of the loss landscape. Various papers assume the noise in SDE is isotropic Gaussian, and then derive an expression for the stationary distribution of the random walk in terms of the familiar Gibbs distribution. This view gives intuitively appealing explanation of some deep learning phenomena since the magnitude of noise (which is related to LR and batch size) controls the convergence speed and other properties. For instance it’s well-known that this SDE approximation implies the well-known <em>linear scaling rule</em> (Goyal et. al., 2017](https://arxiv.org/pdf/1706.02677.pdf)).</p>
<p>Which raises the question: <em>does SGD really behave like a diffusion process that mixes in the loss landscape?</em></p>
<!--[A few lines explaining for why noise term has this form? e.g., show one step discretization]!-->
<h2 id="conventional-wisdom-challenged">Conventional Wisdom challenged</h2>
<p>We now describe the actual discoveries for normalized nets, which show that the above CW’s are quite off.</p>
<blockquote>
<p>(Against CW1): Full batch gradient descent $\neq$ gradient flow.</p>
</blockquote>
<p>It’s well known that if LR is smaller than the inverse of the smoothness, then the trajectory of gradient descent will be close to that of gradient flow. But for normalized networks, the loss function is scale-invariant and thus provably non-smooth (i.e., smoothness becomes unbounded) around the origin (<a href="https://arxiv.org/abs/1910.07454">Li&Arora, 2019</a>). We show that this non-smoothness issue is very real and makes training unstable and even chaotic for full batch SGD with any nonzero learning rate. This occurs both empirically and provably so with some toy losses.</p>
<div style="text-align:center;">
<img style="width:60%;" src="https://www.cs.princeton.edu/~zl4/small_lr_blog_images/additional_blog_image/gd_not_gf.png" />
</div>
<p><strong>Figure 1.</strong> WD makes GD on scale-invariant loss unstable and chaotic.
(a) Toy model with scale-invariant loss $L(x,y) = \frac{x^2}{x^2+y^2}$ (b)(c) Convergence never truly happens for ResNet trained on sub-sampled
CIFAR10 containing 1000 images with full-batch GD (without momentum). ResNet
can easily get to 100% training accuracy but then veers off. When WD is turned off at epoch 30000 it converges.</p>
<p>Note that WD plays a crucial role in this effect since without WD the parameter norm increases monotonically
(<a href="https://arxiv.org/abs/1812.03981">Arora et al., 2018</a>) which implies SGD moves away from the origin at all times.</p>
<p>Savvy readers might wonder whether using a smaller LR could fix this issue. Unfortunately, getting close to the origin is unavoidable because once the gradient gets small, WD will dominate the dynamics and decrease the norm at a geometric rate, causing the gradient to rise again due to the scale invariance! (This happens so long as the gradient gets arbitrarily small, but not actually zero, as is the case in practice.)</p>
<p>In fact, this is an excellent (and rare) place where early stopping is necessary even for correct optimization of the loss.</p>
<blockquote>
<p>(Against CW 2) Small LR can generalize equally well as large LR.</p>
</blockquote>
<p>This actually was a prediction of the new theoretical analysis we came up with. We ran extensive experiments to test this prediction and found that initial large LR is <strong>not necessary</strong> to match the best performance, even when <em>all the other hyperparameters are fixed</em>. See Figure 2.</p>
<div style="text-align:center;">
<img style="width:300px;" src="https://www.cs.princeton.edu/~zl4/small_lr_blog_images/additional_blog_image/blog_sgd_8000_test_acc.png" />
<img style="width:300px;" src="https://www.cs.princeton.edu/~zl4/small_lr_blog_images/additional_blog_image/blog_sgd_8000_train_acc.png" />
</div>
<p><strong>Figure 2</strong>. ResNet trained on CIFAR10 with SGD with normal LR schedule (baseline) as well as a schedule with 100 times smaller initial LR. The latter matches performance of baseline after one more LR decay! Note it needs 5000 epochs which is 10x higher! See our paper for details. (Batch size is 128, WD is 0.0005, and LR is divided by 10 for each decay.)</p>
<p>Note the surprise here is that generalization was not hurt from drastically smaller LR even <em>when no other hyperparameter changes</em>. It is known empirically as well as rigorously (Lemma 2.4 in <a href="https://arxiv.org/abs/1910.07454">Li&Arora, 2019</a>) that it is possible to compensate for small LR by other hyperparameter changes.</p>
<blockquote>
<p>(Against Wisdom 3) Random walk/SDE view of SGD is way off. There is no evidence of mixing as traditionally understood, at least within normal training times.</p>
</blockquote>
<p>Actually the evidence against global mixing exists already via the phenomenon of Stochastic Weight Averaging (SWA) (<a href="https://arxiv.org/abs/1803.05407">Izmailov et al., 2018</a>). Along the trajectory of SGD, if the network parameters from two different epochs are averaged, then the average has test loss lower than either. Improvement via averaging continues to work for run times 10X longer than usual as shown in Figure 3. However, the accuracy improvement doesn’t happen for SWA between two solutions obtained from different initialization. Thus checking whether SWA holds distinguishes between pairs of solutions drawn from the same trajectory and pairs drawn from different trajectories, which shows the diffusion process hasn’t mixed to stationary distribution within normal training times. (This is not surprising, since the theoretical analysis of mixing does not suggest it happens rapidly at all.)</p>
<div style="text-align:center;">
<img style="width:300px;" src="https://www.cs.princeton.edu/~zl4/small_lr_blog_images/additional_blog_image/swa_sgd_test_acc.png" />
<img style="width:300px;" src="https://www.cs.princeton.edu/~zl4/small_lr_blog_images/additional_blog_image/swa_sgd_dist.png" />
</div>
<p><strong>Figure 3</strong>. Stochastic Weight Averaging improves the test accuracy of ResNet trained with
SGD on CIFAR10. <strong>Left:</strong> Test accuracy. <strong>Right:</strong> Pairwise distance between parameters from different epochs.</p>
<p>Actually <a href="https://arxiv.org/abs/1803.05407">Izmailov et al., 2018</a> already noticed the implication that SWA rules out that SGD is a diffusion process which mixes to a unique global equilibrium. They suggested instead that perhaps the trajectory of SGD could be well-approximated by a multivariate Ornstein-Uhlenbeck (OU) process around the <em>local minimizer</em> $W^ * $, assuming the loss surface is locally strongly convex. As the corresponding stationary is multi-dimensional Gaussian, $N(W^ *, \Sigma)$, around the local minimizer, $W^ *$, this explains why SWA helps to reduce the training loss.</p>
<p>However, we note that (<a href="https://arxiv.org/abs/1803.05407">Izmailov et al., 2018</a>)’s suggestion is also refuted by the fact that we can show $\ell_2$ distance between weights from epochs $T$ and $T+\Delta$ monotonically increases with $\Delta$ for every $T$ (See Figure 3), while $ \mathbf{E} [ | W_ T-W_ {T+\Delta} |^2]$ should converge to the constant $2Tr[\Sigma]$ as $T, \Delta \to +\infty$ in the OU process. This suggests that all these weights are correlated, unlike the hypothesized OU process.</p>
<h2 id="so-whats-really-going-on">So what’s really going on?</h2>
<p>We develop a new theory (some parts rigorously proved and others supported by experiments) suggesting that <strong>LR doesn’t play the role assumed in most discussions.</strong></p>
<p>It’s widely believed that LR $\eta$ controls the convergence rate of SGD and affects the generalization via changing the magnitude of noise because LR $\eta$ adjusts the magnitude of gradient update per step.
<!--It's also worth noting that for vanilla SGD, changing LR is equivalent to rescaling the loss function. -->
However, for normalized networks trained with SGD + WD, the effect of LR is more subtle as now it has two roles: (1). the multiplier before the gradient of the loss. (2). the multiplier before WD. Intuitively, one imagines the WD part is useless since the loss function is scale-invariant, and thus the first role must be more important. But surprisingly, this intuition is completely wrong and it turns out that the second role is way more important than the first one.
Further analysis shows that a better measure of speed of learning is $\eta \lambda$, which we call the <em>intrinsic learning rate</em> or <em>intrinsic LR</em>, denoted $\lambda_e$.</p>
<p>While previous papers have noticed qualitatively that LR and WD have a close interaction, our ExpLR paper <a href="https://arxiv.org/abs/1910.07454">Li&Arora, 2019</a>) gave mathematical proof that <em>if WD* LR, i.e., $\lambda\eta$ is fixed, then the effect of changing LR on the dynamics is equivalent to rescaling the initial parameters</em>. As far as we can tell, performance of SGD on modern architectures is quite robust to (indeed usually independent of) scale of the initialization, so the effect of changing initial LR while keeping intrinsic LR fixed is also negligible.</p>
<p>Our paper gives insight into the role of intrinsic LR $\lambda_e$ by giving a new SDE-style analysis of SGD for normalized nets, leading to the following conclusion (which rests in part on experiments):</p>
<blockquote>
<p>In normalized nets SGD does indeed lead to rapid mixing, but in <strong>function space</strong> (i.e., input-output behavior of the net). Mixing happens after $O(1/\lambda_e)$ iterations, in contrast to the exponentially slow mixing guaranteed in the parameter space by traditional analysis of diffusion walks.</p>
</blockquote>
<p>To explain the meaning of mixing in function space, let’s view SGD (carried out for a fixed number of steps) as a way to sample a trained net from a distribution over trained nets. Thus the end result of SGD from a fixed initialization can be viewed as a probabilistic classifier whose output on any datapoint is the $K$-dimenstional vector whose $i$th coordinate is the probability of outputting label $i$. (Here $K$ is the total number of labels.) Now if two different initializations both cause SGD to produce classifiers with error $5$ percent on heldout datapoints, then <em>a priori</em> one would imagine that on a given held-out datapoint the classifier from the first distribution <strong>disagrees</strong> with the classifier from the second distribution with roughly $2 * 5 =10$ percent probability. (More precisely, $2 * 5 * (1-0.05) = 9.5$ percent.) However, convergence to an equilibrium distribution in function space means that the probability of disagreement is almost $0$, i.e., the distribution is almost the same regardless of the initialization! This is indeed what we experimentally find, to our big surprise. Our theory is built around this new phenomenon.</p>
<div style="text-align:center;">
<img style="width:500px;" src="https://www.cs.princeton.edu/~zl4/small_lr_blog_images/additional_blog_image/conjecture.png" />
</div>
<p><strong>Figure 4</strong>: A simple 4-layer normalized CNN trained on MNIST with three schedules converge to the same equilibrium after intrinsic LRs become equal at epoch 81. We use Monte Carlo ($500$ trials) to estimate $\ell_1$ distances between distributions.</p>
<p>In the next post, we will explain our new theory and the partial new analysis of SDEs arising from SGD in normalized nets.</p>
Wed, 21 Oct 2020 15:00:00 -0700
http://offconvex.github.io/2020/10/21/intrinsicLR/
http://offconvex.github.io/2020/10/21/intrinsicLR/Beyond log-concave sampling<p>As the growing number of posts on this blog would suggest, recent years have seen a lot of progress in understanding optimization beyond convexity. However, optimization is only one of the basic algorithmic primitives in machine learning — it’s used by most forms of risk minimization and model fitting. Another important primitive is sampling, which is used by most forms of inference (i.e. answering probabilistic queries of a learned model).</p>
<p>It turns out that there is a natural analogue of convexity for sampling — <em>log-concavity</em>. Paralleling the state of affairs in optimization, we have a variety of (provably efficient) algorithms for sampling from log-concave distributions, under a variety of access models to the distribution. Log-concavity, however, is very restrictive and cannot model common properties of distributions we frequently wish to sample from in machine learning applications, for example multi-modality and manifold structure in the level sets, which is what we’ll focus on in this and the upcoming post.</p>
<p>Unlike non-convex optimization, the field of sampling beyond log-concavity is very nascent. In this post, we will survey the basic tools and difficulties for sampling beyond log-concavity. In the next post, we will survey recent progress in this direction, in particular with respect to handling multi-modality and manifold structure in the level sets, covering the papers <a href="https://arxiv.org/abs/1812.00793">Simulated tempering Langevin Monte Carlo</a> by Rong Ge, Holden Lee, and Andrej Risteski and <a href="https://arxiv.org/abs/2002.05576">Fast convergence for Langevin diffusion with matrix manifold structure</a> by Ankur Moitra and Andrej Risteski.</p>
<h1 id="formalizing-the-sampling-problem">Formalizing the sampling problem</h1>
<p>The formulation of the sampling problem we will consider is as follows:</p>
<blockquote>
<p><strong>Problem</strong>: Sample from a distribution $p(x) \propto e^{-f(x)}$ given black-box access to $f$ and $\nabla f$.</p>
</blockquote>
<p>This formalization subsumes a lot of inference tasks involving different kinds of probabilistic models. We give several common examples:</p>
<p><em>1.Posterior inference</em>: Suppose our data is generated from a model with <em>unknown</em> parameters $\theta$ , such that the data-generation process is given by $p(x \mid \theta)$ and we have a prior $p(\theta)$ over the model parameters. Then the <em>posterior distribution</em> $p(\theta \mid x)$ , by Bayes’s Rule, is given by</p>
\[p(\theta \mid x) = \frac{p(x \mid \theta)p(x)}{p(x)}\propto p(x \mid \theta)p(\theta).\]
<p>A canonical example of this is a <em>noisy inference task</em> where a signal (parametrized by $\theta$ ) is perturbed by noise (as specified by $p(x \mid \theta)$ ).</p>
<p><em>2.Posteriors in latent-variable models</em>: If the data-generation process has a <em>latent (hidden) variable</em> $h$ associated to each data point, such that $h$ has a <em>known</em> prior $p(h)$ and a <em>known</em> conditional $p_\theta(x \mid h)$ , then again by Bayes’s rule, we have</p>
\[p_\theta(h \mid x) = \frac{p_\theta(x \mid h)p_\theta(h)}{p_\theta(x)}\propto p_\theta(x \mid h)p_\theta(h).\]
<p>In typical latent-variable models, $p_\theta(x \mid h)$ and $p_\theta(h)$ have a simple parametric form, which makes it easy to evaluate $p_\theta(x \mid h)p_\theta(h)$ . Some examples of latent-variable models are mixture models (where $h$ encodes which component a sample came from), topic models (where $h$ denote the topic proportions in a document), and noisy-OR networks (and latent-variable Bayesian belief networks).</p>
<p><em>3.Sampling from energy models</em>: in energy models, the distribution of the data is parametrized as $p(x) \propto \exp(-E(x))$ for some <em>energy</em> function $E(x)$ which is smaller on points in the data distribution. Recent works by <a href="https://arxiv.org/abs/1907.05600">(Song, Ermon 2019)</a> and <a href="https://arxiv.org/abs/1903.08689">(Du, Mordatch 2019)</a> have scaled up the training of these models on images so that the visual quality of the samples they produce is comparable to that of more popular generative models like GANs and flow models.</p>
<p>The “exponential form” $e^{-f(x)}$ is also helpful in making an analogy to optimization. Namely, if we sample from $p(x)\propto e^{-f(x)}$, a particular point $x$ is more likely to be sampled if $f(x)$ is small. The key difference between with optimization is that while in optimization, we only want to get to the minimum, in sampling, we want to pick points with the correct probabilities.</p>
<h1 id="comparison-with-optimization">Comparison with optimization</h1>
<p>The computational hardness landscape for our sampling problem parallels the one for black-box optimization, in which the goal is to find the minimum of a function $f$, given value/gradient oracle access. When $f$ is <em>convex</em>, there is a unique local minimum, so that local search algorithms like <em>gradient descent</em> are efficient. When $f$ is non-convex, gradient descent can get trapped in potentially poor local minima, and in the worst case, an exponential number of queries is needed.</p>
<p>Similarly, for sampling, when $p$ is <em>log-concave</em>, the distribution is unimodal and a Markov Chain which is a close relative of gradient descent — <em>Langevin Monte Carlo</em> — is efficient. When $p$ is non-log-concave, Langevin Monte Carlo can get trapped in one of many modes, and and exponential number of queries may also be needed.</p>
<blockquote>
<p>A distribution $p(x)\propto e^{-f(x)}$ is <strong>log-concave</strong> if $f(x) = -\log p(x)$ is convex. It is $\alpha$-strongly log-concave if $f(x)$ is $\alpha$-strongly convex.</p>
</blockquote>
<p>However, such worst-case hardness rarely stop practitioners from trying to solve the non-convex optimization or non-log-concave sampling problems which are ubiquitous in modern machine learning. Often they manage to do so with great success - for instance, in training deep neural networks, gradient descent and its relatives perform quite well. Similarly, Langevin Monte Carlo and its relatives can do quite well on non-log-concave problems, though they sometimes need to be aided by temperature heuristics and other tricks.</p>
<p>As theorists, we’d like to develop theory that will lead to a better understanding of why and when these heuristics work. Just like we’ve done for optimization, we need to be guided both by hardness results and relevant structure of real-world problems in this endeavour.</p>
<p>The following table summarizes the comparisons we have come up with:</p>
<p><img src="http://www.andrew.cmu.edu/user/aristesk/table_opt.jpg" alt="" /></p>
<p>Before we move on to non-log-concave distributions, though, we need to understand the basic algorithm for sampling and its guarantees for log-concave distributions.</p>
<h1 id="langevin-monte-carlo">Langevin Monte Carlo</h1>
<p>Just as gradient descent is the canonical algorithm for optimization, <em>Langevin Monte Carlo</em> (LMC) is the canonical algorithm for our sampling problem. In a nutshell, it is gradient descent that also injects Gaussian noise:</p>
\[\text{Gradient descent:}\quad
x_{t+\eta} = x_t - \eta \nabla f(x_t)\]
\[\text{Langevin Monte Carlo:}\quad
x_{t+\eta} = x_t - \eta \nabla f(x_t) + \sqrt{2\eta}\xi_t,\quad \xi_t\sim N(0,I)\]
<p>Both of these processes can be considered as discretizations of a continuous process. For gradient descent, the limit is an <em>ordinary differential equation</em>, and for Langevin Monte Carlo a <em>stochastic differential equation</em>:</p>
\[\text{Gradient flow:} \quad dx_t = -\nabla f(x_t) dt\]
\[\text{Langevin diffusion:} \quad dx_t = -\nabla f(x_t) dt + \sqrt{2} dB_t\]
<p>where $B_t$ denotes Brownian motion of the appropriate dimension.</p>
<p>The crucial property of the above stochastic differential equation is that under fairly mild assumptions on $f$, the stationary distribution is $p(x) \propto e^{-f(x)}$. (If you’re more comfortable with optimization, note that while gradient descent generally converges to (local) minima, the Gaussian noise term prevents LMC from converging to a single point - rather, it converges to a <em>stationary distribution</em>. See animation below.)</p>
<p><img src="http://www.andrew.cmu.edu/user/aristesk/gd_ld_animated.gif" alt="" /></p>
<p>Langevin Monte Carlo fits in the <em>Markov Chain Monte Carlo</em> (MCMC) paradigm: design a random walk, so that the stationary distribution is the desired distribution. “Mixing” means getting close to the stationary distribution, and rapid mixing means this happens quickly.</p>
<p>Like in optimization, Langevin Monte Carlo is the most “basic” algorithm: for example, one can incorporate “acceleration” and obtain <em>underdamped</em> Langevin, or use the physics-inspired Hamiltonian Monte Carlo.</p>
<h1 id="tools-for-bounding-mixing-time-challenges-beyond-log-concavity">Tools for bounding mixing time, challenges beyond log-concavity</h1>
<p>To illustrate the difficulty in moving beyond log-concavity, we’ll describe the tools that are used to prove fast mixing for log-concave distributions, and where they fall short for non-log-concave distributions.</p>
<p>We will do this by an analogy to how we analyze random walks on graphs. One common way to prove rapid mixing of a random walk on a graph is to show the Laplacian has a spectral gap (equivalently, the transition matrix has a gap between the largest and next-to-largest eigenvalue). The analogue of this for Langevin diffusion is showing a <em>Poincaré inequality</em>. (A spectral gap of $1/C$ corresponds to Poincaré constant of $C$.)</p>
<blockquote>
<p>We say that $p(x)$ satisfies a <strong>Poincaré inequality</strong> with constant $C$ if for all functions $g$ on $\mathbb R^d$ (such that $g$ and $\nabla g$ are square-integrable with respect to $p$),</p>
<div> $$\text{Var}_p(g) \le C \int_{\mathbb R^d} ||\nabla g(x)||^2 p(x)\,dx.$$ </div>
</blockquote>
<p>A small constant $C$ implies fast mixing in $\chi^2$ divergence, which implies fast mixing in total variation distance. More precisely, the mixing time for Langevin diffusion is on the order of $C$. We note that other functional inequalities imply mixing with respect to other measures (such as log-Sobolev inequalities for KL divergence).</p>
<p>While it may not be obvious what the Poincaré inequality has to do with a spectral gap, it turns out that we can think of the right-hand side as a quadratic form involving the <em>infinitesimal generator</em> of Langevin process, which functions as the continuous analogue of a Laplacian for a graph random walk.</p>
<p>The following table shows the analogy: we can put the discrete and continuous processes on the same footing by defining a quadratic form called the Dirichlet form from the Laplacian or infinitesimal generator.</p>
<p><img src="http://www.andrew.cmu.edu/user/aristesk/table_mixing.jpg" alt="" /></p>
<p>To see how the Poincaré inequality represents a spectral gap in the discrete case, we write it in a more explicit form in a familiar special case: a lazy random walk (i.e. a random walk that with probability $1/2$ stays in the current vertex, and with probability $1/2$ goes to a random neighbor) on a regular graph with $n$ vertices. In this case, $p$ is the uniform distribution, and $v_1=\mathbf 1,\ldots, v_n$ are the eigenvectors of $A$ with eigenvalues $1=\lambda_1\ge \lambda_2\ge \cdots \ge \lambda_n\ge 0$; normalize $v_1,\ldots, v_n$ so they have unit norm with respect to $p$, i.e. $\Vert v_i\Vert_p^2=\frac 1n\sum_j v_{ij}^2=1$.</p>
<p>Writing $g= \sum_i a_i v_i$, since $v_2,\ldots, v_n$ are orthogonal to $v_1=\mathbf 1$, we have $\langle g, \mathbf 1\rangle_p = a_1$, so</p>
\[\text{Var}_p(g) = \frac{1}{n}(\sum_i g_i^2) - a_1^2 = \sum_{i=2}^n a_i^2\]
<p>Furthermore, we have</p>
\[\langle g, Lg \rangle_p = \langle \sum_i a_iv_i, (I- A)(\sum_i a_iv_i)\rangle_p= \sum_{i=2}^n a_i^2(1-\lambda_i)\]
<p>These coefficients are all at most $1-\lambda_2$, i.e. the <em>spectral gap</em>, so</p>
\[\langle g, Lg \rangle_p \ge (1-\lambda_2)\text{Var}_p(g),\]
<p>which shows the Poincaré inequality with constant $(1-\lambda_2)^{-1}$.</p>
<p>A classic theorem establishes a Poincaré inequality for (strongly) log-concave distributions.</p>
<blockquote>
<p><strong>Theorem (Bakry, Emery 1985)</strong>: If $p(x)$ is $\alpha$-strongly log-concave, then $p(x)$ satisfies a Poincaré inequality with constant $\frac1{\alpha}$.</p>
</blockquote>
<p>Hence, for strongly-log-concave distributions, Langevin diffusion mixes rapidly. To complete the picture, a line of recent works, starting with <a href="https://arxiv.org/abs/1412.7392">(Dalalyan 2014)</a> have established bounds for discretization error to obtain algorithmic guarantees for Langevin Monte Carlo.</p>
<p>However, guarantees break down when we don’t assume log-concavity. Generically, algorithms for sampling depend <em>exponentially</em> on the ambient dimension $d$, or on the “size” of the non-log-concave region (e.g., the distance between modes of the distribution). In terms of their dependence on $d$, they are not doing much better than if we split space into cells and sample each according to its probability, similar to “grid search” for optimization. This is unsurprising: we can’t hope for better guarantees without structural assumptions.</p>
<p>Toward this end, in the next blog post we will consider two kinds of structure that allow efficient sampling:</p>
<ol>
<li>Simple multimodal distributions, such as a mixture of gaussians with equal variance.</li>
<li>Manifold structure, arising from symmetries in the level sets of the distribution.</li>
</ol>
Sat, 19 Sep 2020 07:00:00 -0700
http://offconvex.github.io/2020/09/19/beyondlogconvavesampling/
http://offconvex.github.io/2020/09/19/beyondlogconvavesampling/Training GANs - From Theory to Practice<p>GANs, originally discovered in the context of unsupervised learning, have had far reaching implications to science, engineering, and society. However, training GANs remains challenging (in part) due to the lack of convergent algorithms for nonconvex-nonconcave min-max optimization. In this post, we present a <a href="https://arxiv.org/abs/2006.12376">new first-order algorithm</a> for min-max optimization which is particularly suited to GANs. This algorithm is guaranteed to converge to an equilibrium, is competitive in terms of time and memory with gradient descent-ascent and, most importantly, GANs trained using it seem to be stable.</p>
<h2 id="gans-and-min-max-optimization">GANs and min-max optimization</h2>
<p>Starting with the work of <a href="http://papers.nips.cc/paper/5423-generative-adversarial-nets">Goodfellow et al.</a>, Generative Adversarial Nets (GANs) have become a critical component in various ML systems; for prior posts on GANs, see <a href="https://www.offconvex.org/2018/03/12/bigan/">here</a> for a post on GAN architecture, and <a href="https://www.offconvex.org/2017/03/15/GANs/">here</a> and <a href="https://www.offconvex.org/2017/07/06/GANs3/">here</a> for posts which discuss some of the many difficulties arising when training GANs.</p>
<p>Mathematically, a GAN consists of a generator neural network $\mathcal{G}$ and a discriminator neural network $\mathcal{D}$ that are competing against each other in a way that, together, they learn the unknown distribution from which a given dataset arises. The generator takes a random “noise” vector as input and maps this vector to a sample; for instance, an image. The discriminator takes samples – “fake” ones produced by the generator and “real” ones from the given dataset – as inputs. The discriminator then tries to classify these samples as “real” or “fake”. As a designer, we would like the generated samples to be indistinguishable from those of the dataset. Thus, our goal is to choose weights $x$ for the generator network that allow it to generate samples which are difficult for <em>any</em> discriminator to tell apart from real samples. This leads to a min-max optimization problem where we look for weights $x$ which <em>minimize</em> the rate (measured by a loss function $f$) at which any discriminator correctly classifies the real and fake samples. And, we seek weights $y$ for the discriminator network which <em>maximize</em> this rate.</p>
<blockquote>
<p><strong>Min-max formulation of GANs</strong> <br /> <br /></p>
\[\min_x \max_y f(x,y),\]
\[f(x,y) := \mathbb{E}[ f_{\zeta, \xi}(x,y)],\]
<p>where $\zeta$ is a random sample from the dataset, and $\xi \sim N(0,I_d)$ is a noise vector which the generator maps to a “fake” sample. $f_{\zeta, \xi}$ measures how accurately the discriminator $\mathcal{D}(y;\cdot)$ distinguishes $\zeta$ from $\mathcal{G}(x;\xi)$ produced by the generator using the input noise $\xi$.</p>
</blockquote>
<p>In this formulation, there are several choices that we have to make as a GAN designer, and an important one is that of a loss function. One concrete choice is from the paper of Goodfellow et al.: the cross-entropy loss function:</p>
\[f_{\zeta, \xi}(x,y) := \log(\mathcal{D}(y;\zeta)) + \log(1-\mathcal{D}(y;\mathcal{G}(x;\xi)))\]
<p>See <a href="https://machinelearningmastery.com/generative-adversarial-network-loss-functions/">here</a> for a summary and comparison of different loss functions.</p>
<p>Once we fix the loss function (and the architecture of the generator and discriminator), we can compute unbiased estimates of the value of $f$ and its gradients $\nabla_x f$ and $\nabla_y f$ using batches consisting of random Gaussian noise vectors $\xi_1,\ldots, \xi_n \sim N(0,I_d)$ and random samples from the dataset $\zeta_1, \ldots, \zeta_n$. For example, the stochastic batch gradient</p>
\[\frac{1}{n} \sum_{i=1}^n \nabla_x f_{\zeta_i, \xi_i}(x,y)\]
<p>gives us an unbiased estimate for $\nabla_x f(x,y)$.</p>
<blockquote>
<p>But how do we solve the min-max optimization problem above using such a first-order access to $f$?</p>
</blockquote>
<h2 id="gradient-descent-ascent-and-variants">Gradient descent-ascent and variants</h2>
<p>Perhaps the simplest algorithm we can try for min-max optimization is gradient descent-ascent (GDA). As the generator wants to minimize with respect to $x$ and the discriminator wants to maximize with respect to $y$, the idea is to do descent steps for $x$ and ascent steps for $y$. How exactly to do this is not clear, and one strategy is to let the generator and discriminator alternate:</p>
\[x_{i+1} = x_i -\nabla_x f(x_i,y_i),\]
\[y_{i+1} = y_i +\nabla_y f(x_i,y_i).\]
<p>Other variants include, for instance, <a href="https://arxiv.org/abs/1311.1869">optimistic mirror descent</a> (OMD) (see also <a href="https://arxiv.org/abs/1807.02629">here</a> and <a href="https://arxiv.org/abs/1711.00141">here</a> for applications of OMD to GANs, and <a href="https://arxiv.org/abs/1901.08511">here</a> for an analysis of OMD and related methods)</p>
\[x_{i+1} = x_i -2\nabla_x f(x_i,y_i) + \nabla_x f(x_{i-1},y_{i-1})\]
\[y_{i+1} = y_i +2\nabla_y f(x_i,y_i)- \nabla_y f(x_{i-1},y_{i-1}).\]
<p>The advantage of such algorithms is that they are quite practical. The problem, as we discuss next, is that they are not always guaranteed to converge. Most of these guarantees only hold for special classes of loss functions $f$ that satisfy properties such as concavity (see <a href="https://papers.nips.cc/paper/9430-efficient-algorithms-for-smooth-minimax-optimization.pdf">here</a> and <a href="https://arxiv.org/abs/1906.00331">here</a>) or <a href="https://papers.nips.cc/paper/9631-solving-a-class-of-non-convex-min-max-games-using-iterative-first-order-methods.pdf">monotonicity</a>, or under the assumptions that these algorithms are provided with special starting points (see <a href="https://arxiv.org/abs/1706.08500">here</a>, <a href="https://arxiv.org/abs/1910.07512">here</a>).</p>
<h2 id="convergence-problems-with-current-algorithms">Convergence problems with current algorithms</h2>
<p>Unfortunately there are simple functions for which some min-max optimization algorithms may never converge to <em>any</em> point. For instance GDA may not converge on $f(x,y) = xy$ (see Figure 1, and our <a href="https://www.offconvex.org/2020/06/24/equilibrium-min-max/">previous post</a> for a more detailed discussion).</p>
<div>
<img src="/assets/GDA_spiral_2.gif" alt="" />
<br />
<b>Figure 1.</b> GDA on $f(x,y) = xy, \, \, \, \, x,y \in [-5,5]$ (the red line is the set of global min-max points). GDA is non-convergent from almost every initial point.
</div>
<p><br /></p>
<p>As for examples relevant to ML, when using GDA to train a GAN on a dataset consisting of points sampled from a mixture of four Gaussians in $\mathbb{R}^2$, we observe that GDA tends to cause the generator to cycle between different modes corresponding to the four Gaussians. We also used GDA to train a GAN on the subset of the MNIST digits which have “0” or “1” as their label, which we refer to as the 0-1 MNIST dataset. We observed a cycling behavior for this dataset as well: After learning how to generate images of $0$’s, the GAN trained by GDA then forgets how to generate $0$’s for a long time and only generates $1$’s.</p>
<div>
<img style="width:400px;" src="/assets/GDA_Gaussian.gif" alt="" />
<img style="width:400px;" src="/assets/GDA_MNIST.gif" alt="" />
<br />
<b>Figure 2.</b> Mode oscillation when GDA is used to train GANs on the four Guassian mixture dataset (left) and the 0-1 MNIST dataset (right).
</div>
<p><br /></p>
<p>In algorithms such as GDA where the discriminator only makes local updates, cycling can happen for the following reason: Once the discriminator learns to identify one of the modes (say mode “A”), the generator can update $x$ in a way that greatly decreases f, by (at least temporarily) ìfoolingî the discriminator. The generator does this by learning to generate samples from a different mode (say mode “B”) which the discriminator has not yet learned to identify, and stops generating samples from mode A. However, after many iterations, the discriminator ìcatches upî to the generator and learns how to identify mode B. Since the generator is no longer generating samples from mode A, the discriminator may then ìforgetî how to identify samples from this mode. And this can cause the generator to switch back to generating only mode A.</p>
<h2 id="our-first-order-algorithm">Our first-order algorithm</h2>
<p>To solve the min-max optimization problem, at any point $(x,y)$, we should ideally allow the discriminator to find the global maximum, $\max_z f(x,z)$. However, this may be hard for nonconcave $f$. But we could still let the discriminator run a convergent algorithm (such as gradient ascent) until it reaches a <strong>first-order stationary point</strong>, allowing it to compute an approximation $h$ for the global max function. (Note that even though $\max_z f(x,z)$ is only a function of $x$, since $h$ is a “local’’ approximation it could also depend on the initial point $y$ where we start gradient ascent.) And we also empower the generator to simulate the discriminator’s update by running gradient ascent (see <a href="https://arxiv.org/abs/2006.12376">our paper</a> for discriminators with access to a more general class of first-order algorithms).</p>
<blockquote>
<p><strong>Idea 1: Use a local approximation to global max</strong>
<br /><br />
Starting at the point $(x,y)$, update $y$ by computing multiple gradient ascent steps for $y$ until a point $w$ is reached where \(\|\nabla_y f(x,w)\|\) is close to zero and define $h(x,y) := f(x,w)$.</p>
</blockquote>
<p>We would like the generator to minimize $h(\cdot,y)$. To minimize $h$, we would ideally like to update $x$ in the direction $-\nabla_x h$. However, $h$ may be discontinuous in $x$ (see our <a href="https://www.offconvex.org/2020/06/24/equilibrium-min-max/">previous post</a> for why this can happen). Moreover, even at points where $h$ is differentiable, computing the gradient of $h$ can take a long time and requires a large amount of memory.</p>
<p>Thus, realistically, we only have access to the value of $h$. A naive approach to minimizing $h$ would be to propose a random update to $x$, for instance an update sampled from a standard Gaussian, and then only accept this update if it causes the value of $h$ to decrease. Unfortunately, this does not lead to fast algorithms as even at points where $h$ is differentiable, in high dimensions, a random Gaussian step will be almost orthogonal to the steepest descent direction $-\nabla_x h(x,y)$, making the progress slow.</p>
<p>Another idea is to have the generator propose at each iteration an update in the direction of the gradient $-\nabla_x f(x,y)$, and to then have the discriminator update $y$ using gradient ascent. To see why this may be a reasonable thing to do, notice that once the generator proposes an update $v$ to $x$, the discriminator will only make updates which increase the value of f or, $h(x+v,y) \geq f(x+v,y)$. And, since $y$ is a first-order stationary point for $f(x, \cdot)$ (because $y$ was computed using gradient ascent in the <em>previous</em> iteration), we also have that $h(x,y)=f(x,y)$. Hence,</p>
\[f(x+v,y) \leq h(x+v,y) < h(x,y) = f(x,y).\]
<p><em>This means that decreasing $h$ requires us to decrease $f$ (the converse is not true). So it indeed makes sense to move in the direction $-\nabla_x f(x,y)$!</em></p>
<p>While making updates using $-\nabla_x f(x,y)$ may allow the generator to decrease $h$ more quickly than updating in a random direction, it is not always the case that updating in the direction of $-\nabla_x f$ will lead to a decrease in $h$ (and doing so may even lead to an increase in $h$!). Instead, our algorithm has the generator perform a random search by proposing an update in the direction of a batch gradient with mean $-\nabla_x f$, and accepts this move only if the value of $h$ (the local approximation) decreases. The accept-reject step prevents our algorithm from cycling between modes, and using the batch gradient for the random search allows our algorithm to be competitive with prior first-order methods in terms of running time.</p>
<blockquote>
<p><strong>Idea 2: Use zeroth-order optimization with batch gradients</strong>
<br /><br />
Sample a batch gradient $v$ with mean $-\nabla_x f(x,y)$.
<br />
If $h(x+ v, y) < h(x,y) $ accept the step $x+v$; otherwise reject it.</p>
</blockquote>
<p>A final issue, that applies even in the special case of minimization, is that converging to a <em>local</em> minimum point does not mean that point is desirable from an application standpoint. The same is true for the more general setting of min-max optimization. To help our algorithm escape undesirable local min-max equilibria, we use a randomized accept-reject rule inspired by <a href="https://towardsdatascience.com/optimization-techniques-simulated-annealing-d6a4785a1de7">simulated annealing</a>. Simulated annealing algorithms seek to minimize a function via a randomized search, while gradually decreasing the acceptance probability of this search; in some cases this allows one to reach the global minimum of a nonconvex function (see for instance <a href="https://arxiv.org/abs/1711.02621">this paper</a>). These three ideas lead us to our algorithm.</p>
<blockquote>
<p><strong>Our algorithm</strong>
<br /><br />
<em>Input</em>: Initial point $(x,y)$, $f: \mathbb{R}^d \times \mathbb{R}^d\rightarrow \mathbb{R}$
<br />
<em>Output:</em> A local min-max equilibrium $(x,y)$</p>
<p><br /> <br /></p>
<p>For $i = 1,2, \ldots$ <br />
<br />
<strong>Step 1:</strong> Generate a batch gradient $v$ with mean $-\nabla_x f(x,y)$ and propose the generator update $x+v$.
<br /><br />
<strong>Step 2:</strong> Compute $h(x+v, y) = f(x+v, w)$, by simulating a discriminator update $w$ via gradient ascent on $f(x+v, \cdot)$ starting at $y$.
<br /><br />
<strong>Step 3:</strong> If $h(x+v, y)$ is less than $h(x,y) = f(x,y)$, accept both updates: $(x,y) = (x+v, w)$. Else, accept both updates with some small probability.</p>
</blockquote>
<p>In our paper, we show that our algorithm is guaranteed to converge to a type of local min-max equilibrium in $\mathrm{poly}(\frac{1}{\varepsilon},d, b, L)$ time whenever $f$ is bounded by some $b>0$ and has $L$-Lipschitz gradients. Our algorithm does not require any special starting points, or any additional assumptions on $f$ such as convexity or monotonicity. (See Definition 3.2 and Theorem 3.3 in our paper.)</p>
<div>
<img style="width:400px;" src="/assets/GDA_spiral_2.gif" alt="" />
<img style="width:400px;" src="/assets/OurAlgorithm_surface_run1.gif" alt="" />
<br />
<b>Figure 3.</b> GDA (left) and a version of our algorithm (right) on $f(x,y) = xy, \, \, \, \, x,y \in [-5,5]$. While GDA is non-convergent from almost every initial point, our algorithm converges to the set of global min-max points (the red line). To ensure it converges to a (local) equilibrium, our algorithm's generator proposes multiple updates, simulates the discriminator's response, and rejects updates which do not lead to a net decrease in $f$. It only stops if it can't find such an update after many attempts. (To stay inside $[-5,5]\times [-5,5]$ this version of our algorithm uses <i>projected</i> gradients.)
</div>
<p><br /></p>
<h2 id="so-how-does-our-algorithm-perform-in-practice">So, how does our algorithm perform in practice?</h2>
<p>When training a GAN on the mixture of four Gaussians dataset, we found that our algorithm avoids the cycling behavior observed in GDA. We ran each algorithm multiple times, and evaluated the results visually. By the 1500’th iteration GDA learned only one mode in 100% of the runs, and tended to cycle between two or more modes. In contrast, our algorithm was able to learn all four modes 68% of the runs, and three modes 26% of the runs.</p>
<div>
<img src="/assets/Both_algorithms_Gaussian.gif" alt="" />
<br />
<b>Figure 4.</b> GAN trained using GDA and our algorithm on a four Gaussian mixture dataset. While GDA cycles between the Gaussian modes (red dots), our algorithm learns all four modes.
</div>
<p><br /></p>
<p>When training on the 0-1 MNIST dataset, we found that GDA tends to briefly generate shapes that look like a combination of $0$’s and $1$’s, then switches to generating only $1$’s, and then re-learns how to generate $0$’s. In contrast, our algorithm seems to learn how to generate both $0$’s and $1$’s early on and does not stop generating either digit. We repeated this simulation multiple times for both algorithms, and visually inspected the images at the 1000’th iteration. GANs trained using our algorithm generated both digits by the 1000’th iteration in 86% of the runs, while those trained using GDA only did so in 23% of the runs.</p>
<div>
<img src="/assets/MNIST_bothAlgorithms.gif" alt="" />
<br />
<b>Figure 5.</b> We trained a GAN with GDA and our algorithm on the
0-1 MNIST dataset. During the first 1000 iterations, GDA (left)
forgets how to generate $0$'s, while our algorithm (right) learns how to
generate both $0$'s and $1$'s early on and does not stop generating either digit.
</div>
<p><br /></p>
<p>While here we have focused on comparing our algorithm to GDA, in our paper we also include a comparison to <a href="https://arxiv.org/abs/1611.02163">Unrolled GANs</a>, which exhibits cycling between modes. We also present results for CIFAR-10 (see Figures 3 and 7 in our paper), where we compute FID scores to track the progress of our algorithm. See our paper for more details; the code is available on <a href="https://github.com/mangoubi/Min-max-optimization-algorithm-for-training-GANs">GitHub</a>.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this post we have shown how to develop a practical and convergent first-order algorithm for training GANs. Our algorithm synthesizes an approximation to the global max function based on first-order algorithms, random search using batch gradients, and simulated annealing. Our simulations show that a version of this algorithm can lead to more stable training of GANs. And yet the amount of memory and time required by each iteration of our algorithm is competitive with GDA. This post, together with the <a href="https://www.offconvex.org/2020/06/24/equilibrium-min-max/">previous post</a>, show that different local approximations to the global max function $\max_z f(x,z)$ can lead to different types of convergent algorithms for min-max optimization. We believe that this idea should be useful in other applications of min-max optimization.</p>
Mon, 06 Jul 2020 02:00:00 -0700
http://offconvex.github.io/2020/07/06/GAN-min-max/
http://offconvex.github.io/2020/07/06/GAN-min-max/An equilibrium in nonconvex-nonconcave min-max optimization<p>While there has been incredible progress in convex and nonconvex minimization, a multitude of problems in ML today are in need of efficient algorithms to solve min-max optimization problems.
Unlike minimization, where algorithms can always be shown to converge to some local minimum, there is no notion of a local equilibrium in min-max optimization that exists for general nonconvex-nonconcave functions.
In two recent papers, we give two notions of local equilibria that are guaranteed to exist and efficient algorithms to compute them.
In this post we present the key ideas behind a second-order notion of local min-max equilibrium from <a href="https://arxiv.org/abs/2006.12363">this paper</a> and in the next we will talk about a different notion along with the algorithm and show its implications to GANs from <a href="https://arxiv.org/abs/2006.12376">this paper</a>.</p>
<h2 id="min-max-optimization">Min-max optimization</h2>
<p>Min-max optimization of an objective function $f:\mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}$</p>
\[\min_x \max_y f(x,y)\]
<p>is a powerful framework in optimization, economics, and ML as it allows one to model learning in the presence of multiple agents with competing objectives.
In ML applications, such as <a href="https://arxiv.org/abs/1406.2661">GANs</a> and <a href="https://adversarial-ml-tutorial.org">adversarial robustness</a>, the min-max objective function may be nonconvex-nonconcave.
We know that min-max optimization is at least as hard as minimization, hence, we cannot hope to find a globally optimal solution to min-max problems for general functions.</p>
<h2 id="approximate-local-minima-for-minimization">Approximate local minima for minimization</h2>
<p>Let us first revisit the special case of minimization, where there is a natural notion of an approximate second-order local minimum.</p>
<blockquote>
<p>$x$ is a second-order $\varepsilon$-local minimum of $\mathcal{L}:\mathbb{R}^d\rightarrow \mathbb{R}$ if
\(\|\nabla \mathcal{L}(x)\| \leq \varepsilon \ \ \mathrm{and} \ \ \nabla^2 \mathcal{L}(x) \succeq -\sqrt{\varepsilon}.\)</p>
</blockquote>
<p>Now suppose we just wanted to minimize a function $\mathcal{L}$, and we start from any point which is <em>not</em> at an $\varepsilon$-local minimum of $\mathcal{L}$.
Then we can always find a direction to travel in along which either $\mathcal{L}$ decreases rapidly, or the second derivative of $\mathcal{L}$ is large.
By searching in such a direction we can easily find a new point which has a smaller value of $\mathcal{L}$ using only local information about the gradient and Hessian of $\mathcal{L}$.
This means that we can keep decreasing $\mathcal{L}$ until we reach an $\varepsilon$-local minimum (see <a href="https://www.researchgate.net/profile/Boris_Polyak2/publication/220589612_Cubic_regularization_of_Newton_method_and_its_global_performance/links/09e4150dd2f0320879000000/Cubic-regularization-of-Newton-method-and-its-global-performance.pdf">Nesterov and Polyak</a>, <a href="https://dl.acm.org/doi/10.1145/3055399.3055464">here</a>, <a href="http://proceedings.mlr.press/v40/Ge15.pdf">here</a>, and also an earlier <a href="https://www.offconvex.org/2016/03/22/saddlepoints">blog post</a> for how to do this with only access to gradients of $\mathcal{L}$).
If $\mathcal{L}$ is Lipschitz smooth and bounded, we will reach an $\varepsilon$-local minimum in polynomial time from any starting point.</p>
<blockquote>
<p>Is there an analogous definition with similar properties for min-max optimization?</p>
</blockquote>
<h2 id="problems-with-current-local-optimality-notions">Problems with current local optimality notions</h2>
<p>There has been much recent work on extending theoretical results in nonconvex minimization to min-max optimization (see <a href="https://arxiv.org/abs/1906.00331">here</a>, <a href="https://papers.nips.cc/paper/9430-efficient-algorithms-for-smooth-minimax-optimization">here</a>, <a href="https://arxiv.org/pdf/1807.02629.pdf">here</a>, <a href="https://papers.nips.cc/paper/9631-solving-a-class-of-non-convex-min-max-games-using-iterative-first-order-methods.pdf">here</a>, <a href="https://arxiv.org/abs/1910.07512">here</a>.
One way to extend the notion of local minimum to the min-max setting is to seek a solution point called a “local saddle”–a point $(x,y)$ where 1) $y$ is a local maximum for $f(x, \cdot)$ and 2) $x$ is a local minimum for $f(\cdot, y).$</p>
<p>For instance,
this is used <a href="https://arxiv.org/abs/1706.08500">here</a>, <a href="https://arxiv.org/pdf/1901.00838.pdf">here</a>, <a href="https://arxiv.org/pdf/1705.10461.pdf">here</a>, and <a href="http://proceedings.mlr.press/v89/adolphs19a.html">here</a>.
But, there are very simple examples of two-dimensional bounded functions where a local saddle does not exist.</p>
<blockquote>
<p>For instance, consider $f(x,y) = sin(x+y)$ from <a href="https://arxiv.org/abs/1902.00618">here</a>. Check that none of the points on this function are simultaneously a local minimum for $x$ and local maximum for $y$.</p>
</blockquote>
<p>The fact that no local saddle exists may be surprising, since an $\varepsilon$-global solution to a min-max optimization problem <em>is</em> guaranteed to exist as long as the objective function is uniformly bounded.
Roughly, this is because, in a global min-max setting, the max-player is empowered to globally maximize the function $f(x,\cdot)$, and the min-player is empowered to minimize the “global max” function $\max_y(f(x, \cdot))$.</p>
<p>The ability to compute the global max allows the min-player to predict the max-player’s response.
If $x$ is a global minimum of $\max_y(f(x, \cdot))$, the min-player is aware of this fact and will have no incentive to update $x$.
On the other hand, if the min-player can only simulate the max-player’s updates locally (as in local saddle),
then the min-player may try to update her strategy even when it leads to a net increase in $f$.
This can happen because the min-player is not powerful enough to accurately simulate the max-player’s response. (See a <a href="https://arxiv.org/abs/1902.00618">related notion</a> of local optimality with similar issues due to vanishingly small updates.)</p>
<p>The fact that players who can only make local predictions are
unable to predict their opponents’ responses can lead to convergence problems in many popular algorithms such as<br />
gradient descent ascent (GDA). This non-convergence behavior can occur if the function has no local saddle point (e.g. the function $sin(x+y)$ mentioned above), and can even happen on some functions, like $f(x,y) = xy$ which do have a local saddle point.</p>
<div style="text-align:center;">
<img src="/assets/GDA_spiral_fast.gif" alt="" />
<br />
<b>Figure 1.</b> GDA spirals off to infinity from almost every starting point on the objective function $f(x,y) = xy$.
</div>
<p><br /></p>
<h2 id="greedy-max-a-computationally-tractable-alternative-to-global-max">Greedy max: a computationally tractable alternative to global max</h2>
<p>To allow for a more stable min-player, and a more stable notion of local optimality, we would like to empower the min-player to more effectively simulate the max-player’s response.
While the notion of global min-max does exactly this by having the min-player compute the global max function $\max_y(f(\cdot,y))$, computing the global maximum may be intractable.</p>
<p>Instead, we replace the global max function $\max_y (f(\cdot ,y))$ with a computationally tractable alternative.
Towards this end, we restrict the max-player’s response, and the min-player’s simulation of this response, to updates which can be computed using any algorithm from a class of second-order optimization algorithms.
More specifically, we restrict the max-player to updating $y$ by traveling along continuous paths which start at the current value of $y$ and along which either $f$ is increasing or the second derivative of $f$ is positive. We refer to such paths as greedy paths since they model a class of second-order “greedy” optimization algorithms.</p>
<blockquote>
<p><strong>Greedy path:</strong> A unit-speed path $\varphi:[0,\tau] \rightarrow \mathbb{R}^d$ is greedy if $f$ is non-decreasing over this path, and for every $t\in[0,\tau]$
\(\frac{\mathrm{d}}{\mathrm{d}t} f(x, \varphi(t)) > \varepsilon \ \ \textrm{or} \ \ \frac{\mathrm{d}^2}{\mathrm{d}t^2} f(x, \varphi(t)) > \sqrt{\varepsilon}.\)</p>
</blockquote>
<p>Roughly speaking, when restricted to updates obtained from greedy paths, the max-player will always be able to reach a point which is an approximate local maximum for $f(x,\cdot)$, although there may not be a greedy path which leads the max-player to a global maximum.</p>
<div style="text-align:center;">
<img style="width:400px;" src="/assets/greedy_region_omega_t.png" alt="" /> <img style="width:400px;" src="/assets/global_max_path_no_axes_t.png" alt="" />
<br />
<b>Figure 2.</b> <i>Left:</i> The light-colored region $\Omega$ is reachable from the initial point $A$ by a greedy path; the dark region is not reachable. <i>Right:</i> There is always a greedy path from any point $A$ to a local maximum ($B$), but a global maximum ($C$) may not be reachable by any greedy path.
</div>
<p><br /></p>
<p>To define an alternative to $\max_y(f(\cdot,y))$, we consider the local maximum point with the largest value of $f(x,\cdot)$ attainable from a given starting point $y$ by any greedy path.
We refer to the value of $f$ at this point as the <em>greedy max function</em>, and denote this value by $g(x,y)$.</p>
<blockquote>
<p><strong>Greedy max function:</strong>
$g(x,y) = \max_{z \in \Omega} f(x,z),$
where $\Omega$ is points reachable from $y$ by greedy path.</p>
</blockquote>
<h2 id="our-greedy-min-max-equilibrium">Our greedy min-max equilibrium</h2>
<p>We use the greedy max function to define a new second-order notion of local optimality for min-max optimization, which we refer to as a greedy min-max equilibrium.
Roughly speaking, we say that $(x,y)$ is a greedy min-max equilibrium if
1) $y$ is a local maximum for $f(x,\cdot)$ (and hence the endpoint of a greedy path), and
2) if $x$ is a local minimum of the greedy max function $g(\cdot,y)$.</p>
<p>In other words, $x$ is a local minimum of $\max_y f(\cdot, y)$ under the constraint that the maximum is computed only over the set of greedy paths starting at $y$.
Unfortunately, even if $f$ is smooth, the greedy max function may not be differentiable with respect to $x$ and may even be discontinuous.</p>
<div style="text-align:center;">
<img src="/assets/discontinuity2_grid_t.png" width="400" alt="" /> <img src="/assets/discontinuity2g_grid_t.png" width="400" alt="" />
<br />
<b>Figure 3.</b> <i>Left:</i> If we change $x$ from one value $x$ to a very close value $\hat{x}$, the largest value of $f$ reachable by greedy path undergoes a discontinuous change. <i>Right:</i> This means the greedy max function $g(x,y)$ is discontinuous in $x$.</div>
<p><br /></p>
<p>This creates a problem, since the definition of $\varepsilon$-local minimum only applies to smooth functions.</p>
<p>To solve this problem we would ideally like to smooth $g$ by convolution with a Gaussian.
Unfortunately, convolution can cause the local minima of a function to “shift”– a point which is a local minimum for $g$ may no longer be a local minimum for the convolved version of $g$ (to see why, try convolving the function $f(x) = x - 3x I(x\leq 0) + I(x \leq 0)$ with a Gaussian $N(0,\sigma^2)$ for any $\sigma>0$).
To avoid this, we instead consider a “truncated” version of $g$, and then convolve this function in the $x$ variable with a Gaussian to obtain our smoothed version of $g$.</p>
<p>This allows us to define a notion of greedy min-max equilibrium. We say that a point $(x^\star, y^\star)$ is a greedy min-max equilibrium if $y^\star$ is an approximate local maximum of $f(x^\star, \cdot)$, and $x^\star$ is an $\varepsilon$-local minimum of this smoothed version of $g(\cdot, y^\star)$.</p>
<blockquote>
<p><b>Greedy min-max equilibrium:</b>
$(x^{\star}, y^{\star})$ is an $\varepsilon$-greedy min-max equilibrium if
<br />
\(\|\nabla_y f(x^\star,y^\star)\| \leq \varepsilon, \qquad \nabla^2_y f(x^\star,y^\star) \preceq \sqrt{\varepsilon},\)
<br />
\(\|\nabla_x S(x^{\star},y^{\star})\| \leq \varepsilon \qquad \nabla^2_x S(x^{\star},y^{\star}) \succeq -\sqrt{\varepsilon}, \\\)
<br />
where $S(x,y):= \mathrm{smooth}_x(\mathrm{truncate}(g(x, y))$.</p>
</blockquote>
<p>Any point which is a local saddle point (talked about earlier) also satisifeis our equilibrium conditions. The converse, however, cannot be true as a local saddle point may not always exist. Further, for compactly supported convex-concave functions a point is a greedy min-max equilibrium (in an appropriate sense) if and only if it is a global min-max point. (See Section 7 and Appendix A respectively in <a href="https://arxiv.org/abs/2006.12363">our paper</a>.)</p>
<h2 id="greedy-min-max-equilibria-always-exist-and-can-be-found-efficiently">Greedy min-max equilibria always exist! (And can be found efficiently)</h2>
<p>In <a href="https://arxiv.org/abs/2006.12363">this paper</a> we show: A greedy min-max equilibrium is always guaranteed to exist provided that $f$ is uniformly bounded with Lipschitz Hessian. We do so by providing an algorithm which converges to a greedy min-max equilibrium, and, moreover, we show that it is able to do this in polynomial time from any initial point:</p>
<blockquote>
<p><b>Main theorem:</b> Suppose that we are given access to a smooth function $f:\mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}$ and to its gradient and Hessian. And suppose that $f$ is uniformly bounded by $b>0$ and has $L$-Lipschitz Hessian.
Then given any initial point, our algorithm returns an $\varepsilon$-greedy min-max equilibrium $(x^\star,y^\star)$ of $f$ in $\mathrm{poly}(b, L, d, \frac{1}{\varepsilon})$ time.</p>
</blockquote>
<p>There are a number of difficulties that our algorithm and proof must overcome:
One difficulty in designing an algorithm is that the greedy max function may be discontinuous.
To find an approximate local minimum of a discontinuous function, our algorithm combines a Monte-Carlo hill climbing algorithm with a <a href="https://arxiv.org/abs/cs/0408007">zeroth-order optimization version</a> of stochastic gradient descent.
Another difficulty is that, while one can easily compute a greedy path from any starting point, there may be many different greedy paths which end up at different local maxima.
Searching for the greedy path which leads to the local maximum point with the largest value of $f$ may be infeasible.
In other words the greedy max function $g$ may be intractable to compute.</p>
<div style="text-align:center;">
<img src="/assets/greedy_paths_no_axes_t.png" width="400" alt="" />
<br />
<b>Figure 4.</b>There are many different greedy paths that start at the same point $A$. They can end up at different local maxima ($B$, $D$), with different values of $f$. In many cases it may be intractable to search over all these paths to compute the greedy max function.
</div>
<p><br /></p>
<p>To get around this problem, rather than computing the exact value of $g(x,y)$, we instead compute a lower bound $h(x,y)$ for the greedy max function. Since we are able to obtain this lower bound by computing only a <em>single</em> greedy path, it is much easier to compute than greedy max function.</p>
<p>In our paper, we prove that if 1) $x^\star$ is an approximate local minimum for the this lower bound $h(\cdot, y^\star)$, and 2) $y^\star$ is a an approximate local maximum for $f(x^\star, \cdot)$, then $x^\star$ is also an approximate local minimum for the greedy max $g(\cdot, y^\star)$.
This allows us to design an algorithm which obtains a greedy min-max point by minimizing the computationally tractable lower bound $h$, instead of the greedy max function which may be intractable to compute.</p>
<h2 id="to-conclude">To conclude</h2>
<p>In this post we have shown how to extend a notion of second-order equilibrium for minimization to min-max optimization which is guaranteed to exist for any function which is bounded and Lipschitz, with Lipschitz gradient and Hessian.
We have also shown that our algorithm is able to find this equilibrium in polynomial time from any initial point.</p>
<blockquote>
<p>Our results do not require any additional assumptions such as convexity, monotonicity, or sufficient bilinearity.</p>
</blockquote>
<p>In an upcoming blog post we will show how one can use some of the ideas from here to obtain a new min-max optimization algorithm with applications to stably training GANs.</p>
Wed, 24 Jun 2020 03:00:00 -0700
http://offconvex.github.io/2020/06/24/equilibrium-min-max/
http://offconvex.github.io/2020/06/24/equilibrium-min-max/