Effect of Depth and Width on Local Minima in Deep Learning

In this paper, we analyze the effects of depth and width on the quality of local minima, without strong over-parameterization and simplification assumptions in the literature. Without any simplification assumption, for deep nonlinear neural networks with the squared loss, we theoretically show that the quality of local minima tends to improve towards the global minimum value as depth and width increase. Furthermore, with a locally-induced structure on deep nonlinear neural networks, the values of local minima of neural networks are theoretically proven to be no worse than the globally optimal values of corresponding classical machine learning models. We empirically support our theoretical observation with a synthetic dataset as well as MNIST, CIFAR-10 and SVHN datasets. When compared to previous studies with strong over-parameterization assumptions, the results in this paper do not require over-parameterization, and instead show the gradual effects of over-parameterization as consequences of general results.


Introduction
Deep learning with neural networks has been a significant practical success in many fields, including computer vision, machine learning and artificial intelligence. Along with its practical success, deep learning has been theoretically analyzed and shown to be attractive in terms of its expressive power. For example, neural networks with one hidden layer can approximate any continuous function (Leshno et al., 1993;Barron, 1993), and deeper neural networks enable us to approximate functions of certain classes with fewer parameters (Montufar et al., 2014;Livni et al., 2014;Telgarsky, 2016). However, training deep learning models requires us to work with a seemingly intractable problem, namely non-convex and high-dimensional optimization. Finding a global minimum of a general non-convex function is NP-hard (Murty and Kabadi, 1987), and non-convex optimization to train certain types of neural networks is also known to be NP-hard (Blum and Rivest, 1992). These hardness results pose a serious concern only for high-dimensional problems, because global optimization methods can efficiently approximate global minima without convexity in relatively low-dimensional problems (Kawaguchi et al., 2015).
Therefore, a hope is that, beyond the worst case scenarios, practical deep learning allows some additional structure or assumption to make the non-convex high-dimensional optimization tractable. Recently, it has been shown with strong simplification assumptions that there are novel loss landscape structures in deep learning optimization, which may play a role in making the optimization tractable (Dauphin et al., 2014;Choromanska et al., 2015;Kawaguchi, 2016). Another key observation is that if a neural network is strongly over-parameterized so that it can memorize any dataset of a fixed size, then all stationary points (including all local minima and saddle points) become global minima, with some non-degeneracy assumptions. This observation was explained by Livni et al. (2014) and further refined by Nguyen andHein (2017, 2018). However, these previous results (Livni et al., 2014;Nguyen andHein, 2017, 2018) require strong over-parameterization, by assuming not only that a network's width is larger than the dataset size, but also that optimizing only a single layer (the last layer or some hidden layer) can memorize any dataset based on an assumed condition on the rank or non-degeneracy of other layers.
In this paper, we analyze the effects of depth and width on the values of local minima, without the strong over-parameterization and simplification assumptions in the literature. As a result, we prove quantitative upper bounds on the quality of local minima, which shows that the values of local minima of neural networks are guaranteed to be no worse than the globally optimal values of corresponding classical machine learning models, and the guarantee can improve as depth and width increase.

Preliminaries
Let H be the number of hidden layers and d l be the width (or equivalently, the number of units) of the l-th hidden layer. To theoretically analyze concrete phenomena, this paper focuses on fullyconnected feedforward networks with various depths H ≥ 1 and widths d l ≥ 1, using rectified linear units (ReLUs), leaky ReLUs and absolute value activations, evaluated with the squared loss function. In this paper, the (finite) depth H can be arbitrarily large and the (finite) widths d l can arbitrarily differ among different layers. This section defines the optimization problem considered in this paper with an introduction of basic notation.

Problem formulation
Let x ∈ R d 0 and y ∈ R d H+1 be an input vector and a target vector respectively. Let {(x i , y i )} m i=1 be a training dataset of size m. Given a set of n matrices or vectors {M (j) } n j=1 , define [M (j) ] n j=1 := M (1) M (2) · · · M (n) to be a block matrix of each column block being M (1) , M (2) , . . . , M (n) . Define the training data matrices as X := ([x i ] m i=1 ) ∈ R m×d 0 and Y := ([y i ] m i=1 ) ∈ R m×d H+1 . Let θ ∈ R d θ be the vector consisting of all trainable parameters, which determines the entries of the weight matrix W (l) (θ) ∈ R d l−1 ×d l at every l-th hidden layer as vec([W (l) (θ)] H+1 l=1 ) = θ. Here, d θ := H+1 l=1 d l−1 d l is the number of trainable parameters. Given an input matrix X and a parameter vector θ, the output prediction matrixŶ (X, θ) ∈ R m×d H+1 of a fully-connected feedforward network can be written aŝ where X (l) (X, θ) ∈ R m×d l is the post-activation output of l-th hidden layer: where X (0) (X, θ) := X, X (H+1) (X, θ) :=Ŷ (X, θ), and σ (l) : R m×d l → R m×d l is defined by coordinate-wise nonlinear activation functions σ (l) i,j as (σ (l) (M )) i,j := σ (l) i,j (M i,j ) for each (l, i, j). In this paper, each nonlinear activation function σ (l) i,j is allowed to differ among different layers and different units within each layer, but assumed to be ReLU (σ (l) i,j (z) = max(0, z)), leaky ReLU (σ (l) i,j (z) = max(az, z) with any fixed a ≤ 1) or absolute value activation (σ (l) i,j (z) = |z|). This paper considers the squared loss function, with which the training objective of the neural networks can be formulated as the following optimization problem: where · F is the Frobenius norm. Here, 2 m L(θ) is the standard mean squared error, for which all of our results hold true as well, because multiplying L(θ) by a constant 2 m (in θ) only changes the entire scale of the optimization landscape.

Additional terminology and notation
Given a matrix M , let M ·,j and M i,· denote the j-th column vector of M and the i-th row vector of M respectively. Let X (l) := X (l) (X, θ) and W (l) := W (l) (θ) for notational simplicity. Let Λ l,k ∈ R m×m represent a diagonal matrix with diagonal elements corresponding to the activation pattern of the k-th unit at the l-th layer over m different samples as Define P [M ] to be the orthogonal projection matrix onto the column space (or range space) of a matrix M . Let P N [M ] be the orthogonal projection matrix onto the null space (or kernel space) of a matrix M . Given matrices (M (j) ) j∈S with a sequence S = (s 1 , s 2 , . . . , s n ), define [M (j) ] j∈S := M (s 1 ) M (s 2 ) · · · M (sn) to be a block matrix with columns being M (s 1 ) , M (s 2 ) , . . . , M (sn) . Let S ⊆ (s 1 , s 2 , . . . , s n ) denote a subsequence of (s 1 , s 2 , . . . , s n ). Let I d be the identity matrix of size d by d.

Deep nonlinear neural networks with local structure
Given the scarcity of theoretical understanding of the optimality of deep neural networks, Goodfellow et al. (2016) noted that it is valuable to theoretically study simplified models, i.e., deep linear neural networks. For example, Saxe et al. (2014) empirically showed that in terms of optimization, deep linear networks exhibited several properties similar to those of deep nonlinear networks. Following these observations, the theoretical study of deep linear neural networks has become an active area of research (Kawaguchi, 2016;Hardt and Ma, 2017;Arora et al., 2018b,a), as a step towards the goal of establishing the optimization theory of deep learning.
As another step towards the goal, this section discards the strong linearity assumption, and considers a locally induced nonlinear-linear structure in deep nonlinear networks with the piecewise linear activation functions such as ReLUs, leaky ReLUs and absolute value activations.

Locally induced nonlinear-linear structure
In this subsection, we describe how a standard deep nonlinear neural network can induce nonlinearlinear structure. The nonlinear-linear structure considered in this paper is defined in Definition 1: condition (i) simply defines the index subsets S (l) that pick out the relevant subset of units at each layer l, condition (ii) requires the existence of n linearly acting units, and condition (iii) imposes weak separability of edges.
Definition 1. A parameter vector θ is said to induce (n, t) weakly-separated linear units on a training input dataset X if there exist (H + 1 − t) sets S (t+1) , S (t+2) , . . . , S (H+1) such that for all l ∈ {t + 1, t + 2, . . . , H + 1}, the following three conditions hold: Let Θ n,t be the set of all parameter vectors that induce (n, t) weakly-separated linear units on the particular training input dataset X that defines the total loss L(θ) in Equation (2). For standard deep nonlinear neural networks, all parameter vectors θ are in Θ d H+1 ,H , and some parameter vectors θ are in Θ n,t for different values of (n, t). Figure 1 (a) illustrates locally induced structures for θ ∈ Θ 1,0 . For a parameter θ to be in Θ n,t , Definition 1 only requires the existence of a portion n/d l of units to act linearly only on the particular training dataset merely at the particular θ. Thus, all units can be nonlinear, act nonlinearly on the training dataset outside of some parameters θ, and operate nonlinearly always on other inputs x, for example, in a test dataset or a different training (a) with weakly-separated edges: θ ∈ Θ 1,0 (b) with strongly-separated edges: θ ∈ Θ strong 1,0 , two examples of parameters θ are presented with the exact same network architecture (including activation functions and edges); i.e., even if the network architecture (or parameterization) is identical, different parameters θ can induce different local structures. With Θ 1,3 instead of Θ 1,0 , for example, this local structure only has to hold in the last hidden layer. With Θ 1,4 , this local structure always holds in standard deep nonlinear networks with four hidden layers.
dataset. The weak separability requires that the edges going from the n units to the rest of the network are negligible. The weak separability does not require for the n units to be separated from the rest of the neural network.
Here, a neural network with a θ ∈ Θ n,t can be a standard deep nonlinear neural network (without any linear units in its architecture), a deep linear neural network (with all activation functions being linear), or a combination of these cases. Whereas a standard deep nonlinear neural network can naturally have parameters θ ∈ Θ n,t , it is possible to guarantee all parameters θ to be in Θ n,t with desired (n, t) simply by using corresponding network architectures. For standard deep nonlinear neural networks, one can also restrict all relevant convergent solution parameters θ to be in Θ n,t by using some corresponding learning algorithms. Our theoretical results hold for all of these cases.

Theoretical Result
We state our main theoretical result in Theorem 1 and Corollary 1, a simplified statement of which is presented in Remark 1. Here, a classical machine learning method, basis function regression, is used as a baseline to be compared with neural networks. The global minimum value of basis function regression with an arbitrary basis matrix where the basis matrix M (X) does not depend on R and can represent nonlinear maps, for example, by setting ) ∈ R m×d φ with any nonlinear basis functions φ and any finite d φ . In Theorem 1, the expression P N X (S) Y represents the projection of Y onto the null space of (X (S) ) , which is also (Y − the projection of Y onto the column space of X (S) ).
Theorem 1. For any t ∈ {0, 1, . . . , H}, every differentiable local minimum θ ∈ Θ d H+1 ,t of L satisfies that for any subsequence S ⊆ (t, t + 1, . . . , H) (including the case of S being the empty sequence), further improvement as a network gets wider and deeper , (3) . For any l ∈ {1, . . . , H} and any k l ∈ {1, . . . , d l }, N (l) 1 , . . . , Q Remark 1. From Theorem 1 (or Corollary 1), one can see the following properties of the loss landscape: (i) Every differentiable local minimum θ ∈ Θ d H+1 ,t has a loss value L(θ) better than or equal to any global minimum value of basis function regression with any combination of the basis matrices in the set {X (l) } H l=t of fixed deep hierarchical representation matrices. In particular with t = 0, every differentiable local minimum θ ∈ Θ d H+1 ,0 has a loss value L(θ) no worse than the global minimum values of standard basis function regression with the hand-crafted basis matrix X (0) = X, and of basis function regression with the larger basis matrix [X (l) ] H l=0 .
(ii) As d l and H increase (or equivalently, as a neural network gets wider and deeper), the upper bound on the loss values of local minima can further improve.
In terms of over-parameterization, Theorem 1 states that local minima of deep neural networks are as good as global minima of the corresponding basis function regression even without overparameterization, and over-parameterization helps to further improve the guarantee on local minima. The effect of over-parameterization is captured both in the first and second terms in the right-hand side of Equation (3). As depth and width increase, the second term tends to increase and hence the guarantee on local minima can improve. Moreover, as depth and width increase (for some of t + 1, t + 2, . . . , H-th layers in Theorem 1), the first term tends to decrease and the guarantee on local minima can also improve. For example, if [X (l) ] H l=t has rank at least m, then the first term is zero and hence every local minimum is a global minimum with zero loss value. As a special case of this example, since every θ is automatically in Θ d H+1 ,H , if X (H) is forced to have rank at least m, every local minimum becomes a global minimum for standard deep nonlinear neural networks, which coincides with the observation about over-parameterization by Livni et al. (2014).
Without over-parameterization, Theorem 1 also recovers one of the main results in the literature of deep linear neural networks as a special case; i.e., every local minimum is a global minimum. If d H+1 ≤ min{d l : 1 ≤ l ≤ H}, every local minimum θ for deep linear networks is differentiable and in Θ d H+1 ,0 , and hence Theorem 1 yields that L(θ) is the global minimum value, this implies that every local minimum is a global minimum for deep linear neural networks. Corollary 1 states that the same conclusion and discussions as in Theorem 1 hold true even if we fix the edges in condition (iii) in Definition 1 to be zero (by removing them as an architectural design or by forcing it with a learning algorithm) and consider optimization problems only with remaining edges.
Corollary 1. For any t ∈ {0, 1, . . . , H}, every differentiable local minimum θ ∈ Θ d H+1 ,t of L| I satisfies that for any subsequence S ⊆ (t, t + 1, . . . , H) (including the case of S being the empty sequence), further improvement as a network gets wider and deeper , where L| I is the restriction of L to . For any l ∈ {1, . . . , H} and Here, X (0) = X consists of training inputs x i in the arbitrary given feature space embedded in R d 0 ; e.g., given a raw input x raw and any feature map φ : . Therefore, Theorem 1 and Corollary 1 state that every differentiable local minima of deep neural networks can be guaranteed to be no worse than any given basis function regression model with a hand-crafted basis taking values in R d with some finite d, such as polynomial regression with a finite degree and radial basis function regression with a finite number of centers.
To illustrate an advantage of the notion of weakly-separated edges in Definition 1, one can consider the following alternative definition that requires strongly-separated edges.
Definition 2. A parameter vector θ is said to induce (n, t) strongly-separated linear units on the training input dataset X if there exist (H + 1 − t) sets S (t+1) , S (t+2) , . . . , S (H+1) such that for all l ∈ {t+1, t+2, . . . , H +1}, conditions (i) -(iii) in Definition 1 hold and be the set of all parameter vectors that induces (n, t) strongly-separated linear units on the particular training input dataset X that defines the total loss L(θ) in Equation (2). Figure 1 shows a comparison of weekly-separated edges and strongly-separated edges. Under this stronger restriction on the local structure, we can obtain Corollary 2.
Corollary 2. For any t ∈ {0, 1, . . . , H}, every differentiable local minimum θ ∈ Θ strong d H+1 ,t of L satisfies that for any S ⊆ (t, H), k l are defined in Theorem 1. As a special case, Corollary 2 also recovers the statement that every local minimum is a global minimum for deep linear neural networks, in the same way as in Theorem 1. When compared with Theorem 1, one can see that the statement in Corollary 2 is weaker, only producing the upper bound in terms of S ⊆ (t, H). This is because the restriction of strongly-separated units forces neural networks to have less expressive power with fewer effective edges. This illustrates an advantage of the notion of weakly-separated edges in Definition 1.

Deep nonlinear neural networks without structure
The previous section presented Theorem 1, which provides various guarantees for neural networks with possible local nonlinear-linear structure. This section focuses on the general case without any local structure, and examines standard deep nonlinear neural networks in the general case. For the general case without any local structure, Theorem 2 provides an upper bound on the values of local minima that illustrates the effect of depth and width for standard deep nonlinear neural networks. Let d l := d l for all l ∈ {1, . . . , H} and d H+1 := 1.
Theorem 2. Every differentiable local minimum θ of L satisfies that further improvement as a network gets wider and deeper , where D (l) Unlike previous studies (Livni et al., 2014;Nguyen andHein, 2017, 2018), Theorem 2 (and Theorem 1) requires no over-parameterization such as d l ≥ m. Instead, Theorem 2 (and Theorem 1) provides quantitative gradual effects of depth and width on local minima, from no overparameterization to over-parameterization. Notably, Theorem 2 (and Theorem 1) shows the effect of over-parameterization in terms of depth as well as width, which also differs from the results of previous studies that consider over-parameterization in terms of width (Livni et al., 2014;Nguyen andHein, 2017, 2018).

Experiments
In Theorem 2, we have shown that at every differentiable local minimum θ, the total training loss value L(θ) has an analytical formula L(θ) = J(θ), where denotes the right-hand side of Equation (5). In this subsection, we investigate the actual numerical values of the formula J(θ) with a synthetic dataset and standard benchmark datasets for neural networks with different degrees of depth = H and hidden layers' width = d l for l ∈ {1, 2, . . . , H}.
In the synthetic dataset, the data points {(x i , y i )} m i=1 were randomly generated by a ground truth fully-connected feedforward neural network with H = 7, d l = 50 for all l ∈ {1, 2, . . . , H}, tanh activation function, (x, y) ∈ R 10 × R and m = 5000. MNIST (LeCun et al., 1998) is a popular dataset for recognizing handwritten digits and it contains 28 × 28 grey-scale images. The CIFAR-10 ( Krizhevsky and Hinton, 2009) dataset consists of 32 × 32 color images that contain different types of objects such as "airplane", "automobile" and "cat". The Street View House Numbers (SVHN) dataset (Netzer et al., 2011) contains house digits collected by Google Street View, and we used the 32 × 32 color image version for the standard task of predicting the digits in the middle of these images. In order to reduce the computational cost, for image datasets (MNIST, CIFAR-10 and SVHN), we center-cropped the images (24 × 24 for MNIST and 28 × 28 for CIFAR-10 and SVHN), then resized them to smaller grey-scale images (8 × 8 for MNIST and 12 × 12 for CIFAR-10 and SVHN), and used randomly selected subsets of the datasets with size m = 10000 as the training datasets. For all the datasets, the network architecture was fixed to be a fully-connected feedforward network with the ReLU activation function. For each dataset, the values of J(θ) were computed with initial random weights drawn from a normal distribution with zero mean and normalized standard deviation (1/ √ d l ), and with trained weights at the end of 40 training epochs. Additional experimental details are presented in Appendix C. Figure 2 shows the results with the synthetic dataset, as well as MNIST, CIFAR-10, and SVHN datasets. As it can be seen, the values of J(θ) tend to decrease towards zero (and hence the global minimum value), as the width or depth of neural networks increase. In theory, the values of J(θ) may not improve as much as desired along depth and width if representations corresponding to each unit and each layer are redundant in the sense of linear dependence of the columns of D (l) k l (θ) (see Theorem 2). Intuitively, at initial random weights, one can mitigate this redundancy due to the randomness of the weights, and hence a major concern is whether such redundancy arises and J(θ) degrades along with training. From Figure 2, it can be also noticed that the values of J(θ) tend to decrease along with training. These empirical results partially support our theoretical observation that increasing the depth and width can improve the quality of local minima.

Probabilistic bound
To facilitate our theoretical understanding of the observed phenomenon, this subsection focuses on neural networks only with one hidden layer and scalar-valued output: where X ∈ R m×d 0 and Y ∈ R m are the data matrices, and W (1) ∈ R d 0 ×d as well as W (2) ∈ R d×1 are the weight matrices. In this simplified setting, Equation (5) in Theorem 2 is reduced to further improvement as a network gets wider .
From Equation (7), the loss L(θ) is expected to tend to get smaller, as the width of the hidden layer d gets larger. To further support this theoretical observation, this subsection obtains a probabilistic upper bound on the loss L(θ) for white noise data by fixing the activation patterns Λ 1,k for k ∈ {1, 2, · · · , d}, and assuming that the data matrix X Y is a random Gaussian matrix with each entry having mean zero and variance one. We denote the vector consisting of the diagonal entries of Λ 1,k by Λ k ∈ R m for k ∈ {1, 2, · · · , d}. Define the activation pattern matrix as Λ := [Λ k ] d k=1 ∈ R m×d . For any index set I ⊆ {1, 2, · · · , m}, let Λ I denote the sub-matrix of Λ that consists of its rows of indices in I.
Under some mild conditions on the activation pattern matrix Λ, Proposition 1 proves that L(θ) ≈ (1 − d 0 d/m) Y 2 2 /2 in the regime d 0 d m, and L(θ) = 0 in the regime d 0 d m, supporting our theoretical observation that increasing width helps improve the quality of local minima.
In the regime dd 0 m, results similar to Proposition 1 (ii) were obtained under certain diversity assumptions on the entries of the weight matrices in a previous study (Xie et al., 2017). When compared with the previous study (Xie et al., 2017), Proposition 1 specifies precise relations between the size dd 0 of the neural network and the size m of dataset, and also holds true in the regime dd 0 m. However, Proposition 1 assumes a Gaussian data matrix, which may be a substantial limitation.

Conclusion
In this paper, we have theoretically and empirically analyzed the effect of depth and width on the loss values of local minima, with and without a possible local nonlinear-linear structure. The local nonlinear-linear structure considered in this paper might naturally arise during training and also is guaranteed to emerge by using specific learning algorithms or architecture designs. With the local nonlinear-linear structure, we have proved that the values of local minima of neural networks are no worse than the global minimum values of corresponding basis function regression and can improve as depth and width increase. In the general case without the possible local structure, we have theoretically shown that increasing the depth and width can improve the quality of local minima, and empirically supported this theoretical observation. Furthermore, without the local structure but with a shallow neural network, we have proven concrete rates of improvement on the local minimum values in terms of width.
Our results suggest that the values of local minima are not arbitrarily poor (unless one crafts a pathological worst-case example), and can be guaranteed to some desired degree in practice, depending on the degree of over-parameterization as well as the local or global structural assumption. Indeed, a structural assumption, namely the existence of identity map, was recently utilized to analyze the quality of local minima (Shamir, 2018; Kawaguchi and Bengio, 2018). When compared with these previous studies (Shamir, 2018; Kawaguchi and Bengio, 2018), we have shown the effect of depth and width, as well as considered a different type of neural networks without the explicit identity map.
In practice, we often "over-parameterize" a hypothesis space in deep learning in a certain sense (e.g., in terms of expressive power). Theoretically, with strong over-parameterization assumptions, we can show that all stationary points (including all local minima) even with respect to a single layer has zero training errors, memorizing any dataset, and thus are global minima. However, "over-parameterization" in practice may not satisfy such strong over-parameterization assumptions in the theoretical literature. In contrast, the results in this paper do not require over-parameterization and show the gradual effects of over-parameterization as consequences of general results.  The following lemma decomposes the model outputŶ in terms of the weight matrix W (l) and D (l) that coincides with its derivatives at differentiable points. Proof. Define G (l) to be the pre-activation output of the l-th hidden layer as G (l) := G (l) (X, θ) := X (l−1) (X, θ)W (l) . By the linearity of vec operation and the definition of G (l) , we have that

Appendix A. Proofs for non-probabilistic statements
, which proves the first statement that vec(Ŷ ) = D (l) vec(W (l) ). The second statement follows from the fact that the derivatives of D (l) with respect to vec(W (l) ) are zeros at any differentiable point, and hence (∂ W (l)Ŷ ) = D (l) + 0.
Lemma 2 generalizes a part of Theorem A.45 in (Rao et al., 2007) by discarding invertibility assumptions.
Lemma 2. For any block matrix A B with real sub-matrices A and B such that A B = 0, Proof. It follows a straightforward calculation as Lemma 3 decomposes a norm of a projected target vector into a form that clearly shows an effect of depth and width.
Lemma 3. For any t ∈ {0, 1, . . . , H} and any S ⊆ (t, t + 1, . . . , H), Then, by repeatedly applying Lemma 2 to each block of [[N (l) From the construction of N (l) k l , we have that for all (l, k) = (l , k ), The following lemma plays a major role in the proof of Theorem 1.

this implies that
By the definition of l * , this implies that We now show that for any sufficiently small vector (ν l+1 , . . . , ν l * −2 ) such thatθ(ν l+1 , . . . , ν l * −2 ) by induction on the index j with the decreasing order j = l * − 2, l * − 3, . . . , l + 1. The base case with j = l * − 2 is proven above. LetÃ (l) :=Ã (l) (ν l+2 ). For the inductive step, assuming that the statement holds for j, we show that it holds for j − 1 as: where the last line follows the fact that the first term in the second line is zero because of the inductive hypothesis with ν j = 0. Since u j 2 = 1, by multiplying u j both sides from the right, we have that for any sufficiently small ν j ∈ R d L , This completes the inductive step and proves that r X (l) (ν l+1 u l+1 ) = 0.
Therefore, for any S ⊆ (t, t + 1, . . . , H), where the second inequality holds because the column space of X D includes the column space of X ( By applying Lemma 3 to the second term in the right-hand side of Equation (13), we obtain the desired upper bound in Theorem 1. Finally, we complete the proof by noticing that 1 2 P N [X (S) ]Y 2 F is the global minimum value of basis function regression with the basis X (S) for all S ⊆ (0, 1, . . . , H). This is because 1 2 X (S) W − Y 2 F is convex in W and hence ∂ W 1 2 X (S) W − Y 2 F = 0 is a necessary and sufficient condition of global minima, solving which yields the global minimum value of 1 2 P N [X (S) ]Y 2 F .

A.4 Proof of Theorem 2
From the first order necessary condition of differentiable local minima,

Appendix B. Proofs for probabilistic statements
In the following lemma, we rewrite the loss (7) in terms of the activation pattern, and data matrices X Y .
Lemma 5. Every differentiable local minimizer θ of L with the neural network (6) satisfies Proof. With r :=Ŷ (X, θ) − Y , we have L(θ) = r r/2. For the expression (6), we havê For any differentiable local minimum θ, from the first order condition, We conclude that if W (2) j = 0, then r ⊥ Λ 1,j X ·i for 1 ≤ i ≤ d 0 . In fact, we have the same conclusion, even if W (2) j = 0. To prove it, we use the second order condition as follows. We notice By the second order condition, the above matrix must be positive semidefinite, and we conclude that r Λ 1,j X ·i = 0. Therefore,Ŷ (X, θ) − Y is perpendicular to the column space ofD. Moreover, from the expression (16),Ŷ (X, θ) is in the column space ofD,Ŷ (X, θ) is the projection of Y to the column space ofD,Ŷ (X, θ) = P [D]Y , and From Equation (14), we expect the larger the rank of the projection matrixD, the loss L(θ) is smaller. In the following lemma, we prove that under some mild conditions of the activation pattern matrix Λ, in the regime d 0 d m, we have rankD = d 0 d; in the regime d 0 d m, we have rankD = m. As it is shown later, Proposition 1 follows easily from the rank estimates ofD.
Proof of Lemma 6. We denote the event Ω sum such that Thanks to (37) in Lemma 7, P(Ω sum ) ≥ 1 − e −d 0 m/8 . In the following we first prove Case (i): that rankD = d 0 d with high probability. We identify the space R dd 0 with d × d 0 matrix, and fix L = 2 ln(dm/δ 2 ) . We first prove that for any V in the unit sphere in R dd 0 , with probability at least 1 − e −m/(16L) , we have We notice thatD Then u is a Gaussian vector in R m with k-th entry Since by our assumption that the entries of Λ are bounded by 1, we get We denote the sets I 0 = {1 ≤ k ≤ m : a 2 k ≤ δ 2 /m}, and There are two cases, if there exists some ≥ 1 such that |I | ≥ m/L, then thanks to (38) in Lemma 7, we have that with probability at least 1 − e −m/(16L) Otherwise, we have that |I 0 | ≥ m(1 − log 2 (dm/δ 2 ) /L) = m/2. Then However by our assumption that s min (Λ I 0 ) ≥ δ, This leads to a contradiction. The claim (21) follows from (22). We take an ε-net of the unit sphere in R dd 0 , and denote it by E. The cardinality of the set E is at most (2/ε) dd 0 . We denote the event Ω such that the following holds Then by using a union bound, we get that the event Ω ∩ Ω sum holds with probability at least 1 − e −m/(16L) (2/ε) dd 0 − e −md 0 /4 . LetV be a vector in the unit sphere of R dd 0 , then there exists a vector V ∈ E such that V − V 2 ≤ ε, and we haveD From (20) and (23), for X ∈ Ω ∩ Ω sum , we have that and It follows from combining (24), (25) and (26), we get that on the event Ω ∩ Ω sum , provided that ε ≤ δ/ √ 12d 0 dmL. This implies that the smallest singular value of the matrixD is at least δ 2 /(4L), with probability provided that m ≥ 32L ln(d 0 dm/δ 2 )d 0 d. This finishes the proof of Case (i).
In the following we prove Case 2: that rankD = m with high probability. We notice that for any vector v ∈ R m , On the event Ω sum as defined in (20), we have that D v 2 2 ≤ 2dd 0 m v 2 2 for any vector v ∈ R m . In the following we prove that, for any vector v ∈ R m , if its L-th largest entry (in absolute value) is at least a for some L ≤ d/2, then We denote the vectors u i = [X ·i Λ 1,1 v, X ·i Λ 1,2 v, · · · , X ·i Λ 1,d v] , for any i = 1, 2, · · · , d 0 . Theñ D v = [u 1 , u 2 , · · · , u d 0 ] . Moreover, u 1 , u 2 , · · · , u d 0 ∈ R d are independent and identically distributed Gaussian vectors, with mean zero and covariance matrix where V is the m × m diagonal matrix, with diagonal entries given by v. We denote the eigenvalues of Σ as λ 1 (Σ) ≥ λ 2 (Σ) ≥ · · · ≥ λ d (Σ) ≥ 0. Then in distribution where {z ij } 1≤i≤d 0 ,1≤j≤d are independent Gaussian random variables with mean zero and variance one. If the L-th largest entry of v (in absolute value) is at least a for some L ≤ d/2, we denote the index set I = {1 ≤ k ≤ m : |v k | ≥ a}, then Therefore, the j-th largest eigenvalue of Σ is at least the j-th largest eigenvalue of a 2 Λ I Λ I , for any 1 ≤ j ≤ d. From our assumption, s min (Λ I ) ≥ δ, and the L-th largest eigenvalue of a 2 Λ I Λ I is at least a 2 δ 2 . Therefore the L-th largest eigenvalue of Σ is at least a 2 δ 2 , i.e. λ L (Σ) ≥ a 2 δ 2 . We can rewrite (29) as Thanks to (38) in Lemma 7, This finishes the proof of claim (28). We take an ε-net of the unit sphere in R m , and denote it by E. Letv be a vector in the unit sphere of R m , then there exists a vector v ∈ E such that v −v 2 ≤ ε, and we havẽ and on the event Ω sum using (27), we have In the rest of the proof, we show that with high probability D v 2 2 is bounded away from zero for uniformly any v ∈ E.
For any given vector v in the unit sphere of R m , we sort its entries in absolute value, |v * 1 | ≥ |v * 2 | ≥ · · · ≥ |v * m |.
It follows from combining (33) and (35), on the event Ω q ∩ Ω sum for any v ∈ E with q(v) = q, we get On the event Ω sum ∩ p q=0 Ω q , it follows from combining (30), (31) and (36), Moreover, thanks to (34), Ω sum ∩ p q=0 Ω q holds with probability at least P(Ω sum ∩ p q=0 Ω q ) ≥ 1 − For training, we used a standard training procedure with mini-batch stochastic gradient decent (SGD) with momentum. The learning rate was set to be 0.01. The momentum coefficient was set to be 0.9 for the synthetic dataset and 0.5 for the image datasets. The mini-batch size was set to be 200 for the synthetic dataset and 64 for the image datasets. ]]) vec(Y ) 2 2 for all θ, which was used to numerically compute the values of J(θ). This is mainly because the form of J(θ) in Theorem 2 may accumulate the positive numerical errors for each l ≤ H and k l ≤ d l in the sum in its second term, which may easily cause a numerical over-estimation of the effect of depth and width. To compute the projections, we adopted a method of computing numerical cutoff criterion on singular values from (Press et al., 2007) as (the numerical cutoff criterion) = 1 2 × (maximum singular value of M ) × (machine precision of M ) × ( √ d + d + 1), for a matrix of M ∈ R d ×d . We also confirmed that the reported experimental results kept qualitatively unchanged with two other different cutoff criteria: a criterion based (Golub and Van Loan, 1996)