The title may seem like a contradiction given that there is such a thing as the Universal Approximation Theorem which simply states that a neural network with a single hidden layer of finite width(i.e finite number of neurons) can approximate any function on a compact set of given that the activation function is non-constant,bounded and continuous.
Needless to say, I haven’t found any kind of flaws in the existing proofs(see Kolmogrov or Cybenko). However, I thought of something a little simple and scoured the internet for an answer.
What if we allow an arbitrary number of hidden layers and bound the dimension of the hidden layers making them ‘Deep, Skinny Neural Networks’? Would that be a universal approximator?
The answer is No and I found it out from a paper by Jesse Johnson who proved it using nothing but basic topology for a particular class of neural networks. This seems quite important because quite often, in application, multiple hidden layers of low dimension are used in machine learning. Also, the kinds of restrictions the author placed on the activation function and the neural network appear quite commonly in the real-world.
Setting Up the Notation
Before I proceed to the proof, I’ll just set in place the required notation that will be used throughout.
Let be set of all functions in a layered classic feed-forward neural network with activation function , input dimension , output dimension and hidden layer dimension for .
Given this, let to be the union of the family of functions where for some fixed natural number .
A function in is said to be nonsingular if is continuous and injective, and the weight matrix is nonsingular. The set of all functions in the families for fixed and varying activation functions which are continuous and injective is denoted by .
Given some function and a compact subset of , another function is said to if for every .
An activation function is uniformly approximated by injective functions if there if there is a sequence of continuous, injective functions which uniformly converge to . This would allow common activation functions such as ISRU, tanh and the sigmoid function to be approximated trivially. Functions such as RELu have a horizontal line in their graph and it’s easy to appropriately choose a sequence of injective functions which will converge to RELu(try to do this yourself).
One key observation is that any function can be written as:
where are the activation functions and are linear functions determined by the weights associated with a particular layer. This is typically how we imagine forward propagation of fully connected neural networks to work.
The main theorem of his paper deals with path components associated with pre-images of level sets. So, going forward, it’ll be better to put forth a definition.
A function has unbounded level components if for every , the path components of are all unbounded.
Before, moving onto the main theorem, I’d like to simply list some rather obvious lemmas.
In (1), I mentioned the decomposition of functions in . So, now consider some function which can be decomposed as where each is a linear function between Euclidean spaces.
If there are another set of functions than each corresponding for some compact sets , then the composition for some compact set .
If is approximated by some function in where the activation function can be uniformly approximated by injective functions, then can be approximated by a function in
The proof is pretty simple so I skip it in typical mathematical snobbery. If you wish to prove it for yourself, start off by showing that every function in can be approximated by a function in .Note that you can appropriately make for every hidden layer even if the width is less than by appropriately inserting zeros in the weight matrix. Now, the next hint is a way to deal with those pesky singular linear functions . The set of all non singular matrices form an open dense set in the space of all matrices, i.e the set of all singular matrices form a Lebesgue zero measure set. This can be understood by considering these spaces to be manifolds in so perturbing elements appropriately in a singular matrix makes it non-singular so that any arbitrarily small change in the value of the linear transformation associated with the singular function can be approximated by a non-singular one.
Now, you’ve shown that every function in can be approximated by a function in and given that is approximated by some function in , you can combine the results to prove the lemma.
This section ends with a key lemma on the functions in .
If is a function in , then for every , the level set is homeomorphic to an open subset of . Hence, it has unbounded level components.
Sketch of the proof:
Again, the lemma is quite easy to prove. I’ll provide a rough sketch of the proof. Writing the function as:
where is an linear function and where is a linear function to . Essentially, we are writing the f as the compositions associated to the last layer of the neural network and the remaining layers. is injective and continuous because the activation function is injective and continuous, and is singular by definition.
Once you work through the simple details, it’s revealed that is an open subset of . On the other hand, is also closed in by the properties of continuous functions. If it was bounded, then the set would be a compact subset of making it empty so is either unbounded or empty. Hence, the path components of this set are unbounded.
Towards the Main Theorem
The final result of the paper is that deep, skinny networks can only approximate functions with unbounded level components.
What we have proven is that every function that can be approximated by a function in can be approximated by a function in and that the level sets of the functions in have unbounded level components. Now, it remains to prove that every function approximated by a function in also has unbounded level components.
So, the topological constraint is that functions with bounded level components can not be approximated by a function in where is a continuous activation function that can uniformly approximated by injective functions.
Any function approximated by a function in has unbounded level components.
Assume to the contrary that has bounded level components. Consider some level set for . is closed and components are also closed.Since, components are bounded, both these sets are compact.
Hence, there exists a number such that every element of is at a distance greater than from .
Define . Let be the boundary of this open set . is closed and also bounded since it is at a bounded distance from (which is bounded itself).
is also compact and is disjoint from it. Since the complement of the set is open, there exists some neighborhood of that is disjoint from .
obviously contains and has a component that contains . intersects and is disjoint from the boundary . Therefore, . So, is also compact.
It’s given that is approximated by a function in . Let . Bear in mind that this number is finite since is bounded and contains .
Let be an element of .
Consider some connected compact neighborhood of such that there is the boundary is at a distance . The simplest example of such a neighborhood would be the closed ball .
So, choose a function that .
. Since ,
Since , the level components are unbounded, so the components of are unbounded. We can construct a path from to a point that is at a distance from such that every point in lies in . This means that . Since , by the previous equation and inequality.
So, . The path is contained in one of the components of and since it contains , this component is .
But we constructed the path so that it starts from and ends at a point whose distance from is greater than . This is a contradiction by the definition of .
Well, that ends the proof. J. Johnson presents examples of how these restrictions apply to real-world machine learning problems which you can find in the last few pages of his paper.