Artificial intelligence extracts information from a large amount of data by learning different patterns that occur in it. Deep neural networks – a form of AI that learns to differentiate between these data patterns, began surpassing traditional algorithms 10 years ago. Due to enough data and processing power, it was made possible. Nowadays neural networks are hungrier for more data and power. Training data sets require watchful tuning of the values of billions of parameters that characterize these networks. Through the process of “Optimization”, ideal values for the data sets are found, but training the data set to achieve the same is not easy.
Boris Knyazev and his colleagues have set up and trained a “Hypernetwork”, it has other neural networks in it that could speed up the training process. The hypernetwork predicts the patterns and parameters for a new, untrained deep neural network designed for some task in fractions of a second. Hypernetwork performs very well in certain specifications, but there is still room for it to grow.
Stochastic Gradient descent (SGD) is a technique used to minimize the errors in the networks of a given task. For instance, in the case of image recognition, the SGD algorithm can be run through lots of labeled data to adjust the parameters and reduce errors. But this technique is workable only for the networks that people have to optimize. Engineers must rely on rules of thumb to build the initial neural network having multiple layers of artificial neurons. Such architectures can differ in terms of the number of layers of neurons.
Architectures need optimization in the way that the programmer can pick up the best from all of the optimized architectures. But such training would need a nontrivial amount of time, as it would be a difficult task to train and test every candidate network architecture. So, in 2018, Ren along with his former University of Toronto colleague Chris Zhang and their adviser Raquel Urtasun tried a different approach. They have used a Graph Hyper Network to find the best architecture to solve some tasks, as the architecture of a deep neural network can be thought of as a mathematical graph. Nodes represent the computational units – a layer of a neural network, and edges represent the way these layers are interconnected. Ren’s team has illustrated that their method can be used to figure out the best candidate architecture.
Knyazev’s team then found the idea of Ren’s team quite good and they realized they could build upon it. Their new paper has illustrations related to it. They did not only find the best architecture but used the GHN to predict the parameters for the best network. “It’s a very solid paper. [It] contains a lot more experimentation than what we did,” Ren said of the new work.
Knyazev and his team mentioned their hypernetwok as GHN-2 with improvement upon two factors of Graph Hypernetwork built by Ren’s team. First, they relied upon Ren’s technique for the architecture of a neural network as a graph. The second idea they drew on was the technique used by them to train the hyper network to make predictions for new candidate architectures. For each architecture, these two models were used. Starting with a graph and then using the graph hyper network to predict parameters and initialize the Artificial Neural networks (ANN) with the predicted parameters was an attempt made. ANN can then be used to perform a specific task like image recognition. The loss made by the ANN is then calculated. Instead of updating the parameters of the ANN to make better predictions, parameters of the hyper network are updated that make the prediction in the first place. Then it is made to iterate over each image in the data set. It reduces the chances of loss at each step.
Knyazev’s team took these ideas and then wrote their own code from scratch and improved upon it. Knyazev’s team created a unique data set of 1 million architectures to ensure that GHN-2 learns to predict parameters for a wide range of neural network architectures. The test was performed by Knyazev’s team as they tried to predict parameters for a given task such as classifying images in a data set. They tested its ability to predict parameters for any random candidate architecture as well. Such random candidates could have similar properties to the million architectures in the training data set. In the former case, architecture is said to be in distribution. In the latter case, it is out of distribution. Testing GHN -2 on out-of-distribution data was important.
The team predicted parameters for 500 random target network architectures using fully trained GHN-2. For a data set of images, CIFAR-10, the accuracy of GHN-2 on in-distribution was about 66.9%. For out-of-distribution architectures, accuracy was about 60%. So, it performed really well on the out-of distribution network.