In machine learning, one of the key challenges is selecting the correct hyperparameter values for your problem. This includes things like model complexity, learning rate, batch size, etc. This has largely been done using intuition, which is decidedly not scientific. At GTC 2017, there were two talks on using genetic algorithms to select the best hyperparameter values for neural networks:
S7230 – Joseph Schneible (Technica Corporation) Using Genetic Algorithms to Optimize Recurrent Neural Networks.
S7435 – Steven Young, (Oak Ridge National Laboratory) Adapting DL to New Data: An Evolutionary Algorithm for Optimizing Deep Networks.
I decided to try this out using one of the common image classification benchmarks, CIFAR-10. This is a standard classification problem, the goal being to place each image into one of 10 classes. I started with the tutorial implementation available in the TensorFlow repository.
I wrote a small library to do genetic optimization internally at Acceleware. It supports the following types of traits:
- Integers within a specified range
- Uniformly distributed floating point values within a specified range
- Geometrically distributed floating point values
- Geometrically distributed integers
The next task was to determine the ranges of values for each of the hyperparameters. The network as implemented in the sample code (cifar10.py inference()) looks like this:
I decided to vary the following parameters:
In total there are 138240 combinations for all of the parameters in the table above. At an average of one hour per combination, on 4 GPUs, it would take almost 4 years to run all possible combinations! We can do better, using genetic algorithms.
I modified inference() to take these parameters in a config dictionary:
I then created a population with the following parameters: The first generation had 20 randomly generated entities. The parents of each subsequent generation consisted of:
- The top 10 members of the previous generation
- The top 10 entities of all time
- One new random entity
Fitness was determined using the classification accuracy for the validation set. The parents were randomly paired off to breed with each other until 20 offspring were created for the next generation. When breeding, each trait had a 5% chance of mutating.
Using two NVIDIA Tesla P100 and two NVIDIA Tesla K20c GPUs, I ran this algorithm through 8 generations, and the top entities were as follows, with the original sample code values for comparison:
As you can see, we have achieved an improvement over the baseline with minimal effort. The top performing algorithms use other network topologies.
What can we do to further improve? We could modify the algorithm beyond simple hyperparameter selection and allow it to actually choose different network topologies. We could also try using Bayesian optimization as an alternative to genetic algorithms, although that can be harder to parallelize. We could also try penalizing the fitness function with the size of the network in order to build a model which is not only accurate but fast and memory-efficient as well.