Normalization is an important method for data pre-processing, especially for machine learning. Why do we need normalization? Normalization may change the distribution of the feature vector data set and the distances among observation points. Additionally, it may accelerate convergence of deep learning algorithms [1]. As another example, our AxHeat product uses normalization for the RF heating of reservoirs, and improves algorithm stability after we have applied it to the water saturation.

In this blog post, we will discuss 3 types of fundamental normalizations. Those methods specific for deep learning, i.e. local response normalization, batch normalization and layer normalization will be discussed in a future blog post [2].

Zero-mean normalization

Equation:

 

In this case, zero-mean normalization is the same as standardization. The processed data with this method will fit in the standard normal distribution. For some cluster algorithms using distance to measure similarity, i.e. K-means, zero-mean normalization is a good choice[3].

Min-max normalization

Equation:

x = (x-min)/(max - min)

Min-max normalization is a linear normalization technique. It does not change the distribution of the data set. However, if the min and max are not stable among input data sets, it may cause instabilities in the results. Min-max normalization is the most common method in imaging processing as most of the pixels values are in range of [0, 255].

Non-linear normalizations

The typical non-linear normalizations include logarithm, exponential functions, and inverse trigonometric functions. The choice of non-linear function depends on the distribution of your inputs and the expected distribution of outputs. has better discernibility in range of [0, 1]. can take any real numbers as inputs and convert to the output to values in the range.

 

Let’s have a look of data distribution after applying zero-mean, min-max, log and arctan normalizations to a standard Cauchy distribution input:

We generate 200 sample points randomly (shown as blue point), which are in range of [-40.0, 40.0]. After normalizations, the data are shrunk into [-10.0, 15.0]. We do see different data distributions where there is not absolute good or bad. It depends your criteria. For instance, if your target is to minimize the distance among the points, min-max is a good choice. If you expect evenly distributed data with obvious differences, log may be a good idea.

Lastly, Scikit-learn provides handy tools to compare/visualize normalized results with your inputs[4]. It is a good idea to run your candidate normalization algorithm on your sample data set before applying to your real data set.

References

[1] https://www.coursera.org/learn/deep-neural-network/lecture/lXv6U/normali...
[2] http://yeephycho.github.io/2016/08/03/Normalizations-in-neural-networks
[3] http://ai.stanford.edu/~acoates/papers/coatesng_nntot2012.pdf
[4] http://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scal...