In the prevvious article, discussing Multi Layer Perceptron networks, we mentioned activation functions, which allow neural networks to learn non-linear functions. In this article, we'll look at these functions in more detail.

As the name implies, Multi Layer Perceptron networks are based on the *perceptron* model, which fits a function

to the data. The activation function for this model is the *Heaviside Step Function*, but since it is not differentiable, we cannot apply the chain rule to it. Therefore, early neural networks used activation function that replaced the sudden step with a smooth transition between positive and negative values. One such function is the *softsign function*

Whose derivative is given by

but more commonly used is the *hyperbolic tangent*

for which

This is related to the logistic function or *sigmoid*

by the relationship

Both these functions are 0 at \(x=0\) and have the property

and so are known as *saturating functions* as a result. One disadvantage of saturating functions is the *vanishing gradient problem*. Since the gradient of a saturating function tends to zero for large positive or negative \(x\), these functions will propogate little information to their inputs during training if they are close to saturation. This makes them unsuitable for use in deep networks. As a result, a variety of *non-saturating functions* are now used.

One of the most common of these is the *Rectified Linear Unit* (ReLU).

The output vector space for a layer with ReLU activation is a piecewise linear function. It has the advantage of computational simplicity, but can suffer from *dead neurons*. Since it can produce arbirtrarity large outputs, but the gradient is zero when the input is less than zero, a neuron with large negative weights can get locked into producing zero outputs. To address this problem a number of variations of ReLu are used, which produce non-zero outputs for negative inputs. *Leaky ReLU* is a piecewise linear function

where \(\alpha\) is a constant in the range \(0 < \alpha <1\). If we treat \(\alpha\) as a trainable parameter, we get *Parametric ReLU* (PReLU).

*Exponential Linear Unit* (ELU) uses an offset exponetial function for negative inputs.

. This means that the activation will saturate for large negative values but not for positive values.

The exponential function

may also be used as an activation function, but has the disadvantage that its gradients can be arbitrarily large, so this can lead to an *exploding gradient problem*, where gradients increase without limit during training, leading to instability and the risk of numerical overflows. A more stable alternative is the *softsum function*

This can be seen as a fully continuous alternative to ReLU. The gradient is given by

These functions are all monotonic - their gradients are always non-negative with respect to their inputs. More recent research has introduced a number of *non-monotonic* activation functions, which have a minimum for small negative inputs and tend to zero for large negative inputs. These are the *Gausian Error Linear Unit*, the *Sigmoid Linear Unit* (SiLU), or *Swish function*, and the *Mish function* (apparently names after Diganta Misra, who devised it).
The GELU activation function is defined as

where

. For this

The swish function is given by

and its derivative is

.

The Mish function is given by

Its derivative is

These are quite similar functions. The fact that they are not monotonic gives them the propery of being self-regularising - weights and inputs giving rise to large values of \(x\) will tend to be weakened rather than strengthened during training, thus reducing the the tendency to overfit.

In some circumstances, we may wish to use an activation function that treats both large positive and large negative input values. For this purpose there is a family of activation functions known as *Shrink functions*. The *Hard Shrink* function

is discontinuous, which may lead to unstable behaviour during training. The *Soft Shrink* function

avoids this problem. However, if we wish to use a smooth function, there is the *Tanh Shrink* function.

The activation functions discussed so far apply mainly to hidden layers. For output layers, the activation functions used may be chosen according to the requirements of the problem. The sigmoid function and its generalisation the *Softmax function*

may be used, while a simple linear output may be suitable for regression.

Both TensorFlow and PyTorch provide a wide selection of activation functions.

Previous | Next |
---|---|

Multi Layer Perceptron | Transfer Learning |