1. Linear Function (Identity Function)
Formula: f(x)=x
Description: The output is directly proportional to the input. It's a straight line.
When Used:
Output Layer of Regression Models: When predicting a continuous numerical value (e.g., house price, temperature).
Occasionally in Intermediate Layers (rarely): While theoretically possible, using only linear activations throughout a deep network would make the entire network equivalent to a single linear transformation, losing the ability to learn complex patterns.
Advantages:
Simple to understand and implement.
No vanishing/exploding gradient problems when used as the only activation.
Disadvantages:
Cannot learn non-linear relationships.
A neural network with only linear activation functions can only learn a linear function, regardless of the number of layers.
2. Binary Step Function
Formula: f(x)=1 if x≥0 f(x)=0 if x<0
Description: Outputs a binary value (0 or 1) based on whether the input crosses a certain threshold (usually 0).
When Used:
Historical Significance: Primarily used in early perceptrons for binary classification tasks.
Not in Modern Deep Learning: Rarely used in hidden layers of modern neural networks due to its limitations.
Advantages:
Simple and computationally inexpensive.
Clear binary output.
Disadvantages:
Non-differentiable at 0: This means gradient-based optimization methods (like backpropagation) cannot be directly applied.
Zero gradient elsewhere: Gradients are 0 for all other inputs, meaning the weights cannot be updated if the input is not exactly 0.
Cannot handle multi-class problems well.
3. Non-Linear Activation Functions (General Advantages)
All the following functions are non-linear. The primary advantage of non-linear activation functions is that they allow neural networks to learn and approximate complex, non-linear relationships in data. Without non-linearity, a multi-layered neural network would essentially behave like a single-layered network, limiting its representational power. They enable the network to learn intricate patterns and solve non-linear classification and regression problems.
4. Sigmoid (Logistic)
Formula: f(x)=
1+e
−x
1
Description: Squashes the input value into a range between 0 and 1. It has an "S" shape.
When Used:
Output Layer for Binary Classification: When you need a probability-like output between 0 and 1 (e.g., predicting the probability of an email being spam).
Historically in Hidden Layers: Was popular in hidden layers but has largely been replaced by ReLU and its variants.
Advantages:
Output is normalized between 0 and 1, suitable for probabilities.
Smooth gradient, which prevents "jumps" in output values.
Disadvantages:
Vanishing Gradient Problem: Gradients are very small for very large positive or negative inputs, leading to slow or halted learning in deep networks.
Outputs are not zero-centered: This can cause issues with gradient updates, leading to a "zig-zagging" optimization path.
Computationally expensive compared to ReLU.
5. TanH (Hyperbolic Tangent)
Formula: f(x)=
e
x
+e
−x
e
x
−e
−x
(or tanh(x))
Description: Squashes the input value into a range between -1 and 1. Also has an "S" shape, centered at 0.
When Used:
Hidden Layers: More often used in hidden layers than Sigmoid, particularly in older architectures or recurrent neural networks (RNNs) where it can be beneficial due to its zero-centered output.
Advantages:
Zero-centered output: This is a significant advantage over Sigmoid, as it helps alleviate the zig-zagging effect during gradient descent and makes training more stable.
Stronger gradients than Sigmoid for values closer to 0.
Disadvantages:
Still suffers from Vanishing Gradient Problem: Similar to Sigmoid, gradients become very small for large positive or negative inputs.
Computationally more expensive than ReLU.
6. ReLU (Rectified Linear Unit)
Formula: f(x)=max(0,x)
Description: Outputs the input directly if it's positive, otherwise outputs 0. It's a simple piecewise linear function.
When Used:
Most Common Choice for Hidden Layers: The default activation function for hidden layers in the vast majority of deep learning models (Convolutional Neural Networks, Feedforward Networks, etc.).
Advantages:
Solves Vanishing Gradient Problem (for positive inputs): The gradient is 1 for positive inputs, preventing saturation.
Computationally Efficient: Simple to compute and its derivative is also simple (0 or 1).
Sparsity: Can lead to sparse activations (some neurons output 0), which can be beneficial for efficiency and representation.
Disadvantages:
Dying ReLU Problem: Neurons can become "dead" if their input is always negative, causing their gradient to be 0. Once a neuron outputs 0, it never updates its weights via backpropagation.
Outputs are not zero-centered.
7. Leaky ReLU
Formula: f(x)=x if x≥0, else f(x)=αx (where α is a small positive constant, e.g., 0.01)
Description: Similar to ReLU, but instead of outputting 0 for negative inputs, it outputs a small linear component.
When Used:
Hidden Layers: Used as an alternative to ReLU when the dying ReLU problem is a concern.
Advantages:
Mitigates Dying ReLU Problem: By providing a small gradient for negative inputs, it allows neurons to "recover" and continue learning.
Computationally efficient.
Disadvantages:
Performance is not always consistent and can vary.
The choice of α is often heuristic.
8. Parametric ReLU (PReLU)
Formula: f(x)=x if x≥0, else f(x)=αx (where α is a learnable parameter)
Description: An extension of Leaky ReLU where the slope α for negative inputs is learned during training via backpropagation, rather than being a fixed hyperparameter.
When Used:
Hidden Layers: Can be used in architectures where fine-tuning the negative slope might lead to better performance.
Advantages:
Learns the optimal slope: Allows the model to adapt the activation function to the specific data, potentially leading to better performance.
Addresses the dying ReLU problem.
Disadvantages:
Adds an additional parameter to learn per neuron, slightly increasing model complexity.
Might be prone to overfitting if not enough data is available.
9. Exponential Linear Unit (ELU)
Formula: f(x)=x if x≥0, else f(x)=α(e
x
−1) (where α is a positive constant, often 1)
Description: For positive inputs, it's linear like ReLU. For negative inputs, it smoothly curves towards −α.
When Used:
Hidden Layers: Can be used as an alternative to ReLU and its variants, particularly in deep networks.
Advantages:
Addresses Dying ReLU: The negative values allow for non-zero gradients, preventing dead neurons.
Smoother transition: The exponential curve for negative inputs leads to more robust learning, especially when inputs are slightly negative.
Closer to zero-centered output: For inputs below zero, it converges to −α, pulling the mean activation closer to zero, which can lead to faster learning.
Disadvantages:
Computationally more expensive than ReLU due to the exponential function.
10. Swish
Formula: f(x)=x⋅sigmoid(βx) (often β=1, so f(x)=x⋅sigmoid(x))
Description: A smooth, non-monotonic function that is a product of the input and the sigmoid of the input. It's "self-gated."
When Used:
Hidden Layers: Demonstrated to outperform ReLU in some deeper models, notably in architectures like EfficientNet.
Advantages:
Smooth and Non-monotonic: The non-monotonicity (a dip below zero before rising) can sometimes help with learning complex patterns.
Better performance in deep networks: Often found to yield better results than ReLU in very deep models.
Avoids the dying ReLU problem.
Disadvantages:
Computationally more expensive than ReLU due to the sigmoid function.
11. Maxout
Formula: f(x)=max(w
1
T
x+b
1
,w
2
T
x+b
2
,…,w
k
T
x+b
k
)
Description: Instead of a fixed function, a Maxout unit takes the maximum of k linear functions. It's a generalization of ReLU (ReLU is a Maxout unit with one linear function being 0 and the other being x).
When Used:
Hidden Layers: Can be used in deep networks, often alongside dropout.
Advantages:
Approximates any convex function: This makes it a very powerful and expressive activation function.
Does not suffer from dying ReLU: Since it's the maximum of linear functions, the gradient will always be non-zero for at least one of the linear functions.
No vanishing/exploding gradients (due to its piecewise linear nature).
Disadvantages:
Increases number of parameters: Each Maxout unit has k times more parameters than a standard ReLU unit, significantly increasing model complexity and training time.
Computationally more expensive during inference due to evaluating multiple linear functions.
12. Softmax
Formula: For an input vector z=[z
1
,z
2
,…,z
K
], the softmax function outputs a probability distribution σ(z)
i
=
∑
j=1
K
e
z
j
e
z
i
Description: Converts a vector of arbitrary real values into a probability distribution, where each value is between 0 and 1, and all values sum to 1.
When Used:
Output Layer for Multi-class Classification: This is its primary and almost exclusive use. It's used when a data point belongs to exactly one of several possible classes (e.g., classifying an image as a cat, dog, or bird).
Advantages:
Provides a probability distribution: The output directly represents the confidence scores for each class.
Numerically stable: Exponentiation makes it suitable for larger inputs.
Disadvantages:
Only suitable for multi-class classification output layers.
Not used in hidden layers.
The evolution of activation functions reflects the continuous effort to overcome limitations like vanishing gradients and improve training stability and performance in deeper neural networks. While ReLU remains the workhorse for many hidden layers due to its simplicity and effectiveness, newer functions like Swish and ELU offer promising alternatives for specific architectures and tasks.