Tanh vs Softmax: Which is Better?
Both Tanh (Hyperbolic Tangent) and Softmax are activation functions, but they serve different purposes in machine learning.
1๏ธโฃ Tanh (Hyperbolic Tangent)
- Formula: tanhโก(x)=exโeโxex+eโx\tanh(x) = \frac{e^x – e^{-x}}{e^x + e^{-x}}tanh(x)=ex+eโxexโeโxโ
- Range: (-1, 1)
- Behavior:
- Outputs values between -1 and 1 (zero-centered).
- Works well for hidden layers in deep networks.
- Derivative: ddxtanhโก(x)=1โtanhโก2(x)\frac{d}{dx} \tanh(x) = 1 – \tanh^2(x)dxdโtanh(x)=1โtanh2(x)
- Advantages:
โ Zero-centered output โ faster training in deep networks.
โ Handles negative inputs better than sigmoid. - Disadvantages:
โ Vanishing gradient problem for large/small values.
Example in PyTorch:
import torch
x = torch.tensor([-2.0, -1.0, 0.0, 1.0, 2.0])
tanh_output = torch.tanh(x)
print(tanh_output) # tensor([-0.9640, -0.7616, 0.0000, 0.7616, 0.9640])
2๏ธโฃ Softmax
- Formula: Softmax(xi)=exiโexj\text{Softmax}(x_i) = \frac{e^{x_i}}{\sum e^{x_j}}Softmax(xiโ)=โexjโexiโโ
- Range: (0, 1)
- Behavior:
- Outputs a probability distribution (sum = 1).
- Used in multi-class classification.
- Derivative: โSiโxj=Si(ฮดijโSj)\frac{\partial S_i}{\partial x_j} = S_i (\delta_{ij} – S_j)โxjโโSiโโ=Siโ(ฮดijโโSjโ)
- Advantages:
โ Converts raw scores into probabilities.
โ Ensures sum of outputs is 1. - Disadvantages:
โ Sensitive to large input values โ may need normalization.
Example in PyTorch:
import torch.nn.functional as F
x = torch.tensor([2.0, 1.0, 0.1])
softmax_output = F.softmax(x, dim=0)
print(softmax_output) # tensor([0.6590, 0.2424, 0.0986])
๐ Key Differences
| Feature | Tanh | Softmax |
|---|---|---|
| Formula | exโeโxex+eโx\frac{e^x – e^{-x}}{e^x + e^{-x}}ex+eโxexโeโxโ | exiโexj\frac{e^{x_i}}{\sum e^{x_j}}โexjโexiโโ |
| Range | (-1, 1) | (0, 1) |
| Use Case | Hidden layers | Output layer (multi-class classification) |
| Zero-centered? | โ Yes | โ No |
| Gradient Issues? | โ Vanishing gradient | โ Sensitive to large values |
| Probability Interpretation? | โ No | โ Yes |
๐ ๏ธ When to Use Each?
- Use Tanh in hidden layers when you want zero-centered values.
- Use Softmax in the output layer for multi-class classification to get probability scores.
๐ Which is Better?
- For hidden layers โ Tanh is better.
- For multi-class classification output โ Softmax is better.
Let me know if you need further clarification! ๐