• March 20, 2025

Tanh vs Softmax: Which is Better?

Both Tanh (Hyperbolic Tangent) and Softmax are activation functions, but they serve different purposes in machine learning.


1๏ธโƒฃ Tanh (Hyperbolic Tangent)

  • Formula: tanhโก(x)=exโˆ’eโˆ’xex+eโˆ’x\tanh(x) = \frac{e^x – e^{-x}}{e^x + e^{-x}}tanh(x)=ex+eโˆ’xexโˆ’eโˆ’xโ€‹
  • Range: (-1, 1)
  • Behavior:
    • Outputs values between -1 and 1 (zero-centered).
    • Works well for hidden layers in deep networks.
  • Derivative: ddxtanhโก(x)=1โˆ’tanhโก2(x)\frac{d}{dx} \tanh(x) = 1 – \tanh^2(x)dxdโ€‹tanh(x)=1โˆ’tanh2(x)
  • Advantages:
    โœ… Zero-centered output โ†’ faster training in deep networks.
    โœ… Handles negative inputs better than sigmoid.
  • Disadvantages:
    โŒ Vanishing gradient problem for large/small values.

Example in PyTorch:

import torch
x = torch.tensor([-2.0, -1.0, 0.0, 1.0, 2.0])
tanh_output = torch.tanh(x)
print(tanh_output) # tensor([-0.9640, -0.7616, 0.0000, 0.7616, 0.9640])

2๏ธโƒฃ Softmax

  • Formula: Softmax(xi)=exiโˆ‘exj\text{Softmax}(x_i) = \frac{e^{x_i}}{\sum e^{x_j}}Softmax(xiโ€‹)=โˆ‘exjโ€‹exiโ€‹โ€‹
  • Range: (0, 1)
  • Behavior:
    • Outputs a probability distribution (sum = 1).
    • Used in multi-class classification.
  • Derivative: โˆ‚Siโˆ‚xj=Si(ฮดijโˆ’Sj)\frac{\partial S_i}{\partial x_j} = S_i (\delta_{ij} – S_j)โˆ‚xjโ€‹โˆ‚Siโ€‹โ€‹=Siโ€‹(ฮดijโ€‹โˆ’Sjโ€‹)
  • Advantages:
    โœ… Converts raw scores into probabilities.
    โœ… Ensures sum of outputs is 1.
  • Disadvantages:
    โŒ Sensitive to large input values โ†’ may need normalization.

Example in PyTorch:

import torch.nn.functional as F
x = torch.tensor([2.0, 1.0, 0.1])
softmax_output = F.softmax(x, dim=0)
print(softmax_output) # tensor([0.6590, 0.2424, 0.0986])

๐Ÿ”‘ Key Differences

FeatureTanhSoftmax
Formulaexโˆ’eโˆ’xex+eโˆ’x\frac{e^x – e^{-x}}{e^x + e^{-x}}ex+eโˆ’xexโˆ’eโˆ’xโ€‹exiโˆ‘exj\frac{e^{x_i}}{\sum e^{x_j}}โˆ‘exjโ€‹exiโ€‹โ€‹
Range(-1, 1)(0, 1)
Use CaseHidden layersOutput layer (multi-class classification)
Zero-centered?โœ… YesโŒ No
Gradient Issues?โœ… Vanishing gradientโœ… Sensitive to large values
Probability Interpretation?โŒ Noโœ… Yes

๐Ÿ› ๏ธ When to Use Each?

  • Use Tanh in hidden layers when you want zero-centered values.
  • Use Softmax in the output layer for multi-class classification to get probability scores.

๐Ÿš€ Which is Better?

  • For hidden layers โ†’ Tanh is better.
  • For multi-class classification output โ†’ Softmax is better.

Let me know if you need further clarification! ๐Ÿš€

Leave a Reply

Your email address will not be published. Required fields are marked *