1. Gradient Descent Convergence: Features with larger scales can dominate the gradient calculation,
leading to slower convergence and potentially getting stuck in local minima. Scaling brings all
features to a similar range, allowing the optimization algorithm to find the minimum more efficiently.
2. Activation Functions: Many activation functions (like sigmoid or tanh) are sensitive to the input
range. Large input values can lead to saturation, where the gradient becomes very small, hindering
learning. Scaling prevents this saturation by keeping inputs within a reasonable range.
3. Weight Initialization: Proper weight initialization techniques assume that input features are scaled.
If features have vastly different scales, the initial weights might not be appropriate, leading
to instability during training.
4. Regularization Techniques: Some regularization techniques (like L2 regularization) penalize large
weights. If features are not scaled, the model might be forced to assign large weights to features
with larger scales, disproportionately affecting the regularization penalty.
No comments:
Post a Comment