Kubernetes autoscaling is the ability to dynamically adjust compute resources — either at the pod level or node level — based on real-time workload demands.
This helps achieve better performance, higher efficiency, and cost savings without manual intervention.
⸻
1. Horizontal Pod Autoscaler (HPA)
Purpose:
Adjusts the number of pod replicas in a deployment, replica set, or stateful set, based on observed workload demand.
Why use it?
Some applications experience fluctuating traffic — high demand during peak hours and low demand during off-hours. HPA ensures enough pods are available during spikes while scaling down during idle periods, saving resources.
⸻
How It Works
1. HPA monitors a specific metric (like CPU, memory, or a custom metric).
2. It compares the observed average value with your target value.
3. If the observed value is higher than target → it scales up (adds replicas).
4. If lower than target → it scales down (removes replicas).
Formula for scaling decision:
Desired Replicas = Current Replicas × (Current Metric Value / Target Value)
⸻
Example
• Target CPU utilization: 50%
• Current mean CPU utilization: 75%
• Current replicas: 5
Calculation:
Desired Replicas = 5 × (75 / 50) = 7.5 → round up to 8
HPA will increase replicas from 5 to 8 to balance load.
⸻
Requirements
• Metrics Source:
• For CPU/memory: metrics-server must be running in your cluster.
• For custom metrics: Implement custom.metrics.k8s.io API.
• For external metrics (like Kafka lag, queue length): Implement external.metrics.k8s.io.
• Pod Resource Requests:
CPU/memory requests must be set in your pod spec for accurate scaling.
⸻
When to Use
• Stateless workloads (e.g., web apps, APIs).
• Batch jobs that can run in parallel.
• Paired with Cluster Autoscaler to also scale nodes when pod count increases.
⸻
Best Practices
1. Install and configure metrics-server.
2. Always set requests for CPU/memory in pods.
3. Use custom metrics for application-specific scaling triggers (e.g., request latency).
4. Combine with Cluster Autoscaler for full elasticity.
⸻
2. Vertical Pod Autoscaler (VPA)
Purpose:
Adjusts resource requests and limits (CPU, memory) for individual pods based on observed usage.
Why use it?
Some applications are not easy to scale horizontally (e.g., stateful apps, monoliths) but can benefit from more CPU/memory when needed.
⸻
How It Works
VPA has three components:
1. Recommender – Analyzes usage and suggests optimal CPU/memory requests.
2. Updater – Deletes and restarts pods that have outdated resource requests.
3. Admission Controller – Modifies pod specs at creation with updated requests/limits.
Important: VPA replaces pods rather than hot-resizing them.
⸻
Example
If your app was originally given:
• CPU: 200m
• Memory: 256Mi
…but usage shows it consistently needs:
• CPU: 500m
• Memory: 512Mi
VPA will terminate the pod and recreate it with updated values.
⸻
When to Use
• Stateful workloads (databases, in-memory caches).
• Apps with unpredictable CPU/memory bursts.
• Workloads where horizontal scaling is difficult or impossible.
⸻
Best Practices
1. Start with updateMode: Off to collect recommendations first.
2. Avoid using VPA and HPA on CPU for the same workload (conflicts possible).
3. Understand seasonality: If workload fluctuates often, VPA may restart pods too frequently.
⸻
3. Cluster Autoscaler
Purpose:
Adjusts the number of nodes in a Kubernetes cluster by adding/removing nodes based on scheduling needs.
Why use it?
To ensure enough nodes are available to run pods while reducing costs during low demand.
⸻
How It Works
Cluster Autoscaler continuously checks:
1. Unschedulable pods – If a pod cannot be scheduled because all nodes are full, it adds more nodes.
2. Underutilized nodes – If a node is mostly empty and its pods can be moved elsewhere, it removes the node.
⸻
Example
• Your cluster has 3 nodes fully utilized.
• A new pod is scheduled but can’t fit anywhere.
• Cluster Autoscaler adds a new node to accommodate the pod.
• Later, if a node’s utilization drops below a threshold (e.g., 50%), it may remove that node.
⸻
When to Use
• On cloud platforms (AWS, GCP, Azure) with autoscaling node pools.
• For workloads with large demand spikes.
• To save costs in pay-as-you-go environments.
⸻
Best Practices
1. Keep all nodes in a node group with the same specs.
2. Define resource requests for every pod.
3. Set PodDisruptionBudget for critical workloads.
4. Pair with HPA for pod scaling + node scaling synergy.
⸻
Best Practices for Combining Autoscaling Methods
• HPA + Cluster Autoscaler → Common pairing for elastic web services.
• VPA + Cluster Autoscaler → For workloads needing more power per pod.
• Avoid HPA + VPA on CPU for same workload (can cause constant scaling changes).
• Always have monitoring in place to validate scaling behavior (Prometheus, Grafana).
⸻
Quick Comparison Table
Feature HPA VPA Cluster Autoscaler
Scales Pods? ✅ ❌ ❌
Scales Node Count? ❌ ❌ ✅
Changes Pod Resources? ❌ ✅ ❌
Works with Stateful Apps⚠️ ✅ ✅
Needs metrics-server? ✅ ✅ ❌
Cloud/IaaS Dependent? ❌ ❌ ✅
⸻
If you want, I can also create a visual diagram showing how HPA, VPA, and Cluster Autoscaler interact in a real Kubernetes cluster so you can instantly see the workflow.
Do you want me to prepare that next?
references:
https://cast.ai/blog/guide-to-kubernetes-autoscaling-for-cloud-cost-optimization/#vpa