Autoscaling

This guide covers configuring Horizontal Pod Autoscaling (HPA) for AutoCom to automatically scale based on CPU and memory metrics.

Overview

Kubernetes Horizontal Pod Autoscaler automatically adjusts the number of pod replicas based on observed metrics. AutoCom uses HPA for:

  • API - Scale based on CPU and memory usage
  • Frontend - Scale based on CPU and memory usage
  • Nginx - Scale based on CPU usage

Prerequisites

Metrics Server

HPA requires metrics-server to collect resource metrics:

# Check if metrics-server is installed
kubectl get deployment metrics-server -n kube-system

# Install metrics-server (if not present)
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

For Docker Desktop, you may need to patch metrics-server:

kubectl patch deployment metrics-server -n kube-system --type='json' -p='[
  {"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--kubelet-insecure-tls"}
]'

Verify Metrics

# Check node metrics
kubectl top nodes

# Check pod metrics
kubectl top pods -n autocom

HPA Configuration

API Autoscaler

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
  namespace: autocom
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15
        - type: Pods
          value: 4
          periodSeconds: 15
      selectPolicy: Max

Frontend Autoscaler

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: frontend-hpa
  namespace: autocom
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: frontend
  minReplicas: 2
  maxReplicas: 8
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15

Nginx Autoscaler

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: nginx-hpa
  namespace: autocom
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nginx
  minReplicas: 2
  maxReplicas: 6
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
    scaleUp:
      stabilizationWindowSeconds: 0

Scaling Behavior

Scale Up Behavior

behavior:
  scaleUp:
    stabilizationWindowSeconds: 0  # Scale up immediately
    policies:
      - type: Percent
        value: 100              # Double pods
        periodSeconds: 15       # Every 15 seconds
      - type: Pods
        value: 4                # Or add 4 pods
        periodSeconds: 15       # Every 15 seconds
    selectPolicy: Max           # Use the larger of the two

Explanation:

  • stabilizationWindowSeconds: 0 - No delay before scaling up
  • selectPolicy: Max - Choose the policy that adds the most pods
  • Can double pods OR add 4, whichever is greater

Scale Down Behavior

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300  # Wait 5 minutes
    policies:
      - type: Percent
        value: 50                # Remove 50% of pods
        periodSeconds: 60        # Every 60 seconds

Explanation:

  • stabilizationWindowSeconds: 300 - Wait 5 minutes before scaling down
  • Prevents flapping during temporary load drops
  • Removes at most 50% of pods per minute

Resource Limits

For HPA to work correctly, deployments must have resource requests:

resources:
  requests:
    cpu: 200m      # 0.2 CPU cores
    memory: 384Mi  # 384 MiB
  limits:
    cpu: 1000m     # 1 CPU core
    memory: 768Mi  # 768 MiB

Recommended Values

Component CPU Request CPU Limit Memory Request Memory Limit
API 200m 1000m 384Mi 768Mi
Nginx 100m 500m 128Mi 256Mi
Frontend 100m 500m 256Mi 512Mi

Monitoring HPA

Check HPA Status

kubectl get hpa -n autocom

Example output:

NAME           REFERENCE             TARGETS                        MINPODS   MAXPODS   REPLICAS
api-hpa        Deployment/api        cpu: 1%/70%, memory: 4%/80%    2         10        2
frontend-hpa   Deployment/frontend   cpu: 4%/70%, memory: 23%/80%   2         8         2
nginx-hpa      Deployment/nginx      cpu: 1%/70%                    2         6         2

Detailed Status

kubectl describe hpa api-hpa -n autocom

Shows:

  • Current metrics
  • Scaling events
  • Conditions

Watch Scaling Events

kubectl get hpa -n autocom -w

Testing Autoscaling

Generate Load

Using hey load testing tool:

# Install hey
go install github.com/rakyll/hey@latest

# Generate load (1000 requests, 50 concurrent)
hey -n 1000 -c 50 http://localhost:8080/api/v1/health

Using kubectl run

# Create a load generator pod
kubectl run load-generator --image=busybox -n autocom -- /bin/sh -c "while true; do wget -q -O- http://nginx/api/v1/health; done"

# Watch HPA
kubectl get hpa -n autocom -w

# Clean up
kubectl delete pod load-generator -n autocom

Expected Behavior

  1. Load increases → CPU usage rises
  2. CPU exceeds 70% threshold
  3. HPA triggers scale up
  4. New pods start (within 30 seconds)
  5. Load distributed across more pods
  6. CPU usage decreases per pod
  7. After load stops, wait 5 minutes
  8. HPA triggers scale down

Tuning Tips

For Bursty Traffic

behavior:
  scaleUp:
    stabilizationWindowSeconds: 0
    policies:
      - type: Pods
        value: 6         # Add more pods quickly
        periodSeconds: 10

For Steady Traffic

behavior:
  scaleUp:
    stabilizationWindowSeconds: 30  # Wait before scaling
    policies:
      - type: Percent
        value: 50         # Scale more gradually
        periodSeconds: 30

For Cost Optimization

behavior:
  scaleDown:
    stabilizationWindowSeconds: 600  # Wait 10 minutes
    policies:
      - type: Percent
        value: 25         # Remove only 25% at a time
        periodSeconds: 120

Vertical Pod Autoscaler (VPA)

For automatic resource request tuning:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
  namespace: autocom
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  updatePolicy:
    updateMode: "Off"  # Just recommend, don't auto-update

Note: VPA requires the VPA controller to be installed separately.

Troubleshooting

HPA Shows "Unknown" Metrics

# Check if metrics-server is running
kubectl get pods -n kube-system | grep metrics-server

# Check metrics-server logs
kubectl logs -n kube-system deployment/metrics-server

Pods Not Scaling

  1. Check resource requests are set
  2. Verify metrics-server is working
  3. Check HPA events: kubectl describe hpa api-hpa -n autocom

Scaling Too Aggressively

  1. Increase stabilizationWindowSeconds
  2. Reduce scale up value percentage
  3. Add scaleDown policies

Scaling Too Slowly

  1. Decrease stabilizationWindowSeconds
  2. Increase scale up value
  3. Add more aggressive scaleUp policies

Summary

Setting API Frontend Nginx
Min Replicas 2 2 2
Max Replicas 10 8 6
CPU Target 70% 70% 70%
Memory Target 80% 80% -
Scale Up Delay 0s 0s 0s
Scale Down Delay 300s 300s 300s

Next Steps