Kubernetes Performance Tuning: Top Best Practices
Production Kubernetes performance tuning requires systematic optimization across resource management, networking, storage, and cluster configuration. This guide provides 15 actionable best practices with implementation code for enterprise-grade performance optimization.
1. Configure Resource Requests and Limits Properly
Proper resource management forms the foundation of Kubernetes performance optimization and directly impacts both cost efficiency and cluster stability. Resource requests tell the scheduler how much CPU and memory a container needs to function properly, ensuring pods are placed on nodes with sufficient capacity. This prevents resource contention and maintains consistent application performance across the cluster. The scheduler uses these requests to make intelligent placement decisions, avoiding scenarios where multiple resource-hungry applications compete for the same node resources.
Resource limits act as safety guardrails, preventing any single container from consuming excessive resources that could impact other workloads or destabilize the entire node. When containers exceed their memory limits, Kubernetes terminates them with an out-of-memory (OOM) kill, while CPU limits trigger throttling to maintain system stability. This protection mechanism is crucial in multi-tenant environments where workload isolation is essential.
The Quality of Service (QoS) classes demonstrated in the code examples—Guaranteed, Burstable, and BestEffort—create a hierarchy for resource allocation and eviction policies. Guaranteed pods receive the highest priority and are least likely to be evicted during resource pressure, making them ideal for critical workloads. Burstable pods can utilize unused resources when available but may be throttled or evicted if resources become scarce. This tiered approach allows you to optimize resource utilization while protecting mission-critical applications.
Setting appropriate requests and limits requires understanding your application’s resource consumption patterns through monitoring and profiling. Under-provisioning leads to performance degradation and potential application failures, while over-provisioning wastes resources and increases costs. The example shows setting GOMAXPROCS based on CPU limits, which helps Go applications optimize their runtime behavior according to available resources, demonstrating how application-level optimizations complement Kubernetes resource management.
CPU and Memory Resource Management
apiVersion: apps/v1
kind: Deployment
metadata:
name: optimized-app
spec:
replicas: 3
selector:
matchLabels:
app: optimized-app
template:
metadata:
labels:
app: optimized-app
spec:
containers:
- name: app
image: nginx:1.21
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "200m"
# Enable CPU throttling awareness
env:
- name: GOMAXPROCS
valueFrom:
resourceFieldRef:
resource: limits.cpu
Quality of Service Classes
# Guaranteed QoS - Critical workloads
apiVersion: v1
kind: Pod
metadata:
name: guaranteed-pod
spec:
containers:
- name: app
image: critical-app:latest
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "500m"
---
# Burstable QoS - Standard workloads
apiVersion: v1
kind: Pod
metadata:
name: burstable-pod
spec:
containers:
- name: app
image: standard-app:latest
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "1000m"
2. Implement Horizontal Pod Autoscaling (HPA)
Horizontal Pod Autoscaling provides dynamic scalability that adapts to changing workload demands, ensuring optimal performance while minimizing resource costs. The HPA controller continuously monitors specified metrics and automatically adjusts the number of pod replicas to maintain target performance levels. This automation eliminates the need for manual scaling interventions and responds to traffic spikes much faster than human operators could manage.
The advanced HPA configuration shown supports multiple metrics beyond basic CPU utilization, including memory usage and custom metrics like requests per second. This multi-metric approach provides more nuanced scaling decisions that better reflect real-world application performance characteristics. For example, a web application might scale based on incoming request rates rather than just CPU usage, providing more responsive scaling for user-facing services.
The behavior configuration controls scaling velocity and stability, preventing rapid fluctuations that could destabilize applications. Scale-down policies with stabilization windows ensure that temporary traffic spikes don’t trigger immediate scale-down actions, while scale-up policies can be more aggressive to handle sudden load increases. The percentage and absolute pod limits provide fine-grained control over scaling rates, allowing you to balance responsiveness with stability based on your application’s characteristics.
Cost optimization through HPA occurs by automatically reducing replica counts during low-demand periods, such as nights and weekends, while maintaining performance during peak hours. This dynamic resource allocation can result in significant cost savings compared to static provisioning for peak capacity. The minimum and maximum replica settings provide boundaries that ensure basic availability while preventing runaway scaling costs, making HPA both a performance and cost management tool.
CPU-based HPA with Custom Metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 2
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: requests_per_second
target:
type: AverageValue
averageValue: "1k"
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max
3. Configure Vertical Pod Autoscaling (VPA)
Vertical Pod Autoscaling addresses the challenge of right-sizing individual containers by automatically adjusting CPU and memory requests and limits based on actual usage patterns. Unlike HPA which changes the number of replicas, VPA optimizes the resource allocation per pod, making it particularly valuable for applications that don’t scale horizontally effectively or have variable resource needs over time.
The VPA controller analyzes historical resource consumption data and provides recommendations or automatically updates resource specifications. This automation is especially beneficial for applications with unpredictable resource usage patterns or during development phases where optimal resource requirements aren’t yet known. The continuous optimization ensures that applications receive adequate resources for performance while avoiding over-provisioning that wastes cluster capacity.
The resource policy configuration allows fine-tuned control over VPA behavior, including minimum and maximum allowed resources and which specific resources (CPU, memory, or both) should be managed. The controlledValues setting determines whether VPA manages just requests, just limits, or both, providing flexibility in how aggressively the system optimizes resource allocation. This granular control is essential for maintaining application stability while allowing optimization.
VPA works best in combination with HPA, where VPA optimizes individual pod resource allocation while HPA manages the number of replicas. However, care must be taken to avoid conflicts between the two systems. VPA is particularly effective for batch workloads, databases, and stateful applications where horizontal scaling is limited, providing a complementary optimization strategy that ensures optimal resource utilization across different application types.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: resource-consumer
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: app
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 2
memory: 4Gi
controlledResources: ["cpu", "memory"]
controlledValues: RequestsAndLimits
4. Optimize Node Affinity and Pod Scheduling
Strategic pod placement through node affinity and anti-affinity rules significantly impacts both performance and reliability by ensuring workloads run on appropriate hardware and maintain proper distribution across the cluster. Node affinity allows you to specify preferences or requirements for which nodes should host your pods based on node labels such as instance type, availability zone, or custom hardware characteristics. This targeting ensures that compute-intensive applications run on high-performance nodes while memory-intensive workloads are placed on memory-optimized instances.
The example demonstrates both required and preferred affinity rules, providing different levels of scheduling constraints. Required rules create hard constraints that must be satisfied, ensuring critical requirements like specific CPU architectures or compliance zones are met. Preferred rules influence scheduling decisions without creating hard failures, allowing the scheduler to optimize placement while maintaining flexibility when resources are constrained.
Pod anti-affinity rules enhance reliability by spreading replicas across different nodes, zones, or even regions, reducing the blast radius of infrastructure failures. The preferredDuringSchedulingIgnoredDuringExecution configuration shown creates soft anti-affinity that attempts to spread pods across nodes while allowing co-location if necessary for resource constraints. This approach balances high availability with practical scheduling limitations in resource-constrained environments.
Advanced scheduling strategies can significantly impact performance by reducing network latency, optimizing cache locality, and ensuring appropriate resource allocation. For example, placing frontend and backend services in the same availability zone reduces network latency, while ensuring database replicas are distributed across zones maintains availability during zone failures. The combination of node and pod affinity rules creates sophisticated placement policies that optimize for both performance and reliability requirements.
Advanced Node Affinity
apiVersion: apps/v1
kind: Deployment
metadata:
name: compute-intensive-app
spec:
replicas: 3
selector:
matchLabels:
app: compute-app
template:
metadata:
labels:
app: compute-app
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["c5.xlarge", "c5.2xlarge"]
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values: ["us-west-2a", "us-west-2b"]
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values: ["compute-app"]
topologyKey: kubernetes.io/hostname
containers:
- name: app
image: compute-app:latest
resources:
requests:
cpu: "1000m"
memory: "2Gi"
limits:
cpu: "2000m"
memory: "4Gi"
5. Implement Pod Disruption Budgets (PDB)
Pod Disruption Budgets provide essential protection against operational disruptions by defining minimum availability requirements during voluntary disruptions such as node maintenance, cluster upgrades, or scaling operations. PDBs ensure that critical applications maintain adequate capacity even when the cluster infrastructure undergoes planned changes, preventing service outages that could impact business operations.
The two main PDB configurations—minAvailable and maxUnavailable—offer different approaches to defining availability requirements. The minAvailable setting ensures a specific number of pods remain running, which is ideal for applications where you know the exact minimum capacity needed for operation. The maxUnavailable percentage-based approach is more flexible for applications that can tolerate proportional capacity reductions, automatically adapting to changes in deployment scale.
PDBs work by blocking eviction requests that would violate the defined availability constraints, causing operations like node drains or cluster autoscaler scale-downs to wait until sufficient capacity becomes available elsewhere. This protection mechanism ensures that administrative operations don’t inadvertently cause service disruptions, making cluster maintenance safer and more predictable.
The strategic implementation of PDBs requires balancing availability requirements with operational flexibility. Overly restrictive PDBs can prevent necessary maintenance operations, while insufficiently protective budgets may allow service disruptions. The examples show different strategies for critical versus standard applications, demonstrating how PDB configuration should align with service criticality and business requirements. This approach ensures that the most important services receive the strongest protection while maintaining operational flexibility for the overall cluster.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: critical-app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: critical-app
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-app-pdb
spec:
maxUnavailable: 25%
selector:
matchLabels:
app: web-app
6. Configure Readiness and Liveness Probes
Health check configuration through readiness, liveness, and startup probes creates a robust application lifecycle management system that ensures only healthy pods receive traffic while automatically recovering from failures. These probes provide Kubernetes with essential information about application state, enabling intelligent traffic routing and automated failure recovery that maintains service availability without manual intervention.
Readiness probes determine when a pod is ready to receive traffic, preventing premature traffic routing to containers that are still initializing or temporarily unable to handle requests. This mechanism is crucial for zero-downtime deployments, as it ensures new pods are fully operational before old ones are terminated. The probe configuration shown includes appropriate timing parameters that balance quick detection of ready state with avoiding false negatives during normal startup operations.
Liveness probes detect and recover from application deadlocks or hung states by restarting containers that fail health checks. The longer initial delay and period for liveness probes compared to readiness probes reflects their different purposes—liveness probes should be more conservative to avoid unnecessary restarts while still detecting genuine failures. The timeout and failure threshold settings provide tunable sensitivity that can be adjusted based on application characteristics and network conditions.
Startup probes address the challenge of applications with slow initialization times by providing a separate health check mechanism during the startup phase. This prevents liveness probes from prematurely terminating containers that require extended startup time, while still providing timely detection of startup failures. The combination of all three probe types creates a comprehensive health management system that handles the full application lifecycle from startup through steady-state operation to failure recovery, significantly improving overall service reliability.
Optimized Health Check Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-service
spec:
replicas: 3
selector:
matchLabels:
app: web-service
template:
metadata:
labels:
app: web-service
spec:
containers:
- name: app
image: web-app:latest
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /health/ready
port: 8080
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
successThreshold: 1
failureThreshold: 3
livenessProbe:
httpGet:
path: /health/live
port: 8080
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 3
startupProbe:
httpGet:
path: /health/startup
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 30
resources:
requests:
memory: "256Mi"
cpu: "200m"
limits:
memory: "512Mi"
cpu: "500m"
7. Optimize Container Images
Container image optimization directly impacts application startup time, resource consumption, and security posture, making it a critical factor in cluster performance and cost efficiency. The multi-stage Dockerfile example demonstrates how to create minimal production images by separating build dependencies from runtime requirements. This approach dramatically reduces image size, which decreases pull times, reduces storage costs, and minimizes the attack surface by excluding unnecessary components from production containers.
The build stage includes all development tools and dependencies needed to compile the application, while the production stage contains only the compiled binary and essential runtime components. This separation can reduce image sizes from hundreds of megabytes to just tens of megabytes, significantly improving pod startup times especially when images need to be pulled to new nodes. The use of Alpine Linux as the base image further minimizes size while providing necessary system libraries.
Image pull policies significantly impact cluster performance and reliability. The IfNotPresent policy shown in the example reduces network bandwidth and registry load by reusing locally cached images when available, while still ensuring that updated image tags are pulled when necessary. This policy strikes a balance between performance and freshness, reducing startup times for frequently deployed applications while maintaining the ability to deploy updates.
Security optimizations like running containers as non-root users (demonstrated with USER 65534:65534) and using specific image tags rather than “latest” improve both security and reliability. Specific tags ensure reproducible deployments and prevent unexpected changes from upstream image updates, while non-root execution reduces the potential impact of container escapes. These practices, combined with regular image scanning and updates, create a secure and efficient container foundation for your applications.
Multi-stage Dockerfile Example
# Build stage
FROM golang:1.21-alpine AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o main .
# Production stage
FROM alpine:3.18
RUN apk --no-cache add ca-certificates tzdata
WORKDIR /root/
COPY --from=builder /app/main .
USER 65534:65534
EXPOSE 8080
CMD ["./main"]
Image Pull Policy Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: optimized-deployment
spec:
replicas: 3
selector:
matchLabels:
app: optimized-app
template:
metadata:
labels:
app: optimized-app
spec:
containers:
- name: app
image: myregistry.com/optimized-app:v1.2.3
imagePullPolicy: IfNotPresent
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "200m"
8. Configure CNI and Network Performance
Network configuration optimization is crucial for application performance, especially in microservices architectures where inter-service communication represents a significant portion of overall latency. The Calico network policy example demonstrates how to implement micro-segmentation that enhances security while maintaining performance through efficient traffic filtering. Well-designed network policies reduce the blast radius of security incidents while avoiding unnecessary network overhead that could impact application performance.
The network policy configuration shows label-based traffic filtering that allows only necessary communication paths between services. This approach provides security benefits by implementing zero-trust networking principles, but also enables network optimization by clearly defining traffic patterns that can be optimized by the CNI implementation. The policy structure supports both ingress and egress rules, providing comprehensive control over pod-to-pod communication.
Service configuration optimization includes choosing appropriate service types and load balancer configurations that match your performance requirements. The Network Load Balancer (NLB) annotation shown provides lower latency and higher throughput compared to Application Load Balancers for TCP traffic, while cross-zone load balancing ensures even traffic distribution across availability zones. Session affinity configuration can improve performance for stateful applications by reducing connection overhead.
CNI selection and configuration significantly impact cluster networking performance. Different CNI implementations (Calico, Flannel, Cilium, etc.) have varying performance characteristics and feature sets. Calico, as shown in the example, provides both network policy enforcement and high-performance networking, making it suitable for security-conscious environments with performance requirements. The choice of CNI should align with your specific needs for features like network policies, encryption, observability, and performance characteristics.
Calico Network Policy for Performance
apiVersion: projectcalico.org/v3
kind: NetworkPolicy
metadata:
name: high-performance-policy
namespace: production
spec:
selector: app == 'web-app'
types:
- Ingress
- Egress
ingress:
- action: Allow
protocol: TCP
source:
selector: role == 'frontend'
destination:
ports:
- 8080
egress:
- action: Allow
protocol: TCP
destination:
selector: role == 'database'
ports:
- 5432
---
apiVersion: v1
kind: Service
metadata:
name: high-perf-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
spec:
type: LoadBalancer
selector:
app: web-app
ports:
- port: 80
targetPort: 8080
protocol: TCP
sessionAffinity: ClientIP
9. Implement Cluster Autoscaling
Cluster autoscaling provides dynamic infrastructure management that automatically adjusts the number of worker nodes based on resource demand, ensuring optimal cost efficiency while maintaining application performance. The cluster autoscaler monitors pod scheduling failures due to insufficient resources and automatically provisions new nodes to accommodate pending workloads. This automation eliminates the need for manual capacity planning and responds to demand spikes much faster than human operators could manage.
The configuration parameters shown control autoscaling behavior to balance responsiveness with stability. The expander setting determines which node group to scale when multiple options are available, with “least-waste” minimizing unused resources for cost optimization. Scale-down parameters like delay-after-add and unneeded-time prevent rapid scaling fluctuations that could destabilize workloads, while the utilization threshold determines when nodes are considered underutilized for removal.
Node group auto-discovery using Auto Scaling Group (ASG) tags enables seamless integration with cloud provider infrastructure, allowing the cluster autoscaler to manage multiple node groups with different instance types and configurations. This capability enables sophisticated scaling strategies where different workload types can trigger scaling of appropriate node types, optimizing both performance and cost. For example, CPU-intensive workloads can trigger scaling of compute-optimized instances while memory-intensive applications scale memory-optimized nodes.
Cost optimization through cluster autoscaling occurs by automatically removing underutilized nodes during low-demand periods, ensuring you only pay for needed capacity. The balance-similar-node-groups feature helps maintain even distribution across availability zones and instance types, improving both cost efficiency and fault tolerance. However, careful tuning of scaling parameters is essential to avoid premature scale-downs that could impact performance or excessive scale-ups that increase costs unnecessarily.
Cluster Autoscaler Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: cluster-autoscaler
template:
metadata:
labels:
app: cluster-autoscaler
spec:
containers:
- image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.21.0
name: cluster-autoscaler
resources:
limits:
cpu: 100m
memory: 300Mi
requests:
cpu: 100m
memory: 300Mi
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=least-waste
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/production-cluster
- --balance-similar-node-groups
- --scale-down-delay-after-add=10m
- --scale-down-unneeded-time=10m
- --scale-down-utilization-threshold=0.5
env:
- name: AWS_REGION
value: us-west-2
10. Optimize Storage with CSI Drivers
Storage optimization through Container Storage Interface (CSI) drivers enables applications to leverage high-performance storage while maintaining portability across different infrastructure environments. The high-performance storage class example demonstrates how to configure AWS EBS GP3 volumes with specific IOPS and throughput settings that match application requirements. This granular control over storage performance characteristics allows you to optimize for specific workload patterns while managing costs effectively.
The CSI driver architecture provides a standardized interface between Kubernetes and storage systems, enabling advanced features like volume snapshots, cloning, and expansion without vendor lock-in. The storage class configuration shown includes encryption by default, demonstrating how security can be built into storage provisioning policies. The WaitForFirstConsumer volume binding mode optimizes placement by ensuring volumes are created in the same availability zone as the consuming pod, reducing network latency and improving performance.
Storage performance optimization requires matching volume characteristics to application access patterns. Sequential I/O workloads benefit from high throughput settings, while random I/O applications need high IOPS configurations. The GP3 volume type shown provides the flexibility to tune IOPS and throughput independently, allowing precise optimization for different workload types. This granular control enables better performance at lower costs compared to older volume types that coupled IOPS and storage size.
Volume expansion capabilities enable dynamic storage scaling without application downtime, supporting growing data requirements without service interruption. The allowVolumeExpansion setting shown enables this functionality, which is particularly important for databases and other stateful applications that may experience data growth over time. Combined with monitoring and alerting on storage utilization, this capability enables proactive storage management that prevents outages due to insufficient storage space while minimizing over-provisioning costs.
High-Performance Storage Class
yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: high-iops-ssd
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "3000"
throughput: "125"
encrypted: "true"
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: database-pvc
spec:
accessModes:
- ReadWriteOnce
storageClassN
Conclusion
Implementing these 15 Kubernetes performance optimization practices creates a foundation for production-grade clusters that deliver exceptional performance, cost efficiency, and reliability. However, the true power of these techniques emerges when they work together as an integrated optimization strategy rather than isolated improvements. Resource management through proper requests and limits enables effective autoscaling, while strategic pod placement enhances the benefits of optimized networking and storage configurations.
The journey toward optimal Kubernetes performance requires a systematic approach that balances competing priorities. Performance optimizations must consider cost implications—aggressive resource allocation may improve application response times but could significantly increase infrastructure expenses. Similarly, reliability improvements through redundancy and distribution need to be weighed against resource efficiency. The most successful implementations adopt a data-driven approach, using comprehensive monitoring and observability to make informed optimization decisions based on actual workload characteristics rather than assumptions.