The “Node Not Ready” error is a common error faced by Kubernetes operators. When a node enters this state, it means that the node is unable to accept new pods due to underlying issues. In this troubleshooting guide, you will understand the causes of this error, and how does it impact the pods already running on the affected node?
Understanding Node States
Kubernetes nodes can exist in several states, with each one indicating the node’s operational status. Here are the four primary states:
- Ready: A node in the “Ready” state is fully operational and capable of running pods. It meets all the necessary conditions (resources, network connectivity, etc.) to host workloads.
- NotReady: When a node transitions to the “NotReady” state, it indicates that the node is experiencing issues preventing it from accepting new pods. Existing pods on the node continue to run, but no new ones can be scheduled.
- SchedulingDisabled: This state occurs when the node is intentionally marked as unschedulable. It won’t accept any new pods, even if it’s otherwise healthy. Administrators might set this state during maintenance or troubleshooting.
- Unknown: The “Unknown” state typically arises when the Kubernetes control plane loses communication with the node. It lacks information about the node’s status, making it impossible to determine whether it’s ready or not.
Impact on Pods
A “NotReady” node affects pod scheduling. While existing pods on the node continue to operate, no new pods can be assigned to it. Pods intended to run on a “NotReady” node stay in a pending state until the node returns to “Ready” status or they are rescheduled to another node.
Now let’s look into the common causes of the Node NotReady error.
Common Causes of Node Not Ready Error
- Lack of System Resources:
- Memory: Insufficient memory can lead to a node being marked as “NotReady.” Pods may fail to start due to memory constraints.
- Disk Space: Running out of disk space impacts the node’s ability to function properly.
- Excessive Processes: Too many processes competing for resources can render the node non-operational.
- kubelet Issues:
- kubelet Crashes: A crash or stoppage of the kubelet process causes the node to become “NotReady.”
- Misconfiguration: Errors in the kubelet configuration can stop the node from achieving the “Ready” state.
- Network-Related Problems:
- Network Partition: Isolation from the cluster network can cause a node to be marked as “NotReady.”
- DNS Resolution Issues: Nodes unable to resolve DNS names may remain in the “NotReady” state.
- Configuration Issues:
- CNI Plugin Misconfiguration: Problems with Container Network Interface (CNI) plugins can impact node readiness.
- Node Labels and Taints: Incorrect labels or taints may prevent pod scheduling.
Diagnosing and Troubleshooting
There are different ways that you can use to troubleshoot the “Node NoteReady” error. Some of your options include the following:
Use kubectl describe node
Step 1: Run kubectl describe node <node-name>
to get detailed information about the node’s status.
kubectl describe node <node-name>
Step 2: Look for conditions like MemoryPressure
, DiskPressure
, or PIDPressure
. These indicate resource shortages that might cause the node to be “NotReady.”
Investigate kubelet logs
Step 1: Check the kubelet logs (journalctl -u kubelet
or /var/log/kubelet.log
) for any errors or warnings.
sudo journalctl -u kubelet
Step 2: Look for clues related to connectivity issues, configuration problems, or component failures.
Verify network connectivity
Ensure that the node can communicate with the control plane and other nodes.
Step 1: Check Node Communication with Control Plane
kubectl get nodes
Step 2: Check Node Communication with Control Plane Using Ping
Ensure nodes and control plane are reachable via ping. Successful ping replies indicate good network connectivity.
ping -c 4 control-plane
Step 3: Check Node Communication with Another Node
ping -c 4 node-456
Check DNS resolution, firewall rules, and network routes.
Option 1: Check DNS Resolution
Verify service names resolve correctly using nslookup
. Proper resolution means DNS is functioning.
nslookup kubernetes.default.svc.cluster.local
Option 2: Check Firewall Rules
Confirm correct routes are in place using ip route
. Correct routes ensure network traffic flows properly between nodes and control plane.
sudo ufw status
Resolution Strategies
Address System Resource Issues
Option 1: Shut Down Non-Kubernetes Processes
Identify any non-essential processes consuming resources on the node. Shut them down or move them to other nodes.
Step 1: List Running Processes and Their Resource Usage
top -b -n 1 | head -n 20
Step 2: Identify and Shut Down Non-Essential Processes
sudo systemctl stop apache2
sudo systemctl stop mysql
Option 2: Run Malware Scans
Ensure the node is free from malware or malicious processes that might impact its performance.
Install and Run ClamAV
sudo apt-get update
sudo apt-get install clamav
sudo freshclam
sudo clamscan -r / --log=/var/log/clamav/scan.log\
Option 3: Upgrade the Node
Consider upgrading the node’s hardware (CPU, memory, storage) if resource constraints persist.
Restart Components
Option 1: kubelet
Restart the kubelet service using sudo systemctl restart kubelet
.
Resolution Strategies
Restart Components
Option 1: kubelet
Restart the kubelet service using sudo systemctl restart kubelet
.
Option 2: kube-proxy
Similarly, restart kube-proxy using sudo systemctl restart kube-proxy
.
Option 3: Docker
If you’re using Docker as the container runtime, restart it as well: sudo systemctl restart docker
.
Consider Using a Higher Service Tier
If you’re using managed Kubernetes services (like AKS, EKS, or GKE), consider upgrading to a higher service tier. This often provides better performance, reliability, and resource availability.