If you are keen to understand why Kubernetes Pods are placed onto a particular cluster node, then you have come to the right place. This detailed guide talks about the Kubernetes schedulers and how it works. It also covers the concepts like Node-Affinity, taints and tolerations.
In Kubernetes, scheduling refers to making sure that Pods are matched to Nodes so that Kubelets can run them The Kubernetes Scheduler is a core component of Kubernetes: After a user or a controller creates a Pod, the Kubernetes Scheduler, monitoring the Object Store for unassigned Pods, will assign the Pod to a Node. Then, the Kubelet, monitoring the Object Store for assigned Pods, will execute the Pod.
A scheduler watches for newly created Pods that have no Node assigned. For every Pod that the scheduler discovers, the scheduler becomes responsible for finding the best Node for that Pod to run on.
How does Kubernetes Schedule work?

The Kubernetes scheduler is in charge of scheduling pods onto nodes. Basically it works like this:
- You create a pod
- The scheduler notices that the new pod you created doesn’t have a node assigned to it
- The scheduler assigns a node to the pod
K8s scheduler is not responsible for actually running the pod – that’s the kubelet’s job. So it basically just needs to make sure every pod has a node assigned to it. Kubernetes in general has this idea of a “controller”. A controller’s job is to:
- look at the state of the system
- notice ways in which the actual state does not match the desired state (like “this pod needs to be assigned a node”)
- repeat
The scheduler is a kind of controller. There are lots of different controllers and they all have different jobs and operate independently.
How Kubernetes Selects The Right node?
Enter Node Affinity. Node Affinity allows you to tell Kubernetes to schedule pods only to specific subsets of nodes.The initial node affinity mechanism in early versions of Kubernetes was the nodeSelector field in the pod specification. The node had to include all the labels specified in that field to be eligible to become the target for the pod.nodeSelectorSteps
git clone https://github.com/collabnix/dockerlabs
cd dockerlabs/kubernetes/workshop/Scheduler101/
kubectl label nodes node2 mynode=worker-1
kubectl apply -f pod-nginx.yaml
- We have label on the node with node name,in this case i have given node2 as mynode=worker-1 label.
Viewing Your Pods
kubectl get pods --output=wide
[node1 Scheduler101]$ kubectl describe po nginx
Name: nginx
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: node2/
Start Time: Mon, 30 Dec 2019 16:40:53 +0000
Labels: env=test
Annotations: kubectl.kubernetes.io/last-applied-configuration:
Status: Pending
Container ID:
Image: nginx
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Environment: <none>
/var/run/secrets/kubernetes.io/serviceaccount from default-token-qpgxq (ro)
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Type: Secret (a volume populated by a Secret)
SecretName: default-token-qpgxq
Optional: false
QoS Class: BestEffort
Node-Selectors: mynode=worker-1
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 7s default-scheduler Successfully assigned default/nginx to node2
Normal Pulling 3s kubelet, node2 Pulling image "nginx"
[node1 Scheduler101]$
- You can check in above output Node-Selectors: mynode=worker-1
Deleting the Pod
kubectl delete -f pod-nginx.yaml
pod "nginx" deleted
Please note:
- Node affinity is conceptually similar to nodeSelector – it allows you to constrain which nodes your pod is eligible to be scheduled on, based on labels on the node.
- There are currently two types of node affinity.
- requiredDuringSchedulingIgnoredDuringExecution (Preferred during scheduling, ignored during execution; also known as “hard” requirements)
- preferredDuringSchedulingIgnoredDuringExecution (Required during scheduling, ignored during execution; also known as “soft” requirements)
Show me a demo…
Let’s jump into a quick demonstration by cloning the repository and labelling the nodes
git clone https://github.com/collabnix/dockerlabs
cd dockerlabs/kubernetes/workshop/Scheduler101/
kubectl label nodes node2 mynode=worker-1
kubectl label nodes node3 mynode=worker-3
kubectl apply -f pod-with-node-affinity.yaml
Viewing Your Pods
kubectl get pods --output=wide
with-node-affinity 1/1 Running 0 9m46s kube-slave1 <none> <none>
[node1 Scheduler101]$ kubectl describe po
Name: with-node-affinity
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: node3/
Start Time: Mon, 30 Dec 2019 19:28:33 +0000
Labels: <none>
Annotations: kubectl.kubernetes.io/last-applied-configuration:
Status: Pending
Container ID:
Image: nginx
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Environment: <none>
/var/run/secrets/kubernetes.io/serviceaccount from default-token-qpgxq (ro)
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Type: Secret (a volume populated by a Secret)
SecretName: default-token-qpgxq
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 26s default-scheduler Successfully assigned default/with-node-affinity to node3
Normal Pulling 22s kubelet, node3 Pulling image "nginx"
Normal Pulled 20s kubelet, node3 Successfully pulled image "nginx"
Normal Created 2s kubelet, node3 Created container nginx
Normal Started 0s kubelet, node3 Started container nginx
Cleaning up
Finally you can clean up the resources you created in your cluster:
kubectl delete -f pod-with-node-affinity.yaml
What is Anti-Node Affinity ?
- Some scenarios require that you don’t use one or more nodes except for particular pods. Think of the nodes that host your monitoring application.
- Those nodes shouldn’t have many resources due to the nature of their role. Thus, if other pods than those which have the monitoring app are scheduled to those nodes, they hurt monitoring and also degrades the application they are hosting.
- In such a case, you need to use node anti-affinity to keep pods away from a set of nodes.
Show me a demo..
Let us jump into an anti-affinity demonstration by cloning the repository and running the below commands:
git clone https://github.com/collabnix/dockerlabs
cd dockerlabs/kubernetes/workshop/Scheduler101/
kubectl label nodes node2 mynode=worker-1
kubectl label nodes node3 mynode=worker-3
kubectl apply -f pod-anti-node-affinity.yaml
Viewing Your Pods
[node1 Scheduler101]$ kubectl get pods --output=wide
nginx 1/1 Running 0 2m37s node2 <none> <none>
Get nodes label detail
[node1 Scheduler101]$ kubectl get nodes --show-labels | grep mynode
node2 Ready <none> 166m v1.14.9 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node2,kubernetes.io/os=linux,mynode=worker-1,role=dev
node3 Ready <none> 165m v1.14.9 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node3,kubernetes.io/os=linux,mynode=worker-3
Get pod describe
[node1 Scheduler101]$ kubectl describe pods nginx
Name: nginx
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: node2/
Start Time: Mon, 30 Dec 2019 19:02:46 +0000
Labels: <none>
Annotations: kubectl.kubernetes.io/last-applied-configuration:
Status: Running
Container ID: docker://2bdc20d79c360e1cd857eeb9bbb9424c726b2133e78f25bf4587e0befe3fbcc7
Image: nginx
Image ID: docker-pullable://nginx@sha256:b2d89d0a210398b4d1120b3e3a7672c16a4ba09c2c4a0395f18b9f7999b768f2
Port: <none>
Host Port: <none>
State: Running
Started: Mon, 30 Dec 2019 19:03:07 +0000
Ready: True
Restart Count: 0
Environment: <none>
/var/run/secrets/kubernetes.io/serviceaccount from default-token-qpgxq (ro)
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Type: Secret (a volume populated by a Secret)
SecretName: default-token-qpgxq
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 60s default-scheduler Successfully assigned default/nginx to node2
Normal Pulling 56s kubelet, node2 Pulling image "nginx"
Normal Pulled 54s kubelet, node2 Successfully pulled image "nginx"
Normal Created 40s kubelet, node2 Created container nginx
Normal Started 39s kubelet, node2 Started container nginx
Adding another key to the matchExpressions with the operator NotIn will avoid scheduling the nginx pods on any node labelled worker-1.
Cleaning up
Finally you can clean up the resources you created in your cluster:
kubectl delete -f pod-anti-node-affinity.yaml
In our next blog post, we will learn about node taints and tolerations in detail.
Further References:
Comments are closed.