Join our Discord Server
Ajeet Raina Ajeet Singh Raina is a former Docker Captain, Community Leader and Arm Ambassador. He is a founder of Collabnix blogging site and has authored more than 570+ blogs on Docker, Kubernetes and Cloud-Native Technology. He runs a community Slack of 8900+ members and discord server close to 2200+ members. You can follow him on Twitter(@ajeetsraina).

What is Kubernetes Scheduler and why do you need it? – KubeLabs Glossary

5 min read

If you are keen to understand why Kubernetes Pods are placed onto a particular cluster node, then you have come to the right place. This detailed guide talks about the Kubernetes schedulers and how it works. It also covers the concepts like Node-Affinity, taints and tolerations.

In Kubernetes, scheduling refers to making sure that  Pods are matched to Nodes so that Kubelets can run them The Kubernetes Scheduler is a core component of Kubernetes: After a user or a controller creates a Pod, the Kubernetes Scheduler, monitoring the Object Store for unassigned Pods, will assign the Pod to a Node. Then, the Kubelet, monitoring the Object Store for assigned Pods, will execute the Pod.

A scheduler watches for newly created Pods that have no Node assigned. For every Pod that the scheduler discovers, the scheduler becomes responsible for finding the best Node for that Pod to run on.

How does Kubernetes Schedule work?

The Kubernetes scheduler is in charge of scheduling pods onto nodes. Basically it works like this:

  1. You create a pod
  2. The scheduler notices that the new pod you created doesn’t have a node assigned to it
  3. The scheduler assigns a node to the pod

K8s scheduler is not responsible for actually running the pod – that’s the kubelet’s job. So it basically just needs to make sure every pod has a node assigned to it. Kubernetes in general has this idea of a “controller”. A controller’s job is to:

  • look at the state of the system
  • notice ways in which the actual state does not match the desired state (like “this pod needs to be assigned a node”)
  • repeat

The scheduler is a kind of controller. There are lots of different controllers and they all have different jobs and operate independently.

How Kubernetes Selects The Right node?

Enter Node Affinity. Node Affinity allows you to tell Kubernetes to schedule pods only to specific subsets of nodes.The initial node affinity mechanism in early versions of Kubernetes was the nodeSelector field in the pod specification. The node had to include all the labels specified in that field to be eligible to become the target for the pod.nodeSelectorSteps

git clone https://github.com/collabnix/dockerlabs
cd dockerlabs/kubernetes/workshop/Scheduler101/
kubectl label nodes node2 mynode=worker-1
kubectl apply -f pod-nginx.yaml
  • We have label on the node with node name,in this case i have given node2 as mynode=worker-1 label.

Viewing Your Pods

kubectl get pods --output=wide
[node1 Scheduler101]$ kubectl describe po nginx
Name:               nginx
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               node2/192.168.0.17
Start Time:         Mon, 30 Dec 2019 16:40:53 +0000
Labels:             env=test
Annotations:        kubectl.kubernetes.io/last-applied-configuration:
                      {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"labels":{"env":"test"},"name":"nginx","namespace":"default"},"spec":{"contai...
Status:             Pending
IP:
Containers:
  nginx:
    Container ID:
    Image:          nginx
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-qpgxq (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  default-token-qpgxq:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-qpgxq
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  mynode=worker-1
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  7s    default-scheduler  Successfully assigned default/nginx to node2
  Normal  Pulling    3s    kubelet, node2     Pulling image "nginx"
[node1 Scheduler101]$

  • You can check in above output Node-Selectors: mynode=worker-1

Deleting the Pod

kubectl delete -f pod-nginx.yaml
pod "nginx" deleted

Please note:

  • Node affinity is conceptually similar to nodeSelector – it allows you to constrain which nodes your pod is eligible to be scheduled on, based on labels on the node.
  • There are currently two types of node affinity.
    1. requiredDuringSchedulingIgnoredDuringExecution (Preferred during scheduling, ignored during execution; also known as “hard” requirements)
    2. preferredDuringSchedulingIgnoredDuringExecution (Required during scheduling, ignored during execution; also known as “soft” requirements)

Show me a demo…

Let’s jump into a quick demonstration by cloning the repository and labelling the nodes

git clone https://github.com/collabnix/dockerlabs
cd dockerlabs/kubernetes/workshop/Scheduler101/
kubectl label nodes node2 mynode=worker-1
kubectl label nodes node3 mynode=worker-3
kubectl apply -f pod-with-node-affinity.yaml

Viewing Your Pods


kubectl get pods --output=wide
NAME                 READY   STATUS    RESTARTS   AGE     IP          NODE          NOMINATED NODE   READINESS GATES
with-node-affinity   1/1     Running   0          9m46s   10.44.0.1   kube-slave1   <none>           <none>

[node1 Scheduler101]$ kubectl describe po
Name:               with-node-affinity
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               node3/192.168.0.16
Start Time:         Mon, 30 Dec 2019 19:28:33 +0000
Labels:             <none>
Annotations:        kubectl.kubernetes.io/last-applied-configuration:
                      {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"with-node-affinity","namespace":"default"},"spec":{"affinity":{"nodeA...
Status:             Pending
IP:
Containers:
  nginx:
    Container ID:
    Image:          nginx
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-qpgxq (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  default-token-qpgxq:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-qpgxq
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  26s   default-scheduler  Successfully assigned default/with-node-affinity to node3
  Normal  Pulling    22s   kubelet, node3     Pulling image "nginx"
  Normal  Pulled     20s   kubelet, node3     Successfully pulled image "nginx"
  Normal  Created    2s    kubelet, node3     Created container nginx
  Normal  Started    0s    kubelet, node3     Started container nginx

Cleaning up

Finally you can clean up the resources you created in your cluster:

kubectl delete -f pod-with-node-affinity.yaml

What is Anti-Node Affinity ?

  • Some scenarios require that you don’t use one or more nodes except for particular pods. Think of the nodes that host your monitoring application.
  • Those nodes shouldn’t have many resources due to the nature of their role. Thus, if other pods than those which have the monitoring app are scheduled to those nodes, they hurt monitoring and also degrades the application they are hosting.
  • In such a case, you need to use node anti-affinity to keep pods away from a set of nodes.

Show me a demo..

Let us jump into an anti-affinity demonstration by cloning the repository and running the below commands:

git clone https://github.com/collabnix/dockerlabs
cd dockerlabs/kubernetes/workshop/Scheduler101/
kubectl label nodes node2 mynode=worker-1
kubectl label nodes node3 mynode=worker-3
kubectl apply -f pod-anti-node-affinity.yaml

Viewing Your Pods

[node1 Scheduler101]$ kubectl get pods --output=wide
NAME    READY   STATUS    RESTARTS   AGE     IP          NODE    NOMINATED NODE   READINESS GATES
nginx   1/1     Running   0          2m37s   10.44.0.1   node2   <none>           <none>

Get nodes label detail

[node1 Scheduler101]$ kubectl get nodes --show-labels | grep mynode
node2   Ready    <none>   166m   v1.14.9   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node2,kubernetes.io/os=linux,mynode=worker-1,role=dev
node3   Ready    <none>   165m   v1.14.9   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node3,kubernetes.io/os=linux,mynode=worker-3

Get pod describe

[node1 Scheduler101]$ kubectl describe pods nginx
Name:               nginx
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               node2/192.168.0.17
Start Time:         Mon, 30 Dec 2019 19:02:46 +0000
Labels:             <none>
Annotations:        kubectl.kubernetes.io/last-applied-configuration:
                      {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"nginx","namespace":"default"},"spec":{"affinity":{"nodeAffinity":{"re...
Status:             Running
IP:                 10.44.0.1
Containers:
  nginx:
    Container ID:   docker://2bdc20d79c360e1cd857eeb9bbb9424c726b2133e78f25bf4587e0befe3fbcc7
    Image:          nginx
    Image ID:       docker-pullable://nginx@sha256:b2d89d0a210398b4d1120b3e3a7672c16a4ba09c2c4a0395f18b9f7999b768f2
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Mon, 30 Dec 2019 19:03:07 +0000
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-qpgxq (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  default-token-qpgxq:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-qpgxq
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  60s   default-scheduler  Successfully assigned default/nginx to node2
  Normal  Pulling    56s   kubelet, node2     Pulling image "nginx"
  Normal  Pulled     54s   kubelet, node2     Successfully pulled image "nginx"
  Normal  Created    40s   kubelet, node2     Created container nginx
  Normal  Started    39s   kubelet, node2     Started container nginx

Adding another key to the matchExpressions with the operator NotIn will avoid scheduling the nginx pods on any node labelled worker-1.

Cleaning up

Finally you can clean up the resources you created in your cluster:

kubectl delete -f pod-anti-node-affinity.yaml

In our next blog post, we will learn about node taints and tolerations in detail.

Further References:

Have Queries? Join https://launchpass.com/collabnix

Ajeet Raina Ajeet Singh Raina is a former Docker Captain, Community Leader and Arm Ambassador. He is a founder of Collabnix blogging site and has authored more than 570+ blogs on Docker, Kubernetes and Cloud-Native Technology. He runs a community Slack of 8900+ members and discord server close to 2200+ members. You can follow him on Twitter(@ajeetsraina).
Join our Discord Server
Index