How to Monitor Node Health in Kubernetes using Node Problem Detector Tool?

Table of Contents

Kubernetes is a powerful container orchestration platform that allows users to deploy and manage containerized applications efficiently. However, the health of the nodes in a Kubernetes cluster is crucial for the overall stability and reliability of the applications running on it. Node problems, such as hardware failures, kernel issues, or container runtime problems, can impact the availability of pods and disrupt the entire cluster.

To address this, Kubernetes offers a tool called node-problem-detector, which aims to detect and report various node problems to the cluster management stack. In this blog, we will explore node-problem-detector, its features, how to deploy it in a Kubernetes cluster, and real-world use-cases with code snippets.

What is node-problem-detector?

Background and Motivation

Node problems in a Kubernetes cluster can lead to application disruptions and impact user experience. Issues like hardware failures, kernel panics, or unresponsive container runtimes are challenging to detect early and remediate. The node-problem-detector tool aims to address this problem by making various node problems visible to the upstream layers in the cluster management stack.

Problem API

node-problem-detector uses two mechanisms to report problems to the Kubernetes API server: Event and NodeCondition. Permanent problems that make the node unavailable for pods are reported as NodeConditions, while temporary problems that have limited impact on pods but are informative are reported as Events.

Supported Problem Daemons

node-problem-detector consists of multiple problem daemons, each responsible for monitoring specific kinds of node problems. The supported problem daemon types include System Log Monitor, System Stats Monitor, Custom Plugin Monitor, and Health Checker.

How node-problem-detector Works?

System Log Monitor

The System Log Monitor is a crucial component of node-problem-detector that monitors system logs and reports problems and metrics according to predefined rules. It collects log data from various sources, including kernel logs, system logs, and container runtime logs.

Code Snippet: Configuring System Log Monitor

node-problem-detector --config.system-log-monitor=config/kernel-monitor.json,config/system-monitor.json

System Stats Monitor

The System Stats Monitor collects various health-related system stats as metrics to provide insights into the node’s health status. Although it is not fully supported yet, it’s a promising feature for future releases.

Custom Plugin Monitor

The Custom Plugin Monitor allows users to define and check various node problems using custom check scripts. This flexibility enables users to address node problems specific to their use-cases.

Health Checker

The Health Checker verifies the health of essential components in the node, such as the kubelet and container runtime. It ensures these components are functioning correctly and reports any issues detected.

Exporter

The Exporter is responsible for reporting node problems and metrics to certain backends. Supported exporters include the Kubernetes exporter, Prometheus exporter, and Stackdriver exporter.

Building and Deploying node-problem-detector

Deploying with Helm

Helm simplifies the deployment of node-problem-detector in a Kubernetes cluster.

Code Snippet: Deploying with Helm

helm repo add deliveryhero https://charts.deliveryhero.io/
helm install --generate-name deliveryhero/node-problem-detector

Manual Installation

For manual installation, you can use YAML manifests to deploy node-problem-detector in your cluster.

Code Snippet: Manual Installation

Edit node-problem-detector.yaml to fit your environment. Set log volume to your system log directory (used by SystemLogMonitor). You can use a ConfigMap to overwrite the config directory inside the pod.
Edit node-problem-detector-config.yaml to configure node-problem-detector.
Edit rbac.yaml to fit your environment.
Create the ServiceAccount and ClusterRoleBinding with:

kubectl create -f rbac.yaml

Create the ConfigMap with:

kubectl create -f node-problem-detector-config.yaml

Create the DaemonSet with:

kubectl create -f node-problem-detector.yaml

Apply required manifests

kubectl create -f node-problem-detector-config.yaml
kubectl create -f rbac.yaml
kubectl create -f node-problem-detector.yaml

Configuration and Usage:

Command Line Flags

node-problem-detector provides various command line flags to configure its behavior.

Code Snippet: Using Command Line Flags

node-problem-detector --hostname-override=my-node --enable-k8s-exporter

Configuring System Log Monitor

You can specify the paths to system log monitor configuration files using the –config.system-log-monitor flag.

Code Snippet: Configuring System Log Monitor

node-problem-detector --config.system-log-monitor=config/kernel-monitor.json,config/filelog-monitor.json

Configuring System Stats Monitor

System Stats Monitor is still under development, but it will allow you to collect various health-related system stats as metrics.

Configuring Custom Plugin Monitor

The Custom Plugin Monitor can be configured with a list of paths to custom plugin monitor configuration files.

Code Snippet: Configuring Custom Plugin Monitor

node-problem-detector --config.custom-plugin-monitor=config/custom-plugin-monitor.json

Enabling Kubernetes Exporter

By default, node-problem-detector exports node problems to the Kubernetes API server. You can disable it using the –enable-k8s-exporter=false flag.

Code Snippet: Enabling Kubernetes Exporter

node-problem-detector --enable-k8s-exporter=false

Prometheus Exporter Configuration

The Prometheus exporter reports node problems and metrics locally as Prometheus metrics.

Code Snippet: Prometheus Exporter Configuration

node-problem-detector --prometheus-port=20257

Stackdriver Exporter Configuration

The Stackdriver exporter reports node problems and metrics to the Stackdriver Monitoring API.

Code Snippet: Stackdriver Exporter Configuration

node-problem-detector --exporter.stackdriver=config/stackdriver-exporter.json

Conclusion

Node-problem-detector is a valuable tool for monitoring node health in Kubernetes clusters. By making node problems visible to the cluster management stack, it enables administrators to detect and address issues before they impact applications. In this blog, we explored the features of node-problem-detector, how to deploy it, and real-world use-cases. Armed with this knowledge, you can enhance the reliability and stability of your Kubernetes clusters and ensure seamless application deployment.

How to Monitor Node Health in Kubernetes using Node Problem Detector Tool?

What is node-problem-detector?

Background and Motivation

Problem API

Supported Problem Daemons

How node-problem-detector Works?

System Log Monitor

Code Snippet: Configuring System Log Monitor

System Stats Monitor

Custom Plugin Monitor

Health Checker

Exporter

Building and Deploying node-problem-detector

Deploying with Helm

Code Snippet: Deploying with Helm

Manual Installation

Code Snippet: Manual Installation

Configuration and Usage:

Command Line Flags

Code Snippet: Using Command Line Flags

Configuring System Log Monitor

Code Snippet: Configuring System Log Monitor

Configuring System Stats Monitor

Configuring Custom Plugin Monitor

Code Snippet: Configuring Custom Plugin Monitor

Enabling Kubernetes Exporter

Code Snippet: Enabling Kubernetes Exporter

Prometheus Exporter Configuration

Code Snippet: Prometheus Exporter Configuration

Stackdriver Exporter Configuration

Code Snippet: Stackdriver Exporter Configuration

Conclusion

Kubernetes MCP Server: Step by Step Guide

Running Distributed ML Training with JobSet on Kubernetes

Kubectl Quick Reference 2025