Docker, Prometheus & Pushgateway for NVIDIA GPU Metrics & Monitoring

Estimated Reading Time: 6 minutes

In my last blog post, I talked about how to get started with NVIDIA docker & interaction with NVIDIA GPU system. I demonstrated NVIDIA Deep Learning GPU Training System, a.k.a DIGITS by running it inside Docker container. ICYMI –  DIGITS is essentially a webapp for training deep learning models and is used to rapidly train the highly accurate deep neural network (DNNs) for image classification, segmentation and object detection tasks.The currently supported frameworks are: Caffe, Torch, and Tensorflow. It simplifies common deep learning tasks such as managing data, designing and training neural networks on multi-GPU systems, monitoring performance in real time with advanced visualizations, and selecting the best performing model from the results browser for deployment. 

 

In a typical HPC environment where you run 100 and 100s of NVIDIA GPU equipped cluster of nodes, it becomes important to monitor those systems to gain insight of the performance metrics, memory usage, temperature and utilization. . Tools like Ganglia & Nagios etc. are very popular due to their scalable  & distributed monitoring architecture for high-performance computing systems such as clusters and Grids. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization. But with the advent of container technology, there is a need of modern monitoring tools and solutions which works well with Docker & Microservices. 

It’s all modern world of Prometheus Stack…

Prometheus is 100% open-source service monitoring system and time series database written in Go.It is a full monitoring and trending system that includes built-in and active scraping, storing, querying, graphing, and alerting based on time series data. It has knowledge about what the world should look like (which endpoints should exist, what time series patterns mean trouble, etc.), and actively tries to find faults.

How is it different from Nagios?

Though both serves a purpose of monitoring, Prometheus wins this debate with the below major points – 

  • Nagios is host-based. Each host can have one or more services, which has one check.There is no notion of labels or a query language. But Prometheus comes with its robust query language called “PromQL”. Prometheus provides a functional expression language that lets the user select and aggregate time series data in real time. The result of an expression can either be shown as a graph, viewed as tabular data in Prometheus’s expression browser, or consumed by external systems via the HTTP API.
  • Nagios is suitable for basic monitoring of small and/or static systems where blackbox probing is sufficient. But if you want to do whitebox monitoring, or have a dynamic or cloud based environment then Prometheus is a good choice.
  • Nagios is primarily just about alerting based on the exit codes of scripts. These are called “checks”. There is silencing of individual alerts, however no grouping, routing or deduplication.

Let’s talk about Prometheus Pushgateway..

Occasionally you will need to monitor components which cannot be scraped. They might live behind a firewall, or they might be too short-lived to expose data reliably via the pull model. The Prometheus Pushgateway allows you to push time series from these components to an intermediary job which Prometheus can scrape. Combined with Prometheus’s simple text-based exposition format, this makes it easy to instrument even shell scripts without a client library.

The Prometheus Pushgateway allow ephemeral and batch jobs to expose their metrics to Prometheus. Since these kinds of jobs may not exist long enough to be scraped, they can instead push their metrics to a Pushgateway. The Pushgateway then exposes these metrics to Prometheus. It is important to understand that the Pushgateway is explicitly not an aggregator or distributed counter but rather a metrics cache. It does not have statsd-like semantics. The metrics pushed are exactly the same as you would present for scraping in a permanently running program.For machine-level metrics, the textfile collector of the Node exporter is usually more appropriate. The Pushgateway is intended for service-level metrics. It is not an event store

Under this blog post, I will showcase how NVIDIA Docker, Prometheus & Pushgateway come together to  push NVIDIA GPU metrics to Prometheus Stack.

Infrastructure Setup:

  • Docker Version: 17.06
  • OS: Ubuntu 16.04 LTS
  • Environment : Managed Server Instance with GPU
  • GPU: GeForce GTX 1080 Graphics card

Cloning the GITHUB Repository

Run the below command to clone the below repository to your Ubuntu 16.04 system equipped with GPU card:

git clone https://github.com/ajeetraina/nvidia-prometheus-stats

Script to bring up Prometheus Stack(Includes Grafana)

Change to nvidia-prometheus-stats directory with proper execute permission & then execute the ‘start_containers.sh’ script as shown below:

cd nvidia-prometheus-stats
sudo chmod +x start_containers.sh
sudo sh start_containers.sh

This script will bring up 3 containers in sequence – Pushgateway, Prometheus & Grafana

Executing GPU Metrics Script:

NVIDIA provides a python module for monitoring NVIDIA GPUs using the newly released Python bindings for NVML (NVIDIA Management Library). These bindings are under BSD license and allow simplified access to GPU metrics like temperature, memory usage, and utilization.

Next, under the same directory, you will find a python script called “test.py”.

Execute the script (after IP under line number – 124 as per your host machine) as shown below:

sudo python test.py

That’s it. It is time to open up Prometheus & Grafana UI under http://<IP-address>:9090

Just type gpu under the Expression section and you will see the list of GPU metrics automatically turned up as shown below:

Accessing the targets

Go to Status > Targets to see what targets are accessible. The Status should show up UP.

Click on Push gateway Endpoint to access the GPU metrics in details as shown:

Accessing Grafana

 You can access Grafana through the below link:

                                          http://<IP-address>:3000

 

Did you find this blog helpful?  Feel free to share your experience. Get in touch @ajeetsraina

If you are looking out for contribution/discussion, join me at Docker Community Slack Channel.

 

Building a secure Docker Host VM on VMware ESXi using LinuxKit & Moby

Estimated Reading Time: 4 minutes

Post Dockercon 2017 @ Austin TX,  I raised a feature request titled “LinuxKit command to push vmware.vmdk to remote ESXi datastore”. Within few weeks time, the feature was introduced by LinuxKit team. A Special thanks goes to Daniel Finneran who worked hard to get this feature merged into the LinuxKit main branch.

 

LinuxKit project is 5 month old now. It has already bagged 3100+ stars, added up 69 contributors and 350+ forks till date. If you are pretty new, LinuxKit is not a full host operating system, as it primarily has two jobs: run containerd containers, and be secure. It uses modern kernels, and updates frequently following new releases. As such, the system does not contain extraneous packages or drivers by default. Because LinuxKit is customizable, it is up to individual operators to include any additional bits they may require.

LinuxKit is undoubtedly Secure

The core system components included in LinuxKit userspace are key to security, and written in type safe languages, such as RustGo and OCaml, and run with maximum privilege separation and isolation. The project is currently leveraging MirageOS to construct unikernels to achieve this, and that progress can be tracked here: as of this writing, dhcp is the first such type safe program. There is ongoing work to remove more C components, and to improve, fuzz test and isolate the base daemons. Further rationale about the decision to rewrite system daemons in MirageOS is explained at length in this document. I am planning to come up with blog post to brief on “LinuxKit Security” aspect. Keep an eye on this space in future..

Let’s talk about building a secure Docker Host VM…

I am a great fan of VMware PowerCLI. I have been using it since the time I was working full time in VMware Inc.(during 2010-2011 timeframe). Today the most quickest way to get VMware PowerCLI up and running is by using PhotonOS based Docker Image. Just one Docker CLI and you are already inside Photon OS running PowerShell & PowerCLI to connect to remote ESXi to build up VMware Infrastructure. Still this mightn’t give you a secure Docker host environment. If you are really interested to build a secure, portable and lean Docker Host operating system, LinuxKit is the right tool. But how?

Under this blog post, I am going to show how Moby & LinuxKit can help you in building a secure Docker 17.07 Host VM on top of VMware ESXi.

Pre-requisites:

  • VMware vSphere ESXi 6.x
  • Linux or MacOS with Go packages installed
  • Docker 17.06/17.07 installed on the system

The below commands has been executed on one of my local Ubuntu 16.04 LTS system which can reach out to ESXi system flawlessly.

Cloning the LinuxKit Repository:

git clone https://github.com/linuxkit/linuxkit

Building Moby & LinuxKit:

cd linuxkit
make

Configuring the right PATH for Moby & LinuxKit

cp bin/moby /usr/local/bin
cd bin/linuxkit /usr/local/bin

A Peep into vmware.yml File

The first 3 lines shows modern, securely configured kernel. The init section spins up containerd to run services. The onboot section allows dhcpd for networking. The services includes getty service container for shell, runs nginx service container. The trust section indicates all images signed and verified.

Building VMware ISO Image using Moby

moby build -output iso-bios -name vmware vmware.yml

 

Pushing VMware ISO Image to remote ESXi datastore

linuxkit push vcenter -datastore=datastore1 -hostname=myesxi.dell.com -url https://root:xxx@100.98.x.x/sdk -folder=linuxkit vmware.iso

Usage of linuxkit push vcenter :-

 

Running a secure VM directly from the ESXi datastore using LinuxKit

dell@redfish-ubuntu:~/linuxkit/examples$ sudo linuxkit run vcenter -cpus 8 -datastore datastore1 -mem 2048 -network ‘VM Network’ -hostname myesxi.dell.com -powerOn -url  https://root:xxx@100.98.x.x/sdk vmware.iso
Creating new LinuxKit Virtual Machine
Adding ISO to the Virtual Machine
Adding VM Networking
Powering on LinuxKit VM

Now let us verify that VM is up and running using either VMware vSphere Client or SDK URL.

You will find that VM is already booted up with the latest Docker 17.07 platform up and running.

Building a Docker Host VM using Moby & LinuxKit

In case you want to build a Docker Host VM, you can refer to the below vmware.yml file:

Just re-run the below command to get the new VM image:

moby build -output iso-bios -name vmware docker-vmware.yml

Follow the above steps to push it to remote datastore and run it using LinuxKit. Hence, you have a secured Docker 17.07 Host ready to build Docker Images and build up application stack.

How about building Photon OS based Docker Image using Moby & LinuxKit? Once you build it and push it to VM , its all ready to build Virtual Infrastructure. Interesting, Isn’t it?

Did you find this blog helpful?  Feel free to share your experience. Get in touch @ajeetsraina

If you are looking out for contribution/discussion, join me at Docker Community Slack Channel.