Code

TensorFlow on Kubernetes

TensorFlow on Kubernetes

While GPUs are a staple of deep learning, deploying on GPUs makes everything more complicated, including your Kubernetes cluster. This quick guide will walk through adding basic single-GPU support to Kubernetes.

The guide assumes that Kubernetes is already running on Ubuntu. A LTS release is preferable, with 14.04 being most preferable due to NVIDIA recommendations for driver hosts. Warning: Ubuntu 14.04 is not well supported by Kubernetes. Feel free to use a different distro. This guide also assumes that the proper GPU drivers and CUDA version have been installed. Plenty of other guides cover those topics.

TL;DR: start with nvidia-docker, then whittle away it’s functionality so that just plain docker remains. Then add that functionality to Kubernetes.

Working without nvidia-docker

A common way to run containerized GPU applications is to use nvidia-docker. Here is an example of running TensorFlow with full GPU support inside a container.

Working without nvidia-docker

A common way to run containerized GPU applications is to use nvidia-docker. Here is an example of running TensorFlow with full GPU support inside a container.

Simple! If all goes well the output should look something like this:

Unfortunately it’s not current possible to use nvidia-docker directly from Kubernetes. Additionally, Kubernetes does not support the nvidia-docker-plugin since Kubernetes does not use Docker’s volume mechanism.

The goal is to manually replicate the functionality provided by nvidia-docker (and it’s plugin). For demonstration, query the nvidia-docker-plugin REST API to query the command line arguments:

Which will feed into docker, running the same python command:

If all does well, TensorFlow should find everything correctly and you should see the same output as before.

Finally, the dependency on nvidia-docker-plugin by manually specifying the driver path and manually mounting the devices and CUDA volumes.

Enabling GPU devices

With the knowledge of what Docker needs to be able to run a GPU-enabled container it is straightforward to add this to Kubernetes. The first step is to enable an experiment flag on all of the GPU nodes. In the Kubelet options (found in /etc/default/kubelet if you use upstart for services), add --experimental-nvidia-gpus=1. This does two things… First, it allows GPU resources on the node for use by the scheduler. Second, when a GPU resource is requested, it will add the appropriate device flags to the docker command. This post describes a little more about what and why this flag exists:

http://blog.clarifai.com/how-to-scale-your-gpu-cloud-infrastructure-with-kubernetes

The full GPU proposal, including the existing flag and future steps can be found here:

https://github.com/kubernetes/community/blob/master/contributors/design-proposals/gpu-support.md

Pod Spec

With the device flags added by the experimental GPU flag the final step requires adding the necessary volumes to the pod spec. A sample pod spec is provided below:

If set up correctly the output should match the output from running the nvidia-docker container output at the beginning:

Conclusion

Hopefully this guide helps someone wade through these undocumented features to make use of GPUs in their cluster.

TensorFlow from C++

TensorFlow from C++

A working example of loading a TensorFlow graph in C++


TensorFlow on Kubernetes

TensorFlow on Kubernetes

While GPUs are a staple of deep learning, deploying on GPUs makes everything more complicated, including your Kubernetes clu…


An LSTM Odyssey

An LSTM Odyssey

Discusses the paper LSTM: A Search Space Odyssey. We implemented each of the variants in TensorFlow and evaluated their perf…


Contact Us

Get More Out of Your AI

What could you do if your machine learning was actually state-of-the-art?

Featured Project

Reverse Image Search

We created a lightweight reverse image search and wrote up the details in a three part series for Codementor.