Lead Image © Luxuz, photocase.com

Lead Image © Luxuz, photocase.com

A Hands-on Look at Kubernetes with OpenAI

Learning in Containers

Article from ADMIN 41/2017
By
For research into deep learning algorithms that automatically acquire new skills, OpenAI operates some of the largest Kubernetes clusters worldwide, with up to 36,000 CPU cores. We look at some practical experience with the container management system.

OpenAI [1] is a non-profit, privately funded research institution in San Francisco, where I work with about 60 other employees on machine learning and artificial intelligence. In concrete terms, my colleagues are examining how to teach a computer new behavioral patterns through experience, without being deliberately programmed to handle a task. Contributors include Elon Musk (Tesla, SpaceX) [2] and Sam Altman (Y Combinator) [3], among others.

The people of OpenAI contribute academic publications on the web, presentations at conferences, and software for researchers and developers. In this article, I show how OpenAI prepares its Kubernetes cluster to run artificial intelligence experiments across thousands of computers.

Go Deep

The company's main focus is on deep learning – that is, researching large neural networks with many layers. In recent years, deep learning has gained importance because it can generally solve extremely complicated problems.

For example, the AlphaGo bot, developed by Google's DeepMind, has learned how to play the Chinese board game Go, which is considered to be extremely complicated, with a far wider range of moves than chess. Go experts agreed that it would take at least 20 to 30 years until a computer could beat the best human Go players. However, in the spring of 2016, the AlphaGo deep-learning software defeated the (at that time) top Go player, Lee Sedol, and even a team of five world champions in 2017.

In contrast, OpenAI researches algorithms that, unlike AlphaGo, learn not just a single game, but a wide range of tasks. The company recently developed a robot that observes only once how a person does a task previously unknown to the robot; then, Fetch (the robot's internal name) is able to repeat the task in a new environment (Figure 1) [4]. The robot even learned the concepts of gripping and stacking by observing people.

Figure 1: The robot (left) learns how to replicate a structure created by a human (right), but with new block starting positions.

Despite the proven success of Google, Facebook, OpenAI, and various university projects, no one knows what the perfect structure of a deep learning neural network looks like. The question is, which structure is best at allowing a network to learn new skills? The bulk of the research in the company, then, comprises experiments with new learning algorithms and network structures, which require the use of a complex technical infrastructure. OpenAI researches structures (Figure 2) and seeks to find out how data moves through them.

Figure 2: A neural network for controlling a robot. The camera image in the mind of the robot "moves" from left to right, resulting in the movement of the robot arm.

Bare-Metal Performance

For researchers to work productively, the team must be in a position to distribute new algorithms and network topologies to thousands of computers in seconds and to evaluate the results. An extremely flexible infrastructure is needed. The project requirements are articulated as follows:

  • Performance is vital: A computing neural network requires massive computing power, which is why the project uses graphics processing units (GPUs), for example.
  • Computer topologies must be customizable: The project must be able to implement unforeseen changes in machine structure, such as distributing an originally centralized computation to hundreds of computers.
  • Usage fluctuates rapidly: Researchers commonly try out new methods on a single CPU core for two months and then suddenly need 10,000 cores for two weeks.

To achieve the best possible raw performance, the project needs direct access to the hardware as well as to the CPU and accelerators such as GPUs. Modern hypervisors such as Xen run the CPU code natively, but all I/O operations generate administrative overhead. OpenAI needs the maximum bandwidth, in particular for communicating with GPUs via the PCIe bus. Classical virtualization is thus out of the question.

On the other hand, a large bare-metal server farm would be a nightmare for users and admins. Instead of relying on one cloud provider, OpenAI researchers use all kinds of providers – Microsoft Azure, Amazon Web Services (AWS), Google Compute Engine (GCE), and our own data centers – for cost savings from volume discounts for thousands of computers, among other things. However, most of our researchers are not Linux or cloud experts; learning three or more cloud APIs is out of the question. Instead, they need tools that empower them to implement their ideas independently, without having to ask engineers for help.

The company relies on Docker containers [5] and Kubernetes [6] to master this balancing act between usability and performance. Containers let you describe all the dependencies and packages (e.g., TensorFlow [7]) for an experiment. The container image can be saved as a snapshot and sent via the network, much like a virtual machine (VM). Because of kernel integration, the images have no performance overhead – unlike VMs.

The researchers describe their experiments and algorithms as containers and Kubernetes pods. The infrastructure team then ensures that Kubernetes provides the required computers (nodes) to accommodate all the pods.

The team is currently running a number of Kubernetes clusters with up to 2,000 nodes. Thus, the company is probably one of the largest Kubernetes users, which means, on the minus side, that Kubernetes (often referred to as Kube) traverses shaky ground and has to fight for stability. To ensure that third parties also benefit from our experience in dealing with helpful Kubernetes data structures and abstractions, I describe in the following section some of the "gotchas" encountered by the team.

Uncharted Territory

The typical life cycle of an experiment proceeds with one or two researchers, who test new ideas for several weeks or months. During this time, they only need minimal computing capacity. At some point (usually just before the deadline of an academic journal), they need tens of thousands of CPU cores in one fell swoop to carry out much larger calculations. To avoid the cost of out-of-work CPUs, they need to change the size of the cluster at run time. The solution is an autoscaler (kubernetes-ec2-autoscaler) [8], which OpenAI has released under an MIT license on GitHub. The autoscaler specifically works by:

  • fetching a list of all pods that currently cannot be placed on a node;
  • creating a plan to launch a sufficient number of nodes, taking into consideration several factors, such as providing cheap computing power, providing GPUs, co-locating related jobs, and complying with the various user preferences; and, finally,
  • putting the plan into practice.

At this point, the Kube API shows one of its strengths: Because the autoscaler uses an API to access all the resources, it allows OpenAI to create a kind of programmable infrastructure. Kubernetes provides extremely practical and extensible data structures that can be tracked easily on a home computer:

kubectl get nodes

shows the data structure for a node (Listings 1 and 2) on a running cluster. The labels serve as custom markers in which to store metadata such as the location, type of CPU, and machine equipment. The autoscaler then provides machines with the correct metadata, depending on the requirement.

Listing 1

Show Nodes

$ kubectl get nodes
NAME            STATUS                 AGE       VERSION
10.126.22.9     Ready                  3h        v1.6.2

Listing 2

Kubernetes Node Data Structure in YAML

$ kubectl get node 10.126.22.9 -o yaml
apiVersion: v1
kind: Node
metadata:
  creationTimestamp: 2017-06-07T08:15:30Z
  labels:
    openai.org/location: azure-us-east-v2
  name: 10.126.22.9
spec:
  externalID: 10.126.22.9
  providerID: azure:////62823750-1942-A94F-822E-E6BF3C9EDCC4
status:
  addresses:
  - address: 10.126.22.9
    type: InternalIP
  - address: 10.126.22.9
    type: Hostname
  allocatable:
    alpha.kubernetes.io/nvidia-gpu: "0"
    cpu: "20"
    memory: 144310716Ki
    pods: "28"
  nodeInfo:
    architecture: amd64
    containerRuntimeVersion: docker://1.12.6
    kernelVersion: 4.4.0-72-generic
    osImage: Ubuntu 14.04.5 LTS

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus