Research

Contextual Recurrent Neural Networks

There is an implicit assumption that by unfolding recurrent neural networks (RNN) in finite time, the misspecification of choosing a zero value for the initial hidden state is mitigated by later time steps. This assumption has been shown to work in practice and alternative initialization may be suggested but often overlooked. In this paper, we propose a method of parameterizing the initial hidden state of an RNN. The resulting architecture, referred to as a Contextual RNN, can be trained end-to-end. The performance on an associative retrieval task is found to improve by conditioning the RNN initial hidden state on contextual information from the input sequence. Furthermore, we propose a novel method of conditionally generating sequences using the hidden state parameterization of Contextual RNN.

Tools

Receptive Field Calculator

A convolutional layer operates over a local region of the input to that layer with the size of this local region usually specified directly. You can also compute the effective receptive field of a convolutional layer which is the size of the input region to the network that contributes to a layers’ activations. For example, if the first convolutional layer has a receptive field of 3x3 then it’s effective receptive field is also 3x3 since it operates directly on the input. However if the second layer of a convolutional network also has a 3x3 filter, then it’s (local) receptive field is 3x3, but it’s effective receptive field is 5x5

Convolutional Filter Inputs

Layer # Kernel Size Stride Dilation Padding Input Size Output Size Receptive Field

Code

TensorFlow on Kubernetes

While GPUs are a staple of deep learning, deploying on GPUs makes everything more complicated, including your Kubernetes cluster. This quick guide will walk through adding basic single-GPU support to Kubernetes.

The guide assumes that Kubernetes is already running on Ubuntu. A LTS release is preferable, with 14.04 being most preferable due to NVIDIA recommendations for driver hosts. Warning: Ubuntu 14.04 is not well supported by Kubernetes. Feel free to use a different distro. This guide also assumes that the proper GPU drivers and CUDA version have been installed. Plenty of other guides cover those topics.

TL;DR: start with nvidia-docker, then whittle away it’s functionality so that just plain docker remains. Then add that functionality to Kubernetes.

Working without nvidia-docker

A common way to run containerized GPU applications is to use nvidia-docker. Here is an example of running TensorFlow with full GPU support inside a container.

Working without nvidia-docker

A common way to run containerized GPU applications is to use nvidia-docker. Here is an example of running TensorFlow with full GPU support inside a container.

Simple! If all goes well the output should look something like this:

Unfortunately it’s not current possible to use nvidia-docker directly from Kubernetes. Additionally, Kubernetes does not support the nvidia-docker-plugin since Kubernetes does not use Docker’s volume mechanism.

The goal is to manually replicate the functionality provided by nvidia-docker (and it’s plugin). For demonstration, query the nvidia-docker-plugin REST API to query the command line arguments:

Which will feed into docker, running the same python command:

If all does well, TensorFlow should find everything correctly and you should see the same output as before.

Finally, the dependency on nvidia-docker-plugin by manually specifying the driver path and manually mounting the devices and CUDA volumes.

Enabling GPU devices

With the knowledge of what Docker needs to be able to run a GPU-enabled container it is straightforward to add this to Kubernetes. The first step is to enable an experiment flag on all of the GPU nodes. In the Kubelet options (found in /etc/default/kubelet if you use upstart for services), add --experimental-nvidia-gpus=1. This does two things… First, it allows GPU resources on the node for use by the scheduler. Second, when a GPU resource is requested, it will add the appropriate device flags to the docker command. This post describes a little more about what and why this flag exists:

http://blog.clarifai.com/how-to-scale-your-gpu-cloud-infrastructure-with-kubernetes

The full GPU proposal, including the existing flag and future steps can be found here:

https://github.com/kubernetes/community/blob/master/contributors/design-proposals/gpu-support.md

Pod Spec

With the device flags added by the experimental GPU flag the final step requires adding the necessary volumes to the pod spec. A sample pod spec is provided below:

If set up correctly the output should match the output from running the nvidia-docker container output at the beginning:

Conclusion

Hopefully this guide helps someone wade through these undocumented features to make use of GPUs in their cluster.

Theory

Hierarchical RNNs

Notes on Hierarchical Multiscale Recurrent Neural Networks

Introduces a novel update mechanism to learn latent hierarchical representations from data.

Introduction

State-of-the-art on PTB, Text8 and IAM On-Line Handwriting DB. Tied for SotA on Hutter Wikipedia.

Lots of prior work with hierarchy (hierarchical RNN / stacked RNN) and multi-scale (LSTM, clockwork RNN) but they all rely on pre-defined boundaries, pre-defined scales, or soft non-hierarchical boundaries.

Two benefits of discrete hierarchical representations:

  • Helps vanishing gradient since information is held at higher levels for more steps.
  • More computationally efficient in the discrete case since higher layers update less frequently.

Model

Uses parameterized binary boundary detectors at each layer. Avoids “soft” gating which leads to “curse of updating every timestep”.

Boundary detectors determine operations for modifying RNN state: COPY, FLUSH, UPDATE:

  • UPDATE: similar to LSTM but sparse, according to boundary detector.
  • COPY: copies cell and hidden states from the previous timestep to the current timestep. Similar to Zoneout (recurrent generalization of stochastic depth) which uses Bernoulli distribution to copy hidden state across timesteps.
  • FLUSH: sends summary to next layer and re-initializes current layer’s state.

Discrete (binary) decisions are difficult to optimize due to non-smooth gradients. Uses straight-through estimator (as an alternative to REINFORCE) to learn discrete variables. The simplest variant uses a step function on the forward pass and a hard sigmoid on backward pass for gradient estimation.

The slope annealing trick on the hard sigmoid compensates for the biased estimator but minimal improvement from experimental results. Also introduces more hyperparameters.

Implemented as a variant of LSTM (HM-LSTM) with custom operations above. No experimental results for variant with regular RNN (HM-RNN).

Results

Learns useful boundary detectors, visualized in the paper.

Latent representations possibly imperfect, or at least, not human: spaces, tree breaks, some bigrams, some prefix delineation (“dur”: during, duration, durable).

Only results on character-level compression tasks and handwriting, no explicit NLP tasks, e.g. machine translation, question-answering, or named entity recognition.

Conclusion

Thanks to those who attended the reading group session for their discussion of this paper! Lots of good insights from everyone.

Process

Numerai Competition

Last week I spent some time diving into the Numerai machine learning competition. Below are my notes on the competition: things I tried, what worked and what didn’t. First an introduction to Numerai and the competition…

Numerai is a hedge fund which uses the competition to source predictions for a large ensemble that they use internally to make trades. Another detail that makes the competition unique is that the provided data has been encrypted in a way that still allows it to be used for predictions. Each week, Numerai releases a new dataset and the competition resets. After briefly controlling 1st-2nd place in both score and originality, by the end of the week I was still “controlling capital” with a log loss of 0.68714. In all this earned about $8.17 USD worth of Bitcoin.

Here’s a sample of the training data:

Validation

My first step in the competition was to generate a validation set so that I could run models locally and get a sense for how the models would do on the leaderboard. Using a simple stratified split that maintains the target distribution turned out not to be representative of the leaderboard so I turned to “adversarial validation”. This clever idea was introduced by @fastml in a blog post here. Basically:

  1. Train a classifier to identify whether data comes from the train or test set.
  2. Sort the training data by it’s probability of being in the test set.
  3. Select the training data most similar to the test data as your validation set.

This was much more representative with a validation loss corresponding to within ~0.001 log loss on the public leaderboard. Interestingly, the only reason this works is that the test data is dissimilar from much of the training data which violates IID.

Baseline Model

Now that I had a good validation set I wanted to get a baseline model trained, validated and uploaded. As a starting point I used logistic regression with default settings and no feature engineering. This gets about 0.69290 validation loss and 0.69162 on the public leaderboard. It’s not great but now I know what a simple model can do. For comparison, first place is currently 0.64669, so the baseline is only about 6.5% off. This means any improvements are going to be really small. We can push this a little further with L2 regularization at 1e-2 which gets to 0.69286 (-0.006% from baseline).

Neural Networks

I took a quick divergence into neural networks before beginning feature engineering. Ideally, the networks would learn their own features with enough data, unfortunately none of the architectures I tried had much improvement over simple logistic regression. Additionally, deep neural networks can have far more learned parameters than logistic regression so I needed to regularize the parameters heavily with L2 and batch normalization (which can act as a regularizer per the paper). Dropout sometimes helped too depending on the architecture.

One interesting architecture that worked okay was using a single very wide hidden layer (2048 parameters) with very high dropout (0.9) and then leaving it’s initialized parameters fixed during training. This creates an ensemble of many random discriminators. While this worked pretty well (with a logloss around 0.689) the model hurt the final ensemble so it was removed. In the end neural networks did not yield enough improvement to continue their use here and would still rely on feature engineering which defeated my intentions.

Data Analysis & Feature Engineering

Now I need to dig into the data, starting with a simple plot of each of the feature distributions:

Violin plot of the distributions for each feature.

The distributions are pretty similar for each feature and target. How about correlations between features:

Correlation matrix showing feature interactions.

Okay, so many of the features are strongly correlated. We can make use of this in our model by including polynomial features (e.g. PolynomialFeatures(degree=2) from scikit-learn). Adding these brings our validation loss down to 0.69256 (-0.05% from baseline).

Now dimensionality reduction. I take the features and run principal component analysis (a linear method) to reduce the original features down to two dimensions for visualization:

PCA dimensionality reduction over original features.

This does not contain much useful information. How about with the polynomial features:

PCA dimensionality reduction over polynomial features.

The polynomial PCA produces a slightly better result by pulling many of the target “1” values towards the edges and many of the target “0” values towards the center. Still not great so I opted to omit PCA for now.

Instead I’ll use a fancier dimensionality reduction method called t-SNE or “t-Distributed Stochastic Neighbor Embedding”. t-SNE is often used for visualization of high-dimensional data but it has a useful property not found in PCA: t-SNE is non-linear and works on the probability of two points being selected as neighbors.

t-SNE embedding over the features; clusters colored using DBSCAN.

Here t-SNE captured really good features for visualization (e.g. local clusters), and incidentally for classification too! I add in these 2D features to the model to get the best validation loss so far: 0.68947 (-0.5% from baseline). I suspect the reason this helps is that there are actually many local features that logistic regression cannot pull out but are useful in classifying for the target. By running an unsupervised method specifically designed to align the data by pairwise similarities the model is able to use that information.

Since t-SNE is stochastic, multiple runs will produce different embeddings. To exploit this I’ll run t-SNE 5 or 6 times at different perplexities and dimensions (2D and 3D) then incorporate these extra features. Now the validation loss is 0.68839 (-0.65% from baseline).

Note, some implementations of t-SNE do not work correctly in 3D. Plot them to make sure you’re seeing a blob, not a pyramid shape.

Additional Embeddings

Since t-SNE worked so well, I implemented several other embedding methods including autoencoders, denoising autoencoders, and generative adversarial networks. The autoencoders learned excellent reconstructions with >95% accuracy, even with noise but their learned embeddings did not improve the model. The GAN, including semi-supervised variant, did not outperform logistic regression. I also briefly experimented with kernel PCA and isomaps (also non-linear dimensionality reduction methods). Both improved the validation loss slightly but took significantly longer to run, reducing my ability to iterate quickly, so they were ultimately discarded. I never tried LargeVis or parametric t-SNE but they might be worth exploring. Parametric t-SNE would be particularly interesting since it allows fitting on a test holdout, rather than learning an embedding of all of the samples at once.

Isomap embedding of the original features.

Pairwise Interactions

One of the models that made it into the final ensemble was to explicitly model pairwise interactions. Basically, given features from two samples predict which of the two had a greater probability of being classified as “1”. This provides significantly more data since you’re modeling interactions between samples, rather than individual samples. It also hopefully learns useful features for classifying by the intended target. To make predictions for the target classification I take the average of each sample’s prediction against all other samples. (It’s probably worth exploring more sophisticated averaging techniques.) This performed similarly to logistic regression and produced different enough results to add to the ensemble.

Hyperparameter Search

Now that we have useful features and a few models that perform well I wanted to run a hyperparameter search and see if it could outperform the existing models. Since scikit-learn’s GridSearchCV and RandomSearchCV only explore hyperparameters, not entire architectures, I opted to use tpot which searches over both. This discovered that using randomized PCA would outperform PCA and that L1 regularization (sparsity) slightly outperformed L2 regularization (smoothing), especially when paired with random PCA. Unfortunately neither of the discovered interactions made it into the final ensemble: hand engineering won out.

Ensemble

With a few models complete it’s time to ensemble their predictions. There are a number of methods for doing this covered here but I opted for a simple average using the geometric mean.

The final ensemble consisted of 4 models: logistic regression, gradient boosted trees, factorization machines and the pairwise model described above. I used the same features for each model, consisting of the original 21 features and five runs of T-SNE in 2D at perplexities of 5.0, 10.0, 15.0, 30.0, and 50.0 and one run of T-SNE in 3D at a perplexity of 30 (I only included a single run because it takes significantly longer in 3D). These features were combined with polynomial interactions and run through the models to produce the final log loss of 0.68714 on the leaderboard.

Conclusion

Overall it was an interesting competition—very different from something like Kaggle. I especially enjoyed experimenting with the encrypted data which was a first for me. While the payouts and “originality” bonuses are interesting mechanics, it’s often better to look at the rewards as points, more than currency, as this made the competition overall more fun. On the other hand, now I have my first bitcoin… 😊

Theory

TD-Gammon

TL;DR Introduces temporal difference learning, TD-Lambda / TD-Gammon, and eligibility traces. Check out the Github repo for an implementation of TD-Gammon with TensorFlow.

A few weeks ago AlphaGo won a historic tournament playing the game of Go against Lee Sedol, one of the top Go players in the world. Many people have compared AlphaGo to DeepBlue, which won a series of famous chess matches against Gary Kasparov, but a different comparison may be made for the game of backgammon.

Before DeepMind tackled playing Atari games or built AlphaGo there was TD-Gammon, the first algorithm to reach an expert level of play in backgammon. Gerald Tesauro published his paper in 1992 describing TD-Gammon as a neural network trained with reinforcement learning. It is referenced in both Atari and AlphaGo research papers and helped set the groundwork for many of the advancements made in the last few years.

Temporal-Difference Learning

TD-Gammon consists of a simple three-layer neural network trained using a reinforcement learning technique known as TD-Lambda or temporal-difference learning with a trace decay parameter lambda (λ). The neural network acts as a “value function” which predicts the value, or reward, of a particular state of the game for the current player.

During training, the neural network iterates over all possible moves for the current player and evaluates each valid move and the move with the highest value is selected. Because the network evaluates moves for both players, it’s effectively playing against itself. Using TD-Lambda we want to improve the neural network so that it can reasonably predict the most likely outcome of a game from a given board state. It does this by learning to reduce the difference between the value for the next state and the current state.

Let’s start with a loss function, which describes how well the network is performing for any state at time t:

Loss function: mean squared error of the difference between our neural network’s output for the next state and the output for the current state. The variable α is a small scalar to control the learning rate.

Here we want to minimize the mean squared error of the difference between the next prediction and the current prediction. Basically, we want our predictions about the present to match our predictions about the future. This in itself isn’t very useful until we know how the game ends so for the final step of the game we modify the loss function:

Same as above, but z represents the true outcome of the game.

Where z is the actual outcome of the game. Together these two loss functions work okay but the network will converge slowly and never reach a strong level of play.

Temporal Credit Assignment

To make our predictions more useful we need to solve the problem of temporal credit assignment. Basically, which actions did the player take in the past that resulted in the desired outcome in the future. Right now the loss only incorporates two consecutive steps and we want to stretch that out.

With the loss function above our parameter updates will look something like this:

Parameter updates for the loss function L.

Where θ is the network’s parameters (weights), α is the learning rate and δ is the difference we defined above:

Definition of δ for intermediate and end-game states where f is the final time-step of the game.

Now rather than include a single gradient we want to include all past gradients while paying more attention to the most recent. This is accomplished keeping a history of gradients then decaying each by increasing amounts of λ that reflect how old the gradient has become:

The full definition for TD-Lambda includes a sum over all previous gradients, decayed by λ.

Eligibility Traces

Keeping a running history of gradients can become memory intensive depending on the size of the network and the length of the game. An elegant solution to this problem is to use something called an “eligibility trace”. Eligibility traces replace the gradient sum of the parameter update with a single moving gradient. The eligibility trace is defined as:

Definition of an eligibility trace decayed by λ.

Basically, we decay our eligibility trace by λ then add the new gradient. With this, our parameter update becomes:

New parameter update for TD-Lambda, using an eligibility trace in place of the gradient.

This effectively allows our parameter updates to take into account decisions made in the past. Now when we backpropagate the end game state, we take into account the gradients from earlier states in the game while we avoid keeping a complete history of gradients.

Results

At the start of training, each game can take hundreds or thousands of turns to complete, effectively taking a random strategy. As the network learns, games require only around 50–100 turns and will outperform an opponent making random moves after around 1000 games (about an hour of training).

The average loss for a game can never really reach zero because there’s more uncertainty at the beginning of a game but it can be useful to visualize convergence:

Average loss for each of 5,000 games.

Conclusion

Hopefully, this post shed some light on a small part of the history of recent deep reinforcement learning papers and the temporal-difference learning algorithm. If you’re interested in learning more about reinforcement learning definitely check out Richard Sutton’s book on the topic. You can also download the code for this implementation of TD-Gammon and play against the pre-trained network included in the repo.

Theory

An LSTM Odyssey

This week I read LSTM:// A Search Space Odyssey. It’s an excellent paper that systematically evaluates the different internal mechanisms of an LSTM (long short-term memory) block by disabling each mechanism in turn and comparing their performance. We’re going to implement each of the variants in TensorFlow and evaluate their performance on the Penn Tree Bank (PTB) dataset. This will obviously not be as thorough as the original paper but it allows us to see, and try out, the impact of each variant for ourselves.

TL;DR Check out the Github repo for results and variant definitions.

Vanilla LSTM

We’ll start with a setup similar to TensorFlow’s RNN tutorial. The primary difference is that we’re going to use a very simple re-implementation for the LSTM cell defined as follows:

LSTM equations from section 2.

This corresponds to the “vanilla” LSTM from the paper. Each equation defines a particular component of the block: block input (z), input gate (i), forget gate (f), cell state (c), output gate (o) and block output (y). Both g and h represent the hyperbolic tangent function and sigma represents the sigmoid activation function. The circle dot represents element-wise multiplication.

Here’s the same thing in code:

Be sure to check out the full source for the rest of the cell definition. Mostly we create a new class inheriting from RNNCell and use the above code as the body of __call__. The nice part about this setup is that we can utilize MultiRNNCell to stack the LSTMs into multiple layers.

Notice that we initialize all of our parameters using get_variable. This is necessary so that we can reuse these variables for each time step rather than creating new parameters at each step. Also, all parameters are transposed from the paper’s definitions to avoid additional graph operations.

Then we define each equation as operations in the graph. Many of the operations have reversed inputs from the equations so that the matrix multiplications produce the correct dimensionality. Other than these details we’re directly translating the equations.

Note that from a performance perspective, this is a naïve implementation. If you look at the source for TensorFlow’s LSTMCell you’ll see that all of the cell inputs and states are concatenated together before doing any matrix multiplication. This is to improve performance, however, since we’re more interested in taking the LSTM apart, we’ll keep things simple.

Running this vanilla LSTM on the included notebook we obtain a test perplexity (e^cost) of less than 100. So far so good. This will serve as our baseline to compare to the other variants. Below is the cost (average negative log probability of the target words) on the validation set after each epoch:

<

figure>

Vanilla cost on the validation set

<

figure>

Variants

The most helpful bits for implementing each of the variants can be found in appendix A3 of the paper. The gate omission variants such as no input gate (NIG), no forget gate (NFG), and no output gate (NOG) simply set their respective gates to 1 (be sure to use floats, not integers, here):

<

figure>

NIG sets i to 1, NFG sets f to 1 and NOG sets o to 1.

<

figure>

The no input activation function (NIAF) and no output activation function (NOAF) variants remove their input or output activation functions, respectively:

<

figure>

NIAF removes the g(x) activation function, while NOAF removes the h(x) activation function.

<

figure>

The no peepholes (NP) variant removes peepholes from all three gates:

<

figure>

For all three gates remove the peepholes

<

figure>

The coupled input-forget gate (CIFG) variant sets the forget gate like so:

<

figure>

<

figure>

The final variant, full gate recurrence (FGR), is the most complex, essentially allowing each gate’s previous state to interact with each gate’s next state:

<

figure>

Recurrent connections are added for each of the gates.

<

figure>

In many of the variants, we can remove parameters no longer needed to compute the cell. The FGR variant, however, adds significantly more parameters (9 additional square matrices) which also increases training time.

To implement each, we’ll simply duplicate our vanilla LSTM cell implementation and make the necessary modifications for the variant. There are too many to show here but you can view the full source for each variant on Github. To train each, we’ll use the same hyperparameters from the vanilla LSTM trial. This probably isn’t fair and a more thorough analysis (as performed in the paper) would try to find the best hyperparameters for each variant.

<

figure>

Training progress of model variants.

<

figure>

Results

The NFG and NOG variants fail to converge to anything useful while the NIAF variant diverges significantly after around the 8th epoch. (This divergence could probably be fixed with learning rate decay which I omitted for simplicity.)

<

figure>

Diverging variants

<

figure>

In contrast, the NIG, CIFG, NP and FGR variants all converge. The NIG and FGR variants do not produce great results while the NP and CIFG variants perform similarly to the vanilla LSTM.

<

figure>

Diverging variants

<

figure>

Finally the NOAF variant. Its poor performance is likely due to the lack of clamping from the output activation function so its cost explodes:

<

figure>

<

figure>

Here are the test perplexities for each variant:

<

figure>

<

figure>

Conclusion

Overall it’s been fun dissecting the LSTM. Feel free to try out the code yourself and if you’re interested in taking this further I recommend running comparisons with GRUs, looking at fANOVA or extending what’s here with more thorough analysis.

Theory

Highway Networks

This week I implemented highway networks to get an intuition for how they work. Highway networks, inspired by LSTMs, are a method of constructing networks with hundreds, even thousands, of layers. Let’s see how we construct them using TensorFlow.

TL;DR Fully-connected highway repo and convolutional highway repo.

Implementation

For comparison, let’s start with a standard fully-connected (or “dense”) layer. We need a weight matrix and a bias vector then we’ll compute the following for the layer output:

$$\mathbf{y}=H(\mathbf{x,W_{H}})$$
Computing the output of a dense layer. (Bias omitted for simplicity and to match the paper.)

Here’s what a dense layer looks like as a graph in TensorBoard:

A dense layer in TensorBoard

For the highway layer what we want are two “gates” that control the flow of information. The “transform” gate controls how much of the activation we pass through and the “carry” gate controls how much of the unmodified input we pass through. Otherwise, the layer largely resembles a dense layer with a few additions:

$$\mathbf{y}=H(\mathbf{x},\mathbf{W_{H}})\cdot{}T(\mathbf{x},\mathbf{W_{T}})+x\cdot{}C(\mathbf{x}, \mathbf{W_{C}})$$
Computing the highway layer output. (Bias omitted for simplicity and to match the paper.)
  • An extra set of weights and biases to be learned for the gates.
  • The transform gate operation (T).
  • The carry gate operation (C or just 1 - T).
  • The layer output (y) with the new gates.

What happens is that when the transform gate is 1, we pass through our activation (H) and suppress the carry gate (since it will be 0). When the carry gate is 1, we pass through the unmodified input (x), while the activation is suppressed.

Here’s what the highway layer graph looks in TensorBoard:

A highway layer in TensorBoard

Using a highway layer in a network is also straightforward. One detail to keep in mind is that consecutive highway layers must be the same size but you can use fully-connected layers to change dimensionality. This becomes especially complicated in convolutional layers where each layer can change the output dimensions. We can use padding (‘SAME’) to maintain each layers dimensionality.

Otherwise, by simply using hyperparameters from the TensorFlow docs (i.e. no hyperparameter search) the fully-connected highway network performed much better than a fully-connected network. Using MNIST as my simple trial:

  • 20 fully-connected layers fail to achieve more than 15% accuracy.
  • 18 highway layers (with two fully-connected layers to transform the input and output) achieves ~95% accuracy. Which is also much better than a shallow network which only reaches 91%.

Now that we have a highway network, I wanted to answer a few questions that came up for me while reading the paper. For instance, how deep will the network converge? The paper briefly mentions 1000 layers:

In pilot experiments, SGD did not stall for networks with more than 1000 layers. (2.2)

Can we train with 1000 layers on MNIST?

Yes, also reaching around 95% accuracy. Try it out with a carry bias around -20.0 for MNIST (from the paper the network will only utilize ~15 layers anyway). The network can probably even go deeper since the it’s just learning to carry the last 980 layers or so. We can’t do much useful at or past 1000 layers so that seems sufficient for now.

What happens if you set very low or very high carry biases?

In either extreme the network simply fails to converge in a reasonable amount of time. In the case of low biases (more positive), the network starts as if the carry gates aren’t present at all. In the case of high biases (more negative), we’re putting more emphasis on carrying and the network can take a long time to overcome that. Otherwise, the biases don’t seem to need to be exact, at least on this simple example. When in doubt start with high biases (more negative) since it’s easier to learn to overcome carrying than without carry gates (which is just a plain network).

Conclusion

Overall I was happy with how easy highway networks were to implement. They’re fully differentiable with only a single additional hyperparameter for the initial carry bias. One downside is that highway layers do require additional parameters for the transform weights and biases. However, since we can go deeper, the layers do not need to be as wide which can compensate.

Here are the complete notebooks if you want to play with the code: fully-connected highway repo and convolutional highway repo.

Code

TensorFlow from Node.js

Even though the full C API for TensorFlow is not yet available, we can still use it load TensorFlow graphs and evaluate them from other languages. This is incredibly useful for embedding pre-trained models in other applications. Embedding is one of the most interesting use cases for TensorFlow as it cannot be accomplished as easily with Theano.

Note that while all of the examples here will use Node.js the steps are nearly identical in any language with C FFI support (e.g. Rust, Go, C#, etc.)

Requirements

git clone --recursive https://github.com/tensorflow/tensorflow

Compiling a shared library

We’ll start by compiling a shared library from TensorFlow using Bazel.

UPDATE: The following build rule for creating a shared library is now part of TensorFlow: https://github.com/tensorflow/tensorflow/pull/695

  • Create a new folder in the TensorFlow repo at tensorflow/tensorflow/libtensorflow/.
  • Inside this folder we’re going to create a new BUILD file which will contain a single call to cc_binary with the linkshared option set to 1 so that we get a .so from the build. The name of the binary must end in .so or it will not work.

Here’s the final directory structure:

  • tensorflow/tensorflow/libtensorflow/
  • tensorflow/tensorflow/libtensorflow/BUILD

Below is the complete BUILD file:

  1. From the root of the repository, run ./configure.
  2. Compile the shared library with bazel build :libtensorflow.so and locate the generated file from the repo’s root: bazel-bin/tensorflow/libtensorflow/libtensorflow.so

Now that we have our shared library, create a new folder for the host language. Since this is for Node.js I’ll name it tensorflowjs/. This folder can exist outside of the TensorFlow repo since we now have everything needed in the shared library. Copy libtensorflow.so into the new folder.

If you’re on OS X and using Node.js you’ll need to rename the shared library from libtensorflow.so to libtensorflow.dylib. TensorFlow produces a .so however the standard on OS X is dylib. The Node FFI library doesn’t look for .so, only .dylib; however it can read both formats, so we just rename it.

Creating the graph

Just like with the previous C++ tutorial we’re going to create a minimal graph and write it to a protobuf file. (Be sure to name your variables and operations.)

Creating the bindings

Now we can go through the TensorFlow C API header, almost line by line, and write the appropriate binding. Most of the time this is fairly direct, simply copying the signature of the function. I also created variables for many of the common types so they were more legible. For example, any structs which map to void*. I declared as variables named after the struct. We can also use the ref-array Node module which provides helpers for types like long long* (essentially an array of long long types) so we’ll define a LongLongArray type to correspond. Otherwise, we just copy the signature:

I also defined a few helper functions to eliminate some of the boilerplate when working with the TensorFlow interface. The first is TF_Destructor, a default tensor destructor for TF_NewTensor. This comment in the TensorFlow source makes it sound like it’s optional but it’s not:

Clients can provide a custom deallocator function so they can pass in memory managed by something like numpy.

Additionally, many TensorFlow functions return a TF_Status struct and checking the status can get tedious. So I defined a function called TF_CheckOK that simply checks if the status code is TF_OK using TF_GetCode. If its not, we throw an error using TF_Message to hopefully get a useful error message. (This function loosely corresponds to TF_CHECK_OK in the TensorFlow source.)

And finally, reading a tensor with TF_TensorData only returns a pointer but to actually read the data we need to extend the returned Buffer to the appropriate length. Creating a Buffer with the correct size is a few lines of boiler plate so I wrapped TF_TensorData to create TF_ReadTensorData which handles that boilerplate for us. Here are the helpers:

Now that we’ve defined our interface the steps for loading the graph are the same as with C++:

  1. Initialize a TensorFlow session.
  2. Read in the graph we exported above.
  3. Add the graph to the session.
  4. Setup our inputs and outputs.
  5. Run the graph, populating the outputs.
  6. Read values from the outputs.
  7. Close the session to release resources.

We can load and execute TensorFlow graphs from Node.js! I’ve put the whole thing together into a repo here (you’ll need to provide graph.pb and libtensorflow.dylib since they’re kinda large).

Code

TensorFlow from C++

The current documentation around loading a graph with C++ is pretty sparse so I spent some time setting up a barebones example. In the TensorFlow repo there are more involved examples, such as building a graph in C++. However, the C++ API for constructing graphs is not as complete as the Python API. Many features (including automatic gradient computation) are not available from C++ yet. Another example in the repo demonstrates defining your own operations but most users will never need this. I imagine the most common use case for the C++ API is for loading pre-trained graphs to be standalone or embedded in other applications.

Be aware, there are some caveats to this approach that I’ll cover at the end. Requirements

git clone --recursive https://github.com/tensorflow/tensorflow

Creating the graph

Let’s start by creating a minimal TensorFlow graph and write it out as a protobuf file. Make sure to assign names to your inputs and operations so they’re easier to assign when we execute the graph later. The node’s do have default names but they aren’t very useful: Variable_1 or Mul_3. Here’s an example created with Jupyter:

Creating a simple binary or shared library

Let’s create a new folder like tensorflow/tensorflow/<my project name> for your binary or library to live. I’m going to call the project loader since it will be loading a graph.

Inside this project folder we’ll create a new file called <my project name>.cc (e.g. loader.cc). If you’re curious, the .cc extension is essentially the same as .cpp but is preferred by Google’s code guidelines.

Inside loader.cc we’re going to do a few things:

  1. Initialize a TensorFlow session.
  2. Read in the graph we exported above.
  3. Add the graph to the session.
  4. Setup our inputs and outputs.
  5. Run the graph, populating the outputs
  6. Read values from the outputs.
  7. Close the session to release resources

`

Now we create a BUILD file for our project. This tells Bazel what to compile. Inside we want to define a cc_binary for our program. You can also use the linkshared option on the binary to produce a shared library or the cc_library rule if you’re going to link it using Bazel.

Here’s the final directory structure:

  • tensorflow/tensorflow/loader/
  • tensorflow/tensorflow/loader/loader.cc
  • tensorflow/tensorflow/loader/BUILD

Compile & Run

  • From the root of the tensorflow repo, run ./configure
  • From inside the project folder call bazel build :loader
  • From the repository root, go into bazel-bin/tensorflow/loader
  • Copy the graph protobuf to models/graph.pb
  • Then run ./loader and check the output!

You could also call bazel run :loader to run the executable directly, however the working directory for bazel run is buried in a temporary folder and ReadBinaryProto looks in the current working directory for relative paths.

And that should be all we need to do to compile and run C++ code for TensorFlow.

The last thing to cover are the caveats I mentioned:

  1. The build is huge, coming in at 103MB, even for this simple example. Much of this is for TensorFlow, CUDA support and numerous dependencies we never use. This is especially true since the C++ API doesn’t support much functionality right now, as a large portion of the TensorFlow API is Python-only. There is probably a better way of linking to TensorFlow (e.g. shared library) but I haven’t gotten it working yet.
  2. There doesn’t seem to be a straightforward way of building this outside of the TensorFlow repo because of Bazel (many of the modules needed to link to are marked as internal). Again, there is probably a solution to this, it’s just non-obvious.

Conclusion

Hopefully someone can shed some light on these last points so we can begin to embed TensorFlow graphs in applications. If you are that person, message me on Twitter or email.

TensorFlow from Node.js

Tutorial on loading a TensorFlow graph in Node.js with applications for other host languages.


TensorFlow from C++

A working example of loading a TensorFlow graph in C++


TD-Gammon

Before DeepMind built AlphaGo there was TD-Gammon, the first algorithm to reach an expert level of…


What could you do as an AI-powered company?