Theory

Hierarchical RNNs

Notes on Hierarchical Multiscale Recurrent Neural Networks

Introduces a novel update mechanism to learn latent hierarchical representations from data.

Introduction

State-of-the-art on PTB, Text8 and IAM On-Line Handwriting DB. Tied for SotA on Hutter Wikipedia.

Lots of prior work with hierarchy (hierarchical RNN / stacked RNN) and multi-scale (LSTM, clockwork RNN) but they all rely on pre-defined boundaries, pre-defined scales, or soft non-hierarchical boundaries.

Two benefits of discrete hierarchical representations:

  • Helps vanishing gradient since information is held at higher levels for more steps.
  • More computationally efficient in the discrete case since higher layers update less frequently.

Model

Uses parameterized binary boundary detectors at each layer. Avoids “soft” gating which leads to “curse of updating every timestep”.

Boundary detectors determine operations for modifying RNN state: COPY, FLUSH, UPDATE:

  • UPDATE: similar to LSTM but sparse, according to boundary detector.
  • COPY: copies cell and hidden states from the previous timestep to the current timestep. Similar to Zoneout (recurrent generalization of stochastic depth) which uses Bernoulli distribution to copy hidden state across timesteps.
  • FLUSH: sends summary to next layer and re-initializes current layer’s state.

Discrete (binary) decisions are difficult to optimize due to non-smooth gradients. Uses straight-through estimator (as an alternative to REINFORCE) to learn discrete variables. The simplest variant uses a step function on the forward pass and a hard sigmoid on backward pass for gradient estimation.

The slope annealing trick on the hard sigmoid compensates for the biased estimator but minimal improvement from experimental results. Also introduces more hyperparameters.

Implemented as a variant of LSTM (HM-LSTM) with custom operations above. No experimental results for variant with regular RNN (HM-RNN).

Results

Learns useful boundary detectors, visualized in the paper.

Latent representations possibly imperfect, or at least, not human: spaces, tree breaks, some bigrams, some prefix delineation (“dur”: during, duration, durable).

Only results on character-level compression tasks and handwriting, no explicit NLP tasks, e.g. machine translation, question-answering, or named entity recognition.

Conclusion

Thanks to those who attended the reading group session for their discussion of this paper! Lots of good insights from everyone.

TensorFlow from C++

A working example of loading a TensorFlow graph in C++


Receptive Field Calculator

A convolutional layer operates over a local region of the input to that layer with the size of this…


TensorFlow on Kubernetes

While GPUs are a staple of deep learning, deploying on GPUs makes everything more complicated,…


What could you do as an AI-powered company?