Hierarchical RNNs

Notes on Hierarchical Multiscale Recurrent Neural Networks

Introduces a novel update mechanism to learn latent hierarchical representations from data.


State-of-the-art on PTB, Text8 and IAM On-Line Handwriting DB. Tied for SotA on Hutter Wikipedia.

Lots of prior work with hierarchy (hierarchical RNN / stacked RNN) and multi-scale (LSTM, clockwork RNN) but they all rely on pre-defined boundaries, pre-defined scales, or soft non-hierarchical boundaries.

Two benefits of discrete hierarchical representations:

  • Helps vanishing gradient since information is held at higher levels for more steps.
  • More computationally efficient in the discrete case since higher layers update less frequently.


Uses parameterized binary boundary detectors at each layer. Avoids “soft” gating which leads to “curse of updating every timestep”.

Boundary detectors determine operations for modifying RNN state: COPY, FLUSH, UPDATE:

  • UPDATE: similar to LSTM but sparse, according to boundary detector.
  • COPY: copies cell and hidden states from the previous timestep to the current timestep. Similar to Zoneout (recurrent generalization of stochastic depth) which uses Bernoulli distribution to copy hidden state across timesteps.
  • FLUSH: sends summary to next layer and re-initializes current layer’s state.

Discrete (binary) decisions are difficult to optimize due to non-smooth gradients. Uses straight-through estimator (as an alternative to REINFORCE) to learn discrete variables. The simplest variant uses a step function on the forward pass and a hard sigmoid on backward pass for gradient estimation.

The slope annealing trick on the hard sigmoid compensates for the biased estimator but minimal improvement from experimental results. Also introduces more hyperparameters.

Implemented as a variant of LSTM (HM-LSTM) with custom operations above. No experimental results for variant with regular RNN (HM-RNN).


Learns useful boundary detectors, visualized in the paper.

Latent representations possibly imperfect, or at least, not human: spaces, tree breaks, some bigrams, some prefix delineation (“dur”: during, duration, durable).

Only results on character-level compression tasks and handwriting, no explicit NLP tasks, e.g. machine translation, question-answering, or named entity recognition.


Thanks to those who attended the reading group session for their discussion of this paper! Lots of good insights from everyone.

TensorFlow from Node.js

Tutorial on loading a TensorFlow graph in Node.js with applications for other host languages.

Receptive Field Calculator

A convolutional layer operates over a local region of the input to that layer with the size of this…


Before DeepMind built AlphaGo there was TD-Gammon, the first algorithm to reach an expert level of…

What could you do as an AI-powered company?