UPDATE: similar to LSTM but sparse, according to boundary detector.
COPY: copies cell and hidden states from the previous timestep to the current timestep. Similar to Zoneout (recurrent generalization of stochastic depth) which uses Bernoulli distribution to copy hidden state across timesteps.
FLUSH: sends summary to next layer and re-initializes current layer’s state.
Discrete (binary) decisions are difficult to optimize due to non-smooth gradients. Uses straight-through estimator (as an alternative to REINFORCE) to learn discrete variables. The simplest variant uses a step function on the forward pass and a hard sigmoid on backward pass for gradient estimation.
The slope annealing trick on the hard sigmoid compensates for the biased estimator but minimal improvement from experimental results. Also introduces more hyperparameters.
Implemented as a variant of LSTM (HM-LSTM) with custom operations above. No experimental results for variant with regular RNN (HM-RNN).
Learns useful boundary detectors, visualized in the paper.
Latent representations possibly imperfect, or at least, not human: spaces, tree breaks, some bigrams, some prefix delineation (“dur”: during, duration, durable).
Only results on character-level compression tasks and handwriting, no explicit NLP tasks, e.g. machine translation, question-answering, or named entity recognition.
Thanks to those who attended the reading group session for their discussion of this paper! Lots of good insights from everyone.