Hierarchical Variational Autoencoders

$$\newcommand{\expected}[2]{\mathbb{E}_{#1}\left[ #2 \right]}
\newcommand{\prob}[3]{{#1}_{#2} \left( #3 \right)}
\newcommand{\condprob}[4]{{#1}_{#2} \left( #3 \middle| #4 \right)}
\newcommand{\Dkl}[2]{D_{\mathrm{KL}}\left( #1 \| #2 \right)}
\newcommand{\muvec}{\boldsymbol \mu}
\newcommand{\sigmavec}{\boldsymbol \sigma}
\newcommand{\uttid}{s}
\newcommand{\lspeakervec}{\vec{w}}
\newcommand{\lframevec}{\vec{z}}
\newcommand{\lframevect}{\lframevec_t}
\newcommand{\inframevec}{\vec{x}}
\newcommand{\inframevect}{\inframevec_t}
\newcommand{\inframeset}{\inframevec_1,\hdots,\inframevec_T}
\newcommand{\lframeset}{\lframevec_1,\hdots,\lframevec_T}
\newcommand{\model}[2]{\condprob{#1}{#2}{\lspeakervec,\lframeset}{\inframeset}}
\newcommand{\joint}{\prob{p}{}{\lspeakervec,\lframeset,\inframeset}}
\newcommand{\normalparams}[2]{\mathcal{N}(#1,#2)}
\newcommand{\normal}{\normalparams{\mathbf{0}}{\mathbf{I}}}
\newcommand{\hidden}[1]{\vec{h}^{(#1)}}
\newcommand{\pool}{\max}
\newcommand{\hpooled}{\hidden{\pool}}
\newcommand{\Weight}[1]{\mathbf{W}^{(#1)}}
\newcommand{\Bias}[1]{\vec{b}^{(#1)}}$$
I’ve decided to approach the inpainting problem given for our class project IFT6266 using a hierarchical variational autoencoder.

While the basic VAE only has a single latent variable, this architecture assumes the image generation process comes from a hierarchy of latent variables, each dependent on its parents. So the factorisation looks like this:

$$p(x|z_1, z_2,\dots,z_L) = p(x|z_1)p(z_1|z_2) \dots p(z_{L-1}|z_L)$$

An architecture like this was used in the PixelVAE paper, but there they use a more complex PixelCNN structure at each layer, which I am attempting to do without. In their model, the `recognition model` or the encoder is not hierarchical — the $q_\phi$ network is structured in the following way:
$$q(z_1, z_2,\dots,z_L | x) = q(z_1|x)q(z_2|x) \dots q(z_L|x)$$

The paper has a nice diagram illustrating what the structure of the model looks like:

One interesting observation to make about the model is that the $p_\theta(z_l|z_{l+1})$s are never trained directly with signals from the reconstruction of $x$. Instead, the model first learns to create a good reconstruction ($\expected{q(z_1|x)}{\log p(x|z_1)}$), which increases $\Dkl{q(z_1|x)}{p(z_1|z_2)}$. In minimising $\Dkl{q(z_1|x)}{p(z_1|z_2)}$, it then has to learn a good $q(z_2|x)$, which in turn increases $\Dkl{q(z_2|x)}{p(z_2|z_3)}$. The process then goes on slowly to the top level latent variable.

One change I’ve made to my formulation of the model for the inpainting problem is to turn it into a `conditional’ VAE. I split $x$ up into $x_\text{outer}$ and $x_\text{inner}$. The prior for the top level latent variable, is then conditioned on $x_\text{outer}$, $p(z_L|x_\text{outer})$. This gives a model that requires the outer border during the generation process.

FizzBuzz in Theano, or, Backpropaganda Through Time.

Continue reading

Deciding When To Feedforward (or WTF gates)

Another paper of mine, titled “Towards Implicit Complexity Control using Variable-Depth DNNs for ASR Systems” got accepted to the International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2016 in Shanghai, which happened not too long ago.

The idea behind this one was the intuition that in a classification task, some instances should be simpler than others to classify. Similarly, the problem of deciding when to stop in an RNN setting is also an important one. If we take the bAbI task for example, and go an extra step and assume the number of logical steps to arrive at the answer is not provided for you, then you need to know when the network is ‘ready’ to give an answer.
Continue reading

Constraining Hidden Layers for Interpretability (eventually, hopefully…)

I haven’t written much this past year, so I guess as a parting post for 2015, I’d talk a little bit about the poster I presented at ASRU 2015. The bulk of the stuff’s in the paper, plus I’m still kind of unsure about the legality about putting stuff that’s in the paper on this blog post, so I think I’ll talk about the other things that didn’t make it in.
Continue reading

Learning to Transduce with Unbounded Memory – The Neural Stack

DeepMind has in the past week released a paper proposing yet another approach to having a memory structure within a neural network. This time, they implement a stack, queue and a deque “data structure” within their models. While this idea is not necessarily new, it incorporates some of the broad ideas seen in the Neural Turing Machines, where they try to have a model that is end-to-end differentiable, rather than have the data structure decoupled from the training process. I have to admit I haven’t read any of these previous papers before, but it’s definitely on my to read list.

In any case, this paper claims that using these memory structures beats having an 8-layered LSTM network trained for the same task. If this is true, this may mean we finally have some justification for these fancier models — simply throwing bigger networks at problems just isn’t as efficient.

I’ve spent some time trying to puzzle out what exactly they’re trying to do here with the neural stack. I suspect once I’ve figured this out, the queue and deque will be pretty similar, so I don’t think I will go through them in the same detail. Continue reading

Generating Singlish with LSTMs

So in in the last week, Andrej Karpathy wrote a post about the current state of RNNs, and proceeded to dump a whole bunch of different kinds of text data into them to see what they learn. Training language models and then sampling from them is lots of fun, and a character-level model is extra interesting because you see it come up with new words that actually kind of mean something sometimes. Even Geoffrey Hinton has some fun with it in this talk.

So after reading through Karpathy’s code, I got a few tips about how to do this properly, as I haven’t been able to train a proper language model before.

The data is obtained from one of Singapore’s most active sub-forums on HardwareZone.com.sg, Eat-Drink-Man-Woman. I like to think it’s a… localised 4chan. So know what to expect going in. If you want to just see what the model generates, go here.
Continue reading

Long Short-Term Memory

There seems to be a resurgence in using these units in the past year. They were first proposed in 1997 by Hochreiter and Schmidhuber, but, along with most neural network literature seemed to have been forgotten for a while, until work on neural networks made a comeback, and focus started shifting toward RNNs again. Some of the more interesting recent work using LSTMs have come from Schmidhuber’s student Alex Graves. Notice the spike here in 2009 when Graves first wrote about cursive handwriting recognition (and generation) using LSTMs.
Continue reading

Neural Turing Machines FAQ

There’s been some interest in Neural Turing Machines paper, and I’ve been getting some questions regarding my implementation via e-mail and the comments section on this blog. I plan to make this a blog post where I’ll regularly come back and update with answers to some of these questions as they come up, so do check back!
Continue reading

Learning Gaussian Feature Extractors

While playing around with the MNIST dataset and the example code, I tried to visualise the weights of the connections from the weights to the hidden layer. These can be thought of as feature extractors of the input.

Continue reading

Neural Turing Machines – Copy Task

After much fiddling around with the instability of the training procedure, I still haven’t found a recipe that would get it to converge consistently.

I did find though, that training it on shorter sequences first, before letting it see longer ones avoids huge gradients that would make the parameters explode into NaNs. And that is a huge help. Doing that still does not guarantee convergence though, and I only get a good model at random, like this one I’ve trained here copying a sequence of length 10:
ntm10

 

Continue reading