## Hierarchical Variational Autoencoders

$$\newcommand{\expected}2{\mathbb{E}_{#1}\left[ #2 \right]}

\newcommand{\prob}[3]{{#1}_{#2} \left( #3 \right)}

\newcommand{\condprob}[4]{{#1}_{#2} \left( #3 \middle| #4 \right)}

\newcommand{\Dkl}2{D_{\mathrm{KL}}\left( #1 | #2 \right)}

\newcommand{\muvec}{\boldsymbol \mu}

\newcommand{\sigmavec}{\boldsymbol \sigma}

\newcommand{\uttid}{s}

\newcommand{\lspeakervec}{\vec{w}}

\newcommand{\lframevec}{\vec{z}}

\newcommand{\lframevect}{\lframevec_t}

\newcommand{\inframevec}{\vec{x}}

\newcommand{\inframevect}{\inframevec_t}

\newcommand{\inframeset}{\inframevec_1,\hdots,\inframevec_T}

\newcommand{\lframeset}{\lframevec_1,\hdots,\lframevec_T}

\newcommand{\model}2{\condprob{#1}{#2}{\lspeakervec,\lframeset}{\inframeset}}

\newcommand{\joint}{\prob{p}{}{\lspeakervec,\lframeset,\inframeset}}

\newcommand{\normalparams}2{\mathcal{N}(#1,#2)}

\newcommand{\normal}{\normalparams{\mathbf{0}}{\mathbf{I}}}

\newcommand{\hidden}1{\vec{h}^{(#1)}}

\newcommand{\pool}{\max}

\newcommand{\hpooled}{\hidden{\pool}}

\newcommand{\Weight}1{\mathbf{W}^{(#1)}}

\newcommand{\Bias}1{\vec{b}^{(#1)}}$$

I’ve decided to approach the inpainting problem given for our class project IFT6266 using a hierarchical variational autoencoder.

While the basic VAE only has a single latent variable, this architecture assumes the image generation process comes from a hierarchy of latent variables, each dependent on its parents. So the factorisation looks like this:

$$p(x|z_1, z_2,\dots,z_L) = p(x|z_1)p(z_1|z_2) \dots p(z_{L-1}|z_L)$$

An architecture like this was used in the PixelVAE paper, but there they use a more complex PixelCNN structure at each layer, which I am attempting to do without. In their model, the `recognition model` or the encoder is not hierarchical — the $q_\phi$ network is structured in the following way:

$$q(z_1, z_2,\dots,z_L | x) = q(z_1|x)q(z_2|x) \dots q(z_L|x)$$