On “Better Exploiting Latent Variables in Text Modeling”

I’ve been working on latent variable language models for some time, and intend to make it the topic of my PhD. So when Google Scholar recommended “Better Exploiting Latent Variables in Text Modeling”, I was naturally excited to see that this work has continued beyond the Bowman’s paper on VAE language models. Of course, since then, there have been multiple improvements on the original model. More recently, Yoon Kim from Harvard has been publishing papers on this topic that have been particularly interesting. $$\newcommand{\softmax}{\mathrm{softmax}} \newcommand{\M}{\mathbf{M}} \newcommand{\h}{\mathbf{h}} \renewcommand{\a}{\mathbf{a}} \renewcommand{\b}{\mathbf{b}} \newcommand{\x}{\mathbf{x}} \newcommand{\z}{\mathbf{z}} \newcommand{\E}{\mathbb{E}} \newcommand{\hdots}{\dots} $$

However, I think there are several issues with the evaluation of the method in this particular paper, and the evaluation method it uses. In this post, I provide two ways of looking at the bias of the loss and evaluation used in the paper, and show how the log probabilities may be over-estimated.

Averaging hidden states and averaging log probabilities

When we average log probabilities in the likelihood part of the loss, if we analyse the loss at any given timestep, with the correct label $i$, we have: $$\begin{align*} \frac{1}{L} \sum_{l=1}^L \log \frac{\exp(x^{(l)}_i)}{\sum_{j=1}^N \exp(x^{(l)}_j)} &= \frac{1}{L} \sum_{l=1}^L x^{(l)}_i - \underbrace{\frac{1}{L} \sum_{l=1}^L \log \sum_{j=1}^N \exp(x^{(l)}_j)}_{\text{(1)}} \\ \end{align*}$$ When we average the hidden states, because of linearity of the final transformation, we are averaging the logits $x^{(l)}$: $$\begin{align*} \log \frac{\exp( \frac{1}{L} \sum_{l=1}x^{(l)}_i)}{\sum_{j=1}^N \exp( \frac{1}{L} \sum_{l=1}^L x^{(l)}_j)} &= \frac{1}{L} \sum_{l=1}^L x^{(l)}_i - \underbrace{\log \sum_{j=1}^N \exp( \frac{1}{L} \sum_{l=1}^Lx^{(l)}_j)}_{\text{(2)}} \end{align*}$$

Both quantities are similar except for terms (1) and (2).

First, observe that the difference in both terms is the averaging occuring outside the log-sum-exp (in the case of (1)), and inside (in the case of (2)). We know that log-sum-exp is convex, therefore we know that $(1) \geq (2)$, which in turn means that the second quantity is greater than or equal to the first.

So the loss used in the paper is greater than or equal to ELBO, which means we no longer have lower bound guarantees on the new loss.

Using multiple samples in a VAE decoder

We can also interpret the use of multiple samples as a model that is conditioned on multiple random variables, $p(\x |\z^{(1)},\hdots, \z^{(L)})$.

However, if we work out the ELBO, $$ \begin{align} \log p(\x) &= \log \int p(\z^{(1)})\dots \int p(\z^{(L)}) p(\x|\z^{(1)},\hdots, \z^{(L)})~\mathrm{d}\z^{(1)}\dots \mathrm{d}\z^{(L)} \\
&= \log \E_{p(\z^{(1)},\hdots,\z^{(L)})}\left[p(\x|\z^{(1)},\hdots, \z^{(L)})\right] \\\
&= \log \E_{q(\z^{(1)},\hdots,\z^{(L)}|\x)}\left[ \frac{p(\x|\z^{(1)},\hdots, \z^{(L)})p(\z^{(1)},\hdots,\z^{(L)}) }{ q(\z^{(1)},\hdots,\z^{(L)}|\x) }\right] \\\ &\geq \E_{ q(\z^{(1)},\hdots,\z^{(L)}|\x)}\left[\log \frac{p(\x|\z^{(1)},\hdots, \z^{(L)})p(\z^{(1)},\hdots,\z^{(L)}) }{ q(\z^{(1)},\hdots,\z^{(L)}|\x) }\right] \\\ \end{align} $$ where, we know from the set up that: (1) $p(\z^{(1)},\hdots,\z^{(L)}) = \prod^L_i p(\z^{(i)})$, and (2) $p(\z^{(1)} = \z’) = p(\z^{(2)} = \z’) = \dots = p(\z^{(L)}= \z’)$ for any $\z’$, i.e., they are the same distribution. This is identical for $q(\z^{(1)},\hdots,\z^{(L)} | \x)$ (factorised and identical when conditioned on $\x$).

$$ \begin{align*} &\E_{ q(\z^{(1)},\hdots,\z^{(L)}|\x)}\left[\log \frac{p(\x|\z^{(1)},\hdots, \z^{(L)})p(\z^{(1)},\hdots,\z^{(L)}) }{ q(\z^{(1)},\hdots,\z^{(L)}|\x) }\right] \\\ &=\E_{ q(\z^{(1)},\hdots,\z^{(L)}|\x)}\left[\log p(\x|\z^{(1)},\hdots, \z^{(L)}) \frac{\prod^L_i p(\z^{(i)}) }{ \prod^L_i q(\z^{(i)}|\x) }\right] \\\ &=\E_{ q(\z^{(1)},\hdots,\z^{(L)}|\x)}\left[\log p(\x|\z^{(1)},\hdots, \z^{(L)}) + \sum^L_i \left(\log p(\z^{(i)}) – \log q(\z^{(i)}|\x)\right) \right] \end{align*} $$ by linearity of expectation, $$ \begin{align*} &=\E_{ q(\z^{(1)},\hdots,\z^{(L)}|\x)}\left[\log p(\x|\z^{(1)},\hdots, \z^{(L)})\right] + \sum^L_i \E_{ q(\z^{(1)},\hdots,\z^{(L)}|\x)}\left[\log p(\z^{(i)}) – \log q(\z^{(i)}|\x) \right] \end{align*} $$ removing expectations over non-dependent random variables, $$ \begin{align*} &=\E_{ q(\z^{(1)},\hdots,\z^{(L)}|\x)}\left[\log p(\x|\z^{(1)},\hdots, \z^{(L)})\right] – \sum^L_i \underbrace{\E_{ q(\z^{(i)}|\x)}\left[ \log q(\z^{(i)}|\x) – \log p(\z^{(i)}) \right]}_{D_\mathrm{KL} (q(\z^{(i)}|\x) || p(\z^{(i)})} \end{align*} $$ The resulting ELBO, when simplified, is then: $$ \begin{align} &\E_{q(\z^{(1)},\hdots,\z^{(L)}|\x)}\left[\log p(\x|\z^{(1)},\hdots, \z^{(L)})\right] – \sum^L_l D_{\mathrm{KL}}(q(\z^{(l)}|\x) || p(\z^{(l)})) \\\ &= \E_{q(\z^{(1)},\hdots,\z^{(L)}|\x)}\left[\log p(\x|\z^{(1)},\hdots, \z^{(L)})\right] – L \cdot D_{\mathrm{KL}}(q(\z|\x) || p(\z)) \end{align} $$ where we drop the superscript because they are equal. We then see that the KL term appears $L$ times.

While we know that the ELBO is a definite lower bound on $\log p(\x)$, if the KL term appears only once, that quantity may no longer be a lower bound. More formally, there may be a case where $$ \begin{align*} &\E_{q(\z^{(1)},\hdots,\z^{(L)}|\x)}\left[\log p(\x|\z^{(1)},\hdots, \z^{(L)})\right] – D_{\mathrm{KL}}(q(\z|\x) || p(\z)) \\\ &\qquad > \log p(\x) \\\ &\qquad \qquad \geq \E_{q(\z^{(1)},\hdots,\z^{(L)}|\x)}\left[\log p(\x|\z^{(1)},\hdots, \z^{(L)})\right] – L \cdot D_{\mathrm{KL}}(q(\z|\x) || p(\z)) \end{align*} $$ In such a setting, the single KL loss cannot be reported as a lower bound of the log probabilities.

@misc{tan2019-10-20,
  title        = {On “Better Exploiting Latent Variables in Text Modeling”},
  author       = {Tan, Shawn},
  howpublished = {\url{https://blog.wtf.sg/posts/2019-10-20-on-better-exploiting-latent-variables-in-text-modeling/}},
  year         = {2019}
}