## Computing Log Normal for Isotropic Gaussians

Consider a matrix $\mathbf{X}$ with rows of datapoints $\mathbf{x_i}$ which are $(n, d)$. The matrix $\mathbf{M}$ is made up of the $\boldsymbol{\mu}_j$ of $k$ different Gaussian components. The task is to compute the log probability of each of these $k$ components for all $n$ data points.

```
import theano
import theano.tensor as T
import numpy as np
import time
X = T.matrix('X')
M = T.matrix('M')
```

As part of the computation, the squared distance from the mean is calculated from each point. The straightforward way this is implemented is to add an axis on both matrices and use broadcasting to compute the pairwise, element-wise difference.

```
slow = T.sum(T.sqr(X[:, None, :] - M[None, :, :]), axis=-1)
distance = theano.function(inputs=[X, M], outputs=slow)
start_time = time.time()
for _ in xrange(2000):
distance(np.random.randn(32, 100).astype(np.float32),
np.random.randn(128, 100).astype(np.float32))
print time.time() - start_time
```

There is an intermediate step here while doing this which produces a $(n, k, d)$ dimensional tensor. We may be able to sidestep this, however, if we use make use of the knowledge that GPUs (and code written for GPUs) tend to optimise matrix operations. For example, a matrix multiply, naively computed, also involves summing over an axis. This computation is usually implemented such that they are in place, skipping the need to allocate the 3-dimensional intermediate tensor.

Notice that what we are computing is:

$$\begin{align*} D_{ij} &= \| \mathbf{x}_i – \boldsymbol{\mu}_j \|\\ &= \mathbf{x}_i \cdot\mathbf{x}_i – 2 \mathbf{x}_i \cdot \boldsymbol{\mu}_j + \boldsymbol{\mu}_j \cdot \boldsymbol{\mu}_j\end{align*}$$

The middle term can be computed in bulk using matrices $\mathbf{X}$ and $\mathbf{M}$:

$$\mathbf{X} \mathbf{M}^{\top},$$

while the first and last terms are reductions along the $d$ axis.

```
fast = T.sum(T.sqr(X), axis=-1)[:, None] + T.sum(T.sqr(M), axis=-1)[None, :] - 2 * T.dot(X, M.T)
distance = theano.function(inputs=[X, M], outputs=fast)
start_time = time.time()
for _ in xrange(2000):
distance(np.random.randn(32, 100).astype(np.float32),
np.random.randn(128, 100).astype(np.float32))
print time.time() - start_time
```

As a final sanity check,

```
distance_check = theano.function(inputs=[X, M], outputs=[fast, slow])
for _ in xrange(200):
output_1, output_2 = distance_check(
np.random.randn(32, 100).astype(np.float32),
np.random.randn(128, 100).astype(np.float32)
)
assert(np.allclose(output_1, output_2))
print "All passed!"
```

GPUs are optimised to do matrix multiplications, so my general rule of thumb is to substitute them wherever possible. This is a case which I suppose may have been obvious for many but I’ve completely overlooked in past implementations, so hopefully it’s useful for someone else out there!

I used Theano’s `optimizer=fast_compile`

option in order to demonstrate effects that *should* be present in PyTorch as well (we’ve tried this on a project and it does seem to be faster). Also, the runtimes are reported on a GeForce GTX 750 Ti.