## Deciding <u>W</u>hen <u>T</u>o <u>F</u>eedforward (or WTF gates)

Another paper of mine, titled “Towards Implicit Complexity Control using Variable-Depth DNNs for ASR Systems” got accepted to the International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2016 in Shanghai, which happened not too long ago.

The idea behind this one was the intuition that in a classification task, some instances *should* be simpler than others to classify. Similarly, the problem of deciding when to stop in an RNN setting is also an important one. If we take the bAbI task for example, and go an extra step and assume the number of logical steps to arrive at the answer is not provided for you, then you need to know when the network is ‘ready’ to give an answer.

## Constraining Hidden Layers for Interpretability (eventually, hopefully…)

I haven’t written much this past year, so I guess as a parting post for 2015, I’d talk a little bit about the poster I presented at ASRU 2015. The bulk of the stuff’s in the paper, plus I’m still kind of unsure about the legality about putting stuff that’s *in* the paper on this blog post, so I think I’ll talk about the other things that didn’t make it in.

## Learning to Transduce with Unbounded Memory – The Neural Stack

DeepMind has in the past week released a paper proposing yet another approach to having a memory structure within a neural network. This time, they implement a stack, queue and a deque “data structure” within their models. While this idea is not necessarily new, it incorporates some of the broad ideas seen in the Neural Turing Machines, where they try to have a model that is end-to-end differentiable, rather than have the data structure decoupled from the training process. I have to admit I haven’t read any of these previous papers before, but it’s definitely on my to read list.

In any case, this paper claims that using these memory structures beats having an 8-layered LSTM network trained for the same task. If this is true, this may mean we finally have some justification for these fancier models — simply throwing bigger networks at problems just isn’t as efficient.

I’ve spent some time trying to puzzle out what exactly they’re trying to do here with the neural stack. I suspect once I’ve figured this out, the queue and deque will be pretty similar, so I don’t think I will go through them in the same detail.

## Generating Singlish with LSTMs

So in in the last week, Andrej Karpathy wrote a post about the current state of RNNs, and proceeded to dump a whole bunch of different kinds of text data into them to see what they learn. Training language models and then sampling from them is lots of fun, and a character-level model is extra interesting because you see it come up with new words that actually kind of mean something sometimes. Even Geoffrey Hinton has some fun with it in this talk.

So after reading through Karpathy’s code, I got a few tips about how to do this properly, as I haven’t been able to train a proper language model before.

The data is obtained from one of Singapore’s most active sub-forums on HardwareZone.com.sg, Eat-Drink-Man-Woman. I like to think it’s a… localised 4chan. So know what to expect going in. If you want to just see what the model generates, go here.

## Long Short-Term Memory

There seems to be a resurgence in using these units in the past year. They were first proposed in 1997 by Hochreiter and Schmidhuber, but, along with most neural network literature seemed to have been forgotten for a while, until work on neural networks made a comeback, and focus started shifting toward RNNs again. Some of the more interesting recent work using LSTMs have come from Schmidhuber’s student Alex Graves. Notice the spike here in 2009 when Graves first wrote about cursive handwriting recognition (and generation) using LSTMs.

## Neural Turing Machines FAQ

There’s been some interest in Neural Turing Machines paper, and I’ve been getting some questions regarding my implementation via e-mail and the comments section on this blog. I plan to make this a blog post where I’ll regularly come back and update with answers to some of these questions as they come up, so do check back!

## Learning Gaussian Feature Extractors

While playing around with the MNIST dataset and the example code, I tried to visualise the weights of the connections from the weights to the hidden layer. These can be thought of as feature extractors of the input.

## Neural Turing Machines – Copy Task

After much fiddling around with the instability of the training procedure, I still haven’t found a recipe that would get it to converge consistently.

I did find though, that training it on shorter sequences first, before letting it see longer ones avoids huge gradients that would make the parameters explode into NaNs. And that is a huge help. Doing that still does not guarantee convergence though, and I only get a good model at random, like this one I’ve trained here copying a sequence of length 10:

## Neural Turing Machines – Implementation Hell

I’ve been struggling with the implementation of the NTM for the past week and a half now.

```
<p>
There are various problems that I’ve been trying to deal with. The paper is relatively sparse when it comes to details of the architecture, and a lot more brief when it comes to the training process. Alex Graves trains RNNs a lot in his work, and it seems to me some of the tricks he has used here might have been distributed through his previous work.
</p>
</div>
```