TextPurr Logo

TextPurr

Loading...
Loading...

Stanford Seminar - Can the brain do back-propagation? Geoffrey Hinton

Stanford Online
"Can the brain do back-propagation?" - Geoffrey Hinton of Google & University of Toronto About the talk: Deep learning has been very successful for a variety of difficult perceptual tasks. This suggests that the sensory pathways in the brain might also be using back-propagation to ensure that lower cortical areas compute features that are useful to higher cortical areas. Neuroscientists have not taken this possibility seriously because there are so many obvious objections: Neurons do not communicate real numbers; the output of a neuron cannot represent both a feature of the world and the derivative of a cost function with respect to the neuron's output; the feedback connections to lower cortical areas that are needed to communicate error derivatives do not have the same weights as the feedforward connections; the feedback connections do not even go to the neurons from which the feedforward connections originate; there is no obvious source of labelled data. I will describe joint work with Timothy Lillicrap on ways of overcoming these objections. Support for the Stanford Colloquium on Computer Systems Seminar Series provided by the Stanford Computer Forum. Speaker Abstract and Bio can be found here: http://ee380.stanford.edu/Abstracts/160427.html Colloquium on Computer Systems Seminar Series (EE380) presents the current research in design, implementation, analysis, and use of computer systems. Topics range from integrated circuits to operating systems and programming languages. It is free and open to the public, with new lectures each week. Learn more: http://bit.ly/WinYX5 0:00 Introduction 0:48 Online stochastic gradient descent 2:43 Four reasons why the brain cannot do backprop 5:20 Sources of supervision that allow backprop learning without a separate supervision signal 8:18 The wake-sleep algorithm (Hinton et. al. 1995) 12:15 New methods for unsupervised learning 13:39 Conclusion about supervision signals 14:03 Can neurons communicate real values? 16:16 Statistics and the brain 18:39 Big data versus big models 23:32 Dropout as a form of model averaging 24:53 Different kinds of noise in the hidden activities 28:38 How are the derivatives sent backwards? 30:18 A fundamental representational decision: temporal derivatives represent error derivatives 32:24 An early use of the idea that temporal derivatives encode error derivatives (Hinton & McClelland, 1988) 35:17 Combining STDP with reverse STDP 37:02 If this is what is happening, what should neuroscientists see? 39:22 What the two top-down passes achieve 40:11 A way to encode the top-level error derivatives 48:28 A consequence of using temporal derivatives to code error derivatives 48:40 The next problem 50:18 Now a miracle occurs 56:44 Why does feedback alignment work?
Hosts: Geoffrey Hinton
📅April 28, 2016
⏱️01:25:13
🌐English

Disclaimer: The transcript on this page is for the YouTube video titled "Stanford Seminar - Can the brain do back-propagation? Geoffrey Hinton" from "Stanford Online". All rights to the original content belong to their respective owners. This transcript is provided for educational, research, and informational purposes only. This website is not affiliated with or endorsed by the original content creators or platforms.

Watch the original video here: https://www.youtube.com/watch?v=VIRCybGgHts

00:00:11Geoffrey Hinton

So a lot of the work I'm going to talk about today was done jointly with Timothy Lillicrap. As I talk, my voice will get weirder and weirder, um, because I've got a small polyp growing on one vocal cord, and as I talk I'll start producing two notes at the same time.

💬 0 comments
Add to My Notes
00:00:28Geoffrey Hinton

Um, so back propagation, just to be sure you all know, takes an input vector. You go forwards through a multi-layer neural net with nonlinear units, you compare with the correct answer, you back propagate something backwards (derivatives), and then you adjust all the weights. And despite what people thought for a long time, it works great.

💬 0 comments
Add to My Notes
00:00:49Geoffrey Hinton

If you're interested in it for the brain, then you would do online stochastic learning. That is, you would take a training example, you go forwards, you go backwards, and you update the weights a little bit. And in statistical terms, you're getting an expectation of the full gradient. That is, you get a noisy version of the full gradient from just one training case; on average it's right. Of course, it's a long way off, and that learning technique you'd have thought was crazy, but actually it worked quite well as long as you don't learn too fast.

💬 0 comments
Add to My Notes
00:01:19Geoffrey Hinton

So the question is, could the cortex be doing this? And if you talk to neuroscientists, or look at the things neuroscientists have said over the ages, until very recently they were all completely convinced that this was crazy. Most of them didn't understand what you meant, because they thought back propagation meant sending spikes backwards down the dendritic tree. And that is back propagation, that's a different form of back propagation, and you need that for doing many of these algorithms. And they didn't understand that the idea of back propagation was to send error derivatives from one cortical area to an earlier cortical area.

💬 0 comments
Add to My Notes
00:01:58Geoffrey Hinton

It's the right thing to do, and it seems to be completely crazy. It would be completely crazy for evolution not to have figured out a way of modifying early feature detectors so that they're useful for later feature detectors. We can all think of a dumb algorithm doing that, which is you change them at random and see if it helps, but that's hopelessly inefficient. Back propagation is just that dumb algorithm—change them and see if it helps—except that it's more efficient by a factor of the number of connections. So if you've got a billion connections, it's a billion times more efficient than tinkering with the weights at random. And so you'd have thought evolution would have discovered that.

💬 0 comments
Add to My Notes
00:02:33Geoffrey Hinton

But neuroscientists actually have a bunch of good reasons why it's not possible, and I'm going to go over four of those reasons. I don't know whether the brain can do back-prop, but what I do know is the arguments neuroscientists use aren't very good.

💬 0 comments
Add to My Notes
00:02:47Geoffrey Hinton

So the first reason is there's no obvious source of supervision in back propagation. I'm talking for a feed-forward network, not a recurrent network here. For a feed-forward network, you go forwards, someone injects the right answer, you compare it with what you got, and you send something backwards. And there's nobody to inject the right answer in the brain. It's not like your mother has a little electrode into the middle of your brain, much as she would like that.

💬 0 comments
Add to My Notes
00:03:13Geoffrey Hinton

A second reason is that cortical neurons don't send real values to each other. And in back propagation, as is normally used, you're communicating real values in the forward path and real value derivatives backwards. So Francis Crick was very fond of this reason. He said back propagation is crazy, neurons don't communicate real values, and so it can't be doing it.

💬 0 comments
Add to My Notes
00:03:32Geoffrey Hinton

The next argument, which I think is one of the best arguments why it can't be done, is to do it in the obvious way, neurons would need to send two different signals. When you're going forwards, a neuron sends a signal that says this feature is here, or this feature is here to this extent. So that's the activity of the neuron, that's what it represents. When you're going backwards, that same neuron needs to send a completely different signal, which is: how fast would the error change if I were to change my total input that I received from the layer below? That's a completely different quantity, and obviously the neuron can't be sending both quantities. I'll refer to the output of the neuron as Y and the total input to the neuron as X. And so on the forward pass it needs to send Y, and on the backward pass it needs to send dE/dX.

💬 0 comments
Add to My Notes
00:04:19Geoffrey Hinton

And the last thing is, in back propagation you go forwards through the weights and then you go backwards through the same weights. So in matrix terms, you use the transpose of the forward matrix to send the error derivatives backwards. And there's lots of evidence in the brain that if you have two cortical areas and there's forward connections, there will be backwards connections. And if there's forward connections from one region of a cortical area, there will be backward connections to that region of the cortical area, but they're not point-to-point. So if a neuron here sends a forward connection there, it's not that this neuron sends a backward connection there. In fact, the backward connections go to different neurons. So that seemed like a major problem.

💬 0 comments
Add to My Notes
00:04:57Geoffrey Hinton

And what I'm going to do now is just go through these four arguments and show how—the main aim of this talk is to show—none of them are really insuperable obstacles. And when you combine that with the idea that we now know it works really well, it suddenly becomes plausible, I think, that the brain might be doing something that's back propagation, or something very close to back propagation.

💬 0 comments
Add to My Notes
00:05:19Geoffrey Hinton

So first, the source of supervision. People doing back propagation have worried about this problem for a long time. And in the 80s, we thought that well, one way to get a source of supervision is to do reconstruction. So you're trying to encode the data and then reconstruct the data, and you take the reconstruction error and back propagate it. So you don't need an extra supervision signal, you're just trying to reconstruct the data. That's what PCA is doing, and back propagation is just nonlinear PCA.

💬 0 comments
Add to My Notes
00:05:49Geoffrey Hinton

Another idea about how you might get an error signal is you might extract local features and then you might compare what the features in the layer below say a feature detector ought to be doing. So the whole issue here is you have feature detectors in the intermediate layers and they need to figure out what they should be doing. And one thing to do is say it's going to extract some stuff from below, and then it's going to get a prediction from above from the broader context. And it wants to make those two the same. It wants to make the prediction from the broader context agree with what it extracts from below.

💬 0 comments
Add to My Notes
00:06:23Geoffrey Hinton

And so a little example of that in a sentence is nice one-trial learning. I give you a sentence like this one: "She scrummed him with a frying pan." And you've never heard the word "scrummed" before, and you have a pretty good idea of what it means in one trial. I think it's sort of she bashed him with it somehow, right? Probably because of something sexy he said. And what's happening here is there's a certain amount of information in this character string here. These seven characters, like the "ed" tells you it's very likely a verb and the past tense of a verb. There may be some information in "scrum", just that that doesn't sound good. But basically what's happening is the context tells you what that's likely to mean, and in one trial you can get some evidence about what that means. And that's just an example of something you can detect locally, like "scrum" and the context it's in, and you want to make those agree.

💬 0 comments
Add to My Notes
00:07:28Geoffrey Hinton

Okay, a somewhat more principled way to get a learning signal, though not necessarily better, is to say let's learn a generative model that assigns high log probability to the input data. So for vision, let's learn a graphics model that generates things that look like the images we actually see. For complex nonlinear models that's tricky, but if instead of trying to maximize the log probability of the input data you try and maximize a variational bound on it, then you can make complex models much easier to learn, and that works pretty well.

💬 0 comments
Add to My Notes
00:08:05Geoffrey Hinton

And if you then are willing to make a further approximation and say you've got this variational bound that's motivating you, and you're not even going to optimize that, you're going to optimize an approximation to that, you can get a very simple algorithm called the wake-sleep algorithm. Yes, I was going to put in a description of how it works here, but I can give you the description of how it works. I did an animation and confused myself. Yes, there you go.

💬 0 comments
Add to My Notes
00:08:35Geoffrey Hinton

There's a wake phase where you go forwards through these red connections which are recognition connections, and that determines the activities of these hidden layers. And then you do learning of the other connections. So now we're going to learn these connections, and we're going to train these to be good at reconstructing whatever it was in this layer that caused it. So whatever it was in H2 that caused the pattern in H3, you try and reconstruct it from the pattern in H3. You do a reconstruction, you look at the error, and you change these weights to get rid of that error.

💬 0 comments
Add to My Notes
00:09:06Geoffrey Hinton

And notice there's no back propagation needed here. That learning can be done sort of at a synapse here. It sees the state of H2 that was there previously, it sees the state of H2 that's there when it does the reconstruction, and it gets its input on the activity of H3, and so you can learn this synapse here without having to do back-prop. And that really is the right thing to learn, to maximize the probability of regenerating this from that. And you can do that sort of independently at all the layers.

💬 0 comments
Add to My Notes
00:09:39Geoffrey Hinton

And then the sleep phase, what you do is you generate data from your model. So I start with random vectors up here generated according to the biases of these units. I generate downwards, and then for each pair of layers I try and reconstruct what actually caused this activity in the layer above. And I assume independence—it's a variational method so I make an assumption there—I assume these are independent causes of this, or rather in the posterior they're independent. And so you get a very simple learning algorithm. It's not actually following the derivative of the variational bound, because unfortunately the variational bound is a KL of Q and P (where Q is your approximating distribution, P is the right distribution), and this is optimizing KL of P and Q. But it works pretty well, and it's very simple. There's no back propagation required.

💬 0 comments
Add to My Notes
00:10:31Geoffrey Hinton

And so in 1995 we were quite excited about that, that we had an algorithm that could learn multiple layer representations without doing any back propagation, and it worked moderately well. It learned sensible layer feature detectors, but it wasn't as good as real back-prop. So you'll notice the training of these weights and training of those weights uses exactly the same process.

💬 0 comments
Add to My Notes
00:10:56Geoffrey Hinton

Now there's some crazy things about it. Like when you're awake, you don't learn the recognition connections. We really thought this was wake and sleep. And when you're asleep, you don't learn the generative connections. So that means you get to the end of a day and you haven't learned to recognize things any better during the day, you have to go to sleep. That doesn't seem plausible. Maybe half asleep? Yeah, half asleep. Well, you could alternate.

💬 0 comments
Add to My Notes
00:11:20Geoffrey Hinton

Maybe there's new methods for unsupervised learning. So the problem we had that led to the wake-sleep algorithm, and that meant it wasn't doing quite the right thing, was that for this deep net we couldn't get the exact derivatives of the variational bound with respect to the recognition weights. The learning algorithm had the exact derivatives of the variational bound with respect to the generative weights, but not with respect to those recognition weights. And then Max Welling and Diederik Kingma came up with a very clever trick that allowed—I'm amazed that you can do it—that allowed you to actually get the exact derivatives. That is, something whose expected value was the exact derivative, and so now people can learn these variational autoencoders much better.

💬 0 comments
Add to My Notes
00:12:01Geoffrey Hinton

There's also other new methods of doing unsupervised learning, or rather getting a supervision signal without being given a separate supervision signal. So Ian Goodfellow and his collaborators have a thing called Generative Adversarial Nets. Where you have a net that generates data, let's suppose it's images, and to begin with it doesn't generate very good images, it generates rubbish. You have another net that looks at real images and looks at the images generated by this net, and it has to tell you whether what it's just seen is a real image or an image that came from this net. So it's learning to tell the difference between what the net produced and what real images look like. So it's an adversary.

💬 0 comments
Add to My Notes
00:12:41Geoffrey Hinton

And now if you back propagate through the adversary, you can figure out how to change the generated images so it's harder for the adversary to tell the difference between them and real images. So you get a signal that tells you how to generate images that are more difficult to distinguish from the real ones according to this adversary. But if you now keep learning the two together, it starts generating really good images. If you show it lots and lots of scenes of bedrooms—and just the furniture—and you then get it to generate, it generates things that are not particularly like any of the scenes it's seen, but you look at them and you say, "That's a bedroom." It's amazing.

💬 0 comments
Add to My Notes
00:13:27Geoffrey Hinton

Okay, so there's lots of new unsupervised learning algorithms coming up, and my conclusion from all this—I'm not going to pick one of them as the best way to get a supervision signal—but my conclusion is there's many different ways to get supervision signals that you can use with back propagation. So we don't actually have to inject a separate label, so that objection doesn't really kill back propagation. There's lots of other ways you can get signals that you back propagate that will allow you to do learning. So that's not a major objection.

💬 0 comments
Add to My Notes
00:14:02Geoffrey Hinton

Now the next objection: can neurons communicate real values? So normally when you do back propagation, you send a real number forwards that's the output of a unit, and you also send real numbers backwards which are error derivatives. But as a matter of fact, people didn't try this for a long time. If you take logistic units which are sending values between 0 and 1, and they're going to send some real value forwards like 0.73, if you just randomly quantize that—so 73% of the time you send a 1 and 27% of the time you send a 0—so it's got the same expected value, the algorithm works just fine.

💬 0 comments
Add to My Notes
00:14:46Geoffrey Hinton

When you're actually computing the error derivatives for the incoming weights of a unit, you make use of the fact that it's a .73, but you never need to communicate that outside the neuron. Outside the neuron, you just do this stochastic communication, and it works just fine. If when you're back propagating errors from one layer to the next, you use two bits: you use one bit to say whether it's positive or negative, and another bit to say whether it's one or zero (or whether it's epsilon or zero, and you have to choose an epsilon so that most of the error derivatives are smaller than epsilon). And now you can stochastically communicate the back propagated derivatives with the right expected value, and back-prop will work just fine. So it is actually very robust to sending noisy things that have the right expected value.

💬 0 comments
Add to My Notes
00:15:33Geoffrey Hinton

We're now going to have a digression about statistics, and I'm going to argue that actually the fact that neurons send spikes rather than real values is an advantage. It's better than sending real values. This sounds odd. I mean, the neurons are roughly—I'm going to model them as a Poisson process. We all know they're more complicated than that, but that's a good start—that sends spikes randomly from some underlying rate. And the question is, how could that be better than sending an accurate real number?

💬 0 comments
Add to My Notes
00:16:03Geoffrey Hinton

Well, it all depends on ideas about statistics, and we've all been grossly misled by the professionals who are called statisticians. Frequentist statisticians will tell you things you probably learned these when you were very young, like you shouldn't have more parameters than training cases, because if you have more parameters than training cases you can model anything. Bayesian statisticians are a bit more liberal. They'll say you can have more parameters than training cases, but you better integrate over the posterior. Well, I don't want to do either of those things. I want to have hugely more parameters than training cases, and I don't want to have to integrate over the posterior, because that's what the brain does.

💬 0 comments
Add to My Notes
00:16:48Geoffrey Hinton

And so there must be some regime for learning where this works, and the regime the brain is in is totally unlike anything statisticians have studied, or anything that nearly all statisticians have studied. The models they study are tiny models. Until quite recently, there were models with say 100 parameters and a thousand training cases. Now they're still tiny models with say 100 million parameters and a billion training cases. These are tiny compared with the brain. Trillions of times smaller—well, billions anyway.

💬 0 comments
Add to My Notes
00:17:18Geoffrey Hinton

The brain's got about 10^14 parameters, that's synapses, most of which seem to be adaptive, or a large fraction are adaptive. And you only live for 10^9 seconds. Actually, it's 2 * 10^9, which is very lucky for some of us. So you've got about 10,000 synapses per second, or about 100,000 synapses a second. That's how many parameters you burn throughout your lifetime. And presumably, the brain does this because supporting a synapse for your entire lifetime is much cheaper than having a one-tenth of a second experience for your whole body. So synapses are really cheap, and the brain manages to compute with them very cheaply, only using about 30 Watts or so for 10^14 of them.

💬 0 comments
Add to My Notes
00:18:11Geoffrey Hinton

And so it needs a way to throw lots and lots of parameters at a relatively small amount of data compared with the number of parameters it's got. That was the evolutionary requirement on the brain, and presumably it's figured out how to do that. And actually, it turns out statisticians know a way of doing that. Statisticians already know a way in which you can get better as you have more parameters; your model gets better and better, it just doesn't get better and better very fast.

💬 0 comments
Add to My Notes
00:18:38Geoffrey Hinton

So this is Big Data versus big models. We all know big data is good, mainly because it's caused all our salaries to go up a bit. Although it's still not as high as the salaries of our recent graduate students, but there you go. So we know that for any given size of model, more data is better. Kind of the best regularizer you can get for a model is more data. So just get more data.

💬 0 comments
Add to My Notes
00:19:11Geoffrey Hinton

But I'm arguing it's a bad idea to do what statisticians have always recommended, which is make your model so small that whatever size data set you have, it looks big, it's bigger than the number of parameters in your model. That's what frequentist statisticians typically recommend. Big models are good, and here's what statisticians don't believe, but it's true: for any given size of data, the bigger you make the model, the better you'll do, not just at fitting the data but at generalizing. It has to be complicated enough data to be worth a big model of course, provided you regularize it well. In other words, there are regularizers that are so good that it always pays to have more data and a stronger regularizer.

💬 0 comments
Add to My Notes
00:19:55Geoffrey Hinton

And statisticians know something like that, which is if you use an ensemble. If you have an ensemble of 50 different models and I say, "Okay, would you like to have 100 different models?"—that's twice as many parameters—we draw models from some distribution and train them independently, so they're all trying to get the right answer. Almost certainly, having more models is going to give you a slightly better answer. So you can always make use of more parameters by just adding models to an ensemble, models of a fixed size to an ensemble. The question is, is there a more efficient way to use more parameters than that?

💬 0 comments
Add to My Notes
00:20:32Geoffrey Hinton

But because of this, and because there are more efficient ways to use more parameters, I think it's a good idea to always try to make the data look small by using a huge model. Now this relies on you having almost free computation. Obviously, one of the reasons statisticians didn't want to have big models was they started off doing the calculations by hand, and then they used pocket calculators, and then they used computers and so on. And until very recently, computers were so slow that you couldn't really afford to have very big models. And in fact, they're still so slow. So if you take a plausible size model like 10^14 parameters, the computers are still too slow to be able to fit that to a relatively small amount of data—but say a billion training cases. We'd like to fit 10^14 parameters to a billion training cases. Computers are still much too slow, but I believe Nvidia is working on it.

💬 0 comments
Add to My Notes
00:21:31Geoffrey Hinton

So I'm just going to talk about one regularizer that works quite nicely because I'm particularly attached to it. You can use a lot of parameters by having an ensemble of models and adding more models to the ensemble. But you can make that more efficient by letting models within the ensemble actually share information, so that you get more knowledge per parameter. Just having a gazillion models is a very inefficient way of using a lot of parameters. So we're going to somehow let the models in the ensemble share information.

💬 0 comments
Add to My Notes
00:22:00Geoffrey Hinton

And here's the idea. I've talked about this before actually at Stanford, so I'm going to go over it rather quickly. Let's have a neural net with one hidden layer. Each time you show it a training example, you leave out each unit with a probability of 0.5. So if you've got H hidden units, you have 2^H different models now, and each time you're going to just use one model. So for each training example, we use one of these models selected at random. And so most of these 2^H models will never see a single training example.

💬 0 comments
Add to My Notes
00:22:31Geoffrey Hinton

But the trick is, all of these models are going to share the incoming weights. So if two models both use a unit, they'll use it with the same incoming weights. So they're doing massive parameter sharing, and that's a much better regularizer. Having your parameter be like the parameter value that some other model wants is a much better regularizer than just having it be close to zero, for example, which is what L1 and L2 weight decay are trying to do.

💬 0 comments
Add to My Notes
00:22:58Geoffrey Hinton

So that's the idea of Dropout. And you train by, on each training example, you randomly leave out half the hidden units. When you test, you put them all in and use half the size of the outgoing weights. And if you're using a softmax up here, then when you test what you get is exactly the geometric mean of the predictions of all 2^H models. So there's an efficient way to get your model average.

💬 0 comments
Add to My Notes
00:23:29Geoffrey Hinton

Now if you have multiple layers, then when you use half the outgoing weights that's just an approximation, but it still works pretty well. So Dropout allows you to make models much bigger. It makes training slower because (a) there's more noise, so even for a model of the same size it will be slow to train, and (b) you need to make the models bigger because you're on each occasion you're dropping out many of the units. But what it's really doing is it's preventing units collaborating too much. So it's preventing overfitting.

💬 0 comments
Add to My Notes
00:24:03Geoffrey Hinton

Similarly, when you use an ensemble, having lots of small models prevents any collaboration at all between parameters in different models, and that's what stops overfitting. Here we're going to have some collaboration, but we're going to try and minimize it by saying, "I don't know which other hidden unit is going to be left out, so I can't rely on what he said, and I can't adjust my parameter values so that it works with this other hidden unit. I have to be sort of more independent than that." Okay, there's people here who've studied Dropout and know much more about it than I do, but it works. And at test time, you halve the outgoing weights and just do a forward pass, and that approximately computes the geometric average of all of your models.

💬 0 comments
Add to My Notes
00:24:51Geoffrey Hinton

Now you can view Dropout as using Bernoulli noise, where you take a neuron, you compute what its output should be, and you either send zero or you send twice that. Let's suppose we use probability .5 dropout. So if you either send zero or you send twice the activation, the expected value is the same as the activation, but there's some noise in it. So you're sending the right expected value but with noise. Well, a Bernoulli distribution has a standard deviation equal to the activation, because you're either going to send twice of it or zero, and so the standard deviation is equal to the activation.

💬 0 comments
Add to My Notes
00:25:34Geoffrey Hinton

You could try using Gaussian noise that had that same standard deviation and that same mean value. So that would say, use multiplicative Gaussian noise on the activations of the units where the standard deviation of the multiplicative Gaussian noise is equal to the activation. That's a lot of multiplicative Gaussian noise; obviously you can use less or more, but if you use that amount of Gaussian noise, it works about the same as using Dropout. I was really hoping it would work worse, because there's this cute property that Dropout has a certain standard deviation and it minimizes the entropy of the distribution for that standard deviation, and Gaussian multiplicative noise has the same mean and standard deviation but maximizes the entropy of the distribution. And you'd have thought minimizing versus maximizing might make it work different. Actually, it's pretty much the same. And if anything, the Gaussian noise is just slightly better, which is very annoying.

💬 0 comments
Add to My Notes
00:26:29Geoffrey Hinton

But you can also use sort of fake Poisson noise. So rather than actually implementing a Poisson process, what you can do is you can say, "I'm going to try and get something that has the same mean and variance as a Poisson process. I'm going to have multiplicative noise, but the standard deviation of the noise is now going to be the square root of the activation value." So the proportional noise gets less as the activation value gets bigger. And that works fine too. And that strongly suggests that if you use a Poisson process that has the same mean and variance as this Poisson multiplicative noise, then that'll work too. I haven't actually implemented that because my graduate students all went off and got jobs and I didn't feel like doing it myself, but it's bound to work.

💬 0 comments
Add to My Notes
00:27:20Geoffrey Hinton

Okay, so the conclusion about sending accurate real values is this: if what a neuron does is computes an underlying rate which is a real value, and then sends Poisson spikes according to that rate, what it'll achieve is, in expectation, it's sending the real value, but it's adding a huge amount of Poisson noise to it. But adding a huge amount of noise is a very good thing to do if you're in the regime the brain is in. That's what allows you to use 10^14 parameters with only 10^9 training points. So actually, it's not a problem that you're only sending these noisy spikes; it's actually better than being able to send accurate real numbers.

💬 0 comments
Add to My Notes
00:28:06Geoffrey Hinton

Okay, so that's the end of that objection. Now having claimed that, what I'm going to do is say, well look, if I give you a theory that works when I send real numbers forwards and backwards, you can always turn it into a theory that works when I send spikes, which is just the same theory but with lots of Poisson noise as the regularizer. So from now on, I'll assume I'm sending real numbers for most of what I'm going to say, but if I can do it with real numbers now, I can do it with spikes.

💬 0 comments
Add to My Notes
00:28:38Geoffrey Hinton

Okay, so the next question, the next of the four questions is: how do we send the error derivatives backwards? So it's obvious that if the output of a neuron represents the presence of a feature in the current input, you can't also use the output of that neuron to represent the derivative of the error with respect to the total input received by that neuron, which is what you need to communicate backwards in back propagation. So you've got a problem. You can't actually use the same neuron to send stuff backwards. So it obviously has to be a different neuron.

💬 0 comments
Add to My Notes
00:29:15Geoffrey Hinton

What's more, when you send stuff backwards, this different neuron has to sometimes send positive error derivatives and sometimes send negative error derivatives. That is, the sign of the thing it's sending changes, and the effect those have has to change sign too. And there's a thing in neuroscience called Dale's Law, which says that neurons are either excitatory or inhibitory, and they can't change the effect they have. They can't change from having a positive effect to having a negative effect. So one neuron will violate Dale's Law. That's why in feed-forward things you have things like on-center off-surround neurons and off-center on-surround neurons; you have a pair of neurons to deal with the sign reversal. And so presumably you need pairs of neurons to deal with the reversal of the sign of the error derivative. So it's obvious that you can't use the same neurons to send the error signal backwards. But actually, that's all nonsense.

💬 0 comments
Add to My Notes
00:30:18Geoffrey Hinton

Um, so here's how you do it. So this is the sort of biggest claim—this is a sort of central claim of this talk—that there is a way to represent error in the brain which makes it very easy to do back propagation. In fact, if you represent error derivatives this way, back propagation just emerges. It's just an emergent property of using this representation of error derivatives plus a few other things. And the idea is this: if you've got a signal and you want to represent two things, well you could represent one thing by the value of a signal, and another thing by the rate at which that value is changing. And at least over a short time period, you have two independent quantities now: the value and the rate of change of the value.

💬 0 comments
Add to My Notes
00:31:00Geoffrey Hinton

So what we're going to do is we're going to say the output of a neuron represents what's going on in the world, and the rate at which that output is changing does not represent how fast that property of the world is changing. So if I've got a neuron that represents where something is, when it's active it says "it's here." If the activity of that neuron changes, it doesn't mean that it's moving. The rate of change of the neuron is going to represent an error derivative. If the brain does that, it's a really basic decision it's made, because it can no longer use rates of change of quantities for representing that that quantity changed.

💬 0 comments
Add to My Notes
00:31:38Geoffrey Hinton

And in fact, the brain seems to be like that. You can't use the rate of change of the activation of a position-sensitive neuron to represent the velocity of something. If you want to represent velocity, you use a different neuron. And if you want to represent acceleration, you don't use the rate of change of the output of a velocity neuron, you use a different neuron that represents acceleration. And so some people get brain damage where they lose the velocity-sensitive neurons, and then they can see the position of a car, and then they can see the car is here, but it didn't move. It's kind of the opposite of a waterfall effect where you see things are moving but in the same position. Okay.

💬 0 comments
Add to My Notes
00:32:21Geoffrey Hinton

So let me show you a very early use of this idea that you can use temporal derivatives of the outputs to encode error derivatives with respect to the inputs. This was done in 1988 by—it was actually done in 1987 by J. McClelland and me. The year is actually quite significant, I think, because I'm fairly sure that at that time people hadn't yet discovered Spike Timing Dependent Plasticity. And what I'll show you is this is actually a Spike Timing Dependent Plasticity rule. So we predicted it, we just predicted it with the wrong sign.

💬 0 comments
Add to My Notes
00:32:55Geoffrey Hinton

So this is for an autoencoder. What you do is you take an input. You want to train these green weights so they will reconstruct the input and thereby get some interesting representations in these layers. And I've shown it for what I call a "long loop" – that's a loop that's greater than just up and down again. The algorithm works most simply for a short loop, but I've now made it work for a long loop using logistic units.

💬 0 comments
Add to My Notes
00:33:21Geoffrey Hinton

And the idea is this: you start with random weights, you send the input round the loop once. So the first time round is green, that'll give you something different here from what the real input was. You then send it around again, so that's that red pathway, and you take the difference between the activation the first time you sent things around and the second time you send things around, and you say, "I'm going to use that as an error derivative. I'm going to change the incoming weights here so as to reduce that difference." And it will actually learn to be an autoencoder.

💬 0 comments
Add to My Notes
00:34:00Geoffrey Hinton

And notice there's no back propagation. It's actually just—well, you might call this back propagation because you get information from there to there, but it doesn't look any different from the forward propagation here. It also answers the question, how can I learn these weights when the only pathway to get information from here to there is this? It actually works. It doesn't work as well as standard back-prop, but you can make it work. And so the learning rule for a neuron here is to say, change the incoming weights of that neuron in proportion to the activity of this neuron in the layer below times the rate of change of this neuron.

💬 0 comments
Add to My Notes
00:34:45Geoffrey Hinton

And you put in a minus sign because what you want is that the thing that happened first is right and the thing that happens second is bad—you're wandering away from the right thing, and so you want to pull it back to the right thing. So if the thing that happened first is right and the thing that happened second is bad, you need a minus sign here to make the learning go in the right direction. And so that's the opposite of Spike Timing Dependent Plasticity. I will elaborate the connection to Spike Timing Dependent Plasticity in a minute.

💬 0 comments
Add to My Notes
00:35:15Geoffrey Hinton

But you can actually combine it with Spike Timing Dependent Plasticity with the right sign in the following way. I gave a talk at NIPS about this in 2007, which met with universal incomprehension. First, you can use reversed Spike Timing Dependent Plasticity to learn a bunch of stacked autoencoders. So you learn one autoencoder, you then take the hidden states of it, you learn to autoencode those, and you can pile up a bunch of stacks like that. So think of those as cortical areas that are learning representations based on the input. And then our problem is, if we know something about what we'd like the final output to be, can we get that to influence the early connections?

💬 0 comments
Add to My Notes
00:36:01Geoffrey Hinton

So what you're going to do now is, having learned the autoencoders and got them working quite well, you're going to do a downwards pass all the way from the top. Just one downwards pass that uses whatever representation came out of your stack of autoencoders when you showed it the data. So that's your prediction. And then you're going to correct your prediction with the right answer by regressing it towards the right answer, and do another downwards pass. And you're going to use the difference in activations on those two downward passes as your derivative of the error with respect to the input of each unit.

💬 0 comments
Add to My Notes
00:36:44Geoffrey Hinton

And that corresponds to Spike Timing Dependent Plasticity with the correct sign. Because the bad thing came first and the good thing came second. So what comes second is better than what came first.

💬 0 comments
Add to My Notes
00:37:01Geoffrey Hinton

So let's just have a look at that. So here's Spike Timing Dependent Plasticity. This is what neuroscientists observed. You have a presynaptic spike, let's suppose it occurs here, so an incoming spike, and then the neuron fires at some point. If it fires before the presynaptic spike, what happens is the weight connecting the two goes down. And if it fires after the presynaptic spike, the weight goes up. It's normally interpreted by neuroscientists as: could this presynaptic spike have caused this neuron to fire? If it could have caused it, let's increase the connection strength.

💬 0 comments
Add to My Notes
00:37:41Geoffrey Hinton

But I want to give you a different interpretation. If you look at this thing, it looks like a derivative filter. And if what you want is to change the connection strength by the activity on the input times the rate of change of the output, then what you need to do is take that stream of output spikes, apply a derivative filter to them, and use the output of that derivative filter as the post-synaptic term in your learning rule. And so you're really asking, is the rate of firing higher here or here? And if the rate of firing is higher here, you want to increase the weight. And if the rate of firing is higher here, you want to decrease the rate.

💬 0 comments
Add to My Notes
00:38:27Geoffrey Hinton

That's the right learning rule if you're using the change in the rate of firing to represent the error derivative. And it's the change in this underlying rate of firing. You only get a noisy observation of this because you just see some spikes, but this thing will give you the right expected value. And for stochastic online gradient descent, all you have to do is get the expected values right and it'll work. It's very robust to all the noise. So you can interpret Spike Timing Dependent Plasticity as the signature of a system that's using the rate of change of the output to represent the error derivative with respect to the input.

💬 0 comments
Add to My Notes
00:39:08Geoffrey Hinton

Now let me show you that in a bit more detail. Yeah. So if you do a forward pass through a bunch of stacked autoencoders, and then you do two backward passes—one from whatever you produced and one from the right answer—that difference will give you your learning signal. If you want to make it more continuous, what you do is you go forwards, you get your prediction, and then you gradually blend your prediction with the right answer. So you regress your prediction towards the right answer. And then as you're doing backward passes, the rate of change of the neuron as you gradually do this regression will be the post-synaptic term that you need to multiply the presynaptic activity by to do back propagation. And I'm going to show you that now. Hopefully.

💬 0 comments
Add to My Notes
00:40:14Geoffrey Hinton

So this is sort of the technical idea of this: I'm going to use the rate of change of the output of neuron j (that is, ẏ_j) to represent dE/dx_j. I have some target value. I have some output probability when I've driven the system bottom-up when it's making a prediction. And so what we do is we say, let's start off with the output the neuron actually produced, and then as time goes by, let's change that by regressing it—by starting there, and then adding in some of the desired output and removing some of the actual output. So we're just regressing towards the desired output.

💬 0 comments
Add to My Notes
00:41:02Geoffrey Hinton

If you differentiate that with respect to t, you'll see you just get d_j - y_j(0). And so the rate of change of the output in the top layer is going to be proportional to the derivative of the cross entropy error, because that's what the cross entropy error looks like, and that's the derivative of the cross entropy with respect to the input to a neuron in the last layer.

💬 0 comments
Add to My Notes
00:41:31Geoffrey Hinton

Just so we're all on the same page, here's what you need to do to do back propagation. You need to start off by getting this thing, the derivative of the error with respect to the output, and then you need to convert that into the derivative of the error with respect to the input, and we're going to do that. We're going to represent this thing as ẏ_j. The rate of change of the neuron is going to be representing this. Now we want to get the same quantity in the previous layer. If we can do that, we can do back propagation.

💬 0 comments
Add to My Notes
00:42:06Geoffrey Hinton

And so what I want to show is, if you train a stack of autoencoders and do this backward pass with the time-changing activities, then if you get the output units so that dE/dx_j is represented by ẏ_j, then automatically in a stack of autoencoders, the units in the layer below, the same thing will be true: the ẏ_i will be representing dE/dx_i.

💬 0 comments
Add to My Notes
00:42:39Geoffrey Hinton

Now the way you do that in back-prop is you take this quantity, you multiply it by the slope of the nonlinearity here, dy/dx. That's another problem for the brain. The brain has neurons that aren't very stable—I mean they adapt rapidly and things, so their nonlinearities keep changing slope, and so it can't actually know the slope of the nonlinearity because it keeps wandering around. And the method I'm going to show you has no problem with that, because it actually measures the slope of the nonlinearity rather than knowing it. So anyway, you need to do that.

💬 0 comments
Add to My Notes
00:43:16Geoffrey Hinton

And then you need to—this is the basic back-prop step—you need to take this quantity for every neuron in that layer, you need to multiply it by the weight on the connection but going in the other direction, and you need to add it all up, and for neuron i here, that will give you dE/dy_i. You then need to take this quantity you computed by adding up all these backwards coming things and put it through the slope of the nonlinearity there, so you need to multiply by this thing in order to get dE/dx_i. And I'm going to show you an easy way to do that.

💬 0 comments
Add to My Notes
00:44:00Geoffrey Hinton

So let's have two output neurons and one neuron in the layer before just to keep the diagram simple, and let's first do a forward pass. There's more layers down here. So we do a forward pass, these are the output neurons, we get some actual outputs here. We now start regressing those actual outputs towards their desired values. And we now replace the bottom-up input by top-down input that comes from these guys. So to begin with, the top-down input this neuron gets is y_j * w_ji + y_k * w_ki. And because this thing's been trained as an autoencoder, that top-down input ought to reconstruct what was here. So what was here shouldn't change much when you compute it using the green connections instead of computing it bottom-up.

💬 0 comments
Add to My Notes
00:45:04Geoffrey Hinton

If these two layers formed a good autoencoder, you can reconstruct whatever was here from the activities in the layer above by going backwards through these weights. And when you train a Restricted Boltzmann Machine, that's exactly what you're doing. You're using the same top-down weights as bottom-up weights and training it to be good at reconstructing, at least if you train it with contrastive divergence.

💬 0 comments
Add to My Notes
00:45:24Geoffrey Hinton

Okay, so when I use the green connections instead of the black connections, when I first use them when I have the actual outputs, this neuron won't change much. But now I start changing these guys by regressing them towards the correct values, the desired values, the target values. So now what happens here? Well, what's coming along the green connection here will be changing, and the rate at which it will change will be how fast this guy changes times the value of this weight. And what's coming along here will be changing. It'll be how fast this guy changes times the value of this weight. So the total input to this neuron here will be changing by w_ji * ẏ_j + w_ki * ẏ_k. I apologize to mathematicians, I never understand equations; I always understand it by using a particular example.

💬 0 comments
Add to My Notes
00:46:20Geoffrey Hinton

So now we can ask, how fast does the output of y_i change? Well, the output of y_i will change at a rate that's the rate of change of the input (which is what we computed here) times the slope of the nonlinearity. Notice we didn't need to know the slope of the nonlinearity; this just happens! Whatever the slope of the nonlinearity is here, if you change this top-down signal that's being used to reconstruct, it'll change what comes out, and the slope of the nonlinearity will get into the act. And that is, if you take this guy and that guy... And if it was the case that ẏ_j and ẏ_k were representing the error derivatives with respect to the inputs here, it will just happen that ẏ_i represents the error derivative with respect to the input to this guy.

💬 0 comments
Add to My Notes
00:47:12Geoffrey Hinton

So that is the kind of recursive step in back propagation. The idea is, it just happens automatically using these temporal derivatives as error derivatives. And you don't need to know—you can use neurons that wander around, you don't need to know the slope of the nonlinearity, because here you're getting this effect by just putting in a changing input and seeing how the output changes. It's kind of weird because if you want the error derivative with respect to the output, you sort of do your computation here once, and once you've taken the error with respect to the output that you computed here, you put it forward through the neuron and you get the error derivative with respect to the input in these rates of change.

💬 0 comments
Add to My Notes
00:47:57Audience Question

That's just basic gradient using the adjoint, right? That's what Kalman filters do and everything else does. Yeah.

💬 0 comments
Add to My Notes
00:48:07Geoffrey Hinton

It's dumb, I agree! It's just that neuroscientists were determined the brain couldn't do this. Yeah. Although I've already let you know what I think of statisticians, at least when it comes to very big models.

💬 0 comments
Add to My Notes
00:48:27Geoffrey Hinton

So I already said this. If you're using these temporal derivatives as error derivatives, you can't also use temporal derivatives to represent the temporal derivatives of quantities in the world. And so we get to the last problem. And this caused me to give up on this idea in 2007. I thought, "Okay, I got this far, but it's actually hopeless." And the reason it's hopeless is because it requires the top-down weights to be the same as the bottom-up weights. And in particular, it requires if I've got two neurons with a connection this way, I better have a connection the other way.

💬 0 comments
Add to My Notes
00:49:03Geoffrey Hinton

That's not what the brain is like. You don't have this point-wise connectivity. You have some connections going this way, and some other connections going to the same vague area coming back, but not to the same neurons. I mean, by coincidence it might occasionally happen, but in general it doesn't. So you don't have this basic property of back-prop, which is that when the signal comes backwards, it comes backwards through the transpose of the weight matrix through which you went forwards. And that seems to be essential for back-prop.

💬 0 comments
Add to My Notes
00:49:36Geoffrey Hinton

So, how could this possibly work if you had a neural net in which you've got sparse connections in both directions with not much overlap between the pairs of neurons that are connected in the two directions? Well, you need a miracle. And in 2014, Tim Lillicrap and his co-workers at Oxford and Toronto discovered something very surprising. They couldn't believe it to be true. They were using it as a control for something, and suddenly something that shouldn't have worked, and it worked. And yes, and it worked. And this is used in control and people try using this stuff for control.

💬 0 comments
Add to My Notes
00:50:24Geoffrey Hinton

So what they did was they used just random connections coming back instead of the transpose of the forward weights. Now, if you use random connections coming back, your derivatives don't even have the right signs. I mean, they're just random by the time they get to the layer, or they go down one layer. So using random connections coming back, if you now just updated the weights using those derivatives you projected back, you wouldn't expect it to help. And it doesn't. I mean, that is, if you updated the weights from the one hidden layer to the next hidden layer using as error derivatives these fake derivatives that you get by taking the derivatives in the last hidden layer and mapping them back through random weights...

💬 0 comments
Add to My Notes
00:51:08Audience Question

You're going to explain why?

💬 0 comments
Add to My Notes
00:51:10Geoffrey Hinton

Yes, I'm going to explain why. So the puzzle is why does it work when you map back through random weights? I mean, I didn't believe it either. I went and implemented it, and it works. It doesn't work so well for getting things through narrow bottlenecks, but it works. And so for autoencoders with narrow bottlenecks it doesn't work so well, but if you have a wide layer, it works just fine. It's a bit slower than back-prop normally, maybe a factor of two slower, but it gets down to similar errors. So it works really well. It's not that it sort of works a bit; it works really well, almost as well as true back-prop.

💬 0 comments
Add to My Notes
00:51:49Geoffrey Hinton

I'll give you an analogy. There's something else like this that works, which is in variational inference. You try and infer the states of the latent variables, and you use something that infers the wrong states, and then you go off and do learning under the assumption they're the right states, and hey presto, it all works out nicely. It's something that wants to work. And what's really happening is when you learn the generative model assuming that you got the real posterior distribution even though you didn't, the generative model will adapt so that your crummy method of doing inference is a bit more correct. And so the generative model contorts itself so as to try and fit in with your appalling way of doing inference, and the whole thing works out not too badly. There's basically two kinds of algorithms in learning as far as I can see. There's ones where you make some terrible fudge and it wants to work, so it contorts itself to make your fudge work. And there's others where you make some terrible fudge and it wants not to work, and it exploits the fact that your fudge is wrong to work really badly.

💬 0 comments
Add to My Notes
00:53:03Audience Question

Sorry, that's a noise cancellation principle, yes?

💬 0 comments
Add to My Notes
00:53:06Geoffrey Hinton

But it's not always noise cancellation. You see, in variational inference it's not noise cancellation, yeah. Okay. I like variational inference, it's double contrarian. It actually works just to show you you weren't right about the noise cancellation principle! Yeah. So this is one of those weird things that works when it shouldn't. Works much better than it should.

💬 0 comments
Add to My Notes
00:53:32Geoffrey Hinton

So all I've explained so far is there's other things like this that work when they shouldn't. That doesn't really explain how it works. Before I go into explaining how it works: when you use fixed top-down weights, it works. So obviously, if you're slowly changing the top-down weights, and the feed-forward weights can track fast enough, it'll still work. And so you should be able to do better than fixed top-down weights. You should be able to slowly change the top-down weights so as to, for example, make the things into better autoencoders. And actually that works slightly better. It's a small improvement over using fixed top-down weights.

💬 0 comments
Add to My Notes
00:54:12Geoffrey Hinton

But why does using fixed top-down weights work at all? Well, here's a really tricky case. Here's a case that you wouldn't have thought would work. We're going to take MNIST, so we're going to take a handwritten digit. We're going to take 800 rectified linear units and 800 rectified linear units (but they could be sigmoid units). And what we're going to do is we're going to use fixed random connections here. We're going to use fixed random sparse connections here, and the way we're going to get them is we take forward connections and we only put in 25% of them. And then for all the pairs of neurons that weren't connected here, we put in a third of the possible connections. So we only have 25% of the top-down connections too. And there's no overlap. So we're back-projecting through a fixed random sparse matrix where none of the terms overlap with the forward matrix. That's much more like the brain than a normal back-prop net.

💬 0 comments
Add to My Notes
00:55:16Geoffrey Hinton

But then we have connections here. They can be sparse if you like, in this I didn't make them sparse, but these are adaptive. And the question is, will these adapt in a sensible way? So the experiment goes like this: you try learning where these are fixed and you just adapt these. And obviously it learns, because what you're doing is taking the input, randomly recoding it to 800 activities here (which is keeping it about the same size), and then you're learning a simple model here. And that will learn. That'll get down to about 2.5% error or something. Maybe not quite that, but it'll do okay.

💬 0 comments
Add to My Notes
00:55:54Geoffrey Hinton

The question is, if you learn these guys and you learn these guys at the same time, using as your learning signal for these units the errors you got there which were correct, back propagated through the fixed random matrix, multiplied by the slopes of the nonlinearities here, back propagated through this one, multiplied by the slopes of nonlinearities here, and those give you your fake derivatives here that you're going to use for learning. Does it help? And yes, it helps a whole lot. It uses these connections very nicely and builds nice receptive fields. So something's really working here, even though it's going through these random connections.

💬 0 comments
Add to My Notes
00:56:32Geoffrey Hinton

You have to do things like this to convince yourself that it really is working. It's not, because it was so surprising. And now I'm going to give you an intuitive explanation about why it works, at least how it gets off the ground.

💬 0 comments
Add to My Notes
00:56:45Geoffrey Hinton

So what we can do is we can freeze the last layer of weights and just learn the earlier weights. And if we do that, we notice something. When you freeze the last layer of weights that are going to be used to propagate, to produce predictions, which then get sent backwards through random weights, the last layer of weights is actually learning correctly to begin with, because it's looking at the output and looking at what's coming in, and it's actually really learning properly. If you turn off that learning, it appears that no useful learning goes on. So if you freeze the last layer weights and you learn the earlier layers using these fixed random backwards connections, it doesn't improve at all, it just gets slightly worse as you learn. And so you think that it's not actually learning anything. But actually it's learning a lot. What it's learning is not making the error go down. It's not making the output error go down, but it's going to make it much easier for the last layer of weights to learn when you do turn them on.

💬 0 comments
Add to My Notes
00:57:51Audience Question

So are you breaking the loop when you turn something off?

💬 0 comments
Add to My Notes
00:57:57Geoffrey Hinton

No, no, the connections are still there, we're just not changing them. But we have forward connections from the last layer to the output, they're just not adapting.

💬 0 comments
Add to My Notes
00:58:06Audience Question

Okay, so there are loops.

💬 0 comments
Add to My Notes
00:58:07Geoffrey Hinton

Yeah, if you didn't do that, I don't think it would make sense at all.

💬 0 comments
Add to My Notes
00:58:14Geoffrey Hinton

So here's what's happening. If you think what happens for, if you got say 10 output classes, I'm going to use the softmax of the output, but this also applies to linear regression. The actual output will—they'll all be about equal if we start off with small weights, okay? So we have small weights going to the output, they don't have much effect on the output. The actual outputs all look pretty much like that. The desired outputs, when you have a member of the fourth class as the input, will look like that. And so the derivative of the error with respect to the input to the final layer of units would look like this. It says, if you want to make the error bigger, decrease this guy and increase those guys. Or if you want to make the error smaller, decrease these guys and increase this guy.

💬 0 comments
Add to My Notes
00:59:06Geoffrey Hinton

And notice that if you have different instances of the fourth class that have nothing to do with one another, they're just very different inputs, but they're classified as the fourth class, they will get the same error derivative coming back from the output there. Because the error derivative is determined by the class. It's dominated by what the class is.

💬 0 comments
Add to My Notes
00:59:27Geoffrey Hinton

So now what happens is you take this error vector and you map it backwards through the last layer of backward connections, which is just random weights. And so they'll do some sort of random rotation and scaling and stuff of it. But if you had two things that are the same and you map them through a random matrix, they're going to end up the same pretty much. They're going to end up very similar unless it's some weird matrix. So what you're getting now is that you get these fake derivatives to come back that have nothing to do with the real derivatives. And if you change the units in that direction, it's not going to make life better for you at the output.

💬 0 comments
Add to My Notes
01:00:03Geoffrey Hinton

But for members of the same class, you get the same derivatives. And for members of different classes, you get different derivatives. So all the different members of the same class will try and move the activities in the same direction. And what'll happen when you've done a little bit of learning is that all the different members of the same class, even if they had nothing to do with each other, but you designate them as being of the same class—so you could just take random data and just designate the first 10 as being Class 1, the next 10 as being Class 2, the next 10 as being Class 3. They got nothing to do with each other. But when you turn on the learning now, things that are designated as Class 4 will all get the same error vector coming back. And so in the last hidden layer, they'll all learn to have pretty much the same representation.

💬 0 comments
Add to My Notes
01:00:52Geoffrey Hinton

That's basically why it works. That's what's going on when you learn without adapting the last layer. So you can see how this thing's working. It's making things that have the same class have similar representations in the hidden layers, and it's recursive. I'll show you that in a minute.

💬 0 comments
Add to My Notes
01:01:12Geoffrey Hinton

So this is an example I implemented. You have frozen but small forward weights there, you have fixed random weights here, and you have adaptive random weights there. And you do this: you do back-prop, treating what comes—the error derivatives here mapped through these fixed random weights—as if they were the true derivatives. And as you learn, as you adapt these weights, the error doesn't go down. The error, if anything, goes slightly up. Well, it basically stays at random.

💬 0 comments
Add to My Notes
01:01:46Geoffrey Hinton

Okay, but when you turn now turn on the learning there, what's happened is things of the same class here have a very similar representation there. And so the learning here is trivial, it learns very fast. So you were actually doing something very useful. You were constructing representations of things that were the same class to be very similar. And it works through multiple layers.

💬 0 comments
Add to My Notes
01:02:08Geoffrey Hinton

Oh, this is just an example of that. This is 100 random vectors, random binary vectors of length 784. The first 10 are designated Class 1, the next 10 are Class 2. I train without adapting—I don't adapt these weights at all. And after training just these weights, if you look at the representations here and measure the covariances between these vectors, you get that. You can see all the first 10 all have pretty much the same representation, the next 10 have a very different and pretty much the same representation, and so on. So it's obvious that if that's your representation of the things in the different classes, learning is trivial. And that works through multiple layers.

💬 0 comments
Add to My Notes
01:02:56Geoffrey Hinton

So once the activities in the last hidden layer are very similar for members of the same class, then the slope of the nonlinearity will be very similar for members of the same class. And so now when you take the output error, you put it through one random matrix, you multiply by the slopes of the nonlinearity (which are the same for all members of the same class), and you put it through another random matrix, you'll get very similar error vectors in the layer below. So it's not just the last layer this works. So as soon as that one sort of got its act beginning to get its act together, other ones are learning too. And so you can actually learn lots of layers of representation to make the input get progressively more similar to things designated as the same class, and progressively more different for things of different classes, without ever decreasing the error in the output.

💬 0 comments
Add to My Notes
01:03:46Geoffrey Hinton

Okay, so just to summarize: the fact that neurons send spikes rather than real numbers isn't a problem, that's just because you've got a huge model with not much data and having lots of noise is the right thing to do. By representing error derivatives as temporal derivatives, you can get back propagation to happen automatically. If you've got autoencoders—if you first learned a bunch of autoencoders so that you reconstruct at the layer below—and then you start changing the signal that's doing the reconstruction, watch how the reconstruction changes.

💬 0 comments
Add to My Notes
01:04:22Geoffrey Hinton

And actually, Spike Timing Dependent Plasticity is what you'd expect to see if you were using this scheme for encoding error derivatives. If the rate of change of the underlying firing rate was what represented the error derivative, you'd have to apply a derivative filter to the spike train and use the output of that derivative filter to control the learning. That's Spike Timing Dependent Plasticity.

💬 0 comments
Add to My Notes
01:04:44Geoffrey Hinton

And the problem that caused me to give up before, which is that you don't have top-down connections that are the transpose of the bottom-up connections, well it works anyway, because it's... contrarian. It's called noise cancellation. Okay, that's how you can define it. Okay, I'm done.

💬 0 comments
Add to My Notes
01:05:32Audience Question

Excellent presentation. I just wanted to ask a question about representing these two states, the derivative and the activation. In actual neurons, there's a biochemical—so proteins like calmodulin that are stored locally in the synapse or right before the synapse. In the pre-synaptic neuron and in the post-synaptic neuron, the pre-synaptic neuron fires and it's incident on the post-synaptic neuron. The state of calmodulin biases different biochemical cascades, and it's that state of the protein that changes over time, it's thought to be one of the drivers of long-term plasticity. In other words, it's the variable that's stored. In your model, are you sort of approximating these sort of biochemical cascades...?

💬 0 comments
Add to My Notes
01:06:22Geoffrey Hinton

Okay, so the model we've got at present, we haven't really filled in all the details of this bit. The reason I didn't show you a sort of complete simulation of something using entirely spikes and learning and doing back-prop is because, as you could probably tell from the presentation, there's still a lot of wiggle room left in exactly how you organize the forward pass and then these two backwards passes. And I'd be very interested to read about the biochemical mechanisms. That's what we want to implement, but I think there's still quite a few ways of implementing that.

💬 0 comments
Add to My Notes
01:07:07Audience Question

So you mentioned in the first part of the talk a regime of learning having 10^14 parameters and 10^9 data samples. So do you think there will be a new theory out there for this regime of models, and so for different... what kind of...

💬 0 comments
Add to My Notes
01:07:29Geoffrey Hinton

So I, I mean, if I understood the question right, are we going to get sort of new kinds of learning algorithms for this funny regime where you assume computation is almost free, you assume you have a reasonable size data set, but now because computation is cheap and because you care, you're going to apply a huge amount of computation and have a vast model in the hope that you can generalize better? Yeah, I think that's a different regime. It's a regime that's hardly been explored.

💬 0 comments
Add to My Notes
01:07:54Geoffrey Hinton

So I mean a bit for MNIST. So for MNIST you've got 60,000 training examples. So you can train a model with 100,000 connections and it works okay. You train a model with a million connections, it works better. Okay, so you train a model with 10 million connections and a good regularizer now, and that works, that generalizes better. Okay, so now you train a model with 100 million connections and a good regularizer, that generalizes even a little bit better. If you were to go to half a billion connections, that would be about the same ratio of parameters to training cases that the brain has. But 100 million is getting close, and the point is that 100 million parameters for 60,000 training cases, a statistician would have told you you're completely insane and it will overfit horribly and it will never generalize. And that's just because they didn't have really strong regularizers. It turns out it's better to use more parameters and a strong regularizer.

💬 0 comments
Add to My Notes
01:09:00Audience Question

You have certainly persuaded me that the brain could do back propagation. You suggested that it worked so well and chances are that evolution would have found that. But do you think the brain actually does do back propagation, or do you think it's got other tricks up its sleeve which we still don't know?

💬 0 comments
Add to My Notes
01:09:21Geoffrey Hinton

I don't know. I just object to neuroscientists saying it couldn't possibly do it. And the arguments they give you are, I think what's going on, or what was going on until fairly recently, they had all the arguments I gave plus another argument which was, "And anyway back propagation isn't that great, you know, Support Vector Machines work better. So we don't really need to go into this argument do we, because back propagation clearly isn't the answer to everything."

💬 0 comments
Add to My Notes
01:09:47Geoffrey Hinton

But once you've established that back propagation is actually the answer to everything, then it becomes interesting about whether the brain really could do it. And then you start looking at these arguments and you notice all of the arguments seem like quite good arguments, but none of them are insuperable. And so if you really think the brain's got a big motivation for doing it, then maybe it can. But of course, there may be some other algorithm that's better than back propagation. And of course when you're using these random backwards weights, that's not exactly back propagation anymore. It's in the same ballpark, but it's not actually back propagation.

💬 0 comments
Add to My Notes
01:10:27Geoffrey Hinton

So my feeling is it's probably using something like back propagation that's maybe not exactly back propagation, because that's what works. And the sort of more general point is that I think there has to be some way in which feature detectors early in the sensory pathway get information about their effects later on, so that they can adapt to be more useful later on. It seems crazy not to do that, and that's sort of what back propagation does. And there's other ways of doing that that aren't quite back propagation, like using feedback alignment. That's what I really believe in, that somehow the information is getting back so these things are adapting to be useful. Now it may be that most of the adaptation is just so they can model what's below them, it's mostly unsupervised, but there has to be some element of adapting to be useful.

💬 0 comments
Add to My Notes
01:11:32Audience Question

Terry Sejnowski taught this computational neuroscience class where he taught us to model quanta with discrete packets of neurotransmitter being released from a pre-synaptic neuron using a Poisson distribution. I was intrigued by your Poisson distribution as a way of sampling error gradients instead of using things like Dropout. I was wondering if there was any relationship between these two ideas.

💬 0 comments
Add to My Notes
01:11:57Geoffrey Hinton

I guess the relationship is simply that if neurons are behaving somewhat Poisson-like, a lot of that's coming from these discrete quanta, because that really is a Poisson process. So a lot of it's in the amount of transmitter release that goes on, we know that's very noisy. If you inject charge straight into a neuron, it's much less noisy.

💬 0 comments
Add to My Notes
01:12:25Audience Question

Is there any difference between the kinds of errors that come out from the real back propagation, the original one, and the ones that come out of this propagation to make the output be the same as the input?

💬 0 comments
Add to My Notes
01:12:42Geoffrey Hinton

It's a good question and I don't think anybody's really looked at that. This is all stuff's fairly new and we've mostly been concerned with why on earth does this stuff work? What's going on? Obviously, if there was some significant difference between the kinds of errors they get, you might look for that and see which way people go. But I don't think anybody's really looked at it, they've just looked at this kind of error rate.

💬 0 comments
Add to My Notes
01:13:13Audience Question

It seems like this makes the prediction that the brain is, or that cortical layers are filled with autoencoders kind of at every layer. Do you know if that's actually true? That between layers it's an autoencoder?

💬 0 comments
Add to My Notes
01:13:25Geoffrey Hinton

No, I don't know if it's true. I think it'd be a very good idea, but if I designed it... yes.

💬 0 comments
Add to My Notes
01:13:35Audience Question

There was a recent paper saying that neural synapses encode in 4.7 bits. Have you any comment on that? This was a big deal because it's more than they thought it was before. It's 27 states or something. This is for weight—a particular connection. In some cases I understand that there are several connections to the same [neuron], but an individual synapse...

💬 0 comments
Add to My Notes
01:14:10Geoffrey Hinton

Oh, that was the stuff that Terry was involved in. Yeah, says it's about 4.7 bits per weight. I don't really have much to say about that. It would mean though you could use some pretty simple arithmetic to do that. People who want things to go fast already do use—they use for example 8-bit weights. Yeah.

💬 0 comments
Add to My Notes
01:14:39Audience Question

So I guess back-prop gives you the gradient and in practice people have been getting some mileage out of doing things like momentum or per-feature scaling. So do any of these things exist in [the brain]?

💬 0 comments
Add to My Notes
01:14:55Geoffrey Hinton

A more general question would be, you'd really like to sort of combine stochastic gradient descent with something more like a second order method that goes in a better direction. I mean, it's sort of crazy to go and do steepest descent, you'd like to get something more like the natural gradient. And I got very excited about the idea that actually feedback alignment isn't giving you the transpose of the forward matrix, it's giving you the pseudo-inverse. Or it's giving you something closer to the pseudo-inverse. And so it's saying what do I have to do to get back to the thing I want to reconstruct? If you train it, you'll get that.

💬 0 comments
Add to My Notes
01:15:32Geoffrey Hinton

So with the pseudo-inverse, you can figure out how I would have to change the outputs of neurons in one layer to get rid of the error at the next layer. And that's a bit more like a second order method. That is, it behaves quite differently in extreme cases when you have very high correlations of things. But I didn't manage to get anything to work where you could somehow use the fact that it's more like the pseudo-inverse to make it more second order. I got very excited about that, but I couldn't get anywhere. I haven't totally given up on it, but yeah.

💬 0 comments
Add to My Notes
01:16:09Audience Question

Any evidence about the random connections as opposed to... in real life networks, is there any evidence for those kinds of edges?

💬 0 comments
Add to My Notes
01:16:31Geoffrey Hinton

I guess you mean random weights that simply don't adapt? They're random in the sense that you don't get detailed one-to-one connections, but the whole thing's full of topographic maps. And if a part of this map connects to this map, then this will send connections back to the same place roughly. So there's locality but not point-to-point locality, and so the connections are not random in that sense.

💬 0 comments
Add to My Notes
01:17:10Audience Question

Do you randomly activate the... in that case there's 25 neurons going one way, 25 the other in the middle layer. Do you randomly activate 25%?

💬 0 comments
Add to My Notes
01:17:21Geoffrey Hinton

Sorry, it was 25% of the possible connections are there. But I've also done it where you choose those from—you have topographic maps, you have a location in this layer, a location of a neuron in this layer, and then you connect it to neurons that have similar locations in the layer below. And you can choose sort of how big you make the neighborhood and how dense you make the connections. And you'd expect that to work better if you're dealing with spatially organized input because you expect that if you're dealing with an image, you want to start by processing one part of it locally.

💬 0 comments
Add to My Notes
01:17:55Geoffrey Hinton

All the connections are there from the full 400 both ways, all you're doing is randomly selecting which ones are going to fire and which ones are not. But you just do that once. So at the start of the simulation you decide which connections are there and which aren't. Now as a matter of fact, to make the programming easy, you have them all there and you let them all learn, and then 75% of them you set their weights back to zero again, but that's just programming.

💬 0 comments
Add to My Notes
01:18:27Audience Question

Have you done anything where you played with a stochastic variation of the number of weights that you feed forward/back, so rather than having a fixed 25% some are [random]?

💬 0 comments
Add to My Notes
01:18:39Geoffrey Hinton

No, I just—I mean, we've done very few experiments on the connectivity. We tried local connectivity and that's nice for images. It's more at the stage of we're trying to understand why on earth this works at all, because on the face of it it shouldn't work.

💬 0 comments
Add to My Notes
01:18:56Audience Question

It seems like it might be just sort of adding an effect like noise, because you're not seeing everything, so preventing overfitting if possible?

💬 0 comments
Add to My Notes
01:19:07Geoffrey Hinton

Maybe, but at least if you don't have backward connections where you have forward connections, you're clearly not doing exactly back propagation. And so the problem isn't so much preventing overfitting as how can it do fitting at all.

💬 0 comments
Add to My Notes
01:19:25Audience Question

So a lot of times you know, when an animal goes to a place, the activity of a place cell will rise and then fall. Does that mean sort of before it enters the place it's the wrong place, when it leaves the place it's the right place? If you map it there, how would you interpret that?

💬 0 comments
Add to My Notes
01:19:41Geoffrey Hinton

Well, the firing rate goes up and then the firing rate goes down again. So as it's coming into that location, then... yeah. For that place, it's getting more active as it comes to that location. And so you'd want to increase the incoming weights on anything that's coming in, and then as it goes out you want to decrease the weights. And so you'd get... you tend to get this procession, would you? You'd tend to get the location moving back towards where it came from. In other words, it'll start predicting... rather I haven't really thought about that issue.

💬 0 comments
Add to My Notes
01:20:40Audience Question

So most of the models are using fully connected layers, but another current success of deep learning is also built upon convolutional layers. So is there also evidence that the brain is also using mechanisms of weight sharing like those convolutional layers? And if that's the case, then is back propagation also feasible in that form?

💬 0 comments
Add to My Notes
01:21:12Geoffrey Hinton

As far as I know, there's no evidence that the brain is doing any kind of direct weight sharing. So the convolutional layers, I mean convolutional neural nets have two properties: one is they're doing the weight sharing, the other is they've got these local fields. The brain is of course using the local fields, and that gives you some advantage if you've got spatial data with local correlations.

💬 0 comments
Add to My Notes
01:21:31Geoffrey Hinton

They can do something a bit like weight sharing indirectly. Because—this is a complicated argument—you look in one place, you develop low-level features. Those give you higher-level features, which give you very high-level features which work over a bigger region. You then look in another place, and if it's not too far away, you know that the high-level stuff should be the same. So the high-level stuff can give you top-down supervision for what the low-level features should look like. And you can get sort of information... once you learn good features here, you get high-level stuff. Now I can look over here, and I've got both the input and the high-level stuff which is more or less correct if it's nearby. So I can get—I should be able to get faster learning, because I know both what the input is and what the representation should be like at the higher layer, and then learning is much easier.

💬 0 comments
Add to My Notes
01:22:26Geoffrey Hinton

So in that sense, you can get information coming from features here to transfer across to the features here, not by transporting weights, but by the fact that these features told you what the object was, for example. And knowing what the object is helps you learn these features. That's the closest I can think of that the brain could get to weight sharing.

💬 0 comments
Add to My Notes
01:22:51Audience Question

Just curious, can you repeat your complaint about why statistics can't explain how these things work so well?

💬 0 comments
Add to My Notes
01:23:01Geoffrey Hinton

Okay, my main complaint with statistics, at least in this talk, was they've studied models that are in a very different regime from the brain. They studied models where the data isn't all that high dimensional, and you have not that many parameters and not much data. That's the history of statistics. And what they call Big Data is if you have a billion training examples, that's called Big Data.

💬 0 comments
Add to My Notes
01:23:30Geoffrey Hinton

From the brain's point of view, it's got a billion training examples, or more than a billion training examples, and it's small data because it's got so many more parameters. So it's in a different regime where it can't just assume that we don't need to worry. I mean, at Google they're telling me this all the time, they're saying, "You don't need to worry about regularization, we've got so much data that is not a problem." Well that's just not true for some of these problems. They've only got a trillion examples, and if you've got 10^15 parameters that's no good, you better regularize.

💬 0 comments
Add to My Notes
01:24:04Audience Question

So in wake-sleep as well as in an algorithm like feedback alignment using temporal derivatives, it seems like the earlier layers have to wait longer to get an error signal. So do you have any ideas how the brain might be able to implement these differences in the time scales of the...?

💬 0 comments
Add to My Notes
01:24:26Geoffrey Hinton

From what little neuroscience I know, I think back in the 1960s of course... I think you'd expect the earlier feature detectors to have a longer time scale. I think if you want to learn quickly... learn what's happening later, adapting the earlier stages is a bit like changing the input. You'd want to be quite conservative about that, because changing early layers changes everything else and that's going to be slow. That's all I'd say.

💬 0 comments
Add to My Notes
Video Player
My Notes📝
Highlighted paragraphs will appear here