f2

Predictive coding and how the dynamical Bayesian brain achieves specialization and integration

Authors note: this marks the first in a new series of journal-entry style posts in which I write freely about things I like to think about. The style is meant to be informal and off the cuff, building towards a sort of socratic dialogue. Please feel free to argue or debate any point you like. These are meant to serve as exercises in writing and thinking,  to improve the quality of both and lay groundwork for future papers. 

My wife Francesca and I are spending the winter holidays vacationing in the north Italian countryside with her family. Today in our free time our discussions turned to how predictive coding and generative models can accomplish the multimodal perception that characterizes the brain. To this end Francesca asked a question we found particularly thought provoking: if the brain at all levels is only communicating forward what is not predicted (prediction error), how can you explain the functional specialization that characterizes the different senses? For example, if each sensory hierarchy is only communicating prediction errors, what explains their unique specialization in terms of e.g. the frequency, intensity, or quality of sensory inputs? Put another way, how can the different sensations be represented, if the entire brain is only communicating in one format?

We found this quite interesting, as it seems straightforward and yet the answer lies at the very basis of predictive coding schemes. To arrive at an answer we first had to lay a little groundwork in terms of information theory and basic neurobiology. What follows is a grossly oversimplified account of the basic neurobiology of perception, which serves only as a kind of philosopher’s toy example to consider the question. Please feel free to correct any gross misunderstandings.

To begin, it is clear at least according to Shannon’s theory of information, that any sensory property can be encoded in a simple system of ones and zeros (or nerve impulses). Frequency, time, intensity, and so on can all be re-described in terms of a simplistic encoding scheme. If this were not the case then modern television wouldn’t work. Second, each sensory hierarchy presumably  begins with a sensory effector, which directly transduces physical fluctuations into a neuronal code. For example, in the auditory hierarchy the cochlea contains small hairs that vibrate only to a particular frequency of sound wave. This vibration, through a complex neuro-mechanic relay, results in a tonitopic depolarization of first order neurons in the spiral ganglion.

f1
The human cochlea, a fascinating neural-mechanic apparatus to directly transduce air vibrations into neural representations.

It is here at the first-order neuron where the hierarchy presumably begins, and also where functional specialization becomes possible. It seems to us that predictive coding should say that the first neuron is simply predicting a particular pattern of inputs, which correspond directly to an expected external physical property. To try and give a toy example, say we present the brain with a series of tones, which reliably increase in frequency at 1 Hz intervals. At the lowest level the neuron will fire at a constant rate if the frequency at interval n is 1 greater than the previous interval, and will fire more or less if the frequency is greater or less than this basic expectation, creating a positive or negative prediction error (remember that the neuron should only alter its firing pattern if something unexpected happens). Since frequency here is being signaled directly by the mechanical vibration of the cochlear hairs; the first order neuron is simply predicting which frequency will be signaled. More realistically, each sensory neuron is probably only predicting whether or not a particular frequency will be signaled – we know from neurobiology that low-level neurons are basically tuned to a particular sensory feature, whereas higher level neurons encode receptive fields across multiple neurons or features. All this is to say that the first-order neuron is specialized for frequency because all it can predict is frequency; the only afferent input is the direct result of sensory transduction. The point here is that specialization in each sensory system arises in virtue of the fact that the inputs correspond directly to a physical property.

f2
Presumably, first order neurons predict the presence or absence of a particular, specialized sensory feature owing to their input. Credit: wikipedia.

Now, as one ascends higher in the hierarchy, each subsequent level is predicting the activity of the previous. The first-order neuron predicts whether a given frequency is presented, the second perhaps predicts if a receptive field is activated across several similarly tuned neurons, the third predicts a particular temporal pattern across multiple receptive fields, and so on. Each subsequent level is predicting a “hyperprior” encoding a higher order feature of the previous level. Eventually we get to a level where the prediction is no longer bound to a single sensory domain, but instead has to do with complex, non-linear interactions between multiple features. A parietal neuron thus might predict that an object in the world is a bird if it sings at a particular frequency and has a particular bodily shape.

f3
The motif of hierarchical message passing which encompasses the nervous system, according the the Free Energy principle.

If this general scheme is correct, then according to hierarchical predictive coding functional specialization primarily arises in virtue of the fact that at the lowest level each hierarchy is receiving inputs that strictly correspond to a particular feature. The cochlea is picking up fluctuations in air vibration (sound), the retina is picking up fluctuations in light frequency (light), and the skin is picking up changes in thermal amplitude and tactile frequency (touch). The specialization of each system is due to the fact that each is attempting to predict higher and higher order properties of those low-level inputs, which are by definition particular to a given sensory domain. Any further specialization in the hierarchy must then arise from the fact that higher levels of the brain predict inputs from multiple sensory systems – we might find multimodal object-related areas simply because the best hyper-prior governing nonlinear relationships between frequency and shape is an amodal or cross-model object. The actual etiology of higher-level modules is a bit more complicate than this, and requires an appeal to evolution to explain in detail, but we felt this was a generally sufficient explanation of specialization.

Nonlinearity of the world and perception: prediction as integration

At this point, we felt like we had some insight into how predictive coding can explain functional specialization without needing to appeal to special classes of cortical neurons for each sensation. Beyond the sensory effectors, the function of each system can be realized simply by means of a canonical, hierarchical prediction of each layered input, right down to the point of neurons which predict which frequency will be signaled. However, something still was missing, prompting Francesca to ask – how can this scheme explain the coherent, multi-modal, integrated perception, which characterizes conscious experience?

Indeed, we certainly do not experience perception as a series of nested predictions. All of the aforementioned machinery functions seamlessly beyond the point of awareness. In phenomenology a way to describe such influences is as being prenoetic (before knowing; see also prereflective); i.e. things that influence conscious experience without themselves appearing in experience. How then can predictive coding explain the transition from segregated, feature specific predictions to the unified percept we experience?

f4
When we arrange sensory hierarchies laterally, we see the “markov blanket” structure of the brain emerge. Each level predicts the control parameters of subsequent levels. In this way integration arises naturally from the predictive brain.

As you might guess, we already hinted at part of the answer. Imagine if instead of picturing each sensory hierarchy as an isolated pyramid, we instead arrange them such that each level is parallel to its equivalent in the ‘neighboring’ hierarchy. On this view, we can see that relatively early in each hierarchy you arrive at multi-sensory neurons that are predicting conjoint expectations over multiple sensory inputs. Conveniently, this observation matches what we actually know about the brain; audition, touch, and vision all converge in tempo-parietal association areas.

Perceptual integration is thus achieved as easily as specialization; it arises from the fact that each level predicts a hyperprior on the previous level. As one moves upwards through the hierarchy, this means that each level predicts more integrated, abstract, amodal entities. Association areas don’t predict just that a certain sight or sound will appear, but instead encode a joint expectation across both (or all) modalities. Just like the fusiform face area predicts complex, nonlinear conjunctions of lower-level visual features, multimodal areas predict nonlinear interactions between the senses.

f5
A half-cat half post, or a cat behind a post? The deep convolutional nature of the brain helps us solve this and similar nonlinear problems.

It is this nonlinearity that makes predictive schemes so powerful and attractive. To understand why, consider the task the brain must solve to be useful. Sensory impressions are not generated by simple linear inputs; certainly for perception to be useful to an organism it must process the world at a level that is relevant for that organism. This is the world of objects, persons, and things, not disjointed, individual sensory properties. When I watch a cat walk behind a fence, I don’t perceive it as two halves of a cat and a fence post, but rather as a cat hidden behind a fence. These kinds of nonlinear interactions between objects and properties of the world are ubiquitous in perception; the brain must solve not for the immediately available sensory inputs but rather the complex hidden causes underlying them. This is achieved in a similar manner to a deep convolutional network; each level performs the same canonical prediction, yet together the hierarchy will extract the best-hidden features to explain the complex interactions that produce physical sensations. In this way the predictive brain summersaults the binding problem of perception; perception is integrated precisely because conjoint hypothesis are better, more useful explanations than discrete ones. As long as the network has sufficient hierarchical depth, it will always arrive at these complex representations. It’s worth noting we can observe the flip-side of this process in common visual illusions, where the higher-order percept or prior “fills in” our actual sensory experience (e.g. when we perceive a convex circle as being lit from above).

teaser-convexconcave-01
Our higher-level, integrative priors “fill in” our perception.

Beating the homunculus: the dynamic, enactive Bayesian brain

Feeling satisfied with this, Francesca and I concluded our fun holiday discussion by thinking about some common misunderstandings this scheme might lead one into. For example, the notion of hierarchical prediction explored above might lead one to expect that there has to be a “top” level, a kind of super-homunculus who sits in the prefrontal cortex, predicting the entire sensorium. This would be an impossible solution; how could any subsystem of the brain possibly predict the entire activity of the rest? And wouldn’t that level itself need to be predicted, to be realised in perception, leading to infinite regress? Luckily the intuition that these myriad hypotheses must “come together” fundamentally misunderstands the Bayesian brain.

Remember that each level is only predicting the activity of that before it. The integrative parietal neuron is not predicting the exact sensory input at the retina; rather it is only predicting what pattern of inputs it should receive if the sensory input is an apple, or a bat, or whatever. The entire scheme is linked up this way; the individual units are just stupid predictors of immediate input. It is only when you link them all up together in a deep network, that the brain can recapitulate the complex web of causal interactions that make up the world.

This point cannot be stressed enough: predictive coding is not a localizationist enterprise. Perception does not come about because a magical brain area inverts an entire world model. It comes about in virtue of the distributed, dynamic activity of the entire brain as it constantly attempts to minimize prediction error across all levels. Ultimately the “model” is not contained “anywhere” in the brain; the entire brain itself, and the full network of connection weights, is itself the model of the world. The power to predict complex nonlinear sensory causes arises because the best overall pattern of interactions will be that which most accurately (or usefully) explains sensory inputs and the complex web of interactions which causes them. You might rephrase the famous saying as “the brain is it’s own best model of the world”.

As a final consideration, it is worth noting some misconceptions may arise from the way we ourselves perform Bayesian statistics. As an experimenter, I formalize a discrete hypothesis (or set of hypotheses) about something and then invert that model to explain data in a single step. In the brain however the “inversion” is just the constant interplay of input and feedback across the nervous system at all levels. In fact, under this distributed view (at least according to the Free Energy Principle), neural computation is deeply embodied, as actions themselves complete the inferential flow to minimize error. Thus just like neural feedback, actions function as  ‘predictions’, generated by the inferential mechanism to render the world more sensible to our predictions. This ultimately minimises prediction error just as internal model updates do, albeit in a different ‘direction of fit’ (world to model, instead of model to world). In this way the ‘model’ is distributed across the brain and body; actions themselves are as much a part of the computation as the brain itself and constitute a form of “active inference”. In fact, if one extends their view to evolution, the morphological shape of the organism is itself a kind of prior, predicting the kinds of sensations, environments, and actions the agent is likely to inhabit. This intriguing idea will be the subject of a future blog post.

Conclusion

We feel this is an extremely exciting view of the brain. The idea that an organism can achieve complex intelligence simply by embedding a simple repetitive motif within a dynamical body seems to us to be a fundamentally novel approach to the mind. In future posts and papers, we hope to further explore the notions introduced here, considering questions about “where” these embodied priors come from and what they mean for the brain, as well as the role of precision in integration.

Questions? Comments? Feel like i’m an idiot? Sound off in the comments!

Further Reading:

Brown, H., Adams, R. A., Parees, I., Edwards, M., & Friston, K. (2013). Active inference, sensory attenuation and illusions. Cognitive Processing, 14(4), 411–427. http://doi.org/10.1007/s10339-013-0571-3
Feldman, H., & Friston, K. J. (2010). Attention, Uncertainty, and Free-Energy. Frontiers in Human Neuroscience, 4. http://doi.org/10.3389/fnhum.2010.00215
Friston, K., Adams, R. A., Perrinet, L., & Breakspear, M. (2012). Perceptions as Hypotheses: Saccades as Experiments. Frontiers in Psychology, 3. http://doi.org/10.3389/fpsyg.2012.00151
Friston, K., & Kiebel, S. (2009). Predictive coding under the free-energy principle. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 364(1521), 1211–1221. http://doi.org/10.1098/rstb.2008.0300
Friston, K., Thornton, C., & Clark, A. (2012). Free-Energy Minimization and the Dark-Room Problem. Frontiers in Psychology, 3. http://doi.org/10.3389/fpsyg.2012.00130
Moran, R. J., Campo, P., Symmonds, M., Stephan, K. E., Dolan, R. J., & Friston, K. J. (2013). Free Energy, Precision and Learning: The Role of Cholinergic Neuromodulation. The Journal of Neuroscience, 33(19), 8227–8236. http://doi.org/10.1523/JNEUROSCI.4255-12.2013

 

10 thoughts on “Predictive coding and how the dynamical Bayesian brain achieves specialization and integration

  1. Nice post Micah – I like the format! I see the point re: overarching models (although I think Karl and Chris explicitly argue for that in “A Duet For One”). One helpful thing, I think, is the idea that – since we impact the world, we must model that impact, and as such, we model ourselves as existing. I imagine this would entail some integration – right? That is what Blanke has argued – that we model ourselves as existing, and having done so, we believe we are the contents of our model – perhaps this is what Karl and Chris were referring to?

  2. Nice. I know a bit about Bayes, but not very much about predictive coding. Papers I have glanced at so far seem to be just either words, or impenetrable maths. Do you know of any papers with simple, concrete implementations that I could go code up to actually understand how it works. Or is the whole enterprise not at the implementation level yet?

    • Interesting question! So in SPM there is the MDP scheme which will let you specify and invert any of the models in Karl’s papers. I haven’t played with them myself but the basic idea is you specify matrices corresponding to the decision states, inputs, and actions. The rest is then implemented according to an free energy minimization scheme and my colleagues in Karl’s group tell me the code is pretty approachable once you get the idea. Most of the papers I linked have models that are exemplified in the MDP examples (all bundled with SPM 12) so you can play with then that way.

      Now, those are all variational Bayesian schemes. More classical predictive coding I think would be something like a rescorla-wagner algorithm. I don’t know of any online examples but if you are savy it shouldn’t be too difficult to implement in Matlab from Wikipedia or papers.

      I think it would be really cool if someone made a simple predictive coding simulation and out it on the web. That would make it much more accessible and a nice way to teach modelling and simulation!

    • Oh, I forgot the hierarchical Gaussian Filter toolbox by Chris Mathys! The toolbox is very easy to use, is coded very cleanly, and implements a three or n-level Bayesian inference with prediction errors and precision at all levels. It’s awesome!

  3. Could you expand on the positive vs. negative errors, perhaps? My qualm with these models is that they don’t tend to unpack sufficiently the reasons why an expected but omitted stimulus will lead to an early sensory response. And this response seems to look like the omitted stimulus.

    So when you expect A and receive B, you could get more neural activity at the population level than when you expect A and receive A, simply because there is also a faint, brief, neural pattern of B showing up – or at least that’s my hunch. This could still work within predictive coding if negative errors led to the same thing as positive errors, but I don’t know what a negative error would be, neurally speaking. Been scratching my head over this for a while.

  4. I wandered if someone would write about the somewhat fractal structure of the brain. Temporal and spatial summation, inhibition and excitation , competition and cooperation : pairs of elementary rules that seem to build a rather fantastic structure really.

    The diagram showing how there is not a control homunculus made things go click for me. A nice Xmas present!

  5. Excellent post. I’m a circuit neuroscientist, and my background has mostly been mapping cell-type specific connectivity in rodent slices, but predictive coding and bayesian inference have really caught my attention recently. I’m doing some initial reading on it for a tentative grant (which is how I came across your post!) and would appreciate your advice on something.

    I feel it would be extremely interesting to use a comparative approach drawing on the richness of human literature and resolution possible in rodent models, to tease out the single cell and circuit level implementations of these processes in perception and motor output.

    The initial pubmed searches I’ve done suggest there’s not a lot of work examining bayesian inference and prediction errors in perception/action in animal models. Am I massively missing something, as a newcomer? Or has this not been looked into that thoroughly?

    Thanks in advance, and please keep these posts coming after the holiday!

  6. A nice summary – many thanks. I still think that action is a bit of a problem for the Free Energy Principle: your post is a case in point, tracing an elegant if vague story about perception, with just a sentence or two devoted to action. I still have no idea at all how the FEP might be applied to speech production, for example. Even the kind of path integration that ants do (random walk until they find food, then direct return-to-base) just doesn’t seem to be obviously explainable under the FEP. It just doesn’t feel like a ‘prediction error thing’, in other words, though I do accept that this kind of intuition is often misleading. In other words, and despite the efforts made to push the square peg into the round hole, I would suggest that action, and complex action in particular, is still a real problem for the FEP – though only if you want the principle to explain EVERYTHING, as many seem to.

    Perhaps these behavioural things have been explained in an FEP way and I’ve missed it. Or perhaps they haven’t been but could be – and if you can do it, I’d be grateful. But if they haven’t been explained yet by the FEP (and I don’t think they have), I would caution against the conclusion that organisms can be intelligent just by embedding a single, ‘repetitive motif’ in a dynamic body. This is a seductive conclusion, because it promises that a kind of simplicity can be found behind the apparent complexity we grapple with every day. I’d like it to be true, but don’t think that it is true, or likely to be true.

    To the man with the hammer…?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s