Architecture of a Perceptual System

Extracting Objects From Senses

Apr 29, 2022

Drilling down into a subcognitive architecture, I have suggested that the role of the perceptual system is to populate both egocentric (body-oriented) and allocentric (externally centered) spatial maps based on the incoming sensory stream. In this post, I propose a decomposition of the perceptual system and an account of how sensory stimuli flow through the system and are transformed into objects and places. The distinction of object and places reflects the two-fold output of the perceptual system, with objects flowing into the egocentric spatial system focused in the parietal lobe to support short-term motor planning, and with places streamed into the allocentric spatial system to enable navigation and long-range planning.

Early Bifurcation in Sensory Processing

As a caveat, the division between egocentric and allocentric maps is not necessarily a strong biological model, and I am proposing the separation as more to enable clean discussion than to assert a neuroanatomical model. As best as I can tell, objects are positioned egocentrically in the brain, but this positioning may not have an absolute character. The egocentric position of an object for grasping may be tracked relative to the hand, while an obstacle to be avoided might be tracked relative to the torso. The positions of the object and obstacle need not be integrated at any observable level of neural processing to form a single egocentric frame. Rather, the principle is that they could be integrated if desired, for example, if the hand grasping an object also had to avoid the obstacle during the grasp.

There is the further question of how the allocentric and egocentric systems are integrated. Again, the best evidence, as I discussed here, is that the location of objects is stored in the egocentric system implemented in the parietal lobe, and the relationship between place and object occurs through an interaction of the two systems (presumably through the retrosplenial cortex). That would mean that we should think of objects as being stored in the egocentric system and places in the allocentric system. Nonetheless the two systems should be able to resolve against each other at the stage of scene processing (see Figure 2). An object presents a surface that can become a navigational boundary, and the representation of such boundaries is part of the function of the allocentric system.

In order to focus on subcognition rather than cognition, I have made the assumption that the organism has a single territory that is completely mapped by the allocentric system. This assumption means that places in the current context are really regions within the territory, not places in the human sense of an abstract description of many possible physical locations that share core features (as in a gym, a restaurant, or a school). The latter notion of place will reappear when we address cognition itself, because I believe that place in the abstract sense is in fact the cornerstone of how ideas are represented cognitively. But for now a place will be more restricted.

The net outcome of making a strong distinction between places and objects is that sensory input should be bifurcated quite early into allocentric versus egocentric processing streams. There is plenty of evidence that this is in fact the case. In our discussion of the sensory system, we saw that the olfactory system does not connect directly into the parietal lobe at all, but rather feeds into the entorhinal cortex, which is part of the allocentric navigation system, as well as the the orbito-frontal cortex. The laterodorsal thalamic nucleus relays raw vision to the hippocampus, involved in memory, and to the retrosplenial cortex (Perry & Mitchell, 2019), which is involved in the knowledge of specific places (Epstein, 2008). Hearing reaches the hippocampus through pathways that may not involve the auditory cortex (Xiao et al., 2018). The vestibular (balance) system connects to the entorhinal cortex in rodents at least (Jacob et al., 2014).

The picture that emerges is an early divergence of sensory information into allocentric and egocentric spatial pathways. On the allocentric side, smell, vision, hearing, and touch arrive through various paths to the hippocampal formation, including the entorhinal cortex, the hippocampus, the retrosplenial cortex, the parahippocampal place area, and parts of the frontal lobe (e.g. the frontal eye fields for vision and the orbitofrontal cortex for smell). Sensory processing for the allocentric system seems to discard topographical information from the senses, preferring other mechanism for holistic localization. On the egocentric side, most senses pass through a primary sensory cortex that is topographically organized and ultimately integrated into sensorimotor-focused spatial representations in the parietal lobe. The major exception is smell, which is not topographically organized beyond the olfactory bulb and, as far as I can tell for now, does not seem to feed directly into the parietal lobe. The two systems are ultimately integrated at later stages of processing, and there are some direct connections between them.

Based on this early divergence, I will first discuss processing in the egocentric system, and then return to the allocentric system.

The Egocentric Spatial System

The egocentric spatial system is responsible for recognizing and localizing objects with respect to the self. Once its processing is complete, the position and nature of objects relative to the self is known and ready to be integrated into a complete model of the whole scene around the organism.

What is an object again?

The term object in an informal sense indicates a self-contained thing that can be picked up and manipulated independently of its surroundings. This is not what I mean by object for the purposes of this architecture. As discussed previously, by object, I am referring to those things that are relevant for satisfying the needs of the organism. This includes the objects of drives, such as food for hunger and water for thirst, but also obstacles that must be planned around, such as roads, walls, building, and trees, and even other agents, such as people or animals. Perhaps the least object-like “objects” would be things like fog, water, or smoke, all of which are more substances than things. An understanding of what I mean can be achieved by imagining that the perceptual system cuts up the sensory input into segments, with each extracted segment or patch of reality becoming a functional output that I call an object. Perhaps percept would be a better word for this concept, but that word generally has a lower-level meaning, each percept being an aspect of an object rather than the object itself. Other candidates include segment or region, but these don’t work well either. So I’m going to continue talking about objects, and if you have a better suggestion, please leave it in the comments.

Constructing Objects at Locations

The general architecture in egocentric spatial filtering is that each sensory system has a primary, secondary, and sometimes tertiary cortex, both of which are organized topographically (retinotopically, tonotopically, or somatotopically, as described here). Thus topographical organization is a key architectural principle of egocentric sensory processing, as is sequential filtering through primary and secondary layers. A third principle is the separate processing of identity and location, known as what-where pathways. In general, what pathways pass downward through the temporal lobe and where pathways ascend into the parietal lobe.

In the last post, I suggested an abstraction of cortical regions as functions from 3-tensors to 3-tensors, which mirrors much of deep learning research from the prior decade. The primary and secondary sensory cortices are made up of a collection of regions. For example, the visual system of the macaque monkey proceeds through a sequence of regions known as V1, V2, V3, V4, V5, and V6, each of which extracts progressively more complicated information out of the visual stream. Although the regions of the visual system that are closest to the input do process the visual stream sequentially, the processing stream eventually ramifies and runs in parallel.

To model this pattern, I propose to treat each cortical region as an attribute filter that converts its inputs into a representation specific to the region that I will equate with an attribute, such as color, shape, and motion for vision or timbre, pitch, and rhythm for sound. Each attribute filter preserves topographical organization in the first two dimensions of the tensor, so that for an output y from input x, the neural code (the third dimension of the tensor) at output y[i,j] represents an elaboration or transformation of the input x[i,j], hence retaining location either in the visual field, in the auditory pitch and timbre map, or on the body. Thus attributes throughout the architecture retain a localized map structure.

These attribute filters are arranged into a sensory filter graph that represents the entire primary, secondary, and tertiary sensory cortex. This graph is formed by following the connections of input and output among the cortical regions involved, and it represents the computational flow of the sensory stream. Furthermore, certain regions in the graph are to be designated as output attributes, meaning that they have outbound connections to regions that are not within the same sensory cortex. I require that every attribute in the filter graph should have some output; that is, there should not be terminal cortical regions in the sensory cortex where processing terminates without being passed to any other regions. This sensory filter graph represents all processing performed on a particular sensory input without reference to any other sensory inputs.

Association and Inter-Sense Alignment

Each of the senses presents a perspective on what is happening out in the world, but to provide value, the senses have to be integrated into motor control, which also means that the senses must be integrated together with each other. In my current view, this integration is performed with respect to spatial maps that buffer perception and isolate motor control from sensory processing. If so, then positional information from different senses must be aligned at the level of the map.

There are three major associational areas in the neocortex that take input from more than one sensory area. These are the posterior parietal cortex, the temporal association cortex, and the anterior-frontal association cortex. The frontal cortex for the most part is involved in motor planning and execution, and, as best as I can tell, the input to the frontal association cortex mostly comes from the other two association cortices, as indeed my thesis would require. So we can defer the frontal association areas to our discussion on motor control.

Of the other two association areas, the posterior parietal cortex is known to build egocentric spatial models (Whitlock, 2017), while the temporal association cortex primarily integrates hearing and vision with increasing complexity moving forward along the temporal lobe, culminating in whole scene representations at the temporal pole (Freches et al., 2020). Over time, we will look at the function of these two areas more deeply, but for now it will suffice to say that these areas represent the primary division of what-where with the parietal cortex providing the where and the temporal cortex providing the what.

Let us consider the problem of alignment more generally. How is the sense of vision to be aligned with the sense of smell? The area of space that the eyes can see is only a portion of the area over which the ears can hear, and visual field changes as the eyes move. The relationship between hearing and vision is thus dependent on eye position. In a perfect system, the proprioceptive sense would completely determine how to align what is heard with what is seen.

The system is of course not perfect. In the real world, our sensory localization
is not especially precise. Our stereoscopic eyes are most effective for measuring
depth up close; the error increases with distance. Our binaural hearing can easily
be tricked into believing a sound came from the opposite direction. Smell cannot
on its own distinguish intensity from distance. And our self-pose estimation is
not completely accurate either. Better location estimates can be obtained by
integrating across the sense. If we hear a sound of a person speaking at one
coordinate and see the form of a person at a nearby coordinate, then the right
position for the person should be estimated by integrating sight and sound. Therefore alignment of hearing, vision and others is not merely a deterministic function of proprioception. Rather, the identity of objects should also play into the alignment, as when we hear a bird song, see a bird shape, and thus know that the two should be at the same place out in the world.

The integration of both identity and individually sensed location means that both what and where pathways must be involved in alignment. Perhaps unsurprisingly, then, we find that proprioception, balance, touch, vision, and hearing all feed together into an area at the junction of the temporal and parietal lobes. I would suggest that it is in this area where sensory alignment might primarily be performed based both on stereo senses and on the identity of objects as processed at that point.

Cross-Predicting Alignment

If hearing detects a bird song and vision detects a bird shape, then it is clear that the two should be bound together as a single object, provided that their source locations are not obviously distinct. For this purpose we might introduce a metric of coherence that takes a proposed alignment among the senses and measures how well this alignment represents the filtered sensory input.

An alignment with high coherence would simultaneously accomplish two goals: (1) attributes across senses that fit with each other should be aligned to the same position; and (2) distortion of expected position within the perceptual field based on proprioceptive and vestibular input should be minimized. The first goal says that attributes that belong together, such as the bird’s song and shape, should be unified, for example, to form a bird. A coherent alignment should be characterized by such unifications. The second goal recognizes that although proprioception and balance are imperfect, they are not random. Hence a coherent alignment should agree as much as possible with the information that these systems provide about the body’s pose relative to itself and the external world.

One way in which this might be accomplished is to predict across senses. That is, we can learn a function that predicts the filtered output of each sense from all the others given a proposed alignment. Thus for vision, we might use a proposed alignment together with the output of hearing, smell, touch, proprioception, and balance to predict the output of vision, and so on for hearing, smell, touch, proprioception, and balance. The coherence of a proposed alignment is the inverse of the cumulative error of each of these predictors. If the predictors are correct, then the error is low and coherence is high. If the predictors are incorrect, then the error is high and coherence is low. An initial proposed alignment can be computed from proprioception directly, or carried over from prior times with an adjustment for motor control, as shown in Figure 1.

Incidentally, these predictors would introduce a form of top-down feedback and provide a reason for its initial evolutionary existence.

The cross-prediction mechanism described above accounts for both goals for coherence above. A correct alignment will match up effective correlations among hearing, vision, touch and smell. Such an alignment will also correctly predict proprioception and balance from the other senses, which will minimize distortions.

Defining Alignment

A cross-sense alignment should be able to assign each point in the perceptual field of a particular sense to a single point in an egocentric space. Formally, we might define an alignment as a collection of alignment functions, one per sense including at least vision, hearing, touch, and proprioception (and possibly smell), with each function mapping a particular set of sense-specific coordinates to a common egocentric spatial frame.

For the sake of argument, I will assert that the egocentric space is centered in the middle of the head. Certainly, we seem to experience the space around us as though our selves were located behind our eyes, and stimulation of the medial parietal lobe can cause an out-of-body experience, which is in essence a disturbance of the center point of alignment. The medial parietal lobe is roughly located where left and right brain alignments would have to be negotiated.

The definition of an alignment as a collection of functions is intended as a functional model, not as an anatomical claim. The model can be useful even if the brain actually operates a bit differently in detailed terms. In neural terms, the purpose of an alignment is to establish routing pathways between different sensory pathways such that the contents of each sense can be bound together for subsequent processing.

Note that alignment occurs at the posterior end of the temporal lobe, and that increasingly detailed co-processing of vision and hearing occurs along the length of the temporal lobe from posterior to anterior (this is the temporal association cortex, cf. Freches et al., 2020). It appears that the alignment is established at the temporal-occipital-parietal juncture, where all of proprioception, balance, touch, hearing, and vision are available, and that this alignment then allows subsequent processing of vision and hearing together. In the left temporal lobe, which typically specializes in language, this processing includes correlating spoken words to written forms as lexical entries. In the right temporal lobe, which appears to specialize in social context, the analogous processing performs tasks such as matching facial expressions to tone of voice for a coherent perception of mood. In either case, proper alignment is required before these higher-order features can be interpreted and ultimately integrated into a perception of a whole scene.

Egocentric Spatial Map of Objects

A coherent alignment binds objects to locations in an egocentric spatial model. As we have seen, this alignment enables the further elaboration of object identity and scene composition inside the temporal association cortex. The end result is a complex scene of interconnected objects at the temporal pole that are located with reference to an egocentric spatial model centered in the posterior parietal cortex. These stages of processing are shown in component form in Figure 1.

Computing the Allocentric Spatial Map

In contrast to the egocentric map, the processing that forms the allocentric map is a good deal less structured, in part because the allocentric system has the sole job of encoding where the organism is from all current sensory stimuli, whereas the egocentric system is responsible for decomposing stimuli into parts and enabling direct motor behavior.

To represent where the organism is, the allocentric system decomposes the incoming sensory inputs into two aspects, namely, the position of the organism and the identity of the place containing the organism. I address these in turn.

Observing Position

Position is encoded by the place cells of the hippocampus. These cells fire when the organism is located at a particular absolute position within the environment, as I discussed here.

In a mouse model, place cells apparently are triggered by two main sources of activity. The first source is the boundary cells of the subiculum. These boundary cells are purported to fire when the organism is at a particular angle and distance from an impassable boundary, either a wall or a dropoff. It is not yet clear to me how these boundary cells detect the diverse range of boundaries that could be present.

The other source for activity of place cells are grid cells located in the entorhinal cortex. Unfortunately, I think I misunderstood the grid cells in my earlier discussion of place, where I claimed grid cells participated in path planning. As it turns out, grid cells are not involved in path planning. Instead, the grid cells fire in hexagonal patterns based on their input. By “fire in hexagonal patterns”, what is meant is that if we plot the firing of the neuron as an organism moves around within a place, then the grid cell will have been found to fire just when the organism was present at locations that ultimately form a honeycomb across the place. Different grid cells respond to different places, and when plotted, the difference can be seen as either shifting or rotating the honeycomb.

The grid cells exist throughout the medial entorhinal cortex (mEC), and as one traverses from this cortex from top to bottom, the honeycomb patterns become larger and larger, as do the responsive regions. For a sense of scale, within rodents the smallest recorded distance between firing locations along the side of the hexagon is around 35mm on the dorsal side of the mEC, while the largest recorded is several meters across. There are likely even larger patterns at the very bottom of the mEC, and human patterns are likely to be proportionally larger than those of the rodent.

Imagine that we take all the currently active grid cells across the mEC and overlay all their active regions with their honeycomb patterns, laying small honeycombs over large honeycombs. Now we erase all the regions except those that are active for all the active grid cells. If we have enough grid cells, with enough honeycomb sizes, enough shifts and rotations, then the result would be a particular location in the environment. It is precisely this location that is picked off by the place cells. In effect, each grid cell rules out certain locations, and by combining all grid cells, enough positions are ruled out that only one position is left. Thus the activity of the grid cells is effectively a code that targets a particular position.

The entorhinal cortex receives inputs from basically all senses, but in a non-topographic form. In many cases, this input comes directly from the thalamus, that is, without any prior processing by the primary sensory regions of the neocortex. Position can be determined from any sense at any time. The navigational system will use whatever is available and is not particularly picky.

Several models have been provided of how the place cells might compute. Fabio Anselmi provides one such computational model in which hexagonal patterns would emerge naturally from certain assumptions (Anselmi et al., 2020). I do not pretend to understand the exact mechanisms of the navigational system, but I think that we can describe the architecture of such a system from the above.

In general we can model the positioning system as a function that takes inputs from all sensory inputs. These inputs are not presented topographically, so we can suppose that the 3-tensor sensory input is flattened to remove topographical structure. The position is then a function from all sensory inputs into a fixed coordinate system. For the sake of argument, I will suppose that this coordinate system is three-dimensional and that the resolution enabled within this coordinate system is determined by the responsiveness of place cells to particular points in three-dimensional space. Thus we have a function that takes all sensory inputs in a flattened form and outputs a coordinate in three dimensional space, which might be realized in a model as a 3-tensor with values of either 0 or 1 in which exactly one entry is 1.

Observing Place Identity

An organism must make decisions about how to satisfy its needs, and the crucial question for place identity is whether the current place can satisfy any of the current needs of the organism.

Within this broader framework, there are two aspects of place identity. The first aspect is the categorization of place, that is, in what kind of place does the organism find itself. This categorization helps the organism to make rapid decisions about the value of place. Place categorization in humans appears to involve the parahippocampal place area. From a modeling perspective, the category of a place could be derived from the same flattened inputs as used in position detection, and thus we might introduce a function that maps these flattened inputs to a vector of neural firing activities that encodes the place category.

The second aspect of place identity is that of mapping the place to understand its contents in distinction to other places of the same kind, a function that appears to involve the retrosplenial cortex in mammals. Whereas place categorization helps the organism decide where to go, place mapping enables the organism to use the place to satisfy needs. This mapping depends on the objects within a place, and hence represents an interface between the allocentric and egocentric systems, where these objects have already been constructed in egocentric spatial terms. Mapping between the two systems can be relatively deterministic given the organism’s position, since the position of each object is merely a translation from egocentric to allocentric coordinates based on vestibular and proprioceptive information. Since motor control is executed relative to egocentric coordinates, I do not feel a need to say more about allocentric to egocentric translation at this point, except to make one further observation.

That observation is that the perceptual field of the organism is both limited and known to the organism. That is, the organism can be aware at some level that there are regions that it cannot perceive well. These regions are known to both the allocentric and egocentric systems, but as the allocentric system has broader scope, it make sense to encode this lack of knowledge relative to the allocentric system. That is, place mapping should represent which areas of the allocentric system have been explored thus far. This knowledge can be modeled as a 3-tensor representing the three-dimensional allocentric map in which entries are 0 if the corresponding position in the allocentric map has been explored and 1 if the position has not been explored. The computation of this three tensor can be updated iteratively based on the position and the aligned proprioceptive and vestibular outputs.

A Proposal For the Perceptual System Architecture

Putting it all together, we see a proposed component architecture for the perceptual system in Figure 2. In this architecture, there is an early separation in sensory pathways that support the allocentric versus the egocentric navigational systems. Object processing occurs primarily via attributes constructed by sensory filters that are localized, aligned, and then merged into objects that are egocentrically positioned. Place processing occurs on a separate pathway that determines absolute position and classifies place, integration with the egocentric object detection system to create a holistic scene. In addition, the place system keeps track of unexplored regions to support exploratory motives at a later stage. Egocentric and allocentric systems are integrated at the level of whole scenes, so that information can flow between the two systems.

This architecture is of course tentative and involves many assumptions that may prove to be faulty. However, certain aspects reflect the best current known state of neuroscience research, and I have annotated the figure with the brain regions that appear to be involved in each stage of processing.

Again, thank you for reading, and I look forward to covering the architecture of the motivational system in the next post!

Embodied Language and Cognition

Discussion about this post