“No man ever steps into the same river, for it is not the same river, and he is not the same man.” Thus Heraclitus described the condition of mankind as one of constant change. In machine learning, we train our machines as though the world were static, but the key task of the human mind is to make sense out of a dynamic, ever-changing world that rejects our most strenuous efforts to hold it still.
In Subject, Object Verb, I discussed how the linguistic structure of who did what to whom might emerge from the simulation of the perceptual and behavioral systems as reflected in behavior planning circuits. That explanation works for verbs that represent behaviors, but what about verbs that aren't about action? These fall into two main categories: linking verbs such as is, seems, and appears on the one hand, and intransitive process verbs such as thawing, sinking, and rising on the other. In each case, these verbs represent observations that can be mentally simulated either sequentially or statically in the perceptual faculties without the involvement of the behavioral system. Last week's post on adjectives sufficiently addressed static simulations of perception; this week I will address dynamic, sequential perception, which are typically expressed in language as a subclass of intransitive verbs.
The ship sank.
The ice melted.
The rose bloomed.
The crops grew.
In each of these sentences, the subject of the sentence didn't really do anything. The ship did not make itself sink. The ice did not choose to melt. The rose could not stop blooming once it started, nor could the crops prevent their own growth. These subjects aren't agents. In fact, there are several languages, called ergative-absolutive languages whose grammars explicitly mark that these examples are non-agentive. In addition, they have special endings reserved for agents. Hence, for example, in Basque we have (from Google Translate -- no, I don't know Basque):
Izotza urtu egin zen The ice melted Izotzak izoztutako nitrogenoa urtu zuen The ice melted the frozen nitrogen
Here we have izotza, meaning “the ice”, in two different roles. The first sentence has the ice as a passive object experiencing change; this uses the default (blank) ending, indicating the absolutive case. The second sentence has the ice an an agent (izotzak), causing frozen nitrogen to melt, with the ergative ending -k added to indicate the agentive role. The distinction is still present in sentences where the verb has no direct object. The sentence the man ran translates to gizonak korrika egin zuen, where the ergative -k on gizonak tells us that the man is doing something rather than just experiencing something. Other ergative-absolutive languages include Georgian, Mayan, and apparently even Hindi in some cases (namely, when the verb is perfective).
The fact that ergative-absolutive languages can distinguish agentive expressions like the man ran from non-agentive ones like the ship sank indicates that the brain makes a distinction between the two. In Subject, Verb, Object, I discussed how the brain simulates the action of a verb through behavioral planning, while it simulates the subject and object through the perceptual system. The truth must be a bit more nuanced than that; after all, we observe ourselves as we perform an action. We rely on perceptual feedback to move around in the world. Try standing on one leg with your eyes closed for ten seconds, and you will quickly see deeply you depend on vision for a simple balancing task. In a YouTube video from a few years ago (I can't find it now), research subjects were asked to perform everyday tasks such as cooking or cleaning while equipped with virtual reality systems that showed the real world delayed by just one second . They failed abjectly. Our behavior and its effects on the world are as much a matter of perceiving as of moving.
The simulation of behavior must then have both a behavioral and a perceptual component. With agentive verbs, the subject both acts out a behavior and experiences the changes that result in the environment. With non-agentive verbs, there is a change to the environment as observed by the speaker, but without a known cause. Thus, to understand intransitive verbs, we must understand how processes of change are observed in the perceptual system. This understanding will augment our description of how objects are modeled through their attributes, as discussed in Mix-and-Match Creativity.
What does it mean for a ship to sink? First, we imagine the ship above the water, where we would generally prefer for ships to stay. Then we witness the ship start to take on water, perhaps due to damage, perhaps from high waves or stormy weather. Next, the waterline rises on some part of the ship, and the rest of the ship begins to rotate out of the water, either up into the air or over on its side. Finally, the ship gradually descends and ultimately disappears under the water.
The sinking of the ship, then, is perceived a sequence of changes in the sensory experience. These changes may seem smooth in our imaginations, but in language the changes are broken into chunks: the ship is damaged, water enters the ship, the ship capsizes, the ship disappears. We use words like gradually to give the simulation a feeling of smoothness, but language is primarily a left-brain phenomenon, and the left brain tends to discretize experience. By contrast, the right brain does a better job of modeling continuity. For example, the right-brain analogues of language-processing regions are more active when listening to and playing music, which requires smooth, continuous control. But in language, processes are chopped up into subsequent steps that are regarded as stable, static phases of the process.
So as ice melts, we imagine first a whole ice cube, then a growing pool of water under the ice, then a puddle of water and no ice. As the rose blooms, we first see the bud, then the spreading petals, and finally the bloom itself. The growing of crops proceeds from a sown but empty field, to the first shoots, then growing stalks and finally large ears of corn basking under the summer sun.
These successive snapshots are not truly static. Notice that in each of the cases above, one or more of the steps represented ongoing change. The pool of water extends outward from the ice. The bloom spreads radially from the center of the bud. The stalk grows upward. Thus directional change is encoded as an integral part of the snapshot.
Nor are these steps fixed. Humans have a tendency to group processes into two to four steps. This grouping may be culturally driven, or it may reflect the underlying number sense, whereby most human brains can automatically observe numbers at least up to five without counting. Either way, the particular grouping used to express a process is not fixed. I could add a fourth step to the melting ice example in which a small chip of ice floats in the puddle. Or I could indicate melted ice in two snapshots, before and after, as an ice cube and then a puddle of water. The meaning of melting ice is not a particular breakdown of the stages of melting, but rather a remembered experience of watching ice melt. This memory can be sampled at different rates. That is, the meaning of “melted ice” is not linguistic, but rather experiential, in line with the view of semantics as simulated experience.
How then is the experience of growing crops or melted ice to be acquired? After all, both processes occur slowly over a period of time, several minutes for ice and several months for crops. Yet our minds model motion in both cases, with the ice shrinking and the crops as growing upward. Since this motion cannot come from moment to moment observation, the model of motion must be imputed not observed. That is, we do not see crops grow. Rather, we observe them at one time with a certain height and at a later time with a taller height, and we infer growth. This inference must be based on more basic observations of faster moving objects for which motion is an apparent and integral part of perception.
The detection of how an object is moving in the visual field is known as {\it optical flow}. An example of the results an optical flow algorithm in Figure 1. These images were generated by Deep Mind's PerceiverIO (Jaegle et al., 2021, described in a Deep Mind blog), and they use color to illustrate the expected direction of movement of each pixel in the next frame. Plainly, optical flow is useful for separating background from foreground, and for identifying both objects and their parts, which can move separately.
Figure 1. Three illustration of Optical Flow in Video, by PerceiverIO. Images use color to illustrate direction and speed of movement. From Jaegle et al., 2021.
A mind equipped with optical flow can represent an object in terms not only of its static properties but also of its dynamic change. As best as I can tell, it is not well understood how the brain represents optical flow, but there are several neural regions engaged in the process, including corrections to distinguish self-motion from environmental movement using the vestibular system (Wurtz, 1998; Uesaki and Ashida, 2015). The relevant parts of the brain include the medial temporal lobe, parts of the parietal lobe, and the uppermost parts of the visual cortex, among others.
The relevant point is that the human mind perceives not merely objects, but objects in motion. This motion is part of the snapshot that we capture when we view a scene, and it thus becomes part of the attribute vocabulary used in language. Thus growing crops can be simulated in a single snapshot, with half-grown stalks and upward motion. Although the upward motion of the stalk is too slow to observe in person, it is trivial to simulate in our minds using our internal representations of movement. Simply by juxtaposing snapshots of a scene before and after, we can infer motion at a cognitive level that is not present on sensory timescales.
Perception of motion is not linear either, but can encode patterns. In the case of the blooming rose, we see that the petals expand radially. Wheels can spin, and cars can turn. Given the concept diagrams from Recursive Mapping of Reality, we might imagine that each element of a diagram can be associated with an arrow indicating a path of movement, and so we might represent turning, expanding, circling, spiraling, proceeding, regressing, and other adjectives of motion.
A change process encapsulates a series of motions, often with a defined starting and ending conditions. When the sun rises, the starting condition is night, and the ending condition is day. The change process encapsulates the salient intermediate stages. Figure 2(a) shows a temporal diagram for a sunrise with subordinate spatial diagrams and arrows of motion. First, the sky lightens, then the sun appears above the horizon moving upward, and finally the sun is in the sky above the horizon, still moving upward. Note that the arrows in the diagram are superfluous in the true mental representations, which might be more like Figure 2(b) in that the change is simply part of the simulation. A change process might also break down into sequences of predictable motions, as shown in the temporal diagram of Figure 2(c) for a sinking ship.
Change processes are naturally represented in language as verbs because they share two key features with behaviors. First, they commonly have a start and end conditions that determine whether the change process can take effect. Secondly, the process can be viewed from many vantage points in time, based on whether the change is ongoing, happened in the past, will happen in the futures, happens repeatedly, may happen, and so on. In syntactic terms, change processes have aspect, tense, and mood. As a third point, change processes that occur without intervention can also be caused to happen by external intervention, as in the sinking of a ship or the growing of crops, so that it makes sense to model change processes as a kind of behavior at the cognitive level, even though their origins lie in perception, not action.
The final question is how these change processes can be learned. A process template can be isolated from experience in at least two ways, namely, by predictive coherence on the one hand and by explicit creation on the other. Predictive coherence means that subsequent observations follow prior observations with high probability. This technique is used to find phrases in traditional natural language processing, for example, by computing that the word “processing” has a higher likelihood than usual of occurring after the words “natural language”. A separate contextual prediction principle is used to train the highly successive language models BERT and GPT-3. Thus the statistics of cooccurrence can be used to identify cohesive episodes. There is much more to say about how episodes are learned and structured, and I will leave that topic for a later post.
But humans can also explicitly create a change process ex nihilo by specifying start and end conditions, and perhaps intermediate steps as well. With a slight pun intended, consider the change process encoded in the verb vote. The meaning of voting does not refer to a particular observation sequence. No longer does the citizen furtively drop his painted pebble into an urn. Voting can occur early, in person, by mail, and perhaps soon simply by assumption. The voter might touch buttons on a screen at a voting kiosk, drop a filled out ballot in a mailbox, or press their inked thumb on a paper ballot. When we say that someone voted, we mean that they officially recorded their decision-making preferences in an election. The change process has a start, in which the citizen has not recorded their preferences, and an end, in which their preferences have hopefully been counted correctly. The voting process is not learned by observing events. Rather, it is taught. Children learn what voting is by first being told the word vote, and then having it explained to them what processes are entailed. Not all change processes acquired though careful observation; they may instead be instituted by fiat, and humans learn to accept them all the same.
To recap, most verbs represent change processes that can be observed by the perceptual system, and some verbs only represent change processes. These change processes can be extracted from observation or constructed through social teaching. In the case of observation, optical flow (or, equally, auditory flow or tactile flow) assigns to each object or percept a direction of motion, that is, of movement in space (or pitch). Thus the concept diagrams that represent object perception at a cognitive level can also be adorned with patterns of movement that can be complex, as exemplified by the words zigzagging, spinning, expanding, or curving. The perception of motion is a basic attribute of object recognition that represents aggregation of perception over time, and abstractions of motion can be equally applied to the spatial aspects of concept diagrams more generally. When simulating the experience of a change process, these processes can be viewed from many vantage points in space and time, much like behaviors, and these perspectives may then be modeled as though they were behaviors.
In the next post, we will begin our exploration of behavioral simulation proper by discussing these vantage points as a reflex of how we observe ourselves acting within temporal confines. As always, please leave comments or questions below!
References
A. Jaegle, S. Borgeaud, J. B. Alayrac, C. Doersch, C. Ionescu, D. Ding, S. Koppula, D. Zoran, A. Brock, E. Shelhamer, O. Hénaff, M. M. Botvinick, A. Zisserman, O. Vinyals, J. Carreira Perceiver IO: A General Architecture for Structured Inputs & Outputs ICLR, 2022 https://arxiv.org/abs/2107.14795 R.H. Wurtz Optic flow: A brain region devoted to optic flow analysis? Dispatch, 8(16), 1998 https://www.cell.com/fulltext/S0960-9822(07)00359-4 M. Uesaki and H. Ashida Optic-flow selective cortical sensory regions associated with self-reported states of vection Frontiers of Psychology, 08 June 2015 https://www.frontiersin.org/articles/10.3389/fpsyg.2015.00775/full
If change processes can be learned/extracted from predictive coherence and/or explicit creation via social teaching, and predictive coherence is the basis of powerful language models like BERT and GPT-n, what would a social teaching language model look like and how successful could it be? Off the top of my head, I can't think of any examples ... do you know of any? Do we have the hardware (or the person-power) to create such a model?
Perhaps it's because this post has a strong sense of motion due to its subject matter, but this was particularly engaging, bravo!