I’m going to risk a slightly more technical post to address the question of how prototypes and schemas could be represented within a machine learning architecture. Of course, there might be many ways to accomplish this goal, and the one I am going to introduce here is not necessarily biologically plausible, though it should lie in the right direction. And as we will see, there is an interesting connection to the transformer architecture used in the most successful language models at present.
Knowledge Graphs
Traditional knowledge engineering is built on top of an attributed graph data structure. In brief, a graph models a set of objects (the nodes) and a set of pairwise relationships between them (the edges). An attributed graph adds a set of labels, one of which must be applied to each edge. These labels are called attributes. In general, edges are directional, that is, they point from an origin node to a destination node. Though it may sound surprising, this simple structure is rich enough to encode almost arbitrarily complex types of information.
In fact, an attributed graph can be encoded inside of a simple graph. Suppose that we have an attributed graph (V, A, R) where V is the set of nodes (vertices), R is the set of attribute labels (relationships), and A is the set of attributed edges (relationships), each structured as (s, a, d) for start node s, destination d, and attribute label a. Then we can create simple graph (W, E) where W is a set formed by taking the union of V, A, and R, and E is set of edges that contains three edges for each attributed edge e = (s, a, d) in A, namely, (1) an edge (s, e) pointing from the start node to the attributed edge, (2) an edge (e, d) from the attributed edge to the destination, and (3) an edge (e, a) from the attributed edge to its attribute. If we wish to be even more careful, we could introduce three type nodes named Object, Attribute, and Relationship to represent the sets V, A, and R respectively, adding additional edges to specify whether a node in the simple graph came from V, A, or R. Thus, an attributed graph is a kind of meta-graph, a graph built on a graph.
Figure 1: An example knowledge graph.
There are other mathematical ways to build an attributed graph on a simple graph (for example, through topology using the disjoint union of a set of simple graphs, one for each attribute), but the construction above points towards the traditional structure for a knowledge graph, which adds a type for Class in addition to the other three, along with a special relationship is-a that points to classes, subject to the logical rule that all relationships applying to the class must also apply to the instance. An example knowledge graph is shown in Figure 1 with classes in purple and instances in orange. It additionally shows instantiation of relationships. There are many other technical details and variants, but this description so far should be enough to get us started.
Passing to the Continuous
The initial history of AI centered on the induction of discrete structures from data. By discrete structure, I mean things like rules, programs, logical systems, and yes, knowledge graphs. Symbolic structures are the means by which the educated moderner deals with information and knowledge. They are also the means by which our computers operate. Therefore it was thought that a computer should be able to learn rules from symbols. I covered some of the problems with this approach in another post, but I did not comment there on how this has affected the field of machine learning.
Good Old-Fashioned AI (GOFAI) developed symbolic methods for learning sets of rules by exhaustively searching the space of all possible rules with the aid of heuristics to prioritize more likely solutions. But these heuristics were poor and not particularly effective. Neural networks, genetic algorithms, and statistical machine learning were developed to search more efficiently by passing through a continuous space. There is a long history in mathematics of passing through continuous space to solve discrete problems, as, for example, when determining whether discrete sequences converge (as they do if bounded above and below by continuous functions that converge).
The advantage of continuous space is that the space itself contains information about directions of improvement. The direction of greatest improvement is called the gradient, and it is a multi-dimensional version of the derivative from basic calculus. If a problem can be represented such that its solution is a point in a continuous space that is a minimum of a differentiable function on that space, then the gradient of the function will point the way to improve any potential solution, represented as a point in the space. All we have to do is generate a potential solution, and then the gradient will lead us to the minimum, subject to a variety of relatively satisfiable conditions.
The function can be thought of as a landscape, and gradient descent follows the gradient down to the nearest valley. The goal is to find the lowest valley, but intuitively the solution may get stuck in some higher valley with no way down. The fear of such high valleys tied up funding for neural networks for over a decade in the 1970s, but eventually people realized that in extremely high-dimensional problems, such valleys tend not to occur. When they do, you can often get around them by adding more dimensions.
Artificial neural networks take advantage of these opportunities to use continuity to solve discrete problems. The network is initialized into a somewhat amorphous state, at the chaotic edge where chance differences in the connection strengths can make the difference between classifying an input one way or another, for example, making the decision between whether an image contains a cat or a dog. Under the influence of the gradient, the network is led away from chaos towards definitive solutions, such that the solution found is optimal for the given data but retains some chaos for unseen data.
Further details are beyond the scope of this post, but the point is that continuity (and probability) are leveraged to make the problem of learning discrete categories feasible. Large neural networks now provide the best solutions to complex, human-like machine learning problems.
Continuous Graphs
We have seen that continuity can make discrete structures learnable, but how can a graph be made learnable? The approach I will follow is to embed the graph inside of a continuous space, and then to learn the parameters of the embeddings. Embedding objects into continuous spaces is perhaps one of the main tools offered by deep learning, so this approach should be familiar to anyone with recent experience applying neural networks to problems. What is perhaps unfamiliar is the name, the idea of calling this embedding a continuous graph. And in fact, this my contribution, to relate the old world of discrete graphs to the new world of continuous embeddings.
A continuous vector space is a kind of graph. Points in the space are connected to each other if they are close to each other. But closeness is a matter of degree, not category. If closeness is to represent edges, then the presence or absence of an edge is no longer binary; rather, every object is connected to every other to some degree. I call this degree of connection the link strength. I identify this link strength with the inner product of the vector space, which exists for most familiar vector spaces, including the finite-dimensional real vector spaces I have in mind.
Formally, given a set of objects V, we require an embedding function f that translates each object x in V to a real vector f(x). Given two objects x and y, their link strength is the quantity <f(x), f(y)>, which is the dot product of their embeddings. The pair (V, f) is what I call a continuous graph. Three observations are in order.
Firstly, for real vectors the dot product is symmetric, <f(x), f(y)> = <f(y), f(x)>. Thus it can only represent undirected edges. The link strength from x to y is equal to the link strength from y to x. This problem can be avoided in complex vector spaces, where there is only conjugate symmetry instead of true symmetry, but here too the space introduces the relationship of conjugacy, which must be interpreted somehow. To foreshadow a bit, we could add back directionality by using two separate embeddings, one for the source and one for the destination, but I will ignore undirectedness for now and reintroduce directionality through the representation of attributes.
Secondly, the dot product cannot be used as a distance metric because it does not satisfy the triangle inequality; that is, under the dot product, the shortest path between two vectors may not be a straight line. To understand this fact, consider the special case of unit vectors (vectors at radius one around the origin). These vectors form a space called a hypersphere, and the dot product is a distance metric on this hypersphere, where the shortest distance is a curved path on the sphere called a geodesic. Incidentally, unless you can dig a hole, the shortest path between two places on earth is roughly a geodesic. This observation becomes important if we want to discuss continuous graphs formally, because we might want to restrict vectors to be unit vectors, where we can use the dot product to generate a metric.
The third curiosity about using the dot product to represent connections. A vector space admits opposites. Think of the north and south pole of the earth. Now, if two points on the surface of the earth are close, then so are their opposites. That is, if <f(x), f(y)> is large, then < -f(x), -f(y)> is equally large. In the end, again for formal reasons, we might want exclude opposites somehow, for example by requiring the output of f to be nonnegative, as with the ReLU activation commonly with neural networks. Or we might decide we like having opposites; one thing that opposites allow for is an idea of anti-relatedness. So <f(x), f(y)> = 0 indicates the absence of a relationship, a positive value represents its presence, and negative the existence of an opposite relationship.
Attributes in Continuous Graphs
To add in a notion of attributed edges, I will use a different approach from the meta-graph perspective used in knowledge graphs. Instead, we will imagine each attribute as a simple connection in its own graph, which happens to share vertices with other graphs. That is, we have a collection of graphs (V, Ai), one for each attribute i with edge set Ai containing all the edges having attribute i. These graphs all share the same vertex set V, which allows us to relate them to each other (this perspective is the disjoint union approach referenced above).
In the continuous setting, these attribute graphs would become a collection of continuous graphs (V, fi), where each embedding function maps the objects into the same vector space, but in different ways. Now if the attributes are really separate, then we should want <fi(x), fj(x)> = 0, which is to say that attributes i and j provide no information about each other. Those familiar with linear algebra will recognize that this equation implies that i and j correspond to orthogonal subspaces.
Skipping substantial detail, I introduce a single continuous graph (V, f) to represent the whole, and then define an attribute i as a square matrix Ai with attribute link strength given by <f(x), Ai f(y)>. Using singular value decomposition, the matrix Ai can be decomposed to the product of two matrices Si and Di, and the link strength becomes <Si f(x), Di f(y)>. The S is intended to stand for source and the D for destination. This is a modification of the continuous graph that admits directionality.
The matrix Ai need not have full rank, in which case Si and Di can reduce the dimensionality of the embedding. Also, considering two attributes Ai and Aj neither of which has full rank, we could indeed have <Si f(x), Dj f(y)> = 0, so this structure does allow us to separate attributes, at least up to the dimensionality of the representation. But it also allows attribute meanings to overlap as well.
A continuous attributed graph (CAG), then, is a tuple G = (V, f, A1, …, An) consisting of a set of objects, an embedding function, and a finite list of attributes represented as square matrices. This graph is directional due to the asymmetry of the S and D matrices. There remains an issue with polarity in that <f(x), Ai f(y)> = < -f(x), -Ai f(y)>, which can be most easily handled by assuming that f is nonnegative, so that -f(x) can never result from embedding any object in V. If the link strengths of this G participate in a loss function, then the parameters f and Ai can be trained by gradient descent. And so a continuous attributed graph can be learned from data.
Transformer Self-Attention Uses CAGs
Most people reading this post will have at least some awareness of generative language models such as GPT-3, which trains a 96-layer neural network to predict the next word following a sequence of text, and BERT, which masks out 10% of the words in a sentence and uses a 24-layer neural network to predict the word under each mask from the other words. These impressive technologies are based on a machine learning module called the transformer architecture. A good overview can be found here, and I’ll assume some basic familiarity. At the center of the transformer architecture is an operation called self-attention. Christopher Olah has written an article explaining why this is helpful for language modeling that is worth reading. But transformer models have not only been used for text but also for images, audio, and video.
In a tutorial on geometric deep learning, Michael Bronstein has pointed out that self-attention implements a kind of graph in which the computations establish how strongly each item is connected to each other item. That’s right, link strength. Imagine that we have a sequence of text tokens, and we write the text on two lines, one on top of the other, like so:
The quick brown fox jumped over the lazy dog.
The quick brown fox jumped over the lazy dog.
Now we have two copies of the text. We now want to draw in lines from words on the bottom to words on the top reflecting important connections in meaning. So, we might draw a line from quick to fox, or from lazy to dog, or from fox to jumped and over to jumped. Now these connections are not arbitrary; the relationship between fox and jumped is that of subject to verb, whereas over to jumped is an adverbial relationship reflecting the manner of action. So, we not only want to draw lines between words, but to label them. That is, we are creating a sort of attributed graph, known to linguists as a dependency grammar.
BERT in particular has been analyzed and found to learn such dependency structures. In fact, it shouldn’t be too surprising, because the attentional part of self-attention in transformers turns out to be closely related to the structure I called a continuous attributed graph, and the entirety of self-attention can be interpreted as a kind of information transport within a continuous attributed graph.
To be specific, inside of self attention, an input z at one time step (one word) is linearly projected to form a query vector q = Q z and a key vector k = K z, where Q and K are dimension reducing vectors. Then we take the dot product <k, q>. Sound familiar? We can set z = f(x), D = Q, S = K, and then we see that negative logarithm of attention in a transformer is <S f(x), D f(x)>, at least up to a constant factor. That is, the amount of attention paid to each item in a sequence is a function of the link strength of an attribute A = S.T D (where S.T is the matrix transpose). It was this observation that led me to develop the specific representation of a continuous attributed graph the way that I did.
This characterization leads to an interpretation of the readheads of a transformer layer, which were originally one of the most inscrutable parts of a transformer. Each head represents a type of link between objects. These links examine subspaces of token embeddings, and pairs of heads may be independent or dependent. Clearly, the number of heads and their bandwidth is a critical architectural component of the transformer architecture, and given the layerwise skip-connections, a deep transformer can be thought of as a sequential graph processor operating on a single base embedding space. At each layer, it identifies links between items and then transports information (the value vectors) across those links to modify the value at the link destination.
Conclusion
In this post, I’ve explained a neural model I call the continuous attributed graph. I’ve shown that a version of this object is at the core of the most successful machine learning architecture used to process images and text. But there are still additional components and concepts to introduce before we can explain how to implement prototypes and schemas in a neural network. In particular, knowledge bases are not just graphs but meta-graphs: there are classes and relationships as well as rules for how to instantiate them. We have the schema (the graph), but not the prototypes (the classes). In addition, knowledge bases obey reasoning rules beyond the mere knowledge structure. How can components such as classes and logic rules be represented inside an overall neural structure? Are the roles played by these components already fulfilled in some other way within the transformer architecture, or do transformers need to be patched or augmented in order to acquire more capabilities? These are questions I will be looking at in the next post or two.
As always, thank you for reading, and please leave any questions or comments below!