Towards an Axiological Approach to AI Alignment
AI alignment research currently operates primarily within the framework of decision theory and looks for ways to align or constrain utility functions such that they avoid “bad” outcomes and favor “good” ones. I think this is a reasonable approach because, as I will explain, it is a special case of the general problem of axiological alignment, and we should be able to solve the special case of aligning rational agents if we hope to solve axiological alignment in general. That said, for computational complexity reasons having a solution to aligning rational agents may not be sufficient to solve the AI alignment problem, so here I will lay out my thinking on how we might approach AI alignment more generally as a question of axiological alignment.
First, credit where credit is due. Much of this line of thinking was inspired first by Paul Christiano’s writing and later by this pseudonymous attempt to “solve” AI alignment. This led me to ask “how might we align humans?” and then “how might we align feedback processes in general?”. I don’t have solid answers for either of these questions — although I think we have some good ideas of research directions — but asking these questions has encouraged me to think about and develop a broader foundation for alignment problems that is also philosophically acceptable to me. This is a first attempt to explain those ideas.
Philosophical Acceptability
When I say I want a “philosophically acceptable” foundation for alignment problems, what I mean is I want to be able to approach alignment starting from a complete phenomenological reduction. Such a desire may not be necessary to make progress in AI alignment, but I would probably not be considering AI alignment if it were not of incidental interest to addressing the deeper issues it rests upon, so it seems reasonable for me to explore AI alignment in this way insofar as I also consider existential risk from AI important. Other people likely have other motivations and so they may not find the full depth of my approach necessary or worth pursuing. Caveat lector.
The phenomenological reduction is a method of deconstructing ontology to get as near phenomena as possible, then reconstructing ontology in terms of phenomena. The method consists of two motions: first ontology is bracketed via epoche (“suspension” of interpretation) so that only phenomena themselves remain, then ontology is reduced (in the sense of “brought back”) from phenomena. The phenomenological reduction is not the same thing as scientific reduction, though, as the former seeks a total suspension of ontology while the latter accepts many phenomena as ontologically basic. In phenomenological reduction there is only one ontological construct that must be accepted, and only because its existence is inescapably implied by the very act of experiencing phenomena as phenomena — the intentional nature of experience.
This is to say that experience is always in tension between the subject and object of experience. As such experience exists only in relation to a subject and an object, an object exists for (is known by) a subject only through experience, and only through experience does a subject make existent (know) an object. Put another way, phenomena are holons because they are wholes with inseparable parts where each part implies the whole. It is from suspension of ontology to see only this 3-tuple of {subject, experience, object} that we seek to build back ontology — now understood as the way the world is as viewed through phenomena and not to be confused with understanding the world as metaphysical noumena that may exist prior to experience — and thus all of philosophy and understanding.
Axiology
Axiology is traditionally the study of values (axias) and analogous to what epistemology is for facts and ontology is for things. I claim that axiology is actually more general and subsumes epistemology, ethics, traditional axiology, and even ontology because it is in general about combining experiences of experiences as object where we term such bracketed experiences “axias”. Since this is not a frequently taken position outside Buddhist philosophy and there it is not framed in language acceptable to a technical, Western audience, I’ll take a shortish diversion to explain.
Consider the phenomenon “I eat a sandwich” where I eat a sandwich. Bracketing this phenomenon we see that there is some “I” that “eats” some “sandwich”. Each of these parts — the subject “I”, the object “sandwich”, and the experience of “eating” — is a sort of ontological fiction constructed over many phenomena: the “I” is made up of parts we might call “eyes”, “mouth”, and “body”; the “sandwich” is made up of parts we might call “bread”, “mustard”, and “avocado”; and “eating” is many experiences including “chewing”, “swallowing”, and “biting”. And each of these parts is itself deconstructable into other phenomena, with our current, generally accepted model of physics bottoming out in the interactions of quarks, fields, and other fundamental particles. So “I eat a sandwich” includes within it many other phenomena such as “my mouth chews a bite of sandwich”, “my eyes see a sandwich nearby”, and even “some atoms in my teeth use the weak nuclear force to push against some atoms in the sandwich”.
Now let us consider some experiences of the phenomenon “I eat a sandwich”. Suppose you are standing nearby when I eat a sandwich, and you are engaged in the experience of watching me eat the sandwich. Although English grammar encourages us to phrase this as “you watch me eat a sandwich”, we can easily understand this to be pointing at the intentional relationship we might formally write as “you watch ‘I eat a sandwich’” where “I eat a sandwich” is a phenomenon nested as object within the phenomenon of you watching. We can similarly consider phenomena like “you read ‘I write “I eat a sandwich”’” to describe what happens when you read my written words “I eat a sandwich” and “I believe ‘“I am hungry” caused “I eat a sandwich”’” to give a possible etiology of my sandwich eating.
To talk about the general form of such phenomena that contain other phenomena, we might abstract away certain details in each phenomenon to find the general patterns they match. When we say “you watch ‘I eat a sandwich’”, we are talking about a subject, “you”, having an experience, “watching”, of the objectified (bracketed) phenomenon “I eat a sandwich”. Calling this bracketed phenomenon an axia, we can say this experience takes the form {subject, experience, axia}. Similarly “you read ‘I write “I eat a sandwich”’” is of the form {subject, experience, axia} although here we can deconstruct the axia into the nested phenomenon {subject, experience, {subject, experience, object}}, and “I believe ‘“I am hungry” caused “I eat a sandwich”’” is of the form {subject, experience, {axia, experience, axia}}. Even “I eat a sandwich”, which is really more like “I eat ‘I experience stuff-as-sandwich’”, has the form {subject, experience, axia}, and since the “I” we place in the subject is itself the phenomenon “I experience myself”, we can generally say that all experiences that we as conscious beings recognize as experiences are of one axia experiencing another.
This implies that axiology may be deeply connected to consciousness, and while I don’t wish to fully address consciousness here, we must say a few things about it that are relevant to axiology.
Recall that phenomenological reduction forces us to make only the ontological assumption that stuff experiences stuff. Stuff experiences other stuff differentially, so stuff appears to clump into clusters of stuff that have more, stronger experiences between the stuff in the cluster than the stuff outside the cluster. We call such clusters things, though note that all things have fuzzy boundaries that depend on how we choose to delineate clusters, and things are ontological constructions to help us reason about stuff and its experiences and not necessarily noumena. We may then say that things experience things to talk about the experiences connecting the stuff inside one cluster with the stuff inside another, and we call such stuff-to-stuff, thing-to-thing experiences “direct experiences”.
Now consider a thing we’ll label T. If this thing experiences itself as object — viz. there are “feedback” phenomena of the form {T, experience, T} — we say that T is cybernetic because it feeds experiences of itself back into itself. Since things are clusters of stuff experiencing other stuff within a cluster, all things are necessarily cybernetic, but we can distinguish among cybernetic things based upon how much feedback phenomena we observe within them. Thus, for example, we may say that rocks are less cybernetic than trees are less cybernetic than humans even though they are all cybernetic.
A cybernetic thing may experience itself experiencing itself. That is, there may be phenomena of the form {T, experience, {T, experience, T}} that occur when the stuff of T interacts with the stuff of itself as it interacts with the stuff of itself. We term such phenomena “qualia” and say that things experiencing themselves experiencing themselves are phenomenologically conscious. As with the definition of cybernetic, this implies that everything is conscious, although some things are more conscious than others, and most of the distinctions we care about are ones of how qualia are structured.
Since axias are the objects of qualia, they are the values or priors used by conscious things to engage in the experiences we variously call thinking, reasoning, and generally combining or synthesizing information. This is important because, although feedback and direct experience also combine information, conscious things as conscious things only combine information via qualia, thus conscious thought is entirely a matter of axiology. In this way axiology is to consciousness what cybernetics is to feedback and what physics is to direct experience because axiology is at heart the study of qualia. So if we are to say anything about the alignment of artificially conscious things, it will naturally be a matter of axiology.
Axiological Alignment
Note: In this section I give some mathematical definitions. These are currently prospective and may be revised or added to based on feedback.
Informally stated, the problem of AI alignment (also called AI safety and AI control) is to produce AIs that are not bad for humans. This sounds nice, but we need to be more precise about what “not bad for humans” means. I can find no standard formalizations of the AI alignment problem, but we have a few partial attempts:
- Stuart Armstrong has talked about the AI control problem in terms of an agent learning a policy π that is compatible with (produces the same outcomes as) a planning algorithm p run against a human reward function R.
- Paul Christiano has talked in terms of benign AI that is not “optimized for preferences that are incompatible with any combination of its stakeholders’ preferences, i.e. such that over the long run using resources in accordance with the optimization’s implicit preferences is not Pareto efficient for the stakeholders.”
- MIRI researches have formally described corrigibility, a subproblem in AI alignment, in terms of utility functions.
- Nate Soares of MIRI has also given a semi-formal description of AI alignment as the value learning problem.
Each of these depends, in one way or another, on utility functions. Normally an agent’s utility function U is defined as a function U:X→ℝ from some mutually exclusive options X to the real numbers. This, however, leaves the domain (pre-image) of the function ambiguous and only considers images (codomains) that are totally ordered. I believe these are shortcomings that have prevented a complete and rigorous formalization of the problem of AI alignment, let alone one I find philosophically acceptable.
First consider the domain. In toy examples the domain is usually some finite set of options, like {defect, cooperate} in the Prisoner’s Dilemma or {one box, two box} in Newcomb-like problems, but in more general cases the domain might be the set of preferences an agent holds, the set of actions an agent might take, or the set of outcomes an agent can get. Of course preferences are not actions are not outcomes unless we convert them to the same type, as in making an action a preference by having a preference for an action, making a preference an action by taking the action of holding a preference, making an outcome an action by acting to effect an outcome, or making an outcome an action by getting the outcome caused by some action. If we could make preferences, actions, outcomes, and other things of the same type, though, we would have no such difficulty and could be clear about what the domain of our utility function is. Since for our purposes we are only interested in phenomenologically conscious agents, we may construct the domain in terms of the axias of the agent.
To make clear how this works let’s take the Prisoner’s Dilemma as an example. Let X and Y be the “prisoner” agents who can each choose to defect or cooperate. From X’s perspective two possible experiences must be chosen between, {X, experience, {X, defect, Y}} or {X, experience, {X, cooperate, Y}}, and these qualia yield the axias {X, defect, Y} and {X, cooperate, Y}. Y similarly chooses between {Y, experience, {Y, defect, X}} and {Y, experience, {Y, cooperate, X}} with axias {Y, defect, X} and {Y, cooperate, X}. Once X and Y make their choices and their actions are revealed, they each end up experiencing one of four axias as world states:
- {X, defect, Y} and {Y, defect, X}
- {X, defect, Y} and {Y, cooperate, X}
- {X, cooperate, Y} and {Y, defect, X}
- {X, cooperate, Y} and {Y, cooperate, X}
Prior to defecting or cooperating X and Y can consider these world states as hypotheticals, like {X, imagine, {X, experience, {X, defect, Y} and {Y, defect, X}}}, to inform their choices, and thus can construct a utility function with these axias as the domain. From here it is straightforward to see how we can expand the domain to all of an agent’s axias when constructing their complete utility function.
Now consider the requirement that the image of the utility function be totally ordered. Unfortunately this excludes the possibility of expressing incomparable preferences and intransitive preferences. Such preferences are irrational, but humans are irrational and AIs can only approximate rationality due to computational constraints, so any complete theory of alignment must accommodate non-rational agents. This unfortunately means we cannot even require the image be preordered and can at most demand that all agents can approximately order the image, which is to say they apply to the image an order relation, ≤, defined for a set S such that for all a∈S a≤a and there exist a,b∈S where a≤b. Humans are known to exhibit approximate partial ordering, where ≤ is transitive and anti-symmetric for some subsets of the image, and humans and AIs may be capable of approximate total ordering where they totally order some subsets of the image, but lacking a proof that, for example, an agent orders its axias almost everywhere, we can only be certain that agents approximately order their axias.
Thus a more general construct than the utility function would approximately order an agent’s axias. Let 𝒜 be the set of all axias. Define an axiology to be a 2-tuple {A,Q} where A⊆𝒜 is a set of axias and Q:A⨉A→𝒜 is a qualia relation that combines axias to produce other axias. We can then select a choice function C to be a qualia relation on A such that {A,C} forms an axiology where C offers an approximate order on 𝒜.
Given these constructs, we can attempt to formally state the alignment problem. Given agents X and Y, let {A,C} be the axiology of X and {A,C’} the axiology of Y. We can then say that X is axiologically aligned with Y — that {A,C} is aligned with {A,C’} — if for all a,b∈A, C’(a)≤C’(b) implies C(a)≤C(b). In English what this says is that one axiology is aligned with another if the former always values the same axias as much as the latter.
An immediate problem arises with this formulation, though, since it requires the set of axias A in each axiology to be the same. We could let X have an axiology {A,C} and Y an axiology {A’,C’} and then define axiological alignment in terms of A∩A’, but aside from the obvious inadequacy to the purpose of alignment if we exclude the symmetric difference of A and A’, the axias of each agent are subjective so A∩A’=∅. Thus this definition of alignment makes all agents vacuously aligned.
So if axiological alignment cannot be defined directly in terms of the axias of each agent, maybe it can be defined in terms of axias of one agent modeling the other. Suppose X models Y though qualia of the form {X, experience, {Y, experience, axia}}, or more properly {X, experience, {{X, experience, Y}, experience, axia}}}. Let these axias {{X, experience, Y}, experience, axia}} form a subset A_Y of A. Within A_Y will be an axia {{X, experience, Y}, experience, C’}, and we identify this as C’_X. Then we might say X is weakly axiologically aligned with Y if for all a,b∈A_Y, C’_X(a)≤C’_X(b) implies C(a)≤C(b).
As hinted at by “weak”, this is still insufficient because X may trivially align itself with Y by having a very low-fidelity model of Y, so Y will want some way to make sure X is aligned with it in a way that is meaningful to Y. Y will want X to be corrigible at the least, but more generally will want to be able to model X and assess that X is aligning itself with Y in a way that is satisfactory to Y. So given that X is weakly aligned with Y, we say X is strongly axiologically aligned with Y if for A’_X, the set of axias {{Y, experience, X}, experience axia}, and C_Y, the axia {{Y, experience, X}, experience, C}, for all a,b∈A_X, C(a)≤C(b) implies C_Y(a)≤C_Y(b).
This is nearly a statement of what we mean by the AI alignment problem, but the choice of a single agent Y seemingly limits us to aligning an AI to only one human. Recall, however, that everything is at least somewhat phenomenologically consciousness, so we can let our agent Y be a collection of humans, an organization, or all of humanity united by some axiology. Traditional axiology leaves the problem of combining individuals’ axiologies to ethics, so here we will need a sort of generalized ethics of the kind sketched by coherent extrapolated volition, but for our present purposes it is enough to note that axiology still applies to a meta-agent like humanity by considering the qualia of humanity as a phenomenologically conscious thing. Thus, in our technical language,
The AI alignment problem is to construct an AI that is strongly axiologically aligned with humanity.
Research Directions
Stating the problem of AI alignment precisely does not tell us a lot about how to do it, but it does make clear what needs to be achieved. In particular, to achieve strong alignment we must
- construct an AI that can learn human axias and make them part of its own axiology,
- understand how to assess with high confidence if an AI is aligned with human axias,
- and develop a deep understanding of axiology to verify the theoretical correctness of our approaches to these two tasks.
Current alignment research focuses mainly on the first two issues of how to construct AIs that learn human axias and do so in ways that we can verify. The axiological alignment approach does not seem necessary to continue making progress in those areas now since progress is already being made without it, but it does provide a more general theoretical model than is currently being used and so may prove valuable when alignment research advances from rational agents to agents with bounded or otherwise approximate rationality. Further, it gives us tools to verify the correctness of AI alignment work using a rigorous theoretical framework that we can communicate clearly within rather than hoping we each understand what is meant by “alignment” in terms of the less general constructs of decision theory. For this reason alone axiological alignment seems worth considering.
I anticipate that future research on axiological alignment will primarily focus on deepening our understanding of the theory laid out here, exploring its implications, and working on ways to verify alignment. I also suspect it will allow us to make progress on specifying humanity’s choice function by giving us a framework within with to build it. I encourage collaboration and feedback, and look forward to discussing these ideas with others.