Benevolent AI Is a Bad Idea

Dreams of aligned AI give up too much agency to the machine, and assume “human values” are more stable than they are.

Nov 10, 2023

Brian Ziff/Grimes as a Benevolent AI Goddess

This article by Miya Perry was published at Palladium Magazine on November 10, 2023.

On April 1st, 2022, MIRI (the Machine Intelligence Research Institute)—the people who led the cultural charge on the idea of AGI and AI safety—announced a “death with dignity” strategy:

“tl;dr: It's obvious at this point that humanity isn't going to solve the alignment problem, or even try very hard, or even go out with much of a fight. Since survival is unattainable, we should shift the focus of our efforts to helping humanity die with slightly more dignity.”

This may or may not have been an April Fool’s joke, but either way, consensus among the people I know in the field seems to be that while progress is being made on AI, no progress is being made on AI safety. To me this is not surprising given the starting assumptions in play.

AI safety, originally “friendly AI,” was conceived as the problem of how to create an agent that would be benevolent towards humanity in general, on the assumption that that agent would have godlike superpowers. This whole line of thinking is rife with faulty assumptions—not even necessarily about the technology, which has yet to come into being, but about humans, agency, values, and benevolence.

“How do we make AI benevolent?” is a badly formulated problem. In its very asking, it ascribes agency to the AI that we don’t have to give it, that it may or may not even be able to acquire, and that is naturally ours in the first place. It implicitly ascribes all future moral agency to the AI. “How can we align AI with human values?” is also a badly formulated problem. It assumes that “human values” are or even should be universal and invariant, that they can be “figured out,” and that they need to be figured out to generate good outcomes from AI in the first place.

I have spent the last ten years exploring human psychology from both an empirical experiential and analytic modeling perspective. I sit with people as they introspect on their beliefs, feelings, and decisions, tracking what functions and cognitive routines they run in response to specific stimuli, and experimenting with what kinds of changes we can make to the system.

As someone whose job it is to examine and improve the structure of agency and clarify values, I can say with confidence that as a culture we have only a very primitive understanding of either. To varying extents, that primitiveness is the result of psychological constraints, such that even people who do think deeply about these topics get tripped up. In my conversations with people who work on AI and/or AI safety, who fear or welcome the coming AI god, or just keep up with the discourse, I generally find that their working concepts of agency, consciousness, motivation, drive, telos, values, and other such critical ideas are woefully inadequate.

The problem of benevolent AI as it is formulated is doomed to fail. One reason is that to explicate “human values” is to conceptualize them. If there can be said to be a deeper universal human value function at all, human conceptualization and reflective consciousness are merely its tools. There is some amount of danger in any project that attempts to rationalize or make conscious the entirety of human goal space, because reflective consciousness is not structurally capable of encompassing the entirety of that space, and demanding that motivation be governed by cognition creates dangerous blind spots and convolutions. Not all functions and desires in the human system are the rightful subjects of consciousness, and consciousness is not benevolent toward or competent to reign over all functions and desires in the human system.

Preference is a thing that expresses itself in environmentally contingent ways; there is no conceptualizable set of values that is truly invariant across environments. The things that people are able to consciously think of and verbalize as “values” are already far downstream of any fundamental motivation substrate—they are the extremely contingent and path-dependent products of experience, cultural conditioning, individual social strategies, and not a little trauma. Any consciousness that thinks it has its values fully understood will be surprised by its own behavior in a sufficiently new environment. Language severely under-describes conceptual space, and conceptual space severely under-describes actual possibility space. These are not problems to be transcended; they are simply facts of how abstraction works. Conceptualization is and must be Göedel-incomplete; descriptive power should grow as the information in the system grows, but the system should never be treated as though it has been, can be, or should be fully described.

The good news is that we have no need to fully describe or encapsulate human values. We can function perfectly well without complete self-knowledge; we are not meant to be complete in our conceptual self-understanding, but rather to grow in it.

The desire for complete description of human values is the result of a desire for there to be a single, safe, locked-in answer, and the desire for that answer is the result of a fear that humans are too stupid, too evil, or too insane to be left as the deciders of our own fate. The hope is that “we” (meaning, someone) can somehow tell AI the final answer about what we should want, or get it to tell us the final answer about what we should want, and then leave it to execute on our behalf all of the weighty decisions we are not competent to make ourselves. We should be very wary of a project to save ourselves, or even “empower” ourselves, that is premised on the belief that humans essentially suck.

“Protectionist” projects of this sort will seek to decrease human agency, consciously or not. That intention is easy to fulfill, in fact is already being fulfilled, regardless of whether full AGI is actually achieved. We are already getting the kind of ostensibly revolutionary automation that makes the baseline experience of living less and less agentic and thus more and more frustrating, increment by increment. The experience of being able to perceive a natural path or option through reality but not being able to execute on it is maddening, and this is the experience that much of our more advanced automation produces: the experience of waiting behind a self-driving car that could turn right at a red light but is too conservative to try it, the experience of asking Chatgpt a question about some tidbit of history or politics and receiving a censored and patronizing form lecture, the experience of calling customer service and getting a robot—these kinds of experiences point to the more horrifyingly mundane, more horrifyingly real, and much more imminent version of having no mouth and needing to scream.

Back to Agency

Absent the assumption of human incompetence, AI has the potential to be used to genuinely help increase human agency, not merely or even mainly by increasing human power or technical reach, but by increasing our conceptual range. In every aspect of our lives, a million choices go unrecognized because we are trapped within the limited conceptual frames that steer us; human life is lived on autopilot and in accordance with inherited cultural scripts or default physiological functions to a far greater degree than most people understand. This is not to say that humans can or should be glitteringly conscious of every choice in every moment, the way people imagine they would be if they were spiritually enlightened. There is a simpler and more discrete kind of psychological expansion. The kind of psychological shift someone undergoes when, for example, realizing that they’ve been subconsciously seeking out harsh, judgmental friends as a method of trying to gain their parents’ love by proxy, and realizing that they actually have the option to bond with kind and accepting people, is the kind of subtle but profound broadening of scope that changes the option landscape, and changes the trajectory of a the person’s life. Whether we know it or not, our trajectories are currently determined by the way that the space of possible futures we can conceive of is narrowed by our conceptual baggage and limitations. Being told that we have other choices isn’t sufficient to change this. The person with judgmental friends was likely told many times to get better friends, long before something shifted enough for them to internalize the realization themselves. Being given more material options alone isn’t sufficient either—that person may have likewise been surrounded for years by kind people willing to befriend them, whose overtures went unnoticed in the subconscious pursuit of more actively withheld approval.

So it is for all of us: we are constantly surrounded by options and opportunities that we are conceptually blind to. The current space of imagined futures in AI is highly constrained by cultural imagination, with many people and projects pursuing a vision of AI in a way that is functionally identical to a traumatized individual pursuing an abusive relationship without realizing it, because their concepts of love and relationships have been formed and deformed by local experience and trauma. Were we to query the abuse seeker about their values in the kind of introspective research project that some people are doing in order to try and unearth the human value function, they would be able to tell us all sorts of things about desiring to feel safe, loved, taken care of, protected by something smarter or stronger than them, etc.—but this would not free them from the malformation of those “values,” and pursuing those values would eventually damage them, no matter how lovely and idealistic the words tagging the concepts sounded. To open up the space of what is imaginable takes something beyond verbalization, and beyond mere empathy.

The natural process of human psychological development is a process of models and functions observing other models and functions. For example, someone who compulsively seeks attention by interrupting others’ conversations may notice that this bothers people, and feel ashamed; the compulsion is one function, the shame is another. The latter function is formed in observation and judgment of the former, and attempts to modify or control it. With these two functions at odds, the person faces a false dichotomy: to either “behave naturally” and be disliked by their peers, or to control themselves and be accepted, though not for who they feel they really are. Obviously, these are not truly the only two options in human behavior space. Upon fully metabolizing the desire for attention and the feedback loop it hinges on, such that the compulsion dissolves, the person will have much more range to behave “naturally.” They will be able to respond to the situation at hand in authentic ways, without sacrificing the regard of their friends. The process by which one function is observed, evaluated, commented upon and/or modified by another is the natural evolutionary tree of human psychology. Up to this point in our history, the environment that psychology evolves toward has been whatever culture and material space we happen to be embedded in, perhaps with some limited technical advancements from targeted spiritual practice and therapy. But the possibilities for targeted and technologically aided increases in agency—not agency in the sense of having more doordash options or more ability to lobby government via apps, but agency in the sense of being able to think new thoughts and generate new possibilities—are likely to be extremely under-explored in the nascent field of AI, simply because this is not a common understanding of the meaning of agency.

An Interlude With A Hypothetical Agency-Increasing AI

Imagine the following interaction between a distressed person and a specially designed chatbot built on existing LLM technology:

Jimmy comes home from hanging out with friends and opens up his interactive gpt diary. He tells it that he’s feeling like shit. “Why?” the chatbot asks him.

“I don’t know,” he tells it, “I just feel like shit. Everything sucks.”

The chatbot proceeds to prompt him further. “What is the ‘everything’ that sucks?” it asks. It does some pattern matching for him: “that reads to me like the kind of language people use when they’re generalizing from a bad feeling.” It prompts him to specify: “How would you describe your mood right now? Ashamed? Depressed? Cringey? Angry? You can pick multiple descriptors.”

Eventually, with the chatbot’s prompting, Jimmy tells it that he has just come from an interaction with friends where he made overtures to a girl and she didn’t laugh at his jokes. The chatbot asks him to recall other times he’s felt a similar feeling: “can you think of five or six other times you’ve felt like this and describe to me what happened?” It helps him draw parallels: “what are some common threads between those incidents?”

The special thing about this particular chatbot is that it has been designed to draw on a wide range of analytic commentary. It can make a statement like “that reads to me like the kind of language people use when they’re generalizing from a bad feeling,” because its training data includes examples of this kind of analysis, and it’s designed to prompt the user with these kinds of questions. It may be trained with input from therapists, neuroscientists, film and literature critics, writers, etc.—people who study the human condition from the inside and the outside—and it will also train itself on Jimmy, and train Jimmy on itself: together, Jimmy and the chatbot are able to build a model of the process that Jimmy runs whose output is his current mood. In answering the chatbot’s questions, Jimmy will begin to do a kind of phenomenological and experiential correlation he might otherwise never perform. The chatbot will act as an external memory bank for the concrete examples Jimmy gives it, along with Jimmy’s evaluation of the meaning of those examples, and the conversation will act as a concept formation session for Jimmy, who over time will start to see ideas like I’m not worth people’s time as the outputs of a recognizable cognitive pattern that he can reflect on critically rather than as facts about the world. Once he recognizes the pattern, it has much less power over him.

This system would not need to be particularly powerful. It doesn’t give Jimmy advice or do anything especially prescriptive; it is just an external tool for aiding self-reflective concept formation, and not a particularly sophisticated one—but it is crafted so as constantly refer its interlocutor both to his own perspective and to perspectives outside of himself, and Jimmy’s trajectory could be pretty profoundly affected by merely this simple loop. Far more sophisticated setups are possible, which similarly require no more “goal function” than something like this, beyond the implicit weights imparted by its trainers and its interlocutor (though these will be significant). Nothing in Jimmy’s chatbot or its potentially more advanced cousins requires the AI, or Jimmy, to have deep or special knowledge of humanity’s goals and values—or even just Jimmy’s goals or values—because the chatbot does not need to be a moral arbiter. It’s merely a pattern recognition machine with a Socratic questioning function. It’s not an agent, and it doesn’t need to be.

This chatbot isn’t a particularly brilliant idea, and there would be plenty of issues in making it work. But whether or not a self-observation chatbot is the answer to our problems, I think it is at least an answer to a better formulated problem than the problem of making benevolent AGI.

Agentically Creating the Future

The current wave of AI discourse, the current telos of our culture around the topic, is an example of human social epistemics gone awry in a parasocial world; people are shadow-boxing an imaginary enemy they’ve created together, in person but also massively online. They’ve synced up on a set of trust and in-group signals that amount to feelings of competition and horror and despair, and reinforced and spread them via social media, willing an arms race into being via their socially and parasocially reinforced fear of an arms race.

It would be better to de-fixate on the arms race, and instead imagine applications that are built to help ground people in reality, to explore where and why they respond to which sensations and drives, to know themselves better and give themselves more grace. I know that to many, this will sound like an irresponsible suggestion—isn’t it a terrible idea not to fixate on something that is so serious and urgent? But this reaction is coded into the ideology; those senses of importance and responsibility and urgency are artifacts of its way of seeing. It’s helpful to reflect on whether those feelings are serving you personally. If they are not, it’s much better to let them go. You can still engage with the world and the topic from outside of the particular ideology; in fact, you will have much more agency by doing so.

Ultimately, no matter how hard we try to replace our own agency, whatever we build will be its product—so let’s try to turn our existing agency towards increasing rather than decreasing our future agency, and grow ever in our beautifully incomplete understanding of ourselves and the world. Aligned AI is not the AI that encapsulates our goals and enacts them for us; aligned AI is the AI that helps us open up our exploration of how we ourselves can better see and create the kind of future we want for ourselves.

Miya Perry is an independent psychology researcher and executive coach.

The Palladium Letter

Benevolent AI Is a Bad Idea

Dreams of aligned AI give up too much agency to the machine, and assume “human values” are more stable than they are.

Back to Agency

An Interlude With A Hypothetical Agency-Increasing AI

Agentically Creating the Future