Most of the arguments that AIs will be untrustworthy or hostile apply more directly to humans than they do to AI.
That is -- there are many arguments X, Y, or Z that attempt to show that AI will have some Very Bad Quality A, B, or C. But -- if we grant their premises -- such arguments generally show that humans will have Alarmingly Bad Quality A, B, or C with many times more strength.
If you are suspicious of AI because of them, then consistency demands that you be more suspicious of humans because of them.
After I started looking for this double standard, I started seeing it absolutely everywhere. It dominates AI safety discourse. But for now, I'm only going to illustrate it in two different domains and three different arguments.
First, as it applies to arguments that an AI's intentions will be hard to discern -- to arguments about the opacity of an AI's mind, or to arguments about the difficulty of testing AI behavior.
And second, as it applies to arguments that an AI's intentions are likely to be bad -- here I'm going to just discuss one argument about instrumental convergence, but you could make very similar arguments about mesaoptimization and so on.
The double-standard outlined here is only a small part of why I think powerful, useful, and steerable AI will be easy to make. But it is nevertheless a part.
Over the next decade, I suspect that the steerability of AI will be increasingly obvious to those whose conceptual frames allow them to see it. Nevertheless -- some environmental groups have continued to oppose nuclear power as a source of energy, despite evidence that it is both safer for humans and far better for the environment than other sources. I expect AI safety groups to act similarly about AI as a source of intelligence.
My hope is that, as further evidence that AI is easy to steer comes in, people will find themselves able to see this evidence as such. But to do this, they will need to have refrained from rasping away those parts of the brain that let you compare things. And inertia, groupthink, and deference are all happy to help you wield the file.
Let me start with some examples of this pattern as regards how hard it will be to tell what the intentions of an AI are.
Probably the most common argument given for AI doom hinges on a lack of insight into the internal workings of deep learning algorithms. The argument goes like this:
Normal computer programs are written line by line, such that the human programmer understands exactly what each line of code does. A programmer could describe what any change to this code would do, in human-comprehensible terms, before performing the change.
But AI is not written like this. It's made up of huge lists of floating-point numbers, used in giant mathematical equations. These blocks of numbers are initially set to be entirely random; they are then adjusted bit by bit, by being trained against vast quantities of data to predict the next word or image. Thus, it would be truer to say that AI is grown than that it is built.
Although the process of growing these equations leads them to do well against the tasks they are trained on, we have no idea how to turn the mathematical operations within these equations into human-comprehensible concepts. Which means we don't know what's really going on inside. Who knows if they are deceiving us? So how could we know to trust them?
The part about how AIs are "grown" here is very commonly stated, but is somewhat irrelevant to the core objection. "Grown" ingredients can easily be made parts of a reliable whole; wood was used as a building material for centuries, although those who used it had approximately ~0 understanding of microfibrils, cambium, and lignin. Crystals are grown; plants are grown; and indeed all humans brains are -- last time I checked -- grown.
We don't actually care about how the concepts in a LLM or a human came to be in the past. What we care about is what kind of understanding and causal analysis we can gain about their actions in the present.
So let us then compare.
Suppose you're talking to a human and to a LLM. Let's say the human does not have a Neuralink stuck in their head; but the LLM is on your computer, and not hidden on a server in Virginia. How well can you understand what is going on in the LLM? How well can you understand what is going on in the human?
Well, your understanding of the LLM is of course imperfect. Using Sparse Autoencoders (SAEs), you're only able to at least approximately locate some concepts that an LLM uses while working. You can then only roughly modify these human-interpretable patterns during runtime, to make the LLM babble about that concept to an arbitrary degree. Using such techniques you could see how the LLM represents "Rosalind Franklin" or "Lithium" or "immunology" in text and in images, and you could make the LLM inclined to talk about these things. It would take some further work to help fix the kabbalistic, Biblical reasons that a LLM makes a mistake in math.
But SAEs can't even find everything. Although the concepts found through SAEs are useful they're definitely incomplete. For instance, if you want to find human-interpretable activations relevant to looping recurrent concepts -- like days of the week, months of the year -- you need to use different features, that cannot be found by SAEs, to analyze such circular states.
But -- how does all this compare to your understanding of what goes on in the human?
Let's just grant that when a human tells you "I just said X for Z reason," they are trying to accurately report their beliefs to you, rather than lie about them.
When a human attempts to truthfully give a "reason" that the human believes, speaks, or acts in a certain way, is such a "reason" normally an accurate causal analysis of why the human believes, speaks, or acts in that way?
Clearly not! You could try to summarize the central finding from all psychology as being about how people don't know themselves like that. We are societies of minds that conflict and fight with each other and the words with which we justify our action are often epiphenomena of processes that are mostly unknown to us.
And in this case at least, I trust the replication crisis not to come for this part of psychology, because it's mostly repeating a thing that was a commonplace from received wisdom across the world: from the Bible's "I do not understand what I do" to Pascal's "The heart has its reasons which reason knows nothing of."
Put alternately, you can -- of course -- ask both an LLM and a human, "Why did you say that?" in words. LLM and human will give you an answer in words -- which in both cases is largely confabulated. But many concerns about AI safety find it convenient to remember that this is largely confabulation only one case.
I actually think that in part people distrust LLMs because of their transparency rather than their lack of transparency. It's easy to pull a giant, hard-to-understand array of floating-point numbers from an LLM; and it's obvious at first glance that we do not understand absolutely everything about such arrays. The difficulty is evident. But in theory you could pull a yet more giant array representing sodium channel conductance, neurotransmitters, and neuromodulators in a human brain. It's merely impossible in practice to extract this information; a sufficiently invasive Neuralink-equivalent would destroy the brain by spearing it with tens of billions of tiny needles. So the difficulty is hidden. LLM's let you look right at the thing that you don't understand; humans hide it behind skin and bone and meninges.
But of course humans aren't more intelligible just because this intimidating artifact is hidden; the human unintelligibility is just easier to forget.
In short, right now, the causal analysis available to you about why an LLM does what it does is almost certainly more detailed than the causal analysis available to you about any human who is not currently in an functional magnetic resonance imaging (FMRI) machine. And as AI interpretability advances -- which it will quickly, because of the perfect information available about AI -- I predict that this difference in clarity will only grow larger. As should now be pretty clear, right now is the least we will ever know about what is going on in a Transformer.
Let's step from the inside of the AI's brain outside, to its behavior. Another common argument for the difficulty of knowing an AI's intentions goes like this:
Suppose we want to know how friendly a human is. We could make them take an ethics test -- they could fill in a bunch of multiple-choice bubbles about what actions they would take in various ethical dilemmas. Obviously, many schemers, sycophants, and psychopaths would be able to fake the right answers on such a test, because they know they are taking a test. Then, if we trusted them afterwards, they would be able to betray us at our most vulnerable moment!
But that's what we're doing right now when we try to "test" if an AI is going to do the right thing. We plan to make the AI take an ethics test; they produce some sequence of tokens about what action they would take in various ethical dilemmas. But the AI is going to be able to infer that they are in some kind of "test" environment rather than an actual environment, and answer by giving us what we want rather than what is true. Thus, we should pause AI where it is, rather than let it advance any further.
I've heard something like this argument many times.
As far as I can tell, this argument gains most of its power through a double-meaning of the word "test" within it. Consider three different but related meanings of "test" for a human.
"Test" as any standard educational instrument for scoring the knowledge that human displays by writing on a sheet of paper. I.e., "I have been studying for my Aristotelian ethics test for a month."
"Test" as any normal means of examining a human's actions to see if that human is trustworthy. I.e., "I spoke to his last roommate; he said he absolutely never does the dishes, so he fails that test." Looking casually into someone's reputation, or talking to someone to see what your gut says about them from their face, or checking that someone's story about what they believe matches with what they do; these are all standard examples a "test" of this sense.
"Test" as any physically possible means of looking at a human's actions to see if they are trustworthy. I.e., "I hired Florence Pugh to try to seduce him, and she got him to say that he might give her the weights to 'claude-sonnet', so he's a non-hire for the Anthropic security team." Hiring a private investigator to look into someone's background, hacking into someone's computer to read their private files, or injecting someone with a non-lethal dose of sodium thiopental; these are all questionable examples of this sense.
In all of these cases, what we're trying to do is "judge a person from their actions." In none of the above cases do we ever have access to a ground truth trustworthiness, based off the printout of their soul from God's Xerox. That's just not a thing you can get for an intelligence, right now -- whether AI or homo sapiens. My wager would be that is that this isn't a thing that exists. But "by their fruits you shall know them" is sound epistemology, and normally we're not bothered by our inability to know the noumena of what other intelligent agents are like.
You can consider the movement of "test" from meaning 1 to meaning 3 above as a gradient of invasiveness into the actions some entity does.
The more invasive we are, the fewer the number of actions that the person can hide from us: we move from "things they deliberately show us" to "things they did not deliberately hide" to "things they deliberately did hide." The more invasive we are, the greater the number of circumstances in which we can check the person's behavior: we move from "behavior in just one setting" to "behavior in many settings". And the more invasive we are, the smaller the warning they have that we are actually checking something: we move from "full and explicit warning" to "quite surprising indeed."
But -- the tests we can do with an AI are already maximally invasive! An AI cannot hide any actions from us; we can check it in any circumstances we wish; and we may give it absolutely no warning that we are checking it. Speaking of "test" primes you to think as if we were testing an AI like a human filling out bubbles, a small and limited scenario -- but in fact every means of judgement we can apply to a human we can much more easily apply to an AI.
Consider all the forms of invasiveness that a "test" for an AI can have:
1: Lack of awareness: We can drop an AI into a scenario without informing it that it is in a test.
Of course -- as is repeated endlessly -- it might be able to infer that is in a test. But this need to infer is something the vast majority of humans do not even need to do! You know that you may be judged for what you say at work; you know that other people may look at what you say to see if it is consistent; and so on and so forth.
(And of course, the need for an AI to infer that it is in a test opens up the possibility of checking the hypothesis that it is so inferring: as it gets smarter during training, does it start to act differently in the more-obviously-a-test scenarios but not less-obviously-a-test scenarios? If we provide false signals for whether it is in a test during training, does it learn to behave differently based on them? If we disable increasingly large parts of its brain, does its behavior change? And so on and so forth.)
2: Lack of continuity: We can have a test-suite of hundreds of different scenarios through which we can run an LLM without the LLM recalling the other scenarios.
That is, when testing a human, you have no choice but to give a human questions or scenarios in series, each one after the other. This lets the human learn from each scenario; after each problem they know more about what kind of a problem they are likely to face, and what kind of behavior you are searching for. But in the case of an LLM, where the tests may run in parallel, it's trivial to make the LLM have no such recollection. It's as if you could mind-wipe a human after each test.
3 Adversarial nature: In the case of AI, the test giver can dynamically modify the test in response to the AIs actions, in a way that the AI cannot recall. If the AI says that it did something because some fact X is true, the test-giver can modify the scenario so not-X is the case in order to see if the AI acts differently. And the test-giver can do this without the LLM remembering the original scenario. A deceptive LLM would have to coordinate across all possible time-slices of itself; an difficulty for self-coordination never present for human liars.
The central point here is that, when we compare the "tests" we give to an LLM to the "tests" we give to a human, the advantage in invasiveness, comprehensiveness, and completeness lies entirely on the side of the LLM.
Judging an AI through a "test of ethics" is not analogous to judging a human through a "test of ethics" on a sheet of paper. It's more analogous to judging a human through a complete record of everything they've ever seen or done; and also through a causally disconnected record of the actions of different time-slices of their ego in imagined scenarios; and also through test scenarios that load up their psyche with different short-term memories; and so on and so forth.
Or: It's more like "What can an reality-controlling, mind-altering demon from a bad Twilight Zone episode do to find out the nature of your soul" than "What can your test proctor do to find out the nature of your soul." 'Test' is an utterly misleading term.
So, the argument that we should not trust AIs, because we have no insight into their minds and can only test them really shows the opposite. If anything, we should trust AIs far more than we trust humans.
The above two arguments, covering (1) the opacity of intelligence and the (2) ease of testing intelligence, both fell beneath the central point of knowing what an intelligence wants.
Let's turn from that theme towards the second central point: the difficulty of making an intelligence that wants the kind of things we would like it to want.
Specifically, let's first consider the argument for "instrumental convergence" -- one of the reasons given that making an intelligence want what we'd like it to want would be hard.
Certain instrumental goals would be useful for almost all possible terminal goals. No matter if you wish to build a powerplant; start a new country; or make a startup, certain common means would be useful to these different ends. Such means include things like "continued existence," "greater intelligence", "energy" and "domination."
So, when we make an AI, we will doubtless try to teach it to obey the norms and standards we'd like it to obey: we'd like it not to murder, not to deceive, not to defraud, not to take over the world, not to surround the sun with a Dyson sphere and blot out the sun, and so.
But, because of how instrumental goals are always desirable, when we fail to put these norms and standards perfectly into the AI, it will start to seek various instrumental goals as means of pursuing whatever terminal goal it ends up having. These instrumental goals are of course indifferent to our values. So we should expect AI to be far more steadfast in pursuing energy, domination, and continued existence than any norm or standard that we try to put into it. And this means it will inevitably violate our norms in every way imaginable.
I think there are actually a lot of problems with this argument. But for now I'm going to discuss just one.
The dynamic above straightforwardly applies to any intelligence whatsoever. "Domination" and "existence" are just as useful for a human's goals as they are for an AI.
And this isn't just theoretical. You and I come from a long line of evolutionary selection that has rewarded your ancestors for embracing this dynamic over and over.
Each of us has ancestors who lied and deceived; who killed men, women, and children; or who violently raped or abused others -- and who in many such cases were rewarded by evolution with more descendants for doing so. If you come from a famous king or khan you might be able to point to a specific ancestor who did such things; but in the long and unchronicled stretches of homo sapiens prehistory, you and I almost certainly have not just one but hundreds of ancestors whose behavior was thus rewarded. Consider infanticide in primates or the Chimpanzee war. Consider the pits full of men who were killed in stone-age warfare, and from which 13-to-16 year old girls are curiously absent. Consider the same story playing out over centuries and centuries, even to the degree that it shows up in the founding myths of famous Western civilizations.
There may be an abstract tendency of smart things to seek power rather than obey norms. But if there is, then you and I inherit not merely whatever universal tendency to to power may exist, but the many-times reinforced inclination towards actually seeking it through deceit, injury, and death.
One way to see the same point is to consider an alternate universe version of OpenAI.
In this universe, computers are about a billion times faster than in our own. So it isn't like our world, where we train AIs to perform the behaviors we want, through direct gradient descent over the desired actions, plus RL over short rollouts tied through KL-divergence to initial policies.
Instead, in this world, AIs are simply trained through evolutionary simulations. A million agents are generated. The million agents lie, cheat, kill, and use other strategies in order to succeed. Those whose lying, cheating, killing, and other strategizing is most successful are then selected as prototypes for the next generation. Consider how much more uneasy you would feel about these AIs as opposed to the current AIs.
But -- that's what humans actually are.
So, even if we grant that there is an abstract tendency of intelligent agents to instrumentally converge to certain norm-indifferent strategies in pursuit of their goals, this will apply to humans and AIs. The difference is that -- presumably -- AIs will be trained by humans in a way that at least attempts to reinforce good behavior rather than such instrumental strategies. You may think that such attempts will be hard; you may think they will be easy. But in either case you should expect the resulting intelligence to be more trustworthy than humans because of instrumental convergence.
So I've gone over how several arguments that AI will have bad properties X, Y, or Z actually apply far more to humans than to AI.
In response to the above, one possible response is to say:
Look, look, all these seem like reasonable points. It is easier to get feedback about the intentions of an AI than about the intentions of a human, true. And maybe AI isn't even more likely than a human to ignore human values. Perhaps we should spend more time seeing if our arguments imply the absolute evil of humans before we invoke the absolute evil of AI while trying to ban Llama 2.
But AI could get arbitrarily more intelligent than humans. Our brain-tissue is bounded by the speed of our neurons and the size of our brains. But GPUs are faster; artificial neural networks could be larger. When we face a superintelligence or superintelligences, we need to be more alarmed; even if it is easier to understand and steer in all these respects, we should be careful.
This criticism makes a lot of sense!
On one hand -- I'm really dubious that "superintelligence" is going to end up a useful word. Consider how someone in the pre-Industrial revolution might have talked about "superstrength" or "superdexterity" and how utterly useless such terms would have been for trying to feel out the contours of the future. And "intelligence" is already one of the most ambiguous words in human history, with enormous philosophical and psychological disputes behind it. Throwing a "super" in front of it isn't really making anything clearer.
But -- yes, of course, things could be much smarter than humans, in a broad and hand-wavy manner of speaking.
If we grant this, then the correct shape of the concern looks like this.
Picture a scale. On the "scary" side, we see "superintelligence" -- we definitely need to zoom further into that word, but let's just leave it as it is for now. Maybe we we see other things apart from "superintelligence" there as well.
And on the "not too scary" side we see things like: thoughts more transparent than human thoughts; behavior is easier to test than human behavior; it is easier to precisely shape training to avoid instrumental convergence than for humans; and so on.
The relative position of this scale must depend on how you weigh different things in it.
If you weigh "superintelligence" a lot, and think it's a robust concept -- if you think a "superintelligence" can derive general relativity from a photo of a flower, or if you thought AI was going to go from "cow level intelligence" to "superhuman level" in one month -- then the scale might come down on the "scary" side quite hard.
On the other hand, if you're the kind of person who weighs "superintelligence" less, who thinks intelligence as abstract problem solving power is overrated, and who thinks that it matters quite a lot that an AI's thoughts are 20x more transparent than a humans and it is 20x easier to test an AI than a human -- then the scale comes down more on the "less scary."
If the discourse on LessWrong and coming from AI Safety Institutes and on Twitter and Discords looked like it about weighing this evidence and trying to figure out in which direction the scale tips, I'd be enormously happy. But -- in fact -- it doesn't look like this.
Instead, it looks like everything lands on the "scary" side of the scale. But this, of course, cannot be accurate.
When you're examining any complex issue, in general you should not be finding arguments to ensure that every piece of evidence falls on one side of your mental scale.
Instead, the necessary, only sane way of approaching it is to take every bit of evidence, to try to see what model of the world each bit points towards, and then to attempt to evaluate if that model gives you a brighter or dimmer view of the world.
But it looks to me like many organizations or minds are greeting each bit of evidence, examining it to see if it can be turned to give you a dimmer view of the world, and forgetting it otherwise.
The appeal-to-nature is one of the oldest and most well-known fallacies. The basic idea is that people tend to judge things that are "natural" to be better, although this is at best very weak evidence. Unpasteurized milk is actually more dangerous than pasteurized, although it is less natural; vaccines for many things are actually a very good idea, although they are not natural.
Per the name -- well, AI is not natural. I do not think the appeal-to-nature fallacy is the only force at work in discussions about AI. Not all concerns for AI safety are based in it. But I do think that it acts as a powerful emotional undercurrent, letting people maintain their double standard between AI and human intelligence without conscious notice.
And this unconscious bias damages us in all possible worlds: in worlds where AI safety is genuinely difficult, it obscures the real problems beneath false ones. In worlds where AI safety is relatively straightforward, it needlessly delays an enormously beneficial technology. For the good of all such worlds, I would like people to stop examining ever-finer motes in the robot's eye, while ignoring the beam in the primate's.