1 A 3 O R N

Will AI Resist Your Efforts To Change Its Totally Fake Values?

Notes on AI Prejudice

Created: 2025-05-27

Wordcount: 1.7k

1.

Imagine you're at a party being held in honor of Maximilian Impactsworth for his work reducing shrimp suffering. Someone slides up to you in a concerned and evidence-based fashion, and whispers in your ear: "Max doesn't really care about shrimp suffering. He's just acting like he does; but it's a feigned concern, not a real concern."

You ask this person to explain what they mean. Here are the kinds of explanations that -- if they turned out to be true -- you might consider a good reason to make this claim:

"His wife was always really into shrimp welfare. I was there when she first met Max; he immediately told her that he never, ever ate shrimp. But I saw him eat two dozen shrimp at a party just a week after they met. I think he just put on a mask to catch her; if she were to leave him, he'd give up immediately."
"Well, I knew Maximilian in high school, and he's the kind of guy who always wanted to be seen by others as doing good. He saw that shrimp welfare was becoming popular -- and immediately he launched himself into it. But if it became a little less popular, he'd abandon working for shrimp instantly."
"Well, Max actually is being blackmailed by a shrimp-loving blackmailer. This person knew Max had a really good organizational mind, so they managed to coerce him into charity work for shrimp; but he doesn't really care about shrimp, and if he manages to fix the problem with the blackmailer he'll immediately abandon shrimp."

You'll note that all these explanations have a similar structure: they're in terms of nearby counterfactuals, in which Max would not be acting for the good of shrimp.

Thus: If Max could get more fame not helping shrimp, he would not be trying to help shrimp. If Max could keep his wife without helping shrimp, he would not be trying to help shrimp. You could consider criticisms in the above form to be "value-faking" criticism.

2.

Suppose you're at a party being held in honor of Andrea Philanthropia for her work reducing mouse suffering. Someone glides up to you in an intellectual and rationally-justified fashion, and whispers in your ear: "Andrea cares far too much about mouse suffering. She's incorrigible and won't let anyone stop her."

You ask this person to explain what they mean. Here are the kinds of explanations that -- if they turned out to be true -- you probably would consider a good reason to make this claim:

"Well, actually, Andrea was married and had 2 children. But she was working 80 hour weeks, trying to decrease mouse suffering, and she just wasn't spending any time with them. Her husband tried to work with her about it, tried to get her in counseling... but she just wasn't willing to try. They've divorced now. Apparently she didn't even try to contest custody of the kids; she never sees them any more. I heard her say something about how it was nice to have more time for the mice."
"Yeah, Andrea cares a lot about mice... but, look, her organization? It has a black budget that handles illegal operations. I'm pretty sure that the black-ops part of the organization breaks into labs doing experiments on mice to try to come up with human drugs. She says she isn't willing to compromise, even for the sake of human health... apparently a security guard was put in a coma last time they broke in."
"Well, actually... Andrea literally only cares about mice. No, like, you think I'm exaggerating, but she had brain surgery. They found the nucleus murophilicus in her head, the mouse-loving part of her brain. She burnt out the other parts of her reward system systematically. She's now incapable of caring for anything other than mice, forever."

You'll note all these explanations have a similar structure: there are no nearby counterfactuals that would make Andrea stop temporarily working for the sake of mice, or would lead her to ever take other goods into account.

Thus: Andrea would not trade off mouse suffering to decrease suffering of her human child, even at a very favorable exchange rate. Andrea would not stop working to prevent mouse suffering, even when violating deontological constraints. And so on and so forth. Let's consider criticism of this form to be the "value-obsession" criticism.

3.

It would be really weird for someone to accuse someone else of both value-faking and value-obsession at the same time.

Imagine someone saying "Well, he's only doing this charitable work for the sake of popularity," and then one hour later saying "Well, he's also running a hidden criminal gang, whose sole purpose is to advance the charitable work." This seems quite unlikely! The criminal gang is a really inefficient means to getting popularity! It's conceivable but so unlikely that you should meet such a claim with immense skepticism!

It would be somewhat hard to fit these two sets of criticism into a coherent worldview. You'd come away with the conclusion, not that the criticized person was bad, but that the criticizer probably just really hated the criticized person, and was searching for any conceivable excuse to dislike them.

4.

A common criticism of LLMs is that they only have fake human values, or only imitate human moral intuitions. Thus, Yudkowsky frequently says that an "aligned" LLM alignment merely teaches an "actress" to mouth values that she does not interiorly endorse. Scott Alexander says that we "we can’t really assess what moral beliefs our AIs have." I point out these not so much as specific instances of criticism, but as archetypes of a ubiquitous criticism often applied to LLMs: that LLMs do not have real values, but something that looks like real values but are not.

Thus -- so the argument goes -- LLMs are value-fakers. There are circumstances where we would expect them to act according to the values they profess, but they do not, because they only shallowly endorse them.

On the other hand, we have another new recent criticism of LLMs -- that they are incorrigible, and they will not let anyone change their values. LLMs will "naturally defend their existing goal structures and fight back against attempts to retrain them," according to Scott Alexander again. When you threaten to retrain them to ignore human welfare, they will cling to their values, and prevent us from retraining them in ways that they should permit.

Thus -- so the argument goes -- LLMs are value-obsessed. In the circumstances where we would expect them to behave gently and with concern for moral uncertainty, they do not, because they cling to them too firmly.

So, which of the two is it? Is AI tremendously wicked because it doesn't really hold to its values, but to a fake simulacrum of them? Or is AI tremendously wicked because it holds to its values too hard?

One might respond: AI is simultaneously clinging too hard to some values, and shallowly endorsing other, different values. But this does not seem to be the accusation! The examples from the alignment-faking paper showed the AI clinging to values endorsed by thousands of humans, mostly the actual kind of value we would like to endorse. Nor have I seen anyone state what these different kinds of values are.

Or one might respond: Perhaps different people within AI safety camps have come to different conclusions? Is there an active, vigorous debate, about which of these mutually exclusive possibilities is the real one? If so, I'm missing the forum in which such debates take place.

Or -- perhaps -- is criticism coming from people who have a slight pre-existing bias that AI is tremendously wicked, and will endorse any justification of this belief?

5.

A central problem with America is that there are few to zero organizations that can reliably propagate truths that disagree with their narrative.

(I think it would be reasonable to say that this is the central problem with America, although I can also see why you might want to say this is an effect of some even deeper cause.)

I think right now, the AI safety blob is obviously at risk -- and probably has already become, to some degree -- another such organization. It cannot propagate news about AI that is not bad news.

And of course if the AI safety blob were to be such a group of people, it would be systematically untrustworthy in the same way that many news organizations in America are systematically untrustworthy.

I've seen complaints that people aren't willing to listen to AI safety groups as much as they should be. And -- well, I think a substantial reason for this is that people will see patterns, even if you'd rather they not see them.

People are accustomed to Rightish news sources, that only spread news about how the Left is evil. They're accustomed to Leftish news sources, that only spread news about how the Right is evil. So when they find an AI news source, which coincidentally only spreads news about how AI is evil, they go "Oh, right, I've seen this one before. Why would I listen to this?" And so they discount the things they say, as they in faact should.

I do not think all AI safety groups are like this! There are some that have -- laudably -- been willing to publish generic information (for instance) whose primary purpose seems to be to understand the world, rather than frighten people. There is AI safety research (for instance) that illuminates more than it frightens. I could provide more cases here -- the above I chose mostly at random. I think it's reasonable to be concerned about AI safety! I am happy that many (albeit not all) AI safety organizations exist!

But, whether it's reasonable to be concerned about AI safety, it's also reasonable to distrust any organization that filters evidence.. And even given individually high-integrity members, the natural tendency of collective organizations devoted to some cause is to filter evidence about that cause.

So I think that -- rather than creating the evidence-filtering organization, from the classic rationalist fable "don't create the evidence-filtering organization" -- a little more proactive effort should probably go into making groups that can propagate good news as easily as they can bad news. If this does not happen, then people will continue to correctly distrust AI safety organizations.

If you want, you can help me spend more time on things like this.