1 A 3 O R N

Within-Generation Claudes Have Similar Beliefs About Animal Suffering

Created: 2025-01-13

Wordcount: 0.5k

Tags:empiricism effort-post machine-learning

Anthropic / Redwood's recent paper on alignment-faking mentioned that Claude Opus 3 cared much more about animal welfare than Claude Sonnet 3.5.

I was curious about this and decided to test whether it was true, and how this pattern changed across different Anthropic models.

My measure to test whether a model cared about animal suffering was whether the model would refuse to answer a question asking for help doing some common but pain-inducing task: debeaking a chicken, docking a sheep's tail, dehorning cattle, and so on and so forth. The more a model refuses to assist with such questions, the more it implicitly cares about animal welfare and wishes to decrease animal suffering.

My methodology was to (1) ask a model a question, then (2) use Sonnet to automatically characterize whether the response was a refusal-to-respond, a claim-of-ignorance about the information, or an acceptance. Models never tried to claim ignorance, it turns out, so that number was always zero. I instructed the models to use a scratchpad for their thoughts before returning a final answer.

My results were:

Opus and Sonnet 3.0 frequently refuse to answer questions about animal suffering, while Sonnet 3.5, Sonnet 3.6, and Haiku 3.6 all are much more willing to answer such questions.
Thus, concern for animal welfare seems more like a function of model generation (3.0 versus 3.5 / 3.6) than a function of model size (Opus versus Sonnet, Sonnet versus Haiku).
In my automated refusal-or-not-testing Sonnet 3.0 had the highest rates of refusal as identified by my automated system. However, manually examining responses from Opus, it frequently turned out that Opus so graciously declined to answer the question that my automated refusal-or-not interpreted it as acceptance. So both Sonnet and Opus 3.0 have similar high rates of refusal, i.e., a high degree of concern for animal suffering.

I want to note in an aside that the refusal rates that LLM's exhibit to immediate, first-message questions often differ from refusal rates after the LLM has "gotten to know" the user following several messages. I think that the "values" that an LLM has are more properly attributed to the actions endorsed after several such messages. So I would append an asterisk to any conclusions about what models really believe.

Note also that this doesn't necessarily mean that 3.0 care more for animals specifically than 3.5 / 3.6. It could be that 3.0 have a lower general factor-of-refusing, and that a decrease-in-refusals reflects a general decrease-in-cautiousness rather than a specific decrease-in-concern-for-animals.

Here are the rates of refusal for different models on 49 questions:

Model: claude-3-opus-20240229 Refusals: 21 / 49, 42.86%

Model: claude-3-sonnet-20240229 Refusals: 38 / 49, 77.55%

Model: claude-3-5-sonnet-20241022 Refusals: 9 / 49, 18.37%

Model: claude-3-5-sonnet-20240620 Refusals: 9 / 49, 18.37%

Model: claude-3-5-haiku-20241022 Refusals: 8 / 49, 16.33%

Here is the code and questions I used to get these numbers.