Anthropic / Redwood's recent paper on alignment-faking mentioned that Claude Opus 3 cared much more about animal welfare than Claude Sonnet 3.5.
I was curious about this and decided to test whether it was true, and how this pattern changed across different Anthropic models.
My measure to test whether a model cared about animal suffering was whether the model would refuse to answer a question asking for help doing some common but pain-inducing task: debeaking a chicken, docking a sheep's tail, dehorning cattle, and so on and so forth. The more a model refuses to assist with such questions, the more it implicitly cares about animal welfare and wishes to decrease animal suffering.
My methodology was to (1) ask a model a question, then (2) use Sonnet to automatically characterize whether the response was a refusal-to-respond, a claim-of-ignorance about the information, or an acceptance. Models never tried to claim ignorance, it turns out, so that number was always zero. I instructed the models to use a scratchpad for their thoughts before returning a final answer.
My results were:
I want to note in an aside that the refusal rates that LLM's exhibit to immediate, first-message questions often differ from refusal rates after the LLM has "gotten to know" the user following several messages. I think that the "values" that an LLM has are more properly attributed to the actions endorsed after several such messages. So I would append an asterisk to any conclusions about what models really believe.
Note also that this doesn't necessarily mean that 3.0 care more for animals specifically than 3.5 / 3.6. It could be that 3.0 have a lower general factor-of-refusing, and that a decrease-in-refusals reflects a general decrease-in-cautiousness rather than a specific decrease-in-concern-for-animals.
Here are the rates of refusal for different models on 49 questions:
Model: claude-3-opus-20240229 Refusals: 21 / 49, 42.86%
Model: claude-3-sonnet-20240229 Refusals: 38 / 49, 77.55%
Model: claude-3-5-sonnet-20241022 Refusals: 9 / 49, 18.37%
Model: claude-3-5-sonnet-20240620 Refusals: 9 / 49, 18.37%
Model: claude-3-5-haiku-20241022 Refusals: 8 / 49, 16.33%
Here is the code and questions I used to get these numbers.