1 A 3 O R N

CA1047 Must Be Changed: Notes on Ambiguity and Compute Trends

Created: 2024-05-29

Wordcount: 3k

Tags:essays effort-post machine-learning mindkill

Edit: This is about an earlier version of 1047; it has since been amended into a very different bill.

CA 1047, the California AI safety bill up for a vote this summer, must either be amended or rejected by the Assembly.

Specifically, the definition of "covered model" -- the AI models on which it imposes requirements -- should be changed. Otherwise, two predictable bad consequences will occur.

First, almost all of the actual impact of the bill will be determined by whoever is in charge of interpreting its deeply ambiguous definition of "covered model." This is contrary to the rule of law; discretionary power should not be so wide -- especially given that there is absolutely no need for it to be so wide.

Second, it will quickly start to apply to much smaller and much less rich organizations than the makers of the bill intend it to. Senator Wiener has clearly stated that he intends for the bill to impose "safety requirements only on... large and well-resourced developers building highly capable models at the frontier of AI development," but an elementary look at trends in costs of compute and algorithmic efficiency shows that this will stop being the case very quickly.

I'll explain why the above is so, then offer some easy changes that would help fix these problems.

Note: I think that things other than the definition of "covered model" probably need work. For instance, the requirement that open-weights models be shown to be non-hazardous despite unspecified quantities of further fine-tuning would kill open weights AI; I'd similarly question the legal requirement that you use "best practices," because the meaning of "best practices" is currently deeply ambiguous and all current attempts to create standard "best practices" are blatantly partisan; the undefined benchmarks involved in qualifying for a limited duty exemption involve similar problems to those I decribe below. But for now I focus my attention simply on the definition, because it's probably the most foundational issue.

What is a "Covered Model"?

Suppose CA 1047 passes. You want to train a large language model (LLM) in California. How do you not violate the law?

Well, if you want to train a model:

With less than 10^26 floating-point operations (FLOPs)
And you could "reasonably" expect that this model will be dumber than a model trained with 10^26 FLOPs in 2024 would have been.

Then you can ignore this bill, because your model is not a "covered model."

A lot is riding on what it means for a model to be dumber. Does the law provide more concrete guidance here? Concrete guidance is particularly necessary because of the hypothetical -- what does it mean to do better than a model trained with 10^26 FLOPs in 2024 specifically?

The bill further specifies that we're concerned with intelligence "as assessed using benchmarks commonly used to quantify the general performance of state-of-the-art foundation models."

It creates a "Frontier Model Division" that is going to actually issue specific instructions about (1) which benchmarks to use and (2) what the actual scores on these benchmarks would need to be, to qualify as a covered model. Thus, section 4.10.A.1 states that the "Frontier Model Division" must issue:

Technical thresholds and benchmarks relevant to determining whether an artificial intelligence model is a covered model, as defined in Section 22602 of the Business and Professions Code.

So -- the question about the most basic issue for this law ("What benchmarks dictate that something is a covered model?") is punted down the road to whomever is working in the Frontier Model Division.

There are several serious problems with this approach.

First, the bureaucrats assigned to the task of choosing these benchmarks are being given a completely undefined task.

I don't just mean that whatever bureaucrat has to decide this will find the task difficult, or might choose benchmarks that I would disagree with.

I instead mean that "The benchmarks that a hypothetical 10^26 FLOP model, trained with whatever hypothetical typical practices exist in 2024" is a basically completely empty phrase. It's like saying "This will regulate luxury yachts that are as good as those built with 200 million dollars with the best shipbuilding techniques in 2024."

If you don't think about it too hard, you might be like "Oh, sure, it just regulates those that are... as good as the best ones. The yachts that all the people, the best people, are saying are the best yachts."

But when you think about it just a little bit longer you start to wonder what this means: Best yacht along what axis? Speed, comfort, range, cargo, seaworthiness, efficiency, ease of use, beauty, weapons systems? Resistance to pirates? Ability to flee Russia with 50 million in gems? Ability to blend seamlessly with the poorer people's yachts at Malta, and evade the KGB?

How are you going to write a regulation that specifies the best yachts by measurements along these axes, given that people value all these axes differently, and will spend their 200 million to improve alternate axes?

Similarly, LLMs can already vary enormously along many axes; different hypothetical people will spend their 10^26 FLOPs very differently.

For instance, some LLMs are trained mostly on computer code, and are correspondingly much, much better at code; some are trained on mostly human speech, and are therefore better at persuasion and literature. Some are strongly multi-lingual, and can talk intelligently in many tongues; some immediately become morons if you switch languages. Some models like Phi are trained on "synthetic data" that means that they're shockingly good at the kind of task for which the synthetic data was generated, but more likely to have sudden holes in their knowledge; others are not trained on as much synthetic data, which may mean that they're less booksmart but better able to role-play 4chan greentext. And you can have all these variations even among models trained with the same number of FLOPs.

Thus, to say "The benchmarks that a hypothetical 10^26 FLOP model would reach, trained with hypothetical best practices existing in 2024" is simply undefined.

I'm actually underselling the problem by only focusing on text-specific foundation models. LLMs will vary even more in the future, as you can see by looking at some of the more advanced ones today.

GPT-4o is the latest model from OpenAI. It was trained from scratch on text, audio, and images -- while prior models had to be trained exclusively on text, for instance, and only later had image-understanding ability grafted on to them.

Yet GPT-4o is not uniformly better than prior GPT versions. According to many tests and anecdotes, GPT-4o is worse than many prior GPT-4 versions at coding, for instance.

This doesn't really match the OpenAI narrative of continual progress. But there's a simple probable explanation, hinging upon how GPT-4o was trained on images and audio tasks natively, while prior GPT versions were not.

For some tasks this cross-modal training helps -- there's a little transfer learning that improves performance. But for other tasks, this cross-modal data probably hurts -- there is competition for space in the model, such that gaining the ability to talk with a sexy voice might use up neurons that would otherwise go into C++ coding ability.

Future foundation models will have even more variance along these axes. Those training them will need to choose from different text / image / audio ratios for whatever downstream tasks they think are best. Those training them will also choose from different kinds of text or image or audio, also depending on their intended downstream performance.

So how do you choose the "right benchmark" that is equivalent to an imagined model trained with 10^26 FLOPs, given that whoever is training the (imagined) model might have different (imaginary) goals? Do you include benchmarks specific to image understanding or not? Do you include benchmarks that test audio comprehension?

Second, even if this were a well-defined task, it is nevertheless in some ways the most important part of the bill. Thus -- given that any concrete numbers given would be deeply controversial -- it should be the actual legislature that decides it, not whoever happens to be working for the Frontier Model Division.

Suppose it were a straightward task of deriving these benchmarks for a model at 10^26 FLOPs? Great -- let's actually include the benchmarks in the bill then, rather than letting someone else decide!

Suppose on the other hand, it's a complex, weirdly hard task of deriving these benchmarks? Even more so -- we shouldn't let such an important thing be decided later, let's put it in the bill!

If there were actual, concrete benchmarks that had been proposed here -- rather than shunting the task onto invisible bureaucrats -- people would be in each other's faces about whether these benchmarks are good or not. This would be unpleasant; but it would be more healthy, and more in accord with basic democratic principles -- something doesn't become less controversial because you try to hide who decides it.

You can see how angry people would be in part by taking a look at prior benchmarks that the sponsor of this bill has proposed for regulating AI models.

The sponsor of the bill, the Center for AI Safety, has previously proposed that the US government should criminalize open source models that score better than 70% on the MMLU. This would have criminalized models at or below the level of currently open-sourced models like -- deep breath --- Llama 3 70b, Grok1, DeepSeekV2, Yi34b, Qwen72b, Phi 3 Small, Phi 3 Medium, Mixtral 7bx8, and some Llama 2 70b fine-tunes, albeit it would not have been able to actually criminalize half of these models because half are Chinese or French.

Curiously, the Center for AI Safety claimed that this regulation would "only regulate a small fraction of the overall AI development ecosystem," similar to what they are claiming about this bill. Yet, although they proposed it just a year ago, I hope the examples above make clear that it would have destroyed enormous parts of the current open source ecosystem. Or at least -- it would have destroyed the American-made part of the current open source ecosystem. The Chinese and French parts of the ecosystem would have been fine.

So this bill basically shunts off all the actual law-making to whover decides the benchmark.

This is contrary to the rule of law. Laws should be passed by assemblies elected by people, not by whoever manages to slip into a newly made bureaucracy.

Third, the intent of the bill is, according to Senator Scott Wiener, to impose safety requirements "only [emphasis his] on the large and well-resourced developers building highly-capable frontier models."

This has been echoed by CAIS president Dan Hendryks, who notes that covered models would "cost tens of millions or even billions to train." Scott Alexander notes that training a model costs "hundreds of millions of dollars" and so would impede only a few companies. This is even implied in the name of the new "Frontier Models Division."

Yet, by tying the bill's requirements to an unchanging FLOP count, and further tying to the benchmarks pseudo-derived from this FLOP count, the bill's current wording guarantees that relatively shortly, many non-frontier models, not made by smaller and less-well-resourced developers, will be covered by the bill.

How do we know this? We can look at historical trends in (1) the cost of compute and (2) how much easier it becomes to reach some given benchmark, given constant compute. Using this, we can calculate how much less it costs to reach some given benchmark over time.

The excellent research organization Epoch finds that cost-per-flop for GPUs halves about every 2 to 2.5 years, approximately. So, looking only at the decreasing cost of computer hardware, a 100-million-dollar model trained today would only cost ~25 million dollars in 5 years, and ~6 million dollars in a decade.

But benchmark-performance-per dollar also gains from algorithmic progress. Algorithmic progress lets you reach the same measure on some benchmark for less compute. Epoch also finds that algorithmic progress makes the actual compute required to reach some benchmark decreases by half approximately every 14 months or less. So if it costs 100 million worth of FLOPs to reach some benchmark at year 0, then after 5 years it would cost just ~6 million worth of FLOPs to reach the benchmark, counting only algorithmic progress and not the cost of GPUs.

Putting these numbers together, the cost of reaching any given benchmark should decrease by about 50-fold every 5 years.

(To be clear, I'm not sure that algorithmic progress will continue at the same rate that Epoch outlines; I wouldn't be surprised if it slowed. But I also wouldn't be surprised if alternative forms of compute suddenly made larger runs much cheaper; there's a lot of uncertainty out there. In any event, making something like this correction will clearly lead to more correct results than making no correction at all.)

Thus, if it costs 100 million dollars to train a model to some benchmark in 2024, then by 2029 it will probably cost < 2 million dollars to train to the same benchmark, which is a reasonable business expense for many small startups. By 2034, it will cost 40k, and literally millions of private individuals will be able to train one.

I know of no advocate for the bill who has bothered to discuss these trends accurately.

Scott Alexander contemplates the possibility that 10^26 FLOPs might be the kind of quantity of compute small companies can reach in thirty years. But this is a serious overestimate of the time required to reach this level of performance, as some simple math shows. Even if we assume that it costs 1 billion dollars to train a model to the somewhat-meaningless 10^26 FLOP-equivalent benchmark today -- probably an overestimate -- the trends above mean it would cost just 400k in ten years, which is easily within the range of capital expenditure for a fairly small business.

So -- exceedingly predictably -- models trained by small companies or academic institutions will start hitting these benchmarks in the near future.

I assume that Senator Wiener has not been informed of these trends. But this means that Senator Wiener's desire for only large, well-funded companies to be governed by this bill would be very quickly frustrated.

Fixes

Regulation directly targeted at FLOP count is controversial among those involved in AI policy. On one hand, it is a clearly-defined quantity that can be known before training a model. It sidesteps thorny and difficult-to-answer questions like "Is Tiger-MMLU or MMLU or GPQA or GSM8K or MATH a better measure of capability?" and "Why are we trying to test general knowledge when what we actually care about is ability to make bioweapons? Isn't that like giving someone a biology test where 99 / 100 questions have nothing do with biology?"

On the other hand, although models trained with more FLOPs are more powerful, it's controversial whether they're more dangerous. Regulation directly targeting per-FLOP models will almost certainly catch many non-dangerous models in it as bycatch. But what it loses in imprecision it makes up for in clarity.

CA1047, by targeting both FLOPs and FLOP-equivalent benchmarks, gets the worst of both worlds. Because FLOPs are imprecise, it will inevitably cover many non-dangerous models; because benchmarks are unclear, it cedes immense power to whatever bureaucrat gets to decide them.

Therefore the authors of this bill should fix two crucial things.

First, the bill should only target the FLOP count. The bill should only regulate models trained with 10^26 FLOPs, and have no clause giving it authority over models according to benchmark score.

This is the sine-qua-non, if you don't want the actual law-makers to be whomever decides the benchmarks; and if you don't want "covered models" to quickly expand to thousands of cases.

Fortunately, this change is easy to make. It also only makes the law easier to administer.

Second, the bill should be tied to an inflation-adjusted FLOP count.

Let's say we calculate how much money it would take to train a model with 10^26 FLOPs in 2024, and it turns out to be 200 million dollars. Every subsequent year, the Frontier Model Division calculates how many FLOPs you could get for an inflation-adjusted 200 million. This would then be the limit at which models start to be covered.

As a compromise with people who are more safety-concerned than me, we could also have a FLOP count increase over time by a fixed amount that nevertheless is less than the natural per-FLOP decrease in price. Suppose the FLOP limit for a "covered model" increased by 3% every month. This rate of increase would still be far, far under the natural rate by which FLOPs increase per training run in frontier models. Yet this change would still help prevent small companies from hitting the FLOP limit for a very long time.

This would make me a little unhappy, and would probably make safetyists a little unhappy, and therefore seems like a reasonable compromise.

If something like these changes cannot be made, then the "Frontier Model Division" should just be renamed simply the "AI Model Division", because that is what it will be in short order.

If you want, you can help me spend more time on things like this.