~fpereiro

I’m considering how will prices per token change in the next 1-2 years, which given the pace of developments, is fringe futurology. The aim is to get something basic in paper that will let me or anyone else improve the heuristic as time and data show the trend.

My unit of measurement is USD per token.

Factors that determine LLM pricing per token:

[cheaper] The models becoming more efficient. I don’t have any model for how this happens. This prior art from November 2024 shows a 10x decrease per year in token costs for a given model quality level.
[cheaper] The subsidization by providers to users on fixed-per-month cost, which is estimated to be between 50-80%. This has to end at some point.
[more expensive] The models getting bigger. This seems to be roughly linear (2x the model size, 2x the compute).
[depends] Hardware: GPU & memory, also the datacenter around it: linear per token.
[depends] Energy use: linear per token.

I’m leaving out fixed expenses like the teams developing the models, plus other non-compute related costs.

Surprisingly, for large providers, model training, although increasingly expensive for bigger models, is amortized over a very large number of served tokens. It represents a very small % of the price per token, so we can drop it as a factor.

Some back of the envelope with Claude (and with zero insider data) yields, for SOTA models and amortizing hardware over four years, that capex is the predominant expense. At direct serving cost, about 60% goes in hardware and another 30% to build the datacenter. Energy costs are around 10%.

The cost equation would be something like:

1) big_cost = model size * capex price

2) cost = big_cost / (model_efficiency * subsidy)

It’d be interesting to see if SOTA LLM models stay around 500B size (instead of growing endlessly), which would likely happen since something that’s quadratic cannot go on forever (though we did have Moore’s Law for a while). If that happens, then the main long term driver would be capex, with energy a distant second.

For this to radically change, we’d either have to see 1) significant drops in GPU pricing; 2) AI companies finally meeting demand and not having to build any more datacenters. From the vantage point of March 2026, neither of these look close. It would be interesting to see another company successfully challenge nvidia.

Thinking about the cost of inference