Compute, Data, and Victory
Both the US and China view AI as fundamental to determining which becomes the ultimate superpower. And China’s newest frontier model is a clear message:
You can’t stop us; we are catching up.
And that’s terrible news for the US in its aspirations to remain the supreme world power.
Until recently, the US was believed to have a crucial advantage over China in the form of superior computing power. However, the release of DeepSeek’s v3 model has effectively nullified this advantage, marking a significant turning point in the AI Cold War and likely leading to increased tensions in the coming months, with the US retaliation coming just recently on January 13th, sending NVIDIA into panic mode.
But why has this model changed everything? Let’s dive in.
I make AI simple, less sexy (no hype), but much more potent for you. Newsletter written by an AI expert — or at least pretends to be one — with 170k followers and hundreds of thousands of monthly readers with independent thoughts.
Click below and start today for free.
Understanding the Battle
Currently, the US leads in all four key areas, although I will explain how that could soon change.
The Great Four
- For starters, they hold the lead in intellectual property (IP).
The best models in the world are trained inside the US, including o1 and o3 from OpenAI, Claude 3.5 Sonnet and Opus from Anthropic, Gemini 2.0 from Google, and Grok-3 (currently testing for release) from xAI.
- They also hold the lead in data.
While pretraining data for large language models (LLMs) is primarily public and widely available, this changes for large reasoner models (LRMs) like o1, o3, or Gemini 2.0 Flash Thinking.
These models require reasoning data, where the problem and the answer also include the in-between reasoning trace that leads to the answer. On most of the Internet, people post their conclusive thoughts and solutions, not the inner dialogues and thought processes they used to reach that conclusion.
Therefore, this data must be built from scratch, requiring top talent and deep pockets. Nobody can rival venture-backed companies like OpenAI or Anthropic in this way, so the best data in the world is also inside the US.
- They hold the talent lead.
Despite most of the talent being Indian and, funnily enough, Chinese, the US’ deep pockets convince many promising researchers worldwide to join the ranks of US companies.
- And most importantly, they hold the compute lead.
Without a doubt, the US’ most significant advantage right now is in hardware, with the top GPUs designed by US companies and manufactured by TSMC, based in US-ally Taiwan.
But how is the US trying to hold its lead?
The primary battle in the AI Cold War
Protecting IP is impossible; eventually, the top researchers behind those breakthroughs, even those that aren’t officially published, spill the beans.
But IP is useless if you can’t train or run the models. For that, you need access to compute.
Therefore, until now, the US’s primary weapon has been compute restrictions, preventing Chinese companies or the CCP from getting access to top GPUs. For instance, NVIDIA can’t sell its top-of-line products to Chinese companies, and TSMC can’t manufacture Chinese chips for companies like Huawei for their top packaging methods.
Long story short, Chinese companies are severely ‘underserved’ regarding GPU accessibility, and only some GPU smuggling through Malaysia alleviates it.
But as I said at the beginning, the release of DeepSeek v3 means that this problem will soon no longer exist for Chinese companies.
Here’s why.
A Deep Pockets Game
In a standard, from-scratch procedure, training a state-of-the-art model is an 8-figure effort, even nine figures in some cases. And the following numbers will make you dizzy.
Without Cash, You’re Trash
With GPT-4, back in 2022, that number was around $80 to $100 million.
With o3, or Orion, that value has allegedly grown to $500 million. Those numbers probably do not account for the entire TCO, as salaries and hardware CAPEX are probably left out.
Nonetheless, according to The Information, OpenAI spent around $3 billion on training compute in 2024.
But why?
The Four Stages of Training
The reason it’s so expensive is because it requires the complete execution of four stages:
- Pre-training:
A vast data corpora is assembled and sent to the model. This model has to process trillions of words, reaching around 12 trillion with Llama 3.1 models, a number a human would require 8 hours a day for 200,000 years to read.
The last published training runs used around 10 to 100 million exaflops, or 10 to 100 trillion trillion mathematical operations (or 1-to-10×10²⁵ operations).
To put that number into perspective, the number of sand grains worldwide is estimated to be 7.5×10¹⁸. Thus, this model required 6.6 million times more operations than the number of grains of sand on planet Earth.
This phase also includes extensive data curation and deduplication, trying to maximize the quality of the training distribution. Nowadays, it also involves a lot of synthetic data generation.
Examples like Microsoft’s Phi family (in this particular case, synthetic data represents around 75% of the total training distribution) imply using other models to generate the data, increasing the costs even further.
2. Supervised-Fine Tuning
In the next stage, we curate a {question: answer} type structured distribution known as instruction datasets, where the model learns to act as a conversationalist and, importantly, answer the questions it receives.
This data is primarily built from scratch, although you can find examples of these datasets for free at places like HuggingFace. Although this dataset is considerably smaller than the first one, building it from scratch is still expensive.
3. RLHF
Next, we have RLHF (Reinforcement Learning from Human Feedback), a weak form of Reinforcement Learning (but very useful nonetheless) that helps models make better decisions when answering.
Again, this phase’s dataset has to be built from scratch and requires human experts to assemble it, like PhDs, which are very expensive to hire. Also, this stage involves the use of three models (except in DPO, which requires two):
- The model being trained
- The reference model (usually, the same model after finishing stage 2).
- The reward model (again, a model of the same size but ‘repurposed’ to output reward values, not words)
The reference model ensures that the model being trained doesn’t deviate too much from its form in stage two (as uncontrolled RL pipelines lead to models hacking the reward to maximize it but becoming invalid for use in the process). The reward model scores the model’s outputs, and the latter uses this signal to improve itself.
If curious, the formula that governs this training is PPO, which includes the reward scalar value and a KL divergence term to prevent deviation.
This phase is prohibitive for all but a handful of AI labs with Hyperscalers behind paying the bills, as to get each new prediction, we need to run up to three gigantic models simultaneously.
4. Reasoning training
At the last stage, we have reasoning training, in which these models undergo extensive training on new types of data that teaches them to approach problem-solving in a step-by-step approach, disregarding time (the models aren’t incentivized to be quick with their responses but think for longer).