S1: Achieving O1-Level Reasoning with Just $50
Li Fei-Fei's Team Trains an O1-Level Reasoning Model for Just $50!
The AI community has been buzzing about OpenAI's O1 model, which demonstrated remarkable test-time scaling and strong reasoning capabilities. However, the methodology behind O1 has remained undisclosed—until now. A team led by Li Fei-Fei at Stanford has introduced S1, a reasoning model that not only replicates but in some cases surpasses O1's performance—all with a mere $50 training cost.
What Is Test-Time Scaling, and Why Does It Matter?
Traditional AI models improve by increasing training compute (e.g., GPT scaling laws). However, a new paradigm—test-time scaling—enhances model performance by allocating more compute during inference. Instead of massive datasets and expensive training, test-time compute is used strategically to improve answers dynamically. OpenAI's O1 hinted at this potential but didn't reveal how to achieve it. Enter S1: a radically simple and open approach to test-time scaling.
Key Innovations in S1
S1 achieves its impressive reasoning capabilities through a few key innovations. Instead of relying on massive datasets, S1 was fine-tuned on just 1,000 carefully curated reasoning samples (s1K). Another significant breakthrough is budget forcing, a technique that controls the time a model spends reasoning, leading to improved accuracy with controlled compute costs. Despite utilizing minimal resources, S1-32B outperforms O1-preview on competition-level math problems, including AIME24 and MATH500.
How Did They Do It?
The foundation of S1’s success lies in its curated dataset (s1K), which prioritizes quality over quantity. The research team selected challenging and diverse questions from sources like AIME, Olympiad competitions, and PhD-level problems, while also distilling reasoning traces from Google Gemini to strengthen its training base. Instead of training a model from scratch, they fine-tuned Qwen2.5-32B-Instruct, a publicly available model, completing the process in just 26 minutes on 16 NVIDIA H100 GPUs, costing only $50 worth of compute.
One of the most intriguing aspects of S1 is its budget forcing mechanism. By modulating how long the model spends reasoning, S1 can effectively double-check its own thought process. If the model attempts to terminate its reasoning too soon, it is prompted with a simple “Wait” command, often leading to self-correction and more accurate answers.
S1 vs. OpenAI's O1: The Showdown
Model | AIME24 (Math) | MATH500 | GPQA (PhD-Level Science) |
---|---|---|---|
S1-32B | 56.7% | 93.0% | 59.6% |
O1-preview | 44.6% | 85.5% | 73.3% |
DeepSeek R1 | 79.8% | 97.3% | 71.5% |
Takeaway: S1 beats OpenAI’s O1-preview on competition math, while using only 1,000 samples and simple test-time tricks. It’s the most sample-efficient open reasoning model.
Why This Matters
S1 is significant for several reasons. It is the first fully open alternative to OpenAI’s O1, providing transparent access to its code, data, and model weights. Additionally, it demonstrates that strong AI models no longer require multimillion-dollar compute resources—S1 achieves comparable performance at just $50. More broadly, test-time scaling challenges the notion that larger training datasets always lead to better models, proving that smarter inference techniques can be just as powerful.
Where Can You Try S1?
You can find the full model, dataset, and code on GitHub:
The model is also available on Hugging Face:
For more details on the research behind S1, you can read the full paper:
With test-time scaling and budget forcing, S1 proves that strong reasoning doesn’t have to come with a massive price tag. This work could redefine how we build and scale next-generation AI models.
What do you think? Is test-time scaling the future of AI? Let’s discuss in the comments!