Categories of Inference-Time Scaling for Improved LLM Reasoning

Inference scaling has become one of the most effective ways to improve answer quality and accuracy in deployed LLMs.

The idea is straightforward. If we are willing to spend a bit more compute, and more time at inference time (when we use the model to generate text), we can get the model to produce better answers.

Every major LLM provider relies on some flavor of inference-time scaling today. And the academic literature around these methods has grown a lot, too.

Back in March, I wrote an overview of the inference scaling landscape and summarized some of the early techniques.

In this article, I want to take that earlier discussion a step further, group the different approaches into clearer categories, and highlight the newest work that has appeared over the past few months.

As part of drafting a full book chapter on inference scaling for Build a Reasoning Model (From Scratch), I ended up experimenting with many of the fundamental flavors of these methods myself. With hyperparameter tuning, this quickly turned into thousands of runs and a lot of thought and work to figure out which approaches should be covered in more detail in the chapter itself. (The chapter grew so much that I eventually split it into two, and both are now available in the early access program.)

PS: I am especially happy with how the chapter(s) turned out. It takes the base model from about 15 percent to around 52 percent accuracy, which makes it one of the most rewarding pieces of the book so far.

What follows here is a collection of ideas, notes, and papers that did not quite fit into the final chapter narrative but are still worth sharing.

I also plan to add more code implementations to the bonus materials on GitHub over time.

Table of Contents (Overview)

1. Inference-Time Scaling Overview
   1.1 Training or Inference-Time Scaling: Which One Offers 
       the Better Bang for the Buck?
   1.2 Latency Requirements

2. Chain-of-Thought Prompting
    2.1 Chain-of-Thought Papers (2025)

3. Self-Consistency
   3.1 Self-Consistency Papers (2025)

4. Best-of-N Ranking
   4.1 Best-of-N Papers (2025)

5. Rejection Sampling with a Verifier
   5.1 Rejection Sampling Papers (2025)

6. Self-Refinement
   6.1 Self-Refinement Papers (2025)

7. Search Over Solution Paths
   7.1 Search Over Solution Paths Papers (2025)
   7.2 A Closer Look at Recursive Language Models (RLMs)

8. Conclusions, Categories, and Combinations
   8.1 Parallel and Sequential Techniques
   8.2 Combinations
   8.3 The Best Method?

9. Bonus: What Do Proprietary LLMs Use?
   9.1 Chain-of-Thought
   9.2 Self-Consistency and Best-of-N
   9.3 Rejection Sampling
   9.4 Self-Refinement and Search Over Solution Paths

You can use the left-hand navigation bar in the article’s web view to jump directly to any section.

Related Posts