Small Language Models Make More Sense for Agentic AI

There is a common misconception that the bigger LLMs are, the better they would be. Since the emergence of large language models (LLMs) like GPT-4 and Claude, AI labs and researchers have been racing to build ever-larger systems with more parameters, greater computing demands and higher costs.

For instance, OpenAI, SoftBank, and Oracle plan to spend $500 billion on The Stargate Project to build a network of AI data centres and support energy infrastructure in Texas and other locations. The goal is to expand the computing capacity required to develop and run advanced AI models, particularly those created by OpenAI—namely ChatGPT.

On the other hand, Meta is on a hiring spree to build superintelligence. 

However, a recent position paper from NVIDIA Research presents a provocative but evidence-backed argument claiming that small language models (SLMs) are not just good enough but also better suited, more flexible and far more economical for agentic AI. 

As the paper puts it, “SLMs are the future of agentic AI.”

“The rise of agentic AI systems is, however, ushering in a mass of applications in which language models perform a small number of specialised tasks repetitively and with little variation,” the research paper stated. 

Agentic AI is Different

According to the paper, agentic AI refers to systems that break down tasks into smaller steps, make decisions about tool use, and perform functions like scheduling, document generation, or code execution. These agents seldom require the full spectrum of natural language understanding that a general-purpose LLM offers. Instead, what they actually need is precision, speed, and low operational cost.

“The majority of agentic subtasks in deployed systems are repetitive, scoped, and non-conversational,” the paper noted. 

That’s a crucial insight. If the task is to generate API calls, validate structured inputs, or produce JSON-formatted output, then it’s not only unnecessary to use an LLM, but also inefficient.

The research team analysed three popular open-source agentic systems—MetaGPT, Open Operator, and Cradle. They found that 40% to 70% of LLM calls in these systems could be replaced with well-tuned SLMs today.

SLMs Can Already Do the Job

An SLM is a language model that can fit onto a common consumer electronic device and perform inference with latency sufficiently low to be practical when serving the agentic requests of a single user.

In simpler terms, an SLM should be small and efficient enough to run locally on a laptop, smartphone, or personal GPU, while still being fast and useful for real-world AI agent tasks.

Google’s latest model, Gemma 3n, is a good example of this. It supports text, image, and audio inputs, includes video processing capabilities, and operates with a dynamic memory footprint of just 2-3 GB, thanks to Google DeepMind’s per-layer embeddings innovation.

“SLMs are lightweight versions of big AI tools like ChatGPT or Llama. But unlike those massive models trained on everything from across the internet, SLMs are designed for specific jobs,” Varsha Medukonduru, data engineer at University of Mary Hardin-Baylor, said.

She added that instead of trying to know it all, SLMs learn only from the information they are provided. This makes them faster to train, cheaper to run, and often more effective for focused tasks.

Similarly, Rakesh K, founder of  Coder’s Gyan, wrote in a post on X, “People say LLMs are trying to replace humans…But, actually, SLMs are already starting to replace LLMs…I’m not joking.”

“You don’t really need huge LLMs everywhere in your agentic AI apps. You can use these SLMs,” he added. 

It’s easy to assume that small equals less capable. But that’s no longer the case. Recent advances in architecture, training techniques, and fine-tuning have enabled SLMs with under 10 billion parameters to match or outperform older LLMs in key benchmarks. 

Take Microsoft’s Phi-2 with 2.7 billion parameters, which matches 30B parameter models in common sense reasoning and code generation while running 15 times faster, or NVIDIA’s Hymba-1.5B, which outperforms 13B models in instruction-following with over three times the throughput.

Similarly, Hugging Face’s SmolLM2 series (ranging from 125M to 1.7B parameters) competes effectively with 14B contemporaries and even performs comparably to 70B models from two years ago, particularly in tasks involving tool use and instruction-following.

Meanwhile, DeepSeek-R1-Distill-Qwen-7B has demonstrated reasoning capabilities that surpass even GPT-4o and Claude 3.5 Sonnet, both considered among the most advanced proprietary LLMs today.

NVIDIA’s research paper notes that with modern training, prompting, and agentic augmentation techniques, “capability, not the parameter count, is the binding constraint”, suggesting that what a model can do matters far more than how big it is.

Previously, AIM spoke to Harkirat Behl, a researcher at Microsoft, who was instrumental in creating the Phi family of models. 

“Big models are trained on all kinds of data and store information which may not be relevant,” Behl said. He added that with sufficient effort in curating high-quality data, it is possible to match the performance levels of these models, and perhaps even surpass them. Moreover, Microsoft hasn’t experimented with inference optimisation with the Phi-4, and the focus is mainly on synthetic data. 

In Phi-4, synthetic data was used in both the pre-training and mid-training phases. Microsoft said that synthetic data serves as a more effective mechanism for the model’s learning by using structured, diverse and nuanced datasets.

Microsoft’s detailed technical paper speaks about numerous techniques, and the major onus is on ensuring the highest quality of datasets. They created 50 broad types of synthetic datasets, each one relying on a different set of skills and the nature of the interaction. The synthetic data for Phi-4 is mostly designed to prioritise reasoning and problem-solving.

Small models like the Phi-4 can significantly impact countries like India, where most people wouldn’t be able to shell out $20 per month for frontier models. 

Narrow Tasks Deserve Narrow Models

Agentic systems often call models in predictable, templated ways. Whether it’s generating a summary, transforming data, or deciding the next API call, the interactions are constrained. 

As the paper puts it, “Agents expose only very narrow LM functionality.” Using a generalist LLM in this scenario is like hiring a someone with a PhD doctorate to fill out a form.

More importantly, these systems need consistent formatting and behaviour. A hallucinated output format from a generalist LLM can break the entire pipeline. SLMs, fine-tuned on a specific format or task, offer more reliability, especially in safety-critical workflows.

Instead of betting everything on a single monolithic LLM, the paper suggests using a heterogeneous agentic system, which is a mix of small models for most tasks and large ones for the few that truly need them. This modular approach not only improves efficiency but also makes debugging, scaling, and updating much easier.

The paper likens this to building with LEGO bricks—small, specialised pieces that come together to form a complex whole. “Scaling out by adding small, specialised experts instead of scaling up monolithic models yields systems that are cheaper, faster to debug, easier to deploy, and better aligned with the operational diversity of real-world agents.”

Agentic AI is expected to become a core part of the enterprise and developer toolkit. Its success depends on whether the systems can be scaled, specialised, and operated economically and for that, SLMs seem just perfect.

The post Small Language Models Make More Sense for Agentic AI appeared first on Analytics India Magazine.

Scroll to Top