How Small Can AI Be? Practical Limits and Opportunities

Smaller, compressed AI models trained on task-specific data can be genuinely useful on ordinary hardware, enabling distributed, cooperative intelligence rather than relying solely on ever-larger models.

One of the most interesting questions in AI right now isn’t just how big these models can get—it’s how small they can get while still being genuinely useful.

For the last few years, the dominant story has been scaling up: GPT-2 to GPT-3 to GPT-4, moving from hundreds of millions of parameters to billions, and then toward trillions. And that path clearly matters. If we want systems that don’t just handle text but can understand and operate across video, audio, and maybe even domains like physics, bigger and more sophisticated foundation models are an obvious route.

But there’s a second direction that’s turning out to be just as consequential: scaling down.

Small models are getting surprisingly good There’s now an entire category of compact models that are far more capable than many people expected. Microsoft’s Phi models demonstrated that “a few gigabytes” can go a long way. Google’s smaller Gemma models made the same point. Part of what made these models feel shocking is how normal the hardware requirements were: you can fit them on something like a thumb drive and run them on an average desktop computer—yet the behavior you get can look, to many eyes, like something we would have associated with much larger systems not that long ago.

This reframes the question. We’re not only asking, “How good can AI get if we throw massive compute at it?” We’re also asking, “How much capability can we pack into a small footprint?”

How do you make models smaller without destroying usefulness? A big part of the answer is that you don’t necessarily need to keep everything you trained.

One approach is distillation: you train a model, then compress it by stripping away redundancy—removing parts that aren’t as necessary for the behaviors you care about. Another approach is to change what you train on. Instead of training on “everything” (all of Wikipedia plus enormous libraries like Project Gutenberg), you can focus training on more task-specific data.

That doesn’t mean you can skip fundamentals. A useful model still has to understand language, grammar, and some basic logic. But after those basics, task-specific training can be extremely effective—especially when you’re not forcing the model to reinvent procedures from scratch, but showing it how to do the work through targeted examples.

The result is a practical lesson: capability isn’t only about raw scale. It’s also about compression and curation.

Before asking for limits, define the behavior If you want to talk about the theoretical limit of how small a model can be, you first have to answer a prior question: what does it need to do?

“How small can intelligence be?” is too vague to be meaningful. Intelligence for what—holding a casual conversation, writing code, proving theorems, planning over long horizons, controlling a robot, building scientific hypotheses? The smaller you want to go, the more explicit you have to be about the behaviors you’re trying to preserve.

Why humans and animals are a useful comparison (and why people misuse it) People often compare how models learn to how babies learn—and then conclude something must be wrong, because today’s models require far more exposure to information than a baby does to acquire language.

That may be true on its face, but the comparison often misses something crucial: humans and animals are not blank slates.

We’re hardwired with a lot of structure: expectations about objects, basic physical intuitions, predispositions around language and social reasoning, and other built-in “priors.” And the strongest proof of that is how many animals are born able to function almost immediately—navigating their environment, responding to threats, and carrying out surprisingly complex behaviors without anything like the training regimen we imagine when we think about machine learning.

In other words, nature already did a huge amount of training.

DNA as evidence for how much can be compressed A helpful way to think about lower bounds on “compressed capability” is biology. The human genome is about 1.6 gigabytes. Across the animal kingdom, there are smaller and larger genomes—and yet many organisms are born “fully wired” enough to survive and adapt in their niches.

However you interpret that, it strongly suggests that a lot of useful behavioral structure can be encoded in a surprisingly small amount of information. Evolution is, in a sense, a massive search and optimization process running over millions to billions of years. The “training environment” is nature. The result is a brain that is, from birth, already pretrained with lessons that were learned long before the individual existed.

This is also why certain reactions don’t feel learned. Some fears—like fear of snakes—often appear more like inherited priors than freshly acquired knowledge. Preferences and aversions can have the same character. These are hints about what gets “baked in” by a long optimization process.

So when we start seeing small AI models—only a few gigabytes—doing genuinely interesting work, it’s reasonable to be delighted. But it shouldn’t be completely surprising. Biology has been telling us for a long time that “a few gigabytes” can encode an enormous amount of useful structure.

A concrete example: “small” can already feel like science fiction We now have models that, in practical terms, fit into sizes that would have sounded impossible not long ago.

For example, OpenAI’s open-source GPT OSS 20B fits in about 12 gigabytes—roughly six times the size of the human genome. In the grand scheme of what people have been conditioned to imagine as “advanced AI,” that’s still relatively small. And yet you can talk to it, get useful work out of it, and because it’s a reasoning model, it can “think over time” in the sense that you can build a harness around it—set it up to pursue longer-range, more agentic tasks rather than only produce a single response.

It’s worth pausing on how fast that shift happened. If you put a model like that on a laptop, then hopped in a time machine to early 2020—before GPT-3—and demoed it, two reactions would be predictable:

  1. People would assume it was a decade away, because it would look far more sophisticated than what most believed we had a clear path to.
  2. People would be shocked it could run on a contemporary MacBook at all.

That gap between expectations and reality is one of the most important signals in the entire space. Even just looking back a year or two, many priors turned out to be wrong. Today, you can fit something on a phone that would have seemed supernatural—and would have passed the “Turing test” as many people described it five or six years ago.

Why small models matter even if big models keep winning The point isn’t that scaling up stops. The point is that compressibility changes the shape of the future.

If capable systems can run on small hardware, then you can run many of them. Hundreds. Thousands. And once you can do that, the unit of progress stops being only “one giant model.” You can start building collective systems: swarms of smaller models cooperating, dividing work, checking each other, and operating over large problem spaces.

Imagine taking a massive body of scientific literature and having models work through it sequentially—looking for clues, building connections, and passing insights back and forth. Or having a set of agents break a large task down, do parts in parallel, and recombine the outputs. This is one reason many frontier labs are bullish about continued improvement: we’ve moved beyond a single axis (parameter count) and into a landscape where organization, cooperation, and harnessing matter.

And unlike training a trillion-parameter model, this is approachable. A lot of people already have spare machines. Many can run multiple agents locally and experiment with parallel task decomposition. The barrier to entry drops dramatically when “useful intelligence” fits into a few gigabytes.

So how small can we ultimately go? It would be a mistake to assume that something like GPT OSS 20B represents a lower bound. There’s no obvious reason to believe it’s the smallest container that can hold “that kind” of capability. It wouldn’t be surprising if we can compress further while retaining a useful model of the world.

At the same time, there are real constraints. When you compress facts into a limited space, you run into mathematical limits—there are bounds on how efficiently certain information can be represented. But there’s also an open question about how much “intelligence” is simply a pile of stored facts versus something that emerges from the way knowledge is organized and used. It’s possible that, with the right stacking and structure of compression, behaviors we label as intelligent could be self-emergent from systems that are much smaller than we’d guess by intuition.

The most important takeaway is that the future isn’t only bigger models. It’s also smaller, cheaper, more distributed intelligence—running on everyday hardware, cooperating at scale, and doing work that used to require a single massive system. The question “how small can it get?” isn’t a curiosity anymore. It’s becoming one of the main drivers of what becomes possible next.