In late 2022, ChatGPT released, kickstarting a new era of LLM-based AI tools. At the time, its ability to answer questions about a wide range of subjects in natural language was unparalleled; even compared to previous transformer models1, it represented a significant leap forward. However, the three short years since then have seen countless innovations in the field, and that first release now seems almost childlike in comparison to the state-of-the-art models of today. Besides simply having a wider knowledge-base and improved fluency, these models have been equipped with tools like web search and IDE integration, can understand visual information like charts, tables, and diagrams, and can even achieve a certain level of logic and reasoning. Moreover, choices are abundant, with many models competing on the frontier like OpenAI’s GPT-5, Anthropic’s Claude Opus 4, and Google’s Gemini 2.5 (and this list will almost certainly be outdated within a year, or even a couple months2).
With these capabilities, we’ve moved from a paradigm where solving a problem with ML requires building a specialized model, to a paradigm where a wide swath of issues can be solved just by providing one of these LLMs with the right prompts and tools. In particular, the field of code generation has benefitted greatly from this paradigm. The ability to use natural language not only to describe the problem, but also to correct mistakes and guide the process, is invaluable and makes code generation much more practical than before. As a Galois summer intern, I was tasked with exploring a use-case for this capability. In particular, my goal was to transform a (semi-arbitrary) natural language description of a system into a (simple) simulator of it.
This task, while still nontrivial with modern LLMs, is also not overly difficult. However, there’s a catch: the client was a government contractor. As the vague description of the task above might suggest, much of the information related to the project is classified or controlled, and this creates two issues when using the most powerful LLMs, which are generally hosted on the cloud. First, any information we send to the LLM can be read and stored by the host (e.g., OpenAI, Google, Amazon, etc.), which is a no-go for classified information. Still, even if we have complete trust in the host, most services will collect our provided prompts in order to do additional training on them, which means bad actors may be able to get the LLM to regurgitate this sensitive information3.
This issue is one that is often overlooked, as the focus in research has tended towards achieving performance at all costs. There is something to be said for that method, but it leaves a lot of potential innovations on the table for applications where privacy and independence and requirements. As this project has revealed, government projects are a great example of a situation where AI could make an impact in improving productivity and unlocking new possibilities, but only if data can be handled with care4. But even ignoring classification or data sensitivity, there’s always a cost to giving up privacy and security. Our data is made up of our thoughts, behaviours, and feelings, and it’s been seen time and time again that the more we make public, the higher the chance that malicious actors can exploit us. It’s becoming more and more difficult to maintain privacy, but by finding and mitigating the weaknesses in small models, we can at least provide an alternative for what is becoming a more and more essential tool.
For those who have not run an LLM locally before (which I imagine is most people), it’s worth taking a moment to explain how it works. The powerful models I mentioned earlier are all closed source, meaning you can’t just download and run them on your own hardware. The opposite of this would be open source, but since making a model open source would include the training method and the data, which is often the most valuable part, open-source models are rare. However, we don’t actually need the original training code just to run the model; it’s enough to have what are called the weights, or parameters. This has led to a vast array of “open-weight” models, typically in the range of 20 to 70 billion parameters. That may sound like a lot of parameters (and it is), but estimates of the sizes of models like GPT-5 and Claude Opus 4 are typically in the trillions of parameters, somewhere between ten to a hundred times larger.
Performance doesn’t scale linearly with the number of parameters, but this difference in size is large enough to have a noticeable effect on the abilities of these open-weight models. They struggle to constrain themselves to the instructions of the prompt, generate invalid or incorrect code, and have considerably worse vision performance (if they support images at all), just to name a few downsides. For example, the original goal of the project was to extract the model directly from a PDF document, but much of the information was stored in diagrams, which have to be interpreted with vision. Initial tests with cloud models were promising, as they could not only understand the diagrams, but could also extract the text within them and other text on the page with a high degree of precision. However, even the relatively large 70 billion parameter models could do no more than describe the shape of the diagram, with text extraction out of reach.
A natural response to these shortcomings is: why not just run a larger model? In fact, there are some large open-weight models: Deepseek R1, a 671 billion parameter model, is perhaps the most well known of these, along with the more recent Kimi K2, a 1 trillion parameter model. However, just because these models are open doesn’t mean it’s easy to run them, due to their computing requirements. For Deepseek R1, the highest level of quantization (essentially, compression) available on Ollama results in a model that’s about 400 GB in size5. LLMs generally have to be run on GPUs for acceleration, so the model has to fit entirely within their specialized memory (called VRAM). Generally, for a single card, this tends to max out at 80 GB, like for NVIDIA’s A100 and H100 cards, and these cards tend to go for no less than $10k, if not much more. To run the aforementioned model, you’d need at least five of those cards, and that’s assuming zero leftover memory for caching and other processes. This can get cheaper if you rent the time on a service like AWS or Azure, but even then it gets expensive quickly.
In contrast, cloud models typically offer prices in the single digits per 1 million tokens, or are even free. The fact that we can access these massive models at relatively low prices is likely only possible due to a combination of scale6 and a desire to gain market share and data by offering cheap queries. Unfortunately, those with a need for privacy or security can’t benefit from those prices, and so long as hardware stays expensive there will be a need for smaller, more efficient models.
Over the course of this project, I identified four main methods to work around the weaknesses of local LLMs. There are certainly many more methods than that, but each of these four reveals something interesting about the weakness of small models and the shortcuts that larger models often allow us to take. In fact, the simplest method has become a core aspect of developing with LLMs regardless of size, and is often called “prompt engineering.” While multifaceted, the basic idea of prompt engineering is to craft a passage of natural language which instructs the model to respond in desirable ways to user input, and to work towards some eventual goal. Unlike typical programming, where bugs or errors generally come from inaccuracies in the code, errors from LLMs can arise even in error-free prompts, whether that’s due to some aspect being underspecified, a model “hallucination”7, or simply because the LLM “ignores” a certain part of the prompt8. As such, prompt engineering is still more of an art than a science, and while it’s required for both large and small models, the latter requires much more guidance and specificity than a large model would, while also requiring a certain level of brevity to avoid confusing the model and to work around the shorter context lengths of these models9.
Another method that was invaluable for local models is called structured output. Due to their unprecedented flexibility, LLMs suffer from two deficiencies that make integration difficult: nondeterminism and a lack of guarantees. At each step in generation, the model calculates the probability that each token in its vocabulary would appear in the current position, and then selects the next token at random, weighted by those probabilities. This not only means that we might get a different answer to a question if we ask it twice, but that even if we only choose the most likely tokens10, the model is ultimately just selecting what it finds is likely to occur. Even if it succeeds ten times in a row, we have no guarantee it will succeed the next time, or the time after that. For example, this project required the model to output its responses in JSON (in order to hook it up to the agent system and access the file editor, terminal, etc.), and often just as much of the model’s time was spent fixing the mistakes in its JSON output as it spent on actually solving the problem at hand. In structured output, we create a grammar that describes a “legal” statement for the model11, and at each step we only choose from legal tokens. This means that the model can still generate what it needs to, but it can't miss an important field or invent a superfluous one, cutting down on the amount of time wasted on fixing its requests.
Still, even after implementing structured output, a significant issue remained; just because the output was in the right format, doesn’t mean it's useful. In fact, LLMs excel most when allowed to output natural language, and I had just ensured that it output as little of that as possible! To resolve this issue, we can have the LLM start completely unconstrained and write a section of text to itself containing its plans and ideas, and only afterwards answer the question in a constrained manner. This method is often called “thinking” or “reasoning,” and while the name is a bit generous, it does roughly correspond. Think about it as the difference between doing the first thing that comes to your mind, versus thinking for a bit and then deciding on an action. This technique has been extremely popular recently, drastically increasing the performance of models at relatively little cost. However, since “thinking” usually occurs between special start/end thinking tokens, the model has to be explicitly trained to support those tokens and generate useful content between them, restricting our model choices12.
The final technique is not one that was used on this project, but is one that I want to highlight, as it tends to be a blind-spot for the general public. Earlier, I mentioned that we’ve moved out of a paradigm of training a new model for every task, but this is only half true, especially when it comes to smaller models. The technique of fine-tuning has become very popular, especially in academic research, in which an open-weight model is put through further training to increase its performance in a given area. In my usual area of research, automated theorem proving, this technique has been used to make small models like Goedel-Prover-V2 which can outperform even the large frontier models by a large margin on this specific task. The downside is that fine-tuning requires a large dataset of examples to learn from, and that can be very difficult to create; for this project, there was not any publicly available data to build from, so it would have required manual labelling of at least a few hundred examples, and that would be still be a small number of examples13. Even so, fine-tuning represents a major reduction in complexity compared to training a model from scratch, as it can generalize much faster and with fewer examples, and in the right conditions can outperform models a hundred times the size, making it one of the strongest tools a small model can wield.
The place of small models in this rapidly developing field is still unclear. On the one hand, there is no shortage of weaknesses of these models compared to the leading models of OpenAI, Anthropic, Google, and the like, and the accessibility of those cloud models is only increasing as scale and competition drives prices lower. On the other hand, the gap between these two categories seems to be shrinking, an impressive feat for something so much smaller and cheaper, and as this project demonstrates, there will always be tasks for which privacy is a requirement, whether due to classification, protections on customer data, or just a desire to keep your thoughts and ideas your own.
In a sense, this leads to a question of what will happen first: will models get efficient enough at smaller sizes that the larger models aren’t needed, or will computing costs drop such that even a host that preserves privacy can offer a competitive price? I doubt anyone can say with confidence which will occur (or that some third option won’t appear instead), but regardless of the eventual answer, it's clear that innovation will continue in both fields, creating more and more opportunities for practical use of these models, large or small.
For my project at Galois, we decided that while local models could find some success in code generation, at this moment the time required to supervise them is at least as much as it would take to write the code manually. This decision represents a lot of what I enjoyed about working at Galois for the summer: there was an omnipresent spirit of research and openness to new ideas, which allowed me to propose an alternative to my original task. Ultimately, we found more success in creating developer tools to increase speed, like a domain-specific language (DSL) that made it simple for a human to translate natural language descriptions into a simulator. Still, it’s worth remembering that change comes fast, and the answer we found may not be the case in a couple years. After all, only a couple years ago, even the concept of this project would have been unreasonable. For the moment, privacy is still at odds with LLMs, but whether through advancements in local models, or through pressure on providers to improve their privacy practices, that gap can be bridged.
[1] The transformer architecture is a type of model architecture introduced in the now-famous paper “Attention Is All You Need”; there were many models implemented with it before ChatGPT, but none broke into the public eye in the same way.
[2] In fact, it became outdated even before this article was published, due to the release of OpenAI’s GPT-5.
[3] Source: https://www.galois.com/articles/qa-ai-ml-and-the-privacy-security-dilemma
[4] While there are government-approved model providers, getting access to these systems is non-trivial (to the point that even an ETA on access wasn’t available during my internship), and they are still more expensive, more outdated, and slower compared to the typical providers.
[5] As a rule of thumb, the memory footprint of a model is roughly ½ to ⅔ of the number of parameters, when using 4-bit quantization.
[6] A single query uses very little compute time, which represents an inefficiency for a single user but represents an opportunity at scale to distribute the cost over many users.
[7] “Hallucination” usually refers to a model presenting something as fact when it is not, even if it was never mentioned in the prompt or the training data.
[8] While this sounds bad (and it is), it’s unfortunately not rare for LLMs to ignore parts of the prompt, and the issue only worsens as model size decreases.
[9] The context length is the number of tokens the model can take as input before it has to start throwing away or compress information (a token is a segment of a word, around ¾ of a word on average). For large models this is usually anywhere from a couple hundred thousand to one million tokens, but small models have only recently reached a hundred thousand. Part of the reason for this is that, even if a small model supports a large context length, larger context lengths require more memory, such that in practice the largest context length I had access to was around sixty thousand (and that was only achieved by using an even smaller model).
[10] We call this sampling with a “temperature” of zero.
[11] This is relatively easy for something like a JSON schema, but can become computationally expensive or even impossible when trying to define something more complex, like a programming language, so this technique isn’t always applicable.
[12] There is another technique called “chain-of-thought prompting”, where the prompt explicitly asks the model to output some thinking before giving its answer. This is also quite powerful, and is often used with large models that haven’t been trained for reasoning, but small models tend to struggle to use it while still attending to the rest of the prompt.
[13] Other techniques, such as reinforcement learning (which was also used in Goedel Prover) are more forgiving and can alleviate this labelling issue to some extent, but still require a large amount of engineering effort to create the training pipeline, and eventually still require either human input or some way to distinguish between "good" and "bad" responses.