LLM System Design and Model Selection – O’Reilly

Choosing the right LLM has become a full-time job. New models appear almost daily, each offering different capabilities, prices, and quirks, from reasoning strengths to cost efficiency to code generation. This competition creates strong incentives for AI labs to carve out a niche and gives new startups room to emerge, resulting in a fragmented landscape where one model may excel at reasoning, another at code, and a third at cost efficiency.

AI, in one sense, is getting cheaper faster than any previous technology, at least per unit of intelligence. For example, input tokens for Gemini Flash Lite 2.5 are approximately 600 times cheaper than what OpenAI’s GPT-3 (davinci-002) cost in August 2022, while outperforming it on every metric. At the same time, access to frontier capabilities is also becoming more expensive than ever. The reason is simple: we can now pay directly for more capability, which has led to the rise of $300+ per month Pro subscription tiers.

Today, any developer can run capable open-weight models locally for negligible marginal cost using tools like Ollama. At the same time, enterprise systems can experience sharp cost increases, depending on the model size (number of parameters, such as 3 billion, 70 billion, or even in the trillions), the number of internal processing steps, and the volume of input data. For developers, these are central system design choices that directly affect feasibility and cost structure. For end users, this complexity explains why a basic subscription differs so much from a premium plan with higher limits on advanced models.

The choices you make in these broader development decisions also determine which LLM and inference settings are optimal for your use case.

At Towards AI, we work across the LLM stack, building applications, designing enterprise systems, and offering online courses (including one on O’Reilly), custom corporate training, and LLM development consultancy. In our experience, model selection and system design have become central to getting meaningful results from these tools. Much of that, in turn, depends on where today’s models are gaining their capabilities. While scale still plays a role, recent progress has come from a broader mix of factors, including training-data quality, post-training methods, and especially how models are used at inference time.

The Shifting Foundations of Model Capability

While early gains in LLM performance tracked closely with increases in pretraining compute, larger datasets, bigger models, and more training steps, this approach now yields diminishing returns.

Recent improvements come from a broader mix of strategies. Pretraining-data quality has become just as important as quantity, with better filtering and AI-generated synthetic data contributing to stronger models. Architectural efficiency, like the innovations introduced by DeepSeek, has started to close the gap between size and capability. And post-training techniques, especially instruction tuning and reinforcement learning from human or AI feedback (RLHF/RLAIF), have made models more aligned, controllable, and responsive in practice.

The more fundamental shift, however, is happening at inference time. Since late 2024, with models like OpenAI’s o1, we’ve entered a new phase where models can trade compute for reasoning on demand. Rather than relying solely on what was baked in during training, they can now “think harder” at runtime, running more internal steps, exploring alternative answers, or chaining thoughts before responding. This opens up new capability ceilings, but also introduces new cost dynamics.

These varied improvement strategies have led to a clear divergence among AI labs and models, a rapid expansion in model choice, and in some cases, an explosion in model usage costs.

The Modern Cost Explosion: How Inference Scaling Changed the Game

Inference-time compute scaling has introduced a new dynamic in LLM system design: We’ve gone from a single lever model size, to at least four distinct ways to trade cost for capability at runtime. The result is a widening gap in inference cost across models and use cases, sometimes by factors of 10,000x or more.

Larger Models (Size Scaling): The most obvious lever is sheer model size. Frontier LLMs, like GPT-4.5, often built with mixture of experts (MoE) architectures, can have input token costs 750 times higher than streamlined models like Gemini Flash-Lite. Larger parameter counts mean more compute per token, especially when multiple experts are active per query.

Series Scaling (“Thinking Tokens”): Newer “reasoning” LLMs perform more internal computational steps, or a longer chain of thought, before producing their final answer. For example, OpenAI’s o1 used ~30x more compute than GPT-4o on average, and often 5x more output tokens per task. Agentic systems introduce an additional method of series scaling and an extra layer of cost multiplication. As these agents think, plan, act, reassess, plan, act, and so on, they often make many LLM steps in a loop, each incurring additional cost.

Parallel Scaling: Here, the system runs multiple model instances on the same task and then automatically selects the best output via automated methods, such as majority voting (which assumes the most common answer is likely correct) or self-confidence scores (where the model output claiming the highest confidence in its response is taken as the best). The o3-pro model likely runs 5–10x parallel instances over o3. This multiplies the cost by the number of parallel attempts (with some nuance).

Input Context Scaling: In RAG pipelines, the number of retrieved chunks and their size directly influence input token costs and the LLM’s ability to synthesize a good answer. More context can often improve results, but this comes at a higher cost and potential latency. Context isn’t free; it’s another dimension of scaling that developers must budget for.

Taken together, these four factors represent a fundamental shift in how model cost scales. For developers designing systems for high-value problems, 10,000x to 1,000,000x differences in API costs to solve a problem based on architectural choices are now realistic possibilities. Reasoning LLMs, although only prominent for about nine months, reversed the trend of declining access costs to the very best models. This transforms the decision from “Which LLM should I use?” to include “How much reasoning do I want to pay for?”

This shift changes how we think about selection. Choosing an LLM is no longer about chasing the highest benchmark score; it’s about finding the balance point where capability, latency, and cost align with your use case.

Core Model Selection Criteria

When choosing a model we find it is important to first clearly identify your use case and the minimum core AI capabilities and attributes needed to deliver it.

A common first step is to take a look at standard benchmark scores (for example LiveBench, MMLU-Pro, SWE-Bench). These benchmarks are a useful starting point, but some models are tuned on benchmark data, and real-world performance on tasks that are actually relevant to you will often vary. Filtering benchmark tests and scores by your industry and task category is a valuable step here. An LLM optimized for software development might perform poorly in creative writing or vice versa. The match between a model’s training focus and your application domain can outweigh general-purpose benchmarks.

Leaderboards like LMArena and Artificial Analysis offer broader human‑preference comparisons but still don’t replace custom real-world testing. It helps to have a set of your own example questions or tasks at hand to test out a new model for yourself and see how it performs. This should include a mix of easy tasks to establish a baseline and tough edge cases where it’s easy for a model to make mistakes.

As you move beyond ad hoc testing, for any serious development effort, custom evaluations are non-negotiable. They must be tailored to your use case and the types of problems you solve. This is the only way to truly know if a model, or a change to your system, is genuinely improving things for your users and your specific business goals.

Here are some core factors we consider:

Multimodality is emerging as a major differentiator. Models like GPT-4o and Gemini can handle not just text but also images, audio, and in some cases video, unlocking applications that pure text models can’t support.

Context window and effective context window utilization are also key: How many tokens or documents can the model process and how much of that advertised context window can the LLM actually use effectively without performance degradation relative to tasks that use less context?

Latency is especially critical for interactive applications. In general, smaller or cheaper models tend to respond faster, while reasoning-heavy models introduce delays due to deeper internal computation.

Reasoning is the ability to scale inference-time compute and perform multistep problem-solving, planning, or deep analysis.

Privacy and security are often key considerations here. For example, if you want to keep your intellectual property private, you must use a model that won’t train on your inputs, which often points toward self-hosted or specific enterprise-grade API solutions.

Trustworthiness is also becoming important and can come down to the reputation and track record of the AI lab. A model that produces erratic, biased, or reputationally damaging outputs is a liability, regardless of its benchmark scores. For instance, Grok has had well-publicized issues with its alignment. Even if such issues are supposedly fixed, it creates a lingering question of trust: How can one be sure it won’t behave similarly in the future?

Additionally, the knowledge cutoff date also matters if it is to be used in a fast-moving field.

After working out if a model meets your minimum capability, the next decision is often on optimizing trade-offs among cost, reliability, security, and latency. A key rule of thumb we find useful here: If the reliability gain from a more expensive model or more inference time saves more of your or your users’ time (valued in terms of pay) than the model costs, going with the larger model is a good decision!

The Pros and Cons of Open-Weight and Closed API LLMs

The rise of increasingly competitive open-weight LLMs, such as Meta’s Llama series, Mistral, DeepSeek, Gemma, Qwen, and now OpenAI’s GPT-OSS has added a critical dimension to the model selection landscape. Momentum behind this open ecosystem surged with the release of DeepSeek’s R1 reasoning model, competitive with OpenAI’s o1 but priced at roughly 30x lower API costs. This sparked debate around efficiency versus scale and intensified the broader AI rivalry between China and the US. Reactions ranged from “OpenAI and Nvidia are obsolete” to “DeepSeek’s costs must be fabricated,” but regardless of hype, the release was a milestone. It showed that architectural innovation, not just scale, could deliver frontier-level performance with far greater cost efficiency.

This open-model offensive has continued with strong contributions from other Chinese labs like Alibaba (Qwen), Kimi, and Tencent (Hunyuan), and has put competitive pressure on Meta after its open-weight Llama models fell behind. China’s recent leadership in Open Weight LLMs has raised new security/IP issues with some US- and European-based organizations, though we note accessing these model weights and running the model on your own infrastructure doesn’t require sending data to China.

This brings us back to the pros and cons of open weights. While closed-API LLMs still lead at the frontier of capability, the primary advantage of open-weight models is quick and affordable local testing, unparalleled flexibility, and increased data security when run internally. Organizations can also perform full fine-tuning, adapting the model’s core weights and behaviors to their specific domain, language, and tasks. Open models also provide stability and predictability—you control the version you deploy, insulating your production systems from unexpected changes or degradations that can sometimes occur with unannounced updates to proprietary API-based models.

Public closed-model APIs from major providers benefit from immense economies of scale and highly optimized GPU utilization by batching requests from thousands of users, an efficiency that is difficult for a single organization to replicate. This often means that using a closed-source API can be cheaper per inference than self-hosting an open model. Security and compliance are also more nuanced than they first appear. While some organizations must use self-hosted models to simplify compliance with regulations like GDPR by keeping data entirely within their own perimeter, this places the entire burden of securing the infrastructure on the internal team—a complex and expensive undertaking. Top API providers also often offer dedicated instances, private cloud endpoints, and contractual agreements that can guarantee data residency, zero-logging, and meet stringent regulatory standards. The choice, therefore, is not a simple open-versus-closed binary.

The boundary between open and closed models is also becoming increasingly blurred. Open-weight models are increasingly offered via API by third-party LLM inference platforms, combining the flexibility of open models with the simplicity of hosted access. This hybrid approach often strikes a practical balance between control and operational complexity.

Leading Closed LLMs

Below, we present some key costs and metrics for leading closed-source models available via API. Many of these models have additional complexity and varied pricing including options for fast modes, thinking modes, context caching, and longer context.

We present the latest LiveBench benchmark score for each model as one measure for comparison. LiveBench is a continuously updated benchmark designed to provide a “contamination-free” evaluation of large language models by regularly releasing new questions with objective, verifiable answers. It scores models out of 100 on a diverse set of challenging tasks, with a significant focus on capabilities like reasoning, coding, and data analysis. The similar LiveBench scores between GPT-4.5 and 2.5 Flash-Lite, despite 750x input token cost variation, highlights both that smaller models are now very capable but also that not all capabilities are captured in a single benchmark!

AI model pricing and specifications comparison — *Source: Towards AI, Company Reports, LiveBench AI*

Leading open-weight LLMs

Below, we also present key costs, the LiveBench benchmark score, and context length for leading open-weight models available via API. We compare hosted versions of these models for easy comparison. Different API providers may choose to host open-weight models with different levels of quantization, different context lengths and different pricing so performance can vary between providers.

AI model pricing and specifications 2 — *Source: Towards AI, Company Reports, LiveBench AI*

Whether hosted or self-deployed, selecting a model only solves part of the problem. In practice, most of the complexity and opportunity lies in how that model is used: how it’s prompted, extended, fine-tuned, or embedded within a broader workflow. These system-level decisions often have a greater impact on performance and cost than the model choice itself.

A Practical Guide to Designing an LLM System

Simply picking the biggest or newest LLM is rarely the optimal strategy. A more effective approach starts with a deep understanding of the developer’s toolkit: knowing which technique to apply to which problem to achieve the desired capability and reliability without unnecessary cost. This is all part of the constant “march of nines” as you develop LLM systems modularly to solve for more reliability and capability. There is a need to prioritize the easiest wins that deliver tangible value before investing in further incremental and often costly accuracy improvements. The reality will always vary on a case-by-case basis, but here is a quick guide to navigating this process.

Step 1: Open Versus Closed?

This is often your first decision.

Go with a closed-API model (e.g., from OpenAI, Google, Anthropic) if: Your priority is accessing the absolute state-of-the-art models with maximum simplicity.
Go with an open-weight model (e.g., Llama, Mistral, Qwen, DeepSeek) if:
- Data security and compliance are paramount: If you need to guarantee that sensitive data never leaves your own infrastructure.
- You need deep customization and control: If your goal is to fine-tune a model on proprietary data and to create a specialized expert that you control completely.

If you went open, what can you realistically run? Your own GPU infrastructure is a hard constraint. Assess your cluster size and memory to determine if you can efficiently run a large, leading 1 trillion+ parameter MoE model, such as Kimi K2, or if you are better served by a medium-size model such as Gemma 3 27B or a much smaller model like Gemma 3n that can even run on mobile.

Step 2: Gauging the Need for Reasoning

Does your task require the model to simply blast out a response, or does it need to think first?

Reasoning: For tasks that involve complex, multistep problem-solving, brainstorming, strategic planning, intricate code generation, or deep analysis, you need a dedicated reasoning model such as o3, Gemini 2.5 Pro, DeepSeek R1, or Claude 4. In some cases these models can be used in high-reasoning mode, which encourages the model to think for longer before responding.
No reasoning: For straightforward tasks like simple Q&A, summarization of a single document, data extraction, or classification, a powerful reasoning model is overkill.
The middle ground: For tasks requiring moderate reasoning, such as generating a structured report from a few data points or performing basic data analysis at scale, a “mini” reasoning model, like OpenAI’s o4-mini or Gemini Flash 2.5, offers a balance of capability and cost.

Step 3: Pinpointing Key Model Attributes

Beyond general intelligence and reasoning, modern LLMs are specialists. Your choice should be guided by the specific attributes and “superpowers” your application needs.

Prioritize accuracy over cost for high-value tasks where mistakes are costly or where a human expert’s time is being saved. o3-pro is a standout model here and it can even be used as a fact checker to meticulously check the details of an earlier LLM output.
Prioritize speed and cost Over accuracy: For user-facing, real-time applications like chatbots or high-volume, low-value tasks like simple data categorization, latency and cost are paramount. Choose a hyper-efficient “flash” or “mini” model such as Gemini 2.5 Flash-Lite. Qwen3-235B models can also be a great option here but are too complex to inference yourself.
Do you need a deep, long-context researcher? For tasks that require synthesizing information from massive documents, entire codebases, or extensive legal contracts, a model with a vast and highly effective context window is crucial. Gemini 2.5 Pro excels here.
Is multimodality essential? If your application needs to understand or generate images, process audio in real time, or analyze video, your choice narrows to models like GPT-4o or the Gemini family. For one-shot YouTube video processing, Gemini is the standout.
Is it a code-specific task? While many models can code, some are explicitly tuned for it. In the open world, Codestral and Gemma do a decent job. But Claude has won hearts and minds, at least for now.
Do you need live, agentic web search? For answering questions about current events or topics beyond the model’s knowledge cutoff, consider a model with a built-in, reliable web search, such as o3.
Do you need complex dialogue and emotional nuance? GPT-4.5, Kimi K2, Claude Opus 4.0, or Grok 4 do a great job.

Step 4: Prompting, Then RAG, Then Evaluation

Before you dive into more complex and costly development, always see how far you can get with the simplest techniques. This is a path of escalating complexity. Model choice for RAG pipelines is often centered around latency for end users, but recently more complex agentic RAG workflows or long-context RAG tasks require reasoning models or longer context capabilities.

Prompt engineering first: Your first step is always to maximize the model’s inherent capabilities through clear, well-structured prompting. Often, a better prompt with a more capable model is all you need.
Move to retrieval-augmented generation (RAG): If your model’s limitation is a lack of specific, private, or up-to-date knowledge, RAG is the next logical step. This is the best approach for reducing hallucinations, providing answers based on proprietary documents, and ensuring responses are current. However, RAG is not a panacea. Its effectiveness is entirely dependent on the quality and freshness of your dataset, and building a retrieval system that consistently finds and uses the most relevant information is a significant engineering challenge. RAG also comes with many associated decisions, such as the quantity of data to retrieve and feed into the model’s context window, and just how much use you make of long-context capabilities and context caching.
Iterate with advanced RAG: To push performance, you will need to implement more advanced techniques like hybrid search (combining keyword and vector search), re-ranking retrieved results for relevance, and query transformation.
Build custom evaluation: Ensure iterations on your system design, additions of new advanced RAG techniques, or updates to the latest model are always moving progress forward on your key metrics!

Step 5: Fine-Tune or Distill for Deep Specialization

If the model’s core behavior—not its knowledge—is still the problem, then it’s time to consider fine-tuning. Fine-tuning is a significant undertaking that requires a high-quality dataset, engineering effort, and computational resources. However, it can enable a smaller, cheaper open-weight model to outperform a massive generalist model on a specific, narrow task, making it a powerful tool for optimization and specialization.

Fine-tuning is for changing behavior, not adding knowledge. Use it to teach a model a specific skill, style, or format. For example:
- To reliably output data in a complex, structured format like specific JSON or XML schemas.
- To master the unique vocabulary and nuances of a highly specialized domain (e.g., legal, medical).
- Some closed-source models are available for fine tuning via API such as Gemini 2.5 Flash and various OpenAI models. Larger models are normally not available.
- In open-weight models, Llama 3.3 70B and Qwen 70B are fine-tuning staples. The process is more complex to fine-tune an open-weight model yourself.
Model distillation can also serve as a production-focused optimization step. In its simplest form, this consists of generating synthetic data from larger models to create fine-tuning datasets to improve the capabilities of smaller models.
Reinforcement fine-tuning (RFT) for problem-solving accuracy
Instead of just imitating correct answers, the model learns by trial, error, and correction. It is rewarded for getting answers right and penalized for getting them wrong.
- Use RFT to: Create a true “expert model” that excels at complex tasks with objectively correct outcomes.
- The advantage: RFT is incredibly data-efficient, often requiring only a few dozen high-quality examples to achieve significant performance gains.
- The catch: RFT requires a reliable, automated “grader” to provide the reward signal. Designing this grader is a critical engineering challenge.

Step 6: Orchestrated Workflows Versus Autonomous Agents

The critical decision here is how much freedom to grant. Autonomous agents are also more likely to need more expensive reasoning models with greater levels of inference scaling. Parallel inference scaling methods with multiple agents are also beginning to deliver great results. Small errors can accumulate and multiply during many successive agentic steps so the investment in a stronger more capable model can make all the difference in building a usable product.

Choose an orchestrated workflow for predictable tasks
You design a specific, often linear, sequence of steps, and the LLM acts as a powerful component at one or more of those steps.
- Use when: You are automating a known, repeatable business process (e.g., processing a customer support ticket, generating a monthly financial summary). The goal is reliability, predictability, and control.
- Benefit: You maintain complete control over the process, ensuring consistency and managing costs effectively because the number and type of LLM calls are predefined.
Build hybrid pipelines: Often, the best results will come from combining many LLMs, open and closed, within a pipeline.
- This means using different LLMs for different stages of a workflow: a fast, cheap LLM for initial query routing; a specialized LLM for a specific subtask; a powerful reasoning LLM for complex planning; and perhaps another LLM for verification or refinement.
- At Towards AI, we often have 2-3 different LLMs from different companies in an LLM pipeline.
Choose an autonomous agent for open-ended problems. You give the LLM a high-level goal, a set of tools (e.g., APIs, databases, code interpreters), and the autonomy to figure out the steps to achieve that goal.
- Use when: The path to the solution is unknown and requires dynamic problem-solving, exploration, or research (e.g., debugging a complex software issue, performing deep market analysis, planning a multistage project).
- The critical risk—runaway costs: An agent that gets stuck in a loop, makes poor decisions, or explores inefficient paths can rapidly accumulate enormous API costs. Implementing strict guardrails is critical:
  - Budget limits: Set hard caps on the cost per task.
  - Step counters: Limit the total number of “thoughts” or “actions” an agent can take.
  - Human-in-the-loop: Require human approval for potentially expensive or irreversible actions.
- Gemini 2.5 Pro and o3 are our favourite closed-API models for agent pipelines, while in open-weight models we like Kimi K2.

Working through these steps helps translate a vague problem into a concrete implementation plan, one that’s grounded in clear trade-offs and tailored to your needs. This structured approach often yields systems that are not only more capable and reliable but also far more effective for specific tasks than a general-purpose chatbot ever could be.

Conclusion

The open-versus-closed race gives us rapid access to strong LLMs but also creates complexity. Selecting and deploying them demands both engineering discipline and economic clarity.

Developing in the LLM ecosystem demands a new level of engineering discipline and keen economic awareness. No single LLM is a cure-all. A practical, evolving toolkit is essential, but knowing which tool to pull out for which job is the real art. The challenge isn’t just picking a model from a list; it’s about architecting a solution. This requires a systematic approach, moving from high-level strategic decisions about data and security down to the granular, technical choices of development and implementation.

The success of specialized “LLM wrapper” applications like Anyscale/Cursor for coding or Perplexity for search, some of which are now valued at over $10 billion, underscores the immense value in this tailored approach. These applications aren’t just thin wrappers; they are sophisticated systems that leverage foundation LLMs but add significant value through custom workflows, fine-tuning, data integration, and user experience design.

Ultimately, success hinges on informed pragmatism. Developers and organizations need a sharp understanding of their problem space and a firm grasp of how cost scales across model choice, series and parallel reasoning, context usage, and agentic behavior. Above all, custom evaluation is non-negotiable because your use case, not a benchmark, is the only standard that truly matters.

Source link

Post Views: 24

LLM System Design and Model Selection – O’Reilly

When AI Writes Code, Who Secures It? – O’Reilly

Plasma Beam Solution Tackles Kessler Syndrome Threat

How a 2020 Rolex Collection Changed the Face of Watch Design

Charlie Kirk’s Gen Z revolution worked

Top Insights

The Bitcoin Standard: The Decentralized Alternative to Central Banking

How to Get Rapid YouTube Subscriber Growth for Creators

LLM System Design and Model Selection – O’Reilly

The Shifting Foundations of Model Capability

The Modern Cost Explosion: How Inference Scaling Changed the Game

Core Model Selection Criteria

The Pros and Cons of Open-Weight and Closed API LLMs

Leading Closed LLMs

Leading open-weight LLMs

A Practical Guide to Designing an LLM System

Step 1: Open Versus Closed?

Step 2: Gauging the Need for Reasoning

Step 3: Pinpointing Key Model Attributes

Step 4: Prompting, Then RAG, Then Evaluation

Step 5: Fine-Tune or Distill for Deep Specialization

Step 6: Orchestrated Workflows Versus Autonomous Agents

Conclusion

Related Posts

When AI Writes Code, Who Secures It? – O’Reilly

Plasma Beam Solution Tackles Kessler Syndrome Threat

How a 2020 Rolex Collection Changed the Face of Watch Design

Charlie Kirk’s Gen Z revolution worked

Subscribe to Updates

The Bitcoin Standard: The Decentralized Alternative to Central Banking

How to Get Rapid YouTube Subscriber Growth for Creators