Loading...
Ultimate Guide to Local LLMs
Agent Native·April 2026 Edition

Ultimate Guide to Local LLMs in 2026

A comprehensive technical reference for practitioners: production war stories, undocumented behaviors, and rare optimization techniques for deploying local LLMs in the real world.

Agent Native
Agent Native
31 chaptersCheckpoint 4: Full manuscript
Ultimate Guide to Local LLMs cover

Chapter 1: Introduction

Why local deployment moved from hobbyist territory into the default toolbelt for serious AI builders.

What the source document gets right

The opening thesis is strong: if you build anything serious with AI in 2026, local deployment is no longer a side quest. The interesting shift is not just model quality. It is that capacity, context, multimodality, and runtime ergonomics now make local inference operationally relevant.

The state of local LLMs in 2026

Running large language models locally has moved from niche experimentation into mainstream engineering practice. Teams are running frontier-class mixture-of-experts models on desks, standing up agent swarms on demand, and treating multimodality as a native assumption rather than an exotic add-on.

For practitioners, that changes the skill floor. Understanding local deployment is not optional if you care about cost control, privacy, latency, reliability, or just having a sane fallback when an external model API becomes the bottleneck.

What makes this guide different

This guide is framed as a practitioner reference rather than a docs summary. The source material is explicitly opinionated about the gaps that matter in real deployments: undocumented behavior, production failures, quantization recipes, memory optimization, and multi-GPU pitfalls.

Operational bias
Useful in production
  • War stories instead of happy-path examples
  • Measured quantization choices instead of vague quality claims
  • Memory and throughput tuning beyond official documentation
  • Monitoring configurations and failure semantics rather than toy demos
Core insight
Important
  • Most local-LLM bottlenecks are memory bandwidth and fragmentation, not raw FLOPs
  • The GPU is often waiting on weight loads instead of compute
  • Hybrid attention and MoE architectures change the hardware economics
  • Practical inference is constrained by memory movement, not benchmark fantasy
  • MoE dominance. Frontier models now routinely use Mixture-of-Experts. You still load a very large model footprint, but per-token compute only touches a subset of experts, which makes enormous models practical on smaller hardware.
  • Hybrid attention everywhere. Pure self-attention is being displaced by hybrids that drastically reduce KV-cache pressure. What once needed double-digit gigabytes for long context now often fits in a fraction of that budget.
  • Hardware and runtime co-evolution. NVFP4, FP8, unified memory systems, and faster inference runtimes are narrowing the gap between theoretical and delivered performance.

How to use this guide

The document itself is structured like a reference manual. That is the right reading model here too. The live route now carries the full manuscript, with the back half of the guide opening into production deployment, hardware decisions, advanced customization, and the appendices once the premium section begins.

Key takeaways
3 items
  • 1Part I is the prerequisite for everything else in the book.
  • 2If you are new to local LLMs, start here before touching framework or hardware decisions.
  • 3If you already deploy models, use this section to recalibrate your mental model around memory and throughput.

Chapter 2: Core Concepts and Internals

What is actually happening when you run a model locally, and why that matters for every downstream decision.

Inference is next-token prediction, step by step

Local inference is just the forward-pass side of a pre-trained model. The important operational point is that generation is autoregressive: one token is predicted, appended, and then used as context for the next step. That is why latency compounds and why caching matters so much.

Step 1: Tokenization
Input text is split into subword tokens

Step 2: Embedding
Token IDs are mapped into learned vectors

Step 3: Forward pass
Embeddings move through attention + MLP blocks

Step 4: Prediction
The model outputs logits across the full vocabulary

Step 5: Sampling
One token is selected via greedy, temperature, top-k, or top-p

Step 6: Append and repeat
The selected token is appended and the cycle continues
Why the KV cache exists

Without the KV cache, every decode step would recompute attention against the full prefix from scratch. The cache is not a convenience feature. It is the thing that makes autoregressive decoding economically viable.

Tokens are not words

A lot of bad capacity planning comes from treating tokens as if they were human words. They are not. Token counts vary with tokenizer, language, and domain. Long identifiers, multilingual text, and punctuation-heavy inputs all distort naive assumptions.

  • hello may be 1 token, but complex words can be 5-8 tokens.
  • A 100-word paragraph often lands around 130-150 tokens.
  • Code often tokenizes differently from natural language and can be either more compact or much noisier depending on the syntax.
  • Longer context windows always increase KV-cache memory, decode latency, and memory bandwidth pressure.

What is inside a model file

When you download a model, you are not just pulling weights. You are also taking on tokenizer state, architectural configuration, and chat-template behavior. If any of those drift from what the runtime expects, quality collapses fast.

  • Neural network weights: the learned tensors that dominate storage and VRAM.
  • Tokenizer: vocabulary, merge rules, and special tokens such as BOS/EOS/PAD.
  • Configuration: hidden size, layer count, attention heads, context window, normalization constants, and RoPE parameters.
  • Chat template: the message framing the model expects. Wrong template often means obviously degraded or nonsensical output.
  • Architecture assumptions: especially important for MoE and hybrid attention models where the runtime must understand the structure correctly.

Chapter 3: The Transformer Architecture

The core skeleton behind almost every modern LLM, plus the efficiency layers that now matter more than the original design.

Attention, MLPs, and positional structure

The original transformer gave us the baseline: self-attention to decide which prior tokens matter, MLP blocks to add expressiveness, residual connections for depth, and positional encoding so token order still means something.

# Self-attention mechanism
Q = X @ W_Q
K = X @ W_K
V = X @ W_V
scores = Q @ K.T / sqrt(d_k)
weights = softmax(scores)
output = weights @ V

The expensive truth is the quadratic scaling: attention cost grows with sequence length, which is why long context becomes painful long before you hit the model's advertised maximum window.

Modern efficiency optimizations

GQA and MQA
Memory reduction
  • Grouped Query Attention shares KV heads across query heads.
  • Multi-Query Attention pushes the idea further by letting all queries share one KV head.
  • The trade is simple: much cheaper KV-cache usage for some quality risk on harder long-context tasks.
MLA and DSA
2026 era
  • Multi-head Latent Attention compresses the KV representation before reuse.
  • DeepSeek Sparse Attention prunes full attention down to a top-k set of relevant tokens.
  • These innovations target the real bottleneck: memory movement and long-context overhead.

Why hybrid attention matters

One of the strongest claims in the source document is that the era of pure self-attention is ending. That matches what we are seeing across multiple model families: full global attention is being used sparingly, while sliding windows, sparse selectors, and alternating global layers pick up the slack.

Operational implication

When a model says it supports long context, the question is no longer just “how many tokens?”. The more useful question is “what architecture made that context affordable, and what tradeoffs did it impose on latency, quality, and memory?”.

Chapter 4: Memory, VRAM, and the KV Cache

The chapter that usually determines whether a local deployment plan is realistic or fantasy.

What must fit in VRAM

During inference, VRAM is shared by model weights, the KV cache, activation buffers, and runtime overhead. In practice, this means your simple weight-size estimate is almost never enough.

  • Model weights are usually the largest static footprint.
  • KV cache grows with sequence length and can exceed the weight footprint at long contexts.
  • Activation buffers are temporary but real, especially with larger batches.
  • Runtime overhead and fragmentation consume a non-trivial slice of the card before useful work starts.
Practical total VRAM estimate
=
Model weights
+ KV cache
+ activation buffers
+ runtime overhead
+
For planning, assume:
- FP16 model: ~2.5x to 3x raw weight size
- 4-bit quantized: ~1.5x to 2x raw weight size

The KV cache reality

The formula for KV-cache size is straightforward. Production reality is not. Alignment, padding, prefix caching, and fragmentation mean real memory usage drifts significantly above the ideal math.

KV_cache = 2 x n_layers x n_kv_heads x d_head x seq_len x batch_size x bytes_per_param

Reality in production:
KV_cache_real = formula x 1.3
              + fragmentation waste
              + prefix-cache overhead
Silent throughput killer

The source document cites vLLM research showing that only roughly 20-38% of allocated KV-cache memory is actually used in production. The rest disappears into fragmentation and over-allocation. That means “enough VRAM on paper” is not the same thing as good throughput in a live system.

Memory bandwidth is the bottleneck

This is the headline insight from the foundations section: most decode time is spent moving weights, not doing math. That is why so many local-LLM discussions go wrong. They focus on TFLOPs instead of asking what the memory subsystem can sustain under repeated decode.

Example from the source material
Qwen3-style decode on H100:
- Compute needed: ~7.7 GFLOPs per token
- Peak compute suggests microseconds
- Actual latency lands in milliseconds

Why?
- Weight loads dominate
- 6 GB of weights may need to move per token
- Practical decode becomes a bandwidth problem, not a pure compute problem

This is also why MoE models are such an important structural change. They let you load a large-capacity system while only activating a much smaller compute footprint per token.

RoPE and KV quantization

One of the more valuable low-level notes in the source document is the interaction between RoPE and KV-cache quantization. Post-RoPE quantization often suffers because channel magnitudes are already mixed. Pre-RoPE quantization preserves cleaner channel structure and can maintain quality at much more aggressive compression.

Post-RoPE quantization
- Keys already have mixed magnitudes
- Outlier channels skew per-token ranges
- 3-bit KV cache often degrades noticeably

Pre-RoPE quantization
- Quantize keys before rotation
- Channels remain more stable
- Compression can stay aggressive with smaller quality loss
Key takeaways
4 items
  • 1Local inference planning is mostly memory planning.
  • 2KV cache is where long context becomes expensive.
  • 3Fragmentation and allocation behavior matter as much as raw formulas.
  • 4Bandwidth dominates decode, which is why architecture and runtime choice matter so much.

Chapter 5: The 2026 Model Landscape

The open-model market is now large enough that architecture and deployment characteristics matter more than leaderboard hype.

The ecosystem changed shape fast

The source document anchors this chapter in one useful fact pattern: by April 2026, the open model ecosystem is no longer a tidy list of a few serious contenders. It is a market with millions of public models, a huge downstream derivative layer, and a clear center of gravity around Chinese labs and their derivatives.

Frontier open models

GLM-5
Reasoning frontier
  • 744B MoE with roughly 40B active parameters per token.
  • Uses MLA + DSA to make very long contexts practical.
  • Strong on AIME, Terminal-Bench, and OSWorld style evaluations.
  • Tradeoff: fewer layers improve latency, but sequential reasoning drops on harder chains.
Qwen3.5
Best local ratio
  • 397B-A17B flagship plus smaller dense and MoE variants.
  • Gated DeltaNet + Gated Attention hybrid dramatically lowers KV-cache cost.
  • Qwen3.5-35B-A3B is the practitioner sweet spot for local deployment.
  • Tradeoff: DeltaNet can underperform on cross-document reference retrieval workloads.
Kimi K2.5
Coding + multimodal
  • 1T total parameters, 32B active per token.
  • Native multimodality via early-fusion training, not late add-on adapters.
  • Excellent coding and tool-heavy performance.
  • Tradeoff: vision tokens create heavy KV-cache pressure and memory planning gets expensive fast.
Step 3.5 Flash
Speed leader
  • 196B MoE with 11B active parameters and MTP-3 used during both training and inference.
  • Can exceed 100 tokens per second on Hopper-class hardware in code-heavy workloads.
  • Acceptance rates stay very high on structured code generation.
  • Tradeoff: the headline speedups compress substantially on creative or less predictable tasks.

Consumer GPU picks

The most important practical recommendation in the source material is not to chase the biggest model you can barely load. The better question is which model delivers the best quality per active token, memory footprint, and runtime behavior on the hardware you already own.

  • Qwen3.5-35B-A3B is the clearest local-deployment pick when you need serious quality on a 16GB class card.
  • MiniMax M2.5 is a useful reminder that training recipe and data quality still beat architecture novelty on many coding tasks.
  • Llama 4 Scout remains relevant when you care about broad compatibility and aggressive context windows.
  • Kimi K2.5 is compelling if coding and multimodality justify the extra memory and deployment complexity.

Decision matrix

Use case matrix

Coding / SWE-bench      -> Kimi K2.5        | Alt: Step 3.5 Flash
Math / reasoning        -> GLM-5            | Alt: Kimi K2.5
Agentic tasks           -> GLM-5            | Alt: Qwen3.5
Long context            -> Qwen3.5          | Alt: Llama 4 Scout
Raw speed               -> Step 3.5 Flash   | Alt: MiniMax M2.5
Consumer GPU deployment -> Qwen3.5-35B-A3B  | Alt: Llama 4 Scout
Multimodal workloads    -> Kimi K2.5        | Alt: Qwen3.5

Chapter 6: Architectural Innovations in 2026

The important architectural shift is not one new trick. It is that multiple labs are converging on the same efficiency ideas.

Hybrid attention is the new standard

The big architectural signal is convergence. Qwen3.5 mixes Gated DeltaNet with Gated Attention, Gemma 3 and MiMo-V2-Flash use sliding-window plus global layers, and other families land on their own variant of the same idea: keep some global signal, but stop paying full-attention cost on every layer for every token.

  • Qwen3.5 / Qwen3-Next: Gated DeltaNet + Gated Attention in a 3:1 ratio.
  • Gemma 3 / MiMo-V2-Flash: sliding-window layers with periodic full global attention.
  • Ling 2.5 and related systems: Lightning Attention combined with MLA or other compressed-cache strategies.
Why builders should care

This is the single biggest reason large-context models now fit on smaller machines. The point is not elegance. The point is that KV-cache cost drops by 3-6x, which changes what is economically deployable on consumer hardware.

Multi-token prediction

Multi-token prediction has moved from a training curiosity into a real inference lever. Models such as Step 3.5 Flash, GLM-5, and newer MiniMax variants predict several tokens at once, which gets you a speculative-decoding-like speedup without paying for a separate draft model.

Where MTP wins
High acceptance
  • Code generation with strong local regularity.
  • Technical writing with repetitive patterns.
  • Long continuations where the next few tokens are predictable.
Where MTP softens
Lower acceptance
  • Creative writing at high temperature.
  • Open-ended ideation where token paths branch more aggressively.
  • Complex reasoning chains with unstable local predictions.

MLA, FlashAttention-4, and multimodality by default

  • MLA is no longer a DeepSeek-only curiosity. It is spreading because more context per GB of VRAM is a real deployment advantage.
  • FlashAttention-4 matters specifically on Blackwell and adjacent NVIDIA hardware, where asynchronous MMA behavior changes the optimization game.
  • Native multimodality is now the default assumption for serious families, which means image tokens and their KV-cache cost have to be included in planning.

Chapter 7: Model Selection and Decision Framework

The right model choice is a requirements exercise, not a leaderboard exercise.

The selection framework

  • Define the primary task: coding, reasoning, chat, multimodal, or agentic orchestration.
  • Set the latency envelope: interactive user-facing chat is a different world from batch generation.
  • Decide the real context requirement rather than the aspirational maximum.
  • Map those requirements to the hardware and budget you actually have, not the benchmark setup you wish you had.

Task-to-model mapping

Code generation  -> Kimi K2.5       | Alt: Step 3.5 Flash | Why: SWE-bench lead
Math / reasoning -> GLM-5           | Alt: Kimi K2.5     | Why: top AIME score
Long context     -> Qwen3.5         | Alt: Llama 4 Scout | Why: 262K to 1M context
Agent workflows  -> GLM-5           | Alt: Qwen3.5       | Why: strongest agentic benchmarks
Raw speed        -> Step 3.5 Flash  | Alt: MiniMax M2.5  | Why: best high-throughput decode
Multimodal       -> Kimi K2.5       | Alt: Qwen3.5       | Why: early-fusion strength

Hardware compatibility matrix

8 GB VRAM    -> Phi-4-Mini, Gemma 3 (1B-4B)
16 GB VRAM   -> Phi-4, Gemma 3 (12B), Llama 4 Scout at Q4
24 GB VRAM   -> Llama 4 Maverick at Q4, Qwen2.5-Coder-32B
32 GB VRAM   -> 70B models at Q4, Llama 4 Maverick at Q8
64 GB unified memory -> Qwen3.5-122B at Q4, Mixtral 8x22B
128 GB+      -> DeepSeek-R1 at Q4, Kimi K2.5 at Q2-Q4

The main discipline here is to plan for real overhead. If the hardware matrix says something technically fits, that still does not mean it fits comfortably enough to survive long context, mixed workloads, and runtime fragmentation.

License considerations

  • Apache 2.0 and MIT are the cleanest path for unrestricted commercial use.
  • Llama-style community licenses usually require attribution and place limits around improving competing models.
  • Custom licenses deserve actual legal reading, especially when you plan to productize hosted inference or derivative fine-tunes.

Chapter 8: Quantization Fundamentals

Quantization is still the most important practical lever for local deployment, because it changes both capacity and throughput.

What quantization is

Quantization converts high-precision weights into lower-bit representations so the model occupies less memory and moves fewer bytes during inference. That shrink can be the difference between a model that never loads and a model that becomes practical on a workstation.

  • Smaller file size and lower VRAM footprint.
  • Lower memory-bandwidth demand and often better tokens per second.
  • Some approximation error, which has to be measured rather than assumed.
  • Different methods degrade in different ways, so there is no single universal best format.

Numeric formats explained

Floating point
Training + inference
  • FP32 remains the full-precision reference and is rarely practical for inference.
  • FP16 was the old inference default before mainstream quantization.
  • BF16 keeps FP32 exponent range and is increasingly useful on supported hardware.
Low-bit formats
Deployment formats
  • INT8 is the safe lower-precision baseline when calibration is available.
  • INT4 remains the most important consumer deployment range.
  • FP8 and NVFP4 matter because they increasingly map to hardware-native acceleration.

Quantization methods

  • Post-training quantization is fast and convenient, but can lose quality if the approximation is crude.
  • Quantization-aware training gives better quality but is far more expensive operationally.
  • GPTQ reconstructs layer outputs to preserve activation behavior.
  • AWQ protects salient channels and often holds quality better at 4-bit.
  • GGUF packages multiple quantization families into the llama.cpp ecosystem with strong portability.

Quantization artifacts

The failure mode is workload-specific

The source document makes an important point that many benchmark summaries blur away: the same model quantized with GPTQ, AWQ, or GGUF can fail in different ways. One may lose numerical reasoning. Another may keep reasoning but hallucinate more in creative output. You have to test with the workload you plan to ship.

Chapter 9: Quantization Formats in Detail

Each format family has a different optimization target: portability, raw GPU speed, fidelity, or hardware-native acceleration.

GGUF and universal compatibility

GGUF remains the most practical format for broad compatibility because it works across CPU, GPU, and Apple Silicon while keeping the operational story simple. That is why it remains the default answer when the deployment target is not locked to a single NVIDIA-heavy environment.

Q4_0    -> maximum compression, fastest, roughest quality
Q4_K_M  -> best general-purpose choice
Q4_K_S  -> slightly smaller / faster for lighter models
Q5_K_M  -> near-FP16 feel for quality-sensitive workloads
Q6_K    -> high quality with diminishing returns
Q8_0    -> when you want a safer FP16 alternative

GPU-optimized formats

GPTQ / AWQ / EXL2
GPU-first
  • GPTQ and AWQ target NVIDIA-style GPU inference and can outperform generic portability formats.
  • AWQ often holds quality better at 4-bit when calibration matches production data.
  • EXL2 is the speed monster when you are comfortable living inside the ExLlamaV2 ecosystem.
FP8 / NVFP4
2026 mainstream
  • FP8 is now a real mainstream option on Hopper, Ada, and Blackwell class hardware.
  • NVFP4 gives hardware-accelerated 4-bit inference without the same calibration overhead.
  • These are especially attractive when both weights and KV cache can exploit the native low-bit path.

Format selection

Need CPU / Mac / universal portability? -> GGUF Q4_K_M
Need highest NVIDIA GPU throughput?      -> EXL2 or GPTQ
Care about coding / creative fidelity?   -> AWQ
Running Hopper / Ada / Blackwell?        -> FP8 weights + FP8 KV cache
Need extreme compression on 24GB?        -> GGUF Q4_K_M or EXL2 3-bit
Want hardware-accelerated 4-bit?         -> NVFP4 via llama.cpp

The source also gives a simple rule-of-thumb ranking worth remembering: for 4-bit quality, AWQ tends to preserve the most, GGUF Q4_K_M is a strong middle ground, and GPTQ can trail both depending on workload. For raw GPU speed, EXL2 leads, then GPTQ, then AWQ, then GGUF.

Chapter 10: KV Cache Quantization

If long context is your real constraint, compressing the KV cache can matter more than squeezing the weights further.

Why KV-cache quantization matters

Llama-3-70B at 32K context
Q4 model weights         -> ~40 GB
FP16 KV cache            -> ~32 GB
Total                    -> ~72 GB

With FP8 / INT8 KV cache -> ~56 GB total
With 3-bit KV cache      -> ~52 GB total

That is why KV-cache quantization is a practical breakthrough. Once context gets large, the cache is no longer a side cost. It is the dominant cost you have to bring under control.

KV-cache methods

  • FP8 KV cache gives a clean 2x memory reduction with minimal quality loss and growing runtime support.
  • INT8 KV cache remains a useful middle ground when calibration is available.
  • KVQuant-style pre-RoPE quantization changes the quality curve by quantizing Keys before rotation rather than after it.
  • TurboQuant pushes toward 3-4 bits with learned codebooks and near-zero quality loss in the better cases.
vllm serve model \
  --kv-cache-dtype fp8 \
  --calculate-kv-scales=True

Deployment findings

  • Per-channel quantization for Keys is essential because per-token scaling falls apart on outlier channels.
  • Pre-RoPE quantization improves perplexity materially over post-RoPE approaches.
  • Removing a tiny set of outliers can make 3-bit KV cache usable with almost no measurable quality loss.
  • Values behave differently: per-token quantization tends to work better than per-channel because it avoids error accumulation.

Chapter 11: Quantization Recipes and Quality Metrics

Good quantization work is less about one magic format and more about disciplined testing, calibration, and conversion workflow.

Measured quality loss

Llama-3-8B on Wikitext-2

FP16 baseline      -> 6.56 perplexity
BitsAndBytes 4-bit -> 6.67 (+1.7%)
GGUF Q4_K_M        -> 6.74 (+2.7%)
AWQ 4-bit          -> 6.84 (+4.3%)
GPTQ 4-bit         -> 6.90 (+5.2%)
GGUF Q3_K_M        -> 7.45 (+13.6%)

AWQ calibration recipe

The source material is very blunt here, and it is a useful correction: AWQ quality is mostly a calibration-data problem. If calibration data does not resemble production, you can end up with a result worse than a simpler GGUF quantization.

  • For coding models, calibrate with The Stack or repository code close to your target language mix.
  • For chat models, use real conversation samples and actual system prompts rather than random web text.
  • Include longer samples so the calibration sees realistic context behavior.
  • Do not rely on generic Wikipedia or C4 if the production domain is specialized.

Conversion recipes

# GGUF conversion
python convert_hf_to_gguf.py \
  --model-dir /path/to/model \
  --outfile model.gguf \
  --outtype q4_k_m

# GGUF quantize an existing file
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=False,
)

model = AutoGPTQForCausalLM.from_pretrained(
    model_id,
    quantize_config,
)
model.quantize(calibration_data)
model.save_quantized(quantized_model_dir)

Quality testing checklist

Key takeaways
4 items
  • 1Measure perplexity on a held-out reference set.
  • 2Run the task evaluation that matches the real workload: coding, reasoning, chat, or multimodal.
  • 3Stress long-context coherence rather than only short benchmark prompts.
  • 4Probe edge cases such as repetition, hallucination triggers, and degradation under extended generation.

Chapter 12: Frameworks Overview

Runtime choice changes performance, portability, and operational complexity more than most teams expect.

The hybrid stack strategy

The source document makes a strong practical argument that still feels right: no single runtime should own your whole workflow. Prototype in Ollama, scale in vLLM, and keep llama.cpp in your back pocket for portability, offline deployment, and edge use cases.

  • Prototype in Ollama for low-friction experimentation and prompt shaping.
  • Scale in vLLM when throughput, concurrency, and operational SLAs matter.
  • Embed with llama.cpp when you care about portability, CPU fallback, or Apple Silicon deployment.

Framework comparison

vLLM       -> Best for production         | Speed: very high | Formats: Safetensors, FP8
llama.cpp  -> Best for portability        | Speed: medium    | Formats: GGUF
Ollama     -> Best for prototyping        | Speed: medium    | Formats: GGUF
ExLlamaV2  -> Best for max GPU perf       | Speed: very high | Formats: EXL2, GPTQ
SGLang     -> Best for structured output  | Speed: high      | Formats: Safetensors

When to use each framework

Use vLLM when
Serving layer
  • You need multi-user concurrency and OpenAI-compatible serving.
  • Throughput and batching efficiency matter more than single-node simplicity.
  • You want the strongest production feature surface for modern serving.
Use llama.cpp or Ollama when
Operator-friendly
  • You need CPU inference, Apple Silicon support, or edge portability.
  • You want one-command local iteration or a simpler distribution model.
  • You are optimizing for reach and friction, not maximum datacenter efficiency.

Chapter 13: vLLM Deep Dive

vLLM is still the production default when throughput, concurrency, and serving completeness are the actual goals.

Key features

  • PagedAttention to reduce fragmentation through fixed-size memory blocks.
  • Continuous batching to keep the GPU busier across mixed request lengths.
  • Tensor parallelism and prefill/decode disaggregation for larger deployments.
  • Prefix caching, FP8 KV cache support, and mature OpenAI-compatible serving.

Production configuration

vllm serve model \
  --tensor-parallel-size 4 \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --max-num-seqs 256 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --api-key your-api-key \
  --port 8000

Undocumented behaviors

The defaults are not all safe defaults

The most useful field notes here are operational: `--gpu-memory-utilization 0.90` is a safe default, prefix caching burns 10-15% memory overhead and should be disabled when hit-rate is poor, and chunked prefill often trades a few milliseconds of latency for materially better throughput on mixed workloads.

Custom metrics worth watching
- vllm:gpu_cache_usage_percent
- vllm:prefix_cache_hit_rate
- vllm:running_requests
- kv_cache_fragmentation_ratio (derived)
- request_queue_wait_seconds

Tool calling with vLLM

vLLM still has the most complete production implementation for tool calling: parallel function calls, `tool_choice`, streaming support, and schema-aware serving. If tool reliability matters, this is still the easiest serious place to start.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")
tools = [{
    "type": "function",
    "function": {
        "name": "lookup_weather",
        "description": "Get weather by city",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"]
        }
    }
}]

response = client.chat.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    messages=[{"role": "user", "content": "Weather in Amsterdam?"}],
    tools=tools,
    tool_choice="auto",
)

Chapter 14: llama.cpp Deep Dive

llama.cpp remains the portability champion, but the real value is how much low-level control it gives operators.

Recent updates

  • NVFP4 and FP8 quantization support for RTX 40/50 series.
  • Faster token generation through steady kernel and backend improvements.
  • Vulkan support for AMD and Intel GPUs.
  • A surprisingly feature-rich CLI and server surface in a very small footprint.

Multi-GPU configuration

# Layer split (default, needs P2P)
./llama-server -m model.gguf --split-mode layer -ngl 999

# Row split (works without P2P, uses more memory)
./llama-server -m model.gguf --split-mode row -ngl 999

# Non-P2P build path
GGML_CUDA_NO_PEER_COPY=ON cmake --build .

Undocumented behaviors

  • Flash attention is not automatically faster on shorter contexts, especially on consumer GPUs.
  • The `--flash-attn` flag changes KV-cache layout and can affect quantization compatibility.
  • `--split-mode layer` breaks on non-P2P topologies at longer contexts; `row` split is safer but heavier.
  • `-ngl 999` loads until OOM, not literally all layers, which can silently leave tail layers on CPU.

CPU offloading

CPU offloading remains one of llama.cpp's most practical advantages. It lets you do ugly but useful things, like running a 70B Q4 model across a 24GB GPU plus a big RAM pool when raw speed is secondary to simply getting the model into service.

./llama-server -m model.gguf --n-gpu-layers 35 -c 4096

Chapter 15: Ollama Deep Dive

Ollama is still the fastest way to get a model into a developer's hands, but the simple surface hides useful tuning hooks.

Key features

  • One-command model management and low-friction local experimentation.
  • Built-in REST API with familiar chat and generation surfaces.
  • Cross-platform support and automatic quantization paths.
  • A very good fit for local development and single-user product prototyping.

Custom Modelfile

FROM llama3.1:70b

PARAMETER num_ctx 8192
PARAMETER num_gpu 999
PARAMETER num_thread 8
PARAMETER batch_size 512

The useful operator insight here is that Ollama's defaults are not always production-friendly. Context limits, GPU loading, thread counts, and batch size all deserve explicit control once you stop treating it like a toy shell and start using it as a real local runtime.

Environment and API usage

OLLAMA_MAX_LOADED_MODELS=1
OLLAMA_NUM_PARALLEL=4
OLLAMA_FLASH_ATTENTION=1
OLLAMA_HOST=0.0.0.0:11434
OLLAMA_MODELS=/path/to/models

Chapter 16: Other Frameworks

The secondary runtimes matter because they solve very specific distribution or workflow problems better than the defaults.

llamafile, LocalAI, and LM Studio

  • llamafile is about zero-dependency distribution: ship one executable model artifact and run it almost anywhere.
  • LocalAI is the broadest compatibility play if you need multimodal support, multiple backends, and LocalAGI-style autonomous agents.
  • LM Studio is still the GUI-first path for researchers, writers, and teams that want a desktop-oriented local model experience.

ExLlamaV2 and SGLang

ExLlamaV2
Maximum GPU throughput
  • Best when your world is quantized NVIDIA inference and raw speed matters most.
  • EXL2 mixed precision remains the sharpest path for many GPU-only setups.
SGLang
Structured generation
  • Worth watching for constrained decoding, speculative decoding, and agent pipeline orchestration.
  • A strong alternative when structured output and orchestration matter as much as raw serving throughput.

Chapter 17: Speculative Decoding

One of the biggest real speed levers in 2026, but only when the workload is predictable enough to accept the draft model often.

How it works

  • A smaller draft model predicts several tokens ahead.
  • The larger target model verifies them in one forward pass.
  • Accepted tokens are kept, rejected tokens are regenerated by the target model.
  • When acceptance is high, you get a real speedup with no output drift.

Implementation approaches

  • Draft-model speculative sampling.
  • N-gram decoding from the prompt itself.
  • Self-speculative decoding using the model's own early layers.
  • MTP-style native multi-token prediction in models that support it.

Acceptance rates and hidden costs

1B drafting for 70B
- Code:   72% acceptance, ~1.8x speedup
- Chat:   45% acceptance, ~1.3x speedup
- Creative: 28% acceptance, ~1.1x speedup

7B drafting for 70B
- Code:   85% acceptance, ~2.4x speedup
- Chat:   62% acceptance, ~1.7x speedup
- Creative: 41% acceptance, ~1.4x speedup
The hidden cost is extra memory

A 7B draft model plus a 70B target means you are now loading 77B worth of model state. On smaller hardware, the more aggressive quantization that makes this fit can eat into the theoretical speedup. The source document is explicit here: speculative decoding is not free, and it is not universally worth turning on.

Chapter 18: Continuous Batching

Continuous batching is one of vLLM's biggest strengths, but it can also quietly hurt latency if you do not understand the workload mix.

Static vs continuous batching

Static batching waits for the slowest request in the batch. Continuous batching opportunistically admits new work as requests complete, which is why it usually wins on GPU utilization and throughput.

The continuous batching trap

Head-of-line blocking still exists

The source chapter calls out the real trap: one very long request can still hurt short interactive requests by dominating the decode loop. In practical terms, that means better average throughput can still come with a worse p99 user experience.

Long request joins the batch
-> short requests inherit its decode cadence
-> TTFT rises from ~50ms to ~200ms
-> average throughput looks better
-> interactive UX gets worse

Prefill-decode disaggregation

The cleanest fix described in the manuscript is prefill-decode disaggregation: split the heavy first pass from the bandwidth-bound decode phase so the system can handle mixed workloads more gracefully.

vllm serve model \
  --tensor-parallel-size 4 \
  --pipeline-parallel-size 2

Chapter 19: Memory Optimization Techniques

Most failed local deployments are memory mistakes disguised as model mistakes.

Why the rules of thumb are wrong

70B model at Q4_K_M
Model weights:      ~40 GB
KV cache (4K):      ~2 GB
Activation buffers: ~4 GB
CUDA overhead:      ~3 GB
Fragmentation:      ~10 GB
Total:              ~59 GB

Real production minimum:
70B Q4 -> ~80 GB
70B Q4 with 32K context -> ~120 GB+

Memory optimization strategies

  • Quantize aggressively, but only as far as the workload allows.
  • Enable FP8 or INT8 KV-cache quantization when long context is the dominant cost.
  • Reduce max context length if the request distribution does not justify the headline window.
  • Use CPU offloading when capacity matters more than peak throughput.
  • Trim batch size when activation buffers are pushing the workload over the edge.

The fragmentation problem

Fragmentation is the silent killer because it makes a machine look sufficient in aggregate while still failing under live request patterns. Variable sequence lengths, mixed workloads, and allocator churn all turn clean capacity plans into Swiss cheese.

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

Chapter 20: Multi-GPU Deployment

Multi-GPU serving is where topology, NCCL behavior, and imbalance finally stop being theory.

Tensor, pipeline, and expert parallelism

  • Tensor parallelism splits each layer across devices and is the standard default for large-model serving.
  • Pipeline parallelism splits groups of layers across devices and pairs well with prefill/decode separation.
  • Expert parallelism matters specifically for large MoE systems where the expert footprint dominates the topology problem.
# Tensor parallelism
vllm serve deepseek-ai/DeepSeek-V3.2 \
  --tensor-parallel-size 8 \
  --dtype auto \
  --max-model-len 65536

# Pipeline parallelism
vllm serve model \
  --tensor-parallel-size 4 \
  --pipeline-parallel-size 2

Multi-GPU pitfalls

  • Check PCIe topology first. Non-P2P setups will hurt or break some split modes.
  • Tune NCCL timeouts to fail faster instead of hanging forever during slow collectives.
  • Watch for per-GPU imbalance: one card at 95% while others idle means your distribution is wrong.
  • Do not assume cloud or mixed-host environments keep driver and CUDA behavior perfectly aligned.
nvidia-smi topo -m
export NCCL_TIMEOUT=600
export NCCL_P2P_DISABLE=1

CPU + GPU offloading

Even in a multi-GPU chapter, the manuscript keeps one pragmatic reminder: CPU + GPU offloading is still often the only realistic way to run larger quantized models on consumer boxes. It is slower, but it is operationally useful when you need access more than elegance.

Production deployment, hardware, advanced topics, and appendices

Unlock the remaining chapters of the guide.