Ultimate Guide to Local LLMs in 2026
A comprehensive technical reference for practitioners: production war stories, undocumented behaviors, and rare optimization techniques for deploying local LLMs in the real world.
Chapter 1: Introduction
Why local deployment moved from hobbyist territory into the default toolbelt for serious AI builders.
The opening thesis is strong: if you build anything serious with AI in 2026, local deployment is no longer a side quest. The interesting shift is not just model quality. It is that capacity, context, multimodality, and runtime ergonomics now make local inference operationally relevant.
The state of local LLMs in 2026
Running large language models locally has moved from niche experimentation into mainstream engineering practice. Teams are running frontier-class mixture-of-experts models on desks, standing up agent swarms on demand, and treating multimodality as a native assumption rather than an exotic add-on.
For practitioners, that changes the skill floor. Understanding local deployment is not optional if you care about cost control, privacy, latency, reliability, or just having a sane fallback when an external model API becomes the bottleneck.
What makes this guide different
This guide is framed as a practitioner reference rather than a docs summary. The source material is explicitly opinionated about the gaps that matter in real deployments: undocumented behavior, production failures, quantization recipes, memory optimization, and multi-GPU pitfalls.
- War stories instead of happy-path examples
- Measured quantization choices instead of vague quality claims
- Memory and throughput tuning beyond official documentation
- Monitoring configurations and failure semantics rather than toy demos
- Most local-LLM bottlenecks are memory bandwidth and fragmentation, not raw FLOPs
- The GPU is often waiting on weight loads instead of compute
- Hybrid attention and MoE architectures change the hardware economics
- Practical inference is constrained by memory movement, not benchmark fantasy
The three trends defining 2026
- MoE dominance. Frontier models now routinely use Mixture-of-Experts. You still load a very large model footprint, but per-token compute only touches a subset of experts, which makes enormous models practical on smaller hardware.
- Hybrid attention everywhere. Pure self-attention is being displaced by hybrids that drastically reduce KV-cache pressure. What once needed double-digit gigabytes for long context now often fits in a fraction of that budget.
- Hardware and runtime co-evolution. NVFP4, FP8, unified memory systems, and faster inference runtimes are narrowing the gap between theoretical and delivered performance.
How to use this guide
The document itself is structured like a reference manual. That is the right reading model here too. The live route now carries the full manuscript, with the back half of the guide opening into production deployment, hardware decisions, advanced customization, and the appendices once the premium section begins.
- 1Part I is the prerequisite for everything else in the book.
- 2If you are new to local LLMs, start here before touching framework or hardware decisions.
- 3If you already deploy models, use this section to recalibrate your mental model around memory and throughput.
Chapter 2: Core Concepts and Internals
What is actually happening when you run a model locally, and why that matters for every downstream decision.
Inference is next-token prediction, step by step
Local inference is just the forward-pass side of a pre-trained model. The important operational point is that generation is autoregressive: one token is predicted, appended, and then used as context for the next step. That is why latency compounds and why caching matters so much.
Step 1: Tokenization
Input text is split into subword tokens
Step 2: Embedding
Token IDs are mapped into learned vectors
Step 3: Forward pass
Embeddings move through attention + MLP blocks
Step 4: Prediction
The model outputs logits across the full vocabulary
Step 5: Sampling
One token is selected via greedy, temperature, top-k, or top-p
Step 6: Append and repeat
The selected token is appended and the cycle continuesWithout the KV cache, every decode step would recompute attention against the full prefix from scratch. The cache is not a convenience feature. It is the thing that makes autoregressive decoding economically viable.
Tokens are not words
A lot of bad capacity planning comes from treating tokens as if they were human words. They are not. Token counts vary with tokenizer, language, and domain. Long identifiers, multilingual text, and punctuation-heavy inputs all distort naive assumptions.
hellomay be 1 token, but complex words can be 5-8 tokens.- A 100-word paragraph often lands around 130-150 tokens.
- Code often tokenizes differently from natural language and can be either more compact or much noisier depending on the syntax.
- Longer context windows always increase KV-cache memory, decode latency, and memory bandwidth pressure.
What is inside a model file
When you download a model, you are not just pulling weights. You are also taking on tokenizer state, architectural configuration, and chat-template behavior. If any of those drift from what the runtime expects, quality collapses fast.
- Neural network weights: the learned tensors that dominate storage and VRAM.
- Tokenizer: vocabulary, merge rules, and special tokens such as BOS/EOS/PAD.
- Configuration: hidden size, layer count, attention heads, context window, normalization constants, and RoPE parameters.
- Chat template: the message framing the model expects. Wrong template often means obviously degraded or nonsensical output.
- Architecture assumptions: especially important for MoE and hybrid attention models where the runtime must understand the structure correctly.
Chapter 3: The Transformer Architecture
The core skeleton behind almost every modern LLM, plus the efficiency layers that now matter more than the original design.
Attention, MLPs, and positional structure
The original transformer gave us the baseline: self-attention to decide which prior tokens matter, MLP blocks to add expressiveness, residual connections for depth, and positional encoding so token order still means something.
# Self-attention mechanism
Q = X @ W_Q
K = X @ W_K
V = X @ W_V
scores = Q @ K.T / sqrt(d_k)
weights = softmax(scores)
output = weights @ VThe expensive truth is the quadratic scaling: attention cost grows with sequence length, which is why long context becomes painful long before you hit the model's advertised maximum window.
Modern efficiency optimizations
- Grouped Query Attention shares KV heads across query heads.
- Multi-Query Attention pushes the idea further by letting all queries share one KV head.
- The trade is simple: much cheaper KV-cache usage for some quality risk on harder long-context tasks.
- Multi-head Latent Attention compresses the KV representation before reuse.
- DeepSeek Sparse Attention prunes full attention down to a top-k set of relevant tokens.
- These innovations target the real bottleneck: memory movement and long-context overhead.
Why hybrid attention matters
One of the strongest claims in the source document is that the era of pure self-attention is ending. That matches what we are seeing across multiple model families: full global attention is being used sparingly, while sliding windows, sparse selectors, and alternating global layers pick up the slack.
When a model says it supports long context, the question is no longer just “how many tokens?”. The more useful question is “what architecture made that context affordable, and what tradeoffs did it impose on latency, quality, and memory?”.
Chapter 4: Memory, VRAM, and the KV Cache
The chapter that usually determines whether a local deployment plan is realistic or fantasy.
What must fit in VRAM
During inference, VRAM is shared by model weights, the KV cache, activation buffers, and runtime overhead. In practice, this means your simple weight-size estimate is almost never enough.
- Model weights are usually the largest static footprint.
- KV cache grows with sequence length and can exceed the weight footprint at long contexts.
- Activation buffers are temporary but real, especially with larger batches.
- Runtime overhead and fragmentation consume a non-trivial slice of the card before useful work starts.
Practical total VRAM estimate
=
Model weights
+ KV cache
+ activation buffers
+ runtime overhead
+
For planning, assume:
- FP16 model: ~2.5x to 3x raw weight size
- 4-bit quantized: ~1.5x to 2x raw weight sizeThe KV cache reality
The formula for KV-cache size is straightforward. Production reality is not. Alignment, padding, prefix caching, and fragmentation mean real memory usage drifts significantly above the ideal math.
KV_cache = 2 x n_layers x n_kv_heads x d_head x seq_len x batch_size x bytes_per_param
Reality in production:
KV_cache_real = formula x 1.3
+ fragmentation waste
+ prefix-cache overheadThe source document cites vLLM research showing that only roughly 20-38% of allocated KV-cache memory is actually used in production. The rest disappears into fragmentation and over-allocation. That means “enough VRAM on paper” is not the same thing as good throughput in a live system.
Memory bandwidth is the bottleneck
This is the headline insight from the foundations section: most decode time is spent moving weights, not doing math. That is why so many local-LLM discussions go wrong. They focus on TFLOPs instead of asking what the memory subsystem can sustain under repeated decode.
Example from the source material
Qwen3-style decode on H100:
- Compute needed: ~7.7 GFLOPs per token
- Peak compute suggests microseconds
- Actual latency lands in milliseconds
Why?
- Weight loads dominate
- 6 GB of weights may need to move per token
- Practical decode becomes a bandwidth problem, not a pure compute problemThis is also why MoE models are such an important structural change. They let you load a large-capacity system while only activating a much smaller compute footprint per token.
RoPE and KV quantization
One of the more valuable low-level notes in the source document is the interaction between RoPE and KV-cache quantization. Post-RoPE quantization often suffers because channel magnitudes are already mixed. Pre-RoPE quantization preserves cleaner channel structure and can maintain quality at much more aggressive compression.
Post-RoPE quantization
- Keys already have mixed magnitudes
- Outlier channels skew per-token ranges
- 3-bit KV cache often degrades noticeably
Pre-RoPE quantization
- Quantize keys before rotation
- Channels remain more stable
- Compression can stay aggressive with smaller quality loss- 1Local inference planning is mostly memory planning.
- 2KV cache is where long context becomes expensive.
- 3Fragmentation and allocation behavior matter as much as raw formulas.
- 4Bandwidth dominates decode, which is why architecture and runtime choice matter so much.
Chapter 5: The 2026 Model Landscape
The open-model market is now large enough that architecture and deployment characteristics matter more than leaderboard hype.
The source document anchors this chapter in one useful fact pattern: by April 2026, the open model ecosystem is no longer a tidy list of a few serious contenders. It is a market with millions of public models, a huge downstream derivative layer, and a clear center of gravity around Chinese labs and their derivatives.
Frontier open models
- 744B MoE with roughly 40B active parameters per token.
- Uses MLA + DSA to make very long contexts practical.
- Strong on AIME, Terminal-Bench, and OSWorld style evaluations.
- Tradeoff: fewer layers improve latency, but sequential reasoning drops on harder chains.
- 397B-A17B flagship plus smaller dense and MoE variants.
- Gated DeltaNet + Gated Attention hybrid dramatically lowers KV-cache cost.
- Qwen3.5-35B-A3B is the practitioner sweet spot for local deployment.
- Tradeoff: DeltaNet can underperform on cross-document reference retrieval workloads.
- 1T total parameters, 32B active per token.
- Native multimodality via early-fusion training, not late add-on adapters.
- Excellent coding and tool-heavy performance.
- Tradeoff: vision tokens create heavy KV-cache pressure and memory planning gets expensive fast.
- 196B MoE with 11B active parameters and MTP-3 used during both training and inference.
- Can exceed 100 tokens per second on Hopper-class hardware in code-heavy workloads.
- Acceptance rates stay very high on structured code generation.
- Tradeoff: the headline speedups compress substantially on creative or less predictable tasks.
Consumer GPU picks
The most important practical recommendation in the source material is not to chase the biggest model you can barely load. The better question is which model delivers the best quality per active token, memory footprint, and runtime behavior on the hardware you already own.
- Qwen3.5-35B-A3B is the clearest local-deployment pick when you need serious quality on a 16GB class card.
- MiniMax M2.5 is a useful reminder that training recipe and data quality still beat architecture novelty on many coding tasks.
- Llama 4 Scout remains relevant when you care about broad compatibility and aggressive context windows.
- Kimi K2.5 is compelling if coding and multimodality justify the extra memory and deployment complexity.
Decision matrix
Use case matrix
Coding / SWE-bench -> Kimi K2.5 | Alt: Step 3.5 Flash
Math / reasoning -> GLM-5 | Alt: Kimi K2.5
Agentic tasks -> GLM-5 | Alt: Qwen3.5
Long context -> Qwen3.5 | Alt: Llama 4 Scout
Raw speed -> Step 3.5 Flash | Alt: MiniMax M2.5
Consumer GPU deployment -> Qwen3.5-35B-A3B | Alt: Llama 4 Scout
Multimodal workloads -> Kimi K2.5 | Alt: Qwen3.5Chapter 6: Architectural Innovations in 2026
The important architectural shift is not one new trick. It is that multiple labs are converging on the same efficiency ideas.
Hybrid attention is the new standard
The big architectural signal is convergence. Qwen3.5 mixes Gated DeltaNet with Gated Attention, Gemma 3 and MiMo-V2-Flash use sliding-window plus global layers, and other families land on their own variant of the same idea: keep some global signal, but stop paying full-attention cost on every layer for every token.
- Qwen3.5 / Qwen3-Next: Gated DeltaNet + Gated Attention in a 3:1 ratio.
- Gemma 3 / MiMo-V2-Flash: sliding-window layers with periodic full global attention.
- Ling 2.5 and related systems: Lightning Attention combined with MLA or other compressed-cache strategies.
This is the single biggest reason large-context models now fit on smaller machines. The point is not elegance. The point is that KV-cache cost drops by 3-6x, which changes what is economically deployable on consumer hardware.
Multi-token prediction
Multi-token prediction has moved from a training curiosity into a real inference lever. Models such as Step 3.5 Flash, GLM-5, and newer MiniMax variants predict several tokens at once, which gets you a speculative-decoding-like speedup without paying for a separate draft model.
- Code generation with strong local regularity.
- Technical writing with repetitive patterns.
- Long continuations where the next few tokens are predictable.
- Creative writing at high temperature.
- Open-ended ideation where token paths branch more aggressively.
- Complex reasoning chains with unstable local predictions.
MLA, FlashAttention-4, and multimodality by default
- MLA is no longer a DeepSeek-only curiosity. It is spreading because more context per GB of VRAM is a real deployment advantage.
- FlashAttention-4 matters specifically on Blackwell and adjacent NVIDIA hardware, where asynchronous MMA behavior changes the optimization game.
- Native multimodality is now the default assumption for serious families, which means image tokens and their KV-cache cost have to be included in planning.
Chapter 7: Model Selection and Decision Framework
The right model choice is a requirements exercise, not a leaderboard exercise.
The selection framework
- Define the primary task: coding, reasoning, chat, multimodal, or agentic orchestration.
- Set the latency envelope: interactive user-facing chat is a different world from batch generation.
- Decide the real context requirement rather than the aspirational maximum.
- Map those requirements to the hardware and budget you actually have, not the benchmark setup you wish you had.
Task-to-model mapping
Code generation -> Kimi K2.5 | Alt: Step 3.5 Flash | Why: SWE-bench lead
Math / reasoning -> GLM-5 | Alt: Kimi K2.5 | Why: top AIME score
Long context -> Qwen3.5 | Alt: Llama 4 Scout | Why: 262K to 1M context
Agent workflows -> GLM-5 | Alt: Qwen3.5 | Why: strongest agentic benchmarks
Raw speed -> Step 3.5 Flash | Alt: MiniMax M2.5 | Why: best high-throughput decode
Multimodal -> Kimi K2.5 | Alt: Qwen3.5 | Why: early-fusion strengthHardware compatibility matrix
8 GB VRAM -> Phi-4-Mini, Gemma 3 (1B-4B)
16 GB VRAM -> Phi-4, Gemma 3 (12B), Llama 4 Scout at Q4
24 GB VRAM -> Llama 4 Maverick at Q4, Qwen2.5-Coder-32B
32 GB VRAM -> 70B models at Q4, Llama 4 Maverick at Q8
64 GB unified memory -> Qwen3.5-122B at Q4, Mixtral 8x22B
128 GB+ -> DeepSeek-R1 at Q4, Kimi K2.5 at Q2-Q4The main discipline here is to plan for real overhead. If the hardware matrix says something technically fits, that still does not mean it fits comfortably enough to survive long context, mixed workloads, and runtime fragmentation.
License considerations
- Apache 2.0 and MIT are the cleanest path for unrestricted commercial use.
- Llama-style community licenses usually require attribution and place limits around improving competing models.
- Custom licenses deserve actual legal reading, especially when you plan to productize hosted inference or derivative fine-tunes.
Chapter 8: Quantization Fundamentals
Quantization is still the most important practical lever for local deployment, because it changes both capacity and throughput.
What quantization is
Quantization converts high-precision weights into lower-bit representations so the model occupies less memory and moves fewer bytes during inference. That shrink can be the difference between a model that never loads and a model that becomes practical on a workstation.
- Smaller file size and lower VRAM footprint.
- Lower memory-bandwidth demand and often better tokens per second.
- Some approximation error, which has to be measured rather than assumed.
- Different methods degrade in different ways, so there is no single universal best format.
Numeric formats explained
- FP32 remains the full-precision reference and is rarely practical for inference.
- FP16 was the old inference default before mainstream quantization.
- BF16 keeps FP32 exponent range and is increasingly useful on supported hardware.
- INT8 is the safe lower-precision baseline when calibration is available.
- INT4 remains the most important consumer deployment range.
- FP8 and NVFP4 matter because they increasingly map to hardware-native acceleration.
Quantization methods
- Post-training quantization is fast and convenient, but can lose quality if the approximation is crude.
- Quantization-aware training gives better quality but is far more expensive operationally.
- GPTQ reconstructs layer outputs to preserve activation behavior.
- AWQ protects salient channels and often holds quality better at 4-bit.
- GGUF packages multiple quantization families into the llama.cpp ecosystem with strong portability.
Quantization artifacts
The source document makes an important point that many benchmark summaries blur away: the same model quantized with GPTQ, AWQ, or GGUF can fail in different ways. One may lose numerical reasoning. Another may keep reasoning but hallucinate more in creative output. You have to test with the workload you plan to ship.
Chapter 9: Quantization Formats in Detail
Each format family has a different optimization target: portability, raw GPU speed, fidelity, or hardware-native acceleration.
GGUF and universal compatibility
GGUF remains the most practical format for broad compatibility because it works across CPU, GPU, and Apple Silicon while keeping the operational story simple. That is why it remains the default answer when the deployment target is not locked to a single NVIDIA-heavy environment.
Q4_0 -> maximum compression, fastest, roughest quality
Q4_K_M -> best general-purpose choice
Q4_K_S -> slightly smaller / faster for lighter models
Q5_K_M -> near-FP16 feel for quality-sensitive workloads
Q6_K -> high quality with diminishing returns
Q8_0 -> when you want a safer FP16 alternativeGPU-optimized formats
- GPTQ and AWQ target NVIDIA-style GPU inference and can outperform generic portability formats.
- AWQ often holds quality better at 4-bit when calibration matches production data.
- EXL2 is the speed monster when you are comfortable living inside the ExLlamaV2 ecosystem.
- FP8 is now a real mainstream option on Hopper, Ada, and Blackwell class hardware.
- NVFP4 gives hardware-accelerated 4-bit inference without the same calibration overhead.
- These are especially attractive when both weights and KV cache can exploit the native low-bit path.
Format selection
Need CPU / Mac / universal portability? -> GGUF Q4_K_M
Need highest NVIDIA GPU throughput? -> EXL2 or GPTQ
Care about coding / creative fidelity? -> AWQ
Running Hopper / Ada / Blackwell? -> FP8 weights + FP8 KV cache
Need extreme compression on 24GB? -> GGUF Q4_K_M or EXL2 3-bit
Want hardware-accelerated 4-bit? -> NVFP4 via llama.cppThe source also gives a simple rule-of-thumb ranking worth remembering: for 4-bit quality, AWQ tends to preserve the most, GGUF Q4_K_M is a strong middle ground, and GPTQ can trail both depending on workload. For raw GPU speed, EXL2 leads, then GPTQ, then AWQ, then GGUF.
Chapter 10: KV Cache Quantization
If long context is your real constraint, compressing the KV cache can matter more than squeezing the weights further.
Why KV-cache quantization matters
Llama-3-70B at 32K context
Q4 model weights -> ~40 GB
FP16 KV cache -> ~32 GB
Total -> ~72 GB
With FP8 / INT8 KV cache -> ~56 GB total
With 3-bit KV cache -> ~52 GB totalThat is why KV-cache quantization is a practical breakthrough. Once context gets large, the cache is no longer a side cost. It is the dominant cost you have to bring under control.
KV-cache methods
- FP8 KV cache gives a clean 2x memory reduction with minimal quality loss and growing runtime support.
- INT8 KV cache remains a useful middle ground when calibration is available.
- KVQuant-style pre-RoPE quantization changes the quality curve by quantizing Keys before rotation rather than after it.
- TurboQuant pushes toward 3-4 bits with learned codebooks and near-zero quality loss in the better cases.
vllm serve model \
--kv-cache-dtype fp8 \
--calculate-kv-scales=TrueDeployment findings
- Per-channel quantization for Keys is essential because per-token scaling falls apart on outlier channels.
- Pre-RoPE quantization improves perplexity materially over post-RoPE approaches.
- Removing a tiny set of outliers can make 3-bit KV cache usable with almost no measurable quality loss.
- Values behave differently: per-token quantization tends to work better than per-channel because it avoids error accumulation.
Chapter 11: Quantization Recipes and Quality Metrics
Good quantization work is less about one magic format and more about disciplined testing, calibration, and conversion workflow.
Measured quality loss
Llama-3-8B on Wikitext-2
FP16 baseline -> 6.56 perplexity
BitsAndBytes 4-bit -> 6.67 (+1.7%)
GGUF Q4_K_M -> 6.74 (+2.7%)
AWQ 4-bit -> 6.84 (+4.3%)
GPTQ 4-bit -> 6.90 (+5.2%)
GGUF Q3_K_M -> 7.45 (+13.6%)AWQ calibration recipe
The source material is very blunt here, and it is a useful correction: AWQ quality is mostly a calibration-data problem. If calibration data does not resemble production, you can end up with a result worse than a simpler GGUF quantization.
- For coding models, calibrate with The Stack or repository code close to your target language mix.
- For chat models, use real conversation samples and actual system prompts rather than random web text.
- Include longer samples so the calibration sees realistic context behavior.
- Do not rely on generic Wikipedia or C4 if the production domain is specialized.
Conversion recipes
# GGUF conversion
python convert_hf_to_gguf.py \
--model-dir /path/to/model \
--outfile model.gguf \
--outtype q4_k_m
# GGUF quantize an existing file
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_Mfrom auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
quantize_config = BaseQuantizeConfig(
bits=4,
group_size=128,
desc_act=False,
)
model = AutoGPTQForCausalLM.from_pretrained(
model_id,
quantize_config,
)
model.quantize(calibration_data)
model.save_quantized(quantized_model_dir)Quality testing checklist
- 1Measure perplexity on a held-out reference set.
- 2Run the task evaluation that matches the real workload: coding, reasoning, chat, or multimodal.
- 3Stress long-context coherence rather than only short benchmark prompts.
- 4Probe edge cases such as repetition, hallucination triggers, and degradation under extended generation.
Chapter 12: Frameworks Overview
Runtime choice changes performance, portability, and operational complexity more than most teams expect.
The hybrid stack strategy
The source document makes a strong practical argument that still feels right: no single runtime should own your whole workflow. Prototype in Ollama, scale in vLLM, and keep llama.cpp in your back pocket for portability, offline deployment, and edge use cases.
- Prototype in Ollama for low-friction experimentation and prompt shaping.
- Scale in vLLM when throughput, concurrency, and operational SLAs matter.
- Embed with llama.cpp when you care about portability, CPU fallback, or Apple Silicon deployment.
Framework comparison
vLLM -> Best for production | Speed: very high | Formats: Safetensors, FP8
llama.cpp -> Best for portability | Speed: medium | Formats: GGUF
Ollama -> Best for prototyping | Speed: medium | Formats: GGUF
ExLlamaV2 -> Best for max GPU perf | Speed: very high | Formats: EXL2, GPTQ
SGLang -> Best for structured output | Speed: high | Formats: SafetensorsWhen to use each framework
- You need multi-user concurrency and OpenAI-compatible serving.
- Throughput and batching efficiency matter more than single-node simplicity.
- You want the strongest production feature surface for modern serving.
- You need CPU inference, Apple Silicon support, or edge portability.
- You want one-command local iteration or a simpler distribution model.
- You are optimizing for reach and friction, not maximum datacenter efficiency.
Chapter 13: vLLM Deep Dive
vLLM is still the production default when throughput, concurrency, and serving completeness are the actual goals.
Key features
- PagedAttention to reduce fragmentation through fixed-size memory blocks.
- Continuous batching to keep the GPU busier across mixed request lengths.
- Tensor parallelism and prefill/decode disaggregation for larger deployments.
- Prefix caching, FP8 KV cache support, and mature OpenAI-compatible serving.
Production configuration
vllm serve model \
--tensor-parallel-size 4 \
--dtype auto \
--kv-cache-dtype fp8 \
--enable-prefix-caching \
--max-num-seqs 256 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--api-key your-api-key \
--port 8000Undocumented behaviors
The most useful field notes here are operational: `--gpu-memory-utilization 0.90` is a safe default, prefix caching burns 10-15% memory overhead and should be disabled when hit-rate is poor, and chunked prefill often trades a few milliseconds of latency for materially better throughput on mixed workloads.
Custom metrics worth watching
- vllm:gpu_cache_usage_percent
- vllm:prefix_cache_hit_rate
- vllm:running_requests
- kv_cache_fragmentation_ratio (derived)
- request_queue_wait_secondsTool calling with vLLM
vLLM still has the most complete production implementation for tool calling: parallel function calls, `tool_choice`, streaming support, and schema-aware serving. If tool reliability matters, this is still the easiest serious place to start.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")
tools = [{
"type": "function",
"function": {
"name": "lookup_weather",
"description": "Get weather by city",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
}
}
}]
response = client.chat.completions.create(
model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
messages=[{"role": "user", "content": "Weather in Amsterdam?"}],
tools=tools,
tool_choice="auto",
)Chapter 14: llama.cpp Deep Dive
llama.cpp remains the portability champion, but the real value is how much low-level control it gives operators.
Recent updates
- NVFP4 and FP8 quantization support for RTX 40/50 series.
- Faster token generation through steady kernel and backend improvements.
- Vulkan support for AMD and Intel GPUs.
- A surprisingly feature-rich CLI and server surface in a very small footprint.
Multi-GPU configuration
# Layer split (default, needs P2P)
./llama-server -m model.gguf --split-mode layer -ngl 999
# Row split (works without P2P, uses more memory)
./llama-server -m model.gguf --split-mode row -ngl 999
# Non-P2P build path
GGML_CUDA_NO_PEER_COPY=ON cmake --build .Undocumented behaviors
- Flash attention is not automatically faster on shorter contexts, especially on consumer GPUs.
- The `--flash-attn` flag changes KV-cache layout and can affect quantization compatibility.
- `--split-mode layer` breaks on non-P2P topologies at longer contexts; `row` split is safer but heavier.
- `-ngl 999` loads until OOM, not literally all layers, which can silently leave tail layers on CPU.
CPU offloading
CPU offloading remains one of llama.cpp's most practical advantages. It lets you do ugly but useful things, like running a 70B Q4 model across a 24GB GPU plus a big RAM pool when raw speed is secondary to simply getting the model into service.
./llama-server -m model.gguf --n-gpu-layers 35 -c 4096Chapter 15: Ollama Deep Dive
Ollama is still the fastest way to get a model into a developer's hands, but the simple surface hides useful tuning hooks.
Key features
- One-command model management and low-friction local experimentation.
- Built-in REST API with familiar chat and generation surfaces.
- Cross-platform support and automatic quantization paths.
- A very good fit for local development and single-user product prototyping.
Custom Modelfile
FROM llama3.1:70b
PARAMETER num_ctx 8192
PARAMETER num_gpu 999
PARAMETER num_thread 8
PARAMETER batch_size 512The useful operator insight here is that Ollama's defaults are not always production-friendly. Context limits, GPU loading, thread counts, and batch size all deserve explicit control once you stop treating it like a toy shell and start using it as a real local runtime.
Environment and API usage
OLLAMA_MAX_LOADED_MODELS=1
OLLAMA_NUM_PARALLEL=4
OLLAMA_FLASH_ATTENTION=1
OLLAMA_HOST=0.0.0.0:11434
OLLAMA_MODELS=/path/to/modelsChapter 16: Other Frameworks
The secondary runtimes matter because they solve very specific distribution or workflow problems better than the defaults.
llamafile, LocalAI, and LM Studio
- llamafile is about zero-dependency distribution: ship one executable model artifact and run it almost anywhere.
- LocalAI is the broadest compatibility play if you need multimodal support, multiple backends, and LocalAGI-style autonomous agents.
- LM Studio is still the GUI-first path for researchers, writers, and teams that want a desktop-oriented local model experience.
ExLlamaV2 and SGLang
- Best when your world is quantized NVIDIA inference and raw speed matters most.
- EXL2 mixed precision remains the sharpest path for many GPU-only setups.
- Worth watching for constrained decoding, speculative decoding, and agent pipeline orchestration.
- A strong alternative when structured output and orchestration matter as much as raw serving throughput.
Chapter 17: Speculative Decoding
One of the biggest real speed levers in 2026, but only when the workload is predictable enough to accept the draft model often.
How it works
- A smaller draft model predicts several tokens ahead.
- The larger target model verifies them in one forward pass.
- Accepted tokens are kept, rejected tokens are regenerated by the target model.
- When acceptance is high, you get a real speedup with no output drift.
Implementation approaches
- Draft-model speculative sampling.
- N-gram decoding from the prompt itself.
- Self-speculative decoding using the model's own early layers.
- MTP-style native multi-token prediction in models that support it.
Acceptance rates and hidden costs
1B drafting for 70B
- Code: 72% acceptance, ~1.8x speedup
- Chat: 45% acceptance, ~1.3x speedup
- Creative: 28% acceptance, ~1.1x speedup
7B drafting for 70B
- Code: 85% acceptance, ~2.4x speedup
- Chat: 62% acceptance, ~1.7x speedup
- Creative: 41% acceptance, ~1.4x speedupA 7B draft model plus a 70B target means you are now loading 77B worth of model state. On smaller hardware, the more aggressive quantization that makes this fit can eat into the theoretical speedup. The source document is explicit here: speculative decoding is not free, and it is not universally worth turning on.
Chapter 18: Continuous Batching
Continuous batching is one of vLLM's biggest strengths, but it can also quietly hurt latency if you do not understand the workload mix.
Static vs continuous batching
Static batching waits for the slowest request in the batch. Continuous batching opportunistically admits new work as requests complete, which is why it usually wins on GPU utilization and throughput.
The continuous batching trap
The source chapter calls out the real trap: one very long request can still hurt short interactive requests by dominating the decode loop. In practical terms, that means better average throughput can still come with a worse p99 user experience.
Long request joins the batch
-> short requests inherit its decode cadence
-> TTFT rises from ~50ms to ~200ms
-> average throughput looks better
-> interactive UX gets worsePrefill-decode disaggregation
The cleanest fix described in the manuscript is prefill-decode disaggregation: split the heavy first pass from the bandwidth-bound decode phase so the system can handle mixed workloads more gracefully.
vllm serve model \
--tensor-parallel-size 4 \
--pipeline-parallel-size 2Chapter 19: Memory Optimization Techniques
Most failed local deployments are memory mistakes disguised as model mistakes.
Why the rules of thumb are wrong
70B model at Q4_K_M
Model weights: ~40 GB
KV cache (4K): ~2 GB
Activation buffers: ~4 GB
CUDA overhead: ~3 GB
Fragmentation: ~10 GB
Total: ~59 GB
Real production minimum:
70B Q4 -> ~80 GB
70B Q4 with 32K context -> ~120 GB+Memory optimization strategies
- Quantize aggressively, but only as far as the workload allows.
- Enable FP8 or INT8 KV-cache quantization when long context is the dominant cost.
- Reduce max context length if the request distribution does not justify the headline window.
- Use CPU offloading when capacity matters more than peak throughput.
- Trim batch size when activation buffers are pushing the workload over the edge.
The fragmentation problem
Fragmentation is the silent killer because it makes a machine look sufficient in aggregate while still failing under live request patterns. Variable sequence lengths, mixed workloads, and allocator churn all turn clean capacity plans into Swiss cheese.
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:TrueChapter 20: Multi-GPU Deployment
Multi-GPU serving is where topology, NCCL behavior, and imbalance finally stop being theory.
Tensor, pipeline, and expert parallelism
- Tensor parallelism splits each layer across devices and is the standard default for large-model serving.
- Pipeline parallelism splits groups of layers across devices and pairs well with prefill/decode separation.
- Expert parallelism matters specifically for large MoE systems where the expert footprint dominates the topology problem.
# Tensor parallelism
vllm serve deepseek-ai/DeepSeek-V3.2 \
--tensor-parallel-size 8 \
--dtype auto \
--max-model-len 65536
# Pipeline parallelism
vllm serve model \
--tensor-parallel-size 4 \
--pipeline-parallel-size 2Multi-GPU pitfalls
- Check PCIe topology first. Non-P2P setups will hurt or break some split modes.
- Tune NCCL timeouts to fail faster instead of hanging forever during slow collectives.
- Watch for per-GPU imbalance: one card at 95% while others idle means your distribution is wrong.
- Do not assume cloud or mixed-host environments keep driver and CUDA behavior perfectly aligned.
nvidia-smi topo -m
export NCCL_TIMEOUT=600
export NCCL_P2P_DISABLE=1CPU + GPU offloading
Even in a multi-GPU chapter, the manuscript keeps one pragmatic reminder: CPU + GPU offloading is still often the only realistic way to run larger quantized models on consumer boxes. It is slower, but it is operationally useful when you need access more than elegance.
Production deployment, hardware, advanced topics, and appendices
Unlock the remaining chapters of the guide.