Loading...
Back to Blog

7 min read

Founder’s Open-Model Stack: GLM-4.7, Qwen3-VL, DeepSeek-V3.2, Kimi-K2, FLUX.2

January 22, 2026

If you’re building an AI product as a solo founder or a small team, you don’t need one “best” model.

You need a stack!

A small menu of models that each do one job extremely well, plus a router so your app automatically picks the right one based on the request.

And needles to say, with cost/latency constraints.

Below is a pragmatic, product-minded walkthrough based on this stack:

  • OCR / doc understanding: Qwen3-VL, GLM-4.6V
  • Cheap coding: Qwen3-Coder
  • Coding (stronger agents): GLM-4.7, MiniMax-M2.1
  • Reasoning: DeepSeek-V3.2-Speciale
  • Writing: Kimi-K2, Kimi-K2-Thinking
  • General-purpose: DeepSeek-V3.2
  • Image generation: FLUX.2-dev, Z-Image-Turbo
  • Image editing: Qwen-Image-Edit-2509

Let’s see how each one helps you shipping performant products.

Article image
Article image

One rule before you ship: “open source” vs “open weights”

In 2026, many “open” models ship as open weights with licenses that vary from permissive (Apache/MIT) to restricted.

For example, FLUX.2-dev is distributed under a non-commercial license on Hugging Face, with additional commercial/self-host terms published by Black Forest Labs. Meanwhile, Z-Image-Turbo and Qwen-Image-Edit-2509 are Apache-2.0.

You have to treat “license” as a product requirement, not an afterthought.

1) The architecture pattern: router + tiers

A very effective pattern for startups:

  1. Default to a cheap/fast model (for 70–90% of requests).
  2. Escalate only when confidence is low or the task is hard (agentic coding, long-horizon reasoning, heavy document understanding).
  3. Keep an “always-works” general model for everything else.

You’ll implement this with:

  • A task classifier (simple heuristics + a tiny LLM if you want)
  • A model map
  • A standardized OpenAI-compatible API interface (so swapping providers/models is painless)

vLLM makes this easy with an OpenAI-compatible server.

2) Getting models running: local dev vs production

Local dev (fastest feedback loop)

Use Ollama to pull and test models locally.

It’s the quickest way to validate prompts, UX, and routing logic before you invest in serving infra.

If you have enough RAM, definitely give GLM-4.7-Flash a try!

GLM-4.7-Flash on 24GB GPU (llama.ccp, vLLM, SGLang, Transformers)

GLM-4.7-Flash is one of those rare open-weights releases that changes what “local-first” can realistically mean for…

agentnativedev.medium.com

Production serving (OpenAI-compatible)

Use vLLM for high-throughput inference and keep your app code provider-agnostic.

First, start vLLM OpenAI server:

Code
bash
pip install vllm openai

Example: serve a coding model

vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct --dtype auto --api-key token-abc123

Then, call it with the OpenAI SDK:

Code
python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="token-abc123")

resp = client.chat.completions.create(
  model="Qwen/Qwen3-Coder-30B-A3B-Instruct",
  messages=[{"role": "user", "content": "Write a FastAPI endpoint that validates a JWT and returns the user id."}],
)

print(resp.choices[0].message.content)

3) Use-case walkthrough: what to use, when, and how

A) OCR / Document understanding: Qwen3-VL and GLM-4.6V

If your “OCR” use case is really document understanding (invoices, tables, receipts, screenshots), VLMs are often better than classic OCR because they can:

  • extract text
  • interpret layout
  • and return structured fields in one shot

Qwen3-VL explicitly targets “general OCR and key information extraction” workflows and is Apache-2.0.

GLM-4.6V is an open multimodal GLM line, positioned as a versatile multimodal reasoning model family.

Prompt pattern that actually works in production

  • Always ask for JSON
  • Add schema + validation hints
  • Include a “do not hallucinate” instruction
  • Run a second-pass verifier (often your reasoning model) on the extracted JSON

Example for OpenAI-style image input which works with OpenAI-compatible multimodal servers that support image parts:

Code
python
messages = [{
"role": "user",
"content": [
  {"type": "text", "text":
    "Extract all visible text and key fields from this receipt. "
    "Return STRICT JSON: {merchant, date, total, currency, line_items:[{name, qty, price}], raw_text}. "
    "If a field is missing, use null. Do not guess."
  },
  {"type": "image_url", "image_url": {"url": "data:image/png;base64,<...>"}}
]
}]

OCR is usually a pipeline, not a single model call.

So consider:

  1. VLM extraction
  2. Schema validation
  3. Repair pass (reasoning model)
  4. Human fallback when confidence is low.

B) Cheap coding with Qwen3-Coder

For “autocomplete, small refactors, boilerplate, docs, tests,” you want a model that’s:

  • fast
  • consistent with code formatting
  • and cheap to run.

Qwen3-Coder is the dedicated code branch of Qwen3. You can use it as your default code model, and escalate only when the job becomes agentic (multi-file, multi-step, tool use).

Use a patch-based workflow:

You are editing a Git repo. Return ONLY a unified diff. Do not include explanations. If you need context, ask for the file path.

This reduces “creative” output and makes automated application safer.

C) Stronger coding (agentic): GLM-4.7 and MiniMax-M2.1

When you need:

  • multi-step coding plans
  • tool calling
  • repo-wide refactors
  • debugging with logs/tests

you want models designed for agentic behavior.

GLM-4.7 is positioned as a “powerful Agent model” with strong coding and reasoning characteristics in its release materials/model card. MiniMax-M2.1 is released as open-source weights with a focus on robustness in coding, tool use, and long-horizon planning.

Practical pattern

Use Qwen3-Coder for quick diffs and escalate to GLM-4.7 / M2.1 when:

  • you need >1 file changes
  • you need tool loops (run tests, ead logs)
  • you need planning + execution

D) Reasoning with DeepSeek-V3.2-Speciale

Reasoning models earn their keep when:

  • the prompt is ambiguous
  • the job requires planning
  • you need structured decisions
  • or you want a “verifier” pass on outputs from cheaper models

DeepSeek’s V3.2 line (including the Speciale variant) is distributed with an MIT license on Hugging Face.

Where this shines

  • “repair” passes (fix broken JSON, broken diffs)
  • evaluation/scoring (“is this answer grounded?”)
  • agent planning (“what tools should I call next?”)

E) Writing: Kimi-K2 and Kimi-K2-Thinking

If you’re generating:

  • long-form docs
  • onboarding guides
  • blog posts
  • product copy with consistency

you want models that stay coherent across long contexts and can plan.

Kimi-K2-Thinking is positioned as a thinking agent trained for step-by-step reasoning + tool use, and it ships as a native INT4 quantized model with a 256k context window.

Kimi K2 is open-weight (Moonshot’s GitHub release) and has been widely covered as a major open release.

Writing workflow that’s hard to beat

  1. Kimi-K2-Thinking: outline + claims + structure
  2. Kimi-K2 (non-thinking): final prose in your style guide
  3. DeepSeek reasoning: fact-check prompts + consistency pass

F) General purpose: DeepSeek-V3.2

Your app needs a “default brain” for:

  • user chats
  • mixed requests
  • “I don’t know what I want” prompts
  • glue logic between tools.

DeepSeek-V3.2 is a strong general-purpose anchor and MIT-licensed on HF.

G) Image generation: FLUX.2-dev and Z-Image-Turbo

Use two tiers again:

  • Z-Image-Turbo (Apache-2.0): fast, cheap, high-volume generation, easy to ship commercially.
  • FLUX.2-dev: very capable generation + editing, but distributed under a non-commercial license on HF. Commercial/self-host terms are separate.

Z-Image-Turbo quick start (Diffusers)

Code
bash
import torch
from diffusers import ZImagePipeline

pipe = ZImagePipeline.from_pretrained(
  "Tongyi-MAI/Z-Image-Turbo",
  torch_dtype=torch.bfloat16,
  low_cpu_mem_usage=False,

).to("cuda")

image = pipe(
  prompt="A clean SaaS landing page hero illustration, minimal, white background",
  height=1024, width=1024,
  num_inference_steps=9,
  guidance_scale=0.0,
).images[0]

image.save("zimage.png")

FLUX + Diffusers Diffusers supports Flux pipelines broadly. You’ll typically use FluxPipeline / Flux2Pipeline depending on checkpoint family.

H) Image editing: Qwen-Image-Edit-2509

This is the model you use for:

  • “remove background”
  • “change text on poster”
  • “keep identity but change style”
  • “compose product + scene”
  • multi-image editing and consistency
Code
bash
import os
import torch
from PIL import Image
from diffusers import QwenImageEditPlusPipeline

pipeline = QwenImageEditPlusPipeline.from_pretrained(
  "Qwen/Qwen-Image-Edit-2509", torch_dtype=torch.bfloat16
).to("cuda")

image1 = Image.open("input1.png")
image2 = Image.open("input2.png")
prompt = "Put the product from image1 on a clean studio background from image2. Keep the logo sharp."

out = pipeline(
  image=[image1, image2],
  prompt=prompt,
  generator=torch.manual_seed(0),
  true_cfg_scale=4.0,
  negative_prompt=" ",
  num_inference_steps=40,
  guidance_scale=1.0,
).images[0]

out.save("edited.png")

print("saved:", os.path.abspath("edited.png"))

4) A simple router you can ship this week

Start with something dead simple. You can get fancy later.

Code
python
TASK_TO_MODEL = {
"ocr": "Qwen/Qwen3-VL-8B-Instruct",
"cheap_code": "Qwen/Qwen3-Coder-30B-A3B-Instruct",
"agentic_code": "zai-org/GLM-4.7-Flash",   # or MiniMaxAI/MiniMax-M2.1
"reasoning": "deepseek-ai/DeepSeek-V3.2-Speciale",
"writing": "moonshotai/Kimi-K2",
"writing_thinking": "moonshotai/Kimi-K2-Thinking",
"general": "deepseek-ai/DeepSeek-V3.2",
}

def route(intent: str, hard: bool=False) -> str:
  if intent == "writing" and hard:
      return TASK_TO_MODEL["writing_thinking"]
  if intent == "code" and hard:
      return TASK_TO_MODEL["agentic_code"]
  return TASK_TO_MODEL.get(intent, TASK_TO_MODEL["general"])

How to decide **intent** and **hard**

  • intent: heuristics (image attached as OCR, code fences as code, etc.)
  • hard: one cheap call to your general model: “Is this multi-step? will you need tools? yes/no”

5) Operational tips that saves you money

  1. Cache aggressively: prompt+inputs → response
  2. Stream everything user-facing
  3. Use JSON schemas wherever possible
  4. Add a verifier pass (reasoning model) for anything that becomes a database write, a payment action, or a code change
  5. Measure, don’t vibe: track latency, token usage, tool calls, and failure modes per route

This stack works because it mirrors how products actually behave:

  • most requests are cheap
  • some requests need “agent brains”
  • OCR is a pipeline
  • and image workflows are distinct (gen vs edit)

Happy building!