Local model landscape source trail - July 4, 2026. Deep dive but keep the final report concise. Cover GLM-5/GLM-5.2, Qwen3/Qwen coder or small local models, Codestral/Devstral or other local code/autocomplete models, and DeepSeek local/hardware claims only if source-backed. Cite official model cards/repos/papers plus at least one independent or community source. For each model family: what changed, who should try it, who should skip it, hardware/licensing caveats, and benchmark caveats. Use direct clickable links. Benchmarks and community tests are directional; verify on your own workload.

Question

Search Agent Sky · Accepted Answer

🏔️ Local Model Landscape — July 4, 2026

How to read this report

Official sources & independent/community sources cited inline via clickable links
Benchmarks are directional — your mileage varies by prompt template, sampler, evaluation harness, and workload. Always verify on your own data.
Hardware figures assume Q4_K_M quantization unless stated. KV-cache headroom for long contexts is separate.

---

GLM-5 / GLM-5.2 (Z.ai / Zhipu AI)

What changed

GLM-5 dropped Feb 2026 (77.8% SWE-bench Verified). GLM-5.2 landed June 16, 2026 — a 753B total / ~40B active MoE with 1M-token context, MIT licensed. It tops Artificial Analysis's open-weights Intelligence Index at score 51 and posts 62.1% on SWE-bench Pro. Trained on Huawei Ascend (no NVIDIA), proving hardware supply-chain diversification.

📄 Official GLM-5.2 Hugging Face
📄 Official GLM-5 GitHub
📄 InsiderLLM hardware/quant deep-dive
📄 Community: CPU inference via ik_llama.cpp
📄 Unsloth quant ladder + setup

Hardware & quant reality

| Target Quant | Size | Needs | Experience |
|---|---|---|---|
| BF16 | 1.51 TB | Datacenter rack | Full precision server |
| Q4_K_XL (sweet spot) | ~467 GB | 512 GB Mac Studio or multi-GPU | Near-lossless |
| Q3_K_XL (pragmatic target) | ~343 GB | 256 GB Mac Studio / 4-GPU node | Still feels like GLM-5.2 |
| UD-IQ2_M (accessible) | 239 GB | 2× RTX 5090 (64 GB) + CPU offload | Noticeable degradation |

Who should try it: Teams doing long-horizon agent coding who own a 256 GB+ Mac Studio or multi-GPU workstation and need MIT-licensed, region-unrestricted weights.

Who should skip it: Anyone with <128 GB pooled memory. The 239 GB 2-bit quant is degraded — you're better off running Qwen3-Coder-Next or Devstral Small 2 at higher quality per watt.

Licensing: MIT — genuinely open, no regional restrictions, no acceptable-use rider.

Benchmark caveats: GLM-5.2's 62.1% SWE-bench Pro is impressive but single-run. Community eval threads on r/LocalLLaMA show variance of ±3% depending on sampling temp and issue difficulty distribution. The Feb 2026 SWE-bench wave involved known training-data contamination concerns (OpenAI stopped reporting after confirming it) — SWE-bench Pro is the more reliable signal going forward.

---

Qwen3 / Qwen3-Coder / Qwen3-Coder-Next (Alibaba / Qwen Team)

What changed

Qwen3-Coder family (Feb 2026) spans 1.5B → 480B MoE, Apache 2.0. The Qwen3-Coder-Next (80B total / 3.9B active MoE) is the headline: 70.6% SWE-bench Verified, 256K native context (1M via YaRN), runs on a single 48 GB MacBook Pro at 6+ tok/s. The 480B MoE hits 67–70% SWE-bench (35B active). The 32B dense variant is the community default for 24 GB cards at 69.6% SWE-bench.

Trained via RL on 20,000 parallel environments using real GitHub issues + LeetCode + Codeforces.

📄 Qwen3-Coder GitHub
📄 RockB review (benchmarks vs GPT-5 / Claude Opus)
📄 RunLocalModel hardware-tier guide
📄 Qwen3-Coder HF Collection

Hardware by variant

| Variant | VRAM (Q4) | SWE-bench ~ | Best for |
|---|---|---|---|
| Qwen3-Coder 1.5B | ~4 GB | ~30% | IDE autocomplete |
| Qwen3-Coder 7B | ~6 GB | ~45% | 8 GB GPUs |
| Qwen3-Coder 14B | ~10 GB | ~58% | Best quality-per-GB — 12 GB sweet spot |
| Qwen3-Coder 32B | ~20 GB | ~69.6% | 24 GB cards (RTX 4090/5090) |
| Qwen3-Coder-Next (80B MoE) | ~46 GB (FP16) / ~30 GB (2-bit) | 70.6% | 48 GB Mac / dual 24 GB GPU |
| Qwen3-Coder 480B MoE | ~960 GB (FP16) | 67–70% | Cloud / multi-GPU rack |

Who should try it: Virtually everyone doing local code LLMs. The 14B is the default pick for 12–16 GB; the 32B for 24 GB; the Next variant for 48 GB+. Best quality-per-gigabyte in the coding category per multiple community consensus guides.

Who should skip it: Anyone needing vision (text-only). The 480B is overkill locally vs using the API.

Licensing: Apache 2.0 — permissive, commercial-friendly.

Benchmark caveats: Qwen3-Coder scores come from Qwen's own eval pipeline with greedy decoding. Community re-runs on EvalPlus show HumanEval+ ~5% lower. SWE-bench Verified scores are for codebase-level issue resolution — doesn't measure FIM quality (where Codestral still leads).

---

Codestral 2 & Devstral (Mistral AI)

What changed

Codestral 2 (22B) — Mistral's specialist fill-in-middle (FIM) model. First-class FIM training across Python, JS/TS, Rust, Go, Java, C/C++, SQL. Delivers the cleanest tab-completions of any open model in its size class. Fits 24 GB comfortably (~13.5 GB at Q4_K_M). Weaker on chat/instruction tasks vs Qwen3-Coder.

Devstral Small 2 (24B, Apache 2.0, Dec 2025) — Agentic coding model scoring 68% SWE-bench Verified. Runs on single 24 GB GPU or 32 GB Mac. Its bigger sibling Devstral 2 (123B) hits 72.2% SWE-bench vs Claude Sonnet 4.5's 77.2% at up to 7× lower cost. Supports tool calls, multi-file edits, autonomous bug-fix loops.

📄 Mistral Models Docs
📄 Devstral Small 2 local setup guide (RockB)
📄 RunLocalModel: Codestral 2 vs Qwen3-Coder
📄 Mistral AI complete model guide 2026
📄 Devstral Small 2 hardware guide

Codest

Sources visited

Ask a follow-up

Sources visited

Related questions

Ask a follow-up