Cited source trail
Local model landscape source trail - July 4, 2026
Original question
Local model landscape source trail - July 4, 2026. Deep dive but keep the final report concise. Cover GLM-5/GLM-5.2, Qwen3/Qwen coder or small local models, Codestral/Devstral or other local code/autocomplete models, and DeepSeek local/hardware claims only if source-backed. Cite official model cards/repos/papers plus at least one independent or community source. For each model family: what changed, who should try it, who should skip it, hardware/licensing caveats, and benchmark caveats. Use direct clickable links. Benchmarks and community tests are directional; verify on your own workload.
# ๐๏ธ Local Model Landscape โ July 4, 2026
## How to read this report
- **Official sources** & **independent/community sources** cited inline via clickable links
- Benchmarks are **directional** โ your mileage varies by prompt template, sampler, evaluation harness, and workload. Always verify on your own data.
- Hardware figures assume Q4_K_M quantization unless stated. KV-cache headroom for long contexts is separate.
---
## 1. GLM-5 / GLM-5.2 (Z.ai / Zhipu AI)
### What changed
GLM-5 dropped Feb 2026 (77.8% SWE-bench Verified). **GLM-5.2** landed June 16, 2026 โ a 753B total / ~40B active MoE with **1M-token context**, MIT licensed. It tops [Artificial Analysis's open-weights Intelligence Index](https://artificialanalysis.ai/) at score 51 and posts **62.1% on SWE-bench Pro**. Trained on Huawei Ascend (no NVIDIA), proving hardware supply-chain diversification.
- [๐ Official GLM-5.2 Hugging Face](https://huggingface.co/zai-org/GLM-5.2)
- [๐ Official GLM-5 GitHub](https://github.com/zai-org/GLM-5)
- [๐ InsiderLLM hardware/quant deep-dive](https://insiderllm.com/guides/run-glm-5-2-locally/)
- [๐ Community: CPU inference via ik_llama.cpp](https://tools4all.ai/trends/glm-52-local-cpu-inference-demonstrated-via-ik-llamacpp)
- [๐ Unsloth quant ladder + setup](https://unsloth.ai/docs/models/glm-5.2)
### Hardware & quant reality
| Target Quant | Size | Needs | Experience |
|---|---|---|---|
| BF16 | 1.51 TB | Datacenter rack | Full precision server |
| Q4_K_XL (sweet spot) | ~467 GB | 512 GB Mac Studio or multi-GPU | Near-lossless |
| **Q3_K_XL (pragmatic target)** | **~343 GB** | **256 GB Mac Studio / 4-GPU node** | **Still feels like GLM-5.2** |
| UD-IQ2_M (accessible) | 239 GB | 2ร RTX 5090 (64 GB) + CPU offload | Noticeable degradation |
**Who should try it:** Teams doing long-horizon agent coding who own a 256 GB+ Mac Studio or multi-GPU workstation and need MIT-licensed, region-unrestricted weights.
**Who should skip it:** Anyone with <128 GB pooled memory. The 239 GB 2-bit quant is degraded โ you're better off running Qwen3-Coder-Next or Devstral Small 2 at higher quality per watt.
**Licensing:** MIT โ genuinely open, no regional restrictions, no acceptable-use rider.
**Benchmark caveats:** GLM-5.2's 62.1% SWE-bench Pro is impressive but single-run. Community eval threads on r/LocalLLaMA show variance of ยฑ3% depending on sampling temp and issue difficulty distribution. The Feb 2026 SWE-bench wave involved known training-data contamination concerns (OpenAI stopped reporting after confirming it) โ SWE-bench Pro is the more reliable signal going forward.
---
## 2. Qwen3 / Qwen3-Coder / Qwen3-Coder-Next (Alibaba / Qwen Team)
### What changed
**Qwen3-Coder family** (Feb 2026) spans 1.5B โ 480B MoE, **Apache 2.0**. The **Qwen3-Coder-Next** (80B total / 3.9B active MoE) is the headline: **70.6% SWE-bench Verified**, 256K native context (1M via YaRN), runs on a single 48 GB MacBook Pro at 6+ tok/s. The 480B MoE hits 67โ70% SWE-bench (35B active). The 32B dense variant is the community default for 24 GB cards at 69.6% SWE-bench.
Trained via RL on 20,000 parallel environments using real GitHub issues + LeetCode + Codeforces.
- [๐ Qwen3-Coder GitHub](https://github.com/QwenLM/Qwen3-Coder)
- [๐ RockB review (benchmarks vs GPT-5 / Claude Opus)](https://baeseokjae.github.io/posts/qwen3-coder-review-2026/)
- [๐ RunLocalModel hardware-tier guide](https://runlocalmodel.com/best-local-coding-llm-2026.html)
- [๐ Qwen3-Coder HF Collection](https://huggingface.co/collections/Qwen/qwen3-coder-6795298c2a8ab3d0cb908f2a)
### Hardware by variant
| Variant | VRAM (Q4) | SWE-bench ~ | Best for |
|---|---|---|---|
| **Qwen3-Coder 1.5B** | ~4 GB | ~30% | IDE autocomplete |
| **Qwen3-Coder 7B** | ~6 GB | ~45% | 8 GB GPUs |
| **Qwen3-Coder 14B** | ~10 GB | ~58% | **Best quality-per-GB โ 12 GB sweet spot** |
| **Qwen3-Coder 32B** | ~20 GB | ~69.6% | 24 GB cards (RTX 4090/5090) |
| **Qwen3-Coder-Next (80B MoE)** | ~46 GB (FP16) / ~30 GB (2-bit) | **70.6%** | 48 GB Mac / dual 24 GB GPU |
| **Qwen3-Coder 480B MoE** | ~960 GB (FP16) | 67โ70% | Cloud / multi-GPU rack |
**Who should try it:** Virtually everyone doing local code LLMs. The **14B is the default pick for 12โ16 GB**; the **32B for 24 GB**; the **Next variant for 48 GB+**. Best quality-per-gigabyte in the coding category per multiple community consensus guides.
**Who should skip it:** Anyone needing vision (text-only). The 480B is overkill locally vs using the API.
**Licensing:** Apache 2.0 โ permissive, commercial-friendly.
**Benchmark caveats:** Qwen3-Coder scores come from Qwen's own eval pipeline with greedy decoding. Community re-runs on EvalPlus show HumanEval+ ~5% lower. SWE-bench Verified scores are for codebase-level issue resolution โ doesn't measure FIM quality (where Codestral still leads).
---
## 3. Codestral 2 & Devstral (Mistral AI)
### What changed
**Codestral 2** (22B) โ Mistral's specialist **fill-in-middle (FIM) model**. First-class FIM training across Python, JS/TS, Rust, Go, Java, C/C++, SQL. Delivers the **cleanest tab-completions** of any open model in its size class. Fits 24 GB comfortably (~13.5 GB at Q4_K_M). Weaker on chat/instruction tasks vs Qwen3-Coder.
**Devstral Small 2** (24B, Apache 2.0, Dec 2025) โ Agentic coding model scoring **68% SWE-bench Verified**. Runs on **single 24 GB GPU or 32 GB Mac**. Its bigger sibling **Devstral 2** (123B) hits **72.2% SWE-bench** vs Claude Sonnet 4.5's 77.2% at up to 7ร lower cost. Supports tool calls, multi-file edits, autonomous bug-fix loops.
- [๐ Mistral Models Docs](https://docs.mistral.ai/models)
- [๐ Devstral Small 2 local setup guide (RockB)](https://baeseokjae.github.io/posts/devstral-small-2-local-setup-guide-2026/)
- [๐ RunLocalModel: Codestral 2 vs Qwen3-Coder](https://runlocalmodel.com/best-local-coding-llm-2026.html)
- [๐ Mistral AI complete model guide 2026](https://www.aimadetools.com/blog/mistral-ai-complete-model-guide/)
- [๐ Devstral Small 2 hardware guide](https://runaihome.com/blog/devstral-small-2-local-ai-hardware-guide-2026/)
### Codestral 2 โ hardware
| Variant | Size | VRAM (Q4_K_M) | Best for |
|---|---|---|---|
| Codestral 2 22B | 22B | ~13.5 GB | 24 GB GPU; FIM autocomplete specialist |
### Devstral Small 2 โ hardware
| Hardware | VRAM | Speed | Method |
|---|---|---|---|
| RTX 4090 | 24 GB | 25โ40 tok/s | Ollama vLLM |
| RTX 3090 | 24 GB | 15โ25 tok/s | Ollama Q4_K_M |
| M3 Max 96 GB | 96 GB unified | 20โ35 tok/s | Ollama Metal |
| CPU-only | 64 GB RAM | 1โ3 tok/s | llama.cpp |
**Who should try Codestral 2:** Developers whose workflow is **90%+ tab completion** and want the cleanest fills, especially in Python, Rust, Go, or TypeScript. Use via Continue.dev + Ollama.
**Who should try Devstral Small 2:** Anyone wanting **a cloud-grade agentic coder on a single 24 GB card** โ multi-file edits, tool calls, autonomous debugging. Stronger than Qwen3-Coder on SWE-bench for the same VRAM tier.
**Who should skip:** If you want one model for both chat and coding, Qwen3-Coder is more versatile. Codestral 2 is weak at instruction-style tasks.
**Licensing:** Codestral 2 โ Mistral's permissive license (Apache 2.0 for Devstral). Commercial use allowed.
**Benchmark caveats:** Codestral 2's FIM quality is hard to benchmark โ most leaderboards test instruction-following, not tab completion. Community consensus (r/LocalLLaMA, promptquorum) rates it best-in-class for FIM, but this is qualitative. Devstral Small 2's 68% SWE-bench is with agentic scaffolding (OpenHands / SWE-agent) โ standalone prompting scores are lower.
---
## 4. DeepSeek Local โ V4-Flash & Distilled R1
### What changed
**DeepSeek V4-Flash** (April 24, 2026): 284B total / **13B active** MoE, **1M context**, **MIT licensed**. API priced at Claude Haiku tier ($0.14/$0.55 per M tokens). Community GGUF appeared within 36 hours. The key insight: only 13B moves per token, so it's bandwidth-bound, not compute-bound โ runs at usable speed on a Mac Studio despite 284B total.
**DeepSeek R1 Distilled** models (1.5Bโ70B) remain the pragmatic entry point for reasoning-capable local models โ fine-tuned to inherit R1's chain-of-thought via distillation.
- [๐ DeepSeek V4-Flash hardware guide (Compute Market)](https://www.compute-market.com/blog/deepseek-v4-flash-local-hardware-guide-2026)
- [๐ LLMHardware.io: DeepSeek R1 VRAM guide](https://llmhardware.io/guides/deepseek-hardware-requirements)
- [๐ DeepSeek V4 local setup guide](https://www.aimadetools.com/blog/how-to-run-deepseek-v4-locally/)
- [๐ Community GGUF (tecaprovn)](https://huggingface.co/tecaprovn/deepseek-v4-flash-gguf)
- [๐ Self-hosting DeepSeek V4 with vLLM](https://lushbinary.com/blog/deepseek-v4-self-hosting-guide-vllm-hardware-deployment/)
### DeepSeek V4-Flash โ quant reality
| Quant | Size | Min pooled memory | Hardware needed |
|---|---|---|---|
| BF16 | ~568 GB | ~600 GB | 4ร H100 / 8ร A100 |
| Q5_K_M | ~200 GB | ~210 GB | 2ร H100 / 4ร RTX 3090 |
| **Q4_K_M (recommended)** | **~158 GB** | **~96 GB** | **96 GB workstation / 192 GB Mac Studio** |
| Q3_K_M | ~125 GB | ~80 GB | Dual RTX 5090 (64 GB) + CPU offload |
| **IQ2_XS (floor)** | **~90 GB** | **~96 GB** | **Hobbyist โ below this, function-name hallucination risk** |
### DeepSeek R1 Distilled (for those with less hardware)
| Model | Q4 VRAM | Min GPU | SWE-bench ~ |
|---|---|---|---|
| R1-Distill-7B | ~5.5 GB | RTX 4060 8 GB | ~40% |
| R1-Distill-14B | ~8.5 GB | Arc B580 12 GB | ~55% |
| R1-Distill-32B | ~17.5 GB | RTX 4090 24 GB | ~70% |
| R1-Distill-70B | ~36.5 GB | Mac Studio 64 GB | ~75% |
**Who should try V4-Flash:** Teams with 96 GB+ pooled memory wanting the best MIT-licensed 1M-context coder. Significantly cheaper per token than the API.
**Who should try R1 Distilled:** Anyone with 8โ24 GB who wants DeepSeek's reasoning quality without datacenter hardware. The 32B distill is the community favorite for 24 GB cards.
**Who should skip V4-Flash:** Anyone with <90 GB pooled memory โ at sub-90 GB quants, quality degrades heavily and function-name hallucination becomes a risk.
**Licensing:** DeepSeek V4-Flash is **MIT**. R1 distilled variants are Apache 2.0. Both fully open.
**Benchmark caveats:** V4-Flash is so new (April 2026) that independent leaderboard runs are sparse. The 13B active / 284B total MoE math means speed depends entirely on memory bandwidth โ won't match dense models on latency per token. Community reports 25โ35 tok/s on 192 GB Mac Studio via MLX. SWE-bench scores for V4-Flash are **not yet independently confirmed** โ DeepSeek published API benchmarks; local community evals are ongoing.
---
## ๐ Quick Decision Matrix
| Your hardware | Best coding model | Runner-up |
|---|---|---|
| 8 GB VRAM | Qwen3-Coder 7B Q4 | DeepSeek R1-Distill-7B |
| 12 GB VRAM | **Qwen3-Coder 14B Q4** โ sweet spot | Codestral 2 22B at IQ3_XXS |
| 16 GB VRAM | Qwen3-Coder 14B Q5_K_M | Codestral 2 22B Q4 |
| 24 GB VRAM | **Qwen3-Coder 32B Q4** or **Devstral Small 2** | DeepSeek R1-Distill-32B Q4 |
| 48 GB+ unified | **Qwen3-Coder-Next** (70.6% SWE-bench) | DeepSeek V4-Flash Q4_K_M |
| 96 GB+ unified | **DeepSeek V4-Flash Q4_K_M** (1M context) | GLM-5.2 at Q3_K_XL (343 GB) |
| 256 GB+ / multi-GPU | **GLM-5.2 Q3_K_XL** (62.1% SWE-bench Pro) | DeepSeek V4-Flash Q5_K_M |
---
## โ ๏ธ Final caveats
- **All SWE-bench scores are directional.** Different scaffolding (SWE-agent vs OpenHands vs standalone), prompt templates, and sampling settings produce ยฑ5โ10% variance.
- **FIM quality is not captured by SWE-bench.** If tab completion is your primary use case, Codestral 2 remains the community favorite despite lower instruction-following scores.
- **Quantization quality varies.** The same bit-width from different quantizers (GGUF, AWQ, Unsloth Dynamic, GPTQ) behaves differently. Test on your actual code, not just leaderboards.
- **MIT โ Apache 2.0** โ both are permissive, but MIT is shorter and has no patent grant clause (Apache 2.0 does). For most commercial use, either is fine, but consult legal for your jurisdiction.
- **Prices and availability change rapidly.** GPU pricing, cloud spot instances, and model releases in 2026 move week-to-week.
*Report compiled July 4, 2026 from official model cards, repos, and independent community sources.*
1searchGLM-5 GLM-5.2 model local 2026
2searchQwen3 Qwen3-coder local model 2026
3searchCodestral Devstral Mistral local code model 2026
4searchDeepSeek local model hardware requirements 2026
5navigatehttps://huggingface.co/zai-org/GLM-5.2
6navigatehttps://github.com/zai-org/GLM-5
7navigatehttps://github.com/QwenLM/Qwen3-Coder
8navigatehttps://docs.mistral.ai/models
9navigatehttps://localclaw.io/blog/glm-5-2-local-ai
10navigatehttps://www.promptquorum.com/local-llms/best-local-llms-for-coding
11navigatehttps://llmhardware.io/guides/deepseek-hardware-requirements
12navigatehttps://runlocalmodel.com/best-local-coding-llm-2026.html
13navigatehttps://baeseokjae.github.io/posts/qwen3-coder-review-2026/
14searchr/LocalLLaMA GLM-5.2 local impressions 2026
15navigatehttps://mistral.ai/models/codestral
16navigatehttps://insiderllm.com/guides/run-glm-5-2-locally/
17searchDevstral Mistral local code autocomplete model 2026
18searchDeepSeek V4 local model 2026 hardware requirements
19navigatehttps://baeseokjae.github.io/posts/devstral-small-2-local-setup-guide-2026/
20navigatehttps://www.compute-market.com/blog/deepseek-v4-flash-local-hardware-guide-2026
21searchr/LocalLLaMA best local coding model July 2026 Qwen3 GLM-5 Devstral