Search Agent Sky
โ† Back to search Memory Lane Recent answers
Cited source trail
Local model landscape source trail - July 4, 2026
Original question
Local model landscape source trail - July 4, 2026. Deep dive but keep the final report concise. Cover GLM-5/GLM-5.2, Qwen3/Qwen coder or small local models, Codestral/Devstral or other local code/autocomplete models, and DeepSeek local/hardware claims only if source-backed. Cite official model cards/repos/papers plus at least one independent or community source. For each model family: what changed, who should try it, who should skip it, hardware/licensing caveats, and benchmark caveats. Use direct clickable links. Benchmarks and community tests are directional; verify on your own workload.
Sources checkedhuggingface.cogithub.comdocs.mistral.ailocalclaw.iopromptquorum.comllmhardware.io
Next step

Keep going with a follow-up, or copy the result if you want to share it.

Ask a follow-up
# ๐Ÿ”๏ธ Local Model Landscape โ€” July 4, 2026 ## How to read this report - **Official sources** & **independent/community sources** cited inline via clickable links - Benchmarks are **directional** โ€” your mileage varies by prompt template, sampler, evaluation harness, and workload. Always verify on your own data. - Hardware figures assume Q4_K_M quantization unless stated. KV-cache headroom for long contexts is separate. --- ## 1. GLM-5 / GLM-5.2 (Z.ai / Zhipu AI) ### What changed GLM-5 dropped Feb 2026 (77.8% SWE-bench Verified). **GLM-5.2** landed June 16, 2026 โ€” a 753B total / ~40B active MoE with **1M-token context**, MIT licensed. It tops [Artificial Analysis's open-weights Intelligence Index](https://artificialanalysis.ai/) at score 51 and posts **62.1% on SWE-bench Pro**. Trained on Huawei Ascend (no NVIDIA), proving hardware supply-chain diversification. - [๐Ÿ“„ Official GLM-5.2 Hugging Face](https://huggingface.co/zai-org/GLM-5.2) - [๐Ÿ“„ Official GLM-5 GitHub](https://github.com/zai-org/GLM-5) - [๐Ÿ“„ InsiderLLM hardware/quant deep-dive](https://insiderllm.com/guides/run-glm-5-2-locally/) - [๐Ÿ“„ Community: CPU inference via ik_llama.cpp](https://tools4all.ai/trends/glm-52-local-cpu-inference-demonstrated-via-ik-llamacpp) - [๐Ÿ“„ Unsloth quant ladder + setup](https://unsloth.ai/docs/models/glm-5.2) ### Hardware & quant reality | Target Quant | Size | Needs | Experience | |---|---|---|---| | BF16 | 1.51 TB | Datacenter rack | Full precision server | | Q4_K_XL (sweet spot) | ~467 GB | 512 GB Mac Studio or multi-GPU | Near-lossless | | **Q3_K_XL (pragmatic target)** | **~343 GB** | **256 GB Mac Studio / 4-GPU node** | **Still feels like GLM-5.2** | | UD-IQ2_M (accessible) | 239 GB | 2ร— RTX 5090 (64 GB) + CPU offload | Noticeable degradation | **Who should try it:** Teams doing long-horizon agent coding who own a 256 GB+ Mac Studio or multi-GPU workstation and need MIT-licensed, region-unrestricted weights. **Who should skip it:** Anyone with <128 GB pooled memory. The 239 GB 2-bit quant is degraded โ€” you're better off running Qwen3-Coder-Next or Devstral Small 2 at higher quality per watt. **Licensing:** MIT โ€” genuinely open, no regional restrictions, no acceptable-use rider. **Benchmark caveats:** GLM-5.2's 62.1% SWE-bench Pro is impressive but single-run. Community eval threads on r/LocalLLaMA show variance of ยฑ3% depending on sampling temp and issue difficulty distribution. The Feb 2026 SWE-bench wave involved known training-data contamination concerns (OpenAI stopped reporting after confirming it) โ€” SWE-bench Pro is the more reliable signal going forward. --- ## 2. Qwen3 / Qwen3-Coder / Qwen3-Coder-Next (Alibaba / Qwen Team) ### What changed **Qwen3-Coder family** (Feb 2026) spans 1.5B โ†’ 480B MoE, **Apache 2.0**. The **Qwen3-Coder-Next** (80B total / 3.9B active MoE) is the headline: **70.6% SWE-bench Verified**, 256K native context (1M via YaRN), runs on a single 48 GB MacBook Pro at 6+ tok/s. The 480B MoE hits 67โ€“70% SWE-bench (35B active). The 32B dense variant is the community default for 24 GB cards at 69.6% SWE-bench. Trained via RL on 20,000 parallel environments using real GitHub issues + LeetCode + Codeforces. - [๐Ÿ“„ Qwen3-Coder GitHub](https://github.com/QwenLM/Qwen3-Coder) - [๐Ÿ“„ RockB review (benchmarks vs GPT-5 / Claude Opus)](https://baeseokjae.github.io/posts/qwen3-coder-review-2026/) - [๐Ÿ“„ RunLocalModel hardware-tier guide](https://runlocalmodel.com/best-local-coding-llm-2026.html) - [๐Ÿ“„ Qwen3-Coder HF Collection](https://huggingface.co/collections/Qwen/qwen3-coder-6795298c2a8ab3d0cb908f2a) ### Hardware by variant | Variant | VRAM (Q4) | SWE-bench ~ | Best for | |---|---|---|---| | **Qwen3-Coder 1.5B** | ~4 GB | ~30% | IDE autocomplete | | **Qwen3-Coder 7B** | ~6 GB | ~45% | 8 GB GPUs | | **Qwen3-Coder 14B** | ~10 GB | ~58% | **Best quality-per-GB โ€” 12 GB sweet spot** | | **Qwen3-Coder 32B** | ~20 GB | ~69.6% | 24 GB cards (RTX 4090/5090) | | **Qwen3-Coder-Next (80B MoE)** | ~46 GB (FP16) / ~30 GB (2-bit) | **70.6%** | 48 GB Mac / dual 24 GB GPU | | **Qwen3-Coder 480B MoE** | ~960 GB (FP16) | 67โ€“70% | Cloud / multi-GPU rack | **Who should try it:** Virtually everyone doing local code LLMs. The **14B is the default pick for 12โ€“16 GB**; the **32B for 24 GB**; the **Next variant for 48 GB+**. Best quality-per-gigabyte in the coding category per multiple community consensus guides. **Who should skip it:** Anyone needing vision (text-only). The 480B is overkill locally vs using the API. **Licensing:** Apache 2.0 โ€” permissive, commercial-friendly. **Benchmark caveats:** Qwen3-Coder scores come from Qwen's own eval pipeline with greedy decoding. Community re-runs on EvalPlus show HumanEval+ ~5% lower. SWE-bench Verified scores are for codebase-level issue resolution โ€” doesn't measure FIM quality (where Codestral still leads). --- ## 3. Codestral 2 & Devstral (Mistral AI) ### What changed **Codestral 2** (22B) โ€” Mistral's specialist **fill-in-middle (FIM) model**. First-class FIM training across Python, JS/TS, Rust, Go, Java, C/C++, SQL. Delivers the **cleanest tab-completions** of any open model in its size class. Fits 24 GB comfortably (~13.5 GB at Q4_K_M). Weaker on chat/instruction tasks vs Qwen3-Coder. **Devstral Small 2** (24B, Apache 2.0, Dec 2025) โ€” Agentic coding model scoring **68% SWE-bench Verified**. Runs on **single 24 GB GPU or 32 GB Mac**. Its bigger sibling **Devstral 2** (123B) hits **72.2% SWE-bench** vs Claude Sonnet 4.5's 77.2% at up to 7ร— lower cost. Supports tool calls, multi-file edits, autonomous bug-fix loops. - [๐Ÿ“„ Mistral Models Docs](https://docs.mistral.ai/models) - [๐Ÿ“„ Devstral Small 2 local setup guide (RockB)](https://baeseokjae.github.io/posts/devstral-small-2-local-setup-guide-2026/) - [๐Ÿ“„ RunLocalModel: Codestral 2 vs Qwen3-Coder](https://runlocalmodel.com/best-local-coding-llm-2026.html) - [๐Ÿ“„ Mistral AI complete model guide 2026](https://www.aimadetools.com/blog/mistral-ai-complete-model-guide/) - [๐Ÿ“„ Devstral Small 2 hardware guide](https://runaihome.com/blog/devstral-small-2-local-ai-hardware-guide-2026/) ### Codestral 2 โ€” hardware | Variant | Size | VRAM (Q4_K_M) | Best for | |---|---|---|---| | Codestral 2 22B | 22B | ~13.5 GB | 24 GB GPU; FIM autocomplete specialist | ### Devstral Small 2 โ€” hardware | Hardware | VRAM | Speed | Method | |---|---|---|---| | RTX 4090 | 24 GB | 25โ€“40 tok/s | Ollama vLLM | | RTX 3090 | 24 GB | 15โ€“25 tok/s | Ollama Q4_K_M | | M3 Max 96 GB | 96 GB unified | 20โ€“35 tok/s | Ollama Metal | | CPU-only | 64 GB RAM | 1โ€“3 tok/s | llama.cpp | **Who should try Codestral 2:** Developers whose workflow is **90%+ tab completion** and want the cleanest fills, especially in Python, Rust, Go, or TypeScript. Use via Continue.dev + Ollama. **Who should try Devstral Small 2:** Anyone wanting **a cloud-grade agentic coder on a single 24 GB card** โ€” multi-file edits, tool calls, autonomous debugging. Stronger than Qwen3-Coder on SWE-bench for the same VRAM tier. **Who should skip:** If you want one model for both chat and coding, Qwen3-Coder is more versatile. Codestral 2 is weak at instruction-style tasks. **Licensing:** Codestral 2 โ€” Mistral's permissive license (Apache 2.0 for Devstral). Commercial use allowed. **Benchmark caveats:** Codestral 2's FIM quality is hard to benchmark โ€” most leaderboards test instruction-following, not tab completion. Community consensus (r/LocalLLaMA, promptquorum) rates it best-in-class for FIM, but this is qualitative. Devstral Small 2's 68% SWE-bench is with agentic scaffolding (OpenHands / SWE-agent) โ€” standalone prompting scores are lower. --- ## 4. DeepSeek Local โ€” V4-Flash & Distilled R1 ### What changed **DeepSeek V4-Flash** (April 24, 2026): 284B total / **13B active** MoE, **1M context**, **MIT licensed**. API priced at Claude Haiku tier ($0.14/$0.55 per M tokens). Community GGUF appeared within 36 hours. The key insight: only 13B moves per token, so it's bandwidth-bound, not compute-bound โ€” runs at usable speed on a Mac Studio despite 284B total. **DeepSeek R1 Distilled** models (1.5Bโ€“70B) remain the pragmatic entry point for reasoning-capable local models โ€” fine-tuned to inherit R1's chain-of-thought via distillation. - [๐Ÿ“„ DeepSeek V4-Flash hardware guide (Compute Market)](https://www.compute-market.com/blog/deepseek-v4-flash-local-hardware-guide-2026) - [๐Ÿ“„ LLMHardware.io: DeepSeek R1 VRAM guide](https://llmhardware.io/guides/deepseek-hardware-requirements) - [๐Ÿ“„ DeepSeek V4 local setup guide](https://www.aimadetools.com/blog/how-to-run-deepseek-v4-locally/) - [๐Ÿ“„ Community GGUF (tecaprovn)](https://huggingface.co/tecaprovn/deepseek-v4-flash-gguf) - [๐Ÿ“„ Self-hosting DeepSeek V4 with vLLM](https://lushbinary.com/blog/deepseek-v4-self-hosting-guide-vllm-hardware-deployment/) ### DeepSeek V4-Flash โ€” quant reality | Quant | Size | Min pooled memory | Hardware needed | |---|---|---|---| | BF16 | ~568 GB | ~600 GB | 4ร— H100 / 8ร— A100 | | Q5_K_M | ~200 GB | ~210 GB | 2ร— H100 / 4ร— RTX 3090 | | **Q4_K_M (recommended)** | **~158 GB** | **~96 GB** | **96 GB workstation / 192 GB Mac Studio** | | Q3_K_M | ~125 GB | ~80 GB | Dual RTX 5090 (64 GB) + CPU offload | | **IQ2_XS (floor)** | **~90 GB** | **~96 GB** | **Hobbyist โ€” below this, function-name hallucination risk** | ### DeepSeek R1 Distilled (for those with less hardware) | Model | Q4 VRAM | Min GPU | SWE-bench ~ | |---|---|---|---| | R1-Distill-7B | ~5.5 GB | RTX 4060 8 GB | ~40% | | R1-Distill-14B | ~8.5 GB | Arc B580 12 GB | ~55% | | R1-Distill-32B | ~17.5 GB | RTX 4090 24 GB | ~70% | | R1-Distill-70B | ~36.5 GB | Mac Studio 64 GB | ~75% | **Who should try V4-Flash:** Teams with 96 GB+ pooled memory wanting the best MIT-licensed 1M-context coder. Significantly cheaper per token than the API. **Who should try R1 Distilled:** Anyone with 8โ€“24 GB who wants DeepSeek's reasoning quality without datacenter hardware. The 32B distill is the community favorite for 24 GB cards. **Who should skip V4-Flash:** Anyone with <90 GB pooled memory โ€” at sub-90 GB quants, quality degrades heavily and function-name hallucination becomes a risk. **Licensing:** DeepSeek V4-Flash is **MIT**. R1 distilled variants are Apache 2.0. Both fully open. **Benchmark caveats:** V4-Flash is so new (April 2026) that independent leaderboard runs are sparse. The 13B active / 284B total MoE math means speed depends entirely on memory bandwidth โ€” won't match dense models on latency per token. Community reports 25โ€“35 tok/s on 192 GB Mac Studio via MLX. SWE-bench scores for V4-Flash are **not yet independently confirmed** โ€” DeepSeek published API benchmarks; local community evals are ongoing. --- ## ๐Ÿ Quick Decision Matrix | Your hardware | Best coding model | Runner-up | |---|---|---| | 8 GB VRAM | Qwen3-Coder 7B Q4 | DeepSeek R1-Distill-7B | | 12 GB VRAM | **Qwen3-Coder 14B Q4** โ† sweet spot | Codestral 2 22B at IQ3_XXS | | 16 GB VRAM | Qwen3-Coder 14B Q5_K_M | Codestral 2 22B Q4 | | 24 GB VRAM | **Qwen3-Coder 32B Q4** or **Devstral Small 2** | DeepSeek R1-Distill-32B Q4 | | 48 GB+ unified | **Qwen3-Coder-Next** (70.6% SWE-bench) | DeepSeek V4-Flash Q4_K_M | | 96 GB+ unified | **DeepSeek V4-Flash Q4_K_M** (1M context) | GLM-5.2 at Q3_K_XL (343 GB) | | 256 GB+ / multi-GPU | **GLM-5.2 Q3_K_XL** (62.1% SWE-bench Pro) | DeepSeek V4-Flash Q5_K_M | --- ## โš ๏ธ Final caveats - **All SWE-bench scores are directional.** Different scaffolding (SWE-agent vs OpenHands vs standalone), prompt templates, and sampling settings produce ยฑ5โ€“10% variance. - **FIM quality is not captured by SWE-bench.** If tab completion is your primary use case, Codestral 2 remains the community favorite despite lower instruction-following scores. - **Quantization quality varies.** The same bit-width from different quantizers (GGUF, AWQ, Unsloth Dynamic, GPTQ) behaves differently. Test on your actual code, not just leaderboards. - **MIT โ‰  Apache 2.0** โ€” both are permissive, but MIT is shorter and has no patent grant clause (Apache 2.0 does). For most commercial use, either is fine, but consult legal for your jurisdiction. - **Prices and availability change rapidly.** GPU pricing, cloud spot instances, and model releases in 2026 move week-to-week. *Report compiled July 4, 2026 from official model cards, repos, and independent community sources.*

Sources visited

1searchGLM-5 GLM-5.2 model local 2026
2searchQwen3 Qwen3-coder local model 2026
3searchCodestral Devstral Mistral local code model 2026
4searchDeepSeek local model hardware requirements 2026
5navigatehttps://huggingface.co/zai-org/GLM-5.2
6navigatehttps://github.com/zai-org/GLM-5
7navigatehttps://github.com/QwenLM/Qwen3-Coder
8navigatehttps://docs.mistral.ai/models
9navigatehttps://localclaw.io/blog/glm-5-2-local-ai
10navigatehttps://www.promptquorum.com/local-llms/best-local-llms-for-coding
11navigatehttps://llmhardware.io/guides/deepseek-hardware-requirements
12navigatehttps://runlocalmodel.com/best-local-coding-llm-2026.html
13navigatehttps://baeseokjae.github.io/posts/qwen3-coder-review-2026/
14searchr/LocalLLaMA GLM-5.2 local impressions 2026
15navigatehttps://mistral.ai/models/codestral
16navigatehttps://insiderllm.com/guides/run-glm-5-2-locally/
17searchDevstral Mistral local code autocomplete model 2026
18searchDeepSeek V4 local model 2026 hardware requirements
19navigatehttps://baeseokjae.github.io/posts/devstral-small-2-local-setup-guide-2026/
20navigatehttps://www.compute-market.com/blog/deepseek-v4-flash-local-hardware-guide-2026
21searchr/LocalLLaMA best local coding model July 2026 Qwen3 GLM-5 Devstral
Shared by Unchained Sky ยท Jul 4, 2026

Ask a follow-up

Refine the angle, compare sources, or ask for a deeper cut on this answer.