Model Selection Guide — claude-codex-local

┌─ /tiers memory → model fit

Reference table

01Recommendations by memory tier

Pick a row that matches your hardware. Each tier lists models that perform well at that memory ceiling — assuming Q4–Q5 quantization and a typical 8k–16k context.

Coding model recommendations by memory tier
Tier · Hardware	Recommended coding models
24 GB M3 Pro (18GB) · RTX 4090 (24GB)	Qwen2.5-Coder-7B — fast, efficient for most coding tasks DeepSeek-Coder-6.7B — strong code completion CodeLlama-13B — balanced performance Mistral-7B-Instruct — general purpose with coding ability
32 GB M3 Max (36GB) · RTX 6000 Ada (48GB)	Qwen2.5-Coder-14B — enhanced reasoning for refactoring DeepSeek-Coder-33B — advanced code generation CodeLlama-34B — large context window (16K) Phind-CodeLlama-34B — optimized for code search
64 GB M2 Ultra (64GB) · H100 (80GB)	Qwen2.5-Coder-32B — top-tier coding performance DeepSeek-Coder-V2-Lite — 16B with 128K context CodeLlama-70B — architecture design ceiling Mixtral-8x7B-Instruct — MoE, effective 47B
128 GB+ Mac Studio (192GB) · H100 NVL (188GB)	DeepSeek-Coder-V2-236B — state-of-the-art Qwen2.5-Coder-72B — multi-file reasoning Mixtral-8x22B — 141B params, expert routing CodeLlama-70B + long ctx — 100K+ tokens

├─ /manual-selection 6 steps · DIY

Process

02Selecting models manually

llmfit automates this. If you prefer to drive yourself, here's the procedure.

Check your available memory

Determine the memory budget your model has to work with:

Mac (unified): Total RAM − 4–6 GB for OS
NVIDIA GPU: VRAM + system RAM (for offloading layers)
CPU-only: Total RAM − 8–12 GB for OS and apps

Example: M3 Max 36GB → ~30–32 GB available for models.

Calculate model memory

Estimate memory based on parameter count and quantization:

FP16: ~2 bytes / parameter
Q8: ~1 byte / parameter
Q4: ~0.5 bytes / parameter
Context buffer: add 2–4 GB

Example: 7B at Q4 → 7 × 0.5 = 3.5 GB + 2 GB ctx = ~5.5 GB.

Pick a quantization level

Balance quality vs memory:

Q8: ~95% of FP16 quality, minimal degradation
Q5 / Q6: ~90% quality, balanced default
Q4: ~85% quality, 2× memory savings
Q2 / Q3: noticeable loss — avoid for coding

Recommended: start at Q4 or Q5 for coding models.

Test on your machine

Run a real session and watch for these signals:

Load time: under 30 s
First token: under 2–3 s
Generation: 10+ tokens/sec for interactive UX
Headroom: keep 10–20% memory free

Test: ccl chat --model <name> --benchmark

Monitor resources

Watch for the bad signs during a real workload:

Memory pressure: Activity Monitor (Mac) / Task Manager (Win)
Swap usage: rising swap = model too large
GPU util: 70–90% during generation
Thermals: sustained high temps throttle perf

Red flags: swap >2 GB · <5 tok/s · UI lag.

Optimize if needed

If performance is poor, walk down one of these axes:

Lower quant: Q5 → Q4, Q4 → Q3
Smaller model: 34B → 13B → 7B
Reduce context: 4K → 2K
Offload: GPU + CPU hybrid (if supported)

Goal: 10+ tok/s with <2 s first-token latency.

Pro tip: Start with a smaller model (7B–13B) to verify your setup works end-to-end, then scale up once you've confirmed performance.

Which model
fits your machine?

01Recommendations by memory tier

02Selecting models manually

Check your available memory

Calculate model memory

Pick a quantization level

Test on your machine

Monitor resources

Optimize if needed

Skip the guesswork — let the wizard pick

Which modelfits your machine?

01Recommendations by memory tier

02Selecting models manually

Check your available memory

Calculate model memory

Pick a quantization level

Test on your machine

Monitor resources

Optimize if needed

Skip the guesswork — let the wizard pick

Which model
fits your machine?