~ / guides / model-selection

Which model
fits your machine?

From 8GB laptops to 192GB workstations — a hardware-tiered map of coding models. The wizard does this for you via llmfit; this page shows the math.

Tiers
4 memory ranges
Models
16+ recommended
Picker
llmfit auto
┌─ /tiers memory → model fit
Reference table

01Recommendations by memory tier

Pick a row that matches your hardware. Each tier lists models that perform well at that memory ceiling — assuming Q4–Q5 quantization and a typical 8k–16k context.

Coding model recommendations by memory tier
Tier · Hardware Recommended coding models
24 GB M3 Pro (18GB) · RTX 4090 (24GB) Qwen2.5-Coder-7B — fast, efficient for most coding tasks
DeepSeek-Coder-6.7B — strong code completion
CodeLlama-13B — balanced performance
Mistral-7B-Instruct — general purpose with coding ability
32 GB M3 Max (36GB) · RTX 6000 Ada (48GB) Qwen2.5-Coder-14B — enhanced reasoning for refactoring
DeepSeek-Coder-33B — advanced code generation
CodeLlama-34B — large context window (16K)
Phind-CodeLlama-34B — optimized for code search
64 GB M2 Ultra (64GB) · H100 (80GB) Qwen2.5-Coder-32B — top-tier coding performance
DeepSeek-Coder-V2-Lite — 16B with 128K context
CodeLlama-70B — architecture design ceiling
Mixtral-8x7B-Instruct — MoE, effective 47B
128 GB+ Mac Studio (192GB) · H100 NVL (188GB) DeepSeek-Coder-V2-236B — state-of-the-art
Qwen2.5-Coder-72B — multi-file reasoning
Mixtral-8x22B — 141B params, expert routing
CodeLlama-70B + long ctx — 100K+ tokens

├─ /manual-selection 6 steps · DIY
Process

02Selecting models manually

llmfit automates this. If you prefer to drive yourself, here's the procedure.

01

Check your available memory

Determine the memory budget your model has to work with:

  • Mac (unified): Total RAM − 4–6 GB for OS
  • NVIDIA GPU: VRAM + system RAM (for offloading layers)
  • CPU-only: Total RAM − 8–12 GB for OS and apps
Example: M3 Max 36GB → ~30–32 GB available for models.
02

Calculate model memory

Estimate memory based on parameter count and quantization:

  • FP16: ~2 bytes / parameter
  • Q8: ~1 byte / parameter
  • Q4: ~0.5 bytes / parameter
  • Context buffer: add 2–4 GB
Example: 7B at Q4 → 7 × 0.5 = 3.5 GB + 2 GB ctx = ~5.5 GB.
03

Pick a quantization level

Balance quality vs memory:

  • Q8: ~95% of FP16 quality, minimal degradation
  • Q5 / Q6: ~90% quality, balanced default
  • Q4: ~85% quality, 2× memory savings
  • Q2 / Q3: noticeable loss — avoid for coding
Recommended: start at Q4 or Q5 for coding models.
04

Test on your machine

Run a real session and watch for these signals:

  • Load time: under 30 s
  • First token: under 2–3 s
  • Generation: 10+ tokens/sec for interactive UX
  • Headroom: keep 10–20% memory free
Test: ccl chat --model <name> --benchmark
05

Monitor resources

Watch for the bad signs during a real workload:

  • Memory pressure: Activity Monitor (Mac) / Task Manager (Win)
  • Swap usage: rising swap = model too large
  • GPU util: 70–90% during generation
  • Thermals: sustained high temps throttle perf
Red flags: swap >2 GB · <5 tok/s · UI lag.
06

Optimize if needed

If performance is poor, walk down one of these axes:

  • Lower quant: Q5 → Q4, Q4 → Q3
  • Smaller model: 34B → 13B → 7B
  • Reduce context: 4K → 2K
  • Offload: GPU + CPU hybrid (if supported)
Goal: 10+ tok/s with <2 s first-token latency.
Pro tip: Start with a smaller model (7B–13B) to verify your setup works end-to-end, then scale up once you've confirmed performance.

Skip the guesswork — let the wizard pick

Running local? llmfit reads your hardware and picks a model that fits. Routing through 9router or OpenRouter? Skip this page and head straight to setup.