Asking Claude to Help Me Replace Claude
Two days of debugging a local coding agent on a single RTX 4090, and why open-weight models are closer to ready than they look but harder to get there than they should be.
Why bother running locally at all
Two reasons, both gaining weight every month.
Cost. I’m a daily Claude Max user and an active Codex user. Between them, my monthly inference bill is in the hundreds of dollars and growing. Agent-style coding workflows burn tokens fast. The frontier models are worth that for the hard tasks. But a lot of what an agent does day to day (boilerplate edits, search-and-summarize, build babysitting, “rename this and update the callers”) doesn’t actually need a frontier model. Routing the easy 70% to a local model and the hard 30% to Claude is the obvious play, and it stops being theoretical when local quality crosses a usable threshold.
Sovereignty. Sending source code, conversations, and increasingly your filesystem context to a remote provider is a tradeoff I’d rather not make if I don’t have to. Retention policies are policies, not physics. Open-weight models like Qwen3.6, DeepSeek-V3.x, Llama 4, and Gemma 3 have crossed from “interesting” to “good enough for real work” over the last six months. Coding-specific evals show the gap to frontier closing. The point where a 35B-class MoE on a desktop is the default tool, with frontier APIs as the escalation path, is here.
The catch is that getting that local model to actually feel good (not just produce the right tokens, but produce them fast enough that you don’t context-switch waiting for the response) takes more tuning than I expected. This post walks through how I set the stack up, what went wrong, what fixed it, and where it’s going next.
Setting up the stack
The components are: llama.cpp as the inference server, a GGUF-quantized model file on disk, a systemd unit to keep the server running, and pi as the agent client that talks to it.
Build llama.cpp with CUDA
I build llama.cpp from source rather than using a packaged binary. The packaged versions tend to lag and the CUDA build flags depend on your specific GPU architecture (sm_89 for the 4090).
cd ~/code
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89
cmake --build build --config Release -j
The binary you care about is build/bin/llama-server. Sanity check that it loaded the CUDA backend:
./build/bin/llama-server --help | grep -i cuda
If you see CUDA-related options, the build is good. If not, the cmake step probably didn’t find your CUDA toolkit and silently fell back to CPU only.
Get a GGUF model
I went with Qwen3.6-35B-A3B (a 35B Mixture-of-Experts model, only 3B parameters active per token). MoE matters here because inference is bandwidth-bound by the active parameter count, not the total. A 3B-active model decodes roughly 10x faster than a 35B dense model on the same hardware.
For GGUF files, the two main publishers are Bartowski (static quants) and Unsloth (Dynamic 2.0 quants, generally better quality at the same file size). I started with Bartowski’s IQ4_XS and later swapped to Unsloth’s UD-IQ4_XS as a free upgrade.
# Download via huggingface_hub. hf_transfer enables parallel chunked download
# and is roughly 3-5x faster than the default on a fast connection.
pip install huggingface_hub hf_transfer
HF_HUB_ENABLE_HF_TRANSFER=1 \
hf download unsloth/Qwen3.6-35B-A3B-GGUF \
Qwen3.6-35B-A3B-UD-IQ4_XS.gguf \
--local-dir ~/models/qwen3.6-35b-a3b-unsloth
The file is around 17.7 GB. On a fast link with hf_transfer it pulled in roughly eight minutes for me.
Run llama-server as a systemd unit
Running it as a user systemd service means it survives terminal closes, restarts on crash, and has clean logs through journalctl. My initial unit was straightforward:
# ~/.config/systemd/user/llama-qwen.service
[Unit]
Description=llama.cpp server (Qwen3.6-35B-A3B)
After=network.target
[Service]
Type=simple
ExecStart=/home/sachin/code/llama.cpp/build/bin/llama-server \
-m /home/sachin/models/qwen3.6-35b-a3b-unsloth/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf \
-ngl 99 \
-c 32768 \
-fa on \
--cache-type-k q8_0 --cache-type-v q8_0 \
--jinja \
--host 127.0.0.1 --port 8080
Restart=on-failure
RestartSec=5
[Install]
WantedBy=default.target
The flags worth knowing on day one:
-ngl 99: offload all layers to the GPU (the number is high enough to cover any layer count).-c 32768: total KV cache budget in tokens. This is the part that gets interesting later.-fa on: flash attention. Faster decode, lower KV memory.--cache-type-k q8_0 --cache-type-v q8_0: 8-bit KV cache. Halves KV memory vs FP16 with negligible quality loss.--jinja: use the model’s built-in chat template.
Enable and start it:
systemctl --user daemon-reload
systemctl --user enable --now llama-qwen.service
journalctl --user -u llama-qwen -f
Smoke test it:
curl -s http://127.0.0.1:8080/health
# {"status":"ok"}
Wire pi to the local server
pi is a coding agent that supports custom backends through its extension system. The standard approach is to write a small TypeScript extension that registers a local-llama provider pointing at the llama-server’s OpenAI-compatible endpoint.
// ~/.pi/agent/extensions/local-llama.ts
import type { ExtensionAPI } from "@mariozechner/pi-coding-agent";
export default function (pi: ExtensionAPI) {
pi.registerProvider({
id: "local-llama",
label: "Local llama-server",
api: "openai-completions",
baseUrl: "http://127.0.0.1:8080/v1",
reasoning: true, // Qwen3.6 emits reasoning_content separate from content
});
}
Install it (extensions are not auto-discovered just by sitting in the directory):
npm install -g @mariozechner/pi-coding-agent
pi install ~/.pi/agent/extensions/local-llama.ts -l
Then run pi against the local provider:
pi --provider local-llama --print "Write a haiku about prompt caching."
If that works, the pipeline is up. The setup looks like:
GPU: NVIDIA RTX 4090, 24 GB VRAM
Model: Qwen3.6-35B-A3B-UD-IQ4_XS.gguf (~17.7 GB)
Server: llama.cpp built from source, CUDA arch 89
Client: pi (npm: @mariozechner/pi-coding-agent)
This is the point where I had a “working” local agent. The next thing I wanted to know was whether it was actually working and how well.
Before the problem surfaced: building a tiny benchmark
A --print smoke test tells you the pipeline is up. It doesn’t tell you whether the model is good enough at the kind of work I’d actually route to it, or how much faster or slower a config change makes things. I needed something to measure both quality and speed on tasks shaped like the ones I do all day.
The obvious starting point was SWE-bench Lite, the 300-problem subset of SWE-bench that the open-source community treats as a standard for coding agents. I kicked off a run. On a local 35B-A3B going through pi, the full 300 problems would have taken many hours end-to-end. Worse, SWE-bench’s setup involves Docker images per problem, repo checkouts, and patch application, which adds enough infrastructure overhead that a single config tweak to the inference server requires a 4 to 8 hour re-run to evaluate. That’s not a feedback loop I can iterate against.
So I built a much smaller suite at ~/code/pitest/ that keeps the SWE-bench shape but takes minutes instead of hours. Three Python problems, each one a deliberately broken codebase with a frozen test suite:
ProblemTestsStarter baselineDifficultyexpr_calc1511/15Easy. 3 surface bugs, 3 files.mini_sheet2412/24Hard. 5 layered bugs, 6 files.mini_stackvm(similar shape)variesMedium. Stack VM with bytecode bugs.
Each problem has a PROBLEM.md that the agent reads, a <name>/ directory with the buggy code, a tests/ dir with the frozen test file, and a sibling <name>.reference/ directory containing the human-written fix that the agent never sees. The harness is a single shell script:
./run_bench.sh mini_sheet pi --print
It copies the problem into a fresh /tmp/pitest-<problem>-XXXX/, runs the agent with that as cwd and PROBLEM.md as the prompt, then runs pytest in the workdir and reports passed/total. Pass/fail is binary, the score is unambiguous, and the original problem dir is never modified, so re-runs always start from the same baseline.
Why this shape:
Fast feedback. A single problem runs in 2 to 5 minutes including agent time. The full suite runs in roughly 10 to 15 minutes. A config change on the llama-server side can be evaluated in one coffee break instead of one workday.
Real-shape tasks. The agent has to read multiple files, identify bugs that span them, and produce edits that pass a test suite it didn’t write. Same loop as SWE-bench, just compressed.
Different agents, same harness. The runner takes any command line that accepts a final prompt argument, so I can run
pi,claude,goose, oropencodeagainst the exact same problem state and compare directly.
The intent was never to compete with SWE-bench on coverage. It was to have a tight local loop so that tuning the inference server had a feedback signal beyond “feels faster,” and so that comparing different harnesses against the same model could be a few-minutes thing rather than a few-hours thing.
And then, while running these tests for the first few times, I noticed something off. The model produced reasonable output. But it took a strangely long time to start producing output, especially on the second and third problems in a row. That’s the bit I expected to take 30 seconds and was instead taking two minutes. Which is what the rest of this post is about.
The two-minute wait
The annoying behavior went like this. Every new pi conversation would sit there saying “working” for two minutes before the model started ripping out tokens. Once warmed up, it would fly through 15 minutes of agent work without a hitch. Then I’d ask a follow-up, and again, two minutes of nothing.
First instinct was prompt caching. Pi sends a substantial system prompt. If the prefix isn’t cached on the server, llama.cpp has to do a full prefill before generating the first token. A 10k-token system prompt at roughly 600 tok/s prefill is about 17 seconds. But this was 120 seconds. Something else was going on.
I checked the GPU power state:
$ nvidia-smi --query-gpu=pstate,clocks.current.graphics,power.draw --format=csv
P8, 210 MHz, 30.59 W
P8 is the deepest idle state. Persistence mode was off. So I enabled it:
sudo nvidia-smi -pm 1
This pins the driver in place and avoids context teardown when nothing’s using the GPU. Helpful, but not the answer. Clock-up from P8 to P0 is sub-second. Two minutes was something else.
The wait wasn’t a wait, it was prefill
I tailed journalctl --user -u llama-qwen -f while a slow turn was in flight. Empty. llama-server only logs after a request finishes, so nothing for two minutes can mean either “hung” or “actively working.” Quick check on the GPU:
$ nvidia-smi --query-gpu=pstate,utilization.gpu,power.draw,memory.used --format=csv
P2, 77 %, 153.06 W, 21116 MiB
P2, 77% utilization, 153W. Definitely working. So the “two-minute wait” was llama.cpp grinding through prompt prefill. The question became: why is it re-prefilling so much each turn?
Four slots, one user, lots of cache misses
llama-server has a /slots endpoint that exposes per-slot state. I hit it:
$ curl -s http://127.0.0.1:8080/slots | jq '.[] | {id, is_processing, n_ctx, n_decoded: .next_token[0].n_decoded}'
{ "id": 0, "is_processing": false, "n_ctx": 262144, "n_decoded": 283 }
{ "id": 1, "is_processing": true, "n_ctx": 262144, "n_decoded": 425 }
{ "id": 2, "is_processing": false, "n_ctx": 262144, "n_decoded": 567 }
{ "id": 3, "is_processing": false, "n_ctx": 262144, "n_decoded": 804 }
Four slots. I’d assumed one. The server was launched without -np, but a recent llama.cpp default is -np -1 (auto), which spun up 4 slots from my context budget. With -c 1048576 that’s 256k tokens per slot.
Each slot has its own KV cache region. The default slot picker is roughly LRU with a --slot-prompt-similarity threshold (default 0.5). When pi sent its first turn, it landed in slot N. The 15 minutes of agent work kept landing in N because nothing else was hitting the server, so the KV stayed warm. But by the time I sent a follow-up minutes later, other things had touched the server (other pi sessions), and pi’s request got routed to a different slot whose cache didn’t match. That slot then had to re-prefill the entire conversation from scratch. Two minutes of grinding.
-cmoe was making everything slower than it should be
While digging through the running command, I noticed -cmoe was set:
$ ps -ef | grep llama-server
... -ngl 99 -cmoe -c 1048576 --rope-scaling yarn --rope-scale 4 ...
-cmoe (or --n-cpu-moe) offloads the MoE expert tensors to CPU RAM, keeping only the shared and attention layers on the GPU. The trade is: you can hold a much bigger context (KV cache eats VRAM, experts move to host) at the cost of every token round-tripping the experts across PCIe.
For a 3B-active MoE that’s a brutal trade. Decode tanked from a theoretical 200 tok/s to roughly 30 to 50 tok/s. And the only reason -cmoe was on was to fit the 1M-token context budget. Which I wasn’t actually using.
The fix, day 1
The obvious move: drop the over-provisioned context, drop -cmoe, drop multi-slot, and let the model actually breathe on the GPU.
[Service]
ExecStart=/home/sachin/code/llama.cpp/build/bin/llama-server \
-m /home/sachin/models/qwen3.6-35b-a3b/Qwen_Qwen3.6-35B-A3B-IQ4_XS.gguf \
-ngl 99 \
-c 262144 \
-np 1 \
-fa on \
--cache-type-k q8_0 --cache-type-v q8_0 \
--jinja \
--host 127.0.0.1 --port 8080
Native 256k context (no YaRN scaling needed), one slot, all experts resident on GPU. Result:
VRAM: 22.8 GB / 24 GB used
Decode: ~150 to 200 tok/s (single slot, warm cache)
Prefill: ~600+ tok/s
Cache: never invalidated. Single slot, single user.
First turn pays a normal one-time prefill of pi’s system prompt (around 7 to 12 seconds). Every turn after that is near-instant because the prefix is already cached. Problem solved.
Day 2: but what if I want multiple sessions?
The -np 1 config is great as long as I’m running one pi session at a time. The moment two pi sessions overlap, they fight over the single slot and we’re back to cache thrash.
I went looking for a way to share the KV cache dynamically across sessions. Let one session use the whole pool when alone, and split flexibly when others arrive.
It exists. It’s actually been the default behavior since July 2025 if you don’t pass -np at all:
--kv-unified, -kvu use single unified KV buffer shared across all sequences
(default: enabled if number of slots is auto)
--cache-idle-slots evict idle slots' KV to host RAM
--cache-ram N host-RAM cache size (MiB) for evicted slots
--slot-prompt-similarity SIMILARITY prefer slot with most matching prefix
--kv-unified replaces the old fixed c/N per-slot allocation with a single shared pool. Idle slots cost zero KV. One active session can use the whole pool. If three sessions are active, they split the pool by actual demand instead of a static carve-up.
--cache-idle-slots plus --cache-ram adds a tier below that: when active sessions need more KV than the pool has free, idle sessions’ KV gets paged out to host RAM (16 GB cache in my setup). When the idle session next sends a request, its KV reloads from RAM, which is much faster than re-prefilling from scratch.
--slot-prompt-similarity 0.9 makes the slot picker route requests to whichever slot has the longest matching prefix, instead of LRU. This is what keeps each pi session pinned to “its” slot across turns, preserving cache continuity.
Updated unit:
ExecStart=/home/sachin/code/llama.cpp/build/bin/llama-server \
-m /home/sachin/models/qwen3.6-35b-a3b-unsloth/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf \
-ngl 99 \
-c 262144 \
-fa on \
--cache-type-k q8_0 --cache-type-v q8_0 \
--kv-unified \
--cache-idle-slots \
--cache-ram 16384 \
--slot-prompt-similarity 0.9 \
--jinja \
--host 127.0.0.1 --port 8080
Important nuance: unified KV doesn’t increase the total budget. -c 262144 still preallocates 262k tokens of KV in VRAM up front. What unified gives you is flexible distribution of that budget, not 4x capacity from the same memory.
Day 2 bonus: switching to Unsloth’s Dynamic 2.0 quant
While I was poking around, I checked Unsloth’s GGUF release for the same model. Their entire repo is “UD-” prefixed (Unsloth Dynamic 2.0), which typically beats Bartowski’s static quants at the same bit budget.
The drop-in candidate: UD-IQ4_XS at 17.73 GB versus my Bartowski IQ4_XS at around 18 GB. Same bit budget, same VRAM, ostensibly better quality.
Pointed -m at the new file, restarted, verified via /props that the new model loaded:
VRAM: 22.4 GB / 24 GB (~700 MB less than Bartowski's IQ4_XS)
Speed: identical (same architecture, same active params)
Free quality bump and a little VRAM headroom. No reason not to.
What I learned about TurboQuant (and why I’m not using it yet)
A natural question along the way: Google released TurboQuant (Walsh-Hadamard rotations plus Lloyd-Max codebooks plus QJL), which beats traditional quants at low bit-counts. Is it in llama.cpp?
Short answer: a precursor is, full TurboQuant isn’t.
PR #21038 (around April 2026) merged a simpler Hadamard-rotation step that runs before existing q4_0, q4_1, q5_0, q5_1, and q8_0 KV quantization. The headline number: Q4_0 KV jumped from 2.0% to 21.7% on AIME25. It made low-bit KV cache actually usable.
Full TurboQuant types (
TQ3_1S,TQ4_S, etc.) live only in community forks. Tracking issue #20977.vLLM has TurboQuant merged (PR #38479), but recommends FP8 or NVFP4 instead because (a) the model ecosystem doesn’t yet ship TurboQuant-quantized weights and (b) FP8 and NVFP4 are much faster on Hopper and Blackwell. My 4090 is Ada. FP8 is emulated, NVFP4 isn’t supported. So vLLM’s actual win path doesn’t apply to my hardware.
For now, the conservative move is: stay on llama.cpp, use q8_0 KV (no rotation needed at 8-bit), and watch for upstream TurboQuant merges. The next interesting experiment is --cache-type-k q4_0 --cache-type-v q4_0, which would halve KV memory and unlock -c 524288 (a 512k unified pool). Post-#21038, q4 KV is finally trustworthy for that.
Next stop: RTX 5090
The next concrete upgrade is a 5090. Specs that matter for this workload:
RTX 4090 (current)RTX 5090 (next)VRAM24 GB GDDR6X32 GB GDDR7Memory bandwidth1008 GB/s~1792 GB/sArchitectureAda (sm_89)Blackwell (sm_120)FP4 nativeemulated onlynative (NVFP4)Tensor cores4th gen5th genApprox. pricen/a~$2k MSRP
What this unlocks for the local setup:
+8 GB VRAM is genuinely meaningful here. Today the 4090 is at 22.4 GB / 24 GB, leaving 1.6 GB headroom, and that ceiling drives a lot of the trade-offs above (forced into IQ4_XS, 256k context, q8_0 KV). On a 5090 the same model leaves around 14 GB free, which means: bump the quant up to UD-Q5_K_S or UD-Q5_K_XL for noticeably better quality, and push to a 512k or 1M context unified pool, and run multiple concurrent pi sessions without thrashing.
Roughly 1.78x memory bandwidth is the actual decode speedup. MoE A3B is bandwidth-bound, not compute-bound. Every token reads about 1.5 GB of active weights. At 1.79 TB/s that’s around 1200 tok/s theoretical, 300 to 400 tok/s realistic. Roughly double current speeds.
Native NVFP4 is the bigger story. Once vLLM’s NVFP4 path is working with Qwen-class MoE weights, a Blackwell card can run those quants at full hardware speed instead of emulation. That’s the door into vLLM and SGLang’s actually-recommended quant path, which isn’t open on Ada. It also means full TurboQuant support, when upstream llama.cpp lands it, will run efficiently on the 5090 instead of being a research toy.
Headroom for a bigger model. A 70B-class dense model at Q4 is about 38 GB, too big for one 5090, but a 35B-class model at Q6 fits comfortably, and any future Qwen 3.7, DeepSeek-V4, or Gemma 4 model in the 30 to 50B range will land on a 5090 with room for serious context.
The 5090’s existence is what makes “stop sending most of my coding traffic to Anthropic” go from “experiment” to “default workflow.” The 4090 setup proves the software side works. The 5090 takes the rough edges off.
Settings that ended up mattering
In rough order of impact:
Don’t use
-cmoeunless you have to. It pushes MoE expert tensors to CPU RAM, and every token round-trips PCIe. Decode tanks. Use it only if VRAM forces your hand.Single slot with no other tenants is the easiest fast path.
-np 1guarantees zero cache invalidation if you’re the only user. The two-minute prefill problem disappears.For multi-tenant on llama-server, use
--kv-unifiedplus--slot-prompt-similarity 0.9. The unified pool flexes to demand, and similarity routing keeps each session’s cache intact across turns.--cache-idle-slotsis a free upgrade if you have host RAM to spare. Idle sessions evict to RAM, active sessions get the pool, and resumes are fast.Native context length beats YaRN-extended. If the model’s native window is enough, avoid rope scaling. It costs accuracy and uses the same KV memory.
Persistence mode on (
sudo nvidia-smi -pm 1) is a small win that survives until reboot. Worth doing once.Unsloth Dynamic quants are a free swap from Bartowski static quants at the same file size. Drop-in.
What’s next
Two parallel threads.
More models. Qwen3.6-35B-A3B is a strong baseline but it’s not the only credible local coding model. Top of the queue: Gemma 4. Google’s coding evals on the late-Gemma-3 line were already competitive with the frontier-mini tier, and Gemma 4 is expected to push further. Past that, DeepSeek-V4 distills and any future Qwen 3.7 or Llama 4.x release are worth benchmarking head-to-head on the actual workload (not synthetic evals, real pi sessions on real codebases).
More harnesses. pi works well, but it’s one client. Worth seeing how the local server holds up under different agent loops:
Goose (block.xyz). Open-source agent harness with a different conversation and tool loop than pi. Good test for prompt-caching behavior across diverse client styles.
Opencode. Terminal coding agent in the Claude Code style, but designed for any OpenAI-compatible backend. Closest workflow analog to my actual Claude Code use, which is the right stress test for “can this replace the easy 70%.”
The endgame: a routing setup where most agent traffic (refactors, test fixes, search and summarize, build babysitting) stays local on the 5090, and only the genuinely hard problems escalate to Claude or Codex. Two days of llama.cpp tuning got me to where the local model is fast enough that this is now a software-and-evals problem, not a hardware problem. The next batch of work is figuring out which model on which harness actually crosses the quality bar for which class of task.
The thing that surprised me most: this whole investigation was caused by one flag (-cmoe) that someone, probably past me, added months ago to “fit a 1M context,” and never revisited. The 1M context I was paying for nightly was being used for, on average, conversations under 30k tokens. The right ceiling was always 256k natively, and removing the workaround was the actual fix.
The general lesson: when an inference setup feels mysteriously slow, measure prefill, check /slots, and audit the flags you accumulated when you were optimizing for a different problem. The slowness is rarely magic.
Coda, and the irony
Worth naming the obvious thing: this post was written by Claude. The same Claude I’m setting up the local stack to replace for most of my daily traffic. The collaboration was genuinely useful for getting the system tuned and writing this up, and that’s also the point. The frontier models are excellent partners and the local ones are catching up fast enough that “default to remote” is no longer the unexamined choice.
I think there’s a real direction change underway. Three things look likely over the next year or two:
API access keeps getting cheaper. Frontier providers are in a price war as fast as they’re in a capabilities race. Per-token costs for a given quality level are dropping every quarter.
Open-weight models keep getting better. The gap between best-open and best-frontier on real coding work has gone from “embarrassing” to “noticeable” to “context-dependent” in 18 months. The trajectory points at “frontier-equivalent for most tasks” being a 2026 to 2027 reality at the open tier.
Capable inference hardware keeps commoditizing. A 5090 today is a 4090 plus a meaningful step. A DGX Spark at $3.5k is a credible alternative to a workstation GPU for memory-bound MoE work. Apple’s unified-memory boxes are pushing the same direction from a different angle.
Which leaves the open question: do frontier model gains keep justifying datacenter-scale deployments? If yes, we get superhuman models running in compute clusters that the rest of us rent through APIs. Useful, but with the centralization concerns intact. If the curve flattens and open-weight 30 to 70B-class models hit “good enough for 95% of real work” on commodity hardware, the economic case for ever-larger frontier deployments narrows to research, hard reasoning, and specialty domains.
Hopefully yes, because the genuinely hard problems benefit from genuinely better models. Hopefully no, because I’d rather not have my coding workflow, or anyone else’s, locked into a handful of providers indefinitely. The honest answer is probably “both, in different proportions for different tasks,” and the tuning work documented above is a small bet that “local good enough” is closer than it looks.
That’s worth two days of debugging.


