Are Local Models Good Enough Yet: Two Camps Measuring Two Different Things

Vicki Boykis says local models are good now. A 1,245-point Ask HN thread splits into two camps. Boosters measure whether local open-weight models handle daily coding. Skeptics measure whether they match cloud frontier models on hard tasks. The turning point is not that models suddenly got smart, it is that open weights crossed a usable line and local agent tooling redefined good enough. The builder question: not can it work, but how far apart are success rate, latency, and cost on your actual tasks, and is the gap worth trading privacy and control for.

Are Local Models Good Enough Yet: Two Camps Measuring Two Different Things
Photo / Unsplash

Summary

On June 15, Vicki Boykis published “Running local models is good now.” Her conclusion is blunt: she has used local models since they first shipped, and they are finally good enough that she rarely double-checks them against an API model anymore. The same day, an Ask HN thread, “Has anyone replaced Claude/GPT with a local model for daily coding,” climbed to 1,245 points, and the hundreds of replies split into two camps. One posts hardware, models, and workflows, and says they have cancelled their cloud subscriptions. The other, mostly heavy users, pushes back hard: these small models are nowhere near Opus, stop kidding yourselves.

My read: the two camps are mostly not arguing about the same thing. Boosters measure whether open-weight models plus a mature local agent workflow are good enough for daily coding tasks. Skeptics measure whether local models match cloud frontier capability on complex work. Both are right. They are measuring different quantities. What matters for builders is not who won, but the turning point this fight exposes: capability is no longer the only bottleneck.

The debate

On the surface the question is one sentence: are local models good enough. Pull it apart, though, and “good enough” means something completely different in each camp’s mouth.

Vicki Boykis offers an honest definition. Her personal metric is whether she still has to verify output against an API model, and gemma-4-26b-a4b was the first one where she started doing that a lot less often. Her task list is specific too: refactoring a notebook script into a five or six module repo, fixing type hints, proofreading blog posts, writing unit tests, bootstrapping a recommender. She adds a key caveat herself: none of these are groundbreaking tasks, and she is not sure the setup is ready for production software development.

On HN, sosodev frames the same point well: the question itself spans a huge spectrum of capabilities. Run an 8B model and expect to one-shot things, you will have a bad time. Run a ~30B model and give it a reasonably scoped, well-defined task, and they do very well. argee’s analogy is vivid: a local model works as a copilot on small to medium tasks where your hands are on the wheel and your eyes are on the road, driving under the speed limit, not one-shotting anything beyond the trivial.

The other camp measures the ceiling. jwr runs qwen3.6-35b-a3b and gemma-4-26b-a4b-qat on an M4 Max and concludes these small models are nowhere near monsters like Opus and Fable. He says a lot of people are deluding themselves: simple cases look plausible, but for solving complex design problems in a large codebase it is not worth it. redox99 goes harder: models you can run at home, like Qwen 35B, are not in the same neighborhood as Opus or GPT 5.5, and the only logical reasons are absolutely requiring privacy, doing it for fun, or niche cases like airplanes.

So this is not an argument about facts. It is two groups arguing from different task distributions, one measuring the floor on median tasks, the other the ceiling on the hard ones.

Who’s right

My take: within their own stated scope, both camps are right. But the boosters caught the change that matters more.

Give the skeptics their due first, because local advocates tend to skip this part. The capability gap is real. user43928, forced to use Qwen 3.6 27b at work, found it next to useless, worse than doing the work by hand, and felt anything below Sonnet was a waste of time. twothreeone, running the same model on a single 3090, says the worst part was unstable output quality: every few minutes you have to ask yourself whether you are holding it wrong or the model is just too stupid, and that context switch is itself a cost. lambda surfaced a hard technical flaw too: many local models were not trained to preserve reasoning across turns, so after a long chain of tool calls they re-process everything each turn, which is slow and eats context. Qwen 3.6 only just gained the option to keep thinking. These are not nitpicks. They are the things that make people quit on a tool entirely after a bad afternoon.

But the boosters caught the variable that matters more: for a large class of common tasks, the usable line has been crossed. The evidence is not a benchmark, it is real behavior change. Kostic wired VSCode to llama.cpp running Qwen 3.6 27B and cancelled a cloud subscription outright. heipei, on a 5090 with a Q6 Qwen 3.6 27b and Pi, now hands it chores like “commit this on a branch, push, create a PR, assign a reviewer.” horsawlarway replaced a $100/month Claude subscription with a five-year-old dual 3090 box. These are not reviewers. They actually stopped paying.

The key detail is that most of them did not switch because the model got smarter. heipei puts it plainly: being local means he never has to think about token pricing, quotas, time of day, or data sensitivity again. That line is the crux of the whole fight. Once capability clears the good-enough line, the decision weight shifts from how strong it is to what you are optimizing for.

Why it matters

This matters for builders because it changes the question you should ask when you choose.

For two years the local-model story was a single capability-chasing curve, and the answer was always “not yet, wait.” Now Boykis gives a number: local agentic loops run at about 75% of frontier accuracy and speed. Kyle Howells gives the hard data on the other axis. Gemma 4 26B-A4B on an M1 Max went from 58.2 tok/s on plain llama.cpp with Metal to 72.2 tok/s with MTP speculative decoding, about 24% faster. He also tested MLX, and counterintuitively the Mac-optimized runtime came in at only 45.8 tok/s, slower than llama.cpp. That hands-on, parameters-fully-published content is exactly what ryandrake complained is missing on HN: most posts say “I use Qwen and get great results” without the quantization, the parameters, or the hardware, so nobody can reproduce it.

Once 75% holds up, the question turns from a binary can-it-work into a continuous cost-benefit trade. There are three accounts to run. First, latency and speed: 72 tok/s is usable but not fast for an agent making many tool calls, and Howells himself calls 58 slow and 72 the point where it becomes usable. Second, money, but count it honestly. mtone, on dual RTX Pro 6000s running DeepSeek V4 Flash at concurrency one, estimates electricity-only cost at roughly $8.65 to $38.88 a month, which looks cheap. But weego’s rebuttal lands: for someone working for themselves, paying $100 a month and amortizing a depreciating asset are the same thing, except the asset adds maintenance and breaks even only after three to five years. Third, the account cloud cannot offer: privacy, control, supply certainty.

So the real signal from this fight is not “local won.” It is that the fulcrum moved. Once capability is good enough, what is left is all the non-capability factors, and those vary by person and task, with no single answer.

What to ignore

First, ignore the all-or-nothing framing. Almost nobody who cancelled a subscription actually went fully local. horsawlarway uses local for personal projects and company-paid Claude for the day job. fortyseven runs local Qwen for personal projects and Claude at work. bluejay2387 does 90% on Qwen and falls back to Codex for the complex work and UI polish. The reality is layered and mixed: local for volume, frontier for the finish. Treating it as a fully-local-or-fully-cloud loyalty test is asking the wrong question.

Second, ignore benchmark claims with no stated conditions. lambda notes on HN that if you believe the benchmarks, Qwen 3.6 35B-A3B already beats Claude 4 Opus, then immediately undercuts himself: open models do some benchmaxxing, bigger models always feel deeper, and this compares today’s local model against a year-old frontier model. The comparison is interesting, but it is not evidence that local has caught up. Claude 4 Opus is last year’s car.

Third, do not let the hardware-price fight pull you off course. The loudest argument under Boykis’s post was not whether the model works, it was whether telling people to buy a 64GB Mac is fair, with one estimate that only the top 10% of global earners can absorb a $2,000 device without strain. That is a real affordability question. It is separate from whether local models are good enough. One is about technical maturity, the other about who can pay. Keep them apart and your judgment stays clean.

One line for builders: stop asking whether local models work. Measure how far apart success rate, latency, and cost are between local and cloud on your actual tasks. Then ask one thing only: is the gap worth trading back for privacy and control. Nobody can answer that for you.

FAQ

Are local models good enough for daily coding now?

It depends on the task class. Vicki Boykis and many HN users report that 30B-class models like Gemma 4 or Qwen 3.6 handle refactors, unit tests, small to medium edits, splitting scripts into modules, and local doc search well enough that some cancelled their cloud subscriptions. But the same threads have heavy users (jwr, redox99) saying that on complex design in large codebases, local models are nowhere near frontier models like Opus. So good enough is not a yes or no question, it is a question about your task distribution.

What coding model runs on an M2 or M4 Mac, and how fast?

Vicki Boykis runs gemma-4-12b-qat and 26B for agentic coding on a 2022 M2 with 64GB at roughly 75% of frontier accuracy and speed. Kyle Howells benchmarked Gemma 4 26B-A4B on an M1 Max 64GB: plain llama.cpp with Metal hit 58.2 tok/s, and adding MTP speculative decoding pushed it to 72.2 tok/s, about 24% faster. HN users run 26B on a 48GB M4 Pro too, but most agree the 30B class is the sweet spot for consumer hardware. Beyond that you hit either slow output or memory limits.

How do you set up a local coding agent on macOS?

Kyle Howells published a reproducible stack: llama.cpp built with Metal as the inference engine, Gemma 4 26B-A4B in GGUF as the main model, a Q8 MTP draft model for speculative decoding, plus the multimodal projector so you can feed it screenshots, with a terminal agent like Pi connected over an OpenAI-compatible API. Vicki Boykis runs a variant with LM Studio serving the model and Pi inside a Docker container limited to bash only. The hard part is the harness config and sandbox, not the model.

Where do local models lose to cloud, and where do they win?

They lose on capability ceiling and on staying coherent over long tasks. HN users report local models pay weak attention to precise instructions in context, drift on complex tasks, and burn tokens spinning. Older versions also re-process reasoning every turn. They win on four things outside raw capability: privacy (code never leaves your machine), no token pricing or quota anxiety, low latency, and supply certainty (access cannot be revoked remotely). People like heipei switched not because the model got smarter, but to stop worrying about those four things.

Sources

  1. Running local models is good now (Vicki Boykis blog) / blog
  2. How to Setup a Local Coding Agent on macOS (Kyle Howells blog) / blog
  3. Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding? (Hacker News) / hn
  4. Running local models is good now (Hacker News discussion) / hn

No official primary source available; this analysis is based on reliable secondary reporting (named outlets, cross-confirmed).