★ The Real Battle in Local LLM Setup Isn't the Model

 — 7 minutes read


The question I wanted answered was narrow: with 128 GB of Apple Silicon and one hard rule — fully open, fully local, MLX-native — what is the best agentic-coding stack I can build in May 2026, and what does it cost to keep running?

Short answer: the constraint does most of the design for you, model and backend selection complete in an afternoon, and the real engineering goes into the operations layer. The longer answer is below, with the operational findings packaged as a small open-source tool, 4lm (MIT).

The constraint does most of the work

Open weights, open source, fully local, MLX-native. Each clause removes a category of tooling that dominates the visible market. Open weights rules out the proprietary frontier: Claude, GPT, Gemini, the rest. Open source rules out a handful of MLX-adjacent products with freeware-only or restricted-commercial licences. Fully local rules out hybrid setups that fall back to cloud, which would beg the question. MLX-native rules out containerised runtimes: on macOS, Docker does not expose Metal, and an MLX backend inside a container decodes at roughly half speed. Apple’s own platform punishes you for treating it like Linux.

Once that filter has run, the agent-frontend list collapses to a single entry: OpenCode. Native multi-provider, OpenAI-format first, and the current convergent choice in the open-source agentic-coding community (thoughts.jock.pl carries a contemporary write-up). The backend list collapses to a small set of MLX-native OpenAI-compatible servers, which is where the real selection happens.

Backend: one good answer, briefly revised

In April 2026 the only MLX-native open-source server with per-model tool-call parsing is mlx-openai-server (cubist38), a FastAPI wrapper carrying parsers for qwen3_coder, qwen3_next, glm4_moe, and harmony. On a single afternoon’s evaluation it is the rational pick.

For sustained use it is the wrong pick. Three weeks of running it surfaced failure modes split roughly evenly between MLX itself and the wrapper, and in every case the diagnostic information was sitting in a commit log rather than in any documentation. The replacement, omlx (jundot/omlx), is a vLLM-style MLX runtime with paged KV cache, continuous batching, and a multi-model EnginePool. These three features start to matter the moment a second concurrent client touches the backend. As of May 2026, omlx is the defensible answer.

There is a more general point here. In any domain sitting this close to a moving frontier, the README lies and the commit log does not. Reading commits is part of the job.

Models within the envelope

128 GB excludes most of what made noise in 2026’s open-weights releases. GLM-5 at 744–754B parameters, Kimi K2.6 at 1T total / 32B active, DeepSeek V4-Pro at 1.6T: all of them want ≥256 GB for the aggressive community quantisations, and on 128 GB they fall into mmap-backed disk-offload at below 2 t/s. One agentic iteration takes thirty minutes. There are no Air or Flash variants, and no public roadmap suggesting any are coming. The frontier of open weights has, for now, moved into Mac Studio and M3 Ultra territory.

What fits is the MoE class of 30–80B total parameters with 3–10B active. The architectural reason is simple: decode throughput on Apple Silicon is gated by active-parameter count and memory bandwidth (614 GB/s on the M5 Max), and high sparsity is the only way to use the available bandwidth without exceeding the wired-memory budget.

Among models that fit, Alibaba’s Qwen3 family (Apache 2.0) is the only complete lineup. Coder-Next 80B-A3B (3B active, SWE-Bench Verified 70.6%, 256k context, agentic-trajectory training) handles the build agent; 35B-A3B (3B active, 73.4%, 119 languages, where German vault queries notice the difference) handles plan and chat; Embedding-8B and Reranker-0.6B handle retrieval; VL-8B handles vision. All five share tokeniser and quantisation conventions, run in one omlx EnginePool, and settle at around 65 GB resident. Cross-vendor splits (Qwen for code, GLM-Air for chat, Nomic for embeddings) score comparably per component but multiply the tokeniser, template, and update-cadence surface you have to keep in your head. The operational tax decides it.

Where the local stack stops

A local Qwen3 stack approaches Sonnet 4.5 / Opus 4.6 on a well-defined subset of agentic work, and does not replace them on the complement. It is worth being precise about both halves, because vague claims about local LLMs do most of the damage in this discourse.

The competitive half: single-file edits, multi-turn refactoring inside bounded modules, bash tool use, test generation, small bug fixes with reproducible failure cases, code review, explanation. TTFT under three seconds at a 4k prompt on M5 with Neural Accelerators, which is interactively snappy and the threshold below which you stop reaching for cloud out of impatience.

The other half: multi-file refactors crossing repository boundaries, the harder SWE-Bench Verified cases (local ≈ 73%, Sonnet 4.6 considerably higher), autonomous loops past ~30 tool-use turns where tool-format drift compounds, prompt caching (cache_control has no effect locally), Extended Thinking, multi-hour subagent orchestration. The 95/5 split is not a slogan; it falls out of the architectural limits above.

For me the boundary is in the right place. I do the 95% locally and reach for cloud manually when the other 5% turns up. Automated routers, in my experience, cost tokens and misclassify the precise cases where the route would have paid off.

Where engineering time actually goes

Selection of model and backend completes inside an afternoon. The rest of an honest weekend goes into operations, and three decisions dominate the surface.

Process supervision. Docker is excluded by the constraint, as established. tmux is fragile across reboots and offers no status reporting. The remaining native option is launchd, which is Apple’s answer to systemd and a much older one. Apple’s documentation has not been substantively updated in several macOS releases, and most online tutorials still demonstrate the deprecated load / unload API rather than the current bootstrap / bootout / kickstart; you find this out by reading source.

Four plist defaults bite if you run an MLX backend under launchd. KeepAlive: true restarts on clean exit, which is rarely the intended semantics. The dictionary form { SuccessfulExit: false, Crashed: true } is correct. ThrottleInterval defaults to 10 seconds, and 30 seconds is the right value if you want to avoid a misconfigured daemon saturating a core inside a minute. ExitTimeOut defaults to 20 seconds, which is the window between SIGTERM and SIGKILL; 70+ GB of wired memory does not flush in 20 seconds, and 60 suffices. The fourth default is wired memory itself.

Wired-memory configuration. macOS allocates GPU-accessible memory from a wired pool that defaults to roughly 70% of installed RAM, about 89 GB on a 128 GB system. This is exactly enough for two 8-bit A3B models in steady state, the way a trunk fits exactly two suitcases; it stops being enough the moment the KV cache grows or a third model loads. The correction is iogpu.wired_limit_mb=98304, applied via sysctl and persisted in /etc/sysctl.conf, with a sudoers entry to set it without an interactive password. One line of configuration, and a class of OOM kills you no longer have to diagnose.

Network topology. “Local” is ambiguous as to device. The inference host does not have to be the keyboard. The /v1 seam is reachable from any client on the LAN: opencode on a different laptop, Open WebUI on a different host, plain curl. A common topology is one inference host on mains power (a Mac mini or Studio in a closet) serving several thin clients, including the battery-powered machine that would otherwise spend its day refusing to sleep with 70 GB of wired memory in flight. The backend has no authentication; the network is the perimeter; anything past a trusted LAN goes behind Tailscale.

The tool that came out of it

After three rebuilds of the same operational scaffolding across three profile configurations, the scaffolding became its own project. 4lm (MIT) wraps three MLX-compatible backends (omlx, mlx_lm, ollama) behind one /v1 seam, manages six profile configurations with atomic switching and bounded rollback, enforces the wired-memory limit and newsyslog-based log rotation at install time, and gives you 4lm doctor / 4lm diag when the fans spin up unexpectedly.

Two deployment shapes are supported. The workstation shape (./install.sh && 4lm start && 4lm opencode) gives you Open WebUI on :3000 and opencode in your terminal. The appliance shape (./install.sh --backend-only && 4lm start && 4lm expose lan --confirm) runs the inference host headless and exposes the /v1 seam to the LAN; clients connect at http://<host>:8000. The --confirm flag is deliberate friction, and no environment variable overrides it.

Behind the tool sits a list of ten operational items that no tutorial mentions and every serious stack needs: model pre-pull, HuggingFace cache management, log rotation, dependency pinning, Open WebUI admin-account claim, Tailscale beyond the trusted LAN, WebUI state backup, battery-state interaction with wired memory, per-client MCP configuration. Five are now enforced by install.sh. Four are policy or physics, addressable but not by tooling. One is upstream: per-client MCP duplication will go away the day Open WebUI ships server-side MCP, which is on the 4lm roadmap.

What actually changed in 2026

The thesis fits in one line. Under a hard open-local-MLX constraint, model selection is a short exercise, backend selection has one defensible answer that will shift on a quarterly cadence, and the operational layer is where the engineering time actually goes. What changed in 2026 is that the operational answers are now identifiable in advance and installable in one command.

Set the wired-memory limit, write a proper plist, point the couch laptop at the closet Mac. Then walk away. The day it acts up, and on a six-month cadence it will, the README will be wrong and the merge log won’t.