On-Device LLMs in 2026: What Actually Runs Well on Your Hardware
Real numbers for running LLMs locally in 2026: which models to pick, tokens per second on Apple Silicon and iPhone, quantization tradeoffs, and the runtime stack we ship in production.
If you want to run an LLM on your own hardware in June 2026, here's the short version: a quantized 4B to 12B model runs comfortably on any Apple Silicon Mac or mid-range GPU, mixture-of-experts models like Qwen3.6 give you near-frontier quality at laptop speeds, and the right runtime matters as much as the right model. On a base M4 you'll see roughly 24 tokens per second from a 7B model at Q4 quantization. On an M4 Max, over 80. On an iPhone 17 Pro, small 1B to 2B models stream at 55 to 70 tokens per second, which is faster than most people read.
We build on-device AI for a living at DevX Group. Parlin, our Mac transcription app, runs Whisper and an LLM cleanup pass entirely locally. Audio never touches a server. That experience shaped a strong opinion: for most product features, local models stopped being a compromise sometime in the last year. This post covers what we'd pick today, with real numbers.
Why local, in one paragraph
Three reasons keep coming up with clients. Privacy: surveys consistently put data security at the top of enterprise concerns when adopting LLMs, with around 31% of enterprises ranking it the #1 factor in provider choice. Cost: a local model has zero per-token cost, which changes the math for high-volume features like transcript cleanup or document summarization. And latency: no network round-trip means a model that starts streaming in tens of milliseconds. The trade is capability. A 7B local model is not Claude or GPT-5. The craft is knowing which jobs it's good enough for.
The models worth your attention right now
Qwen3.6-35B-A3B is the one that changed our defaults. It's a mixture-of-experts model with 35B total parameters but only 3B active per token, so it generates at small-model speed while scoring 86.0 on GPQA Diamond (vendor-reported). If your machine has the memory to hold it, this is the best quality-per-token-second available locally today.
Gemma 4 (April 2026) is Google's current open family, and the 12B released June 3 is notable for running multimodal (text, image, audio) in 16GB of unified memory. The tiny E2B variant runs in about 2GB, which is what you reach for on phones. Google also ships official Q4 quantization-aware checkpoints, so the quantized versions lose less quality than naive conversions.
Hermes 4.3 (Nous Research, 36B) is the pick when you need a model that follows instructions without arguing. It leads on RefusalBench at 74.6% and produces schema-faithful JSON, which matters more than benchmark trivia when you're parsing model output in production. It supports up to 512K context and ships in GGUF sizes that fit consumer GPUs.
Phi-4-mini (3.8B) punches way above its size on math and code, matching Llama 3.1 8B on MMLU at half the footprint. It's our default for edge devices that can't fit more.
Apple Foundation Models deserve their own mention. At WWDC on June 8, Apple announced its third-generation models, headlined by a 20B sparse on-device model that activates only 1 to 4B parameters per request and runs on devices with 12GB of RAM, shipping with iOS 27 this fall. The current framework already gives every iOS and macOS app free access to a ~3B system model with no API key and no download. If you're building for Apple platforms, start there before bundling your own weights.
Real performance numbers
Token generation speed on Apple Silicon is bound by memory bandwidth, not compute. From the llama.cpp benchmark data for a 7B model at Q4:
| Hardware | Tokens/sec (7B, Q4) |
|---|---|
| M4 (base) | ~24 |
| M4 Pro | ~51 |
| M4 Max | ~83 |
| RTX 4090 (8B) | ~95 to 110 |
| iPhone 17 Pro (1B to 2B models) | ~55 to 70 |
For reference, comfortable reading speed is around 5 tokens per second, so even a base M4 streams text three to four times faster than anyone reads it.
On quantization: the practical rule is that dropping from Q8 to Q4_K_M costs roughly 2% in quality for about 40% less memory. The real quality cliff sits between Q3 and Q4, not Q4 and Q8. We ship Q4_K_M for chat-style features and move to Q8 only for long chains of reasoning, where quantization error compounds.
The stack we actually recommend
The local ecosystem has settled into clear layers. Ollama and LM Studio are the friendly front doors. llama.cpp and Apple's MLX are the engines underneath. Adoption is no longer niche: Ollama reportedly passed 50 million monthly downloads in early 2026.
- Mac desktop app: embed MLX (or llama.cpp with Metal). MLX is typically 10 to 25% faster on M-series chips for sub-14B models. This is what we do in Parlin, with Ollama as an optional bring-your-own-model path.
- iPhone app: use the Foundation Models framework first. It's free, Swift-native, and there's nothing to download. Reach for MLX-Swift or ExecuTorch only when you need a specific open model.
- Web app: either talk to a local Ollama server through its OpenAI-compatible API, or run the model in the browser with WebLLM over WebGPU, which now works in the large majority of browsers.
When we still use cloud models
Local-first doesn't mean local-only. We route to cloud models (Claude, Gemini) when the task needs frontier reasoning, when context windows exceed what fits in RAM, or when the client's traffic pattern makes per-token pricing cheaper than provisioning hardware. The pattern we ship is a provider toggle: local by default, cloud as opt-in, with a clear line in the UI that says your data stays on your machine when using local models. If you're weighing this architecture for a product, that's exactly the kind of build we do.
FAQ
What's the best local LLM in 2026?
For most hardware, Qwen3.6-35B-A3B if you have 24GB+ of memory, Gemma 4 12B for a 16GB machine, and Phi-4-mini or Gemma 4 E2B below that. For agent-style tool use with strict JSON output, Hermes 4.3.
How much RAM do I need to run a local LLM?
At Q4 quantization, budget roughly 0.6GB per billion parameters plus overhead: a 7B model wants 6 to 8GB free, a 12B model wants about 10GB, and a 35B MoE like Qwen3.6 wants 24GB+. Unified memory on Apple Silicon counts directly toward this.
Are local LLMs private?
Yes, that's the core advantage. Inference happens entirely on your hardware, so prompts and outputs never leave the machine. You still need to handle what your app does with the text afterward, but there's no third-party model provider in the loop.
Can an iPhone really run an LLM?
Yes. Small 1B to 2B models stream at 55 to 70 tokens per second on an iPhone 17 Pro, and Apple's own ~3B system model is available to every app for free through the Foundation Models framework. The fall 2026 generation raises that to a 20B sparse model on 12GB devices.
Should my product use a local LLM or an API?
Use local when the data is sensitive, the volume is high, or offline matters. Use an API when you need frontier-level reasoning or your usage is too low to justify the integration work. Most of our client builds end up hybrid: local by default, cloud as a user-visible toggle.