Bobby Encoded
PostsAbout
PostsAbout

© 2026 Bobby Jose

← Back to Blog

Apple Just Rebuilt the Local AI Stack — and I Can't Wait to Try It

June 10, 2026 · 9 min read

Apple, WWDC, Local AI, Ollama, MLX, Core AI, LLM, Apple Silicon, Developer Tools

I've spent the last month going deeper on local AI — running models in Ollama, pointing Claude Code at a local Qwen, building a tiny coding agent to see how small models handle tool calls. My conclusion so far has been: local is great for the cheap, private inner loop, and Apple silicon's unified memory is quietly the best budget hardware for it.

Then WWDC 2026 happened this week, and Apple — of all companies — announced the most interesting local-AI news of the year. Not Siri (fine, Siri too), but a stack of developer announcements aimed squarely at people like me: Core AI, an open-sourced MLXLanguageModel, multi-Mac training over Thunderbolt, and a command-line tool that gives you free LLM calls on macOS. The developer betas just dropped; the public releases land with the OS 27 cycle this fall.

I haven't had hands-on time with any of it yet — this is fresh off the keynote and the State of the Union. But I've read everything Apple published, and I want to walk through what each piece is, why it has me excited, and — keeping my own rule about honesty — what it probably doesn't change.

The map: four different things Apple announced

Apple's naming makes this genuinely confusing, so here's the cheat sheet I wish the keynote had shown:

PieceWhat it isWhose models run on it
Foundation Models frameworkSwift API for calling LLMs from appsApple's models — plus anything via the new protocol
Apple Foundation ModelsApple's actual LLMs (small on-device + bigger server one)Apple's, closed weights
Core AINew OS framework for running custom models on-deviceYours — converted from PyTorch
MLXApple's open-source ML framework (their PyTorch-alike)Anything open-weight — Llama, Qwen, Mistral…

The first two are Apple's models in Apple's API. The last two are the bring-your-own-model story — and that's where my setup lives.

Core AI: what Ollama does, as an OS feature

Core AI is a brand-new framework whose pitch sounds like it was written for the local-AI crowd: a memory-safe Swift API to load and run AI models entirely on-device, with — Apple's exact words — "zero server dependencies and zero token costs."

The parts that made me sit up:

  • Ahead-of-time compilation. Models are specialized for the exact chip they land on, compiled in advance so a multi-gigabyte model loads fast. This is the same problem llama.cpp solves with GGUF quantization and Metal offload — now handled by the OS.
  • Python tools to convert PyTorch models to Apple silicon. The on-ramp from the open-model world is official.
  • Fine-grained inference memory control, zero-copy data paths, stateful execution. "Stateful execution" is the KV-cache story — the thing that makes long chat sessions usable. These are exactly the knobs you currently tune via Ollama's environment variables and modelfiles.
  • Apple says Core AI is what powers the new Siri. They're not asking developers to use something they don't ship on themselves.

If you've read my Claude Code + Ollama post, you know the appeal of local: no API key, no per-token bill, no data leaving the machine. Core AI is Apple institutionalizing that exact value proposition — "zero server dependencies and zero token costs" could be Ollama's tagline.

The LanguageModel protocol: the adapter I've been faking with env vars

The Foundation Models framework now sits behind a LanguageModel protocol — any conforming model can back a session. Apple open-sourced two implementations that matter enormously for local:

  • CoreAILanguageModel — your own converted models through Core AI
  • MLXLanguageModel — open-weight models from the Hugging Face MLX community, running on your Mac's GPU and Neural Engine

Which means this is now a real, supported pattern in a shipping app:

// Local Qwen on the Mac's GPU — same API as Apple's models, Claude, or Gemini
let localModel = MLXLanguageModel(/* e.g. a Qwen or Llama from the MLX community */)
let session = LanguageModelSession(model: localModel)

let response = try await session.respond {
    "Refactor this function to use the retry pattern."
}

Step back and look at what that enables: one Swift AI feature that runs on a free Apple model for most users, a local open-weight model for the privacy-obsessed, and Claude for the power tier — with a one-line swap. In the Ollama world we approximate this with the OpenAI-compatible API and base-URL overrides (I wrote a whole post about the env-var traps). Apple just made it a typed protocol.

And the hybrid setup I keep advocating — local model for the cheap inner loop, frontier model for the heavy lifting — is now expressible inside one session via the new Dynamic Profiles API: run analysis mode on-device, flip to a reasoning-enabled cloud model for the hard step, flip back.

MLX grew up: Metal 4 and Macs chained over Thunderbolt

MLX — Apple's open-source ML framework, the reason "just buy a Mac with more unified memory" became local-AI advice — got two big upgrades:

  1. Metal 4 support with GPU Neural Accelerators — more inference and training speed from the same hardware.
  2. Distributed training across multiple Macs via RDMA over Thunderbolt. Read that again: you can now chain Mac minis into a little training/fine-tuning cluster over a Thunderbolt cable, with direct memory access between machines.

Let me be upfront about the hardware tier I actually care about, because most local-AI coverage drifts straight into $4,000-workstation territory. I'm talking about the machines normal developers already own: a Mac mini or an M-series MacBook. That class of machine — 16 to 32GB of unified memory — comfortably runs the 7B–14B models I've been using with Ollama, and a higher-RAM mini stretches into the ~30B class with quantization. For anything genuinely big — 70B and up — the honest answer for most of us isn't more hardware, it's a hosted model behind an API. Local for the small stuff, server for the heavy stuff; that's the same hybrid split I keep landing on.

Which is exactly why the RDMA news caught my eye: the scaling story is chaining Macs, and two Mac minis over a Thunderbolt cable is a very different proposition than one giant workstation. Whether that's actually practical at my tier — or just a cool demo — goes on the try-list. But it's the first time "scale up local AI" has looked like add a small, cheap box instead of buy a server GPU.

(I'm not saying I'm buying a second Mac mini. I'm saying I've opened the tab.)

The sleeper hit: fm — free LLM calls in your shell

macOS 27 ships an fm CLI and a Python SDK that talk to the on-device Apple model and Private Cloud Compute directly. No API key, no bill.

My scripts folder is full of Python that calls the Anthropic API for small jobs — summarize this, classify that, draft a title. A lot of those calls don't need a frontier model; they need a model. A built-in, $0, private one that's just fm in a pipeline is exactly the kind of boring infrastructure that quietly changes habits. My blog's news-to-drafts pipeline is the first thing I plan to point at it once macOS 27 lands on my machine.

The honest part: what this probably does NOT change

Rule of this blog: no hype without the catch. Three catches I can already see from the announcements alone.

1. Apple's own models are still closed. Same verdict I gave Microsoft's MAI lineup at Build: there is no ollama pull apple-foundation-model, and there won't be. The Foundation Models framework is going open source this summer — the framework, not the weights. Apple's small model (a ~3B-class model, per third-party analysis — Apple doesn't publish the size) is callable only through their API on their devices.

2. Foundation Models doesn't replace your Ollama setup. Apple's on-device model is built for app features — structured output, tool calls, short contexts — not for "chat with a coding model for an hour." For serious local coding assistance, you'll still be running the biggest open-weight model your RAM allows in Ollama, LM Studio, or raw MLX. What changed is that shipping a local model inside a Mac/iOS app stopped being exotic.

3. The small-model agentic gap hasn't closed. Everything I wrote about local models botching tool calls still applies — a 3B-class model is even more limited than the 7B–14B class I tested. Apple's mitigation is interesting (constrained, schema-guided output and narrow built-in tools like OCR and Spotlight search instead of free-form agent loops), but physics is physics. Small model, small reasoning.

My try-list for the coming months

Here's the order I plan to work through this as the betas stabilize over the summer:

  1. The fm CLI first — lowest effort, immediately useful for script-glue tasks
  2. An MLX-community Qwen through MLXLanguageModel — I want to compare it against my Ollama baseline on the same hardware
  3. One prototype app feature on the free on-device model — the OCR + Spotlight local-RAG tools look like the most underrated announcement of the week
  4. Core AI's PyTorch conversion tooling — if converting an arbitrary Hugging Face model turns out to be genuinely smooth, this becomes the default way to ship local AI on Apple platforms
  5. Not on the list: retiring Ollama — portability matters; my Ollama setup works the same on Linux, and the moment you leave Apple hardware, this entire stack evaporates

That last point is the strategic catch worth naming: Apple's local-AI story is also a lock-in story. Ollama and llama.cpp run anywhere; Core AI runs on Apple silicon, period. The bet Apple is making — probably correctly — is that for the devices in your pocket and on your desk, "local, private, free, and integrated" beats "portable."

A month ago I wrote that the future of my setup was hybrid: local for the inner loop, frontier API for the hard problems. Nothing in these announcements changes that conclusion — but if they deliver, the local half is about to feel a lot less like a hack.

I'll be working through the try-list above as the betas mature this summer — expect follow-up posts with real numbers. For now: genuinely the most excited I've been about Apple developer news in years.

← Previous

Xcode 27 Ships With Claude Code Inside: Apple's Take on Agentic Coding

Next →

WWDC 2026: Apple Finally Shipped the AI It Promised — A Developer's Breakdown

STAY UPDATED

Get new posts on software engineering and AI in your inbox. No spam, unsubscribe anytime.