Apple Just Rebuilt the Local AI Stack — and I Can't Wait to Try It

I've spent the last month going deeper on local AI — running models in Ollama, pointing Claude Code at a local Qwen, building a tiny coding agent to see how small models handle tool calls. My conclusion so far has been: local is great for the cheap, private inner loop, and Apple silicon's unified memory is quietly the best budget hardware for it.

Then WWDC 2026 happened this week, and Apple — of all companies — announced the most interesting local-AI news of the year. Not Siri (fine, Siri too), but a stack of developer announcements aimed squarely at people like me: Core AI, an open-sourced MLXLanguageModel, multi-Mac training over Thunderbolt, and a command-line tool that gives you free LLM calls on macOS. The developer betas just dropped; the public releases land with the OS 27 cycle this fall.

I haven't had hands-on time with any of it yet — this is fresh off the keynote and the State of the Union. But I've read everything Apple published, and I want to walk through what each piece is, why it has me excited, and — keeping my own rule about honesty — what it probably doesn't change.

The map: four different things Apple announced

Apple's naming makes this genuinely confusing, so here's the cheat sheet I wish the keynote had shown:

Piece	What it is	Whose models run on it
Foundation Models framework	Swift API for calling LLMs from apps	Apple's models — plus anything via the new protocol
Apple Foundation Models	Apple's actual LLMs (small on-device + bigger server one)	Apple's, closed weights
Core AI	New OS framework for running custom models on-device	Yours — converted from PyTorch
MLX	Apple's open-source ML framework (their PyTorch-alike)	Anything open-weight — Llama, Qwen, Mistral…

The first two are Apple's models in Apple's API. The last two are the bring-your-own-model story — and that's where my setup lives.

Core AI: what Ollama does, as an OS feature

Core AI is a brand-new framework whose pitch sounds like it was written for the local-AI crowd: a memory-safe Swift API to load and run AI models entirely on-device, with — Apple's exact words — "zero server dependencies and zero token costs."

The parts that made me sit up:

Ahead-of-time compilation. Models are specialized for the exact chip they land on, compiled in advance so a multi-gigabyte model loads fast. This is the same problem llama.cpp solves with GGUF quantization and Metal offload — now handled by the OS.
Python tools to convert PyTorch models to Apple silicon. The on-ramp from the open-model world is official.
Fine-grained inference memory control, zero-copy data paths, stateful execution. "Stateful execution" is the KV-cache story — the thing that makes long chat sessions usable. These are exactly the knobs you currently tune via Ollama's environment variables and modelfiles.
Apple says Core AI is what powers the new Siri. They're not asking developers to use something they don't ship on themselves.

If you've read my Claude Code + Ollama post, you know the appeal of local: no API key, no per-token bill, no data leaving the machine. Core AI is Apple institutionalizing that exact value proposition — "zero server dependencies and zero token costs" could be Ollama's tagline.

The LanguageModel protocol: the adapter I've been faking with env vars

The Foundation Models framework now sits behind a LanguageModel protocol — any conforming model can back a session. Apple open-sourced two implementations that matter enormously for local:

CoreAILanguageModel — your own converted models through Core AI
MLXLanguageModel — open-weight models from the Hugging Face MLX community, running on your Mac's GPU and Neural Engine

Which means this is now a real, supported pattern in a shipping app:

// Local Qwen on the Mac's GPU — same API as Apple's models, Claude, or Gemini
let localModel = MLXLanguageModel(/* e.g. a Qwen or Llama from the MLX community */)
let session = LanguageModelSession(model: localModel)

let response = try await session.respond {
    "Refactor this function to use the retry pattern."
}

Step back and look at what that enables: one Swift AI feature that runs on a free Apple model for most users, a local open-weight model for the privacy-obsessed, and Claude for the power tier — with a one-line swap. In the Ollama world we approximate this with the OpenAI-compatible API and base-URL overrides (I wrote a whole post about the env-var traps). Apple just made it a typed protocol.

And the hybrid setup I keep advocating — local model for the cheap inner loop, frontier model for the heavy lifting — is now expressible inside one session via the new Dynamic Profiles API: run analysis mode on-device, flip to a reasoning-enabled cloud model for the hard step, flip back.

MLX grew up: Metal 4 and Macs chained over Thunderbolt

MLX — Apple's open-source ML framework, the reason "just buy a Mac with more unified memory" became local-AI advice — got two big upgrades:

Metal 4 support with GPU Neural Accelerators — more inference and training speed from the same hardware.
Distributed training across multiple Macs via RDMA over Thunderbolt. Read that again: you can now chain Mac minis into a little training/fine-tuning cluster over a Thunderbolt cable, with direct memory access between machines.

Let me be upfront about the hardware tier I actually care about, because most local-AI coverage drifts straight into $4,000-workstation territory. I'm talking about the machines normal developers already own: a Mac mini or an M-series MacBook. That class of machine — 16 to 32GB of unified memory — comfortably runs the 7B–14B models I've been using with Ollama, and a higher-RAM mini stretches into the ~30B class with quantization. For anything genuinely big — 70B and up — the honest answer for most of us isn't more hardware, it's a hosted model behind an API. Local for the small stuff, server for the heavy stuff; that's the same hybrid split I keep landing on.

Which is exactly why the RDMA news caught my eye: the scaling story is chaining Macs, and two Mac minis over a Thunderbolt cable is a very different proposition than one giant workstation. Whether that's actually practical at my tier — or just a cool demo — goes on the try-list. But it's the first time "scale up local AI" has looked like add a small, cheap box instead of buy a server GPU.

(I'm not saying I'm buying a second Mac mini. I'm saying I've opened the tab.)

The sleeper hit: `fm` — free LLM calls in your shell

macOS 27 ships an fm CLI and a Python SDK that talk to the on-device Apple model and Private Cloud Compute directly. No API key, no bill.

My scripts folder is full of Python that calls the Anthropic API for small jobs — summarize this, classify that, draft a title. A lot of those calls don't need a frontier model; they need a model. A built-in, $0, private one that's just fm in a pipeline is exactly the kind of boring infrastructure that quietly changes habits. My blog's news-to-drafts pipeline is the first thing I plan to point at it once macOS 27 lands on my machine.

The honest part: what this probably does NOT change

Rule of this blog: no hype without the catch. Three catches I can already see from the announcements alone.

1. Apple's own models are still closed. Same verdict I gave Microsoft's MAI lineup at Build: there is no ollama pull apple-foundation-model, and there won't be. The Foundation Models framework is going open source this summer — the framework, not the weights. Apple's small model (a ~3B-class model, per third-party analysis — Apple doesn't publish the size) is callable only through their API on their devices.

2. Foundation Models doesn't replace your Ollama setup. Apple's on-device model is built for app features — structured output, tool calls, short contexts — not for "chat with a coding model for an hour." For serious local coding assistance, you'll still be running the biggest open-weight model your RAM allows in Ollama, LM Studio, or raw MLX. What changed is that shipping a local model inside a Mac/iOS app stopped being exotic.

3. The small-model agentic gap hasn't closed. Everything I wrote about local models botching tool calls still applies — a 3B-class model is even more limited than the 7B–14B class I tested. Apple's mitigation is interesting (constrained, schema-guided output and narrow built-in tools like OCR and Spotlight search instead of free-form agent loops), but physics is physics. Small model, small reasoning.

My try-list for the coming months

Here's the order I plan to work through this as the betas stabilize over the summer:

The fm CLI first — lowest effort, immediately useful for script-glue tasks
An MLX-community Qwen through MLXLanguageModel — I want to compare it against my Ollama baseline on the same hardware
One prototype app feature on the free on-device model — the OCR + Spotlight local-RAG tools look like the most underrated announcement of the week
Core AI's PyTorch conversion tooling — if converting an arbitrary Hugging Face model turns out to be genuinely smooth, this becomes the default way to ship local AI on Apple platforms
Not on the list: retiring Ollama — portability matters; my Ollama setup works the same on Linux, and the moment you leave Apple hardware, this entire stack evaporates

That last point is the strategic catch worth naming: Apple's local-AI story is also a lock-in story. Ollama and llama.cpp run anywhere; Core AI runs on Apple silicon, period. The bet Apple is making — probably correctly — is that for the devices in your pocket and on your desk, "local, private, free, and integrated" beats "portable."

A month ago I wrote that the future of my setup was hybrid: local for the inner loop, frontier API for the hard problems. Nothing in these announcements changes that conclusion — but if they deliver, the local half is about to feel a lot less like a hack.

I'll be working through the try-list above as the betas mature this summer — expect follow-up posts with real numbers. For now: genuinely the most excited I've been about Apple developer news in years.

The map: four different things Apple announced

Apple's naming makes this genuinely confusing, so here's the cheat sheet I wish the keynote had shown:

Piece	What it is	Whose models run on it
Foundation Models framework	Swift API for calling LLMs from apps	Apple's models — plus anything via the new protocol
Apple Foundation Models	Apple's actual LLMs (small on-device + bigger server one)	Apple's, closed weights
Core AI	New OS framework for running custom models on-device	Yours — converted from PyTorch
MLX	Apple's open-source ML framework (their PyTorch-alike)	Anything open-weight — Llama, Qwen, Mistral…

The first two are Apple's models in Apple's API. The last two are the bring-your-own-model story — and that's where my setup lives.

Core AI: what Ollama does, as an OS feature

The parts that made me sit up:

Ahead-of-time compilation. Models are specialized for the exact chip they land on, compiled in advance so a multi-gigabyte model loads fast. This is the same problem llama.cpp solves with GGUF quantization and Metal offload — now handled by the OS.
Python tools to convert PyTorch models to Apple silicon. The on-ramp from the open-model world is official.
Fine-grained inference memory control, zero-copy data paths, stateful execution. "Stateful execution" is the KV-cache story — the thing that makes long chat sessions usable. These are exactly the knobs you currently tune via Ollama's environment variables and modelfiles.
Apple says Core AI is what powers the new Siri. They're not asking developers to use something they don't ship on themselves.

The LanguageModel protocol: the adapter I've been faking with env vars

The Foundation Models framework now sits behind a LanguageModel protocol — any conforming model can back a session. Apple open-sourced two implementations that matter enormously for local:

CoreAILanguageModel — your own converted models through Core AI
MLXLanguageModel — open-weight models from the Hugging Face MLX community, running on your Mac's GPU and Neural Engine

Which means this is now a real, supported pattern in a shipping app:

// Local Qwen on the Mac's GPU — same API as Apple's models, Claude, or Gemini
let localModel = MLXLanguageModel(/* e.g. a Qwen or Llama from the MLX community */)
let session = LanguageModelSession(model: localModel)

let response = try await session.respond {
    "Refactor this function to use the retry pattern."
}

MLX grew up: Metal 4 and Macs chained over Thunderbolt

MLX — Apple's open-source ML framework, the reason "just buy a Mac with more unified memory" became local-AI advice — got two big upgrades:

Metal 4 support with GPU Neural Accelerators — more inference and training speed from the same hardware.
Distributed training across multiple Macs via RDMA over Thunderbolt. Read that again: you can now chain Mac minis into a little training/fine-tuning cluster over a Thunderbolt cable, with direct memory access between machines.

(I'm not saying I'm buying a second Mac mini. I'm saying I've opened the tab.)

The sleeper hit: `fm` — free LLM calls in your shell

macOS 27 ships an fm CLI and a Python SDK that talk to the on-device Apple model and Private Cloud Compute directly. No API key, no bill.

The honest part: what this probably does NOT change

Rule of this blog: no hype without the catch. Three catches I can already see from the announcements alone.

My try-list for the coming months

Here's the order I plan to work through this as the betas stabilize over the summer:

The fm CLI first — lowest effort, immediately useful for script-glue tasks
An MLX-community Qwen through MLXLanguageModel — I want to compare it against my Ollama baseline on the same hardware
One prototype app feature on the free on-device model — the OCR + Spotlight local-RAG tools look like the most underrated announcement of the week
Core AI's PyTorch conversion tooling — if converting an arbitrary Hugging Face model turns out to be genuinely smooth, this becomes the default way to ship local AI on Apple platforms
Not on the list: retiring Ollama — portability matters; my Ollama setup works the same on Linux, and the moment you leave Apple hardware, this entire stack evaporates

Apple Just Rebuilt the Local AI Stack — and I Can't Wait to Try It

The map: four different things Apple announced

Core AI: what Ollama does, as an OS feature

The LanguageModel protocol: the adapter I've been faking with env vars

MLX grew up: Metal 4 and Macs chained over Thunderbolt

The sleeper hit: `fm` — free LLM calls in your shell

The honest part: what this probably does NOT change

My try-list for the coming months

STAY UPDATED

Apple Just Rebuilt the Local AI Stack — and I Can't Wait to Try It

The map: four different things Apple announced

Core AI: what Ollama does, as an OS feature

The LanguageModel protocol: the adapter I've been faking with env vars

MLX grew up: Metal 4 and Macs chained over Thunderbolt

The sleeper hit: `fm` — free LLM calls in your shell

The honest part: what this probably does NOT change

My try-list for the coming months

STAY UPDATED

The map: four different things Apple announced

Core AI: what Ollama does, as an OS feature

The LanguageModel protocol: the adapter I've been faking with env vars

MLX grew up: Metal 4 and Macs chained over Thunderbolt

The sleeper hit: fm — free LLM calls in your shell

The honest part: what this probably does NOT change

My try-list for the coming months

STAY UPDATED

The map: four different things Apple announced

Core AI: what Ollama does, as an OS feature

The LanguageModel protocol: the adapter I've been faking with env vars

MLX grew up: Metal 4 and Macs chained over Thunderbolt

The sleeper hit: fm — free LLM calls in your shell

The honest part: what this probably does NOT change

My try-list for the coming months

STAY UPDATED

The sleeper hit: `fm` — free LLM calls in your shell

The sleeper hit: `fm` — free LLM calls in your shell