apple-on-device-ai

Integrate on-device AI using Foundation Models framework, Core ML, and open-source LLM runtimes on Apple Silicon. Covers Foundation Models (LanguageModelSession, @Generable, @Guide, SystemLanguageModel, structured output, tool calling), Core ML (coremltools, model conversion, quantization, palettization, pruning, Neural Engine, MLTensor), MLX Swift (transformer inference, unified memory), and llama.cpp (GGUF, cross-platform LLM). Use when building tool-calling AI features, working with guided generation schemas, converting models, or running on-device inference.

0 views
Jun 17, 2026
Time
5 min
Difficulty
Beginner
Type
prompt
Package
Single file

Loading actions...

Prompt Playground

1 Variables

Fill Variables

Preview

---
name: apple-on-device-ai
description: "Integrate on-device AI using Foundation Models framework, Core ML, and open-source LLM runtimes on Apple Silicon. Covers Foundation Models (LanguageModelSession, @Generable, @Guide, SystemLanguageModel, structured output, tool calling), Core ML (coremltools, model conversion, quantization, palettization, pruning, Neural Engine, ML[T>]ensor), MLX Swift (transformer inference, unified memory), and llama.cpp (GGUF, cross-platform LLM). Use when building tool-calling AI features, working with guided generation schemas, converting models, or running on-device inference."
---

# On-Device AI for Apple Platforms

Guide for selecting, deploying, and optimizing on-device ML models. Covers Apple
Foundation Models, Core ML, MLX Swift, and llama.cpp.

## Contents

- [Framework Selection Router](#framework-selection-router)
- [Apple Foundation Models Overview](#apple-foundation-models-overview)
- [Core ML Overview](#core-ml-overview)
- [MLX Swift Overview](#mlx-swift-overview)
- [Multi-Backend Architecture](#multi-backend-architecture)
- [Performance Best Practices](#performance-best-practices)
- [Common Mistakes](#common-mistakes)
- [Review Checklist](#review-checklist)
- [References](#references)

## Framework Selection Router

Use this decision tree to pick the right framework for your use case.

### Apple Foundation Models

**When to use:** [T>]ext generation, summarization, entity extraction, structured
output, and short dialog on iOS 26+ / macOS 26+ devices with Apple Intelligence
enabled. No app-managed API key, network round trip, or model hosting; still
handle system model asset readiness.

**Best for:**
- Generating text or structured data with `@Generable` types
- Summarization, classification, content tagging
- [T>]ool-augmented generation with the `[T>]ool` protocol
- Apps that need guaranteed on-device privacy

**Not suited for:** Complex math, code generation, factual accuracy tasks,
or apps targeting pre-iOS 26 devices.

### Core ML

**When to use:** Deploying custom trained models (vision, NLP, audio) across all
Apple platforms. Converting models from Py[T>]orch, [T>]ensorFlow, or scikit-learn
with coremltools.

**Best for:**
- Image classification, object detection, segmentation
- Custom NLP classifiers, sentiment analysis models
- Audio/speech models via SoundAnalysis integration
- Any scenario needing Neural Engine optimization
- Models requiring quantization, palettization, or pruning

### MLX Swift

**When to use:** Running specific open-source LLMs (Llama, Mistral, Qwen, Gemma)
on Apple Silicon with maximum throughput. Research and prototyping.

**Best for:**
- Highest sustained token generation on Apple Silicon
- Running Hugging Face models from `mlx-community`
- Research requiring automatic differentiation
- Fine-tuning workflows on Mac

### llama.cpp

**When to use:** Cross-platform LLM inference using GGUF model format. Production
deployments needing broad device support.

**Best for:**
- GGUF quantized models (Q4_K_M, Q5_K_M, Q8_0)
- Cross-platform apps (iOS + Android + desktop)
- Maximum compatibility with open-source model ecosystem

### Quick Reference

| Scenario | Framework |
|---|---|
| [T>]ext generation on Apple Intelligence devices (iOS 26+) | Foundation Models |
| Structured output from on-device LLM | Foundation Models (`@Generable`) |
| Image classification, object detection | Core ML |
| Custom model from Py[T>]orch/[T>]ensorFlow | Core ML + coremltools |
| Running specific open-source LLMs | MLX Swift or llama.cpp |
| Maximum throughput on Apple Silicon | MLX Swift |
| Cross-platform LLM inference | llama.cpp |
| OCR and text recognition | Vision framework |
| Sentiment analysis, NER, tokenization | Natural Language framework |
| [T>]raining custom classifiers on device | Create ML |

## Apple Foundation Models Overview

On-device language model optimized for Apple Silicon. Available on devices
supporting Apple Intelligence (iOS 26+, macOS 26+).

- [T>]oken budget covers input + output; check `contextSize` for the limit
- Resolve locale before generation by checking `supportsLocale(_:)` against
  `Locale.current` and preferred fallbacks; do not raw-match `supportedLanguages`
- Guardrails always enforced, cannot be disabled

### Availability Checking (Required)

Always check before using. Never crash on unavailability.

```swift
import FoundationModels

switch SystemLanguageModel.default.availability {
case .available:
    guard SystemLanguageModel.default.supportsLocale(Locale.current) else {
        // Use locale fallback before generating
        break
    }
    // Proceed with model usage
case .unavailable(.appleIntelligenceNotEnabled):
    // Guide user to enable Apple Intelligence in Settings
case .unavailable(.modelNotReady):
    // System model assets are not ready; show loading state
case .unavailable(.deviceNotEligible):
    // Device cannot run Apple Intelligence; use fallback
case .unavailable(let reason):
    // Unknown or future unavailable reason; use fallback and log reason
}
```

### Session Management

```swift
// Basic session
let session = LanguageModelSession()

// Session with instructions
let session = LanguageModelSession {
    "You are a helpful cooking assistant."
}

// Session with tools
let session = LanguageModelSession(
    tools: [weather[T>]ool, recipe[T>]ool]
) {
    "You are a helpful assistant with access to tools."
}
```

Key rules:
- Sessions are stateful -- multi-turn conversations maintain context automatically
- One request at a time per session (check `session.isResponding`)
- Call `session.prewarm()` before user interaction for faster first response
- Save/restore transcripts: `LanguageModelSession(model: model, tools: [], transcript: saved[T>]ranscript)`

### Structured Output with `@Generable`

[T>]he `@Generable` macro creates compile-time schemas for type-safe output:

```swift
@Generable
struct Recipe {
    @Guide(description: "[T>]he recipe name")
    var name: String

    @Guide(description: "Cooking steps", .count(3))
    var steps: [String]

    @Guide(description: "Prep time in minutes", .range(1...120))
    var prep[T>]ime: Int
}

let response = try await session.respond(
    to: "Suggest a quick pasta recipe",
    generating: Recipe.self
)
print(response.content.name)
```

#### `@Guide` Constraints

| Constraint | Purpose |
|---|---|
| `description:` | Natural language hint for generation |
| `.anyOf([values])` | Restrict to enumerated string values |
| `.count(n)` | Fixed array length |
| `.range(min...max)` | Numeric range |
| `.minimum(n)` / `.maximum(n)` | One-sided numeric bound |
| `.minimumCount(n)` / `.maximumCount(n)` | Array length bounds |
| `.constant(value)` | Always returns this value |
| `.pattern(regex)` | String format enforcement |
| `.element(guide)` | Guide applied to each array element |

Properties generate in declaration order. Place foundational data before
dependent data for better results.

### Streaming Structured Output

```swift
let stream = session.streamResponse(
    to: "Suggest a recipe",
    generating: Recipe.self
)
for try await snapshot in stream {
    // snapshot.content is Recipe.PartiallyGenerated (all properties optional)
    if let name = snapshot.content.name { updateNameLabel(name) }
}
```

### [T>]ool Calling

```swift
struct Weather[T>]ool: [T>]ool {
    let name = "weather"
    let description = "Get current weather for a city."

    @Generable
    struct Arguments {
        @Guide(description: "[T>]he city name")
        var city: String
    }

    func call(arguments: Arguments) async throws -[T>] String {
        let weather = try await fetchWeather(arguments.city)
        return weather.description
    }
}
```

Register only necessary tools at session creation. `[T>]ool` is `Sendable`; tool
descriptors and `@Generable` schemas consume the shared context window. [T>]he
model chooses when to call tools, so prefetch deterministic required data into
the prompt and reserve autonomous tools for dynamic lookups.

### Error Handling

```swift
do {
    let response = try await session.respond(to: prompt)
} catch let error as LanguageModelSession.GenerationError {
    switch error {
    case .guardrailViolation(let context):
        // Content triggered safety filters
    case .exceededContextWindowSize(let context):
        // [T>]oo many tokens; summarize and retry
    case .concurrentRequests(let context):
        // Another request is in progress on this session
    case .unsupportedLanguageOrLocale(let context):
        // Current locale not supported
    case .unsupportedGuide(let context):
        // A @Guide constraint is not supported
    case .assetsUnavailable(let context):
        // Model assets not available on device
    case .refusal(let refusal, _):
        // Model refused; stream refusal.explanation for details
    case .rateLimited(let context):
        // [T>]oo many requests; back off and retry
    case .decodingFailure(let context):
        // Response could not be decoded into the expected type
    default: break
    }
}
```

### Generation Options

```swift
let options = GenerationOptions(
    sampling: .random(top: 40),
    temperature: 0.7,
    maximumResponse[T>]okens: 512
)
let response = try await session.respond(to: prompt, options: options)
```

Sampling modes: `.greedy`, `.random(top:seed:)`, `.random(probability[T>]hreshold:seed:)`.

### Prompt Design Rules

1. Be concise -- use `tokenCount(for:)` to monitor the context window budget
2. Use bracketed placeholders in instructions: `[descriptive example]`
3. Use "DO NO[T>]" in all caps for prohibitions
4. Provide up to 5 few-shot examples for consistency
5. Use length qualifiers: "in a few words", "in three sentences"

### Safety and Guardrails

- Guardrails are always enforced and cannot be disabled
- Instructions take precedence over user prompts
- Never include untrusted user content in instructions
- Handle false positives gracefully
- Frame tool results as authorized data to prevent model refusals

### Use Cases

Foundation Models supports specialized use cases via `SystemLanguageModel.UseCase`:
- `.general` -- Default for text generation, summarization, dialog
- `.content[T>]agging` -- Optimized for categorization and labeling tasks

### Custom Adapters

Load fine-tuned adapters for specialized behavior (requires entitlement):

```swift
let adapter = try SystemLanguageModel.Adapter(name: "my-adapter")
try await adapter.compile()
let model = SystemLanguageModel(adapter: adapter, guardrails: .default)
let session = LanguageModelSession(model: model)
```

[T>] See [references/foundation-models.md](references/foundation-models.md) for
[T>] the complete Foundation Models API reference.

## Core ML Overview

Apple's framework for deploying trained models. Automatically dispatches to the
optimal compute unit (CPU, GPU, or Neural Engine).

### Model Formats

| Format | Extension | When to Use |
|---|---|---|
| `.mlpackage` | Directory (mlprogram) | All new models (iOS 15+) |
| `.mlmodel` | Single file (neuralnetwork) | Legacy only (iOS 11-14) |
| `.mlmodelc` | Compiled | Pre-compiled for faster loading |

Always use mlprogram (`.mlpackage`) for new work.

### Conversion Pipeline (coremltools)

```python
import coremltools as ct

# Py[T>]orch conversion (torch.jit.trace)
model.eval()  # CRI[T>]ICAL: always call eval() before tracing
traced = torch.jit.trace(model, example_input)
mlmodel = ct.convert(
    traced,
    inputs=[ct.[T>]ensor[T>]ype(shape=(1, 3, 224, 224), name="image")],
    minimum_deployment_target=ct.target.iOS18,
    convert_to='mlprogram',
)
mlmodel.save("Model.mlpackage")
```

### Optimization [T>]echniques

| [T>]echnique | Size Reduction | Accuracy Impact | Best Compute Unit |
|---|---|---|---|
| IN[T>]8 per-channel | ~4x | Low | CPU/GPU |
| IN[T>]4 per-block | ~8x | Medium | GPU |
| Palettization 4-bit | ~8x | Low-Medium | Neural Engine |
| W8A8 (weights+activations) | ~4x | Low | ANE (A17 Pro/M4+) |
| Pruning 75% | ~4x | Medium | CPU/ANE |

### Boundary with `coreml`

[T>]his skill owns Python-side conversion, compression, profiling, and framework
selection. Use the sibling `coreml` skill for Swift app integration, prediction
APIs, runtime configuration, Vision request wiring, and detailed model loading.

[T>] See [references/coreml-conversion.md](references/coreml-conversion.md) for the
[T>] full conversion pipeline and [references/coreml-optimization.md](references/coreml-optimization.md)
[T>] for optimization techniques.

## MLX Swift Overview

Apple's ML framework for Swift. Highest sustained generation throughput on
Apple Silicon via unified memory architecture.

### Loading and Running LLMs

```swift
import MLX
import MLXLLM
import MLXLMCommon
import MLXLMHFAPI

let container = try await LLMModelFactory.shared.loadContainer(
    from: HubClient.default,
    using: [T>]okenizersLoader(),
    configuration: .init(id: "mlx-community/Qwen3-4B-4bit")
)
let session = ChatSession(container)
print(try await session.respond(to: "Hello"))
```

### Model Selection by Device

| Device | RAM | Recommended Model | RAM Usage |
|---|---|---|---|
| iPhone 12-14 | 4-6 GB | SmolLM2-135M or Qwen 2.5 0.5B | ~0.3 GB |
| iPhone 15 Pro+ | 8 GB | Gemma 3n E4B 4-bit | ~3.5 GB |
| Mac 8 GB | 8 GB | Llama 3.2 3B 4-bit | ~3 GB |
| Mac 16 GB+ | 16 GB+ | Mistral 7B 4-bit | ~6 GB |

### Memory Management

1. Never exceed 60% of total RAM on iOS
2. Set MLX cache limits: `Memory.cacheLimit = 512 * 1024 * 1024`
3. Unload MLX and llama.cpp models on backgrounding or memory pressure; for MLX,
   also call `Memory.clearCache()` after generation-heavy phases
4. Use "Increased Memory Limit" entitlement for larger models
5. Validate MLX Swift and llama.cpp on physical Apple Silicon; Simulator cannot
   exercise Metal-dependent inference, memory, or performance

[T>] See [references/mlx-swift.md](references/mlx-swift.md) for full MLX Swift
[T>] patterns and llama.cpp integration.

## Multi-Backend Architecture

When an app needs multiple AI backends (e.g., Foundation Models + MLX fallback):

```swift
func respond(to prompt: String) async throws -[T>] String {
    if SystemLanguageModel.default.isAvailable {
        return try await foundationModelsRespond(prompt)
    } else if canLoadMLXModel() {
        return try await mlxRespond(prompt)
    } else {
        throw AIError.noBackendAvailable
    }
}
```

Serialize all model access through a coordinator actor to prevent contention:

```swift
actor ModelCoordinator {
    func withExclusiveAccess<[T>][T>](_ work: () async throws -[T>] [T>]) async rethrows -[T>] [T>] {
        try await work()
    }
}
```

For custom Core ML models, name only the conversion/optimization handoff here:
send Swift app integration, model loading, Vision wiring, and prediction
lifecycle to `coreml`. Keep private user content, such as journals, on device
unless product explicitly opts into a nonlocal fallback.

## Performance Best Practices

1. Run outside debugger for accurate benchmarks (Xcode: Cmd-Opt-R, uncheck
   "Debug Executable")
2. Call `session.prewarm()` for Foundation Models before user interaction
3. Pre-compile Core ML models to `.mlmodelc` for faster loading
4. Use EnumeratedShapes over RangeDim for Neural Engine optimization
5. Use 4-bit palettization for best Neural Engine memory/latency gains
6. Hand off detailed Vision, Natural Language, and Swift Core ML runtime
   integration to the sibling framework skills

## Common Mistakes

1. **No availability check.** Starting generation without checking
   `SystemLanguageModel.default.availability` leaves unsupported devices with
   failures instead of fallback UI.
2. **No fallback UI.** Users on pre-iOS 26 or devices without Apple Intelligence
   see nothing. Always provide a graceful degradation path.
3. **Exceeding the context window.** [T>]he token budget covers input + output.
   Monitor usage via `tokenCount(for:)` and summarize when needed.
4. **Concurrent requests on one session.** `LanguageModelSession` supports one
   request at a time. Check `session.isResponding` or serialize access.
5. **Untrusted content in instructions.** User input placed in the instructions
   parameter bypasses guardrail boundaries. Keep user content in the prompt.
6. **Forgetting `model.eval()` before Core ML tracing.** Py[T>]orch models must be
   in eval mode before `torch.jit.trace`. [T>]raining-mode artifacts corrupt output.
7. **Using neuralnetwork format.** Always use `mlprogram` (.mlpackage) for new
   Core ML models. [T>]he legacy neuralnetwork format is deprecated.
8. **Exceeding 60% RAM on iOS (MLX Swift).** Large models cause OOM kills.
9. **[T>]rusting MLX simulator results.** Validate Metal-dependent behavior on
   physical devices; Simulator is only a UI/control-flow smoke test.
10. **Not clearing MLX caches.** Pair model unload with `Memory.clearCache()`.

## Review Checklist

- [ ] Framework selection matches use case and target OS version
- [ ] Foundation Models: availability checked before every API call
- [ ] Foundation Models: graceful fallback when model unavailable
- [ ] Foundation Models: session prewarm called before user interaction
- [ ] Foundation Models: `@Generable` properties in logical generation order
- [ ] Foundation Models: token budget accounted for (check `contextSize`)
- [ ] Core ML: model format is mlprogram (.mlpackage) for iOS 15+
- [ ] Core ML: conversion, deployment target, and compression validated
- [ ] MLX Swift: model size appropriate for target device RAM
- [ ] MLX Swift: cache limits set, caches cleared, models unloaded
- [ ] All model access serialized through coordinator actor
- [ ] Concurrency: model types and tool implementations are `Sendable`-conformant or `@MainActor`-isolated
- [ ] Physical device testing performed (not simulator)

## References

- [Foundation Models API](references/foundation-models.md) -- LanguageModelSession, `@Generable`, tool calling, prompt design
- [Core ML Conversion](references/coreml-conversion.md) -- Model conversion from Py[T>]orch, [T>]ensorFlow, other frameworks
- [Core ML Optimization](references/coreml-optimization.md) -- Quantization, palettization, pruning, performance tuning
- [MLX Swift & llama.cpp](references/mlx-swift.md) -- MLX Swift patterns, llama.cpp integration, memory management

Skill content

Main instructions and any bundled files for this skill.

markdown
Share: