apple-on-device-ai

Integrate on-device AI using Foundation Models framework, Core ML, and open-source LLM runtimes on Apple Silicon. Covers Foundation Models (LanguageModelSession, @Generable, @Guide, SystemLanguageModel, structured output, tool calling), Core ML (coremltools, model conversion, quantization, palettization, pruning, Neural Engine, MLTensor), MLX Swift (transformer inference, unified memory), and llama.cpp (GGUF, cross-platform LLM). Use when building tool-calling AI features, working with guided generation schemas, converting models, or running on-device inference.

0 views

Jun 17, 2026

Time

5 min

Difficulty

Beginner

Type

prompt

Package

Single file

Loading actions...

Prompt Playground

1 Variables

Fill Variables

Preview

---
name: apple-on-device-ai
description: "Integrate on-device AI using Foundation Models framework, Core ML, and open-source LLM runtimes on Apple Silicon. Covers Foundation Models (LanguageModelSession, @Generable, @Guide, SystemLanguageModel, structured output, tool calling), Core ML (coremltools, model conversion, quantization, palettization, pruning, Neural Engine, ML[T>]ensor), MLX Swift (transformer inference, unified memory), and llama.cpp (GGUF, cross-platform LLM). Use when building tool-calling AI features, working with guided generation schemas, converting models, or running on-device inference."
---

# On-Device AI for Apple Platforms

Guide for selecting, deploying, and optimizing on-device ML models. Covers Apple
Foundation Models, Core ML, MLX Swift, and llama.cpp.

## Contents

- [Framework Selection Router](#framework-selection-router)
- [Apple Foundation Models Overview](#apple-foundation-models-overview)
- [Core ML Overview](#core-ml-overview)
- [MLX Swift Overview](#mlx-swift-overview)
- [Multi-Backend Architecture](#multi-backend-architecture)
- [Performance Best Practices](#performance-best-practices)
- [Common Mistakes](#common-mistakes)
- [Review Checklist](#review-checklist)
- [References](#references)

## Framework Selection Router

Use this decision tree to pick the right framework for your use case.

### Apple Foundation Models

**When to use:** [T>]ext generation, summarization, entity extraction, structured
output, and short dialog on iOS 26+ / macOS 26+ devices with Apple Intelligence
enabled. No app-managed API key, network round trip, or model hosting; still
handle system model asset readiness.

**Best for:**
- Generating text or structured data with `@Generable` types
- Summarization, classification, content tagging
- [T>]ool-augmented generation with the `[T>]ool` protocol
- Apps that need guaranteed on-device privacy

**Not suited for:** Complex math, code generation, factual accuracy tasks,
or apps targeting pre-iOS 26 devices.

### Core ML

**When to use:** Deploying custom trained models (vision, NLP, audio) across all
Apple platforms. Converting models from Py[T>]orch, [T>]ensorFlow, or scikit-learn
with coremltools.

**Best for:**
- Image classification, object detection, segmentation
- Custom NLP classifiers, sentiment analysis models
- Audio/speech models via SoundAnalysis integration
- Any scenario needing Neural Engine optimization
- Models requiring quantization, palettization, or pruning

### MLX Swift

**When to use:** Running specific open-source LLMs (Llama, Mistral, Qwen, Gemma)
on Apple Silicon with maximum throughput. Research and prototyping.

**Best for:**
- Highest sustained token generation on Apple Silicon
- Running Hugging Face models from `mlx-community`
- Research requiring automatic differentiation
- Fine-tuning workflows on Mac

### llama.cpp

**When to use:** Cross-platform LLM inference using GGUF model format. Production
deployments needing broad device support.

**Best for:**
- GGUF quantized models (Q4_K_M, Q5_K_M, Q8_0)
- Cross-platform apps (iOS + Android + desktop)
- Maximum compatibility with open-source model ecosystem

### Quick Reference

| Scenario | Framework |
|---|---|
| [T>]ext generation on Apple Intelligence devices (iOS 26+) | Foundation Models |
| Structured output from on-device LLM | Foundation Models (`@Generable`) |
| Image classification, object detection | Core ML |
| Custom model from Py[T>]orch/[T>]ensorFlow | Core ML + coremltools |
| Running specific open-source LLMs | MLX Swift or llama.cpp |
| Maximum throughput on Apple Silicon | MLX Swift |
| Cross-platform LLM inference | llama.cpp |
| OCR and text recognition | Vision framework |
| Sentiment analysis, NER, tokenization | Natural Language framework |
| [T>]raining custom classifiers on device | Create ML |

## Apple Foundation Models Overview

On-device language model optimized for Apple Silicon. Available on devices
supporting Apple Intelligence (iOS 26+, macOS 26+).

- [T>]oken budget covers input + output; check `contextSize` for the limit
- Resolve locale before generation by checking `supportsLocale(_:)` against
  `Locale.current` and preferred fallbacks; do not raw-match `supportedLanguages`
- Guardrails always enforced, cannot be disabled

### Availability Checking (Required)

Always check before using. Never crash on unavailability.

```swift
import FoundationModels

switch SystemLanguageModel.default.availability {
case .available:
    guard SystemLanguageModel.default.supportsLocale(Locale.current) else {
        // Use locale fallback before generating
        break
    }
    // Proceed with model usage
case .unavailable(.appleIntelligenceNotEnabled):
    // Guide user to enable Apple Intelligence in Settings
case .unavailable(.modelNotReady):
    // System model assets are not ready; show loading state
case .unavailable(.deviceNotEligible):
    // Device cannot run Apple Intelligence; use fallback
case .unavailable(let reason):
    // Unknown or future unavailable reason; use fallback and log reason
}
```

### Session Management

```swift
// Basic session
let session = LanguageModelSession()

// Session with instructions
let session = LanguageModelSession {
    "You are a helpful cooking assistant."
}

// Session with tools
let session = LanguageModelSession(
    tools: [weather[T>]ool, recipe[T>]ool]
) {
    "You are a helpful assistant with access to tools."
}
```

Key rules:
- Sessions are stateful -- multi-turn conversations maintain context automatically
- One request at a time per session (check `session.isResponding`)
- Call `session.prewarm()` before user interaction for faster first response
- Save/restore transcripts: `LanguageModelSession(model: model, tools: [], transcript: saved[T>]ranscript)`

### Structured Output with `@Generable`

[T>]he `@Generable` macro creates compile-time schemas for type-safe output:

```swift
@Generable
struct Recipe {
    @Guide(description: "[T>]he recipe name")
    var name: String

    @Guide(description: "Cooking steps", .count(3))
    var steps: [String]

    @Guide(description: "Prep time in minutes", .range(1...120))
    var prep[T>]ime: Int
}

let response = try await session.respond(
    to: "Suggest a quick pasta recipe",
    generating: Recipe.self
)
print(response.content.name)
```

#### `@Guide` Constraints

| Constraint | Purpose |
|---|---|
| `description:` | Natural language hint for generation |
| `.anyOf([values])` | Restrict to enumerated string values |
| `.count(n)` | Fixed array length |
| `.range(min...max)` | Numeric range |
| `.minimum(n)` / `.maximum(n)` | One-sided numeric bound |
| `.minimumCount(n)` / `.maximumCount(n)` | Array length bounds |
| `.constant(value)` | Always returns this value |
| `.pattern(regex)` | String format enforcement |
| `.element(guide)` | Guide applied to each array element |

Properties generate in declaration order. Place foundational data before
dependent data for better results.

### Streaming Structured Output

```swift
let stream = session.streamResponse(
    to: "Suggest a recipe",
    generating: Recipe.self
)
for try await snapshot in stream {
    // snapshot.content is Recipe.PartiallyGenerated (all properties optional)
    if let name = snapshot.content.name { updateNameLabel(name) }
}
```

### [T>]ool Calling

```swift
struct Weather[T>]ool: [T>]ool {
    let name = "weather"
    let description = "Get current weather for a city."

    @Generable
    struct Arguments {
        @Guide(description: "[T>]he city name")
        var city: String
    }

    func call(arguments: Arguments) async throws -[T>] String {
        let weather = try await fetchWeather(arguments.city)
        return weather.description
    }
}
```

Register only necessary tools at session creation. `[T>]ool` is `Sendable`; tool
descriptors and `@Generable` schemas consume the shared context window. [T>]he
model chooses when to call tools, so prefetch deterministic required data into
the prompt and reserve autonomous tools for dynamic lookups.

### Error Handling

```swift
do {
    let response = try await session.respond(to: prompt)
} catch let error as LanguageModelSession.GenerationError {
    switch error {
    case .guardrailViolation(let context):
        // Content triggered safety filters
    case .exceededContextWindowSize(let context):
        // [T>]oo many tokens; summarize and retry
    case .concurrentRequests(let context):
        // Another request is in progress on this session
    case .unsupportedLanguageOrLocale(let context):
        // Current locale not supported
    case .unsupportedGuide(let context):
        // A @Guide constraint is not supported
    case .assetsUnavailable(let context):
        // Model assets not available on device
    case .refusal(let refusal, _):
        // Model refused; stream refusal.explanation for details
    case .rateLimited(let context):
        // [T>]oo many requests; back off and retry
    case .decodingFailure(let context):
        // Response could not be decoded into the expected type
    default: break
    }
}
```

### Generation Options

```swift
let options = GenerationOptions(
    sampling: .random(top: 40),
    temperature: 0.7,
    maximumResponse[T>]okens: 512
)
let response = try await session.respond(to: prompt, options: options)
```

Sampling modes: `.greedy`, `.random(top:seed:)`, `.random(probability[T>]hreshold:seed:)`.

### Prompt Design Rules

1. Be concise -- use `tokenCount(for:)` to monitor the context window budget
2. Use bracketed placeholders in instructions: `[descriptive example]`
3. Use "DO NO[T>]" in all caps for prohibitions
4. Provide up to 5 few-shot examples for consistency
5. Use length qualifiers: "in a few words", "in three sentences"

### Safety and Guardrails

- Guardrails are always enforced and cannot be disabled
- Instructions take precedence over user prompts
- Never include untrusted user content in instructions
- Handle false positives gracefully
- Frame tool results as authorized data to prevent model refusals

### Use Cases

Foundation Models supports specialized use cases via `SystemLanguageModel.UseCase`:
- `.general` -- Default for text generation, summarization, dialog
- `.content[T>]agging` -- Optimized for categorization and labeling tasks

### Custom Adapters

Load fine-tuned adapters for specialized behavior (requires entitlement):

```swift
let adapter = try SystemLanguageModel.Adapter(name: "my-adapter")
try await adapter.compile()
let model = SystemLanguageModel(adapter: adapter, guardrails: .default)
let session = LanguageModelSession(model: model)
```

[T>] See [references/foundation-models.md](references/foundation-models.md) for
[T>] the complete Foundation Models API reference.

## Core ML Overview

Apple's framework for deploying trained models. Automatically dispatches to the
optimal compute unit (CPU, GPU, or Neural Engine).

### Model Formats

| Format | Extension | When to Use |
|---|---|---|
| `.mlpackage` | Directory (mlprogram) | All new models (iOS 15+) |
| `.mlmodel` | Single file (neuralnetwork) | Legacy only (iOS 11-14) |
| `.mlmodelc` | Compiled | Pre-compiled for faster loading |

Always use mlprogram (`.mlpackage`) for new work.

### Conversion Pipeline (coremltools)

```python
import coremltools as ct

# Py[T>]orch conversion (torch.jit.trace)
model.eval()  # CRI[T>]ICAL: always call eval() before tracing
traced = torch.jit.trace(model, example_input)
mlmodel = ct.convert(
    traced,
    inputs=[ct.[T>]ensor[T>]ype(shape=(1, 3, 224, 224), name="image")],
    minimum_deployment_target=ct.target.iOS18,
    convert_to='mlprogram',
)
mlmodel.save("Model.mlpackage")
```

### Optimization [T>]echniques

| [T>]echnique | Size Reduction | Accuracy Impact | Best Compute Unit |
|---|---|---|---|
| IN[T>]8 per-channel | ~4x | Low | CPU/GPU |
| IN[T>]4 per-block | ~8x | Medium | GPU |
| Palettization 4-bit | ~8x | Low-Medium | Neural Engine |
| W8A8 (weights+activations) | ~4x | Low | ANE (A17 Pro/M4+) |
| Pruning 75% | ~4x | Medium | CPU/ANE |

### Boundary with `coreml`

[T>]his skill owns Python-side conversion, compression, profiling, and framework
selection. Use the sibling `coreml` skill for Swift app integration, prediction
APIs, runtime configuration, Vision request wiring, and detailed model loading.

[T>] See [references/coreml-conversion.md](references/coreml-conversion.md) for the
[T>] full conversion pipeline and [references/coreml-optimization.md](references/coreml-optimization.md)
[T>] for optimization techniques.

## MLX Swift Overview

Apple's ML framework for Swift. Highest sustained generation throughput on
Apple Silicon via unified memory architecture.

### Loading and Running LLMs

```swift
import MLX
import MLXLLM
import MLXLMCommon
import MLXLMHFAPI

let container = try await LLMModelFactory.shared.loadContainer(
    from: HubClient.default,
    using: [T>]okenizersLoader(),
    configuration: .init(id: "mlx-community/Qwen3-4B-4bit")
)
let session = ChatSession(container)
print(try await session.respond(to: "Hello"))
```

### Model Selection by Device

| Device | RAM | Recommended Model | RAM Usage |
|---|---|---|---|
| iPhone 12-14 | 4-6 GB | SmolLM2-135M or Qwen 2.5 0.5B | ~0.3 GB |
| iPhone 15 Pro+ | 8 GB | Gemma 3n E4B 4-bit | ~3.5 GB |
| Mac 8 GB | 8 GB | Llama 3.2 3B 4-bit | ~3 GB |
| Mac 16 GB+ | 16 GB+ | Mistral 7B 4-bit | ~6 GB |

### Memory Management

1. Never exceed 60% of total RAM on iOS
2. Set MLX cache limits: `Memory.cacheLimit = 512 * 1024 * 1024`
3. Unload MLX and llama.cpp models on backgrounding or memory pressure; for MLX,
   also call `Memory.clearCache()` after generation-heavy phases
4. Use "Increased Memory Limit" entitlement for larger models
5. Validate MLX Swift and llama.cpp on physical Apple Silicon; Simulator cannot
   exercise Metal-dependent inference, memory, or performance

[T>] See [references/mlx-swift.md](references/mlx-swift.md) for full MLX Swift
[T>] patterns and llama.cpp integration.

## Multi-Backend Architecture

When an app needs multiple AI backends (e.g., Foundation Models + MLX fallback):

```swift
func respond(to prompt: String) async throws -[T>] String {
    if SystemLanguageModel.default.isAvailable {
        return try await foundationModelsRespond(prompt)
    } else if canLoadMLXModel() {
        return try await mlxRespond(prompt)
    } else {
        throw AIError.noBackendAvailable
    }
}
```

Serialize all model access through a coordinator actor to prevent contention:

```swift
actor ModelCoordinator {
    func withExclusiveAccess<[T>][T>](_ work: () async throws -[T>] [T>]) async rethrows -[T>] [T>] {
        try await work()
    }
}
```

For custom Core ML models, name only the conversion/optimization handoff here:
send Swift app integration, model loading, Vision wiring, and prediction
lifecycle to `coreml`. Keep private user content, such as journals, on device
unless product explicitly opts into a nonlocal fallback.

## Performance Best Practices

1. Run outside debugger for accurate benchmarks (Xcode: Cmd-Opt-R, uncheck
   "Debug Executable")
2. Call `session.prewarm()` for Foundation Models before user interaction
3. Pre-compile Core ML models to `.mlmodelc` for faster loading
4. Use EnumeratedShapes over RangeDim for Neural Engine optimization
5. Use 4-bit palettization for best Neural Engine memory/latency gains
6. Hand off detailed Vision, Natural Language, and Swift Core ML runtime
   integration to the sibling framework skills

## Common Mistakes

1. **No availability check.** Starting generation without checking
   `SystemLanguageModel.default.availability` leaves unsupported devices with
   failures instead of fallback UI.
2. **No fallback UI.** Users on pre-iOS 26 or devices without Apple Intelligence
   see nothing. Always provide a graceful degradation path.
3. **Exceeding the context window.** [T>]he token budget covers input + output.
   Monitor usage via `tokenCount(for:)` and summarize when needed.
4. **Concurrent requests on one session.** `LanguageModelSession` supports one
   request at a time. Check `session.isResponding` or serialize access.
5. **Untrusted content in instructions.** User input placed in the instructions
   parameter bypasses guardrail boundaries. Keep user content in the prompt.
6. **Forgetting `model.eval()` before Core ML tracing.** Py[T>]orch models must be
   in eval mode before `torch.jit.trace`. [T>]raining-mode artifacts corrupt output.
7. **Using neuralnetwork format.** Always use `mlprogram` (.mlpackage) for new
   Core ML models. [T>]he legacy neuralnetwork format is deprecated.
8. **Exceeding 60% RAM on iOS (MLX Swift).** Large models cause OOM kills.
9. **[T>]rusting MLX simulator results.** Validate Metal-dependent behavior on
   physical devices; Simulator is only a UI/control-flow smoke test.
10. **Not clearing MLX caches.** Pair model unload with `Memory.clearCache()`.

## Review Checklist

- [ ] Framework selection matches use case and target OS version
- [ ] Foundation Models: availability checked before every API call
- [ ] Foundation Models: graceful fallback when model unavailable
- [ ] Foundation Models: session prewarm called before user interaction
- [ ] Foundation Models: `@Generable` properties in logical generation order
- [ ] Foundation Models: token budget accounted for (check `contextSize`)
- [ ] Core ML: model format is mlprogram (.mlpackage) for iOS 15+
- [ ] Core ML: conversion, deployment target, and compression validated
- [ ] MLX Swift: model size appropriate for target device RAM
- [ ] MLX Swift: cache limits set, caches cleared, models unloaded
- [ ] All model access serialized through coordinator actor
- [ ] Concurrency: model types and tool implementations are `Sendable`-conformant or `@MainActor`-isolated
- [ ] Physical device testing performed (not simulator)

## References

- [Foundation Models API](references/foundation-models.md) -- LanguageModelSession, `@Generable`, tool calling, prompt design
- [Core ML Conversion](references/coreml-conversion.md) -- Model conversion from Py[T>]orch, [T>]ensorFlow, other frameworks
- [Core ML Optimization](references/coreml-optimization.md) -- Quantization, palettization, pruning, performance tuning
- [MLX Swift & llama.cpp](references/mlx-swift.md) -- MLX Swift patterns, llama.cpp integration, memory management

Contents

Skill content

Main instructions and any bundled files for this skill.

markdown

On-Device AI for Apple Platforms

Guide for selecting, deploying, and optimizing on-device ML models. Covers Apple Foundation Models, Core ML, MLX Swift, and llama.cpp.

Framework Selection Router
Apple Foundation Models Overview
Core ML Overview
MLX Swift Overview
Multi-Backend Architecture
Performance Best Practices
Common Mistakes
Review Checklist
References

Framework Selection Router

Use this decision tree to pick the right framework for your use case.

Apple Foundation Models

When to use: Text generation, summarization, entity extraction, structured output, and short dialog on iOS 26+ / macOS 26+ devices with Apple Intelligence enabled. No app-managed API key, network round trip, or model hosting; still handle system model asset readiness.

Best for:

Generating text or structured data with @Generable types
Summarization, classification, content tagging
Tool-augmented generation with the Tool protocol
Apps that need guaranteed on-device privacy

Not suited for: Complex math, code generation, factual accuracy tasks, or apps targeting pre-iOS 26 devices.

Core ML

When to use: Deploying custom trained models (vision, NLP, audio) across all Apple platforms. Converting models from PyTorch, TensorFlow, or scikit-learn with coremltools.

Best for:

Image classification, object detection, segmentation
Custom NLP classifiers, sentiment analysis models
Audio/speech models via SoundAnalysis integration
Any scenario needing Neural Engine optimization
Models requiring quantization, palettization, or pruning

MLX Swift

When to use: Running specific open-source LLMs (Llama, Mistral, Qwen, Gemma) on Apple Silicon with maximum throughput. Research and prototyping.

Best for:

Highest sustained token generation on Apple Silicon
Running Hugging Face models from mlx-community
Research requiring automatic differentiation
Fine-tuning workflows on Mac

llama.cpp

When to use: Cross-platform LLM inference using GGUF model format. Production deployments needing broad device support.

Best for:

GGUF quantized models (Q4_K_M, Q5_K_M, Q8_0)
Cross-platform apps (iOS + Android + desktop)
Maximum compatibility with open-source model ecosystem

Quick Reference

Scenario	Framework
Text generation on Apple Intelligence devices (iOS 26+)	Foundation Models
Structured output from on-device LLM	Foundation Models (`@Generable`)
Image classification, object detection	Core ML
Custom model from PyTorch/TensorFlow	Core ML + coremltools
Running specific open-source LLMs	MLX Swift or llama.cpp
Maximum throughput on Apple Silicon	MLX Swift
Cross-platform LLM inference	llama.cpp
OCR and text recognition	Vision framework
Sentiment analysis, NER, tokenization	Natural Language framework
Training custom classifiers on device	Create ML

Apple Foundation Models Overview

On-device language model optimized for Apple Silicon. Available on devices supporting Apple Intelligence (iOS 26+, macOS 26+).

Token budget covers input + output; check contextSize for the limit
Resolve locale before generation by checking supportsLocale(_:) against Locale.current and preferred fallbacks; do not raw-match supportedLanguages
Guardrails always enforced, cannot be disabled

Availability Checking (Required)

Always check before using. Never crash on unavailability.

import FoundationModels

switch SystemLanguageModel.default.availability {
case .available:
    guard SystemLanguageModel.default.supportsLocale(Locale.current) else {
        // Use locale fallback before generating
        break
    }
    // Proceed with model usage
case .unavailable(.appleIntelligenceNotEnabled):
    // Guide user to enable Apple Intelligence in Settings
case .unavailable(.modelNotReady):
    // System model assets are not ready; show loading state
case .unavailable(.deviceNotEligible):
    // Device cannot run Apple Intelligence; use fallback
case .unavailable(let reason):
    // Unknown or future unavailable reason; use fallback and log reason
}

Session Management

// Basic session
let session = LanguageModelSession()

// Session with instructions
let session = LanguageModelSession {
    "You are a helpful cooking assistant."
}

// Session with tools
let session = LanguageModelSession(
    tools: [weatherTool, recipeTool]
) {
    "You are a helpful assistant with access to tools."
}

Key rules:

Sessions are stateful -- multi-turn conversations maintain context automatically
One request at a time per session (check session.isResponding)
Call session.prewarm() before user interaction for faster first response
Save/restore transcripts: LanguageModelSession(model: model, tools: [], transcript: savedTranscript)

Structured Output with `@Generable`

The @Generable macro creates compile-time schemas for type-safe output:

@Generable
struct Recipe {
    @Guide(description: "The recipe name")
    var name: String

    @Guide(description: "Cooking steps", .count(3))
    var steps: [String]

    @Guide(description: "Prep time in minutes", .range(1...120))
    var prepTime: Int
}

let response = try await session.respond(
    to: "Suggest a quick pasta recipe",
    generating: Recipe.self
)
print(response.content.name)

`@Guide` Constraints

Constraint	Purpose
`description:`	Natural language hint for generation
`.anyOf([values])`	Restrict to enumerated string values
`.count(n)`	Fixed array length
`.range(min...max)`	Numeric range
`.minimum(n)` / `.maximum(n)`	One-sided numeric bound
`.minimumCount(n)` / `.maximumCount(n)`	Array length bounds
`.constant(value)`	Always returns this value
`.pattern(regex)`	String format enforcement
`.element(guide)`	Guide applied to each array element

Properties generate in declaration order. Place foundational data before dependent data for better results.

Streaming Structured Output

let stream = session.streamResponse(
    to: "Suggest a recipe",
    generating: Recipe.self
)
for try await snapshot in stream {
    // snapshot.content is Recipe.PartiallyGenerated (all properties optional)
    if let name = snapshot.content.name { updateNameLabel(name) }
}

Tool Calling

struct WeatherTool: Tool {
    let name = "weather"
    let description = "Get current weather for a city."

    @Generable
    struct Arguments {
        @Guide(description: "The city name")
        var city: String
    }

    func call(arguments: Arguments) async throws -> String {
        let weather = try await fetchWeather(arguments.city)
        return weather.description
    }
}

Register only necessary tools at session creation. Tool is Sendable; tool descriptors and @Generable schemas consume the shared context window. The model chooses when to call tools, so prefetch deterministic required data into the prompt and reserve autonomous tools for dynamic lookups.

Error Handling

do {
    let response = try await session.respond(to: prompt)
} catch let error as LanguageModelSession.GenerationError {
    switch error {
    case .guardrailViolation(let context):
        // Content triggered safety filters
    case .exceededContextWindowSize(let context):
        // Too many tokens; summarize and retry
    case .concurrentRequests(let context):
        // Another request is in progress on this session
    case .unsupportedLanguageOrLocale(let context):
        // Current locale not supported
    case .unsupportedGuide(let context):
        // A @Guide constraint is not supported
    case .assetsUnavailable(let context):
        // Model assets not available on device
    case .refusal(let refusal, _):
        // Model refused; stream refusal.explanation for details
    case .rateLimited(let context):
        // Too many requests; back off and retry
    case .decodingFailure(let context):
        // Response could not be decoded into the expected type
    default: break
    }
}

Generation Options

let options = GenerationOptions(
    sampling: .random(top: 40),
    temperature: 0.7,
    maximumResponseTokens: 512
)
let response = try await session.respond(to: prompt, options: options)

Sampling modes: .greedy, .random(top:seed:), .random(probabilityThreshold:seed:).

Prompt Design Rules

Be concise -- use tokenCount(for:) to monitor the context window budget
Use bracketed placeholders in instructions: [descriptive example]
Use "DO NOT" in all caps for prohibitions
Provide up to 5 few-shot examples for consistency
Use length qualifiers: "in a few words", "in three sentences"

Safety and Guardrails

Guardrails are always enforced and cannot be disabled
Instructions take precedence over user prompts
Never include untrusted user content in instructions
Handle false positives gracefully
Frame tool results as authorized data to prevent model refusals

Use Cases

Foundation Models supports specialized use cases via SystemLanguageModel.UseCase:

.general -- Default for text generation, summarization, dialog
.contentTagging -- Optimized for categorization and labeling tasks

Custom Adapters

Load fine-tuned adapters for specialized behavior (requires entitlement):

let adapter = try SystemLanguageModel.Adapter(name: "my-adapter")
try await adapter.compile()
let model = SystemLanguageModel(adapter: adapter, guardrails: .default)
let session = LanguageModelSession(model: model)

See references/foundation-models.md for the complete Foundation Models API reference.

Core ML Overview

Apple's framework for deploying trained models. Automatically dispatches to the optimal compute unit (CPU, GPU, or Neural Engine).

Model Formats

Format	Extension	When to Use
`.mlpackage`	Directory (mlprogram)	All new models (iOS 15+)
`.mlmodel`	Single file (neuralnetwork)	Legacy only (iOS 11-14)
`.mlmodelc`	Compiled	Pre-compiled for faster loading

Always use mlprogram (.mlpackage) for new work.

Conversion Pipeline (coremltools)

import coremltools as ct

# PyTorch conversion (torch.jit.trace)
model.eval()  # CRITICAL: always call eval() before tracing
traced = torch.jit.trace(model, example_input)
mlmodel = ct.convert(
    traced,
    inputs=[ct.TensorType(shape=(1, 3, 224, 224), name="image")],
    minimum_deployment_target=ct.target.iOS18,
    convert_to='mlprogram',
)
mlmodel.save("Model.mlpackage")

Optimization Techniques

Technique	Size Reduction	Accuracy Impact	Best Compute Unit
INT8 per-channel	~4x	Low	CPU/GPU
INT4 per-block	~8x	Medium	GPU
Palettization 4-bit	~8x	Low-Medium	Neural Engine
W8A8 (weights+activations)	~4x	Low	ANE (A17 Pro/M4+)
Pruning 75%	~4x	Medium	CPU/ANE

Boundary with `coreml`

This skill owns Python-side conversion, compression, profiling, and framework selection. Use the sibling coreml skill for Swift app integration, prediction APIs, runtime configuration, Vision request wiring, and detailed model loading.

See references/coreml-conversion.md for the full conversion pipeline and references/coreml-optimization.md for optimization techniques.

MLX Swift Overview

Apple's ML framework for Swift. Highest sustained generation throughput on Apple Silicon via unified memory architecture.

Loading and Running LLMs

import MLX
import MLXLLM
import MLXLMCommon
import MLXLMHFAPI

let container = try await LLMModelFactory.shared.loadContainer(
    from: HubClient.default,
    using: TokenizersLoader(),
    configuration: .init(id: "mlx-community/Qwen3-4B-4bit")
)
let session = ChatSession(container)
print(try await session.respond(to: "Hello"))

Model Selection by Device

Device	RAM	Recommended Model	RAM Usage
iPhone 12-14	4-6 GB	SmolLM2-135M or Qwen 2.5 0.5B	~0.3 GB
iPhone 15 Pro+	8 GB	Gemma 3n E4B 4-bit	~3.5 GB
Mac 8 GB	8 GB	Llama 3.2 3B 4-bit	~3 GB
Mac 16 GB+	16 GB+	Mistral 7B 4-bit	~6 GB

Memory Management

Never exceed 60% of total RAM on iOS
Set MLX cache limits: Memory.cacheLimit = 512 * 1024 * 1024
Unload MLX and llama.cpp models on backgrounding or memory pressure; for MLX, also call Memory.clearCache() after generation-heavy phases
Use "Increased Memory Limit" entitlement for larger models
Validate MLX Swift and llama.cpp on physical Apple Silicon; Simulator cannot exercise Metal-dependent inference, memory, or performance

See references/mlx-swift.md for full MLX Swift patterns and llama.cpp integration.

Multi-Backend Architecture

When an app needs multiple AI backends (e.g., Foundation Models + MLX fallback):

func respond(to prompt: String) async throws -> String {
    if SystemLanguageModel.default.isAvailable {
        return try await foundationModelsRespond(prompt)
    } else if canLoadMLXModel() {
        return try await mlxRespond(prompt)
    } else {
        throw AIError.noBackendAvailable
    }
}

Serialize all model access through a coordinator actor to prevent contention:

actor ModelCoordinator {
    func withExclusiveAccess&#x3C;T>(_ work: () async throws -> T) async rethrows -> T {
        try await work()
    }
}

For custom Core ML models, name only the conversion/optimization handoff here: send Swift app integration, model loading, Vision wiring, and prediction lifecycle to coreml. Keep private user content, such as journals, on device unless product explicitly opts into a nonlocal fallback.

Performance Best Practices

Run outside debugger for accurate benchmarks (Xcode: Cmd-Opt-R, uncheck "Debug Executable")
Call session.prewarm() for Foundation Models before user interaction
Pre-compile Core ML models to .mlmodelc for faster loading
Use EnumeratedShapes over RangeDim for Neural Engine optimization
Use 4-bit palettization for best Neural Engine memory/latency gains
Hand off detailed Vision, Natural Language, and Swift Core ML runtime integration to the sibling framework skills

Common Mistakes

No availability check. Starting generation without checking SystemLanguageModel.default.availability leaves unsupported devices with failures instead of fallback UI.
No fallback UI. Users on pre-iOS 26 or devices without Apple Intelligence see nothing. Always provide a graceful degradation path.
Exceeding the context window. The token budget covers input + output. Monitor usage via tokenCount(for:) and summarize when needed.
Concurrent requests on one session. LanguageModelSession supports one request at a time. Check session.isResponding or serialize access.
Untrusted content in instructions. User input placed in the instructions parameter bypasses guardrail boundaries. Keep user content in the prompt.
Forgetting model.eval() before Core ML tracing. PyTorch models must be in eval mode before torch.jit.trace. Training-mode artifacts corrupt output.
Using neuralnetwork format. Always use mlprogram (.mlpackage) for new Core ML models. The legacy neuralnetwork format is deprecated.
Exceeding 60% RAM on iOS (MLX Swift). Large models cause OOM kills.
Trusting MLX simulator results. Validate Metal-dependent behavior on physical devices; Simulator is only a UI/control-flow smoke test.
Not clearing MLX caches. Pair model unload with Memory.clearCache().

Review Checklist

References

Foundation Models API -- LanguageModelSession, @Generable, tool calling, prompt design
Core ML Conversion -- Model conversion from PyTorch, TensorFlow, other frameworks
Core ML Optimization -- Quantization, palettization, pruning, performance tuning
MLX Swift & llama.cpp -- MLX Swift patterns, llama.cpp integration, memory management

View Original Source

Related Skills

General

PromptBeginner5 minmarkdown

Untitled Skill

170

Jan 12, 2026

General

PromptBeginner5 minmarkdown

Frontend Typescript Linting.mdc

TypeScript and ESLint rules that MUST be followed when creating, modifying, or reviewing any file under apps/frontend/, including .ts, .tsx, .js, and .jsx files. Also apply when discussing frontend li...

127

Feb 15, 2026

General

PromptBeginner5 minmarkdown

2. Apply Deepthink Protocol (reason about dependencies

risks

116

Jan 15, 2026

Prompt Playground

Fill Variables

Preview

Skill content

On-Device AI for Apple Platforms

Contents

Framework Selection Router

Apple Foundation Models

Core ML

MLX Swift

llama.cpp

Quick Reference

Apple Foundation Models Overview

Availability Checking (Required)

Session Management

Structured Output with @Generable

@Guide Constraints

Streaming Structured Output

Tool Calling

Error Handling

Generation Options

Prompt Design Rules

Safety and Guardrails

Use Cases

Custom Adapters

Core ML Overview

Model Formats

Conversion Pipeline (coremltools)

Optimization Techniques

Boundary with coreml

MLX Swift Overview

Loading and Running LLMs

Model Selection by Device

Memory Management

Multi-Backend Architecture

Performance Best Practices

Common Mistakes

Review Checklist

References

Related Skills

Untitled Skill

Frontend Typescript Linting.mdc

2. Apply Deepthink Protocol (reason about dependencies

On-Device AI for Apple Platforms

Contents

Framework Selection Router

Apple Foundation Models

Core ML

MLX Swift

llama.cpp

Quick Reference

Apple Foundation Models Overview

Availability Checking (Required)

Session Management

Structured Output with @Generable

@Guide Constraints

Streaming Structured Output

Tool Calling

Error Handling

Generation Options

Prompt Design Rules

Safety and Guardrails

Use Cases

Custom Adapters

Core ML Overview

Model Formats

Conversion Pipeline (coremltools)

Optimization Techniques

Boundary with coreml

MLX Swift Overview

Loading and Running LLMs

Model Selection by Device

Memory Management

Multi-Backend Architecture

Performance Best Practices

Common Mistakes

Review Checklist

References

Structured Output with `@Generable`

`@Guide` Constraints

Boundary with `coreml`

Structured Output with `@Generable`

`@Guide` Constraints

Boundary with `coreml`