Dispatch backends¶

ANEForge maps three working paths from a regular user process to the ANE silicon, plus several surfaces that are reachable but blocked or unsuitable. This document explains each path and when to use it.

Summary¶

Path	Latency (p50, conv3x3 [1,256,32,32])	Device	Status
`e5rt_execution_stream_*`	~80-110 us	ANE	Recommended
`_ANEInMemoryModel.evaluateWithQoS:` (Path A)	~110 us	ANE	Stable
MPSGraph (`MPSGraphOptimizationLevel0`)	~300 us	GPU	Available
CPU numpy fp16	~587 ms	CPU	Reference
Path B (`_ANESharedEvents` / `_ANEChainingRequest`)	-	-	Blocked (entitlement)
CoreML `MLComputeUnitsAll`	varies	CPU usually	Not viable for direct ANE

Where each route lands (green = reaches the ANE without an entitlement; amber = conditional or lands on another device; red = blocked):

flowchart LR
    classDef ok fill:#d6f5d6,stroke:#2e7d32,color:#000;
    classDef blocked fill:#f8d7da,stroke:#c62828,color:#000;
    classDef partial fill:#fff3cd,stroke:#f9a825,color:#000;
    classDef hw fill:#e7eaf6,stroke:#3949ab,color:#000;

    U["User process (unentitled)"]

    E5["e5rt_* C API - recommended<br/>compile-once / eval-many, ~80-110us"]:::ok
    PA["_ANEInMemoryModel (Path A)<br/>ObjC, ~110us"]:::ok
    PB["Path B - _ANEChainingRequest<br/>streaming / chaining"]:::blocked
    MPS["MPSGraph (Metal graph) ~300us"]:::partial
    CML["CoreML - MLComputeUnitsAll<br/>opaque MLComputePlan placement"]:::partial

    ANED["aned - compile MIL to HWX, SIGN, cache"]:::hw
    ANE["ANE silicon"]:::hw
    GPU["Metal GPU"]:::hw
    CPU["CPU"]:::hw

    U --> E5 --> ANED
    U --> PA --> ANED
    U --> PB
    U --> MPS --> GPU
    U --> CML

    ANED --> ANE
    PB -. "blocked: entitlement com.apple.aned.private.allow" .-> ANED
    CML -->|usually| CPU
    CML -. "above a cost threshold only" .-> ANED

The two green routes (e5rt, Path A) are the working paths; both converge on aned (which compiles, signs, and caches the HWX) and reach the same silicon.

e5rt - the fast path¶

Built on Espresso.framework's e5rt_* C API, all resolvable via dlsym from an unentitled process. The flow is compile-once / eval-many: compile a MIL program to a program library, retain the main function, bind input/output buffers, then encode and execute on a stream. A stable program handle stays resident in aned across submissions, so per-call cost reduces to input memcpy + execute + output memcpy.

Full call sequence and the e5rt_* C ABI are in e5rt-dispatch-reference.md.

When to use it¶

Compile-once, eval-many workloads.
Inference hot loops where each call's shape is fixed.
Multi-op pipelines (encode multiple ops onto a single stream).

Constraints¶

Shapes are baked into the compiled program. Each new shape pays a fresh compile (~750 ms one-time).
Cross-process sharing works end-to-end. Bundle handoff is the ~30 ms warm-load path and requires the master and worker to share a codesign identity (same binary, posix_spawn'd children - fork() does not work, Espresso is libdispatch-fork-unsafe). Heterogeneous binaries recompile the MIL (~750 ms cold, ~30 ms warm aned cache). Cross-process tensor handoff via IOSurface runs the data plane with no memcpy at ~70 us per surface. See e5rt-dispatch-reference.md.

Path A - `_ANEInMemoryModel`¶

The canonical Objective-C surface ANEForge's higher-level frontend uses. Public API of AppleNeuralEngine.framework, callable without entitlement:

desc  = [_ANEInMemoryModelDescriptor modelWithMILText:milData
                                              weights:weights
                                         optionsPlist:nil];
model = [_ANEInMemoryModel inMemoryModelWithDescriptor:desc];
[model compileWithQoS:33 options:opts error:&err];
[model loadWithQoS:33    options:opts error:&err];
[model evaluateWithQoS:33 options:opts request:req error:&err];

compile + load cost: ~38 ms one-time per shape. Eval is ~110 us at the canonical conv shape - the same hardware path as e5rt, just reached through ObjC rather than C.

Use Path A when you need behavior the e5rt wrapper does not yet expose (e.g., multi-output programs the e5rt wrapper handles differently, or compile-options keys the e5rt API does not surface).

What about CoreML?¶

CoreML reaches the same hardware but routes most workloads to CPU. Apple's MLComputePlan API (public macOS 26.5) reports the per-op preferred device for any compiled .mlmodelc. Across a 451-program audit of ANEForge's fuzzer corpus:

0% of programs had any non-trivial op preferred to the ANE by Apple's compiler.
88.9% routed every real compute op to CPU per Apple's policy.
Routing flips ANE-ward only above a workload-cost threshold (e.g., for conv3x3, the knee is between [1,128,32,32] (CPU) and [1,256,32,32] (ANE)).

A powermetrics audit shows ANEForge's direct paths bypass this heuristic: real power draws on the ANE rail even for the ops MLComputePlan flags as CPU-only. So ANEForge's direct paths (_ANEInMemoryModel and e5rt) reach the ANE on smaller workloads where CoreML would keep things on CPU.

MLComputePlan is still useful as an audit oracle: it asks Apple's compiler whether a compiled MIL has any ANE-preferred ops, which catches silent CPU dispatches in a CoreML pipeline. ANEForge uses it offline to audit its corpus, not as a dispatch path - the direct e5rt / Path A routes reach the ANE regardless of what Apple's routing heuristic prefers.

Path B - blocked¶

_ANESharedEvents, _ANEChainingRequest, and the always-zero intermediateBufferHandle - collectively "Path B" - gate streaming, chained execution, and direct IOSurface ownership behind the com.apple.aned.private.allow entitlement. Apple uses Path B internally; third parties cannot.

ANEForge reproduces the Path-B behavior that mattered for autonomy - a host-free dispatch loop - on the e5rt path in bounded form (one execute_multi drives K on-engine steps), and it is performance-neutral, so the entitlement is not a reason to want it. See e5rt-dispatch-reference.md.

MPSGraph¶

Public Metal-based graph API. It can in principle route to ANE via private compilation-descriptor selectors, but for most workloads it keeps execution on the Metal GPU. ANEForge does not use MPSGraph as a dispatch backend, but it is a useful verified GPU baseline for cross-device benchmarking, and the EspressoANEIOSurface metalBufferWithDevice: zero-copy primitive is available for a future hybrid CPU/GPU/ANE pipeline.

Choosing a path¶

You need	Use
Hot-loop inference at fixed shape	e5rt (`af.compile` -> reuse the `net`)
Quick try from Python	`import aneforge as af`; build a graph, `af.compile`, call `net(x)`
Single-shot inference, don't care about latency	Either green path
Streaming inference with KV-cache	e5rt + paired state tensors (see capabilities.md)
Multi-op compiled pipeline	e5rt with multiple ops on one stream
Routing-truth audit	`MLComputePlan` over a compiled `.mlmodelc` (offline oracle)
Hardware-counter probe of which silicon ran	`sudo powermetrics --samplers ane_power,cpu_power,gpu_power`