e5rt dispatch reference¶
The complete reference for the e5rt dispatch path: the unentitled, CoreML-free
route that compiles a MIL program and runs it on the ANE through
Espresso.framework's e5rt_* C API. This is the fast path aneforge uses for
every fused program. It is reachable from an ordinary user process with only
dlopen + dlsym (no entitlement, no .tbd linkage).
Two surfaces are documented here:
- The underlying
e5rt_*symbols inEspresso.frameworkand the exact ordered sequence to drive them (the "what Apple exposes" layer). - The project's stable C ABI (
ane_e5rt_*inaneforge/_lib/ane_e5rt_dispatch.mm, built intolibane_e5rt_dispatch.dylib) that wraps those symbols into a compile-once/eval-many object - the layeraneforge/_runtime.pycalls.
For the dispatch-path matrix (e5rt vs Path A vs Path B vs MPSGraph vs CoreML), see dispatch.md.
Status: verified end-to-end on macOS 26.5 / M5 Pro (
h17s). The core surface (compile, bind, encode,execute_sync) is production-grade; the multi-op, async, and cross-process surfaces are marked experimental (callable C surface that the Python frontend does not bind) where they are research-grade or partly inferred.
1. What the path is¶
- Unentitled. Everything below resolves via
dlopenof/System/Library/PrivateFrameworks/Espresso.framework/Espresso+dlsym. The system log shows the dispatch going throughaned->ANECompilerService(MIL -> HWX) ->ANEDriver->H11ANEInwithcsIdentity = ThirdPartyAppUsingANE, i.e. no entitlement, a real assignedprogramHandle, ~48 KiB wired memory per model, on theqos=21lane. - Compile-once / eval-many. The single
compilecall is where the whole 4-IR pipeline (MIL -> MLIR -> LLIR -> ANECIR -> HWX) and HWX signing happen insideaned(~27 ms on this host). Every subsequent eval is bind -> encode ->execute_sync, ~80 us. - Device selection is a bitmask:
0x1= BNNS/CPU,0x2= MPSGraph/GPU,0x4= ANE.aneforgesets0x4. With multiple bits the runtime auto-selects (and falls back to BNNS if the ANE compile fails); verify the choice viaSelectedBackend = string("ane")in the produced bundle'sanalytics.mil.
2. The underlying e5rt_* call sequence¶
The exact ordered sequence the dylib runs (see compile_and_build_op and the
encode/execute helpers in ane_e5rt_dispatch.mm). About 30 of the 235 exported
e5rt_* symbols are used; their function-pointer typedefs are in
aneforge/_lib/e5rt_api.h.
The shape of one call: a single one-time compile (where the 4-IR pipeline and
HWX signing happen inside aned), then a cheap per-inference eval loop.
sequenceDiagram
participant U as User code
participant D as libane_e5rt_dispatch
participant E as Espresso (e5rt_*)
participant A as aned (root daemon)
participant K as ANEDriver / H11ANEIn
participant S as ANE silicon
Note over U,S: Compile once (one-time: ~27 ms warm aned cache, up to ~750 ms cold)
U->>D: ane_e5rt_program_compile(mil, mask=0x4)
D->>E: e5rt_e5_compiler_compile(mil)
E->>A: XPC compile request
A->>A: MIL to MLIR to LLIR to ANECIR to HWX, SIGN, cache (per-PID)
A-->>E: programHandle
E-->>D: library to function to operation
D->>E: retain ports, alloc + bind buffers, stream_create, encode_operation
loop Eval many (~80-110 us each)
U->>D: set_input_fp16(...) (memcpy into bound buffer)
D->>E: e5rt_execution_stream_execute_sync(stream)
E->>A: submit (signed HWX already loaded)
A->>K: dispatch
K->>S: run program on the single hardware lane
S-->>K: done (output buffer filled)
K-->>E: complete
E-->>D: return
U->>D: get_output_fp16(...) (memcpy out)
D-->>U: fp16 output
end
2.1 Compile: MIL text -> signed HWX -> an executable operation¶
e5rt_e5_compiler_config_options_create(&config);
e5rt_e5_compiler_config_options_set_cache_bundle_location(config, cache_dir);
e5rt_e5_compiler_create_with_config(&compiler, config);
e5rt_e5_compiler_options_create(&options);
e5rt_e5_compiler_options_set_compute_device_types_mask(options, 0x4); // 0x4 = ANE
e5rt_e5_compiler_options_set_force_recompilation(options, 1);
e5rt_e5_compiler_options_set_segmenter(options, "graph");
e5rt_e5_compiler_compile(compiler, mil_path, options, &library); // aned: MIL->...->HWX, signs + caches per-PID
e5rt_program_library_retain_program_function(library, "main", &function);
e5rt_precompiled_compute_op_create_options_create_with_program_function(&op_options, function);
e5rt_precompiled_compute_op_create_options_set_operation_name(op_options, "main");
e5rt_precompiled_compute_op_create_options_set_allocate_intermediate_buffers(op_options, 1);
e5rt_execution_stream_operation_create_precompiled_compute_operation_with_options(&operation, op_options);
The compiler/config/options can be released once operation exists.
2.2 Bind I/O buffers (per input and output port)¶
e5rt_execution_stream_operation_retain_input_port(operation, port_name, &port); // outputs: ..._retain_output_port
e5rt_buffer_object_alloc(&buffer, n_bytes, 0); // type 0 = plain CPU data ptr (0/1/2 valid)
e5rt_buffer_object_get_data_ptr(buffer, &ptr); // memcpy fp16 inputs into *ptr
e5rt_io_port_bind_buffer_object(port, buffer);
2.3 Build the stream and encode¶
e5rt_execution_stream_create(&stream);
// optional happens-before plumbing:
e5rt_async_event_create(&evt, name, 0); // 3 args; NULL name errors
e5rt_execution_stream_operation_bind_completion_event(operation, evt);
// re-encode path only (a stream that was already used):
e5rt_execution_stream_reset(stream); // rejects a fresh stream
e5rt_execution_stream_operation_prepare_op_for_encode(operation); // only legal on already-encoded ops
e5rt_execution_stream_encode_operation(stream, operation); // once per op, in submission order
2.4 Execute and read¶
e5rt_execution_stream_execute_sync(stream); // the production path; serializes encoded ops in order
// or async:
e5rt_execution_stream_submit_async(stream, ^{ /* completion */ }); // 2nd arg is an ObjC block, objc_retain'd
// read fp16 back through the output buffer's data ptr; async path waits first:
e5rt_async_event_sync_wait(final_event);
e5rt_async_event_get_last_signaled_value(evt, &value);
2.5 Multi-op extras (chaining, buffer sharing)¶
e5rt_execution_stream_operation_bind_dependent_events(dst_op, events, 1); // src completion-event -> dst dependency (A->B)
e5rt_io_port_bind_buffer_object(dst_port, src_buffer); // share a buffer between ops (zero-copy)
2.5.1 Persistent on-device state via output->input buffer aliasing¶
share_buffer can alias an op's output port onto its own (or a downstream op's)
input port before the first execute. With that alias in place, a single
execute_sync call per step reads from and writes back to the same resident
buffer, with no host round-trip between steps. The sequence is:
- Compile the op and bind ports as usual.
- Call
e5rt_io_port_bind_buffer_object(orane_e5rt_program_share_bufferat the project C ABI layer) to wire the output port to the same buffer as the input port. - Seed the buffer once from the host (
get_data_ptr->memcpy). - Loop
execute_syncwith no interveningset_input/get_output. - Read back via
get_data_ptronly at checkpoints.
This technique is reachable without an entitlement and is wired into the frontend:
aneforge/_runtime.pyProgramgainedshare_buffer(src_op, src_port, dst_op, dst_port)plus granularset_input/execute/read_output.aneforge/_compile.pygainedcompile_multi(outs)and aMultiModelclass that lowers a graph with N output tensors into one fused program with N named output ports.aneforge/autograd.pyTrainer(resident_state=True)assembles the entire training step - forward + backward + per-param optimizer update - as one fused multi-output e5rt program, then aliases each state tensor's output port onto its own input port viashare_buffer. Weights and Adam moments stay resident on the ANE across steps; the host supplies only the minibatch (x,target) and the scalarlr_tper step, and reads weights back only at epoch checkpoints.
Demonstrated result: a 784->256->10 GELU MLP (Adam) trained to 97.79% test
accuracy on full MNIST with all 12 state tensors (4 params x {w, m, v}) resident
across ~2340 steps, in ~1.0 s (the prior host-shuttle path was ~4 s). No compile
wall at this scale. Example: examples/train_mnist_mlp.py. Tests:
test_compile_multi_two_outputs, test_resident_sgd_matches_host_reference,
test_resident_adam_trains_subset_and_state_stays_resident in
tests/test_autograd.py.
Caveats. fp16 storage means exact integer values are only representable to ~2048; the device is a single lane (latency/bandwidth, not parallelism). Trainable op coverage now spans MLP, CNN, and transformer-block models - matmul-family plus structural VJPs (transpose/reshape/concat/slice), conv grad-wrt-input, and avg_pool/max_pool - with the resident-state path demonstrated on all three.
Multi-step host-free dispatch - reachable without an entitlement, and not a bottleneck.
execute_multi encodes K ops into one stream and runs them
under a single execute_sync, so one host dispatch drives K on-engine steps:
K copies of an aliased step, chained op_i.out -> op_{i+1}.in by share_buffer,
seeded once, advance the accumulator to exactly K for K up to 100 (ceiling
~110-120, the aned per-PID program cap). This needs no Path B and no
entitlement. It is also performance-neutral: one execute_multi over 64 ops
(66 ms median) is no faster than 64 separate execute calls (60 ms) - on the
resident-buffer path execute_sync over already-bound buffers is nearly free,
so the per-step host dispatch is not a measurable cost (~0.93 ms/step intrinsic,
single lane). The only remaining inaccessible case is unbounded zero-host
autonomy (the engine self-looping with no host call ever), which is
entitlement-gated; by this measurement it would not run the workloads here
any faster.
2.6 Release (reverse order)¶
operation -> op_options -> function -> library -> buffer_object ->
io_port -> async_event -> stream, each via its e5rt_*_release.
3. ABI discoveries (the non-obvious calling conventions)¶
These were recovered empirically; getting them wrong returns specific errors:
- Every
createentry point is out-pointer-first:(void **out, payload...). Reversed args giveInvalid E5 path specified. @ GetE5PathFromCompositeBundleorCannot provide program function as nullptr. @ Create. e5rt_async_event_createtakes three args(out, name, initial_value). A NULL name returnsInvalid Function Argument: eventName is NULL.e5rt_execution_stream_submit_asynctakes two args, the second an Objective-C completion block that isobjc_retain'd unconditionally; passing NULL crashes inobjc_retain. (The dylib providesane_e5rt_make_completion_blockto build one from a C callback.)e5rt_buffer_object_alloc(out, size, type)is out-first; types0/1/2are valid (anything else:Invalid BufferType @ AllocMemory). Type0gives a buffer whoseget_data_ptris a plain CPU virtual address.compute_device_types_mask:0x1bnns /0x2mps_graph /0x4ane;0x3-> bnns wins;0x5/0x6/0x7-> ane (with0x7falling back to bnns if the ANE compile fails).
4. The project C ABI (ane_e5rt_*)¶
aneforge/_lib/ane_e5rt_dispatch.mm wraps the sequence above into a
compile-once/eval-many ane_e5rt_program_t, built into
aneforge/_lib/libane_e5rt_dispatch.dylib. aneforge/_runtime.py binds the
subset the package uses (compile / set_input / execute / get_output / release);
the full ABI is documented below.
4.1 Single-op (the lean, production subset)¶
ane_e5rt_program_t *ane_e5rt_program_compile(
const char *mil_path, const char *cache_dir, uint64_t device_mask,
const char *const *input_names, const size_t *input_sizes, size_t n_inputs,
const char *const *output_names, const size_t *output_sizes, size_t n_outputs);
int ane_e5rt_program_set_input_fp16 (ane_e5rt_program_t*, const char *port, const uint16_t *data, size_t n_elems);
int ane_e5rt_program_execute (ane_e5rt_program_t*); // execute_sync
int ane_e5rt_program_get_output_fp16(ane_e5rt_program_t*, const char *port, uint16_t *dest, size_t n_elems);
void ane_e5rt_program_release (ane_e5rt_program_t*);
4.2 Multi-op stream (experimental)¶
Encode several ops onto one stream; chain them with events; share buffers op-to-op for zero-copy hand-off.
int ane_e5rt_program_add_op(ane_e5rt_program_t*, const char *mil_path,
const char *input_name, size_t input_size, const char *output_name, size_t output_size);
int ane_e5rt_program_set_input_fp16_op (ane_e5rt_program_t*, size_t op_idx, const char *port, const uint16_t *data, size_t n);
int ane_e5rt_program_get_output_fp16_op(ane_e5rt_program_t*, size_t op_idx, const char *port, uint16_t *dest, size_t n);
int ane_e5rt_program_execute_multi(ane_e5rt_program_t*); // one execute_sync over all encoded ops
size_t ane_e5rt_program_get_op_count (ane_e5rt_program_t*);
int ane_e5rt_program_chain_ops (ane_e5rt_program_t*, size_t src_op_idx, size_t dst_op_idx, const char *event_name);
int ane_e5rt_program_share_buffer (ane_e5rt_program_t*, size_t src_op_idx, const char *src_out_port,
size_t dst_op_idx, const char *dst_in_port);
int ane_e5rt_program_get_chain_event_last_signaled(ane_e5rt_program_t*, size_t op_idx, uint64_t *out);
4.3 Async completion (experimental)¶
int ane_e5rt_program_execute_async (ane_e5rt_program_t*); // submit_async + completion block
int ane_e5rt_program_wait_for_completion (ane_e5rt_program_t*); // sync_wait on the final event
int ane_e5rt_program_get_final_event_signaled(ane_e5rt_program_t*, uint64_t *before, uint64_t *after);
void *ane_e5rt_make_completion_block(ane_e5rt_completion_cb_t cb, void *ctx);
void ane_e5rt_free_completion_block(void *block);
4.4 Cross-process tensor hand-off (IOSurface)¶
In-process zero-copy between ops is ane_e5rt_program_share_buffer (section 4.2).
Cross-process tensor hand-off uses an IOSurface passed by Mach port
(bootstrap_register), ~70 us. It exists because a compiled program is not loadable in a
fresh process (section 5), so a persistent compiled worker is fed inputs and returns outputs by
IOSurface.
5. Operational constraints (the ones that bite)¶
- Compile and dispatch must be in the same process. The signed HWX lives
only in
aned's per-PID cache; loading a previously compiled bundle in a fresh process fails withANE model load has failed ... Must re-compile the E5 bundle. Cross-process reuse needs same-codesign-identityposix_spawn'd children;fork()SIGSEGVs (Espresso/libdispatch is fork-unsafe). execute_syncserializes all encoded ops in a stream, regardless ofbind_dependent_events. Multipleencode_operationcalls before oneexecute_syncrun in submission order.- Completion events only increment on
submit_async.bind_completion_event/bind_dependent_eventsaccept correct parameters, but alast_signaled_valuedoes not advance underexecute_sync. The dependency-graph machinery is wired and confirmed via the async path; full happens-before enforcement underexecute_syncis therefore partly inferred, not observed. - ~128 loaded programs per PID (the 129th load fails;
release()frees a slot), and the in-flight depth is capped at 127 (dispatch_semaphore_create(127)in the runtime). - The two locks still apply. You can only submit MIL or single-procedure
ANECIR netplist (the parser gate), and only
aned-signed HWX loads (the kernel signature gate, error0xe00002e2). e5rt does not change either.
6. Where this lives in the repo¶
| Layer | File |
|---|---|
Underlying e5rt_* symbol typedefs |
aneforge/_lib/e5rt_api.h |
| Project C ABI + the call sequence | aneforge/_lib/ane_e5rt_dispatch.mm -> libane_e5rt_dispatch.dylib |
aneforge runtime wrapper |
aneforge/_runtime.py |
The build command for the dylib is in the repo README.md (Install section 2).