Glossary¶

Terms a tool user is likely to meet across the ANEForge documentation, in alphabetical order.

A - D¶

ANE - Apple Neural Engine. A dedicated neural-network accelerator on Apple Silicon. fp16-dataplane, ~38 TFLOPS peak on M5 Pro (the marketing spec); the measured fp16 roofline anchor used throughout these docs is the lower sustained figure (~8.9 TFLOP/s, see capabilities.md). Programmable only through private framework APIs.

ANECIR - ANE Compiler Intermediate Representation. The third IR in Apple's compile pipeline (after MIL and MLIR, before HWX). Surfaces as the net.plist XML format consumed by _ANEInMemoryModelDescriptor modelWithNetworkDescription:. Accepts ops MIL rejects, e.g., Rsqrt, Inv, Sqrt, Log2, Exp2.

ANEF - Apple Neural Engine Format. Sometimes synonymous with HWX, sometimes refers to the broader ANE deployment format including weight blobs and metadata. The kANEFModel* constants in AppleNeuralEngine.framework use this prefix.

aned - /usr/libexec/aned. The root-privileged XPC daemon that does MIL -> HWX compile work. Receives requests via XPC, runs the four-IR compile pipeline, returns a compiled program handle. Maintains a per-PID HWX program cache.

aneuserd - /usr/libexec/aneuserd. The per-user broker that arbitrates between processes and aned. Runs as _neuralengine.

BLOBFILE - A MIL reference form for external weight files: tensor<fp16, [...]>(BLOBFILE(@path="weights.bin", offset=0)). Used by the legacy subprocess Path A. Incompatible with e5rt's in-process compile because there's no sibling weight file; e5rt requires inlined fp16 hex literals.

BNNS - Basic Neural Network Subroutines. Apple's CPU-side neural-network inference library (in Accelerate.framework). e5rt accepts device_mask=0x1 = BNNS_MASK to route a compiled program to BNNS on the CPU rather than ANE.

capability family - An ANE generation's op-capability tier, gated in the compiler by a MinimumFamily<N> trait on each op. M-series chips map to H-targets by M(n) = H(n + 12) (M1 = H13 = family 2, M5 = H17 = family 5), so capability is a property of the family, not of one chip. Walls (e.g. the trig floor on older families) are family-wide. ANEForge resolves the running family with detect_family and gates target names with cross_compile_check. See cross-chip.md.

circuit breaker - ANEForge's compile backoff rate-limiter (aneforge/_circuit.py, wired into Program.compile). After a failed compile the breaker paces the next compile by a short interval, a defensive backstop so the autotuner's burst of variant compiles cannot pile up. Warn-and-sleep by default; raises under ANEFORGE_COMPILE_BREAKER_STRICT. Exposes CompileBackoffError and reset_compile_breaker.

E - H¶

Espresso - /System/Library/PrivateFrameworks/Espresso.framework/. Apple's internal ML runtime. CoreML, MPSGraph, and ANEForge's e5rt backend all sit on top of Espresso. Exports 235+ e5rt_* C symbols.

e5rt - Espresso's runtime C API family. 235 symbols including e5rt_e5_compiler_*, e5rt_program_library_*, e5rt_execution_stream_*. ANEForge's fast backend uses this. See e5rt-dispatch-reference.md.

entitlement - A code-signing capability Apple grants its own software (for example CoreML) to use certain private interfaces. ANEForge reaches the ANE through the e5rt path, which requires no entitlement, so it runs from an ordinary user process. "Without an entitlement" or "no entitlement needed" in these docs means exactly that. The one surface that does require an entitlement is Path B.

fp16 - IEEE 754 binary16 floating point. ANE's native compute type. The dataplane is fp16-only; fp32 is acceptable as an intermediate but not as an I/O type. ANE's fp16 has documented IEEE-754 deviations: NaN propagation broken, round-half-away-from-zero, gelu(0.0) = -0.000754. See capabilities.md.

H-target - A compiler target architecture name of the form hNN (h13-h17 and variants such as h17s/h17g/h17d). The compiler holds 28 such targets; the __TEXT code is byte-identical across them and only the per-target HAL data differs. M-series chips map by M(n) = H(n + 12). ANEForge selects one with compile(target='hNN') and af.estimate(out, target='hNN'), validated statically without that silicon present. See cross-chip.md.

HWX - The compiled, ready-to-load binary format for ANE programs. Mach-O variant with magic 0xBEEFFACE, per-chip-generation specialization. The kernel driver (AppleH16ANEInterface) enforces code signatures on HWX binaries - you cannot synthesize and load your own. ANEForge compiles via aned, which produces correctly-signed HWX.

I - M¶

IOSurface - Apple's cross-process shared-memory buffer object. Used as the input/output tensor surface for ANE workloads. Zero-copy aliasable to Metal MTLBuffer via EspressoANEIOSurface metalBufferWithDevice:. The basis for hybrid CPU/GPU/ANE pipelines.

Path A - _ANEInMemoryModel.evaluateWithQoS:. The canonical unentitled ANE dispatch surface. Compiles MIL or ANECIR netplist via aned, loads via loadWithQoS:, evaluates via evaluateWithQoS:. Verified on ANE silicon via powermetrics (1.4 W on ANE rail).

Path B - The entitlement-gated streaming surface using _ANESharedEvents, _ANEChainingRequest, and the always-nonzero intermediateBufferHandle. Used by Apple's production frameworks for streaming/chained execution. Blocked for third parties without com.apple.aned.private.allow. ANEForge reproduces equivalent behavior at the model level via paired _in/_out state tensors.

MIL - Model Intermediate Language. Apple's textual IR for ML programs. Subset of CoreML's emitted MIL. ANEForge's primary input format. Documented opset ios18 is the current target. See mil-primer.md.

MLComputePlan - Public macOS 14.4+ API (MLComputePlan loadContentsOfURL:configuration:completionHandler:). Returns per-op (preferred_device, supported_set, cost_weight) for any compiled .mlmodelc. Useful as a routing oracle for what Apple's compiler would choose; does NOT reflect direct Path A behavior (which bypasses the heuristic).

MLIR - Multi-Level Intermediate Representation. The second IR in Apple's compile pipeline (between MIL and LLIR). Not directly emittable from user-space; gated by daemon capability negotiation.

MPSGraph - Metal Performance Shaders Graph. Apple's GPU graph framework, public. Routes mostly to Metal GPU; has private compilation descriptor selectors that can route to ANE (setPreferredDevice:, setEnableANECValidationWorkflow:, setPrintANEPlacementAnalysis:). Not used by ANEForge as a dispatch backend but useful as a GPU benchmark baseline.

N - Z¶

native (weight) streaming - Reading compressed weights directly into the multiply-accumulate datapath, dequantizing during DMA rather than expanding them to a dense fp16 buffer first. int8, int4-LUT, and sparse weights stream this way on the e5rt path, moving fewer bytes per dispatch and buying a bandwidth win on weight-bound layers. Blockwise-affine instead dequantizes in-program (a footprint lever, not a bandwidth one). Exposed via compile(compress=...). Whether a format streams is family-dependent (on the bandwidth-starved families only int4-LUT streams).

netplist - The XML form of ANECIR programs. The plist root contains a Layers array; each layer has Type, Inputs, Outputs, Params. Loaded via _ANEInMemoryModelDescriptor modelWithNetworkDescription:. Accepts hardware-native opcodes MIL rejects.

powermetrics - /usr/bin/powermetrics. macOS system tool that samples power consumption per hardware rail (CPU, GPU, ANE, etc.). Requires sudo. Used in ANEForge to confirm that Path A dispatches draw power on the ANE rail (1.4 W sustained during hot loops; idle = 0 W).

Program - In ane_e5rt.py, the compiled-and-loaded e5rt dispatch handle. Holds the compiled library, function, operation, input/output ports, and execution stream. Designed for compile-once-eval-many: prog.eval(inputs) per call has only memcpy + execute_sync + memcpy overhead.

programHandle - A 64-bit handle returned by _ANEInMemoryModel programHandle after compile + load. Identifies the model within aned's per-PID cache. Stable across submissions; lost across process boundaries (per-PID).

queueDepth - The per-program in-flight cap, hardcoded at 127 via dispatch_semaphore_create(127) inside _ANEInMemoryModel. The foundation of 127-deep async pipelining for decoder loops.

resident state - Keeping tensor state (weights, optimizer moments) resident on the ANE across execute_sync calls with no host round-trip, by aliasing an op's output buffer onto its own input port (share_buffer). ANEForge uses this for Trainer(resident_state=True) and the default UnrolledTrainer, so during training the host supplies only the minibatch and learning rate and reads weights back at checkpoints. Reachable without an entitlement. See training.md.

SDPA - Scaled Dot-Product Attention. softmax(Q @ K^T * scale) @ V. The MIL op scaled_dot_product_attention is fused at the MIL level; ANE-routing kicks in above heads>=64, seq>=~496 at d=64. Exposed as af.sdpa(...).

TFLOPS - Tera-floating-point operations per second. ANE's peak fp16 throughput on M5 Pro is ~38 TFLOPS as a marketing spec; the measured fp16 roofline anchor (capabilities.md) is the lower sustained ~8.9 TFLOP/s, and Path A's measured sustained throughput at conv3x3 [1, 256, 32, 32] is 12 TFLOPS (well past CPU AMX/SME ceiling of ~5 TFLOPS).

VJP - Vector-Jacobian product, the backward rule for an op in ANEForge's autograd (aneforge/autograd.py). The VJP registry maps each forward op to a rule that builds its gradient as ordinary ANE Tensor ops, so the backward pass runs on the engine. Coverage spans the structural and linear-algebra ops (matmul/bmm, conv, pooling, reductions, shape ops), the common activations (relu/silu/gelu/tanh/sigmoid/softmax and variants), the math ops (exp/log/sqrt/rsqrt/erf/...), and the normalization layers (layer_norm, rms_norm, group_norm, l2_norm) - enough to train transformer, LLaMA-style, diffusion-UNet, CNN, and MLP graphs end to end. See training.md.