Capabilities¶
What ANEForge can and can't do, by category. For the complete per-op, per-chip table see op-catalog.md.
Coverage summary¶
ANEForge classifies the full 187-op MIL vocabulary against a machine-checkable
registry (aneforge/_capabilities.py, serialized to capabilities.json and
CI-gated). Each op falls into one of a few practical buckets:
| status | meaning |
|---|---|
fused |
Frontend primitive - lowers to ANE-MIL and fuses into one program |
bridge |
Native sub-program reached through a graph cut |
reachable |
Compiles and runs on the e5rt path, not yet a named frontend primitive |
walled |
A genuine native-op or codegen gap (design around it) |
not-authorable |
Control-flow / list / state ops with no feed-forward form |
Query reachability programmatically with af.op_info(op),
af.device_status(op, chip), af.is_native(op, chip), af.ops_on(chip, status),
af.min_native_family(op), and af.walled_everywhere().
Datatypes¶
| dtype | I/O accepted | Compute | Notes |
|---|---|---|---|
fp16 |
Y | Y | Native ANE dataplane |
int8 |
Y | N (cast required) | Quantized weights via constexpr_affine_dequantize |
uint8 |
Y | N | Pixel / index input lanes |
int16 |
Y | N | Slow compile (~35 ms) |
uint16 |
Y | N | |
fp32 |
N | Y (intermediate) | Use cast(x, fp32) -> compute -> cast(result, fp16) |
int32 |
N | N | Rejected at MIL parse |
bf16 |
N | N | No path |
bool |
N at I/O | Y (computed) | Produced by comparison ops, consumed by select |
The hard precision limits to design around: I/O is fp16/int8/uint8/int16/uint16; fp32 is available only as a compute intermediate (cast in, cast out); int32, bf16, and boolean tensor I/O have no path.
Dimension bounds¶
Tensor dimension caps are per-family and per-op-class - the limit rides the op's lowering, not the tensor, so a graph that fits an M5 may overflow an M1. Practical caps:
- Spatial / contraction extents (flat W/H, matmul-K, conv spatial): 16384 through A15, 65536 at A16.
- Channel axis: 65536 on both A14 and A16.
- Transpose extent: very large (offset-field width), effectively unbounded.
- Conv kernel width:
kW <= 13on M1,<= 15on M5 (route wider kernels viaspace_to_depth).
tg.limit("max_tensor_dim" | "channel_extent" | "transpose_extent", family)
(where from aneforge import _targets as tg) returns the measured caps, and
tg.preflight classifies each op's axes (including internal reshape extents like
group_norm's [1, G, C/G, H*W]) so you catch an overflow before the compile. Caps
are generation-monotone (A13 == A14 <= A16). See cross-chip.md.
Shapes (static only)¶
Every program compiles for a concrete shape. Variable-shape (symbolic) programs are not reachable through the e5rt path ANEForge uses: a symbolic dim parses but fails to compile. For variable-length (for example LLM sequence) inference, either pad to a fixed maximum and compile once, or bucket a small set of lengths and dispatch the nearest (the compile cache makes repeat lengths free).
This is shape polymorphism only - a separate axis from data-dependent values, which remain unreachable (no on-engine gather / index-by-tensor-data), so a dynamic-slice KV-cache is likewise off the e5rt path.
Operator surface¶
For the complete native MIL op x device (M1-M5) table, see op-catalog.md. The notes below cover the practical details that table does not carry: per-family floors, the quantized / compressed / dynamic-kernel paths, group-norm, and the attention routes.
Per-family op floors¶
A handful of operators are native only at or above a given capability family;
tg.op_status(op, family) (with from aneforge import _targets as tg) returns
native / decompose / reject, and compile(target=...) substitutes or rejects
accordingly. The family-gated ops are:
sin/cos- A15+; decompose to a portable fp16 polynomial on M1 viaaneforge/special.py.- texture-engine ops
crop_resize/resample/affine/gather_hw- A14+. sq_after_reduce- A14+.dropout/random- A15+.global_argmax/global_argmin- A15+.topk/sort/dynamic_slice- reject at codegen on M1 (family 2).
Everything else (conv, matmul, pooling, elementwise, activations, softmax, norms, reductions, sqrt/rsqrt/erf/exp2/log2, SDPA, resize, tile, space<->channel, atan) is native on M1 and up. See cross-chip.md.
Linear (quantized variants)¶
Beyond matmul / linear / inner_product / batch_matmul, the quantized macros
are quantized_linear (dynamic_quantize -> matmul -> dynamic_dequantize),
dynamic_quantize_only, and dynamic_dequantize_only. The working int8 substitute
is matmul(constexpr_affine_dequantize(int8_blob, scale)); both per-tensor and
per-channel scale work.
Convolution (fused + quantized + dynamic-kernel)¶
The base conv (1D/2D/3D) and conv_transpose/deconv are in op-catalog.md;
ANEForge adds fused composites (conv1x1, conv1x1_chain, conv_relu,
conv_gelu, conv1x1_batch_norm_relu, conv1x1_batch_norm_silu,
conv1x1_add_relu, conv1x1_project_add) and a quantized chain
conv1x1_int8_chain.
Dynamic-kernel conv (af.dynamic_conv): a conv whose weight is a runtime
input tensor rather than a baked constant, enabling hypernetwork /
weight-generating inference. Reachable and correct at batch 1 only;
batch >= 2 does not compile, so af.dynamic_conv rejects B >= 2
at build time. A constant-weight conv at any batch is unaffected. (This is why the
trainable conv uses an im2col path rather than a native dynamic-weight conv.)
Normalization¶
layer_norm / instance_norm / l2_norm are in op-catalog.md; ANEForge exposes
group_norm and rms_norm as composites. The channel-axis layer-norm is supported
as a fused "transpose -> instancenorm_1d -> transpose" route.
group_norm lowers at rank 4 ([1, G, C/groups, H*W]) with a chained reduce, so
every axis stays under the per-axis cap. This removes the former large-feature-map
wall: the Stable Diffusion 1.5 wall shapes (640ch@64, 512ch@128) now compile and
run (relerr ~ 0.002 vs fp32), and the unblock is family-wide.
Attention¶
Three routes to attention, with different tradeoffs:
| Path | Use when |
|---|---|
af.sdpa(q, k, v) (native fused SDPA) |
Any shape; runs fused attention on the ANE, including is_causal=True and the decode shape (seq_q != seq_kv) |
MIL scaled_dot_product_attention |
Apple's lowering; ANE-routes only above heads >= 64, seq >= ~496, d = 64 |
Manual matmul -> softmax -> matmul |
Masked attention; always works at any shape |
af.sdpa is the recommended route. Causal masking is native end-to-end
(is_causal=True), validated at cos 1.0 versus a masked-softmax reference. It is
reliable for sequence length S <= 2048 (SDPA_NATIVE_MAX_SEQ); above that the op
decomposes (which carries no mask). Because it also handles the decode shape, a full
autoregressive GPT/LLaMA generation loop (causal-SDPA prefill, then per-step
decode-shape SDPA) runs on the engine token-for-token matching numpy
(examples/gpt_generate_ane.py).
Resident KV-cache decode. The decode KV-cache can stay resident on the engine
across steps so it never round-trips to the host: the masked positional write runs
in the graph, compile_multi emits the hidden state plus every cache output, and
Program.share_buffer aliases each cache output onto its own input. Works for a
single layer (TinyDecoderANE.generate_resident) and for a full L-layer decoder
with all 2L caches resident (examples/gpt_multilayer_resident.py). On this path
the decode attention is decomposed (cheap at seq_q=1) since compile_multi cannot
take the native-SDPA graph cut; the resident-cache bandwidth saving is the goal.
For masked attention generally, use the manual decomposition with a causal mask via
comparison + select (ane.causal_softmax).
Softmax¶
softmax is native; ANEForge adds the softmax_nd, log_softmax,
causal_softmax, and threshold_softmax composites.
Layout (gotchas)¶
Layout ops (reshape, transpose, flatten, unflatten, tile, pad, slice,
gather, concat, split, squeeze, expand_dims) are in op-catalog.md with
per-chip status. Two gotchas to know: flatten is NCHW-only, and concat expects
exactly 2 inputs (more than 2 need decomposition).
Activation modes¶
ReLU, Tanh, Sigmoid, LeakyReLU, ELU, ReLU6, GELU (exact), and clip(alpha, beta)
are all available as native activations.
Specialized fused routes¶
fused_qkv (QKV projection in one call), fused_qkv_norm_proj,
dynamic_matmul_packed, dynamic_matmul_packed_relu, goc_scalar_scale_bias, and
the dynamic_goc_* family.
Image input¶
af.image_input(uint8) accepts a byte image and dequantizes it on the engine,
byte-identical to a host-side convert and saving the host conversion latency
(~2 ms/frame at 1080p). This is the terminal image-input form.
Direct 4CC interchange input (a pixel buffer fed straight from a camera or video
surface with no host RGB convert) is a no-go on the e5rt path - it needs
the entitled CoreML route. Use af.image_input(uint8) instead.
Compressed weight streaming¶
af.compile(out, compress=None | "int8" | "int4" | "sparse" | "blockwise" | "auto")
emits compressed weights that stream (dequant-during-DMA) rather than fold to
dense fp16, so a weight-bandwidth-bound op gets a real eval-latency win, not just a
smaller file. compress=None (the default) is byte-identical to fp16. int4-LUT and
sparse are accuracy-gated (int4 falls back int4->int8->fp16 within compress_atol).
All constexpr_* quant forms, including blockwise-affine, are reachable on the e5rt path.
Which formats stream natively is per-family (tg.native_streams(family), with
from aneforge import _targets as tg):
| family / chip | int4-LUT | int8-affine | sparse | blockwise |
|---|---|---|---|---|
| A13 (M1) | stream (2.37x) | fold | stream | fold |
| A14 (M2) | stream | stream | stream | fold |
| A15+ (M3, M4, M5) | stream | stream | stream | stream |
So compress="auto" is family-aware: on M1 it considers int4-LUT and sparse (the
native streams there) and skips int8/blockwise, while a budget-rejected int4 falls
back to fp16 rather than a folding encoding (which costs accuracy for zero bandwidth
win). Explicit single-mode knobs are never filtered. End to end, compression is
primarily a footprint/capacity lever (~4x smaller weights); the per-matmul win
dilutes through norms, attention, and dispatch in full models.
Training (on-engine autograd)¶
ANEForge trains on the ANE: aneforge/autograd.py runs forward, backward, and the
optimizer update as ANE graph ops through the same e5rt path.
af.parameter makes a weight a graph input (no recompile per step); a VJP
registry builds the backward graph. Trainer (with device_optimizer,
resident_state) and UnrolledTrainer (K steps in one fused program, resident by
default) drive training; full MNIST reaches 97.79% with optimizer state resident
across steps. The gradient vocabulary covers MLPs, CNNs (trainable conv built from
primitives, since native conv needs a baked weight), and transformers. fp16
optimizer state suffices; paired-fp16 is not needed.
One codegen wall to know: mul(reduce_output, 0.0) (a reduce result times exactly
zero) fails to compile and is sidestepped with a subtract-based zero. See
training.md.
Cost estimation (measurement-free)¶
af.estimate(out, target='hXX') returns an analytic per-chip latency from the
compiler's own cycles -> roofline -> wall-time model
(latency ~ overhead + max(compute, memory)), and af.project_peak(arch) gives a
generational fp16-peak. Both cover all targets from one extraction (no on-device
measurement) and are validated on the two measured chips (M1: 3.25 TFLOP/s, 9 GB/s,
220 us floor; M5: 8.9 TFLOP/s, 57 GB/s, 110 us floor). The estimator backs the
lossless opt='routes' pass. project_peak's ~5x M5/M1 is a peak-compute ceiling;
measured typical latency speedup is 2.3-3.3x.
Operators with no hardware backing¶
Some operators have no ANE hardware support and either reject outright or require algebraic decomposition over supported atoms:
| Op | Status | Notes |
|---|---|---|
non_zero |
Hardware-blocked | No decomposition |
sliding_windows |
Hardware-blocked | No decomposition |
reduce_prod |
Hardware-blocked | No decomposition |
cumsum |
Hardware-blocked | No decomposition |
bitwise_and/or/xor/not |
Rejected | Bit-level ops not in fp16 surface |
| Generic boolean tensor I/O | Rejected | Bool tensors are compute-only |
trig/hyperbolic acos/asin/sinh/... |
Rejected | No hardware support |
logical and/or/xor |
Rejected | No hardware support |
nan_to_num |
Broken | ANE does not propagate NaN |
non_maximum_suppression is reachable as a MIL op (presence-only) but offloads to
CPU/GPU rather than running on the ANE, so do not count it as an ANE-native layer.
The image-pipeline ops have nuanced placement: affine, resize_bilinear, and
upsample_bilinear ship as fused, on-engine ops; crop_resize stays
not-implemented-on-ANE.
Quick verification reference¶
- Direct dispatch reaches ANE silicon (~1.4 W on the ANE rail under powermetrics), not a silent CPU fallback.
- Operator outputs are bit-equal (or within fp16 noise) to numpy references.
- Per-call eval is ~80-110 us at the C-API level once compiled.
- The e5rt event graph provides true happens-before serialization across streams.
To confirm ANE placement rather than a silent CPU fallback, watch the ANE power rail
under powermetrics while a compiled model runs: a genuine dispatch draws ~1.4 W on
that rail (a CPU fallback does not), which is the same check the verification suite
uses.