Skip to content

Capabilities

What ANEForge can and can't do, by category. For the complete per-op, per-chip table see op-catalog.md.

Coverage summary

ANEForge classifies the full 187-op MIL vocabulary against a machine-checkable registry (aneforge/_capabilities.py, serialized to capabilities.json and CI-gated). Each op falls into one of a few practical buckets:

status meaning
fused Frontend primitive - lowers to ANE-MIL and fuses into one program
bridge Native sub-program reached through a graph cut
reachable Compiles and runs on the e5rt path, not yet a named frontend primitive
walled A genuine native-op or codegen gap (design around it)
not-authorable Control-flow / list / state ops with no feed-forward form

Query reachability programmatically with af.op_info(op), af.device_status(op, chip), af.is_native(op, chip), af.ops_on(chip, status), af.min_native_family(op), and af.walled_everywhere().

Datatypes

dtype I/O accepted Compute Notes
fp16 Y Y Native ANE dataplane
int8 Y N (cast required) Quantized weights via constexpr_affine_dequantize
uint8 Y N Pixel / index input lanes
int16 Y N Slow compile (~35 ms)
uint16 Y N
fp32 N Y (intermediate) Use cast(x, fp32) -> compute -> cast(result, fp16)
int32 N N Rejected at MIL parse
bf16 N N No path
bool N at I/O Y (computed) Produced by comparison ops, consumed by select

The hard precision limits to design around: I/O is fp16/int8/uint8/int16/uint16; fp32 is available only as a compute intermediate (cast in, cast out); int32, bf16, and boolean tensor I/O have no path.

Dimension bounds

Tensor dimension caps are per-family and per-op-class - the limit rides the op's lowering, not the tensor, so a graph that fits an M5 may overflow an M1. Practical caps:

  • Spatial / contraction extents (flat W/H, matmul-K, conv spatial): 16384 through A15, 65536 at A16.
  • Channel axis: 65536 on both A14 and A16.
  • Transpose extent: very large (offset-field width), effectively unbounded.
  • Conv kernel width: kW <= 13 on M1, <= 15 on M5 (route wider kernels via space_to_depth).

tg.limit("max_tensor_dim" | "channel_extent" | "transpose_extent", family) (where from aneforge import _targets as tg) returns the measured caps, and tg.preflight classifies each op's axes (including internal reshape extents like group_norm's [1, G, C/G, H*W]) so you catch an overflow before the compile. Caps are generation-monotone (A13 == A14 <= A16). See cross-chip.md.

Shapes (static only)

Every program compiles for a concrete shape. Variable-shape (symbolic) programs are not reachable through the e5rt path ANEForge uses: a symbolic dim parses but fails to compile. For variable-length (for example LLM sequence) inference, either pad to a fixed maximum and compile once, or bucket a small set of lengths and dispatch the nearest (the compile cache makes repeat lengths free).

This is shape polymorphism only - a separate axis from data-dependent values, which remain unreachable (no on-engine gather / index-by-tensor-data), so a dynamic-slice KV-cache is likewise off the e5rt path.

Operator surface

For the complete native MIL op x device (M1-M5) table, see op-catalog.md. The notes below cover the practical details that table does not carry: per-family floors, the quantized / compressed / dynamic-kernel paths, group-norm, and the attention routes.

Per-family op floors

A handful of operators are native only at or above a given capability family; tg.op_status(op, family) (with from aneforge import _targets as tg) returns native / decompose / reject, and compile(target=...) substitutes or rejects accordingly. The family-gated ops are:

  • sin / cos - A15+; decompose to a portable fp16 polynomial on M1 via aneforge/special.py.
  • texture-engine ops crop_resize / resample / affine / gather_hw - A14+.
  • sq_after_reduce - A14+.
  • dropout / random - A15+.
  • global_argmax / global_argmin - A15+.
  • topk / sort / dynamic_slice - reject at codegen on M1 (family 2).

Everything else (conv, matmul, pooling, elementwise, activations, softmax, norms, reductions, sqrt/rsqrt/erf/exp2/log2, SDPA, resize, tile, space<->channel, atan) is native on M1 and up. See cross-chip.md.

Linear (quantized variants)

Beyond matmul / linear / inner_product / batch_matmul, the quantized macros are quantized_linear (dynamic_quantize -> matmul -> dynamic_dequantize), dynamic_quantize_only, and dynamic_dequantize_only. The working int8 substitute is matmul(constexpr_affine_dequantize(int8_blob, scale)); both per-tensor and per-channel scale work.

Convolution (fused + quantized + dynamic-kernel)

The base conv (1D/2D/3D) and conv_transpose/deconv are in op-catalog.md; ANEForge adds fused composites (conv1x1, conv1x1_chain, conv_relu, conv_gelu, conv1x1_batch_norm_relu, conv1x1_batch_norm_silu, conv1x1_add_relu, conv1x1_project_add) and a quantized chain conv1x1_int8_chain.

Dynamic-kernel conv (af.dynamic_conv): a conv whose weight is a runtime input tensor rather than a baked constant, enabling hypernetwork / weight-generating inference. Reachable and correct at batch 1 only; batch >= 2 does not compile, so af.dynamic_conv rejects B >= 2 at build time. A constant-weight conv at any batch is unaffected. (This is why the trainable conv uses an im2col path rather than a native dynamic-weight conv.)

Normalization

layer_norm / instance_norm / l2_norm are in op-catalog.md; ANEForge exposes group_norm and rms_norm as composites. The channel-axis layer-norm is supported as a fused "transpose -> instancenorm_1d -> transpose" route.

group_norm lowers at rank 4 ([1, G, C/groups, H*W]) with a chained reduce, so every axis stays under the per-axis cap. This removes the former large-feature-map wall: the Stable Diffusion 1.5 wall shapes (640ch@64, 512ch@128) now compile and run (relerr ~ 0.002 vs fp32), and the unblock is family-wide.

Attention

Three routes to attention, with different tradeoffs:

Path Use when
af.sdpa(q, k, v) (native fused SDPA) Any shape; runs fused attention on the ANE, including is_causal=True and the decode shape (seq_q != seq_kv)
MIL scaled_dot_product_attention Apple's lowering; ANE-routes only above heads >= 64, seq >= ~496, d = 64
Manual matmul -> softmax -> matmul Masked attention; always works at any shape

af.sdpa is the recommended route. Causal masking is native end-to-end (is_causal=True), validated at cos 1.0 versus a masked-softmax reference. It is reliable for sequence length S <= 2048 (SDPA_NATIVE_MAX_SEQ); above that the op decomposes (which carries no mask). Because it also handles the decode shape, a full autoregressive GPT/LLaMA generation loop (causal-SDPA prefill, then per-step decode-shape SDPA) runs on the engine token-for-token matching numpy (examples/gpt_generate_ane.py).

Resident KV-cache decode. The decode KV-cache can stay resident on the engine across steps so it never round-trips to the host: the masked positional write runs in the graph, compile_multi emits the hidden state plus every cache output, and Program.share_buffer aliases each cache output onto its own input. Works for a single layer (TinyDecoderANE.generate_resident) and for a full L-layer decoder with all 2L caches resident (examples/gpt_multilayer_resident.py). On this path the decode attention is decomposed (cheap at seq_q=1) since compile_multi cannot take the native-SDPA graph cut; the resident-cache bandwidth saving is the goal.

For masked attention generally, use the manual decomposition with a causal mask via comparison + select (ane.causal_softmax).

Softmax

softmax is native; ANEForge adds the softmax_nd, log_softmax, causal_softmax, and threshold_softmax composites.

Layout (gotchas)

Layout ops (reshape, transpose, flatten, unflatten, tile, pad, slice, gather, concat, split, squeeze, expand_dims) are in op-catalog.md with per-chip status. Two gotchas to know: flatten is NCHW-only, and concat expects exactly 2 inputs (more than 2 need decomposition).

Activation modes

ReLU, Tanh, Sigmoid, LeakyReLU, ELU, ReLU6, GELU (exact), and clip(alpha, beta) are all available as native activations.

Specialized fused routes

fused_qkv (QKV projection in one call), fused_qkv_norm_proj, dynamic_matmul_packed, dynamic_matmul_packed_relu, goc_scalar_scale_bias, and the dynamic_goc_* family.

Image input

af.image_input(uint8) accepts a byte image and dequantizes it on the engine, byte-identical to a host-side convert and saving the host conversion latency (~2 ms/frame at 1080p). This is the terminal image-input form.

Direct 4CC interchange input (a pixel buffer fed straight from a camera or video surface with no host RGB convert) is a no-go on the e5rt path - it needs the entitled CoreML route. Use af.image_input(uint8) instead.

Compressed weight streaming

af.compile(out, compress=None | "int8" | "int4" | "sparse" | "blockwise" | "auto") emits compressed weights that stream (dequant-during-DMA) rather than fold to dense fp16, so a weight-bandwidth-bound op gets a real eval-latency win, not just a smaller file. compress=None (the default) is byte-identical to fp16. int4-LUT and sparse are accuracy-gated (int4 falls back int4->int8->fp16 within compress_atol). All constexpr_* quant forms, including blockwise-affine, are reachable on the e5rt path.

Which formats stream natively is per-family (tg.native_streams(family), with from aneforge import _targets as tg):

family / chip int4-LUT int8-affine sparse blockwise
A13 (M1) stream (2.37x) fold stream fold
A14 (M2) stream stream stream fold
A15+ (M3, M4, M5) stream stream stream stream

So compress="auto" is family-aware: on M1 it considers int4-LUT and sparse (the native streams there) and skips int8/blockwise, while a budget-rejected int4 falls back to fp16 rather than a folding encoding (which costs accuracy for zero bandwidth win). Explicit single-mode knobs are never filtered. End to end, compression is primarily a footprint/capacity lever (~4x smaller weights); the per-matmul win dilutes through norms, attention, and dispatch in full models.

Training (on-engine autograd)

ANEForge trains on the ANE: aneforge/autograd.py runs forward, backward, and the optimizer update as ANE graph ops through the same e5rt path. af.parameter makes a weight a graph input (no recompile per step); a VJP registry builds the backward graph. Trainer (with device_optimizer, resident_state) and UnrolledTrainer (K steps in one fused program, resident by default) drive training; full MNIST reaches 97.79% with optimizer state resident across steps. The gradient vocabulary covers MLPs, CNNs (trainable conv built from primitives, since native conv needs a baked weight), and transformers. fp16 optimizer state suffices; paired-fp16 is not needed.

One codegen wall to know: mul(reduce_output, 0.0) (a reduce result times exactly zero) fails to compile and is sidestepped with a subtract-based zero. See training.md.

Cost estimation (measurement-free)

af.estimate(out, target='hXX') returns an analytic per-chip latency from the compiler's own cycles -> roofline -> wall-time model (latency ~ overhead + max(compute, memory)), and af.project_peak(arch) gives a generational fp16-peak. Both cover all targets from one extraction (no on-device measurement) and are validated on the two measured chips (M1: 3.25 TFLOP/s, 9 GB/s, 220 us floor; M5: 8.9 TFLOP/s, 57 GB/s, 110 us floor). The estimator backs the lossless opt='routes' pass. project_peak's ~5x M5/M1 is a peak-compute ceiling; measured typical latency speedup is 2.3-3.3x.

Operators with no hardware backing

Some operators have no ANE hardware support and either reject outright or require algebraic decomposition over supported atoms:

Op Status Notes
non_zero Hardware-blocked No decomposition
sliding_windows Hardware-blocked No decomposition
reduce_prod Hardware-blocked No decomposition
cumsum Hardware-blocked No decomposition
bitwise_and/or/xor/not Rejected Bit-level ops not in fp16 surface
Generic boolean tensor I/O Rejected Bool tensors are compute-only
trig/hyperbolic acos/asin/sinh/... Rejected No hardware support
logical and/or/xor Rejected No hardware support
nan_to_num Broken ANE does not propagate NaN

non_maximum_suppression is reachable as a MIL op (presence-only) but offloads to CPU/GPU rather than running on the ANE, so do not count it as an ANE-native layer.

The image-pipeline ops have nuanced placement: affine, resize_bilinear, and upsample_bilinear ship as fused, on-engine ops; crop_resize stays not-implemented-on-ANE.

Quick verification reference

  • Direct dispatch reaches ANE silicon (~1.4 W on the ANE rail under powermetrics), not a silent CPU fallback.
  • Operator outputs are bit-equal (or within fp16 noise) to numpy references.
  • Per-call eval is ~80-110 us at the C-API level once compiled.
  • The e5rt event graph provides true happens-before serialization across streams.

To confirm ANE placement rather than a silent CPU fallback, watch the ANE power rail under powermetrics while a compiled model runs: a genuine dispatch draws ~1.4 W on that rail (a CPU fallback does not), which is the same check the verification suite uses.