Skip to content

ANE op catalog: every native MIL op x device (M1-M5)

Generated from aneforge/_op_catalog.py in the ANEForge repository (python docs/gen_op_catalog.py); do not hand-edit. Query the same data at runtime via af.op_info, af.is_native(op, chip), af.ops_on(chip), af.min_native_family(op), af.walled_everywhere().

187 native MIL ops. Device ladder: m1=A13, m2=A14, m3=A15, m4_m5=A16/A17. Cells: Y native, ~ bridge/decompose, N walled. aneforge's higher-level ops (rms_norm/group_norm/mha/sdpa/fft/linalg/...) are composites that lower to these.

Activations (incl. LUT)

op M1 M2 M3 M4/M5 kernel note
ceil Y Y Y Y ElementWise F2 LUT (probed native M1)
clamped_relu Y Y Y Y ClampedRelu LUT
clip Y Y Y Y ElementWise user-facing clamp; LUT
elu Y Y Y Y Elu LUT (effectively F2 -> A13+)
erf Y Y Y Y SimpleActivation F2 LUT
exp Y Y Y Y ElementWise LUT
exp2 Y Y Y Y ElementWise F2 LUT
floor Y Y Y Y ElementWise F2 LUT (probed native M1)
gelu Y Y Y Y Gelu LUT (M1 probe: ~0.08 rel err vs exact - LUT approximation, still native)
leaky_relu Y Y Y Y LeakyRelu LUT
log Y Y Y Y ElementWise LUT (ln2 immediate)
prelu Y Y Y Y PRelu per-channel alpha (LUT); native at rank >=3 (M1-confirmed)
relu Y Y Y Y SimpleActivation F0 SimpleActivation
relu6 Y Y Y Y LUT
round Y Y Y Y ElementWise F2 LUT round-nearest (probed native M1)
scaled_tanh Y Y Y Y ScaledTanh LUT
sigmoid Y Y Y Y SimpleActivation F0/LUT (incl. hard variant)
sigmoid_hard Y Y Y Y SigmoidHard LUT
sign Y Y Y Y ElementWise F2 LUT (probed native M1)
silu Y Y Y Y SimpleActivation a.k.a. swish; LUT
softmax Y Y Y Y Softmax F2 LUT (log2e immediate)
softplus Y Y Y Y Softplus LUT (+ parametric)
softplus_parametric Y Y Y Y Softplus LUT
softsign Y Y Y Y Softsign LUT
tanh Y Y Y Y ElementWise LUT
threshold Y Y Y Y ElementWise LUT
thresholded_relu Y Y Y Y ThresholdedRelu LUT

Comparison / logical

op M1 M2 M3 M4/M5 kernel note
equal Y Y Y Y ElementWise F0 compare -> bool (probed native M1)
greater Y Y Y Y ElementWise F0 compare
greater_equal Y Y Y Y ElementWise F0 compare (probed native M1)
less Y Y Y Y ElementWise F0 compare (probed native M1)
less_equal Y Y Y Y ElementWise F0 compare (probed native M1)
logical_and N N N N Unsupported Unsupported everywhere - decompose via min/mul on host
logical_not Y Y Y Y ElementWise F0 (probed native M1)
logical_or N N N N Unsupported Unsupported everywhere - decompose via max on host
logical_xor N N N N Unsupported Unsupported everywhere - decompose via != on host
not_equal Y Y Y Y ElementWise F0 compare (probed native M1)
select Y Y Y Y Select user-facing where; template_text backend

Control flow

op M1 M2 M3 M4/M5 kernel note
call ~ ~ ~ ~ Call function call; mapped_no_current_hwx_case (inlined)
cond ~ ~ ~ ~ mapped_no_current_hwx_case + Unsupported converter - no standalone ANE codegen; flatten on host
while_loop ~ ~ ~ ~ mapped_no_current_hwx_case + Unsupported/WhileLoop - unroll on host

Conv / MatMul / Pooling

op M1 M2 M3 M4/M5 kernel note
avg_pool Y Y Y Y Pool F0; window <=29 (M1) / 31 (A14+); 3D window A13+
conv Y Y Y Y Conv F0; M1 kernels <=29x29 (13x13 fp16), 3D depth native A13+; M5 <=32x32
conv_transpose Y Y Y Y Conv F0 deconv; strided axes use small-kernel caps
einsum Y Y Y Y Einsum lowers to matmul/transpose chain
l2_pool Y Y Y Y Pool special LUT pool (1024-entry fp16)
linear Y Y Y Y Linear folds to conv when RHS <=2 MB SRAM working set
linear_activation Y Y Y Y LinearActivation fused linear+activation
matmul Y Y Y Y Matmul NE lane / conv-fold; same tensor caps as conv
max_pool Y Y Y Y Pool F0
ne_bypass ~ ~ ~ ~ NEBypass private NEBypass unit; mapped_no_current_hwx_case
ne_conv Y Y Y Y NEConv private NEConv unit (fill=0x44/mir=0x5d)
ne_matmul Y Y Y Y NEMatMul private NEMatMul unit
ne_pool Y Y Y Y NEPool private NEPool unit (probe-pending codegen, treated reachable)
pe_elementwise Y Y Y Y PEElementWise private PEElementWise unit (fill=0x49/mir=0x59)
pe_goc ~ ~ ~ ~ PEGOC private PEGOC unit; mapped_no_current_hwx_case (compiler-internal)
pe_pool Y Y Y Y PEPool private PEPool unit
scaled_dot_product_attention Y Y Y Y SDPA F2; rides matmul+softmax (NOT texture-gated) - native on M1. user-facing sdpa

Detection / sampling

op M1 M2 M3 M4/M5 kernel note
argsort N Y Y Y Sort Sort family, A14+; codegen-rejected on M1 (= sort floor)
list_gather N N N N Unsupported TensorList op - Unsupported everywhere
list_length N N N N Unsupported Unsupported everywhere
list_read N N N N Unsupported Unsupported everywhere
list_scatter N N N N Unsupported Unsupported everywhere
list_write N N N N Unsupported Unsupported everywhere
make_list N N N N Unsupported Unsupported everywhere
non_maximum_suppression Y Y Y Y NonMaximumSuppression template_text NMS backend
random_bernoulli N N N N Unsupported Unsupported everywhere - host RNG
random_categorical N N N N Unsupported Unsupported everywhere - host RNG
random_normal N N N N Unsupported Unsupported everywhere - host RNG
random_uniform ~ ~ Y Y RandomUniform RNG, A15+ (HAL 0x4a9=0 on M1/M2); aneforge uses host RNG below A15 (dropout/random decomposable)
topk N Y Y Y TopK rank/sort bridge, A14+ (_OP_FLOOR); bridge validator callable on M1 but codegen-rejected (measured)

Elementwise arithmetic

op M1 M2 M3 M4/M5 kernel note
abs Y Y Y Y ElementWise PEElementWise (F0)
add Y Y Y Y ElementWise const + tensor forms; text-immediate fused const
cumsum Y Y Y Y CumSum runs ON the ANE as a single op (verified M1 2026-06-09: cos 1.0) - NOT host-decomposed. The standard MIL cumsum op is unimplemented, so it is reached via the curated e5rt path (see _capabilities).
floor_div Y Y Y Y ElementWise LUT-assisted (actlut:2)
inverse Y Y Y Y ElementWise reciprocal LUT
maximum Y Y Y Y ElementWise const + tensor (LUT)
minimum Y Y Y Y ElementWise const + tensor (LUT)
mod N N N N Unsupported Unsupported everywhere - decompose on host
mul Y Y Y Y ElementWise const + tensor forms
pow Y Y Y Y ElementWise pow_const; user-facing x ** y (probed native M1)
real_div Y Y Y Y ElementWise general divide; A11/A12 = const-fp16 reciprocal only. user-facing truediv/div
rsqrt Y Y Y Y ElementWise F2 LUT
sqrt Y Y Y Y ElementWise F2 LUT activation (native A13+, decomposed on A11/A12)
square Y Y Y Y ElementWise F0 PEElementWise
sub Y Y Y Y ElementWise lowered to add-of-negated-const

Image / resize / texture

op M1 M2 M3 M4/M5 kernel note
affine N Y Y Y Affine texture-engine only (A14+); "affine transform is not supported on this architecture" on M1
crop_resize N Y Y Y CropResize texture-engine only (A14+, HAL 0x81d) - _OP_FLOOR; unavailable on M1, no host substitution wired
degamma ~ ~ ~ ~ DeGamma ISP/image op; mapped_no_current_hwx_case
gamma ~ ~ ~ ~ Gamma ISP/image op; mapped_no_current_hwx_case
pixel_buffer_to_tensor ~ ~ ~ ~ PixelBufferToTensor 4CC image input; mapped_no_current_hwx_case. Does not lower on the unentitled direct path (entitlement gate, not chip gate); use af.image_input.
resample N Y Y Y Resample texture-engine only (A14+); warp depth=1, channel in {1,2}. Walled on M1
resize ~ Y Y Y Resize F2 but texture-gated: M1 = software deconv/transpose fallback (different rounding; some modes hard-abort); native A14+
resize_bilinear ~ Y Y Y ResizeBilinear NE lane; sw-fallback on M1
resize_nearest_neighbor ~ Y Y Y ResizeNearestNeighbor NE lane; sw-fallback on M1 (1x1-source fast path exists)
tensor_to_pixel_buffer ~ ~ ~ ~ TensorToPixelBuffer mapped_no_current_hwx_case (compiler-internal)
upsample_bilinear ~ Y Y Y UpsampleBilinear NE lane; sw-fallback on M1
upsample_nearest_neighbor ~ Y Y Y UpsampleNearestNeighbor NE lane; sw-fallback on M1

Normalization

op M1 M2 M3 M4/M5 kernel note
batch_norm Y Y Y Y BatchNorm inference fold-to-affine runs everywhere (incl. A11/A12); native stats form is A13+
instance_norm Y Y Y Y InstanceNorm F2
l2_norm Y Y Y Y F2
layer_norm Y Y Y Y LayerNorm F2 (native A13+)
local_response_norm Y Y Y Y LRNorm LRN bridge (measured Y on M1)

Quantization / dtype

op M1 M2 M3 M4/M5 kernel note
cast Y Y Y Y Cast F0 format primitive. fp16<->fp32/bool native on M1; cast(->int32) is walled on M1 (empirically confirmed) - keep dtype fp on h13
const ~ ~ ~ ~ ConstOps mapped_no_current_hwx_case - folded at compile, not a standalone codegen op
constexpr_affine_dequantize ~ ~ ~ ~ ConstOps weight-compression const; folded. int4-LUT streams natively from M1; int8/affine fold to fp16 below A15 (HAL +0x520-0x539).
constexpr_blockwise_shift_scale ~ ~ Y Y ConstOps blockwise stream gate A15+; folds to fp16 on M1/M2
constexpr_cast N N N N Unsupported Unsupported everywhere
constexpr_lut_to_dense Y Y Y Y ConstOps palette/LUT stream gate (+0x529) is A13-on -> int4-LUT streams natively from M1 (*the one compressed format that wins on M1)
constexpr_lut_to_sparse ~ ~ ~ ~ ConstOps folded const; sparse stream A15+
constexpr_sparse_blockwise_shift_scale ~ ~ Y Y ConstOps sparse+blockwise stream A15+
constexpr_sparse_to_dense ~ ~ Y Y ConstOps sparse stream A15+
dequantize Y Y Y Y Dequantize F0
quantize Y Y Y Y Quantize F0 (not texture-gated)

Recurrent

op M1 M2 M3 M4/M5 kernel note
gru N N N N Unsupported Unsupported everywhere - unroll to conv/matmul+activation on host
lstm N N N N Unsupported Unsupported everywhere - unroll on host
rnn N N N N Unsupported Unsupported everywhere - unroll on host

Reductions

op M1 M2 M3 M4/M5 kernel note
reduce_argmax Y Y Y Y ReduceArg per-axis ArgMax - F0, all chips (bridge ArgMax measured Y on M1)
reduce_argmin ~ ~ Y Y ReduceArg per-axis argmin; M1/M2 walled on the MIL route (HAL 0x4f2, A15+), bridge mirrors argmax. user-facing argmin
reduce_l1_norm Y Y Y Y Reduce F2 Reduce
reduce_l2_norm Y Y Y Y Reduce F2 Reduce
reduce_log_sum Y Y Y Y Reduce LUT-assisted Reduce (ln2 immediate)
reduce_log_sum_exp Y Y Y Y Reduce LUT Reduce; aneforge wires its vjp (probed native M1)
reduce_max Y Y Y Y Reduce F2
reduce_mean Y Y Y Y Reduce F2
reduce_min Y Y Y Y Reduce F2
reduce_prod N N N N Unsupported Unsupported everywhere - decompose (log-sum-exp / scan) on host
reduce_sum Y Y Y Y Reduce F2 (native A13+; decomposed on A11/A12). reduced-axis >=192 -> transpose route (>=384 on A15+)
reduce_sum_square Y Y Y Y Reduce F2; the 0x494 reduce->square fusion is M2+ only - M1 emits an extra fp16 round (<=1-round numeric, not a wall)

Special / math

op M1 M2 M3 M4/M5 kernel note
acos N N N N Unsupported Unsupported everywhere
acosh N N N N Unsupported Unsupported everywhere
asin N N N N Unsupported Unsupported everywhere - host decomposition
asinh N N N N Unsupported Unsupported everywhere
atan Y Y Y Y ElementWise F2 LUT - native on M1 (probe: WORKS; the one trig in vocab on h13)
atanh N N N N Unsupported Unsupported everywhere
cos ~ ~ Y Y ElementWise F4 trig, native A15+ only (REJECTED on M1/A14); M1/M2 Horner
cosh N N N N Unsupported Unsupported everywhere (REJECTED M1 probe)
cost_volume ~ ~ ~ ~ CostVolume bridge CostVolume (measured Y on M1); mapped_no_current_hwx_case
cross_product ~ ~ ~ ~ CrossProduct bridge CrossProduct (measured Y on M1) but mapped_no_current_hwx_case in MIL map - reachable via bridge
matrix_decomposition ~ ~ ~ ~ MatrixDecomposition mapped_no_current_hwx_case - no observed codegen
sin ~ ~ Y Y ElementWise F4 trig, native A15+ only (REJECTED on M1/A14 - silicon-measured); M1/M2 use special.py Horner
sinh N N N N Unsupported Unsupported everywhere (REJECTED M1 probe) - (exp(x)-exp(-x))/2 on host
tan N N N N Unsupported Unsupported everywhere (REJECTED M1 probe) - sin/cos Horner identity on host

Stateful (state / buffers)

op M1 M2 M3 M4/M5 kernel note
circular_buffer_to_tensor ~ ~ ~ ~ CircularBufferToTensor mapped_no_current_hwx_case; ring-buffer reader
read_state Y Y Y Y ReadState F2 stateful; reachable on M1 but needs the e5rt inout-tensor-desc plumbing for KV-cache.
tensor_buffer_to_tensor ~ ~ ~ ~ TensorBufferToTensor mapped_no_current_hwx_case; F2 ring/streaming buffer mover (A13+, reachable inside stateful graph)
tensor_to_circular_buffer ~ ~ ~ ~ TensorToCircularBuffer mapped_no_current_hwx_case; ring-buffer writer
tensor_to_tensor_buffer ~ ~ ~ ~ TensorToTensorBuffer mapped_no_current_hwx_case
write_state Y Y Y Y WriteState F2 stateful

Structural / shape

op M1 M2 M3 M4/M5 kernel note
band_part N N N N Unsupported Unsupported everywhere (mask via host)
batch_to_space Y Y Y Y BatchToSpace inverse of above
concat Y Y Y Y Concat F0 DMA
crop Y Y Y Y Crop F0 slice/crop (distinct from texture crop_resize)
depth_to_space Y Y Y Y DepthToSpace F2 NE lane; user-facing pixel_shuffle
expand_dims Y Y Y Y ExpandDims F0
fill Y Y Y Y Fill const tensor producer
fill_like Y Y Y Y FillLike const tensor producer
flatten2d Y Y Y Y F0
gather Y Y Y Y Gather software gather on M1 in narrow envelope (batch=1,depth=1); hw gather_hw path is A14+ (_OP_FLOOR)
gather_along_axis Y Y Y Y GatherAlongAxis template_text; same M1 envelope caveat
gather_nd ~ Y Y Y GatherND M1 = sw-envelope only (IsValidForH13: batch=1,depth=1,idx-ch=3); outside it rejected. Native (texture) A14+
identity Y Y Y Y Cast aliases Cast/no-op
non_zero N N N N Unsupported Unsupported everywhere (data-dependent shape)
one_hot N N N N Unsupported Unsupported everywhere - decompose (eye-gather) on host
pad Y Y Y Y Pad const pad F0 (NE lane). symmetric/reflect pad is texture-gated -> ~/sw on M1, native A14+
pixel_shuffle Y Y Y Y PixelShuffle F2 NE lane
pixel_unshuffle Y Y Y Y PixelUnshuffle template_text
range_1d ~ ~ ~ ~ template_text; M1 raw-MIL probe: walled (positional-encoding range rejects on h13 codegen) - host-precompute the const
reshape Y Y Y Y Reshape F0 metadata (A11/A12 = fp16-only/Flatten-or-abort)
reshape_like Y Y Y Y ReshapeLike F0
reverse Y Y Y Y Reverse NE lane (probed native M1; aneforge wires vjp)
reverse_sequence N N N N Unsupported Unsupported everywhere - decompose on host
scatter N N N N Unsupported Unsupported everywhere - decompose on host
scatter_along_axis N N N N Unsupported Unsupported everywhere
scatter_nd N N N N Unsupported Unsupported everywhere
shape N N N N Unsupported Unsupported everywhere (static-shape graphs only)
slice_by_index ~ ~ ~ ~ SliceByIndex mapped_no_current_hwx_case; static-offset slice folds into descriptor (reachable inside graph)
slice_by_size Y Y Y Y SliceBySize F0; pre-A16 width-offset quirk (Q.4 x16 crop-DMA): CONCATenating multiple nonzero last-axis (width) slices returns WRONG ELEMENTS on A14 (the gather-axis-1 bug; a SINGLE width slice is exact on A14 - linalg column/element extraction is green on M2); on A13 a width slice also saturates
slice_update Y Y Y Y SliceUpdate template_text backend
sliding_windows N N N N NotImplemented NotImplemented on any backend - decompose on host
space_to_batch Y Y Y Y SpaceToBatch factor in {2,3,4,8}; batch cap 4096 (older)/65536
space_to_depth Y Y Y Y SpaceToDepth F2 NE lane; user-facing pixel_unshuffle
split Y Y Y Y Split F0
squeeze Y Y Y Y Squeeze F0
stack Y Y Y Y Stack F0
tile Y Y Y Y Tile F2 (A13+); factors of {2,3,4,8}. Absent on A11/A12
transpose Y Y Y Y Transpose F0 but capped by max-transpose-extent (16384 M1-A15 -> 65536 M5; 0 on A11/A12)