API reference¶

This reference is generated from the package's own docstrings. The op surface below is the overview from aneforge; the sub-pages document the graph and operators, the compile and optimization entry points, training, the pretrained loaders, and the applied-math toolkits.

aneforge - a clean graph->compile->run frontend for the Apple Neural Engine.

Build a small tensor graph, compile it into ONE fused e5rt program, and run it on the ANE. Fusing is the point: the ANE penalises many tiny dispatches, so a whole subgraph becomes a single program. Weights pack automatically into one BLOBFILE - fp16, or per-channel int8 streamed (dequantised during the tile DMA) when int8=True.

import aneforge as af

x = af.input((1, 3, 32, 32))
h = af.conv(x, W1, pad=1).relu()
h = af.conv(h, W2, pad=1).relu()
y = h.mean((2, 3)).reshape(1, C) @ Wfc
net = af.compile(y, int8=True)      # one fused ANE program
net = af.compile(y, compress="int4")   # 4-bit LUT weights, accuracy-gated
out = net(image)                    # run on the ANE

Op surface

linear algebra: conv, conv_transpose; matmul/linear via @; bmm
dynamic_conv: conv with a RUNTIME-tensor weight (hypernetworks / per-sample kernels; native ANE dynamic kernel, batch-1 only)
activations: relu/silu/gelu/sigmoid/tanh/exp/log/sqrt/rsqrt/abs/square/ sin/cos/erf/softplus/relu6/elu/leaky_relu/clip
arithmetic: add/sub/mul/div(/)/maximum/minimum/pow
reductions/norms: mean/sum/amax/amin, softmax, l2_norm, rms_norm/layer_norm/ group_norm/batch_norm
spatial/shape: max_pool/avg_pool, upsample, concat, reshape/transpose, pixel_shuffle/pixel_unshuffle
nn helpers: mha, cross_attention, geglu

Two op routes. Most ops are FUSED e5rt-MIL: they lower to MIL and fuse into ONE program (no graph cut). A second family are NETPLIST-BRIDGE ops - native Path-A hardware layers Apple's MIL frontend never emits (sdpa, argmax/topk/sort, cross_product/cross_correlation/cost_volume, fps/radius_search, minmax_norm/lrn, the space/channel/batch rearranges, flatten/input_view/dynamic_slice/ scaled_elementwise). Each bridge op CUTS the graph: surrounding regions run as e5rt programs, the bridge node runs as a separate native sub-program (sub-ms via the A2 persistent worker), and compile returns a SegmentedModel.

Image input: af.image_input(shape, scale=1/255, bias=0.0) declares a uint8 input port and dequantises it on the engine (cast -> scale -> bias), so raw camera / decoded-video bytes feed the model directly (host skips the float-convert/repack); scale/bias are scalar or per-channel (length-C, broadcast over NCHW).

Pretrained loaders: af.load(".../all-MiniLM-L6-v2") (sentence encoder), af.load_resnet18() (ImageNet classifier).

Design rules: compute is fp16 only (fp32/int32/bf16 rejected); reductions/matmuls use a WIDE (fp32-class) accumulator fed by radix-4 fp16-rounded input tiles - representable sums are near-exact (a sum/dot of 16384 ones is bit-exact, where naive fp16 would stall at ~2048), and a +1 survives next to a 16000 partial that an fp16 running sum would swallow. The fp16 limit is at the products and the I/O cast, not the running sum, so cancellation-heavy reductions still lose precision; int8=True streams weights at half the bytes. compress= chooses weight encoding: None (fp16, default), 'int8' (per-channel), 'int4' (LUT palettization, per-tensor, with an accuracy-gated fallback to int8/fp16 set by compress_atol), 'sparse' (unstructured bitmask, emitted when the weight is >=50% zeros, else fp16), or 'auto' (per-weight: sparse if sparse, else int4 if accurate, else int8, else fp16). int8=True is the alias for compress='int8'. Wraps the unentitled Espresso e5rt runtime only - no CoreML, no entitlement.

aneforge also has a tiny reverse-mode autograd (autograd.py): af.parameter / af.backward / af.mse / af.SGD / af.Trainer train a small model with the forward and backward passes compiled and run on the ANE. It also does classification: af.softmax_cross_entropy (analytic fp16-stable on-ANE gradient) + af.Adam train a 784->128->10 MLP on MNIST to ~97% test accuracy. Trainer(..., device_optimizer=True) additionally runs the OPTIMIZER STEP on the ANE (SGD/Adam update as graph ops), so all training tensor-math is on the engine; the host only computes the scalar lr_t and shuttles state/grads (the host<->device state round-trip remains). See examples/train_mnist_mlp.py.

Layout: graph.py (Tensor + ops), _compile.py (per-op emit registry + compile), _blob.py (weight packing), autograd.py (on-ANE autograd), models.py (pretrained loaders).