Roadmap¶
Forward-looking strategy for ANEForge: what is solved, what is genuinely open, and what is permanently out of reach. This is a living document. Every entry has a specific reason for being here; it is not a wishlist.
ANEForge is a direct, CoreML-free Python frontend for the Apple Neural
Engine. It lowers an operator graph into a single fused Espresso e5rt
program and dispatches it to ANE silicon, with no entitlement and no
CoreML dependency. The capability surface is characterized in
capabilities.md; the cross-chip story in
cross-chip.md; and training in training.md.
Where things stand¶
The infrastructure is mature. The frontend compiles whole graphs to one
ANE program; pretrained vision CNNs, transformer encoders, and diffusion
UNet blocks load and run correctly; weights stream at fp16, int8, int4-LUT,
sparse, or blockwise; a reverse-mode autograd trains real networks
forward, backward, and optimizer-step on the engine; a per-chip cost model
estimates latency for any ANE generation without that hardware in hand; and
the same graph cross-compiles statically across the M1-M5 family. The
correctness corpus (tests/run_corpus.py) is the standing gate and is
green.
The remaining work is concentrated in three places: extending autograd coverage so transformers train end-to-end, pushing fp16-sensitive models (notably Stable Diffusion) past their numerical walls, and filling the cross-chip story with the one silicon class (M3/A15) we have not measured.
Recently completed¶
These items used to be open directions. They have shipped and moved here so the open frontiers below read honestly.
Cross-chip deployment (M1-M5)¶
A single ANEForge graph compiles statically for any ANE generation. The
compiler binary is byte-identical across chips; all per-chip difference is
HAL data plus a MinimumFamily<N> op gate, so a target is selected by
name (compile(target='hXX')) and validated without that silicon present.
cross_compile_check gates target names (an unknown architecture silently
falls back on e5rt, so it is rejected up front), and family detection
(detect_family) resolves the running chip, including the M2/M3/M4 classes
via the verified M(n) = H(n + 12) board-type ladder. The op walls are
family-wide, not M5-specific; the trig floor (Sin/Cos) and a few
resampling ops are the only core-net-irrelevant gaps on the oldest
families. fp16 numerics are the one axis that genuinely needs each chip,
and predict_fp16_divergence + CrossChipFP16Warning flag where a graph
may diverge across families. cross_compile_check runs the same static
preflight family-cap gate as compile(target=) before invoking the
compiler, so a per-family cap violation (conv kernel width, max tensor
dimension, a below-floor op) is rejected statically rather than relied on
the host cross-compiler to catch - it does not reliably enforce a different
family's HAL-gated caps, which would otherwise pass cross-chip CI and only
fail on the real silicon. See cross-chip.md.
Per-chip cost model¶
af.estimate(out, target='hXX') returns a measurement-free latency
estimate for any of the 28 ANE targets, derived from the compiler's own
analytic cycles -> roofline -> wall-time model. af.project_peak(arch)
gives the generational fp16-peak table. The model is anchored to
silicon-measured rooflines on both M1 and M5 (_cost._ANCHORS): the M5
loop-closure correction is in place (effective bandwidth scales with core
count, not clock), so estimate(target='h17s') now matches measured M5
convs within ~13%. The A14/h14 anchor is silicon-validated out of sample
(1.17x over nine fresh shapes). target=None keeps the precise M5-measured
heuristic so the optimizer path is unchanged. af.estimate_provenance(arch)
reports whether a target's estimate is silicon-anchored (A13/h13, A14/h14,
A16/h17s, with the A16 tier folding H16/H17* onto the h17s point) or
extrapolated from the nearest measured anchor, so a caller knows which
estimates rest on measured silicon. project_peak's generational multiple
is a peak-compute ceiling; typical measured speedups are lower.
Weight-compression datapaths¶
compile(compress=...) streams compressed weights on the
e5rt path with an accuracy-gated fallback chain: int4 (4-bit LUT
palettization), sparse (unstructured bitmask), int8 (per-tensor or
per-channel affine), and blockwise (blockwise affine). All forms compile
and run; compress=None is byte-identical to fp16. int4-LUT, sparse, and
int8 stream weights (dequant during DMA); blockwise dequantizes in-program,
so it is a footprint lever rather than a bandwidth one. The single-matmul
streaming win does not always carry to full models - end-to-end this is
primarily a footprint and capacity lever, with the bandwidth win realized
on large weight-bound layers. compress="auto" is family-aware: it filters
its candidate encodings through native_streams(family), so on a
bandwidth-starved family it only auto-selects forms that stream natively and
falls back to fp16 rather than to a folding encoding (which would cost
accuracy for no bandwidth win). On M1 (h13) that means int4-LUT only;
A14+ stream all four. The family is taken from compile(target=), or the
host chip when no target is given; an explicit single-mode compress= is
never filtered.
On-engine training (autograd + Trainer)¶
aneforge/autograd.py is a reverse-mode autograd whose forward, backward,
and optimizer step all run as ANE graph ops. Trainer and
UnrolledTrainer train real networks: a full-MNIST MLP reaches ~98% with a
host optimizer and ~97.8% with the optimizer on the engine.
Trainer(resident_state=True) keeps weights and optimizer moments resident
on the ANE across steps via output-to-input buffer aliasing, so the host
supplies only the minibatch and learning rate. UnrolledTrainer runs K
steps in one dispatch and is resident by default. Op coverage now spans MLP
and CNN graphs (matmul, bmm, conv, pooling, relu, gelu, the elementwise and
reduction ops, softmax, slice/concat/reshape/transpose), so CNNs train; the
gap that blocks transformers is below. See training.md.
group_norm at diffusion shapes¶
group_norm is tiled over rank-4 feature maps, so the SD-1.5 wall shapes
(640 channels at 64x64, 512 at 128x128) compile and run family-wide. This
removes the group-norm wall that previously blocked large diffusion feature
maps; the remaining SD-1.5 blocker is numerical, not structural (below).
On-engine image input¶
af.image_input(shape, scale=, bias=) declares a uint8 input and
dequantizes on the engine, byte-identical to a host convert and saving the
host-side conversion per frame. This is the terminal form of direct image
input on the e5rt path; true 4CC (FourCC) input is not reachable there
(below).
Compile-failure backoff guard¶
aneforge/_circuit.py is a backoff rate-limiter wired into Program.compile.
After a compile failure the breaker paces the next compile by a short interval, a
defensive backstop so the autotuner's burst of variant compiles cannot pile up;
success clears it. It is
warn-and-sleep by default (ANEFORGE_COMPILE_BREAKER_STRICT raises,
ANEFORGE_DISABLE_COMPILE_BREAKER disables), and exposes
af.CompileBackoffError and af.reset_compile_breaker. af.image_input uses
the uint8 route.
Open frontiers¶
The honest list of what is not done and is worth doing.
End-to-end LLM training at scale¶
The autograd registry now covers the normalization layers (layer_norm,
rms_norm, group_norm, l2_norm) and silu alongside the structural,
linear-algebra, activation, and math ops, so transformer and LLaMA-style
blocks, diffusion-UNet / GroupNorm CNNs, plain CNNs, and MLPs all train end to
end on the engine; the pre-norm transformer and LLaMA-block examples converge as
the end-to-end proof. The remaining gradient gaps are low-value and do not block
mainstream training: the reduction-routing ops amax / amin / cumsum and a
few parametric activations (atan, softplus, clamped variants), each an
additive frontend follow-up. Normalization affine is now learnable too: passing
parameter Tensors for a norm's gamma / beta composes the normalize op with an
explicit trainable scale and shift, so the affine trains with the rest of the
model (a numpy gamma / beta still bakes a fixed affine). Scale is
demonstrated on three axes, all on the engine: a four-layer model that
reconstructs its training text (examples/train_charlm.py), a corpus-trained
model that generalizes to a held-out split (examples/train_charlm_corpus.py),
and a sixteen-layer model trained with a layer-streamed compile
(examples/train_charlm_deep.py). The compile-size ceiling on depth is removed:
aneforge.streaming.CheckpointedStack compiles a stack of identical layers' per-
layer forward and backward once and reuses them for every layer, so compile work
is constant in depth (the sixteen-layer model's programs compile in about half a
second, where the eight-layer monolith took about 162 seconds), with gradients
bit-exact against the monolith. What remains is engineering reach rather than a
capability gap: larger corpora, longer context windows, and a resident-state or
on-engine optimizer for the streamed path.
The conv input-gradient on M1 is now slice-saturation-safe (the im2col backward
concatenates patches on a non-last axis so it never transits the h13 width-axis
crop-DMA saturation). The remaining M1-specific item is the behavior under many
rapid compiles, which the compile backoff in aneforge/_circuit.py paces as a
defensive backstop, best verified on M1 hardware.
Separately, LLM inference throughput remains GPU territory: decode is dispatch-bound and the fp16-product precision limits force the cancellation-sensitive layers onto the CPU, so the GPU wins decode on both speed and energy at every batch size. The ANE's genuine niche is fp16-tolerant, compute-bound work - vision, encoders, on-device embedding - not autoregressive decode. See the bottlenecks section.
Stable Diffusion 1.5 end-to-end¶
Per-component validation passes (UNet and VAE steps run on the ANE within a few percent), and the group-norm shape wall is gone. The remaining wall is classifier-free guidance (CFG) cancellation: the conditional minus unconditional difference is a small fraction of the signal magnitude and is swamped by accumulated fp16 noise over the denoising trajectory. This is a numerical-envelope limit, not a missing op. The path forward is paired-fp16 precision carried through the UNet tail (the subtraction itself is exact by Sterbenz; the loss is input quantization), which is partially built in the math toolkit and not yet wired through the full UNet.
M2 / M3 silicon power anchors¶
The watt-complete characterization is measured on three physical chips (M1 /
A13, M2 Pro / A14, and M5 / A16-class): the M2 run is the full 16-class watt map
(device_compare_wattcomplete_results_M2.json), plus a 25-point latency grid that
drives the h14 cost anchor and its mid-utilization ramp and a power anchor
(per-compression-mode energy and idle/compute/conv rails). The M-series-to-H-target
ladder is verified, so M3/M4 remain
ground-truth capability targets (H15/H16). The remaining gap is A15/M3:
its absolute power rail and full watt-complete map have not been measured, and
neither M1 nor the now-anchored A14 can supply those points. Capability and
relative cost are covered family-wide; per-rail watts for the A15 generation
need that hardware.
Known hard limits¶
These are not roadmap items. They are the boundaries of the approach, stated so nothing here overclaims.
The two locks¶
-
No fp32 / int32 / bf16 compute. The ANE compute dataplane is fp16 (with a wide, at-least-fp32 accumulator). fp32, int32, and bf16 are accepted by the MIL parser but are "not implemented on any backend" for the ANE - even a bare cast is rejected. This is silicon, not a path limit. Every cancellation-sensitive workload that needs exact products (notably long-contraction matmuls under cancellation, and the transformer down-projection) has no fp16-tolerable form.
-
The entitlement boundary. Custom signed HWX cannot be loaded (the kernel driver verifies code signatures), and the fully autonomous, zero-host-call dispatch loop is entitlement-gated. Measurement shows the e5rt surface is functionally complete for the workloads here: bounded multi-step host-free dispatch is reachable without an entitlement, and the per-step host dispatch was never the bottleneck in the measurements, so the entitlement adds no capability these models need.
True 4CC (FourCC) image input¶
Declaring a true interchange image format (e.g. &BGA) as an input is
not reachable on the e5rt path. The format grammar is solved and the MIL
parses and type-checks, but the final lowering of pixel_buffer_to_tensor
does not complete without the entitled CoreML Input4CCFormat + IOSurface
route. af.image_input(uint8) is the terminal form for direct image input
here.
Control flow and recurrence¶
cf_if/else/loop, phi_virtual, and rnn_arch (LSTM/GRU) are a
genuine single-procedure wall on the authorable netplist surface, and Apple
itself routes recurrence off the ANE. Static-iteration unrolling fits;
data-dependent control flow does not.
Known bottlenecks¶
Measured, not speculative.
-
Per-call dispatch floor. Each
e5rtdispatch has a fixed launch cost (tens of microseconds; higher on bandwidth-rich chips). Tiny op chains are floor-bound, where BNNS or numpy can win; ANE territory begins around ~100 MFLOPs per call. Fusing the whole graph into one program - what the frontend does by default - amortizes this. -
LLM decode is dispatch-bound and GPU-favored. The fp16-product precision limits force the precision-sensitive layers onto the CPU, which burns more power than running the whole model on the GPU; combined with the per-token dispatch overhead, the GPU wins LLM decode on speed and energy at every batch size. Batching is the dominant decode-throughput lever, but ANE batched serving plateaus far below the GPU because it stays CPU-dominated.
-
Per-PID program cap.
anedenforces a hard cap of 128 simultaneously loaded programs per process, with no LRU eviction (program 129 fails to compile).Program.release()frees a slot immediately. Budget shape specializations against this cap, or fuse more aggressively. -
Very large or rapid back-to-back compiles. Very large unrolled graphs or many programs compiled back-to-back in one process can fail to compile. This is environmental, not a code bug; reduce program size or re-run. The compile backoff in
aneforge/_circuit.pypaces repeated failures as a defensive backstop.
Limits of the capability map¶
The op census is exhaustive over the surface Apple exports, not over the silicon's full op set. Three structural blind spots remain and are worth keeping in view:
-
Internal-only layers. The crackability predictor keys on exported validators; at least one real hardware layer (
RCAS) has an internal validator and no exported symbol, found only by an opcode cross-check. Others may exist that the export census structurally cannot see. -
Composite reachability without a single-layer validator. Several ops run on silicon with no
*Layervalidator (conv_transpose,group_norm,rms_norm, and others). Reachability is empirical (probe-confirmed), not derivable from the validator list, and the predictor is necessary-leaning but not sufficient - HWX codegen is the real gate. -
Temporal and single-process scope. The whole map is one OS build and one compiler version, measured single-process and steady-state. Contention with other ANE clients, QoS preemption, cache eviction under memory pressure, and per-op thermal behavior are unmapped. Negative labels are characterized at specific configs, not swept boundaries, and several decompose through a second authoring path (so "hard limit" can over-count genuine silicon walls). A second OS build or chip is the lever that converts these from inferred to measured.
The single largest source of unknown-unknowns is what Apple's own models
exercise (Path B, streaming state, dynamic shapes, control flow) that we
would not think to author. Sweeping the on-system Apple espresso.net
corpus is the cheapest way to surface those.