Applied math¶
Self-contained numerical toolkits built over the public operators: iterative linear
algebra, an FFT carried as real/imaginary pairs, special functions, a general einsum,
and signal processing.
Linear algebra¶
linalg ¶
aneforge.linalg - a linear-algebra toolkit for the Apple Neural Engine.
WHAT FITS THIS HARDWARE¶
The ANE is a fp16 feed-forward dataflow engine: it has no in-graph loop and no data-dependent control flow. Anything with a FIXED, data-independent schedule fits, and that covers two families, both shipped here:
- ITERATIVE, MATMUL-DOMINATED methods with a FIXED iteration count (CG, Jacobi, refinement, LSQR, GMRES, randomized SVD, ...): pure static dataflow. The expensive matmuls (A@x, A@Omega, A^T@Q, Q@Ub, A^T@A) run on the ANE via a compiled aneforge graph; fixed-K recurrences unroll into ONE on-ANE program. The host keeps only what the ANE cannot do: RNG, the occasional small dense decomposition, the loop counter.
- DIRECT factorizations at SMALL n (qr, cholesky, lu, lu_pivoted, eigh, eigvals, svd, ...): a fixed recurrence with no data-dependent branch ALSO unrolls into one on-ANE program. The unrolled chain is O(n^3) graph ops, so these are small-n. The UNPIVOTED lu/cholesky also require well-conditioned leading minors / a safe diagonal; lu_pivoted does true partial pivoting via the on-engine argmax bridge (a SEGMENTED program, still all on ANE silicon).
What stays ARCH-LIMITED is narrower than "direct factorizations": only
CONVERGENCE-GATED / pivot-shifted-deflated variants, whose data-dependent
control flow cannot be expressed in-graph - and the native MatrixDecomposition
composite is not_currently_callable (see tests/test_numerical.py).
THE fp16 ENVELOPE (from the reverse-engineering corpus, re-measured in main)¶
The ANE matmul accumulator is WIDE (>=fp32), so a single A@x is clean even at cond~1e4 - BUT the iterates/residuals are stored and re-fed as fp16, and the residual b - A x is a catastrophic-cancellation subtract. Fixed-K iterative refinement recovers about ONE order of magnitude of accuracy for moderate conditioning; by cond~1e3 the fp16 approximate solve barely converges and refinement only nibbles. We claim no accuracy past what the sweep shows.
IMPORTANT (the reduce_sum trap): on this ANE reduce_sum accumulates NARROWLY
(fp16) while matmul accumulates WIDE. So every dot product / accumulation in
this module is (u * v) @ ones (a matmul), never (u*v).sum().
API (importable as from aneforge.linalg import conjugate_gradient, qr, eigh, ...)
Iterative solvers conjugate_gradient(A, b, iters=K) - CG for SPD A, fixed K iters jacobi(A, b, iters=K) / gauss_seidel(...) - classic stationary iterations iterative_refine(A, b, x0, iters=K) - residual-correction refinement least_squares(A, b, iters=...) - normal equations solved by CG + refine lsqr(A, b, iters=K) - Golub-Kahan least squares, fully on-ANE gmres(A, b) - general square solve, one GMRES(n) cycle
Direct factorizations (small-n, fixed recurrences unrolled on-ANE) qr(A) - thin QR by modified Gram-Schmidt cholesky(A) - unpivoted Cholesky of an SPD A lu(A) - unpivoted Doolittle LU (well-conditioned A) lu_pivoted(A) - partial-pivoted LU (argmax bridge, segmented)
Eigenvalues / SVD eigh(A, sweeps=...) - full symmetric spectrum, cyclic Jacobi eigvals(A, iters=...) - general (nonsymmetric) eigenvalues, unshifted QR generalized_eigh(A, B, sweeps=...) - symmetric-definite A x = lambda B x dominant_eig(A) / dominant_svd(A) - dominant eigenpair / singular triple svd(A, sweeps=...) - all singular values via Gram matrix + eigh svdvals_topk(A, k, ...) - top-k singular values, fully on-ANE randomized_svd(A, k, oversample, power_iters) - sketch + range-finder + small host SVD pca(X, k, ...) - randomized_svd of centered data
Run the self-test
PYTHONPATH=. python3 aneforge/linalg.py
conjugate_gradient ¶
Solve A x = b for SYMMETRIC POSITIVE-DEFINITE A by CG, FIXED iters.
Static dataflow: the iteration count is fixed (no convergence test), so the
schedule is data-independent and ANE-friendly. Per iteration the ONE heavy op
is the GEMV A @ p (ANE, wide accumulator). The two dot products
(r.r, p.Ap) are also ANE matmuls; the scalar ratios / axpy updates are host.
ANE: A@p, r.r, p.Ap, r2.r2 (every matmul/dot). HOST: alpha/beta scalar ratios, x/r/p axpy updates, the fixed loop.
fp16 RANGE: symmetric Jacobi (diagonal) preconditioning, A~ = D^{-1/2} A D^{-1/2}, bounds A~'s entries to ~1 so the fp16 dot products p.Ap do not OVERFLOW (raw cond~1e2 GEMV products blow past 65504 otherwise). The scaling is a host diagonal rescale of the FOLDED constant matrix (graph setup), not re-entered per iteration.
FULLY ON THE ANE: the entire K-iteration recurrence - the A@p GEMV, the r.r and p.Ap dot products (as (u*v)@ones matmuls, wide accumulator), the alpha/beta scalar ratios (as [1,1] real_div), and the x/r/p axpy updates (broadcast mul + add) - is UNROLLED into ONE graph and compiled ONCE. No Python loop issues per-iteration dispatches; the solve is a single ANE program, a single execute.
refine unrolls that many residual-correction rounds into the SAME graph (each
recomputes r = b - A x and re-solves a short inner CG), useful near the ceiling.
Source code in aneforge/linalg.py
jacobi ¶
Jacobi iteration x_{k+1} = D^{-1} (b - (A - D) x_k), FIXED iters.
Rewritten so the heavy op is the full GEMV A @ x on the ANE:
x_{k+1} = x_k + D^{-1} (b - A x_k).
Converges for diagonally-dominant A.
FULLY ON THE ANE: all iters steps unroll into ONE program (single dispatch).
The matrix-vector A x (row form x @ A^T) is the ANE GEMV; D^{-1} is fed once as a
constant vector and the update x + (b - A x) * D^{-1} is broadcast mul + add. The
host builds the graph and reads the answer; it does not loop.
Source code in aneforge/linalg.py
gauss_seidel ¶
Gauss-Seidel iteration, FIXED iters.
Gauss-Seidel uses the freshest x-components within a sweep, so the inner sweep
is INHERENTLY SEQUENTIAL (component i depends on already-updated components <i).
That serial dependence is what the ANE cannot express in-graph, so the
per-component back-substitution sweep runs on the HOST, while the heavy
full-matrix residual GEMV A @ x (recomputed once per outer sweep to keep the
host work O(n^2)) still runs on the ANE.
ANE: A@x (the bulk FLOPs). HOST: the sequential triangular sweep + loop. The honest division: Gauss-Seidel's serial sweep is the arch-limited part; only the dense GEMV stays on-device.
Source code in aneforge/linalg.py
iterative_refine ¶
Sharpen an approximate solve x0 of A x = b by fixed-K residual correction.
Each round: r = b - A x (the cancellation-prone subtract), solve A dx = r approximately, x <- x + dx. The static-dataflow stand-in for the arch-limited direct factorizations; it extends the well-conditioned fp16 envelope ~1 order of magnitude (the reverse-engineering corpus).
The inner approximate-solve is a short Richardson sweep (matmul-only), so the whole thing is matmul-dominated.
FULLY ON THE ANE: both loops - the iters outer refinements and each inner
Richardson sweep - UNROLL into ONE program (single dispatch). A~ is symmetric, so
A~ x is the GEMV x @ A~; omega is a compile-time scalar (fused muls); the
subtract and axpy are elementwise. Host: build the graph, read the answer.
Source code in aneforge/linalg.py
randomized_svd ¶
Truncated SVD of A [m,n] via Halko-Martinsson-Tropp randomized range finding.
Steps
- Omega = host-drawn Gaussian [n, l] (a constant, not on-device compute) HOST
- Y = A @ Omega (the big sketch) ANE power iterations with re-orthonormalization: Q = qr(Y); Y = A @ (A^T @ Q) ANE + host QR
- Q = orthonormal basis of Y (l columns) HOST
- B = Q^T @ A (the projection) ANE
- (Ub, S, Vt) = svd(B) (small l x n SVD) HOST
- U = Q @ Ub (the back-projection) ANE
Return (U[:, :k], S[:k], Vt[:k, :]).
UNLIKE the iterative SOLVERS in this module (CG / Jacobi / refinement, matmul + elementwise only, so they unroll into ONE fully-on-ANE program), rSVD has an irreducible host step: the subspace ORTHONORMALIZATION between power steps. Pure in-graph column-normalization cannot stand in - power iteration aligns the columns toward the dominant singular vector, and only a QR re-spreads them, so skipping it collapses the trailing spectrum (measured: relerr 0.4-0.7 on the small singular values). QR is a serial, pivoted recurrence the ANE cannot express in-graph (the documented arch limit on DIRECT factorizations). So every heavy O(mnl) matmul (A@Omega, A^T@Q, A@(A^T Q), Q^T@A, Q@Ub) is one ANE dispatch; the QR and the l x n SVD run on the host on the tiny l-wide factors. This is the on-ANE ceiling for rSVD.
Source code in aneforge/linalg.py
pca ¶
Principal components of data X [samples, features] via randomized SVD of the CENTERED data. Centering is a host mean-subtract (cheap); the SVD's matmuls run on the ANE. Returns (components [k, features], singular_values [k], mean [features]).
Source code in aneforge/linalg.py
least_squares ¶
Solve min_x ||A x - b||_2 via the normal equations A^T A x = A^T b, solved by CG (A^T A is SPD) with iterative refinement.
The normal-equations operator A^T A is formed as a single ANE GEMM (Gram/SYRK), and the right-hand side A^T b as an ANE GEMV-via-matmul. CG then runs on the n x n SPD system. Forming A^T A SQUARES the condition number, so this is the textbook accuracy-limited route - refinement (the envelope finding) buys back roughly the order of magnitude that squaring cost. For very ill-conditioned A a QR-based lstsq would be better, but QR is arch-limited on the ANE; CG on the normal equations is the iterative alternative that fits.
ANE: A^T A (Gram), A^T b, and every A_normal@p inside CG. HOST: CG scalars + loop.
Source code in aneforge/linalg.py
lsqr ¶
Solve min_x ||A x - b||_2 by LSQR (Golub-Kahan bidiagonalization), FULLY on the
ANE: the entire bidiagonalization + Givens recurrence unrolls into ONE program. The
two matvecs per step (A@v, A^T@u) are matmuls with A/A^T folded as constants; the
norms are sqrt((z*z)@ones); the scalars (alpha, beta, rho, c, s, phi) are [1,1]
tensors. No host orthogonalization. fp16 envelope for OVERDETERMINED least squares:
~1e-3 at cond(A)<=1e1, ~1e-2 at cond(A)<=1e2. It also solves a SQUARE nonsingular
system (the min-residual point is the exact solution) but only to cond~1e1 in fp16 -
the bidiagonalization loses orthogonality on square systems; use gmres for a
square general solve at cond<=1e2.
Source code in aneforge/linalg.py
dominant_eig ¶
Dominant (largest-|.|) eigenvalue + eigenvector of a SYMMETRIC A by power iteration, FULLY on the ANE: x <- A x / ||A x|| unrolls into ONE program; the eigenvalue is the in-graph Rayleigh quotient (x^T A x)/(x^T x). Returns (lambda, eigenvector). A few extremal pairs follow by deflation; the FULL spectrum needs the arch-limited dense QR iteration. fp16-clean to ~1e-3 for a well-separated dominant eigenvalue.
Source code in aneforge/linalg.py
gmres ¶
Solve A x = b for GENERAL (nonsymmetric) A by ONE full GMRES(n) cycle, FULLY on the ANE: Arnoldi (double modified-Gram-Schmidt reorthogonalization), the Givens rotations, and the upper-triangular back-substitution ALL unroll into one program. Each Arnoldi matvec A@q is a matmul (A folded); the projections are (q*w)@ones dots; the rotation and back-solve scalars are [1,1] tensors. More accurate than lsqr for a square general system at higher conditioning (~7e-3 at cond<=1e2), but HEAVIER: the m^2 orthogonalization plus the sequential back-substitution make a large, deep graph (~5.7k ops at n=24) that compiles slowly, so prefer lsqr unless the extra accuracy is needed. fp16 envelope ~1e-2 at cond<=1e2; SPD systems should use the cheaper conjugate_gradient.
Source code in aneforge/linalg.py
dominant_svd ¶
Dominant singular triple (sigma1, u1, v1) of A by power iteration on A^T A, FULLY on the ANE: v <- A^T(A v)/||.|| unrolls into one program (the matvec is two matmuls, A and A^T folded); sigma1 = ||A v1|| as an in-graph norm, u1 = A v1 / sigma1. Returns (sigma1, u1, v1). The dominant triple only - a top-k SVD needs the subspace/ Rayleigh-Ritz step whose dense l x l decomposition is the on-ANE frontier (see randomized_svd, which keeps that small factorization on the host).
Source code in aneforge/linalg.py
eigh ¶
ALL eigenvalues of a SYMMETRIC A by fixed-sweep cyclic Jacobi on the ANE. Returns the eigenvalues sorted ascending.
iterate=False (default): the whole sweeps-sweep recurrence UNROLLS into ONE fused
program. Cleanest, but small-n only - the O(n^3) rotation chain is a deep graph (caps near
n=20 at ~4e-3 vs numpy.eigh, cond<=1e2).
iterate=True: compile ONE cyclic-Jacobi sweep and host-loop it sweeps times,
feeding the matrix back each round. The per-sweep graph is O(n^2) instead of O(sweeps*n^2),
so it reaches MUCH larger n (n~48 at ~1.2-1.8e-2). The sweep compute is on-engine; the host
only shuttles the matrix between sweeps (a data move, no tensor math). Use when n exceeds
the unrolled ceiling. For a few extremal pairs of a large matrix use dominant_eig.
Source code in aneforge/linalg.py
svd ¶
ALL singular values of A, FULLY on the ANE: form the symmetric Gram matrix A^T A as
one ANE GEMM, then take its full spectrum with the on-ANE cyclic-Jacobi eigh
(sigma_i = sqrt(eig_i)). Returns singular values descending. Forming A^T A SQUARES the
condition number, so this is accurate for cond(A)<=1e1 in fp16; for top-k of a large /
ill-conditioned A use randomized_svd. Small-n (eigh is the heavy unrolled Jacobi).
Source code in aneforge/linalg.py
svdvals_topk ¶
Top-k singular VALUES of a large A, FULLY on the ANE - randomized range finding with
every step on the engine: the sketch A@Omega and the power products (matmuls), the
range-basis orthonormalization (the on-ANE Gram-Schmidt qr, re-run between power
steps), the projection Q^T A, and the small-block svd (on-ANE cyclic Jacobi). The
host only orchestrates (array transposes/slices, no tensor math). Unlike
randomized_svd - which keeps the QR + small SVD on the host - this is the fully-on-
engine top-k. Returns the k largest singular values, ~1e-2 for a rank-k matrix.
NOTE the fp16 tradeoff: extra power iterations HURT here (relerr 1.2e-2 -> 8e-2 -> 0.2 as power_iters goes 1 -> 2 -> 3). Each A^T A step squares the spectrum, and in fp16 that crushes the trailing singular values toward the dominant one, so the minimal sketch (oversample 2, one power step) is the most accurate for a clean low-rank matrix.
Source code in aneforge/linalg.py
qr ¶
Thin QR A = Q R by modified Gram-Schmidt, FULLY on the ANE (one program). Q has orthonormal columns, R is upper-triangular. Each projection is an (a*b)@ones dot (wide accumulator) + an axpy; the whole orthogonalization unrolls. Returns (Q, R). Small-n.
Source code in aneforge/linalg.py
cholesky ¶
Lower-triangular Cholesky factor L (A = L L^T) of an SPD A, UNPIVOTED, FULLY on the ANE: a fixed recurrence with each L[i,j] a [1,1] tensor, dividing by the pivot L[j,j] via real_div. Small-n. ~3e-4 vs numpy at n=8.
Source code in aneforge/linalg.py
lu ¶
Unpivoted LU (A = L U, L unit-lower, U upper) by Doolittle, FULLY on the ANE: each entry a [1,1] tensor, dividing by the pivot U[i,i] via real_div. Returns (L, U). Small-n; UNPIVOTED, so valid when no leading pivot underflows (well-conditioned A).
Source code in aneforge/linalg.py
lu_pivoted ¶
Partial-pivoted LU: P A = L U with row interchanges, on the ANE. Returns (P, L, U).
The pivot at each column is a true on-engine argmax over the subcolumn, turned into a
permutation by a one-hot (arange + greater + select) and applied as a row-swap; the
Schur elimination is masked to the trailing submatrix so the stored multipliers are
preserved. UNLIKE the unpivoted lu, the argmax is a netplist BRIDGE op, so the program
is SEGMENTED (one graph cut per column), not a single fused program - still all on ANE
silicon. Small-n; ~5e-4 vs P A reconstruction at n<=8.
Source code in aneforge/linalg.py
generalized_eigh ¶
Eigenvalues of the generalized symmetric-definite problem A x = lambda B x (A
symmetric, B SPD; LAPACK sygv), FULLY on the ANE by composing on-engine kernels:
B = L L^T (cholesky), C = L^-1 A L^-T (triangular inverse + two GEMMs), then the symmetric
eig of C (cyclic Jacobi eigh). Returns the generalized eigenvalues ascending. Small-n
(the eigh O(n^3) graph); ~4e-4 vs the fp64 reduction at cond<=1e1.
Source code in aneforge/linalg.py
eigvals ¶
Eigenvalues of a GENERAL (nonsymmetric) real A by the unshifted QR algorithm, FULLY on
the ANE as ONE fused program: iters rounds of M <- R Q where M = Q R (Gram-Schmidt QR
in-graph, RQ a matmul) drive M to real Schur (quasi-triangular) form. NO pivoting, NO
shifts, NO deflation, so it is pure fixed-iteration dataflow with no bridge op. The host
only READS the eigenvalues off the converged form - a 1x1 diagonal block is a real
eigenvalue, a 2x2 block gives a COMPLEX-conjugate pair. Returns a complex array. Small-n
(the unrolled QR chain), well-separated spectra: ~2e-3 (real) / ~2e-2 (complex) vs
numpy.eigvals. Unshifted QR converges linearly, so clustered spectra need more iters.
Source code in aneforge/linalg.py
FFT¶
fft ¶
aneforge.fft - a real FFT for the Apple Neural Engine, built as Cooley-Tukey factored into MATMUL STAGES.
The ANE has NO complex dtype (compute is fp16 real only) and no in-graph loop or gather, but it is a matmul machine. A previous probe (tests/test_spectral_sci.py, examples/spectral_analysis.py) showed the DFT-as-a-twiddle-matrix-matmul is fp16-CLEAN to N=2048 because the matmul accumulator is wide (>=fp32) - but that is the naive O(N^2) DFT (one dense [N,N] twiddle matrix). This module keeps the "every stage is a matmul" property while cutting the op count to ~O(N*(N1+N2+...)) via the Cooley-Tukey factorization.
THE IDEA (Cooley-Tukey, one split N = N1 * N2): Index n = n1N2 + n2 (n1 in [0,N1), n2 in [0,N2)); k = k2N1 + k1. X[k1 + N1k2] = sum_{n2} [ exp(-2pi i n2 k1 / N) * ( sum_{n1} x[n1N2 + n2] * exp(-2pi i n1 k1 / N1) ) ] * exp(-2pi i n2 k2 / N2) => reshape x to [N1, N2]; (1) N1-point DFT down each column (matmul by the [N1,N1] twiddle Wn1), (2) multiply by the CROSS twiddles T[k1,n2] = exp(-2pi i k1 n2 / N), (3) N2-point DFT along each row (matmul by the [N2,N2] twiddle Wn2), (4) transpose to put the output in order X[k1, k2] -> X[k2*N1 + k1].
WHAT THIS MODULE BUILDS (and why). N is factored into AT MOST 3 balanced GROUPS, N = g0g1g2, and the signal is reshaped ONCE to a 4-D tensor [1, g0, g1, g2] that stays fully split for the whole computation. Each group is one dense-DFT matmul stage along its own axis (a single transpose to bring the axis last, the complex matmul, a single transpose back), with cross-twiddle multiplies between stages and ONE final axis-reversal transpose. Cost = N*(g0+g1+g2) MACs vs the dense DFT's N^2 - sub-quadratic, e.g. 32x fewer at N=1024, 51x at N=2048; every stage is a matmul (ANE-native).
TWO HARD ANE WALLS shape this (both found empirically on M5, see _factor / _axis_dft): - The ANE caps transpose+matmul at RANK-4 tensors (rank>=5 fails ANECCompile), so the fully-split tensor carries at most 3 factor groups -> a 3-stage FFT, not a full log-depth radix-2 cascade. - Chaining the recursive transpose->reshape->transpose COLLAPSE mis-fuses on this ANE (correct VALUES, permuted ORDER). Keeping the tensor fully split with ONE isolated transpose per stage sidesteps it. (A single transpose, or a transpose separated by a matmul, is reliable - verified.)
COMPLEX AS REAL PAIRS. Every value is a (re, im) Tensor pair. A complex matmul C = A @ B is four real matmuls: Cre = Are@Bre - Aim@Bim Cim = Are@Bim + Aim@Bre (Straight 4-matmul form, not Karatsuba: on the ANE matmul is cheap and the wide accumulator keeps the straight form cleanest - Karatsuba's (a+b)(c+d) sums lose a bit of fp16 headroom for no real op savings here.) The twiddle matrices are small fp16 CONSTANTS folded into the graph (matmul against a numpy array is a streamed weight, see graph.Tensor.matmul).
This is a submodule over the PUBLIC aneforge ops only (@ matmul, reshape, transpose,
add/sub/mul, concat, square, sqrt, af.input, af.compile). It does not touch graph.py,
_compile.py, init.py, _optimize.py, _paired.py or linalg.py.
API fft(x_re, x_im, N) -> (X_re, X_im) complex -> complex ifft(X_re, X_im, N) -> (x_re, x_im) inverse (1/N scaled) rfft(x_real, N) -> (X_re, X_im) real input (imag = 0) fft2(x_re, x_im) -> (X_re, X_im) 2-D transform of an [M,N] field ifft2(X_re, X_im) -> (x_re, x_im) inverse (1/(M*N) scaled) magnitude(X_re, X_im) -> |X| sqrt(re^2 + im^2) power(X_re, X_im) -> |X|^2
Each returns numpy arrays; under the hood it builds an aneforge graph, compiles it to
ONE fused e5rt program, and runs it on the ANE. A Plan object (fft_plan / rfft_plan)
compiles once and runs many times.
PYTHONPATH=. python3 aneforge/fft.py
Plan ¶
A compiled staged-FFT program. Holds the e5rt Model plus the cross-twiddle constant arrays, which it threads in automatically on every call.
Source code in aneforge/fft.py
Plan2 ¶
A compiled 2-D FFT program for [M,N] complex fields - ONE fused e5rt program.
The 2-D DFT is separable: X_hat = F_M @ X @ F_N^T. Applied to a whole matrix, each axis transform is a single complex matmul (a DFT of every row at once), NOT a per-row loop: rows transform as one matmul against the [N,N] twiddle, columns as transpose -> matmul against the [M,M] twiddle -> transpose back. Eight real GEMMs total, fused into one program - vs M+N dispatches for host-looping the 1-D plan.
The DENSE per-axis form, O(MN(M+N)) MACs (not the staged sub-quadratic 1-D build): at PDE-scale fields the ANE eats the GEMMs and dispatch dominates, and a per-axis staged split of a 2-D field would exceed the rank-4 transpose+matmul cap. Each transpose sits between matmuls (the safe pattern; only transpose->reshape CHAINS mis-fuse - see _axis_dft).
Source code in aneforge/fft.py
fft ¶
Forward FFT of a complex signal (real/imag arrays), length N, on the ANE. Returns (X_re, X_im) numpy arrays of length N.
ifft ¶
rfft ¶
Forward FFT of a REAL signal (imag = 0) on the ANE. Returns the full-length complex spectrum (X_re, X_im); the upper half is the conjugate mirror.
fft2 ¶
2-D FFT of an [M,N] complex field on the ANE as ONE fused program (F_M @ X @ F_N^T as eight real GEMMs). x_im=None means a real field. Returns (X_re, X_im) [M,N] numpy arrays, np.fft.fft2 convention.
Source code in aneforge/fft.py
ifft2 ¶
Inverse 2-D FFT (1/(M*N) normalized) on the ANE, one fused program. Returns (x_re, x_im), np.fft.ifft2 convention.
magnitude ¶
Special functions¶
special ¶
aneforge.special - special functions evaluated on the Apple Neural Engine as fused fp16 polynomial chains.
The ANE's strength is a long, dependent chain of fused fp16 mul/add (one e5rt
program, no graph cut). Special functions are exactly that: Horner / Clenshaw
evaluation of minimax or rational approximations, plus smooth range reduction.
This module builds each function as an aneforge graph out of the public ops
(+ - * /, exp/log/sqrt/sin/cos, clip) only, so the whole thing compiles
into one fused ANE program.
from aneforge.special import erfc, gamma, lgamma, expm1, log1p, bessel_j0
import aneforge as af, numpy as np
x = af.input((1, 64))
net = af.compile(erfc(x)) # one fused ANE program
y = net(np.linspace(0, 4, 64).astype(np.float16).reshape(1, 64))
Every function takes and returns an aneforge.Tensor; you compile and run the
result yourself (so several can share inputs / fuse together).
WHY THESE: they add value beyond the native unaries (exp/log/erf/...) which
either (a) do not exist (gamma, lgamma, erfc, expm1, log1p, Bessel) or (b)
degrade / cancel in fp16 over a wide range (erfc = 1-erf cancels to 0 past x~2;
exp loses a digit past |x|~8). See the __main__ self-test for measured
relerr vs scipy and the honest fp16 verdict per function.
THE TWO CONSTRAINTS that shape the implementations
- fp16 compute. The ANE computes in fp16 (~3-4 significant digits). A poly exact in fp64 is capped at ~1e-3..1e-4 relerr once rounded. Where a function's output leaves the fp16 range (gamma > 65504 past x~8; I0 past x~12) that is a hard wall, not a coefficient problem.
- no scalar add and no in-graph branch. The graph has
* scalar(muls) but no+ scalarand no data-dependent control flow. We add a constantcwith_const(x, c)(acos(0)=1ones tensor timesc), and use only fixed, smooth range reduction - never a per-element branch / variable step count. Functions needing a data-dependent reduction (gamma far outside [1,2], lgamma reflection for x<0) are scoped to the range a single static graph can cover, documented per function.
sin ¶
sin(x) for x in [-pi/2, pi/2] as a fused fp16 polynomial (x * P(x^2)).
Portable trig: only mul/sub/exp, so it runs on ANE families lacking the native sin op (A15+), M1/H13 included. Outside [-pi/2, pi/2] reduce the argument on the host first (the static graph has no data-dependent range reduction).
Source code in aneforge/special.py
cos ¶
cos(x) for x in [-pi/2, pi/2] as a fused fp16 polynomial (Q(x^2)).
Portable companion to sin above - native cos is A15+, this runs everywhere
M1 included. Reduce arguments outside [-pi/2, pi/2] on the host.
Source code in aneforge/special.py
erfc ¶
Complementary error function erfc(x) = 1 - erf(x) for x >= 0.
Direct rational*exp form (A&S 7.1.26) - does NOT cancel for large x, unlike
1 - native_erf. Valid for x in [0, ~6] (erfc decays smoothly; the exp
underflows to 0 past x~6, the correct limit). For x<0 use the identity
erfc(-x)=2-erfc(x) on the host (a branch the static graph can't do).
Source code in aneforge/special.py
expm1 ¶
exp(x) - 1 accurate near 0 (where exp(x)-1 cancels). Taylor
x(1 + x/2 + x^2/6 + ... + x^5/720) - valid for |x| <= ~0.7. Outside that,
use x.exp() and subtract 1 with _const (no cancellation there).
Source code in aneforge/special.py
log1p ¶
log(1 + x) accurate near 0 (where log(1+x) cancels). Evaluated as
x * poly(x) with a deg-7 minimax of log1p(x)/x - valid for x in
[-0.5, 1.0]. The x* factor keeps it exact at 0.
Source code in aneforge/special.py
lgamma ¶
Log-gamma log|Gamma(x)| for x in [1, 8] via a deg-8 minimax in the
centered variable (x-4.5). Accurate in ABSOLUTE terms (~3e-3); relative error
is large only at the zeros x=1, x=2 where lgamma -> 0 (relative error is
ill-defined there).
Source code in aneforge/special.py
gamma ¶
Gamma function on x in [1, 2] via a deg-6 minimax in the centered variable (x-1.5).
SCOPE / HONESTY: gamma grows super-exponentially and overflows fp16 (>65504)
past x~8.3, so it is fundamentally fp16-narrow. Extending [1,2] to a wider
window needs the recurrence Gamma(x+1)=x*Gamma(x) applied a data-dependent
number of times - not expressible in one static graph. For a wider but
still-fp16-bounded range use gamma_via_lgamma (x in [1, ~7.5]), which routes
through exp(lgamma(x)) instead.
Source code in aneforge/special.py
gamma_via_lgamma ¶
Gamma(x) = exp(lgamma(x)) for x in [1, ~7.5] (output stays under the
fp16 max). Wider range than gamma at a small accuracy cost (the exp
re-rounds).
bessel_j0 ¶
Bessel J0(x) for |x| <= 3 (A&S 9.4.1 polynomial in (x/3)^2). Smooth, bounded; the dominant first lobe and first zero (x~2.405) are in range.
bessel_i0 ¶
Modified Bessel I0(x) for |x| <= 3.75 (A&S 9.8.1 in (x/3.75)^2). I0 grows exponentially and overflows fp16 past x~12, so this small-argument branch is the fp16-useful range.
Source code in aneforge/special.py
bessel_k0 ¶
Modified Bessel K0(x) for 0 < x <= 2 (A&S 9.8.5):
K0(x) = -ln(x/2) I0(x) + series((x/2)^2). K0 has a -ln singularity at 0,
so x must be strictly positive (and not tiny: log underflow). I0 here reuses
the 9.8.1 polynomial (in (x/3.75)^2), valid through x=2; the K0 series part is
in (x/2)^2 -- the two use DIFFERENT scaled arguments.
Source code in aneforge/special.py
exp_wide ¶
exp for wide x via argument splitting:
exp(x) = (exp(x / 2^splits)) ^ (2^splits) by repeated squaring (the inner
exp runs on a smaller argument).
HONEST FINDING (measured on this M5 ANE): this does NOT reliably beat the
native x.exp(). The native fp16 exp is already accurate (median per-point
relerr ~1.4e-3 over [-10,10]); the squaring re-rounding usually costs MORE
than the small-argument benefit gains, so per-point accuracy is the same or
slightly worse. Provided for completeness. fp16's output range (max 65504)
caps exp at x~11 regardless of method - splitting can never extend the range,
only (at best marginally) trade accuracy.
Source code in aneforge/special.py
log_wide ¶
log for wide positive x via log(x) = 2^sqrts * log(x^(1/2^sqrts)).
Repeated square roots pull a large argument toward 1, where log is most
accurate, then scale back.
HONEST FINDING: native fp16 log is already good (~1e-3 over [1e-2, 1e4]) and
this matches rather than clearly beats it - the sqrt chain re-rounds. Useful
mainly as a documented range-reduction recipe; the native x.log() is the
better default on this hardware.
Source code in aneforge/special.py
einsum¶
einsum ¶
aneforge.einsum - a general einsum(equation, *operands) -> Tensor that
decomposes a numpy-style subscript equation into a sequence of aneforge graph ops
(transpose / reshape / matmul-bmm / broadcast-mul / reduce-sum) so the whole
contraction runs as ONE fused e5rt program on the Apple Neural Engine.
import aneforge as af
a = af.input((32, 64)); b = af.input((64, 16))
c = af.einsum("ij,jk->ik", a, b) # a plain matmul, on the ANE
net = af.compile(c); out = net(A, B)
af.einsum IS this function (the package rebinds the name to it), so it is
callable directly; from aneforge.einsum import einsum works identically.
WHY THIS IS POSSIBLE (and where it stops)¶
A two-operand einsum A,B->C partitions its indices into four classes:
* BATCH (B): appear in A, B and C -> kept as outer/batch dims
* CONTRACT (K): appear in A and B, not C -> summed over (the dot dimension)
* KEEP-A (M): appear in A and C, not B -> rows of the output
* KEEP-B (N): appear in B and C, not A -> cols of the output
Align each operand to [B..., M..., K...] / [B..., K..., N...] by
transpose, collapse each group to a single axis by reshape, then a batched
matmul [B,M,K] @ [B,K,N] -> [B,M,N] does the contract-and-sum in ONE op
(the ANE matmul accumulator is wide, >=fp32, so the K-sum is clean). Reshape the
result back to the named output dims, then transpose to the requested order.
Indices in ONE operand but NOT in the output are summed FIRST with a reduce
(af.sum); output-only indices absent from the inputs are unsupported (einsum
cannot invent a dimension). For >2 operands we left-fold: contract the running
result with the next operand pairwise.
REDUCTION-ONLY and OUTER products are handled directly
ij->i/ij->: pure reduce (sum over the dropped axes).i,j->ij: broadcast-mul (no contraction), an outer product.
WHAT IS NOT SUPPORTED (rejected with a clear error, never wrong results)¶
Any equation with a REPEATED index inside a single operand - diagonal / trace
extraction (ii->i, ii->, ...ii->...). Reading a tensor's diagonal is a
GATHER, unsupported on the ANE (per the op-conformance finding), with no
matmul/reduce decomposition. These raise EinsumUnsupported. Ellipsis (...)
is also not implemented (v1).
Run the self-test (validates a battery vs numpy.einsum on the ANE):
PYTHONPATH=. python3 aneforge/einsum.py
EinsumUnsupported ¶
Bases: NotImplementedError
Raised for einsum equations that cannot be lowered to ANE matmul/reduce ops (diagonal/trace via repeated indices, ellipsis, invented output dims).
einsum ¶
numpy-style einsum lowered to aneforge ops (matmul-reducible patterns).
Supports: matmul (ij,jk->ik), batched matmul (bij,bjk->bik), transpose
(ij->ji), outer product (i,j->ij), elementwise+reduce (ij,ij->i,
ij->i, ij->), bilinear (bi,ij,bj->b), attention scores
(bhqd,bhkd->bhqk), and general >2-operand left-fold contractions.
Rejects (EinsumUnsupported): diagonal/trace via repeated indices in one
operand (ii->i, ii->), repeated output indices, and ellipsis.
Source code in aneforge/einsum.py
Signal processing¶
dsp ¶
aneforge.dsp - a digital-signal-processing toolkit on the Apple Neural Engine.
Built on the two things the ANE is genuinely good at: REAL CONVOLUTION (the conv
layer) and the MATMUL-FFT from aneforge.fft (the DFT factored into matmul stages,
complex carried as real/imag PAIRS). This submodule composes them into the everyday
DSP kit:
from aneforge.dsp import (fir_filter, fft_convolve, freq_filter,
stft, spectrogram, hann, hamming, blackman,
correlate, autocorrelate)
WHAT FITS THE ANE
- FIR filtering - a finite impulse response IS a convolution. Short kernels run on the native conv layer (1xK, arch-capped at K<=15 on this M5, verified); longer ones fall back to FFT convolution. Feed-forward, no recurrence: a clean fit.
- FFT-domain DSP - fft_convolve / freq_filter / stft / spectrogram are all (windowed) FFTs + elementwise spectral ops + iFFT, every stage a matmul or an elementwise map (real/imag pairs via aneforge.fft). A clean fit.
- Correlation - cross/auto-correlation is convolution without the kernel flip, exactly what the ANE conv and the native CrossCorrelation bridge already do.
WHAT DOES NOT FIT (arch limit):
* IIR / recursive filters (Butterworth, biquads, any y[n] = ... + a*y[n-1]).
A direct-form IIR is a DATA-DEPENDENT RECURRENCE over samples - the scan/cumsum
wall the ANE lacks (no in-graph loop, no scalar feedback). No streaming IIR
here. iir_filter is ONLY a FIXED-LENGTH UNROLL: the recurrence becomes its
truncated FIR impulse response of a chosen length, run as an FIR. Exact only up
to the truncation, and the unroll is fixed at build time - it does NOT scale to
arbitrary-length streaming IIR. Exposed tagged, with the relerr-vs-untruncated
cost reported, rather than faking a recurrent filter.
CONVENTIONS (inherited):
* The ANE conv is a CROSS-CORRELATION (no kernel flip), matching
np.correlate(x, h, 'valid'). True convolution (np.convolve / scipy lfilter)
flips the kernel - fir_filter handles the flip + causal zero-pad internally.
* Complex is (re, im) real-tensor pairs; spectra come back as separate real arrays.
* Everything is fp16 on-device; the matmul accumulator is wide (>=fp32), so the
error floor is fp16 INPUT/twiddle rounding (~few e-4), flat in length.
PYTHONPATH=. python3 aneforge/dsp.py
hann ¶
Hann window (raised cosine). sym=False gives the periodic/DFT window
(the STFT default, matching scipy.signal.get_window('hann', M)).
Source code in aneforge/dsp.py
hamming ¶
Hamming window. sym=False = periodic (STFT) form.
Source code in aneforge/dsp.py
blackman ¶
Blackman window. sym=False = periodic (STFT) form.
Source code in aneforge/dsp.py
get_window ¶
Resolve window (name str, 'boxcar'/None for rectangular, or an array) to a
length-M fp32 coefficient vector (periodic form for the named cosine windows).
Source code in aneforge/dsp.py
fir_filter ¶
FIR filter y = x * taps (true convolution), on the ANE.
A finite impulse response is a convolution. The ANE conv is a CROSS-correlation (no kernel flip), so we flip the taps and zero-pad to realize a genuine convolution / causal FIR:
- mode='full' -> length L+K-1 (== np.convolve(x, taps))
- mode='same' -> length L, centered (== np.convolve(..., 'same'))
- mode='valid' -> length L-K+1 (== np.convolve(..., 'valid'))
- mode='lfilter' -> length L, causal (== scipy.signal.lfilter(taps, [1.0], x))
Short kernels (K<=15) run on the native conv layer in ONE fused program; longer kernels fall back to FFT convolution (aneforge.fft) automatically - same result.
cost: COMPUTE (a real conv) for short taps; FFT-domain for long taps. fit: GOOD - feed-forward, no recurrence.
Source code in aneforge/dsp.py
fft_convolve ¶
Linear convolution y = x * h via the FFT (aneforge.fft), length L+K-1.
Equivalent to np.convolve(x, h) but computed as FFT(x)*FFT(h) -> iFFT, so
every stage is a matmul or an elementwise spectral product (ANE-native). For a
short kernel against a long signal we use OVERLAP-ADD: split the signal into
blocks, multiply each block-FFT by the (cached) kernel spectrum, inverse-
transform, and sum the overlapping tails. A single block is used when the whole
thing fits one transform.
cost: COMPUTE/FUSION - staged matmul-FFTs + spectral multiply. fit: GOOD.
Source code in aneforge/dsp.py
freq_filter ¶
Frequency-domain filter: FFT the (real) signal, zero out the rejected bins, inverse-FFT. A brick-wall ideal filter in the DFT domain.
kind in {'lowpass','highpass','bandpass','bandstop'}. cutoff is a scalar
(low/high) or a (lo, hi) pair (band). fs is the sample rate (default 2.0 so a
bare cutoff reads as a normalized frequency in [0,1] == fraction of
Nyquist). Returns the real filtered signal (length L), on the ANE.
The mask is applied to BOTH the bin and its conjugate mirror so the output stays real. cost: COMPUTE/FUSION (FFT + elementwise mask + iFFT). fit: GOOD.
Source code in aneforge/dsp.py
stft ¶
Short-time Fourier transform: window each frame, FFT it (aneforge.fft), stack.
win is the frame length (int) or an explicit window-coefficient array; hop
the step between frames (default win//4 like scipy); window the named window
when win is an int. Returns (Zr, Zi), each [n_freq, n_frames] with
n_freq = win//2 + 1 (the non-redundant rfft half), matching scipy.signal.stft's
bin layout (NOT its 1/sum(win) scaling - see spectrogram for magnitudes).
Each frame is one matmul-FFT. The frames are independent (no recurrence), a clean ANE fit; we batch them through the cached FFT plan. fit: GOOD.
Source code in aneforge/dsp.py
spectrogram ¶
Magnitude (or power) spectrogram = |STFT|. mode in {'magnitude','power'}.
Returns a [n_freq, n_frames] real array. fit: GOOD.
Source code in aneforge/dsp.py
correlate ¶
Cross-correlation r[k] = sum_n a[n+k] * b[n] on the ANE.
For a short template (Lb<=15) uses the native CrossCorrelation bridge (a path
Apple's MIL frontend rejects): a length-La row and a smaller length-Lb template ->
'valid' correlation of length La-Lb+1, matching np.correlate(a, b, 'valid').
A wider template (Lb>=16, the same 1xK conv-width arch wall as fir_filter) falls
back to FFT correlation (correlation = convolution with b reversed). Other
modes zero-pad the map before the valid correlation.
cost: MIXED (cut) for the bridge; COMPUTE/FFT for the fallback. fit: GOOD.
Source code in aneforge/dsp.py
autocorrelate ¶
Autocorrelation r[k] = sum_n x[n] x[n+k] for lags k=0..max_lag, on the ANE.
Built as a correlation of the signal against its own leading window via the
CrossCorrelation bridge. max_lag defaults to len(x)//4 (the template length is
len(x)-max_lag, kept < len(x)). Returns r[0..max_lag] (the non-negative lags),
matching the tail of np.correlate(x, x, 'full'). fit: GOOD.
Source code in aneforge/dsp.py
iir_filter ¶
ARCH-LIMITED IIR via a FIXED-LENGTH FIR unroll.
A direct-form IIR a[0] y[n] = sum_i b[i] x[n-i] - sum_{j>=1} a[j] y[n-j] is a
DATA-DEPENDENT RECURRENCE over samples - each output depends on previous OUTPUTS.
The ANE is feed-forward: no in-graph loop, no scalar feedback, no scan/cumsum.
There is NO streaming IIR on this hardware.
The ONLY honest realization is to TRUNCATE the IIR's infinite impulse response to
n_taps and run it as an FIR (via fir_filter / fft_convolve). Exact only up to
the truncation tail, with the tap count fixed at build time - it does NOT scale
to arbitrary streaming IIR. Use it for stable filters whose impulse response has
decayed within n_taps samples; the residual is the truncated tail energy.
Returns the FIR-approximated, causal (lfilter-style) output (length L), plus prints nothing - the caller-facing tag is in the module docstring. fit: ARCH-LIMITED (fixed unroll only).