Skip to content

Reproducibility

This document maps every measured claim in the characterization paper to the one command that regenerates it and the committed result file it produces. A single fail-soft driver, scripts/reproduce.sh, runs the whole set end to end; each row below is also a standalone one-command claim.

Environment

  • Hardware: an Apple Silicon Mac. The absolute numbers were measured on an M5 Pro (ANE architecture h17s) and are host- and thermal-dependent; the portable claim is the qualitative device map, not the multipliers. The single-stream map additionally reproduces, verdict for verdict, on an M1 (A13) and an M2 (A14).
  • OS / toolchain: macOS 14+ (verified macOS 26.5, ANECompiler 3520.4.1), Xcode command-line tools.
  • Install:
    pip install -e .            # core (numpy only)
    pip install -e ".[bench]"   # adds MLX for the GPU comparison + power tooling
    sh aneforge/_lib/build.sh   # build the e5rt dispatch dylib (not tracked)
    
  • Power steps run sudo powermetrics, which reports an estimated per-rail power (not a wall meter); they prompt for your password. The deterministic gates (routes, corpus) need no sudo.

All measurement scripts live in bench/ and write a committed JSON to bench/results/. A fresh run overwrites the plain <script>_results.json filename with your host's numbers; the host-suffixed snapshots (device_compare_wattcomplete_results_M{1,2,5}.json) are the paper's committed runs and are never overwritten by the scripts. As committed, the plain device_compare_wattcomplete_results.json is byte-identical to the _M5 snapshot (the paper's single-stream run, idle package 1196 mW).

Two derived metrics are computed inside a single sustained power window (see the paper's methodology section): throughput = iterations completed during the window / wall time of that loop, and perf/W divides it by the median idle-subtracted package power of the same window. The latency columns in the result JSONs come from a separate minimum-latency probe that isolates each device's dispatch floor; at sub-millisecond shapes the sustained per-iteration time exceeds the minimum latency, so perf/W cannot be reconstructed from the latency column.

Claim -> command -> result

Paper element Command Result file
Single-stream device map (table + Fig. 1) python3 bench/device_compare_wattcomplete.py --window 6 bench/results/device_compare_wattcomplete_results.json
Three-generation reproduction (M1 / M2 / M5) same script, run on each host bench/results/device_compare_wattcomplete_results_M{1,2,5}.json
Compute peaks + GEMM falloff (Fig. 2) python3 bench/device_saturation_sweep.py bench/results/device_saturation_sweep_results.json
Bandwidth ceilings, two effective BWs (Fig. 3) python3 bench/device_bandwidth_roofline.py --window 6 bench/results/device_bandwidth_roofline_results.json
Batched-serving crossovers (Fig. 5) python3 bench/device_serving_sweep.py --window 5 bench/results/device_serving_sweep_results.json
Per-device roofline synthesis (Fig. 4, ridge/placement tables) python3 bench/roofline_analysis.py bench/results/roofline_analysis_results.json
Weight-stream effective bandwidth (App. A) python3 bench/gemv_bandwidth_sweep.py bench/results/gemv_bandwidth_sweep_results.json
Fused-GPU baseline (App. A) python3 bench/fused_gpu_baseline.py bench/results/fused_gpu_baseline_results.json
Compressed-weight single-matmul speedup (App. A) python3 bench/compress_speedup_bench.py bench/results/compress_speedup_bench.json
Cross-path compressed-matmul latency (App. A) python3 bench/cross_path_compress_bench.py bench/results/cross_path_compress_bench.json
int4 whole-model bench python3 bench/model_int4_bench.py bench/results/model_int4_bench.json
Encoder cross-path serving python3 bench/encoder_serving_crosspath.py bench/results/encoder_serving_crosspath.json
End-to-end LLM decode sweep python3 bench/decode_measurement.py bench/results/decode_measurement_results.json
Route / capability gate (deterministic) python3 tests/test_routes.py docs/capabilities.json
Correctness corpus (deterministic) python3 tests/run_corpus.py - (pass/fail gate)

roofline_analysis.py reads the saturation and bandwidth result JSONs rather than re-measuring, so run those two first if you want a fresh synthesis.

Supporting experiments

These back specific claims in the text and appendix but do not produce a numbered figure or table. Each writes a committed JSON to bench/results/.

Claim Command Result file
Fusion AI-lever (a memory-bound block crosses the weight ridge when fused) python3 bench/below_ridge_fusion.py bench/results/below_ridge_fusion.json
int8 decode stays within tolerance (token agreement, top-5 overlap, logit relerr, softmax KL) python3 bench/decode_int8_accuracy.py bench/results/decode_int8_accuracy.json
fp16-vs-fp16 GPU/ANE energy for the real models (ResNet-18, ViT-B/16, MiniLM) python3 bench/real_models_fp16.py --window 6 bench/results/real_models_fp16_results.json

Notes and caveats

  • powermetrics is a modeled SoC estimator. The headline is idle-subtracted total-package active power, cross-validated against Apple's IOReport "Energy Model" counters (App. A); the relative device comparisons are robust to a shared model bias since it cancels in ratios.
  • The CPU is measured in fp32 (Accelerate/AMX has no fast fp16 GEMM) against fp16 ANE/GPU; this asymmetry is labeled wherever it matters and carried through the roofline arithmetic.
  • Sixteen subprocess-bridge operators are excluded from the speed race as a dispatch artifact rather than silicon time; see the methodology section.