ARCHITECTURE — THE ENGINE
Stream the model.
Don't hold it.
AeroLLM serves frontier-scale teachers (70B–400B+) on a single box by streaming layers from disk, executing on metal, and never requiring full-model residency. Speculative decoding gives 7× on Tier-1 teachers. On Apple Silicon, unified memory makes the storage→compute hop zero-copy.
STORAGE
Layer-by-layer on disk
The 400B teacher sits on SSD as ~120 layer shards. AeroLLM prefetches the next layer while the current one runs. Whole-model residency never required — VRAM/unified-memory ceiling stops mattering.
AEROLLM CORE
Streaming + speculative
Layer dispatcher coordinates prefetch, page cache, and KV cache. A small drafter model proposes tokens; the verifier ratifies them in parallel against the teacher. Net: ~7× wall-clock speedup on tier-1 teachers, no quality loss.
COMPUTE
MLX, CUDA, or CPU
On Apple Silicon, MLX runs against unified memory — no host↔device copies, ~83% less power than discrete GPU. CUDA path takes the same layer stream over PCIe. CPU/GGUF fallback for boxes without an accelerator.
INFERENCE AT WORK
A frontier-scale teacher running on one box — weights streaming off SSD, the drafter proposing, the verifier ratifying, all on metal.
