ARCHITECTURE — THE ENGINE

Stream the model.
Don't hold it.

AeroLLM serves frontier-scale teachers (70B–400B+) on a single box by streaming layers from disk, executing on metal, and never requiring full-model residency. Speculative decoding gives 7× on Tier-1 teachers. On Apple Silicon, unified memory makes the storage→compute hop zero-copy.

STORAGEMEMORYAEROLLMCOMPUTETOKENS70B–400B Teacher · Layer-by-layerLAYER 00LAYER 01LAYER 02LAYER 03LAYER 04LAYER 05LAYER 06LAYER 07LAYER 08LAYER 09LAYER 10LAYER 11PAGE CACHEwarm layersUNIFIED MEMORYzero-copy · MLXKV CACHEattention reuseLAYER STREAMprefetch + dispatchAEROLLM COREstreaming engineSPECULATIVE DRAFTER1B helper · 7×VERIFIERparallel ratifyMLX / METALapple siliconCUDAdiscrete GPUCPU FALLBACKGGUF · llama.cppTOKEN STREAMOpenAI-compatible APIspec ⟳ verifyStream layer N from disk → execute on metal → token out — repeat for layer N+1Whole-model residency never required. Zero-copy on Apple Silicon. Speculative decoding gives 7× on Tier-1 teachers.

STORAGE

Layer-by-layer on disk

The 400B teacher sits on SSD as ~120 layer shards. AeroLLM prefetches the next layer while the current one runs. Whole-model residency never required — VRAM/unified-memory ceiling stops mattering.

AEROLLM CORE

Streaming + speculative

Layer dispatcher coordinates prefetch, page cache, and KV cache. A small drafter model proposes tokens; the verifier ratifies them in parallel against the teacher. Net: ~7× wall-clock speedup on tier-1 teachers, no quality loss.

COMPUTE

MLX, CUDA, or CPU

On Apple Silicon, MLX runs against unified memory — no host↔device copies, ~83% less power than discrete GPU. CUDA path takes the same layer stream over PCIe. CPU/GGUF fallback for boxes without an accelerator.

INFERENCE AT WORK

A frontier-scale teacher running on one box — weights streaming off SSD, the drafter proposing, the verifier ratifying, all on metal.

aerollm · streaming
Layer streaming — NVMe shards → unified-memory window → MLX compute, KV cache quantized mid-stream
Layer streaming — NVMe shards → unified-memory window → MLX compute, KV cache quantized mid-stream
Learn →