frontier reasoning, streamed off your disk.
Don't shrink the model. Stream it.
A 671B model doesn't fit in your VRAM — so most tools shrink it until it does, and lose what made it smart. AeroLLM does the opposite: it keeps the model whole and feeds it through your GPU one layer at a time, streamed straight off your SSD.
Load a layer, compute, discard it, load the next. The full weight set never sits in memory at once — so frontier-scale reasoning runs on a laptop with no cluster and no cloud. With full credit to AirLLM for the layer-streaming insight; AeroLLM is that idea rebuilt in Rust for the stability and Apple Silicon (MLX) support a long pipeline run needs.
The whole model lives on your SSD. AeroLLM pulls it through a single small compute window — one layer resident at a time — and out the other side come frontier tokens. The 8GB card that could never hold a 671B model can now run one.
Weights flow off the SSD layer by layer — the model is never fully resident.
The next layer loads while the current one computes, so the GPU rarely waits.
A small draft model proposes tokens; the big one verifies in a single pass.
One deterministic Rust binary — the inference backbone of the Nucleus pipeline.
Frontier scale, off your disk. Reasoning that compounds, on hardware you already own.
No Python runtime to crash, no fragile dependency tree. A single Rust executable with a deterministic lifecycle — start it, stream a model, shut it down clean.
One self-contained Rust executable. No interpreter, no venv — a deterministic lifecycle built to run for days without falling over.
Pulls weights off the SSD one transformer layer at a time. The full model never has to fit in VRAM.
Reads the next layer from disk while the current layer is still computing, hiding storage latency behind GPU work.
A fast draft model proposes a run of tokens; the big model verifies them in one pass. Lossless — same output, up to 7× the speed on 70B+ teachers.
Attention state is kept lean and sharded so long prompts fit alongside a streamed model instead of crowding it out.
Apple Silicon unified memory (zero-copy, no passthrough) or a discrete NVIDIA GPU — same engine, same API, swap with a flag.
Rent the math, or own the run.
The deepest open models there are — up to 671B parameters at 4-bit — run locally, streamed from a 1TB laptop SSD. No rack of GPUs to rent, no per-token meter running, no data leaving the machine.
That is exactly what a multi-day distillation run needs from its teacher: a deep model that stays up, costs nothing per query once it's on disk, and answers to no one but you.
parameters at 4-bit — streamed from an SSD, run on commodity hardware. 400B+ on as little as 8GB of VRAM.
Who else runs a 671-billion-parameter model on a laptop?
Clone the engine, point it at a model, and stream 400B+ parameters off the disk you already own.