Read the source ↗
Streaming Inference Engine

AEROLLM

frontier reasoning, streamed off your disk.

One Rust binary No GPU farm Open-sourcing soon
See how it runs
01 The idea

Don't shrink the model. Stream it.

A 671B model doesn't fit in your VRAM — so most tools shrink it until it does, and lose what made it smart. AeroLLM does the opposite: it keeps the model whole and feeds it through your GPU one layer at a time, streamed straight off your SSD.

Load a layer, compute, discard it, load the next. The full weight set never sits in memory at once — so frontier-scale reasoning runs on a laptop with no cluster and no cloud. With full credit to AirLLM for the layer-streaming insight; AeroLLM is that idea rebuilt in Rust for the stability and Apple Silicon (MLX) support a long pipeline run needs.

02 How it earns its speed

Streaming, made fast.

03 The core trick · the heart of it

Your disk becomes your VRAM.

The whole model lives on your SSD. AeroLLM pulls it through a single small compute window — one layer resident at a time — and out the other side come frontier tokens. The 8GB card that could never hold a 671B model can now run one.

ON DISK full weights OUTPUT frontier tokens ONE LAYER IN VRAM

Stream

Weights flow off the SSD layer by layer — the model is never fully resident.

Prefetch

The next layer loads while the current one computes, so the GPU rarely waits.

Speculate

A small draft model proposes tokens; the big one verifies in a single pass.

Serve

One deterministic Rust binary — the inference backbone of the Nucleus pipeline.

Frontier scale, off your disk. Reasoning that compounds, on hardware you already own.

04 Under the hood

Six parts. One binary.

No Python runtime to crash, no fragile dependency tree. A single Rust executable with a deterministic lifecycle — start it, stream a model, shut it down clean.

Single Binary Runtime

One self-contained Rust executable. No interpreter, no venv — a deterministic lifecycle built to run for days without falling over.

Layer Streamer The core

Pulls weights off the SSD one transformer layer at a time. The full model never has to fit in VRAM.

Prefetcher Overlap I/O

Reads the next layer from disk while the current layer is still computing, hiding storage latency behind GPU work.

Speculative Decoder Throughput

A fast draft model proposes a run of tokens; the big model verifies them in one pass. Lossless — same output, up to 7× the speed on 70B+ teachers.

Sharded KV Cache Long context

Attention state is kept lean and sharded so long prompts fit alongside a streamed model instead of crowding it out.

MLX + CUDA Backends Anywhere

Apple Silicon unified memory (zero-copy, no passthrough) or a discrete NVIDIA GPU — same engine, same API, swap with a flag.

05 What that buys you

Frontier scale, no cluster.

Rent the math, or own the run.

The deepest open models there are — up to 671B parameters at 4-bit — run locally, streamed from a 1TB laptop SSD. No rack of GPUs to rent, no per-token meter running, no data leaving the machine.

That is exactly what a multi-day distillation run needs from its teacher: a deep model that stays up, costs nothing per query once it's on disk, and answers to no one but you.

671B

parameters at 4-bit — streamed from an SSD, run on commodity hardware. 400B+ on as little as 8GB of VRAM.

Who else runs a 671-billion-parameter model on a laptop?

06 Where it lives

Open-sourcing soon.
Yours to run, fork, and ship.

Apache 2.0 MLX · Apple Silicon CUDA Single binary No GPU farm

Run a frontier model tonight.

Clone the engine, point it at a model, and stream 400B+ parameters off the disk you already own.