AeroLLM

01 The idea

Don't shrink the model. Stream it.

A 671B model doesn't fit in your VRAM — so most tools shrink it until it does, and lose what made it smart. AeroLLM does the opposite: it keeps the model whole and feeds it through your GPU one layer at a time, streamed straight off your SSD.

Load a layer, compute, discard it, load the next. The full weight set never sits in memory at once — so frontier-scale reasoning runs on a laptop with no cluster and no cloud. With full credit to AirLLM for the layer-streaming insight; AeroLLM is that idea rebuilt in Rust for the stability and Apple Silicon (MLX) support a long pipeline run needs.

02 How it earns its speed

Streaming, made fast.

One question has driven most of the lab's work: how do we make inference faster and use fewer resources at the same time? Streaming alone trades speed for fit. The three primitives below — prefetch, speculate, shard the cache — are how AeroLLM claws the speed back without giving the memory budget back. Each has its own entry in the Performance cluster of the dictionary.

03 The core trick · the heart of it

Your disk becomes your VRAM.

The whole model lives on your SSD. AeroLLM pulls it through a single small compute window — one layer resident at a time — and out the other side come frontier tokens. The 8GB card that could never hold a 671B model can now run one.

Stream

Weights flow off the SSD layer by layer — the model is never fully resident.

Prefetch

The next layer loads while the current one computes, so the GPU rarely waits.

Speculate

A small draft model proposes tokens; the big one verifies in a single pass.

Serve

One deterministic Rust binary — the inference backbone of the Nucleus pipeline.

Frontier scale, off your disk. Reasoning that compounds, on hardware you already own.

04 Under the hood

Six parts. One binary.

No Python runtime to crash, no fragile dependency tree. A single Rust executable with a deterministic lifecycle — start it, stream a model, shut it down clean.

Single Binary Runtime

One self-contained Rust executable. No interpreter, no venv — a deterministic lifecycle built to run for days without falling over.

Layer Streamer The core

Pulls weights off the SSD one transformer layer at a time. The full model never has to fit in VRAM.

Prefetcher Overlap I/O

Reads the next layer from disk while the current layer is still computing, hiding storage latency behind GPU work.

Speculative Decoder Throughput

A fast draft model proposes a run of tokens; the big model verifies them in one pass. Lossless — same output, up to 7× the speed on 70B+ teachers.

Sharded KV Cache Long context

Attention state is kept lean and sharded so long prompts fit alongside a streamed model instead of crowding it out.

MLX + CUDA Backends Anywhere

Apple Silicon unified memory (zero-copy, no passthrough) or a discrete NVIDIA GPU — same engine, same API, swap with a flag.

05 What that buys you

Frontier scale, no cluster.

Rent the math, or own the run.

The deepest open models there are — up to 671B parameters at 4-bit — run locally, streamed from a 1TB laptop SSD. No rack of GPUs to rent, no per-token meter running, no data leaving the machine.

Streaming off disk trades latency for reach, so it opens doors that real-time chat never could: research, batch reasoning, evaluation, offline pipelines — anywhere answers aren't response-time-sensitive. A deep model that stays up, costs nothing per query once it's on disk, and answers to no one but you.

671B

parameters at 4-bit — streamed from an SSD, run on commodity hardware. 400B+ on as little as 8GB of VRAM.

Who else runs a 671-billion-parameter model on a laptop?

06 Where it lives

Open-sourcing soon.
Yours to run, fork, and ship.

Apache 2.0 MLX · Apple Silicon CUDA Single binary No GPU farm

Streaming, made fast.

Your disk becomes your VRAM.

Stream

Prefetch

Speculate

Serve

Six parts. One binary.

Single Binary Runtime

Layer Streamer The core

Prefetcher Overlap I/O

Speculative Decoder Throughput

Sharded KV Cache Long context

MLX + CUDA Backends Anywhere

Frontier scale, no cluster.

Open-sourcing soon.
Yours to run, fork, and ship.

Run a frontier model tonight.

AEROLLM

Streaming, made fast.

Your disk becomes your VRAM.

Stream

Prefetch

Speculate

Serve

Six parts. One binary.

Single Binary Runtime

Layer Streamer The core

Prefetcher Overlap I/O

Speculative Decoder Throughput

Sharded KV Cache Long context

MLX + CUDA Backends Anywhere

Frontier scale, no cluster.

Open-sourcing soon.Yours to run, fork, and ship.

Run a frontier model tonight.

Open-sourcing soon.
Yours to run, fork, and ship.