Learn · the front door to the suite

The AI Dictionary.

Learn the ropes. Every model-building term — from LoRA and quantization to QuKaiZen's own Super Skill and Wisdom per Watt — defined plainly, each with a concrete example.

102 terms · machine-readable at /what

Ask the Docent

on-box model · runs on QuKaiZen's own hardware
JUXY
102 terms

Ablation

Fundamentals

Removing one component to measure how much it actually contributes.

An ablation study turns a part off — a layer, a loss term, a data source — and measures the drop, isolating what really matters. It's how you separate the ingredient from the marketing.

ExampleAblating the distractor documents shows RAFT's robustness gains came from training through noise.

AdamW

TrainingAdam with weight decay

The default optimizer for training transformers — Adam with decoupled weight decay.

AdamW adapts the learning rate per parameter using running estimates of gradient mean and variance, and decouples weight decay from the gradient update for cleaner regularization. It is the workhorse optimizer for LLM training.

ExampleA typical run: AdamW with lr=2e-4, betas=(0.9, 0.95), weight_decay=0.1, plus warmup and a cosine schedule.

Adapters

Fine-tuningadapter layers

Small trainable modules inserted into a frozen model to add new skills without retraining it.

Adapters are tiny bottleneck layers added between a frozen model's existing layers; only the adapters train. They are a parameter-efficient way to teach new tasks, and you can keep a library of swappable adapters for one base. LoRA is a popular low-rank flavor of this idea.

ExampleShip one 7B base plus a 'legal' adapter and a 'medical' adapter; load whichever the task needs.

Adversarial Swarm

QuKaiZenswarm

A loop of agents (interrogate, challenge, evaluate, correct) that hardens a model until it stops breaking.

The Adversarial Swarm Reactor pits Interrogator, Adversary, Evaluator, and Corrector agents (plus data-collection agents) against the student in cycles, systematically hunting and eliminating hallucination pathways. The model graduates not by passing a fixed test but when the swarm can no longer break it.

ExampleThe swarm keeps inventing harder kernel-bug traps until the student answers them all, then it graduates.

AeroLLM

QuKaiZen

QuKaiZen's inference engine that streams frontier models off disk so they run without full GPU residency.

AeroLLM is the inference layer that makes disk-streamed teachers practical — layer streaming plus speculative decoding to claw back speed. It is how QuKaiZen serves 400B+ teachers on workstations that lack the VRAM to hold them.

ExamplePoint the teacher backend at AeroLLM to stream a 405B teacher on a single box instead of an 8x H100 node.

Agent

Architecture

An LLM that takes actions — calls tools, makes decisions — toward a goal, not just chats.

An agent wraps a model with tools, memory, and a control loop so it can plan, act, observe, and iterate. PaperAgents declares teams of small specialist agents; ARAIL's Buddy is a lab agent.

ExampleA dispatch agent reads a load board, computes margin, and books profitable freight without a human in the loop.

Agentic

Architecture

Software built around autonomous, tool-using model agents.

Agentic systems give models autonomy to decide and act over many steps using tools and feedback, instead of producing a single response. The tradeoff is power vs. predictability — hence guardrails and declared workflows.

ExampleAn agentic workflow downloads data, analyzes it, decides, and processes — looping until the job is done.

Alignment

RL & Alignment

Making a model's behavior match human intent and values.

Alignment is the work of making models helpful, honest, and harmless — via methods like RLHF and DPO plus evaluation for refusal and faithfulness. Misalignment shows up as unsafe or off-intent output.

ExampleRLHF aligns a base model so it follows instructions and declines harmful requests.

Attention

Architectureself-attention

The mechanism that lets each token weigh and pull information from every other token.

Attention computes, for each token, a weighted sum of all tokens' value vectors, where weights come from the similarity (dot product) of its query with others' keys. It is how transformers model long-range relationships, and its quadratic cost is what FlashAttention and the KV-cache optimize.

ExampleIn 'the cat sat because it was tired', attention links 'it' back to 'cat' by giving that pair a high weight.

Automation

Architecture

Letting software run repeatable work end-to-end with no human in the loop.

Automation captures a repeatable process so it runs on its own, reliably and on schedule. PaperAgents automates with small specialist agents reconciled to a desired state.

ExampleInvoicing that books, charges, and files itself every night.

AutoResearch

QuKaiZen

The swarm's brain — it evolves the rubrics every other agent consults.

AutoResearch is a first-class meta-agent that evolves the rubrics driving probes, traps, and scoring, and independently fact-checks certification. It is never merged into another service.

ExampleAutoResearch notices repeated failures on edge cases and rewrites the rubric to target them next cycle.

Backprop

Trainingbackpropagation

The algorithm that computes how to nudge every weight by propagating error gradients backward.

Backpropagation applies the chain rule to compute the gradient of the loss with respect to every parameter, flowing from the output layer back to the input. Those gradients tell the optimizer which direction to move each weight to reduce error.

ExampleAfter a forward pass yields loss 2.3, backprop computes the gradient for every weight; AdamW then updates them.

Benchmark

Fundamentalseval

A standardized test set used to measure and compare model capability.

Benchmarks score models on fixed tasks — knowledge, reasoning, code — so results are comparable. QuKaiZen's Gate 1 uses MMLU, HellaSwag, ARC, GSM8K, and IFEval to verify capability survives distillation.

ExampleA distilled student must retain ≥85% of its base model's MMLU score to pass the regression gate.

BF16

Quantizationbfloat16

A 16-bit float with the same exponent range as FP32 — the default precision for training LLMs.

bfloat16 keeps FP32's 8-bit exponent (the same huge dynamic range) but truncates the mantissa to 7 bits. That range makes it numerically stable for training without loss scaling, at half the memory and bandwidth of FP32.

ExampleMost LLMs train in BF16 on A100/H100/TPU; weights are half the size of FP32 with no overflow headaches.

Buddy

QuKaiZenARAIL Buddy

ARAIL's local companion agent — a context-aware lab partner you learn alongside, running entirely on your own hardware.

ARAIL began with Buddy: a local agent to learn alongside. Buddy needed an environment, and that environment became a lab — pluggable, observable, and entirely owned by you. Buddy drives the lab in plain language and draws on your knowledge base for real context, so it can answer "what should I do next?" or "what's interesting in today's pull?" — offline, with no telemetry.

ExampleAsk Buddy "what's worth reading in today's arXiv pull?" and it answers from your own knowledge base — no cloud round-trip, nothing leaving your machine.

Chain-of-Thought

FundamentalsCoT

Prompting a model to show its intermediate steps, which sharply improves reasoning.

Chain-of-thought elicits step-by-step intermediate reasoning before the final answer. Wei et al. (2022) showed it dramatically improves math and logic; QuKaiZen distills symbolic CoT into small students.

ExampleInstead of just '42', a CoT response writes the derivation line by line, then states 42 — and is right far more often.

Checkpoint

Trainingmodel checkpoint

A saved snapshot of model weights (and often optimizer state) you can resume or deploy from.

A checkpoint persists the model's parameters — and during training, the optimizer state and step — so a run can resume after interruption or a version can be evaluated and shipped. Modern checkpoints use safetensors for safe, fast loading.

ExampleSaving a checkpoint every 500 steps means a crash at step 1700 resumes from 1500, not from scratch.

Continuous Batching

Inferencein-flight batching

Swapping requests in and out of a running batch every step to keep the GPU saturated.

Continuous (in-flight) batching removes finished sequences and adds new ones each step, instead of waiting for a whole batch to complete — dramatically improving serving throughput and latency.

ExampleA server using continuous batching serves many users at once with no idle GPU gaps.

Convergence

QuKaiZen

Graduation by exhaustion — the model is done when the swarm can't break it anymore.

Rather than a fixed number of rounds, QuKaiZen runs until convergence: 95%+ evaluator scores, exhausted experiments, and no further reasoning gains. Quality is measured by swarm exhaustion, then verified by three gates.

ExampleAfter dozens of cycles the swarm finds no new failure patterns; the student converges and is sealed.

Convergence Graduation

QuKaiZenConvergence-Based Graduation

A model graduates when the adversarial swarm gives up trying to break it — not at a fixed cycle limit.

Instead of a fixed number of rounds, QuKaiZen runs until convergence: 95%+ evaluator scores, exhausted experiments, and no further reasoning improvement. Quality is measured by swarm exhaustion, then verified by a three-gate certification before the Nucleus Seal is minted.

ExampleAfter dozens of cycles the swarm finds no new failure patterns; the student converges, passes the gates, and is sealed.

CUDA

Formats & Runtime

NVIDIA's platform/language for general-purpose GPU computing — the substrate most ML runs on.

CUDA is NVIDIA's parallel-computing API and toolkit that lets code run on GPUs. Frameworks compile their tensor ops down to CUDA kernels (and libraries like cuBLAS/cuDNN), which is why GPU availability and CUDA versions dominate ML ops.

ExampleA version mismatch between a PyTorch build and the installed CUDA toolkit is the classic 'it will not see the GPU' bug.

Desired State

Architecture

The end state you declare; the system's job is to make reality match it.

Desired-state configuration means you describe what you want — the team, the config — not the steps to get there, and a controller reconciles reality to it. Idempotent and version-controlled.

Exampleteam.toml lists four agents; apply it and the platform makes exactly those run.

Distillation

Fine-tuningknowledge distillation

Transfer a big teacher model's behavior into a small student model.

Knowledge distillation trains a small student to mimic a large teacher — matching its outputs, probabilities, or reasoning traces — so the student captures much of the teacher's capability at a fraction of the size and cost. It is the core of QuKaiZen's pipeline.

ExampleA 3B student trained on a 400B teacher's chain-of-thought traces can match the teacher in-domain while running on a laptop.

DPO

RL & AlignmentDirect Preference Optimization

Align to preferences directly from good/bad answer pairs — no reward model or RL loop.

DPO skips RLHF's separate reward model and PPO loop, reframing alignment as a simple classification-style loss over (preferred, rejected) pairs that directly raises the likelihood of preferred answers. Simpler and more stable than PPO-based RLHF, with comparable results.

ExampleFeed pairs like (concise correct answer = preferred, rambling answer = rejected); DPO's loss directly widens the margin between them.

Draft Model

Inference

The small, fast model that proposes candidate tokens in speculative decoding.

The draft model is a smaller, cheaper model that guesses the next several tokens; the large target model then verifies them together. The closer the draft tracks the target, the more tokens are accepted per pass.

ExampleA 1B draft proposes 5 tokens; the 70B target verifies all 5 in one pass when they agree.

Dropout

Trainingdropout regularization

Randomly zeroing activations during training to prevent overfitting.

Dropout randomly sets a fraction of activations to zero each training step, forcing the network not to rely on any single unit and improving generalization. It is disabled at inference. Large pretraining often uses little or none, but it is common when fine-tuning on small data.

ExampleDropout 0.1 on a fine-tune randomly drops 10% of activations per step to curb overfitting on a small dataset.

Embeddings

Fundamentalsembedding vectors

Dense numeric vectors representing tokens or text so similar meanings sit close together.

An embedding maps a token or piece of text to a vector in high-dimensional space where geometric closeness reflects semantic similarity. Models learn input embeddings for tokens; separate embedding models turn whole documents into vectors for search and RAG.

Example'king' minus 'man' plus 'woman' lands near 'queen'; RAG retrieves the docs whose embeddings are nearest the query's.

Eval

Trainingevaluation

The practice of measuring model quality with repeatable tests — from public benchmarks to task-specific graders.

An eval is any repeatable measurement of how well a model does something: a public benchmark, a private held-out set, an LLM-as-judge rubric, or a unit-test-style check. Good evals are the steering wheel of model building — without them you cannot tell whether a change helped. QuKaiZen's certification gates are the evals a student model must pass before it graduates.

ExampleBefore shipping a fine-tune you run an eval suite — MMLU for knowledge, GSM8K for math, IFEval for instruction-following — and only ship if every score holds or improves.

Experiment

Fundamentalsexperimentation

A single tracked training or evaluation run with a fixed configuration, used to test one change against a baseline.

An experiment isolates one variable — a hyperparameter, a data change, an architecture tweak — and measures its effect against a baseline under otherwise identical conditions. Each run logs its config, metrics, and artifacts so results are reproducible and comparable. In ARAIL, autoresearch agents run experiments continuously and score each against evolving rubrics — what gets measured gets improved.

ExampleChange only the learning rate from 2e-4 to 1e-4, rerun training, and compare validation loss to the baseline; if it improves and nothing else changed, the experiment isolated the cause.

Fine-tune

Fine-tuningfine-tuning

Continue training a pretrained model on new data to specialize it for a task or domain.

Fine-tuning takes a general pretrained model and trains it further on a focused dataset so it adapts to a domain, style, or task. It can be full (all weights) or parameter-efficient (LoRA/PEFT), and is the bridge from a generic base to a useful specialist.

ExampleFine-tune a base 7B on 30 years of Linux-kernel commits and it starts reasoning like a kernel engineer.

FlashAttention

InferenceFlash Attention

An exact attention kernel that is fast and memory-light by never materializing the full attention matrix.

FlashAttention computes exact attention in tiles that stay in fast on-chip SRAM, avoiding the quadratic N-by-N matrix in slow HBM. It cuts memory from quadratic to linear and speeds up training and inference, enabling much longer contexts.

ExampleSwapping standard attention for FlashAttention-2 can train a long-context model ~2x faster with far less memory.

FP8

Quantization8-bit float

An 8-bit floating-point format for faster training and inference on H100-class hardware.

FP8 represents numbers in 8 bits (e4m3 or e5m2 variants), halving memory and doubling throughput versus BF16 on supporting GPUs. It needs careful scaling but is increasingly used for both training and high-throughput inference.

ExampleServing a teacher in FP8 on H100s roughly doubles tokens/sec versus BF16 with minimal quality loss.

FSDP

TrainingFully Sharded Data Parallel

Shards model parameters, gradients, and optimizer state across GPUs so huge models fit in training.

FSDP (PyTorch) splits parameters, gradients, and optimizer states across all data-parallel GPUs, gathering each shard only when needed. It trains models far larger than a single GPU's memory, with less overhead than older model-parallel schemes.

ExampleTraining a 70B model across 8 GPUs: FSDP keeps only 1/8 of the weights resident on each, all-gathering layers on the fly.

Function Calling

Architecture

A structured protocol for a model to request a specific tool with typed arguments.

Function calling has the model emit a structured call — a name plus JSON arguments — that your code executes and returns, for the model to use. It's the reliable mechanism beneath most tool use.

ExampleThe model returns {name:'get_rate', args:{lane:'CHI-DAL'}}; your server runs it and feeds back the price.

GELU

ArchitectureGaussian Error Linear Unit

A smooth activation function used in transformer feed-forward layers.

GELU multiplies an input by the probability it is positive under a Gaussian, giving a smooth, slightly negative-tolerant alternative to ReLU. Its smoothness helps gradient flow, and it is the default activation in many transformer MLP blocks (with SwiGLU now common too).

ExampleA transformer's feed-forward block applies GELU between its two linear layers.

GGUF

Formats & RuntimeGGML successor

A single-file binary format for quantized models, built for fast local inference (llama.cpp).

GGUF packs weights (usually quantized), tokenizer, and metadata into one memory-mappable file so a model loads fast and runs on commodity hardware. It is the format used by llama.cpp and friends, superseding the older GGML format.

Examplellama-2-7b.Q4_K_M.gguf is a 7B model quantized to ~4-bit (~4GB) that runs on a laptop with llama.cpp.

Gradient

Traininggradients

The vector of partial derivatives telling how the loss changes as you tweak each weight.

A gradient points in the direction of steepest increase of the loss; training steps move weights the opposite (descent) way. Gradient magnitude and stability (vanishing/exploding) are central concerns, handled with clipping, normalization, and good optimizers.

ExampleGradient clipping caps the global gradient norm (e.g., 1.0) to stop a huge update from blowing up training.

GSM8K

TrainingGrade School Math 8K

Around 8,500 grade-school math word problems that test multi-step arithmetic reasoning.

GSM8K (Grade School Math 8K) is a dataset of about 8,500 linguistically diverse grade-school word problems, each needing two to eight reasoning steps. It became the standard probe for whether a model reasons step by step instead of pattern-matching — and the benchmark that made chain-of-thought prompting famous.

ExampleA GSM8K problem may say a robe needs 2 bolts of blue fiber and half that of white; the model must compute 2 + 1 = 3, and the benchmark scores only the final number.

HalluLens

TrainingLLM hallucination benchmark

A benchmark for measuring how often an LLM hallucinates — asserts unsupported or fabricated facts.

HalluLens is a hallucination benchmark that separates extrinsic hallucination (claims grounded in no source) from intrinsic hallucination (contradicting the given input), and probes models with tasks designed to surface confident-but-false answers. It exists because fluency hides unreliability — a model can sound right while being wrong.

ExampleAsked to summarize a paper that does not exist, a hallucinating model invents authors and results; HalluLens scores whether it fabricates or correctly declines.

HELM

TrainingHolistic Evaluation of Language Models

Stanford's broad, multi-metric benchmark suite that scores models across many scenarios, not just accuracy.

HELM (Holistic Evaluation of Language Models), from Stanford CRFM, evaluates models across a wide matrix of scenarios and metrics — accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency — so a model is judged on many axes at once instead of a single headline score.

ExampleUnder HELM two models with identical accuracy can rank differently once robustness and calibration are weighed in.

Hypothesis

Fundamentals

A testable prediction you set out to confirm or refute with an experiment.

A hypothesis states, in advance, what you expect a change to do and how you'll measure it — turning a hunch into something falsifiable. Good research lives or dies on sharp hypotheses.

Example'Adding symbolic CoT will raise faithfulness by 5 points' — then you run it and find out.

IFEval

TrainingInstruction-Following Eval

A benchmark of machine-verifiable instructions that measures how precisely a model obeys format and constraint requests.

IFEval (Instruction-Following Eval) uses prompts whose compliance can be checked programmatically — answer in exactly three bullet points, avoid a given word, respond in JSON. Because each rule is machine-verifiable, it scores obedience objectively, with no human or judge model in the loop.

ExampleGiven an instruction to write two paragraphs and end with a specific word, IFEval checks both conditions automatically; missing either one counts as a fail.

Inference

Fundamentalsserving

Running a trained model to produce outputs — the deployment side, as opposed to training.

Inference is using a trained model to generate predictions for real inputs. For LLMs it is autoregressive: produce one token, append it, repeat. Latency, throughput, and memory (the KV-cache) are the central concerns, distinct from the one-time cost of training.

ExampleTyping a prompt into a chatbot and watching tokens stream back is inference; the KV-cache and sampling settings shape its speed and style.

INT4

Quantization4-bit

4-bit integer weights — the aggressive quantization that makes big models fit on small hardware.

INT4 stores each weight in 4 bits (16 levels), roughly 8x smaller than FP32. Schemes like GPTQ, AWQ, and NF4 pick scales and zero-points to preserve quality. Small models tolerate 4-bit well; frontier models often need 8-bit for the same fidelity.

ExampleA 7B model in INT4 is ~4GB and runs on a laptop; a 671B MoE at Q4 fits a 1TB SSD for layer-streamed inference.

KICE

QuKaiZenKnowledge Injection & Corpus Evolution

QuKaiZen's agent that extracts certified, verifiable domain knowledge in six layers.

KICE mines a corpus for rare concepts, edge cases, historical conflicts, subsystem interactions, nuanced reasoning, and ambiguity — knowledge that can be verified against authoritative sources. It feeds the distillation pipeline with high-quality, checkable material.

ExampleFor a Linux-kernel skill, KICE surfaces a subtle locking edge case documented in a 2009 mailing-list thread.

KV-Cache

Inferencekey-value cache

Cached key/value tensors from past tokens so generation does not recompute the whole sequence each step.

During autoregressive generation each new token attends to all previous tokens. The KV-cache stores the keys and values already computed, so each step only processes the new token — turning quadratic regeneration into linear. It is the main consumer of inference memory.

ExampleGenerating token 1000 reuses 999 cached K/V pairs; only the new token's attention is computed. vLLM's PagedAttention manages this cache efficiently.

Latency

Inference

The delay before and during a model's response — time-to-first-token and per-token time.

Latency is how quickly a single request responds, distinct from throughput (total volume). Keeping the model warm and prefetching weights cut it.

ExampleWarm-keeping the SLM drops a dictionary lookup from ~17s cold to a couple of seconds.

Layer Streaming

QuKaiZenlayer-by-layer inference

Load one transformer layer from disk, compute, discard — running 400B+ models on tiny VRAM.

Layer-streaming inference (AeroLLM's core primitive) streams a model layer by layer from SSD: load a layer's weights, compute, free, repeat. It trades latency for the ability to run frontier-scale teachers (70B-671B) on commodity hardware with a few GB of VRAM.

ExampleA 671B MoE at Q4 streams off a 1TB SSD on a MacBook — slow per token, but a background swarm does not mind waiting for depth.

LayerNorm

ArchitectureLayer Normalization

Normalizes activations within each layer to keep training stable; modern LLMs often use RMSNorm.

Layer normalization rescales each token's activation vector to zero mean and unit variance (RMSNorm skips the mean), stabilizing and speeding training. Placement (pre-norm vs post-norm) and the variant chosen materially affect deep-transformer stability.

ExampleLlama-style models apply pre-RMSNorm before attention and the feed-forward block for stable deep training.

LLM

FundamentalsLarge Language Model

A transformer trained on vast text to predict the next token, yielding broad language ability.

A large language model is a big transformer trained on internet-scale text with next-token prediction; scale plus instruction tuning yields general capability. QuKaiZen distills that capability into small, owned models.

ExampleGPT-4, Claude, and Llama are LLMs; a 1–7B Super Skill is a small, specialized descendant.

Logits

Fundamentalslogit

The model's raw, unnormalized output scores over the vocabulary, before softmax makes them probabilities.

Logits are the final layer's raw scores — one per vocabulary token — not yet normalized into probabilities. Sampling controls (temperature, top-k/p) operate on logits before softmax converts them into the next-token distribution.

ExampleDividing logits by a temperature of 0.2 sharpens them, making the top token far more likely after softmax.

LoRA

Fine-tuningLow-Rank Adaptation

Fine-tune a model by training tiny low-rank adapter matrices while the base weights stay frozen.

LoRA freezes the original weights and injects small trainable rank-decomposition matrices into each layer. You train only those low-rank matrices — often under 1% of the parameters — which slashes memory and lets a single GPU fine-tune models that would otherwise need a cluster.

ExampleFully fine-tuning a 7B model needs ~60GB+; with LoRA you train ~10-50MB of adapters in ~10GB, then merge or hot-load them at inference.

MCP

ArchitectureModel Context Protocol

An open standard for connecting models to tools and data sources.

MCP lets agents discover and call external tools, resources, and data through a uniform interface, so capabilities plug in without bespoke glue per integration.

ExampleAn agent connects to a load-board MCP server and instantly gains 'list loads' and 'book load' tools.

MLX

Formats & Runtime

Apple's array framework for running and training models on Apple Silicon's unified memory.

MLX uses the shared CPU/GPU memory of Apple Silicon for zero-copy inference and fine-tuning — no host↔device transfers and much lower power. AeroLLM targets it for Mac deployments.

ExampleOn an M-series Mac, MLX runs a streamed model against unified memory with ~83% less power than a discrete GPU.

MMLU

TrainingMassive Multitask Language Understanding

A benchmark of ~16,000 multiple-choice questions across 57 subjects, measuring an LLM's breadth of knowledge.

MMLU (Massive Multitask Language Understanding) tests a model with four-option multiple-choice questions spanning 57 subjects, from elementary math to law, medicine, and ethics. It is the standard yardstick for general knowledge and reasoning breadth; scores range from 25% (random guessing) to roughly 90% for frontier models.

ExampleAn MMLU item might pose a college-level biology fact with four choices; a model scoring 70% got 70% of questions right, averaged across all 57 subjects.

MoE

ArchitectureMixture of Experts

A model split into many expert sub-networks where a router activates only a few per token.

Mixture-of-Experts replaces a dense layer with many parallel expert networks plus a router that picks a small subset (e.g., 2 of 64) per token. Total parameters balloon while compute per token stays modest — huge capacity at a fraction of dense FLOPs.

ExampleA 671B-parameter MoE might activate only ~37B per token, so it runs far cheaper than a dense 671B model.

Multi-Agent

Architecture

Several specialized agents collaborating, each owning a function.

A multi-agent system splits work across specialist agents that hand off to one another — often cheaper and more reliable than one giant generalist. PaperAgents reconciles a declared team of them.

ExampleDispatch, billing, and safety agents each handle their domain and pass tasks along.

Nucleus Seal

QuKaiZen

An Ed25519 cryptographic provenance chain proving how a Super Skill model was made.

The Nucleus Seal binds a model's DNA — teacher hash, corpus hash, pipeline config, audit, and AutoResearch report — into a signed Ed25519 chain. It is cryptographic proof the pipeline distilled the model correctly, and seals are dynamically monitored and revocable.

ExampleEach model version is minted with a Seal linking it to the exact teacher and corpus that produced it, so provenance is verifiable.

Orchestration

Architecture

Coordinating multiple agents or services into one coherent flow.

Orchestration sequences and supervises the parts of a multi-step system — who runs when, with what inputs — handling retries and handoffs. QuKaiZen orchestrates the swarm; PaperAgents orchestrates a team.

ExampleThe orchestrator fans work to dispatch, waits, then hands results to billing.

PagedAttention

Inference

Storing the KV-cache in non-contiguous pages so long contexts fit without waste.

PagedAttention (from vLLM) manages attention key/value cache in fixed-size pages like virtual memory, eliminating fragmentation and letting many requests share memory — large serving-throughput gains.

ExamplePaged KV-cache lets a server batch far more concurrent long-context requests.

PEFT

Fine-tuningParameter-Efficient Fine-Tuning

An umbrella for methods (LoRA, adapters, prefix-tuning) that tune a tiny fraction of parameters.

PEFT covers techniques that adapt a model by training only a small set of new or selected parameters while freezing the rest — LoRA, adapters, prefix/prompt tuning, and more. It is also the name of Hugging Face's library implementing them.

ExampleUsing the PEFT library, you wrap a base model with a LoRA config and train under 1% of its parameters.

Perplexity

FundamentalsPPL

A measure of how surprised a model is by text — lower means it predicts the text better.

Perplexity is the exponentiated average negative log-likelihood a model assigns to a sequence — roughly the effective number of equally likely choices it faces each step. Lower is better, but it is an intrinsic metric, not a substitute for task benchmarks.

ExampleA model with perplexity 10 on a test set is about as uncertain as choosing uniformly among 10 tokens each step.

PPO

RL & AlignmentProximal Policy Optimization

The RL algorithm classically used to optimize a model against a reward model in RLHF.

PPO is a policy-gradient method that improves a model while clipping each update to stay close to the previous policy, preventing destructive jumps. In RLHF it is the optimizer that pushes the model to maximize reward-model scores.

ExampleDuring RLHF, PPO raises the probability of high-reward responses but clips the step if the new policy strays too far from the old one.

Prefetch

Inference

Loading the next layer from disk while the current compute runs, hiding I/O latency.

Prefetching overlaps disk reads with computation: while the GPU works on layer N, layer N+1 is already streaming in, so the model rarely waits on storage. It's what makes layer streaming fast.

ExampleAeroLLM prefetches the next shard so the GPU stays busy instead of stalling on the SSD.

Prompt

Fundamentals

The input text you give a model to steer what it does.

A prompt is the instruction plus context handed to a model at inference time. Prompt engineering tunes wording, examples, and structure to elicit better output — but unlike training, it never changes the model's weights.

ExampleAdding 'think step by step' to a prompt can lift accuracy on reasoning tasks with no retraining.

Provenance

QuKaiZen

A verifiable record of exactly what went into a model and how it was built.

Provenance is the chain of custody for a model: which teacher, which corpus version, which config and audits. QuKaiZen hashes each artifact and signs the chain so the lineage is tamper-evident.

ExampleThe provenance chain lets anyone verify a sealed model was distilled from the stated teacher and corpus.

QLoRA

Fine-tuningQuantized LoRA

LoRA on top of a 4-bit quantized base model — fine-tune big models on one consumer GPU.

QLoRA quantizes the frozen base to 4-bit (NF4) to shrink its footprint, then trains LoRA adapters on top in higher precision, with gradients flowing through the quantized weights via dequant-on-the-fly. Near-full-fine-tune quality at a fraction of the VRAM.

ExampleQLoRA made it possible to fine-tune a 65B model on a single 48GB GPU — previously impossible without multiple A100s.

Quantization

Quantizationquantisation

Storing weights/activations in fewer bits (FP16 to INT4) to shrink models and speed inference.

Quantization maps high-precision weights to a smaller numeric type (8-bit, 4-bit, ...) using a scale and zero-point, trading a little accuracy for big savings in memory and bandwidth. It is what lets frontier-scale models run on commodity hardware.

ExampleQuantizing a 13B model from FP16 (26GB) to Q4 (~7GB) lets it load on a single consumer GPU.

RAFT

Fine-tuningRetrieval Augmented Fine-Tuning

Fine-tuning that teaches a model to reason over retrieved docs while ignoring distractors.

RAFT trains on a question plus a mix of oracle (relevant) and distractor (irrelevant) documents, teaching the model to cite the right source and ignore noise. The result reasons through imperfect retrieval rather than memorizing — domain-specific RAG baked into the weights.

ExampleFor a kernel-bug question, RAFT shows the real commit (oracle) plus two unrelated patches (distractors); the model learns to ground its answer in the oracle.

RAG

ArchitectureRetrieval-Augmented Generation

Fetch relevant documents at query time and feed them to the model as context.

RAG retrieves passages from a knowledge store and injects them into the prompt, so the model answers from fresh, specific data rather than memory. It's the opposite of distillation — knowledge stays external and looked-up.

ExampleA support bot retrieves the latest policy doc and answers from it, with no retraining when the policy changes.

Reasoning

Fundamentals

A model working through a problem in intermediate steps instead of answering in one leap.

Reasoning is a model's ability to chain intermediate inferences — premises, rules, constraints, cross-references — toward a conclusion, rather than pattern-matching a final answer. Chain-of-thought elicits it; distillation transfers and sharpens it into small models.

ExampleGiven a multi-step word problem, a reasoning model writes each step ('first the rate, then the time…') and lands the answer far more reliably than guessing.

Reconciliation

Architecturereconcile

Continuously closing the gap between the team you declared and the team that's running.

Borrowed from infrastructure (Kubernetes-style control loops), reconciliation compares desired state to observed state and converges them; a watcher fixes drift forever after. PaperAgents applies it to agent teams.

ExampleDeclare four agents; the watcher notices one died and restarts it to match the manifest.

Research

Fundamentalsautoresearch

Systematic inquiry — forming hypotheses, running experiments, and measuring results.

Research is the disciplined loop of asking a question, forming a hypothesis, experimenting, and measuring. ARAIL is built to run that loop with AI: autoresearch agents gather sources, probe ideas, and surface what's interesting.

ExampleARAIL's agents pull recent papers, summarize the state of the art, and propose the next experiment to run.

RLHF

RL & AlignmentReinforcement Learning from Human Feedback

Align a model to human preferences via a reward model trained on human rankings, then RL.

RLHF collects human comparisons of model outputs, trains a reward model to predict which response people prefer, then fine-tunes the policy with reinforcement learning (usually PPO) to maximize that reward. It is how raw pretrained models became helpful, harmless assistants.

ExampleGiven two answers to 'explain recursion', humans pick the clearer one; the reward model learns that preference; PPO nudges the model toward it.

RoPE

ArchitectureRotary Position Embedding

Encodes token position by rotating query/key vectors — the dominant positional scheme in modern LLMs.

Rotary Position Embeddings inject position by rotating query and key vectors by an angle proportional to their position, so attention naturally depends on relative distance. RoPE extrapolates to longer contexts better than learned absolute embeddings and underlies most current LLMs.

ExampleRoPE scaling tricks (NTK, YaRN) stretch a model trained at 4k context to 32k+ by adjusting the rotation frequencies.

Rubric

QuKaiZen

The evolving scoring criteria AutoResearch uses to probe and grade the student.

Rubrics are structured criteria that drive the Interrogator's probes, the Adversary's traps, and the Evaluator's scoring. AutoResearch evolves them over time as new failure modes are discovered.

ExampleA rubric for the kernel domain weights memory-safety reasoning heavily, so the swarm probes it hardest.

SafeTensors

Formats & Runtime

A safe, fast, zero-copy tensor file format — the modern replacement for pickle-based checkpoints.

SafeTensors stores weights in a simple, memory-mappable layout with no arbitrary code execution (unlike Python pickle, which can run malicious code on load). It loads fast via zero-copy and is now the default for sharing weights on the Hub.

Examplemodel.safetensors loads almost instantly via mmap and cannot execute hidden code, unlike a .bin/.pt pickle.

Scaling Laws

Trainingneural scaling laws

Empirical power-law curves showing model loss falls predictably as parameters, data, and compute grow.

Scaling laws are power-law relationships found empirically: a model's loss drops smoothly and predictably as parameters, training tokens, and compute increase together. They let labs forecast a model's capability before training it, and they later motivated training smaller models on far more data (compute-optimal, Chinchilla-style).

ExampleScaling laws predicted how much a 10x larger compute budget would cut loss, so a lab could plan a frontier run's size and data in advance.

SCoTD

Fine-tuningSymbolic Chain-of-Thought Distillation

Distill a teacher's step-by-step reasoning into a small model via many symbolic CoT traces.

Symbolic Chain-of-Thought Distillation samples multiple chain-of-thought rationales from a large teacher and trains a small student on them, so even a 1-3B model learns to reason in explicit steps rather than pattern-match. It is a key reason small QuKaiZen students can think.

ExampleA 1.3B student trained on 175B-teacher CoT traces learns to lay out premise, rule, then conclusion on its own.

Seal

QuKaiZenNucleus Seal

A cryptographic signature certifying a model's provenance — what it was distilled from and that it is untampered.

A seal is a cryptographic signature (QuKaiZen uses Ed25519) bound to a finished model, certifying its provenance: which teacher and corpus it came from, which certification gates it passed, and that its weights have not changed since. Anyone can verify the seal offline, so an owned model carries proof of exactly what it is — the Nucleus Seal.

ExampleBefore trusting a distilled 3B model in production you verify its Ed25519 seal; if a single weight changed, verification fails.

SFT

TrainingSupervised Fine-Tuning

Plain supervised training on curated input to output examples — the first step of post-training.

SFT fine-tunes a pretrained model on labeled prompt/response pairs so it learns to follow instructions in a target format or domain. It is the foundation step before preference alignment (RLHF/DPO) and the simplest way to specialize a base model.

ExampleTrain on 10k (instruction, ideal answer) pairs so a base model answers like a helpful assistant instead of just continuing text.

Softmax

Fundamentalssoftmax function

Turns a vector of logits into a probability distribution that sums to 1.

Softmax exponentiates each logit and divides by the sum, producing positive values that add to 1 — a probability distribution. It picks the next token from logits and, inside attention, weights how much each token attends to others.

ExampleLogits [2.0, 1.0, 0.1] become probabilities about [0.66, 0.24, 0.10] after softmax.

Speculative Decoding

Inferencespeculative sampling

A small draft model proposes several tokens; the big model verifies them in one pass — lossless speedup.

Speculative decoding runs a cheap draft model to guess the next few tokens, then the large target model verifies them all in a single forward pass, accepting the longest correct prefix. Output is identical to normal decoding, but throughput rises 2-3x because the expensive model runs less often.

ExampleThe draft proposes 5 tokens, the target accepts the first 4 and corrects the 5th — 4 tokens produced for roughly one big-model step.

SSDP

QuKaiZenSuper Skill Distillation Pipeline

QuKaiZen's pipeline that distills deep reasoning from frontier teacher models into small, owned Super Skill models.

The Super Skill Distillation Pipeline (SSDP) extracts deep domain reasoning from 400B+ frontier teacher models and crystallizes it into small 1-7B Super Skill models that run on commodity hardware, air-gapped, and owned forever. It is not RAG — a Super Skill knows its domain. Nucleus implements it: KICE/TICE knowledge extraction, RAFT, Symbolic Chain-of-Thought distillation, an adversarial swarm that trains the student to convergence, three certification gates, and an Ed25519 Nucleus Seal.

ExampleSSDP can take a frontier model's mastery of a regulatory domain and mint a 3B model that answers offline at a fraction of the energy — high Wisdom per Watt.

Student Model

QuKaiZen

The small model being trained to absorb the teacher's reasoning.

The student is the compact model (1–7B) that learns from the teacher's traces, RAFT data, and adversarial correction — ending as a sealed Super Skill that can beat its teacher in-domain.

ExampleAfter distillation the 3B student out-reasons its 500B teacher inside the target domain.

Super Skill

QuKaiZenSuper Skill Model

A 1-7B model that durably knows a domain, distilled from a frontier teacher and owned forever.

A Super Skill Model is the output of QuKaiZen's pipeline: a small (1-7B) model that crystallizes a frontier teacher's deep reasoning for a domain, runs on commodity or air-gapped hardware, and keeps improving. It knows — it does not look things up like RAG.

ExampleA Linux-Kernel Super Skill trained on 30 years of commits, CVEs, and mailing lists reasons about kernel bugs offline.

Symbolic Chain-of-Thought

QuKaiZenSymbolic CoT

Capturing a teacher's reasoning as reusable symbolic structure, not just imitated text traces.

Symbolic Chain-of-Thought captures the structure of a teacher's reasoning — the steps, rules, and relationships — in symbolic form rather than copying surface-level wording. Distilling that structure (SCoTD) teaches a small student to reason faithfully instead of mimicking phrasing, which is what makes a Super Skill robust rather than brittle.

ExampleInstead of memorizing one solution's wording, a student trained on symbolic CoT learns the underlying procedure and applies it to unseen problems.

Teacher Model

QuKaiZen

The large frontier model whose reasoning is distilled into a small student.

In distillation the teacher is the big, capable model (400B+) that generates reasoning traces and judgments; the student learns to reproduce its competence in-domain. QuKaiZen uses two-tier teachers for breadth and depth.

ExampleA 400B teacher writes step-by-step solutions that train a 3B student to match it in-domain.

Temperature

Inferencesampling temperature

A knob for randomness in generation — low is focused/deterministic, high is creative/diverse.

Temperature scales logits before softmax: below 1 sharpens the distribution (safer, more repetitive), above 1 flattens it (more diverse, more errors). At 0 the model is effectively greedy. It is the simplest lever for output style.

ExampleUse temperature 0.2 for code or facts; 0.9 for brainstorming or creative writing.

Throughput

Inference

How many tokens a system generates per unit time, across all requests.

Throughput measures total tokens/second a serving stack produces; it trades off against per-request latency. Speculative decoding and batching push it up.

ExampleSpeculative decoding lifts AeroLLM throughput up to 7× on 70B+ teachers.

TICE

QuKaiZenTacit knowledge Injection & Corpus Evolution

QuKaiZen's agent for Layer-7 tacit knowledge — the unwritten expert know-how and gotchas.

TICE extracts implicit, tribal knowledge — folklore, gotchas, and esoteric patterns that are not formally documented. It is the highest-value extractor for user-data-enriched (Mode 2/3) Super Skills, capturing expertise that lives only in practitioners' heads.

ExampleTICE captures a farmer's unwritten rule of thumb about soil timing that no manual records, then teaches it to the student.

Tokenizer

Fundamentalstokenization

Splits text into tokens (subword units) the model actually reads, and back again.

A tokenizer converts raw text into integer token IDs (and back) using a learned vocabulary, usually via subword schemes like BPE or SentencePiece. Token count drives context limits and cost, and odd tokenization explains many model quirks.

Example'tokenization' might split into ['token', 'ization']; rare words and emoji can become many tokens, inflating cost.

Tool Use

Architecture

A model invoking external tools — APIs, code, search — to act beyond text.

Tool use lets a model call functions (query a database, run code, hit an API) and fold the results back into its reasoning, turning a language model into an actor in real systems.

ExampleAsked today's freight rate, the agent calls a rates API instead of guessing.

Transformer

Architecturetransformer architecture

The attention-based neural architecture behind essentially every modern LLM.

The transformer stacks blocks of multi-head attention and feed-forward layers with residual connections and normalization, processing all tokens in parallel. Introduced in 'Attention Is All You Need' (2017), it scales beautifully and underpins GPT, Llama, and the rest.

ExampleA 7B decoder-only transformer is about 32 such blocks; depth and width set the parameter count.

Triton

Formats & RuntimeOpenAI Triton

A Python-like language for writing fast GPU kernels without hand-writing CUDA C++.

Triton (from OpenAI) lets researchers write custom GPU kernels in a Python-like syntax that compiles to efficient code, making fused high-performance ops far easier to author. Many modern kernels, including FlashAttention implementations, are written in Triton.

ExampleA fused softmax written in about 30 lines of Triton can beat a naive PyTorch version by a wide margin.

Verifier

Inference

The target-model pass that accepts or corrects speculatively drafted tokens.

In speculative decoding the verifier is the large model's single forward pass that checks the draft's proposed tokens — keeping the correct prefix and resampling the first mismatch — which guarantees the same distribution as decoding normally.

ExampleOf 5 drafted tokens the verifier accepts 4 and corrects the 5th, all in one pass.

vLLM

Inference

A high-throughput LLM serving engine; its PagedAttention manages the KV-cache like virtual memory.

vLLM maximizes GPU throughput via PagedAttention — treating the KV-cache as paged memory to eliminate fragmentation — plus continuous batching of incoming requests. It is the enterprise-grade backend for serving teacher models on GPUs.

ExampleQuKaiZen uses vLLM (TEACHER_BACKEND=vllm) to serve teachers on H100s with continuous batching and FP8.

Warmup

Traininglearning-rate warmup

Ramping the learning rate up from near zero over the first steps to avoid early instability.

Learning-rate warmup starts the LR small and increases it over the first few hundred or thousand steps before the main schedule (often cosine decay). Early gradients are noisy; warmup prevents large destabilizing updates while the optimizer's statistics settle.

Example500 warmup steps ramping to 2e-4, then cosine decay to near zero over the run.

Wisdom per Watt

QuKaiZenwisdom-per-watt

QuKaiZen's core metric: certified, permanently-owned reasoning capability per unit of lifetime energy to mint and run it.

Renting a frontier model burns full datacenter energy on every query, forever, and you own nothing. QuKaiZen spends energy once to distill that reasoning into a small model you keep — certified by three independent gates, sealed with cryptographic provenance, run locally at near-zero marginal cost. Capability only counts if it is proven and trustworthy: a model that scores high but fails the hallucination gate scores zero. The energy is not a running cost; it is capital spent on an asset you own forever.

ExampleRent a 400B model and the 100,000th query costs the same datacenter energy as the first — and you still own nothing. Mint a 3B Super Skill and you pay the energy once; after the break-even query, owning beats renting and the gap only widens.

Workflow

Architecture

A declared sequence of steps an agent or pipeline executes.

A workflow encodes the steps — download, analyze, decide, process — as configuration rather than ad-hoc code, so it's inspectable, versionable, and reproducible. PaperAgents declares them in TOML.

Example[[workflow]]: download loads → analyze margin → decide → process.

ZeRO

TrainingZero Redundancy Optimizer

DeepSpeed's optimizer that partitions optimizer state, gradients, and params to remove memory redundancy.

ZeRO eliminates the memory redundancy of vanilla data parallelism by partitioning optimizer states (stage 1), gradients (stage 2), and parameters (stage 3) across GPUs. It is the idea FSDP also implements, enabling trillion-parameter training.

ExampleZeRO-3 lets each of 64 GPUs hold only 1/64 of the optimizer state, freeing memory for larger batches.