{"ok":true,"count":102,"terms":[{"slug":"ablation","term":"Ablation","category":"fundamentals","short":"Removing one component to measure how much it actually contributes.","definition":"An ablation study turns a part off — a layer, a loss term, a data source — and measures the drop, isolating what really matters. It's how you separate the ingredient from the marketing.","example":"Ablating the distractor documents shows RAFT's robustness gains came from training through noise.","related":["experiment","baseline"],"source":"QuKaiZen AI Dictionary"},{"slug":"adamw","term":"AdamW","aka":["Adam with weight decay"],"category":"training","short":"The default optimizer for training transformers — Adam with decoupled weight decay.","definition":"AdamW adapts the learning rate per parameter using running estimates of gradient mean and variance, and decouples weight decay from the gradient update for cleaner regularization. It is the workhorse optimizer for LLM training.","example":"A typical run: AdamW with lr=2e-4, betas=(0.9, 0.95), weight_decay=0.1, plus warmup and a cosine schedule.","related":["gradient","backprop","warmup"],"source":"authored"},{"slug":"adapters","term":"Adapters","aka":["adapter layers"],"category":"fine-tuning","short":"Small trainable modules inserted into a frozen model to add new skills without retraining it.","definition":"Adapters are tiny bottleneck layers added between a frozen model's existing layers; only the adapters train. They are a parameter-efficient way to teach new tasks, and you can keep a library of swappable adapters for one base. LoRA is a popular low-rank flavor of this idea.","example":"Ship one 7B base plus a 'legal' adapter and a 'medical' adapter; load whichever the task needs.","related":["lora","peft","fine-tune"],"source":"authored"},{"slug":"adversarial-swarm","term":"Adversarial Swarm","aka":["swarm"],"category":"qukaizen","short":"A loop of agents (interrogate, challenge, evaluate, correct) that hardens a model until it stops breaking.","definition":"The Adversarial Swarm Reactor pits Interrogator, Adversary, Evaluator, and Corrector agents (plus data-collection agents) against the student in cycles, systematically hunting and eliminating hallucination pathways. The model graduates not by passing a fixed test but when the swarm can no longer break it.","example":"The swarm keeps inventing harder kernel-bug traps until the student answers them all, then it graduates.","related":["convergence-graduation","super-skill","nucleus-seal"],"seeAlso":[{"label":"Nucleus pipeline","href":"/nucleus"}],"source":"QuKaiZen NUCLEUS_AGENT_PROTOCOL"},{"slug":"aerollm","term":"AeroLLM","aka":["AeroLLM"],"category":"qukaizen","short":"QuKaiZen's inference engine that streams frontier models off disk so they run without full GPU residency.","definition":"AeroLLM is the inference layer that makes disk-streamed teachers practical — layer streaming plus speculative decoding to claw back speed. It is how QuKaiZen serves 400B+ teachers on workstations that lack the VRAM to hold them.","example":"Point the teacher backend at AeroLLM to stream a 405B teacher on a single box instead of an 8x H100 node.","related":["layer-streaming","speculative-decoding","super-skill","vllm"],"seeAlso":[{"label":"AeroLLM","href":"/aerollm"}],"source":"QuKaiZen NUCLEUS_AGENT_PROTOCOL"},{"slug":"agent","term":"Agent","category":"architecture","short":"An LLM that takes actions — calls tools, makes decisions — toward a goal, not just chats.","definition":"An agent wraps a model with tools, memory, and a control loop so it can plan, act, observe, and iterate. PaperAgents declares teams of small specialist agents; ARAIL's Buddy is a lab agent.","example":"A dispatch agent reads a load board, computes margin, and books profitable freight without a human in the loop.","related":["agentic","tool-use","multi-agent"],"seeAlso":[{"label":"PaperAgents","href":"/paperagents"}],"source":"QuKaiZen AI Dictionary"},{"slug":"agentic","term":"Agentic","category":"architecture","short":"Software built around autonomous, tool-using model agents.","definition":"Agentic systems give models autonomy to decide and act over many steps using tools and feedback, instead of producing a single response. The tradeoff is power vs. predictability — hence guardrails and declared workflows.","example":"An agentic workflow downloads data, analyzes it, decides, and processes — looping until the job is done.","related":["agent","tool-use","workflow"],"source":"QuKaiZen AI Dictionary"},{"slug":"alignment","term":"Alignment","category":"rl-alignment","short":"Making a model's behavior match human intent and values.","definition":"Alignment is the work of making models helpful, honest, and harmless — via methods like RLHF and DPO plus evaluation for refusal and faithfulness. Misalignment shows up as unsafe or off-intent output.","example":"RLHF aligns a base model so it follows instructions and declines harmful requests.","related":["rlhf","dpo","faithfulness"],"source":"QuKaiZen AI Dictionary"},{"slug":"attention","term":"Attention","aka":["self-attention","scaled dot-product attention"],"category":"architecture","short":"The mechanism that lets each token weigh and pull information from every other token.","definition":"Attention computes, for each token, a weighted sum of all tokens' value vectors, where weights come from the similarity (dot product) of its query with others' keys. It is how transformers model long-range relationships, and its quadratic cost is what FlashAttention and the KV-cache optimize.","example":"In 'the cat sat because it was tired', attention links 'it' back to 'cat' by giving that pair a high weight.","related":["transformer","flashattention","kv-cache","softmax"],"source":"authored"},{"slug":"automation","term":"Automation","category":"architecture","short":"Letting software run repeatable work end-to-end with no human in the loop.","definition":"Automation captures a repeatable process so it runs on its own, reliably and on schedule. PaperAgents automates with small specialist agents reconciled to a desired state.","example":"Invoicing that books, charges, and files itself every night.","related":["workflow","agent","reconcile"],"seeAlso":[{"label":"PaperAgents","href":"/paperagents"}],"source":"QuKaiZen AI Dictionary"},{"slug":"autoresearch","term":"AutoResearch","category":"qukaizen","short":"The swarm's brain — it evolves the rubrics every other agent consults.","definition":"AutoResearch is a first-class meta-agent that evolves the rubrics driving probes, traps, and scoring, and independently fact-checks certification. It is never merged into another service.","example":"AutoResearch notices repeated failures on edge cases and rewrites the rubric to target them next cycle.","related":["rubric","adversarial","convergence"],"seeAlso":[{"label":"Nucleus pipeline","href":"/nucleus"}],"source":"QuKaiZen AI Dictionary"},{"slug":"backprop","term":"Backprop","aka":["backpropagation","backward pass"],"category":"training","short":"The algorithm that computes how to nudge every weight by propagating error gradients backward.","definition":"Backpropagation applies the chain rule to compute the gradient of the loss with respect to every parameter, flowing from the output layer back to the input. Those gradients tell the optimizer which direction to move each weight to reduce error.","example":"After a forward pass yields loss 2.3, backprop computes the gradient for every weight; AdamW then updates them.","related":["gradient","adamw","dropout"],"source":"authored"},{"slug":"beam-search","term":"Beam Search","aka":["beam search"],"category":"inference","short":"A decoding strategy that keeps the top-k partial sequences each step to find a higher-probability output.","definition":"Beam search explores several candidate sequences (beams) in parallel, expanding and pruning to the k most probable at each step. It yields higher-likelihood, more deterministic outputs than greedy decoding — good for translation and structured tasks, but it can be bland for open-ended generation.","example":"With beam width 4, the decoder tracks the 4 best running sequences and returns the best completed one.","related":["temperature","logits","inference"],"source":"authored"},{"slug":"benchmark","term":"Benchmark","aka":["eval","evaluation"],"category":"fundamentals","short":"A standardized test set used to measure and compare model capability.","definition":"Benchmarks score models on fixed tasks — knowledge, reasoning, code — so results are comparable. QuKaiZen's Gate 1 uses MMLU, HellaSwag, ARC, GSM8K, and IFEval to verify capability survives distillation.","example":"A distilled student must retain ≥85% of its base model's MMLU score to pass the regression gate.","related":["mmlu","gsm8k","ifeval"],"source":"QuKaiZen AI Dictionary"},{"slug":"bf16","term":"BF16","aka":["bfloat16"],"category":"quantization","short":"A 16-bit float with the same exponent range as FP32 — the default precision for training LLMs.","definition":"bfloat16 keeps FP32's 8-bit exponent (the same huge dynamic range) but truncates the mantissa to 7 bits. That range makes it numerically stable for training without loss scaling, at half the memory and bandwidth of FP32.","example":"Most LLMs train in BF16 on A100/H100/TPU; weights are half the size of FP32 with no overflow headaches.","related":["fp8","int4","quantization"],"source":"authored"},{"slug":"buddy","term":"Buddy","aka":["ARAIL Buddy"],"category":"qukaizen","short":"ARAIL's local companion agent — a context-aware lab partner you learn alongside, running entirely on your own hardware.","definition":"ARAIL began with Buddy: a local agent to learn alongside. Buddy needed an environment, and that environment became a lab — pluggable, observable, and entirely owned by you. Buddy drives the lab in plain language and draws on your knowledge base for real context, so it can answer \"what should I do next?\" or \"what's interesting in today's pull?\" — offline, with no telemetry.","example":"Ask Buddy \"what's worth reading in today's arXiv pull?\" and it answers from your own knowledge base — no cloud round-trip, nothing leaving your machine.","related":["super-skill","aerollm"],"seeAlso":[{"label":"ARAIL lab","href":"/arail"},{"label":"ARAIL explainer","href":"/explainers/arail"}],"source":"ARAIL"},{"slug":"chain-of-thought","term":"Chain-of-Thought","aka":["CoT"],"category":"fundamentals","short":"Prompting a model to show its intermediate steps, which sharply improves reasoning.","definition":"Chain-of-thought elicits step-by-step intermediate reasoning before the final answer. Wei et al. (2022) showed it dramatically improves math and logic; QuKaiZen distills symbolic CoT into small students.","example":"Instead of just '42', a CoT response writes the derivation line by line, then states 42 — and is right far more often.","related":["reasoning","scotd","distillation"],"source":"QuKaiZen AI Dictionary"},{"slug":"checkpoint","term":"Checkpoint","aka":["model checkpoint"],"category":"training","short":"A saved snapshot of model weights (and often optimizer state) you can resume or deploy from.","definition":"A checkpoint persists the model's parameters — and during training, the optimizer state and step — so a run can resume after interruption or a version can be evaluated and shipped. Modern checkpoints use safetensors for safe, fast loading.","example":"Saving a checkpoint every 500 steps means a crash at step 1700 resumes from 1500, not from scratch.","related":["safetensors","gguf","fsdp"],"source":"authored"},{"slug":"continuous-batching","term":"Continuous Batching","aka":["in-flight batching"],"category":"inference","short":"Swapping requests in and out of a running batch every step to keep the GPU saturated.","definition":"Continuous (in-flight) batching removes finished sequences and adds new ones each step, instead of waiting for a whole batch to complete — dramatically improving serving throughput and latency.","example":"A server using continuous batching serves many users at once with no idle GPU gaps.","related":["paged-attention","throughput","latency"],"source":"QuKaiZen AI Dictionary"},{"slug":"convergence","term":"Convergence","category":"qukaizen","short":"Graduation by exhaustion — the model is done when the swarm can't break it anymore.","definition":"Rather than a fixed number of rounds, QuKaiZen runs until convergence: 95%+ evaluator scores, exhausted experiments, and no further reasoning gains. Quality is measured by swarm exhaustion, then verified by three gates.","example":"After dozens of cycles the swarm finds no new failure patterns; the student converges and is sealed.","related":["adversarial","gate","seal"],"seeAlso":[{"label":"Nucleus pipeline","href":"/nucleus"}],"source":"QuKaiZen AI Dictionary"},{"slug":"convergence-graduation","term":"Convergence Graduation","aka":["Convergence-Based Graduation"],"category":"qukaizen","short":"A model graduates when the adversarial swarm gives up trying to break it — not at a fixed cycle limit.","definition":"Instead of a fixed number of rounds, QuKaiZen runs until convergence: 95%+ evaluator scores, exhausted experiments, and no further reasoning improvement. Quality is measured by swarm exhaustion, then verified by a three-gate certification before the Nucleus Seal is minted.","example":"After dozens of cycles the swarm finds no new failure patterns; the student converges, passes the gates, and is sealed.","related":["adversarial-swarm","nucleus-seal","super-skill"],"seeAlso":[{"label":"Nucleus pipeline","href":"/nucleus"}],"source":"QuKaiZen NUCLEUS_AGENT_PROTOCOL"},{"slug":"cuda","term":"CUDA","aka":["CUDA"],"category":"formats-runtime","short":"NVIDIA's platform/language for general-purpose GPU computing — the substrate most ML runs on.","definition":"CUDA is NVIDIA's parallel-computing API and toolkit that lets code run on GPUs. Frameworks compile their tensor ops down to CUDA kernels (and libraries like cuBLAS/cuDNN), which is why GPU availability and CUDA versions dominate ML ops.","example":"A version mismatch between a PyTorch build and the installed CUDA toolkit is the classic 'it will not see the GPU' bug.","related":["triton","flashattention"],"source":"authored"},{"slug":"desired-state","term":"Desired State","category":"architecture","short":"The end state you declare; the system's job is to make reality match it.","definition":"Desired-state configuration means you describe what you want — the team, the config — not the steps to get there, and a controller reconciles reality to it. Idempotent and version-controlled.","example":"team.toml lists four agents; apply it and the platform makes exactly those run.","related":["reconcile","drift","idempotent"],"seeAlso":[{"label":"PaperAgents","href":"/paperagents"}],"source":"QuKaiZen AI Dictionary"},{"slug":"distillation","term":"Distillation","aka":["knowledge distillation"],"category":"fine-tuning","short":"Transfer a big teacher model's behavior into a small student model.","definition":"Knowledge distillation trains a small student to mimic a large teacher — matching its outputs, probabilities, or reasoning traces — so the student captures much of the teacher's capability at a fraction of the size and cost. It is the core of QuKaiZen's pipeline.","example":"A 3B student trained on a 400B teacher's chain-of-thought traces can match the teacher in-domain while running on a laptop.","related":["scotd","raft","super-skill","fine-tune"],"seeAlso":[{"label":"Nucleus pipeline","href":"/nucleus"}],"source":"authored"},{"slug":"dpo","term":"DPO","aka":["Direct Preference Optimization"],"category":"rl-alignment","short":"Align to preferences directly from good/bad answer pairs — no reward model or RL loop.","definition":"DPO skips RLHF's separate reward model and PPO loop, reframing alignment as a simple classification-style loss over (preferred, rejected) pairs that directly raises the likelihood of preferred answers. Simpler and more stable than PPO-based RLHF, with comparable results.","example":"Feed pairs like (concise correct answer = preferred, rambling answer = rejected); DPO's loss directly widens the margin between them.","related":["rlhf","ppo","sft"],"source":"authored"},{"slug":"draft-model","term":"Draft Model","category":"inference","short":"The small, fast model that proposes candidate tokens in speculative decoding.","definition":"The draft model is a smaller, cheaper model that guesses the next several tokens; the large target model then verifies them together. The closer the draft tracks the target, the more tokens are accepted per pass.","example":"A 1B draft proposes 5 tokens; the 70B target verifies all 5 in one pass when they agree.","related":["speculative","verifier"],"source":"QuKaiZen AI Dictionary"},{"slug":"dropout","term":"Dropout","aka":["dropout regularization"],"category":"training","short":"Randomly zeroing activations during training to prevent overfitting.","definition":"Dropout randomly sets a fraction of activations to zero each training step, forcing the network not to rely on any single unit and improving generalization. It is disabled at inference. Large pretraining often uses little or none, but it is common when fine-tuning on small data.","example":"Dropout 0.1 on a fine-tune randomly drops 10% of activations per step to curb overfitting on a small dataset.","related":["backprop","fine-tune","layernorm"],"source":"authored"},{"slug":"embeddings","term":"Embeddings","aka":["embedding vectors"],"category":"fundamentals","short":"Dense numeric vectors representing tokens or text so similar meanings sit close together.","definition":"An embedding maps a token or piece of text to a vector in high-dimensional space where geometric closeness reflects semantic similarity. Models learn input embeddings for tokens; separate embedding models turn whole documents into vectors for search and RAG.","example":"'king' minus 'man' plus 'woman' lands near 'queen'; RAG retrieves the docs whose embeddings are nearest the query's.","related":["tokenizer","transformer","attention"],"source":"authored"},{"slug":"eval","term":"Eval","aka":["evaluation","evals"],"category":"training","short":"The practice of measuring model quality with repeatable tests — from public benchmarks to task-specific graders.","definition":"An eval is any repeatable measurement of how well a model does something: a public benchmark, a private held-out set, an LLM-as-judge rubric, or a unit-test-style check. Good evals are the steering wheel of model building — without them you cannot tell whether a change helped. QuKaiZen's certification gates are the evals a student model must pass before it graduates.","example":"Before shipping a fine-tune you run an eval suite — MMLU for knowledge, GSM8K for math, IFEval for instruction-following — and only ship if every score holds or improves.","related":["benchmark","mmlu","gsm8k","ifeval","kice"],"source":"authored"},{"slug":"experiment","term":"Experiment","aka":["experimentation","training run"],"category":"fundamentals","short":"A single tracked training or evaluation run with a fixed configuration, used to test one change against a baseline.","definition":"An experiment isolates one variable — a hyperparameter, a data change, an architecture tweak — and measures its effect against a baseline under otherwise identical conditions. Each run logs its config, metrics, and artifacts so results are reproducible and comparable. In ARAIL, autoresearch agents run experiments continuously and score each against evolving rubrics — what gets measured gets improved.","example":"Change only the learning rate from 2e-4 to 1e-4, rerun training, and compare validation loss to the baseline; if it improves and nothing else changed, the experiment isolated the cause.","related":["checkpoint","perplexity"],"seeAlso":[{"label":"ARAIL lab","href":"/arail"}],"source":"authored"},{"slug":"fine-tune","term":"Fine-tune","aka":["fine-tuning"],"category":"fine-tuning","short":"Continue training a pretrained model on new data to specialize it for a task or domain.","definition":"Fine-tuning takes a general pretrained model and trains it further on a focused dataset so it adapts to a domain, style, or task. It can be full (all weights) or parameter-efficient (LoRA/PEFT), and is the bridge from a generic base to a useful specialist.","example":"Fine-tune a base 7B on 30 years of Linux-kernel commits and it starts reasoning like a kernel engineer.","related":["sft","lora","peft","distillation"],"source":"authored"},{"slug":"flashattention","term":"FlashAttention","aka":["Flash Attention"],"category":"inference","short":"An exact attention kernel that is fast and memory-light by never materializing the full attention matrix.","definition":"FlashAttention computes exact attention in tiles that stay in fast on-chip SRAM, avoiding the quadratic N-by-N matrix in slow HBM. It cuts memory from quadratic to linear and speeds up training and inference, enabling much longer contexts.","example":"Swapping standard attention for FlashAttention-2 can train a long-context model ~2x faster with far less memory.","related":["attention","kv-cache","transformer"],"source":"authored"},{"slug":"fp8","term":"FP8","aka":["8-bit float"],"category":"quantization","short":"An 8-bit floating-point format for faster training and inference on H100-class hardware.","definition":"FP8 represents numbers in 8 bits (e4m3 or e5m2 variants), halving memory and doubling throughput versus BF16 on supporting GPUs. It needs careful scaling but is increasingly used for both training and high-throughput inference.","example":"Serving a teacher in FP8 on H100s roughly doubles tokens/sec versus BF16 with minimal quality loss.","related":["bf16","int4","quantization","vllm"],"source":"authored"},{"slug":"fsdp","term":"FSDP","aka":["Fully Sharded Data Parallel"],"category":"training","short":"Shards model parameters, gradients, and optimizer state across GPUs so huge models fit in training.","definition":"FSDP (PyTorch) splits parameters, gradients, and optimizer states across all data-parallel GPUs, gathering each shard only when needed. It trains models far larger than a single GPU's memory, with less overhead than older model-parallel schemes.","example":"Training a 70B model across 8 GPUs: FSDP keeps only 1/8 of the weights resident on each, all-gathering layers on the fly.","related":["zero","backprop","gradient"],"source":"authored"},{"slug":"function-calling","term":"Function Calling","category":"architecture","short":"A structured protocol for a model to request a specific tool with typed arguments.","definition":"Function calling has the model emit a structured call — a name plus JSON arguments — that your code executes and returns, for the model to use. It's the reliable mechanism beneath most tool use.","example":"The model returns {name:'get_rate', args:{lane:'CHI-DAL'}}; your server runs it and feeds back the price.","related":["tool-use","agent","mcp"],"source":"QuKaiZen AI Dictionary"},{"slug":"gelu","term":"GELU","aka":["Gaussian Error Linear Unit"],"category":"architecture","short":"A smooth activation function used in transformer feed-forward layers.","definition":"GELU multiplies an input by the probability it is positive under a Gaussian, giving a smooth, slightly negative-tolerant alternative to ReLU. Its smoothness helps gradient flow, and it is the default activation in many transformer MLP blocks (with SwiGLU now common too).","example":"A transformer's feed-forward block applies GELU between its two linear layers.","related":["transformer","layernorm"],"source":"authored"},{"slug":"gguf","term":"GGUF","aka":["GGML successor"],"category":"formats-runtime","short":"A single-file binary format for quantized models, built for fast local inference (llama.cpp).","definition":"GGUF packs weights (usually quantized), tokenizer, and metadata into one memory-mappable file so a model loads fast and runs on commodity hardware. It is the format used by llama.cpp and friends, superseding the older GGML format.","example":"llama-2-7b.Q4_K_M.gguf is a 7B model quantized to ~4-bit (~4GB) that runs on a laptop with llama.cpp.","related":["quantization","int4","safetensors","inference"],"source":"authored"},{"slug":"gradient","term":"Gradient","aka":["gradients"],"category":"training","short":"The vector of partial derivatives telling how the loss changes as you tweak each weight.","definition":"A gradient points in the direction of steepest increase of the loss; training steps move weights the opposite (descent) way. Gradient magnitude and stability (vanishing/exploding) are central concerns, handled with clipping, normalization, and good optimizers.","example":"Gradient clipping caps the global gradient norm (e.g., 1.0) to stop a huge update from blowing up training.","related":["backprop","adamw","fsdp"],"source":"authored"},{"slug":"gsm8k","term":"GSM8K","aka":["Grade School Math 8K"],"category":"training","short":"Around 8,500 grade-school math word problems that test multi-step arithmetic reasoning.","definition":"GSM8K (Grade School Math 8K) is a dataset of about 8,500 linguistically diverse grade-school word problems, each needing two to eight reasoning steps. It became the standard probe for whether a model reasons step by step instead of pattern-matching — and the benchmark that made chain-of-thought prompting famous.","example":"A GSM8K problem may say a robe needs 2 bolts of blue fiber and half that of white; the model must compute 2 + 1 = 3, and the benchmark scores only the final number.","related":["benchmark","eval","mmlu","scotd"],"source":"authored"},{"slug":"hallulens","term":"HalluLens","aka":["LLM hallucination benchmark"],"category":"training","short":"A benchmark for measuring how often an LLM hallucinates — asserts unsupported or fabricated facts.","definition":"HalluLens is a hallucination benchmark that separates extrinsic hallucination (claims grounded in no source) from intrinsic hallucination (contradicting the given input), and probes models with tasks designed to surface confident-but-false answers. It exists because fluency hides unreliability — a model can sound right while being wrong.","example":"Asked to summarize a paper that does not exist, a hallucinating model invents authors and results; HalluLens scores whether it fabricates or correctly declines.","related":["eval","benchmark"],"source":"authored"},{"slug":"helm","term":"HELM","aka":["Holistic Evaluation of Language Models"],"category":"training","short":"Stanford's broad, multi-metric benchmark suite that scores models across many scenarios, not just accuracy.","definition":"HELM (Holistic Evaluation of Language Models), from Stanford CRFM, evaluates models across a wide matrix of scenarios and metrics — accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency — so a model is judged on many axes at once instead of a single headline score.","example":"Under HELM two models with identical accuracy can rank differently once robustness and calibration are weighed in.","related":["benchmark","eval","mmlu"],"source":"authored"},{"slug":"hypothesis","term":"Hypothesis","category":"fundamentals","short":"A testable prediction you set out to confirm or refute with an experiment.","definition":"A hypothesis states, in advance, what you expect a change to do and how you'll measure it — turning a hunch into something falsifiable. Good research lives or dies on sharp hypotheses.","example":"'Adding symbolic CoT will raise faithfulness by 5 points' — then you run it and find out.","related":["experiment","research"],"source":"QuKaiZen AI Dictionary"},{"slug":"ifeval","term":"IFEval","aka":["Instruction-Following Eval"],"category":"training","short":"A benchmark of machine-verifiable instructions that measures how precisely a model obeys format and constraint requests.","definition":"IFEval (Instruction-Following Eval) uses prompts whose compliance can be checked programmatically — answer in exactly three bullet points, avoid a given word, respond in JSON. Because each rule is machine-verifiable, it scores obedience objectively, with no human or judge model in the loop.","example":"Given an instruction to write two paragraphs and end with a specific word, IFEval checks both conditions automatically; missing either one counts as a fail.","related":["benchmark","eval","mmlu"],"source":"authored"},{"slug":"inference","term":"Inference","aka":["serving"],"category":"fundamentals","short":"Running a trained model to produce outputs — the deployment side, as opposed to training.","definition":"Inference is using a trained model to generate predictions for real inputs. For LLMs it is autoregressive: produce one token, append it, repeat. Latency, throughput, and memory (the KV-cache) are the central concerns, distinct from the one-time cost of training.","example":"Typing a prompt into a chatbot and watching tokens stream back is inference; the KV-cache and sampling settings shape its speed and style.","related":["kv-cache","speculative-decoding","temperature","vllm"],"source":"authored"},{"slug":"int4","term":"INT4","aka":["4-bit"],"category":"quantization","short":"4-bit integer weights — the aggressive quantization that makes big models fit on small hardware.","definition":"INT4 stores each weight in 4 bits (16 levels), roughly 8x smaller than FP32. Schemes like GPTQ, AWQ, and NF4 pick scales and zero-points to preserve quality. Small models tolerate 4-bit well; frontier models often need 8-bit for the same fidelity.","example":"A 7B model in INT4 is ~4GB and runs on a laptop; a 671B MoE at Q4 fits a 1TB SSD for layer-streamed inference.","related":["quantization","gguf","qlora","bf16"],"seeAlso":[{"label":"AeroLLM","href":"/aerollm"}],"source":"knowledge_base/wiki/concepts/Quantization_SNR_Affine.md"},{"slug":"kice","term":"KICE","aka":["Knowledge Injection & Corpus Evolution"],"category":"qukaizen","short":"QuKaiZen's agent that extracts certified, verifiable domain knowledge in six layers.","definition":"KICE mines a corpus for rare concepts, edge cases, historical conflicts, subsystem interactions, nuanced reasoning, and ambiguity — knowledge that can be verified against authoritative sources. It feeds the distillation pipeline with high-quality, checkable material.","example":"For a Linux-kernel skill, KICE surfaces a subtle locking edge case documented in a 2009 mailing-list thread.","related":["tice","super-skill","distillation"],"seeAlso":[{"label":"Nucleus pipeline","href":"/nucleus"}],"source":"QuKaiZen NUCLEUS_AGENT_PROTOCOL"},{"slug":"kv-cache","term":"KV-Cache","aka":["key-value cache","KV cache"],"category":"inference","short":"Cached key/value tensors from past tokens so generation does not recompute the whole sequence each step.","definition":"During autoregressive generation each new token attends to all previous tokens. The KV-cache stores the keys and values already computed, so each step only processes the new token — turning quadratic regeneration into linear. It is the main consumer of inference memory.","example":"Generating token 1000 reuses 999 cached K/V pairs; only the new token's attention is computed. vLLM's PagedAttention manages this cache efficiently.","related":["attention","speculative-decoding","vllm","inference"],"source":"authored"},{"slug":"latency","term":"Latency","category":"inference","short":"The delay before and during a model's response — time-to-first-token and per-token time.","definition":"Latency is how quickly a single request responds, distinct from throughput (total volume). Keeping the model warm and prefetching weights cut it.","example":"Warm-keeping the SLM drops a dictionary lookup from ~17s cold to a couple of seconds.","related":["throughput","prefetch"],"source":"QuKaiZen AI Dictionary"},{"slug":"layer-streaming","term":"Layer Streaming","aka":["layer-by-layer inference"],"category":"qukaizen","short":"Load one transformer layer from disk, compute, discard — running 400B+ models on tiny VRAM.","definition":"Layer-streaming inference (AeroLLM's core primitive) streams a model layer by layer from SSD: load a layer's weights, compute, free, repeat. It trades latency for the ability to run frontier-scale teachers (70B-671B) on commodity hardware with a few GB of VRAM.","example":"A 671B MoE at Q4 streams off a 1TB SSD on a MacBook — slow per token, but a background swarm does not mind waiting for depth.","related":["aerollm","speculative-decoding","quantization","super-skill"],"seeAlso":[{"label":"AeroLLM","href":"/aerollm"}],"source":"knowledge_base/wiki/concepts/Layer_Streaming_Inference.md"},{"slug":"layernorm","term":"LayerNorm","aka":["Layer Normalization","RMSNorm"],"category":"architecture","short":"Normalizes activations within each layer to keep training stable; modern LLMs often use RMSNorm.","definition":"Layer normalization rescales each token's activation vector to zero mean and unit variance (RMSNorm skips the mean), stabilizing and speeding training. Placement (pre-norm vs post-norm) and the variant chosen materially affect deep-transformer stability.","example":"Llama-style models apply pre-RMSNorm before attention and the feed-forward block for stable deep training.","related":["transformer","attention","gelu"],"source":"authored"},{"slug":"llm","term":"LLM","aka":["Large Language Model"],"category":"fundamentals","short":"A transformer trained on vast text to predict the next token, yielding broad language ability.","definition":"A large language model is a big transformer trained on internet-scale text with next-token prediction; scale plus instruction tuning yields general capability. QuKaiZen distills that capability into small, owned models.","example":"GPT-4, Claude, and Llama are LLMs; a 1–7B Super Skill is a small, specialized descendant.","related":["transformer","distillation","super-skill"],"source":"QuKaiZen AI Dictionary"},{"slug":"logits","term":"Logits","aka":["logit"],"category":"fundamentals","short":"The model's raw, unnormalized output scores over the vocabulary, before softmax makes them probabilities.","definition":"Logits are the final layer's raw scores — one per vocabulary token — not yet normalized into probabilities. Sampling controls (temperature, top-k/p) operate on logits before softmax converts them into the next-token distribution.","example":"Dividing logits by a temperature of 0.2 sharpens them, making the top token far more likely after softmax.","related":["softmax","temperature","beam-search"],"source":"authored"},{"slug":"lora","term":"LoRA","aka":["Low-Rank Adaptation"],"category":"fine-tuning","short":"Fine-tune a model by training tiny low-rank adapter matrices while the base weights stay frozen.","definition":"LoRA freezes the original weights and injects small trainable rank-decomposition matrices into each layer. You train only those low-rank matrices — often under 1% of the parameters — which slashes memory and lets a single GPU fine-tune models that would otherwise need a cluster.","example":"Fully fine-tuning a 7B model needs ~60GB+; with LoRA you train ~10-50MB of adapters in ~10GB, then merge or hot-load them at inference.","related":["qlora","peft","adapters","fine-tune"],"source":"qukaizen/docs/TECHNIQUES.md"},{"slug":"mcp","term":"MCP","aka":["Model Context Protocol"],"category":"architecture","short":"An open standard for connecting models to tools and data sources.","definition":"MCP lets agents discover and call external tools, resources, and data through a uniform interface, so capabilities plug in without bespoke glue per integration.","example":"An agent connects to a load-board MCP server and instantly gains 'list loads' and 'book load' tools.","related":["tool-use","function-calling","agent"],"source":"QuKaiZen AI Dictionary"},{"slug":"mlx","term":"MLX","category":"formats-runtime","short":"Apple's array framework for running and training models on Apple Silicon's unified memory.","definition":"MLX uses the shared CPU/GPU memory of Apple Silicon for zero-copy inference and fine-tuning — no host↔device transfers and much lower power. AeroLLM targets it for Mac deployments.","example":"On an M-series Mac, MLX runs a streamed model against unified memory with ~83% less power than a discrete GPU.","related":["streaming","quantization"],"seeAlso":[{"label":"AeroLLM","href":"/aerollm"}],"source":"QuKaiZen AI Dictionary"},{"slug":"mmlu","term":"MMLU","aka":["Massive Multitask Language Understanding"],"category":"training","short":"A benchmark of ~16,000 multiple-choice questions across 57 subjects, measuring an LLM's breadth of knowledge.","definition":"MMLU (Massive Multitask Language Understanding) tests a model with four-option multiple-choice questions spanning 57 subjects, from elementary math to law, medicine, and ethics. It is the standard yardstick for general knowledge and reasoning breadth; scores range from 25% (random guessing) to roughly 90% for frontier models.","example":"An MMLU item might pose a college-level biology fact with four choices; a model scoring 70% got 70% of questions right, averaged across all 57 subjects.","related":["benchmark","eval","gsm8k","ifeval"],"source":"authored"},{"slug":"moe","term":"MoE","aka":["Mixture of Experts"],"category":"architecture","short":"A model split into many expert sub-networks where a router activates only a few per token.","definition":"Mixture-of-Experts replaces a dense layer with many parallel expert networks plus a router that picks a small subset (e.g., 2 of 64) per token. Total parameters balloon while compute per token stays modest — huge capacity at a fraction of dense FLOPs.","example":"A 671B-parameter MoE might activate only ~37B per token, so it runs far cheaper than a dense 671B model.","related":["transformer","attention","layer-streaming"],"source":"authored"},{"slug":"multi-agent","term":"Multi-Agent","category":"architecture","short":"Several specialized agents collaborating, each owning a function.","definition":"A multi-agent system splits work across specialist agents that hand off to one another — often cheaper and more reliable than one giant generalist. PaperAgents reconciles a declared team of them.","example":"Dispatch, billing, and safety agents each handle their domain and pass tasks along.","related":["agent","orchestration","handoff"],"seeAlso":[{"label":"PaperAgents","href":"/paperagents"}],"source":"QuKaiZen AI Dictionary"},{"slug":"nucleus-seal","term":"Nucleus Seal","aka":["Nucleus Seal"],"category":"qukaizen","short":"An Ed25519 cryptographic provenance chain proving how a Super Skill model was made.","definition":"The Nucleus Seal binds a model's DNA — teacher hash, corpus hash, pipeline config, audit, and AutoResearch report — into a signed Ed25519 chain. It is cryptographic proof the pipeline distilled the model correctly, and seals are dynamically monitored and revocable.","example":"Each model version is minted with a Seal linking it to the exact teacher and corpus that produced it, so provenance is verifiable.","related":["super-skill","convergence-graduation","distillation"],"seeAlso":[{"label":"Nucleus pipeline","href":"/nucleus"}],"source":"QuKaiZen NUCLEUS_AGENT_PROTOCOL"},{"slug":"orchestration","term":"Orchestration","category":"architecture","short":"Coordinating multiple agents or services into one coherent flow.","definition":"Orchestration sequences and supervises the parts of a multi-step system — who runs when, with what inputs — handling retries and handoffs. QuKaiZen orchestrates the swarm; PaperAgents orchestrates a team.","example":"The orchestrator fans work to dispatch, waits, then hands results to billing.","related":["multi-agent","workflow","reconcile"],"source":"QuKaiZen AI Dictionary"},{"slug":"paged-attention","term":"PagedAttention","category":"inference","short":"Storing the KV-cache in non-contiguous pages so long contexts fit without waste.","definition":"PagedAttention (from vLLM) manages attention key/value cache in fixed-size pages like virtual memory, eliminating fragmentation and letting many requests share memory — large serving-throughput gains.","example":"Paged KV-cache lets a server batch far more concurrent long-context requests.","related":["kv-cache","continuous-batching","context-window"],"source":"QuKaiZen AI Dictionary"},{"slug":"peft","term":"PEFT","aka":["Parameter-Efficient Fine-Tuning"],"category":"fine-tuning","short":"An umbrella for methods (LoRA, adapters, prefix-tuning) that tune a tiny fraction of parameters.","definition":"PEFT covers techniques that adapt a model by training only a small set of new or selected parameters while freezing the rest — LoRA, adapters, prefix/prompt tuning, and more. It is also the name of Hugging Face's library implementing them.","example":"Using the PEFT library, you wrap a base model with a LoRA config and train under 1% of its parameters.","related":["lora","qlora","adapters"],"source":"authored"},{"slug":"perplexity","term":"Perplexity","aka":["PPL"],"category":"fundamentals","short":"A measure of how surprised a model is by text — lower means it predicts the text better.","definition":"Perplexity is the exponentiated average negative log-likelihood a model assigns to a sequence — roughly the effective number of equally likely choices it faces each step. Lower is better, but it is an intrinsic metric, not a substitute for task benchmarks.","example":"A model with perplexity 10 on a test set is about as uncertain as choosing uniformly among 10 tokens each step.","related":["logits","softmax","tokenizer"],"source":"authored"},{"slug":"ppo","term":"PPO","aka":["Proximal Policy Optimization"],"category":"rl-alignment","short":"The RL algorithm classically used to optimize a model against a reward model in RLHF.","definition":"PPO is a policy-gradient method that improves a model while clipping each update to stay close to the previous policy, preventing destructive jumps. In RLHF it is the optimizer that pushes the model to maximize reward-model scores.","example":"During RLHF, PPO raises the probability of high-reward responses but clips the step if the new policy strays too far from the old one.","related":["rlhf","dpo"],"source":"authored"},{"slug":"prefetch","term":"Prefetch","category":"inference","short":"Loading the next layer from disk while the current compute runs, hiding I/O latency.","definition":"Prefetching overlaps disk reads with computation: while the GPU works on layer N, layer N+1 is already streaming in, so the model rarely waits on storage. It's what makes layer streaming fast.","example":"AeroLLM prefetches the next shard so the GPU stays busy instead of stalling on the SSD.","related":["streaming","latency","throughput"],"seeAlso":[{"label":"AeroLLM","href":"/aerollm"}],"source":"QuKaiZen AI Dictionary"},{"slug":"prompt","term":"Prompt","category":"fundamentals","short":"The input text you give a model to steer what it does.","definition":"A prompt is the instruction plus context handed to a model at inference time. Prompt engineering tunes wording, examples, and structure to elicit better output — but unlike training, it never changes the model's weights.","example":"Adding 'think step by step' to a prompt can lift accuracy on reasoning tasks with no retraining.","related":["chain-of-thought","context-window","rag"],"source":"QuKaiZen AI Dictionary"},{"slug":"provenance","term":"Provenance","category":"qukaizen","short":"A verifiable record of exactly what went into a model and how it was built.","definition":"Provenance is the chain of custody for a model: which teacher, which corpus version, which config and audits. QuKaiZen hashes each artifact and signs the chain so the lineage is tamper-evident.","example":"The provenance chain lets anyone verify a sealed model was distilled from the stated teacher and corpus.","related":["seal","ed25519","faithfulness"],"seeAlso":[{"label":"Nucleus pipeline","href":"/nucleus"}],"source":"QuKaiZen AI Dictionary"},{"slug":"qlora","term":"QLoRA","aka":["Quantized LoRA"],"category":"fine-tuning","short":"LoRA on top of a 4-bit quantized base model — fine-tune big models on one consumer GPU.","definition":"QLoRA quantizes the frozen base to 4-bit (NF4) to shrink its footprint, then trains LoRA adapters on top in higher precision, with gradients flowing through the quantized weights via dequant-on-the-fly. Near-full-fine-tune quality at a fraction of the VRAM.","example":"QLoRA made it possible to fine-tune a 65B model on a single 48GB GPU — previously impossible without multiple A100s.","related":["lora","quantization","int4","peft"],"source":"authored"},{"slug":"quantization","term":"Quantization","aka":["quantisation"],"category":"quantization","short":"Storing weights/activations in fewer bits (FP16 to INT4) to shrink models and speed inference.","definition":"Quantization maps high-precision weights to a smaller numeric type (8-bit, 4-bit, ...) using a scale and zero-point, trading a little accuracy for big savings in memory and bandwidth. It is what lets frontier-scale models run on commodity hardware.","example":"Quantizing a 13B model from FP16 (26GB) to Q4 (~7GB) lets it load on a single consumer GPU.","related":["int4","bf16","fp8","gguf"],"seeAlso":[{"label":"AeroLLM","href":"/aerollm"}],"source":"knowledge_base/wiki/concepts/Quantization_SNR_Affine.md"},{"slug":"raft","term":"RAFT","aka":["Retrieval Augmented Fine-Tuning"],"category":"fine-tuning","short":"Fine-tuning that teaches a model to reason over retrieved docs while ignoring distractors.","definition":"RAFT trains on a question plus a mix of oracle (relevant) and distractor (irrelevant) documents, teaching the model to cite the right source and ignore noise. The result reasons through imperfect retrieval rather than memorizing — domain-specific RAG baked into the weights.","example":"For a kernel-bug question, RAFT shows the real commit (oracle) plus two unrelated patches (distractors); the model learns to ground its answer in the oracle.","related":["fine-tune","distillation","scotd","super-skill"],"seeAlso":[{"label":"Nucleus pipeline","href":"/nucleus"}],"source":"knowledge_base/wiki/concepts/RAFT.md"},{"slug":"rag","term":"RAG","aka":["Retrieval-Augmented Generation"],"category":"architecture","short":"Fetch relevant documents at query time and feed them to the model as context.","definition":"RAG retrieves passages from a knowledge store and injects them into the prompt, so the model answers from fresh, specific data rather than memory. It's the opposite of distillation — knowledge stays external and looked-up.","example":"A support bot retrieves the latest policy doc and answers from it, with no retraining when the policy changes.","related":["raft","context-window","knowledge-base"],"source":"QuKaiZen AI Dictionary"},{"slug":"reasoning","term":"Reasoning","category":"fundamentals","short":"A model working through a problem in intermediate steps instead of answering in one leap.","definition":"Reasoning is a model's ability to chain intermediate inferences — premises, rules, constraints, cross-references — toward a conclusion, rather than pattern-matching a final answer. Chain-of-thought elicits it; distillation transfers and sharpens it into small models.","example":"Given a multi-step word problem, a reasoning model writes each step ('first the rate, then the time…') and lands the answer far more reliably than guessing.","related":["chain-of-thought","distillation","super-skill"],"source":"QuKaiZen AI Dictionary"},{"slug":"reconcile","term":"Reconciliation","aka":["reconcile","desired-state reconciliation"],"category":"architecture","short":"Continuously closing the gap between the team you declared and the team that's running.","definition":"Borrowed from infrastructure (Kubernetes-style control loops), reconciliation compares desired state to observed state and converges them; a watcher fixes drift forever after. PaperAgents applies it to agent teams.","example":"Declare four agents; the watcher notices one died and restarts it to match the manifest.","related":["desired-state","drift","watcher"],"seeAlso":[{"label":"PaperAgents","href":"/paperagents"}],"source":"QuKaiZen AI Dictionary"},{"slug":"research","term":"Research","aka":["autoresearch"],"category":"fundamentals","short":"Systematic inquiry — forming hypotheses, running experiments, and measuring results.","definition":"Research is the disciplined loop of asking a question, forming a hypothesis, experimenting, and measuring. ARAIL is built to run that loop with AI: autoresearch agents gather sources, probe ideas, and surface what's interesting.","example":"ARAIL's agents pull recent papers, summarize the state of the art, and propose the next experiment to run.","related":["experiment","hypothesis","ablation"],"seeAlso":[{"label":"ARAIL","href":"/arail"}],"source":"QuKaiZen AI Dictionary"},{"slug":"rlhf","term":"RLHF","aka":["Reinforcement Learning from Human Feedback"],"category":"rl-alignment","short":"Align a model to human preferences via a reward model trained on human rankings, then RL.","definition":"RLHF collects human comparisons of model outputs, trains a reward model to predict which response people prefer, then fine-tunes the policy with reinforcement learning (usually PPO) to maximize that reward. It is how raw pretrained models became helpful, harmless assistants.","example":"Given two answers to 'explain recursion', humans pick the clearer one; the reward model learns that preference; PPO nudges the model toward it.","related":["dpo","ppo","sft"],"source":"authored"},{"slug":"rope","term":"RoPE","aka":["Rotary Position Embedding"],"category":"architecture","short":"Encodes token position by rotating query/key vectors — the dominant positional scheme in modern LLMs.","definition":"Rotary Position Embeddings inject position by rotating query and key vectors by an angle proportional to their position, so attention naturally depends on relative distance. RoPE extrapolates to longer contexts better than learned absolute embeddings and underlies most current LLMs.","example":"RoPE scaling tricks (NTK, YaRN) stretch a model trained at 4k context to 32k+ by adjusting the rotation frequencies.","related":["attention","transformer","embeddings"],"source":"authored"},{"slug":"rubric","term":"Rubric","category":"qukaizen","short":"The evolving scoring criteria AutoResearch uses to probe and grade the student.","definition":"Rubrics are structured criteria that drive the Interrogator's probes, the Adversary's traps, and the Evaluator's scoring. AutoResearch evolves them over time as new failure modes are discovered.","example":"A rubric for the kernel domain weights memory-safety reasoning heavily, so the swarm probes it hardest.","related":["autoresearch","adversarial","gate"],"seeAlso":[{"label":"Nucleus pipeline","href":"/nucleus"}],"source":"QuKaiZen AI Dictionary"},{"slug":"safetensors","term":"SafeTensors","aka":["safetensors"],"category":"formats-runtime","short":"A safe, fast, zero-copy tensor file format — the modern replacement for pickle-based checkpoints.","definition":"SafeTensors stores weights in a simple, memory-mappable layout with no arbitrary code execution (unlike Python pickle, which can run malicious code on load). It loads fast via zero-copy and is now the default for sharing weights on the Hub.","example":"model.safetensors loads almost instantly via mmap and cannot execute hidden code, unlike a .bin/.pt pickle.","related":["gguf","checkpoint"],"source":"authored"},{"slug":"scaling-laws","term":"Scaling Laws","aka":["neural scaling laws"],"category":"training","short":"Empirical power-law curves showing model loss falls predictably as parameters, data, and compute grow.","definition":"Scaling laws are power-law relationships found empirically: a model's loss drops smoothly and predictably as parameters, training tokens, and compute increase together. They let labs forecast a model's capability before training it, and they later motivated training smaller models on far more data (compute-optimal, Chinchilla-style).","example":"Scaling laws predicted how much a 10x larger compute budget would cut loss, so a lab could plan a frontier run's size and data in advance.","related":["benchmark","distillation","perplexity"],"source":"authored"},{"slug":"scotd","term":"SCoTD","aka":["Symbolic Chain-of-Thought Distillation"],"category":"fine-tuning","short":"Distill a teacher's step-by-step reasoning into a small model via many symbolic CoT traces.","definition":"Symbolic Chain-of-Thought Distillation samples multiple chain-of-thought rationales from a large teacher and trains a small student on them, so even a 1-3B model learns to reason in explicit steps rather than pattern-match. It is a key reason small QuKaiZen students can think.","example":"A 1.3B student trained on 175B-teacher CoT traces learns to lay out premise, rule, then conclusion on its own.","related":["distillation","raft","super-skill"],"seeAlso":[{"label":"Nucleus pipeline","href":"/nucleus"}],"source":"knowledge_base/wiki/concepts/SCoTD.md"},{"slug":"seal","term":"Seal","aka":["Nucleus Seal","cryptographic seal"],"category":"qukaizen","short":"A cryptographic signature certifying a model's provenance — what it was distilled from and that it is untampered.","definition":"A seal is a cryptographic signature (QuKaiZen uses Ed25519) bound to a finished model, certifying its provenance: which teacher and corpus it came from, which certification gates it passed, and that its weights have not changed since. Anyone can verify the seal offline, so an owned model carries proof of exactly what it is — the Nucleus Seal.","example":"Before trusting a distilled 3B model in production you verify its Ed25519 seal; if a single weight changed, verification fails.","related":["nucleus-seal","ssdp","ed25519"],"seeAlso":[{"label":"Nucleus pipeline","href":"/nucleus"}],"source":"QuKaiZen NUCLEUS_AGENT_PROTOCOL"},{"slug":"sft","term":"SFT","aka":["Supervised Fine-Tuning"],"category":"training","short":"Plain supervised training on curated input to output examples — the first step of post-training.","definition":"SFT fine-tunes a pretrained model on labeled prompt/response pairs so it learns to follow instructions in a target format or domain. It is the foundation step before preference alignment (RLHF/DPO) and the simplest way to specialize a base model.","example":"Train on 10k (instruction, ideal answer) pairs so a base model answers like a helpful assistant instead of just continuing text.","related":["rlhf","dpo","fine-tune","lora"],"source":"authored"},{"slug":"softmax","term":"Softmax","aka":["softmax function"],"category":"fundamentals","short":"Turns a vector of logits into a probability distribution that sums to 1.","definition":"Softmax exponentiates each logit and divides by the sum, producing positive values that add to 1 — a probability distribution. It picks the next token from logits and, inside attention, weights how much each token attends to others.","example":"Logits [2.0, 1.0, 0.1] become probabilities about [0.66, 0.24, 0.10] after softmax.","related":["logits","attention","temperature"],"source":"authored"},{"slug":"speculative-decoding","term":"Speculative Decoding","aka":["speculative sampling","spec decode"],"category":"inference","short":"A small draft model proposes several tokens; the big model verifies them in one pass — lossless speedup.","definition":"Speculative decoding runs a cheap draft model to guess the next few tokens, then the large target model verifies them all in a single forward pass, accepting the longest correct prefix. Output is identical to normal decoding, but throughput rises 2-3x because the expensive model runs less often.","example":"The draft proposes 5 tokens, the target accepts the first 4 and corrects the 5th — 4 tokens produced for roughly one big-model step.","related":["kv-cache","inference","vllm","layer-streaming"],"seeAlso":[{"label":"AeroLLM","href":"/aerollm"}],"source":"knowledge_base/wiki/concepts/speculative-decoding.md"},{"slug":"ssdp","term":"SSDP","aka":["Super Skill Distillation Pipeline"],"category":"qukaizen","short":"QuKaiZen's pipeline that distills deep reasoning from frontier teacher models into small, owned Super Skill models.","definition":"The Super Skill Distillation Pipeline (SSDP) extracts deep domain reasoning from 400B+ frontier teacher models and crystallizes it into small 1-7B Super Skill models that run on commodity hardware, air-gapped, and owned forever. It is not RAG — a Super Skill knows its domain. Nucleus implements it: KICE/TICE knowledge extraction, RAFT, Symbolic Chain-of-Thought distillation, an adversarial swarm that trains the student to convergence, three certification gates, and an Ed25519 Nucleus Seal.","example":"SSDP can take a frontier model's mastery of a regulatory domain and mint a 3B model that answers offline at a fraction of the energy — high Wisdom per Watt.","related":["super-skill","distillation","kice","symbolic-cot","nucleus-seal"],"seeAlso":[{"label":"Nucleus pipeline","href":"/nucleus"}],"source":"QuKaiZen NUCLEUS_AGENT_PROTOCOL"},{"slug":"student","term":"Student Model","category":"qukaizen","short":"The small model being trained to absorb the teacher's reasoning.","definition":"The student is the compact model (1–7B) that learns from the teacher's traces, RAFT data, and adversarial correction — ending as a sealed Super Skill that can beat its teacher in-domain.","example":"After distillation the 3B student out-reasons its 500B teacher inside the target domain.","related":["teacher","distillation","super-skill"],"seeAlso":[{"label":"Nucleus pipeline","href":"/nucleus"}],"source":"QuKaiZen AI Dictionary"},{"slug":"super-skill","term":"Super Skill","aka":["Super Skill Model","SSM"],"category":"qukaizen","short":"A 1-7B model that durably knows a domain, distilled from a frontier teacher and owned forever.","definition":"A Super Skill Model is the output of QuKaiZen's pipeline: a small (1-7B) model that crystallizes a frontier teacher's deep reasoning for a domain, runs on commodity or air-gapped hardware, and keeps improving. It knows — it does not look things up like RAG.","example":"A Linux-Kernel Super Skill trained on 30 years of commits, CVEs, and mailing lists reasons about kernel bugs offline.","related":["distillation","nucleus-seal","wisdom-per-watt","scotd"],"seeAlso":[{"label":"Nucleus pipeline","href":"/nucleus"}],"source":"QuKaiZen NUCLEUS_AGENT_PROTOCOL"},{"slug":"symbolic-cot","term":"Symbolic Chain-of-Thought","aka":["Symbolic CoT","SCoT"],"category":"qukaizen","short":"Capturing a teacher's reasoning as reusable symbolic structure, not just imitated text traces.","definition":"Symbolic Chain-of-Thought captures the structure of a teacher's reasoning — the steps, rules, and relationships — in symbolic form rather than copying surface-level wording. Distilling that structure (SCoTD) teaches a small student to reason faithfully instead of mimicking phrasing, which is what makes a Super Skill robust rather than brittle.","example":"Instead of memorizing one solution's wording, a student trained on symbolic CoT learns the underlying procedure and applies it to unseen problems.","related":["scotd","distillation","super-skill","ssdp"],"seeAlso":[{"label":"Nucleus pipeline","href":"/nucleus"}],"source":"QuKaiZen NUCLEUS_AGENT_PROTOCOL"},{"slug":"teacher","term":"Teacher Model","category":"qukaizen","short":"The large frontier model whose reasoning is distilled into a small student.","definition":"In distillation the teacher is the big, capable model (400B+) that generates reasoning traces and judgments; the student learns to reproduce its competence in-domain. QuKaiZen uses two-tier teachers for breadth and depth.","example":"A 400B teacher writes step-by-step solutions that train a 3B student to match it in-domain.","related":["student","distillation","two-tier"],"seeAlso":[{"label":"Nucleus pipeline","href":"/nucleus"}],"source":"QuKaiZen AI Dictionary"},{"slug":"temperature","term":"Temperature","aka":["sampling temperature"],"category":"inference","short":"A knob for randomness in generation — low is focused/deterministic, high is creative/diverse.","definition":"Temperature scales logits before softmax: below 1 sharpens the distribution (safer, more repetitive), above 1 flattens it (more diverse, more errors). At 0 the model is effectively greedy. It is the simplest lever for output style.","example":"Use temperature 0.2 for code or facts; 0.9 for brainstorming or creative writing.","related":["logits","softmax","beam-search","inference"],"source":"authored"},{"slug":"throughput","term":"Throughput","category":"inference","short":"How many tokens a system generates per unit time, across all requests.","definition":"Throughput measures total tokens/second a serving stack produces; it trades off against per-request latency. Speculative decoding and batching push it up.","example":"Speculative decoding lifts AeroLLM throughput up to 7× on 70B+ teachers.","related":["latency","speculative","continuous-batching"],"source":"QuKaiZen AI Dictionary"},{"slug":"tice","term":"TICE","aka":["Tacit knowledge Injection & Corpus Evolution"],"category":"qukaizen","short":"QuKaiZen's agent for Layer-7 tacit knowledge — the unwritten expert know-how and gotchas.","definition":"TICE extracts implicit, tribal knowledge — folklore, gotchas, and esoteric patterns that are not formally documented. It is the highest-value extractor for user-data-enriched (Mode 2/3) Super Skills, capturing expertise that lives only in practitioners' heads.","example":"TICE captures a farmer's unwritten rule of thumb about soil timing that no manual records, then teaches it to the student.","related":["kice","super-skill","distillation"],"seeAlso":[{"label":"Nucleus pipeline","href":"/nucleus"}],"source":"QuKaiZen NUCLEUS_AGENT_PROTOCOL"},{"slug":"tokenizer","term":"Tokenizer","aka":["tokenization","BPE"],"category":"fundamentals","short":"Splits text into tokens (subword units) the model actually reads, and back again.","definition":"A tokenizer converts raw text into integer token IDs (and back) using a learned vocabulary, usually via subword schemes like BPE or SentencePiece. Token count drives context limits and cost, and odd tokenization explains many model quirks.","example":"'tokenization' might split into ['token', 'ization']; rare words and emoji can become many tokens, inflating cost.","related":["embeddings","perplexity"],"source":"authored"},{"slug":"tool-use","term":"Tool Use","category":"architecture","short":"A model invoking external tools — APIs, code, search — to act beyond text.","definition":"Tool use lets a model call functions (query a database, run code, hit an API) and fold the results back into its reasoning, turning a language model into an actor in real systems.","example":"Asked today's freight rate, the agent calls a rates API instead of guessing.","related":["function-calling","agent","mcp"],"source":"QuKaiZen AI Dictionary"},{"slug":"transformer","term":"Transformer","aka":["transformer architecture"],"category":"architecture","short":"The attention-based neural architecture behind essentially every modern LLM.","definition":"The transformer stacks blocks of multi-head attention and feed-forward layers with residual connections and normalization, processing all tokens in parallel. Introduced in 'Attention Is All You Need' (2017), it scales beautifully and underpins GPT, Llama, and the rest.","example":"A 7B decoder-only transformer is about 32 such blocks; depth and width set the parameter count.","related":["attention","moe","layernorm","rope"],"source":"authored"},{"slug":"triton","term":"Triton","aka":["OpenAI Triton"],"category":"formats-runtime","short":"A Python-like language for writing fast GPU kernels without hand-writing CUDA C++.","definition":"Triton (from OpenAI) lets researchers write custom GPU kernels in a Python-like syntax that compiles to efficient code, making fused high-performance ops far easier to author. Many modern kernels, including FlashAttention implementations, are written in Triton.","example":"A fused softmax written in about 30 lines of Triton can beat a naive PyTorch version by a wide margin.","related":["cuda","flashattention"],"source":"authored"},{"slug":"verifier","term":"Verifier","category":"inference","short":"The target-model pass that accepts or corrects speculatively drafted tokens.","definition":"In speculative decoding the verifier is the large model's single forward pass that checks the draft's proposed tokens — keeping the correct prefix and resampling the first mismatch — which guarantees the same distribution as decoding normally.","example":"Of 5 drafted tokens the verifier accepts 4 and corrects the 5th, all in one pass.","related":["speculative","draft-model"],"source":"QuKaiZen AI Dictionary"},{"slug":"vllm","term":"vLLM","aka":["vLLM"],"category":"inference","short":"A high-throughput LLM serving engine; its PagedAttention manages the KV-cache like virtual memory.","definition":"vLLM maximizes GPU throughput via PagedAttention — treating the KV-cache as paged memory to eliminate fragmentation — plus continuous batching of incoming requests. It is the enterprise-grade backend for serving teacher models on GPUs.","example":"QuKaiZen uses vLLM (TEACHER_BACKEND=vllm) to serve teachers on H100s with continuous batching and FP8.","related":["kv-cache","inference","flashattention"],"seeAlso":[{"label":"Nucleus pipeline","href":"/nucleus"}],"source":"qukaizen/docs/TECHNIQUES.md"},{"slug":"warmup","term":"Warmup","aka":["learning-rate warmup"],"category":"training","short":"Ramping the learning rate up from near zero over the first steps to avoid early instability.","definition":"Learning-rate warmup starts the LR small and increases it over the first few hundred or thousand steps before the main schedule (often cosine decay). Early gradients are noisy; warmup prevents large destabilizing updates while the optimizer's statistics settle.","example":"500 warmup steps ramping to 2e-4, then cosine decay to near zero over the run.","related":["adamw","gradient"],"source":"authored"},{"slug":"wisdom-per-watt","term":"Wisdom per Watt","aka":["wisdom-per-watt"],"category":"qukaizen","short":"QuKaiZen's core metric: certified, permanently-owned reasoning capability per unit of lifetime energy to mint and run it.","definition":"Renting a frontier model burns full datacenter energy on every query, forever, and you own nothing. QuKaiZen spends energy once to distill that reasoning into a small model you keep — certified by three independent gates, sealed with cryptographic provenance, run locally at near-zero marginal cost. Capability only counts if it is proven and trustworthy: a model that scores high but fails the hallucination gate scores zero. The energy is not a running cost; it is capital spent on an asset you own forever.","example":"Rent a 400B model and the 100,000th query costs the same datacenter energy as the first — and you still own nothing. Mint a 3B Super Skill and you pay the energy once; after the break-even query, owning beats renting and the gap only widens.","related":["super-skill","distillation","layer-streaming"],"seeAlso":[{"label":"Nucleus pipeline","href":"/nucleus"}],"source":"QuKaiZen NUCLEUS_AGENT_PROTOCOL"},{"slug":"workflow","term":"Workflow","category":"architecture","short":"A declared sequence of steps an agent or pipeline executes.","definition":"A workflow encodes the steps — download, analyze, decide, process — as configuration rather than ad-hoc code, so it's inspectable, versionable, and reproducible. PaperAgents declares them in TOML.","example":"[[workflow]]: download loads → analyze margin → decide → process.","related":["automation","orchestration","agentic"],"seeAlso":[{"label":"PaperAgents","href":"/paperagents"}],"source":"QuKaiZen AI Dictionary"},{"slug":"zero","term":"ZeRO","aka":["Zero Redundancy Optimizer"],"category":"training","short":"DeepSpeed's optimizer that partitions optimizer state, gradients, and params to remove memory redundancy.","definition":"ZeRO eliminates the memory redundancy of vanilla data parallelism by partitioning optimizer states (stage 1), gradients (stage 2), and parameters (stage 3) across GPUs. It is the idea FSDP also implements, enabling trillion-parameter training.","example":"ZeRO-3 lets each of 64 GPUs hold only 1/64 of the optimizer state, freeing memory for larger batches.","related":["fsdp","adamw","gradient"],"source":"authored"}]}