NVIDIA AI Releases Star Elastic: A Single Benchmark Featuring 30B, 23B, and 12B Reference Models with Zero-Shot Cutting

Training a family of large language models (LLMs) always comes with painful iterations: each different model in the family—whether it’s 8B, 30B, or 70B—typically requires its own full training, storage, and deployment stack. For a dev team using inference at scale, this means multiplying the computational cost by the number of model sizes they want to support. NVIDIA researchers now propose a different approach called Elastic star.

Elastic star is a post-training method that embeds multiple nested submodels—for different parameter budgets—within a single parent inference model, using a single training. Used in Nemotron Nano v3 (a hybrid Mamba–Transformer–MoE model with 30B parameters in total and 3.6B active parameters), Star Elastic generates 23B (2.8B active) and 12B (2.0B active) nested variants trained with approximately 160B tokens. All three variables reside in a single test environment and can be removed without further configuration.

What does “Nested” actually mean here

If you’ve never experienced nested or stretched structures before, here’s the idea: instead of training three separate models of 30B, 23B, and 12B, you train one model contains as small as their subsets. The submodels reuse the most important weights from the parent, identified by a process called the importance of moderation.

Star Elastic leads each part of the model: embedding channels, attention heads, Mamba SSM heads, MoE experts, and FFN channels by how much they contribute to the accuracy of the models. The components are then averaged and sorted, so the smaller budget models always use the highest quality subset of components from the larger model. This place is called weight sharing in the nest.

The method supports nesting along multiple axes: SSM (State Space Model) dimensions, embedding channels, attention heads, Mamba heads and head channels, MoE expert calculations, and FFN average dimensions. For MoE layers specifically, Star Elastic uses Router-Weighted Expert Activation Pruning (REAP)which averages experts at both route gate values again expert output size—a more systematic signal than standard rate-based pruning, which ignores how much each expert contributes to a layer’s output.

A Readable Router, Not a Fixed Pressure Recipe

A key difference from previous compression methods such as Minitron is that Star Elastic uses i an end-to-end scalable router to specify nested submodel properties. The router takes a target budget (eg, “give me a functional parameter model of 2.8B”) as a single hot input and outputs a separable mask that selects which components are active at that budget level. These masks are trained in conjunction with the model through use Gumbel-Softmaxwhich allows for gradient flow with different architectural resolutions.

The loss function includes distillation knowledge (KD) where the unexpanded parent model acts as a teacher with router losses that penalize deviations from the target resource budget (parameter count, memory, or delay). This means that the router learns to make architecture choices that improve accuracy under KD, rather than simply minimizing the proxy metric.

Training uses a two-phase curriculum: a short context class (8,192 sequence length) with uniform budget samples, followed by an extended context class (49,152 sequence length tokens) with unequal samples prior to the full 30B model (p(30B)=0.5, p(23B)=0.3, p(12B)=0.2). The expanded context category is important for cognitive performance. The research team’s output on Nano v2—clearly reproduced as a robust basis for choosing the same curriculum in Nano v3 shows gains of up to 19.8% in AIME-2025 for the 6B variant and 4.0 percent for the 12B variant from Phase 2 alone, encouraging its use here.

Flexible Budgetary Control: Different Models for Different Levels of Reasoning

The budget control that is present in the thinking models including the automatic behavior of Nemotron Nano v3 works by entering the amount of tokens produced in time phase before forcing a final response. This method uses the same model throughout. Star Elastic opens up a different strategy: use various submodels of the thinking phase versus the response phase.

The researchers tested four configurations. Right, called ℳS → ℳL (smaller model for reasoning, larger model for response), assign a cheaper model to generate extended reasoning traces and reserve a model with full capability to synthesize the final response. The configuration 23B → 30B mainly improves the accuracy-delay of the Pareto frontier, reaching 16% higher accuracy and 1.9× lower latency compared to Nemotron Nano v3 budget control. Intuition: intuition tokens have a high volume but tolerate some power reduction; the final answer requires high precision.

Quantization Without Breaking Nesting Structure

A foolproof way to use an elastic model for estimation would be to estimate each variable separately after trimming. That breaks the nested weight sharing structure and requires a separate measurement pass for each size. Instead, Star Elastic works Quantization-Aware Distillation (QAD) directly into the expandable test area, keeping the nested mask background throughout.

For FP8 (E4M3 format), post-training quantization (PTQ) is sufficient, achieving 98.69% BF16 accuracy on 30B variants. In NVFP4 (NVIDIA’s 4-bit floating point format), PTQ alone causes an average accuracy drop of 4.12%, so a short nested QAD phase (5B tokens in a 48K core) returns a recovery of 97.79% for a 30B variant. In both cases, the zero-shot cut of variants 23B and 12B from a single limited test area is maintained.

Memory effects are important. Storing the separate checkpoints for 12B, 23B, and 30B BF16 requires 126.1 GB; one expandable test area requires 58.9 GB. The 30B NVFP4 expandable checkpoint fits at 18.7 GB, allowing the 12B NVFP4 variant to run on the RTX 5080 when the entire BF16 configuration runs out of memory. On the RTX Pro 6000, the 12B NVFP4 variant reaches 7,426 tokens/s, a 3.4× output improvement over the base 30B BF16.

Depth vs. Range: Why Star Elastic Compresses Range

One design option that should be clearly called: the research team compared two compression techniques—removing layers completely (deep compression) versus reducing internal dimensions such as hidden size, expert counting, and head counting (width compression). With a 15% parameter reduction and 25B information extraction tokens, range compression is achieved 98.1% of base performance while deep depression was only found 95.2%with significant degradation in HumanEval and MMLU-Pro. As a result, Star Elastic prioritizes width-based elasticity in its main results, although depth compression (skipping a layer) is always available as an option for situations that stress more latency.

In the test system—AIME-2025, GPQA, LiveCodeBench v5, MMLU-Pro, IFBench, and Tau Bench—the Elastic-30B variant matches its parent Nemotron Nano v3 30B in most benchmarks, while the Elastic-23B and Elastic-23B and Elastic-12B variants remain consistent with the same completitive size. Elastic-23B scores a remarkable 85.63 in AIME-2025 compared to Qwen3-30B-A3B’s 80.00, despite having fewer performance parameters.

For training costs, the research team reports a 360× reduction of tokens compared to pretraining each variant from scratch, and a 7 × reduction in addition to earlier modern compression methods that require successive distillations that operate at individual model sizes. The 12B variant works at 2.4× the output of the 30B parent on the H100 GPU at bfloat16 with the same input/output sequence length.

Key Takeaways

  • Star Elastic trains 30B, 23B, and 12B nested logic models from a single 160B-token post-training run, achieving a 360× token reduction over training from scratch.
  • Scalable budget control (23B for logic, 30B for response) improves the accuracy-delay Pareto frontier with up to 16% accuracy and 1.9× latency gains.
  • A programmable router with Gumbel-Softmax enables end-to-end programmable architecture selection, eliminating the need for separate compression for each model size.
  • Nest QAD maintains the zero shot cut across all FP8 and NVFP4 rated test areas, reducing the 30B test area to 18.7 GB in NVFP4.
  • All three precision models (BF16, FP8, NVFP4) are publicly available on Hugging Face below nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B.

Check it out Paper, Elastic Models to A Hugging Face BF16, FP8 again NVFP4 . Also, feel free to follow us Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us


Leave a Comment