Moonshot AI Open-Sources FlashKDA: CUTLASS Kernels for Kimi Delta Attention with Variable-Length Batching and H20 Benchmarks

The team behind Kimi.ai (Moonshot AI) recently made a significant contribution to the open AI infrastructure space. The research team has made significant contributions to the open AI infrastructure space. They let go FlashKDA (Flash Me Delta Attention), high performance CUTLASS based kernel Kimi Delta Attention (KDA) way. I FlashKDA the library is available on GitHub under the MIT license. It brings acceleration of the pre-filling of the 1.72× to 2.22× on top of flash-linear-attention base on NVIDIA H20 GPUs, and serves as a backend for popular downloads flash-linear-attention the library.

What Is Kimi Delta Attention, And Why Is It Important?

To understand FlashKDA, it helps to first understand where it sits in the LLM spotlight.

A general observation of softmax is that the complexity is quadratic with respect to the length of the sequence – which means that as you enter a long context into the model, the computational cost increases very quickly. This led to a wave of research about line attention methods, which approximate or modify the softmax function to achieve linear scaling. Kimi Delta Attention (KDA) Moonshot AI’s contribution to this space: a refined direct attention approach Gated DeltaNet fine-grained, smart gateway via channel method, which allows more efficient use of the RNN’s memory for the limited state.

KDA is not just a research prototype. It is the core attention mechanism I’m Linearopen source hybrid model for Moonshot AI with 48B total parameters and 3B activated parameters. Kimi Linear uses a 3:1 KDA-to-MLA (Multi-Head Latent Attention) ratio – three KDA layers for each global attention layer – which reduces KV cache usage by up to 75% during sequential production while achieving up to 6× higher decoding performance at 1 million context lengths compared to full attention. FlashKDA a production-grade CUDA kernel that makes those architectures faster at runtime.

Specifically, the KDA forward pass takes questions (q), keys (k), values (v), gate before activation (g), and beta scripts (beta), and a scale factor, the resulting tensor (out), and the gate parameters: A_log (gateway parameter per header), dt_bias (gate bias), and lower_bound (low threshold, ranging from -5.0 to 0). The sigmoid function is on beta used internally by the kernel. The mechanism also supports optional iterative start and end states – useful for multi-variable analysis when you want to apply it to all requests.

Iterative construction means that the model can process long sequences during production. But successful fill in first of these architectures still require highly optimized GPU kernels – which is exactly what FlashKDA delivers.

Under the Hood: CUTLASS on Hopper

FlashKDA is built on top of it THE CUTLASSNVIDIA’s open source library of CUDA C++ template abstractions for high-performance linear algebra and custom kernel development. CUTLASS allows developers to write cores that take full advantage of NVIDIA’s Tensor Core architecture, and is the same framework used by libraries such as FlashAttention-3.

The library directs SM90 and above – meaning NVIDIA’s Hopper architecture (H100, H20) and new. Minimum requirements are CUDA 12.9 and PyTorch 2.4. The codebase is primarily CUDA (56.4%), with Python (36.2%) binding and C++ glue code (6.7%).

The main API flash_kda.fwdwhich takes the following inputs:

q, k, v, g: all in bf16 ready [B, T, H, K] or [B, T, H, V] (there g it is the gate before activation)
beta: bf16 beta logs are in status [B, T, H] (sigmoid is used internally)
scale: fp32 scaling factor
out: bf16 tensor output in state [B, T, H, V]
A_log, dt_bias, lower_bound: fp32 gate parameters
initial_state, final_state: optional bf16 or fp32 repeating circuits
cu_seqlens: the length of the accumulated int64 sequence of variable length mixing

One current feature: the kernel requires K = V = 128 so that the size of the head.

Variable length batching support with cu_seqlens it is most notable for production use. In real-world rendering, requests in a cluster rarely share the same sequence length. Being able to pack multiple sequences of different lengths into a single kernel call is a critical requirement for high-end deployment systems.

Benchmark Results: 1.72× to 2.22× in H20

Benchmark results (as of April 20, 2026) are comparative flash_kda against fla_chunk_kda (which exists flash-linear-attention implementation) for the entire length of the sequence of T=8192the size of the head D=128and two head count settings: H=96 again H=64. Each benchmark was run with 30 warmup repetitions, 200 calibration repetitions, and 5 repetitions.

Because H=96:

The case	`flash_kda` (ms)	`fla_chunk_kda` (ms)	Hurry up
Fixed	2.6219	4.5052	1.72×
Varlen, `seq_lens`=[1300, 547, 2048, 963, 271, 3063]	2.3420	4.5717	1.95×
Varlen, `seq_lens`=`1024 × 8`	2.0100	4.4668	2.22×

Because H=64:

The case	`flash_kda` (ms)	`fla_chunk_kda` (ms)	Hurry up
Fixed	1.6199	2.9587	1.83×
Varlen, `seq_lens`=[1300, 547, 2048, 963, 271, 3063]	1.7027	3.0595	1.80×
Varlen, `seq_lens`=`1024 × 8`	1.3930	3.0412	2.18×

The 2.22× peak speed appears in the case of the same variable length (seq_lens=1024 × 8eight sequences of length 1024 which are shortened to T=8192). The fixed-length case brings the bottom of the scope to 1.72×. In both head configurations and in all three sequence conditions, FlashKDA consistently outperforms flash-linear-attention the foundation is essential genes.

Integration with flash-linear-attention

One of the most effective features of FlashKDA is its integration story. Once installed, FlashKDA i automatically sent from flash-linear attention chunk_kda – which means existing codes are used flash-linear-attention they don’t need manual strings to run a faster kernel. Integration is tracked in flash-linear-attention PR #852.

Installation is straightforward:

git clone  flash-kda
cd flash-kda
git submodule update --init --recursive
pip install -v .

The integrity testing suite (tests/test_fwd.py) uses the exact same validation as the PyTorch reference implementation and validates in reverse flash-linear-attention. This gives AI devs a reliable basis for testing kernel behavior before deployment to production.

Key Takeaways

FlashKDA is an open source CUDA kernel for Moonshot AI for CUTLASS of Kimi Delta Attention (KDA), delivers 1.72×–2.22× prefill acceleration on top of flash-linear-attention base on NVIDIA H20 GPUs.
KDA extends Gated DeltaNet with an efficient, intelligent channel gateway — is the main attention mechanism behind Kimi Linear, a 48B-total / 3B-active-parameter hybrid model that reduces KV cache usage by up to 75% and achieves up to 6× decoding with 1M context length.
The kernel targets SM90+ hardware (NVIDIA Hopper — H100, H20 and above), requires CUDA 12.9+ and PyTorch 2.4+, and currently supports fixed head size K = V = 128.
Variable length mixing is natively supported with cu_seqlens parameter, which allows multiple sequences of different lengths to be packed into a single kernel call – an important feature for high-throughput inference applications.
Once installed, FlashKDA is automatically shipped from flash-linear-attention‘s chunk_kdamaking it a drag-and-drop functionality upgrade to any existing codebase that uses the flash-linear-attention library – no architectural changes required.

Check it out GitHub Repo. Also, feel free to follow us Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? Connect with us

What Is Kimi Delta Attention, And Why Is It Important?

Under the Hood: CUTLASS on Hopper

Benchmark Results: 1.72× to 2.22× in H20

Integration with flash-linear-attention

Key Takeaways

Leave a Comment Cancel reply