Meet Flash-KMeans: An IO-Aware, Direct Method That Runs Over 200× Faster Than FAISS on GPUs

Means has been an offline tool for decades. You run it once to process the data, then move on. A team of researchers from UC Berkeley and UT Austin released Flash-KMeans, a new open source library that targets a different setting. Modern AI pipelines now call for internal training methods and touch loops. At that point, the latency of each call is more important than the FLOPs considered.

Flash-KMeans is an IO-aware implementation of Lloyd’s k-means standard methods. It doesn’t change the math, and it doesn’t balance. It only rearranges the way the algorithm moves data to the GPU. On the NVIDIA H200, the research team reported up to 17.9× end-to-end speed over the best baseline. Against NVIDIA cuML they report 33 ×. Against FAISS they report more than 200 ×.

What is Flash-KMeans

Flash-KMeans is an integrated k-means library written for the Triton GPU. It ships under Apache 2.0 and includes the pip install flash-kmeans.

The output is statistically identical to Lloyd’s standard methods for k. The speedup comes from kernel-level data flow, not from task skipping. That distinguishes it from algorithmic methods such as triangle-unevenness pruning or coreset sampling.

A typical Lloyd’s repetition has two stages. The function class calculates the distance of each point to each centroid, and selects the closest one. The update phase averages the points in each cluster to create new centroids. Both categories are simple arithmetic. For GPUs, both are limited by memory, not computation.

Two Bottles Attack

I the first bottle it is a share class. The standard code constructs a full range matrix D of N×K shape in High Bandwidth Memory (HBM). Writes the matrix, then reads it back to initialize argmin. For N=65536, K=1024, d=128, B=32, distance calculations take 2.6ms. Writing and executing D takes about 23ms. The matrix is the cost, not the math.

Flash-KMeans replaces this with FlashAssign. The design was borrowed from FlashAttention. FlashAssign broadcasts tiles of points and centroids from the HBM to the chip SRAM. Includes distance calculation with online argmin. A full N×K matrix has never been created. This reduces the outstanding IO complexity from O(NK) to O(Nd + Kd). At the kernel level, FlashAssign is up to 21.2×. On the other hand it reduced the share from 122.5ms to 5.8ms.

I the second bottle is the renewal stage of the centroid. The standard code uses scatter-style atomic additions. Each thread adds its point to the shared sum buffer marked with the batch id. Multiple threads hit the same ‘hot’ cluster at the same time. That causes atomicity and hardware serialization conflicts. The research team only measured an effective bandwidth of 50 GB/s here on the H200.

Flash-KMeans replaces this with Filter-Inverse Update. Sorts a 1D allocation vector by cluster id using argsort. Similar cluster ids then form overlapping segments. Each serial block reduces a segment on the chip, and subtracts one additional atom per segment. A heavy point matrix is never physically allowed. Atomic functions decrease from $(O((K + frac{N}{B_N})d))$ . The kernel is up to 6.3×.

Benchmark

The research team tested it on an H200 with CUDA 12.8 data, FP16, and d=128. They sweep N, K, and cluster size B. They compare with four improved bases: fast_pytorch_kmeans, fastkmeans, cuML, and FAISS.

Comparison	Acceleration has been reported	Workload context
End to end vs best base	up to 17.9 ×	N=8M, K=1024 (large N, small K)
vs NVIDIA cuML	33×	industrial library
vs FAISS	more than 200 ×	industry library
FlashAssign kernel	up to 21.2 ×	N=1M, K=8192 (share)
Sort-Inverse Update the kernel	up to 6.3 ×	N=33M, K=4096 (update)
Without the spine, a large measure	up to 10.5 ×	N=400M, K=16384 vs fastkmeans

One failure mode is important in context. The standard implementation of PyTorch runs out of memory on large K settings. They cannot form an N×K matrix. FAISS is an industry-standard library for many vector-search programs.

The library is also not working properly. At one billion points (K=32768, d=128), it completes the iteration in 41.4s, compared to 261.8s for the baseline. It uses processed stream overlay to hide the PCIe transmission behind the computer. The cache-aware compile heuristic also reduces tuning up to 175×, within 0.3% of the tuned speed.

MTP Interactive Explainer

Marktechpost · Interactive Explainer

Flash-KMeans: exact k-means, reconstructed around GPU memory

Lloyd’s statistics are the same as standard k-methods – faster due to data flow. Run a live compilation, watch the update bottleneck, and the size of the IO it removes.

17.9×end to end vs best base

33×vs NVIDIA cuML

200×+vs FAISS

1Bpoints, out-of-core

1 · Live integration

2 · Review the contention

3 · IO calculator

Data points (N) 800

Collections (K) 5

Repetition0

Centroid shift–

The situationdoing nothing

This runs Lloyd’s original methods in your browser with 2-D points. The algorithm is the same as that accelerated by Flash-KMeans – only the GPU data flow is different. Each step = one assignment + one centroid update.

Press play. A general scatter update plots if the blocks write the same “hot” centroid (red stores). The Inverse Sorting Update sorts the cluster IDs first, so that each block combines overlapping segments with a single atomic addition — no collisions.

Atoms are normalO(N·d)

Sort out the opposite atomsO(K+N/B)·d)

Measured std bandwidth50 GB/s

Kernel acceleration6.3x

Standard updates issue one atomic addition per token. Multiple threads hit the same centroid at the same time, causing contention. Sorting by cluster ID converts scatters into component-level reductions in chip memory.

General – form an N×K matrix, O(NK)–

FlashAssign – streaming input, O(Nd+Kd)–

–HBM minimum traffic step function (theory)

Use Cases

Fast and accurate k methods are changing what you can use online, not just offline.

Vector search index: FAISS builds its search indexes with k-means. Faster k-means allows you to reindex as data changes, instead of rebuilding overnight.
Little attention: Routing Transformers and Tactic collection tokens to route attention. Millisecond k-means performs this operation inside the inference loop.
KV storage compression: ClusterKV cluster tokens in the semantic space to compress the cache. Cheap assembly makes layer by layer, step by step compression work.
KV minimum measurement: Latest methods of integrating KV into codebooks, iteratively. Fast integration reduces pre-processing costs.
Diffusion Transformers: Sparse VideoGen2 concatenated k calls during forward pass. Allows tokens with semantic similarity for minimal usage.

Using it

The API shows faiss and sklearn. The bottom line includes the integral tensor (B, N, d).

import torch
from flash_kmeans import batch_kmeans_Euclid

x = torch.randn(32, 75600, 128, device="cuda", dtype=torch.float16)
cluster_ids, centers, _ = batch_kmeans_Euclid(
    x, n_clusters=1000, tol=1e-4, verbose=True
)

A scikit-style interface is available.

from flash_kmeans import FlashKMeans

km = FlashKMeans(d=128, k=8192, niter=100)
labels = km.fit_predict(large_cpu_tensor)  # device=None uses all visible GPUs

The kernel automatically ships with the shape and type of d. D sub-route carries ud≤512. A split-D router handles large distances without creating a distance matrix. Multi-GPU automatically executes on large N data stored in CPU memory.

Key Takeaways

Flash-KMeans is exact, not approximate – similar Lloyd calculations, accelerated by GPU data flow.
FlashAssign includes distance + online argmin, IO cut assignment from O(NK) to O(Nd+Kd) — up to 21.2×.
Review of Inverse Filtering sorts cluster IDs into segments, instead of scatter atoms — up to 6.3×.
Report coming 17.9× end to end33× over cuML, and more than 200× over FAISS in H200.
Scales out of-core to 1 billion points and crop tuning up to 175×.

Check it out Paper again Repo. Also, feel free to follow us Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us