Consistency diffusion language models: Up to 14x faster, no quality loss

Consistency diffusion language models: Up to 14x faster inference without sacrificing quality Research Consistency diffusion language models: Up to 14x faster inference without sacrificing quality February 19, 2026 ・ By Minseo Kim, Chenfeng Xu, Coleman Richard Charles Hooper, Harman Singh, Ben Athiwaratkun, Ce Zhang, Kurt Keutzer, Amir Gholami | Seoul National University, University of California, Berkeley, Together AI Summary We introduce consistency diffusion language models (CDLM), which accelerates diffusion language model inference by combining consistency-based multi-token finalization with block-wise KV caching, achieving up to 14.5x latency speedups on math and coding tasks. ‍ Diffusion Language Models (DLMs) are emerging as a promising alternative to autoregressive (AR) LMs. Instead of generating one token at a time, DLMs iteratively refine a partially masked sequence over multiple sampling steps, gradually transforming a fully masked sequence into clean text. This refinement process creates a compelling opportunity: it enables parallel generation, allowing the model to finalize multiple tokens per iteration and potentially achieve higher throughput than AR decoding. At the same time, it can exploit bidirectional context to unlock new capabilities such as text infilling and refinement. Visualization of inference in CDLM, naive DLMs, and autoregressive (AR) models. However, in practice, standard DLMs suffer from two major inefficiencies. [1] KV caching incompatibility under full bidirectional attention. Standard DLMs commonly use bidirectional (non-causal) attention, which requires recomputing attention over the full context at every denoising step, making inference expensive and preventing standard KV caching. High refinement step counts to maintain quality. High-quality generation typically requires many denoising/refinement steps, often comparable to the generation length. Naively reducing the number of steps tends to degrade quality sharply. CDLM targets both bottlenecks through a post-training recipe that makes fewer-step inference reliable while enabling exact block-wise KV caching. Preliminary: Inference in diffusion language models DLM generation is an iterative refinement over N discrete sampling steps. It transforms a fully masked sequence at time t=1 into a clean sequence at t=0. At each step, the model predicts a clean sequence distribution x0 given the current noisy sequence xt and prompt c: $p_{\theta}(\mathbf{X}_0 \mid \mathbf{X}_t, c)$ A common deterministic instantiation is low-confidence remasking: the model greedily unmasks tokens (often within blocks), finalizing the highest-confidence masked positions while keeping others masked. This leads to the decoding trajectory: $\mathcal{T}_{\mathbf{x}} = \left(\mathbf{x}_{t_0}, \mathbf{x}_{t_1}, \ldots, \mathbf{x}_{t_N}\right), \quad t_k = 1 – \frac{k}{N}$ which records how the partially refined sequence evolves step-by-step. This trajectory becomes the core object for CDLM’s

Source: Hacker News | Original Link

才疏学浅

一花一草一世界 | 心若无物就可以一花一世界，一草一天堂

Consistency diffusion language models: Up to 14x faster, no quality loss