STDec: Spatio-Temporal Stability Guided Decoding for dLLMs

Yuzhe Chen1, Jiale Cao1, Xuyang Liu2, Jin Xie3, Aiping Yang1, Yanwei Pang1
1Tianjin University 2Sichuan University 3Chongqing University
A training-free and cache-compatible spatio-temporal stability guided decoder that replaces global confidence thresholding with token-adaptive thresholds.
STDec teaser
10.03×
Average speedup on Dream-7B-Instruct
7.60×
Average speedup on LLaDA-8B-Instruct
3.57×
Average speedup on LaViDa-Reason

Highlight of STDec. Compared to existing dLLM acceleration methods, STDec achieves up to 47.80× long-context speedup, delivers 10.03× and 7.60× average gains on Dream and LLaDA, and improves multimodal throughput to 3.57× on LaViDa-Reason while preserving comparable generation quality.

Contributions

(1) Spatio-temporal stability as a decoding signal: We show that dLLM decoding exhibits strong spatial stability and temporal stability: newly decoded tokens tend to appear near decoded neighbors, and many token IDs remain consistent for multiple denoising steps before the final commit.

(2) A simple training-free decoder: We propose STDec, which combines spatial-aware decoding and temporal-aware decoding to replace a single global confidence threshold with token-adaptive thresholds derived from local decoded states and historical ID consistency.

(3) Strong efficiency-quality trade-off across modalities: Experiments on Dream, LLaDA, and LaViDa show that STDec delivers substantial throughput gains on textual reasoning, code generation, and multimodal understanding, while remaining compatible with cache-based acceleration methods.

Motivation

Existing dLLM decoders usually rely on a global threshold such as top-k decoding or fixed-threshold selection. These strategies treat all masked positions equally, even though the denoising process is highly structured. In practice, current decoding decisions depend strongly on whether a token lies near already decoded neighbors and whether its predicted ID has already stabilized across previous steps.

STDec motivation

Comparison with existing decoding strategies. Top-k and anchor-based decoding policies only partially leverage the spatio-temporal stability signals that emerge during denoising.

Spatial consistency analysis

Spatial stability. More than 90% of decoded tokens appear near at least one previously decoded neighbor, suggesting that locally surrounded positions are safer to be decoded.

Temporal consistency analysis

Temporal stability. Nearly 95% of decoded tokens remain Top-1 ID-stable for at least one earlier step, indicating that many stable tokens can be decoded before the final decoding step.

Method

STDec computes token-adaptive decoding thresholds by combining local decoded context with cross-step prediction consistency. Spatial-aware decoding constructs a threshold map from decoded and masked states, then smooths it so positions near decoded neighbors become easier to commit. Temporal-aware decoding further relaxes thresholds for masked tokens whose predicted IDs remain consistent across denoising steps. The final decoder commits tokens when their confidence exceeds their own spatio-temporal threshold instead of a universal global one.

STDec framework

Overall framework of STDec. The decoder first builds a spatially smoothed threshold map from current decoding states, then applies temporal relaxation to tokens with stable predicted IDs, and finally decodes tokens using the resulting token-adaptive thresholds.

Main Results

Main textual reasoning results on Dream-7B-Instruct and LLaDA-8B-Instruct, reporting throughput, relative speedup, and task score across code generation and mathematical reasoning benchmarks, where STDec consistently delivers the strongest efficiency-quality trade-off among the compared training-free decoding baselines.
Task Method Dream-7B-Instruct LLaDA-8B-Instruct
TPS Speed Score TPS Speed Score
Code
MBPP Vanilla 6.571.00×51.40 5.681.00×37.60
+ Half-Step 13.112.00×35.80 11.512.02×34.80
+ dKV-Cache 11.121.69×49.60 10.181.79×39.00
+ Fast-dLLM 56.888.66×54.80 62.6811.04×37.40
+ LocalLeap 63.659.69×53.60 76.4813.46×37.60
+ STDec 91.1613.88×55.60 80.4714.17×38.40
HumanEval Vanilla 11.821.00×59.15 10.821.00×48.17
+ Half-Step 23.692.00×35.37 21.672.00×35.37
+ dKV-Cache 15.211.29×56.10 14.231.32×46.95
+ Fast-dLLM 46.873.97×62.20 38.493.56×48.78
+ LocalLeap 53.804.55×58.54 50.334.65×46.34
+ STDec 64.505.46×60.37 52.924.89×48.78
Mathematics & Science
GPQA Vanilla 6.951.00×32.83 6.241.00×28.79
+ Half-Step 13.882.00×32.32 12.662.02×30.81
+ dKV-Cache 13.211.90×33.33 12.331.98×32.32
+ Fast-dLLM 88.9512.80×34.85 62.6110.03×28.79
+ LocalLeap 149.1521.46×33.33 74.6011.96×29.29
+ STDec 193.6227.86×32.83 92.0814.76×29.29
GSM8K Vanilla 4.711.00×83.47 4.191.00×78.01
+ Half-Step 9.412.00×74.22 8.251.97×75.82
+ dKV-Cache 9.832.09×79.08 8.822.11×77.63
+ Fast-dLLM 17.203.65×82.94 12.162.90×78.77
+ LocalLeap 20.714.40×82.49 17.014.06×77.98
+ STDec 23.705.03×82.34 16.944.04×78.01
MATH Vanilla 12.671.00×44.64 11.901.00×40.38
+ Half-Step 25.312.00×39.40 23.641.99×39.44
+ dKV-Cache 15.561.23×44.04 14.731.24×40.90
+ Fast-dLLM 40.263.18×44.16 39.003.28×41.00
+ LocalLeap 52.354.13×44.26 49.934.20×39.66
+ STDec 55.154.35×44.64 53.054.46×39.86
Average Vanilla 8.541.00×54.30 7.771.00×46.59
+ Half-Step 17.082.00×43.42 15.552.00×43.25
+ dKV-Cache 12.991.52×52.43 12.061.55×47.36
+ Fast-dLLM 50.035.86×55.79 42.995.53×46.95
+ LocalLeap 67.937.95×54.44 53.676.91×46.17
+ STDec 85.6310.03×55.16 59.097.60×46.87
Main multimodal understanding results on LaViDa-Reason, comparing STDec with Prefix-DLM, Fast-dLLM, and LocalLeap across MathVerse, MathVision, and MathVista, where STDec achieves the highest throughput while maintaining competitive multimodal reasoning quality.
Task Method TPS Speed Score
MathVerse LaViDa w/o Prefix-DLM 5.71 1.00× 28.30
+ Prefix-DLM 12.22 2.14× 27.03
+ Fast-dLLM 11.94 2.09× 28.68
+ LocalLeap 14.31 2.51× 28.30
+ STDec 18.22 3.19× 28.30
MathVision LaViDa w/o Prefix-DLM 5.34 1.00× 19.74
+ Prefix-DLM 12.09 2.26× 20.39
+ Fast-dLLM 10.87 2.04× 20.72
+ LocalLeap 12.81 2.40× 21.71
+ STDec 16.71 3.13× 21.71
MathVista LaViDa w/o Prefix-DLM 5.66 1.00× 47.20
+ Prefix-DLM 12.38 2.19× 40.90
+ Fast-dLLM 15.31 2.70× 47.50
+ LocalLeap 18.19 3.21× 47.70
+ STDec 24.66 4.36× 46.20
Average LaViDa (baseline) 5.57 1.00× 31.75
+ STDec 19.86 3.57× 32.07
LLaDA speed bars

Composability on LLaDA. STDec can be stacked with dKV-Cache and still adds substantial extra throughput gains over the cache baseline.

LaViDa speed bars

Composability on LaViDa. STDec also complements Prefix-DLM style caching on multimodal understanding, improving efficiency without disrupting output quality.

Case Study

Multimodal

Case study on multimodal understanding with LaViDa-Reason

STDec preserves the same key objects and relative spatial relations in the generated scene description while substantially reducing decoding time on LaViDa-Reason.

Question 1

Case study on GSM8K with LLaDA-8B-Instruct

Lily can run 12 kilometers per hour for 4 hours. After that, she runs 6 kilometers per hour. How many kilometers can she run in 8 hours?

LLaDA-8B-Instruct 12.04 secs

To determine how many kilometers Lily can run in 8 hours, we need to break down the problem into two parts: the distance she runs in the first 4 hours and the distance she runs in the next 4 hours.

First, let's calculate the distance she runs in the first 4 hours:

- She runs 12 kilometers per hour for 4 hours.
- Distance = Speed × Time
- Distance in the first 4 hours = 12 km/h × 4 h = 48 km

Next, let's calculate the distance she runs in the next 4 hours:

- She runs 6 kilometers per hour for 4 hours.
- Distance = Speed × Time
- Distance in the next 4 hours = 6 km/h × 4 h = 24 km

Now, we add the distances from both parts to find the total distance she can run in 8 hours:

- Total distance = Distance in the first 4 hours + Distance in the next 4 hours
- Total distance = 48 km + 24 km = 72 km

Therefore, Lily can run a total of 72 kilometers in 8 hours.

+ STDec 2.55 secs (4.72×)

To determine how many kilometers Lily can run in 8 hours, we need to break down the problem into two parts: the distance she runs in the first 4 hours and the distance she runs in the next 4 hours.

First, let's calculate the distance she runs in the first 4 hours:

- Speed: 12 kilometers per hour
- Time: 4 hours
- Distance = Speed × Time
- Distance = 12 km/h × 4 h
- Distance = 48 kilometers

Next, let's calculate the distance she runs in the next 4 hours:

- Speed: 6 kilometers per hour
- Time: 4 hours
- Distance = Speed × Time
- Distance = 6 km/h × 4 h
- Distance = 24 kilometers

Now, we add the distances from both parts to find the total distance she can run in 8 hours:

- Total distance = Distance in the first 4 hours + Distance in the next 4 hours
- Total distance = 48 km + 24 km
- Total distance = 72 kilometers

Therefore, Lily can run 72 kilometers in 8 hours.

Question 2

Case study on GSM8K with LLaDA-8B-Instruct

Joy can read 8 pages of a book in 20 minutes. How many hours will it take her to read 120 pages?

LLaDA-8B-Instruct 11.94 secs

To determine how many hours it will take Joy to read 120 pages, we need to follow these steps:

Step 1: Calculate Joy's reading rate in pages per minute.
Joy reads 8 pages in 20 minutes. Therefore, her reading rate is 8 / 20 = 0.4 pages per minute.

Step 2: Determine the total time required to read 120 pages.
Since Joy's reading rate is 0.4 pages per minute, the time required to read 120 pages is 120 / 0.4 = 300 minutes.

Step 3: Convert the total time from minutes to hours.
There are 60 minutes in an hour, so 300 / 60 = 5 hours.

Therefore, it will take Joy 5 hours to read 120 pages.

+ STDec 2.60 secs (4.59×)

To determine how many hours it will take Joy to read 120 pages, we need to follow these steps:

Step 1: Calculate Joy's reading rate in pages per minute.
Joy reads 8 pages in 20 minutes. Therefore, her reading rate is 8 / 20 = 0.4 pages per minute.

Step 2: Determine the total time required to read 120 pages.
Since Joy reads at a rate of 0.4 pages per minute, the total time required to read 120 pages is 120 / 0.4 = 300 minutes.

Step 3: Convert the total time from minutes to hours.
Total time in hours = 300 / 60 = 5 hours.

Final answer: 5.

Citation

BibTeX
@article{chen2026stdec,
  title={STDec: Spatio-Temporal Stability Guided Decoding for dLLMs},
  author={Chen, Yuzhe and Cao, Jiale and Liu, Xuyang and Xie, Jin and Yang, Aiping and Pang, Yanwei},
  journal={arXiv preprint},
  year={2026}
}