ToMA: Token Merge with Attention for Diffusion Models

Published in ICML 2025, PMLR 267:40930–40951, 2025

Proposed ToMA, a GPU-aligned token merging framework that reformulates merging as an attention-like linear transformation with invertible unmerge, enabling practical acceleration of diffusion models without quality degradation. Applied submodular optimization to select representative tokens, providing theoretical guarantees on information coverage and improving efficiency and generation fidelity. Co-designed GPU-efficient merging/unmerging using pure attention-like matrix operations to minimize overhead, leveraging spatial locality and temporal redundancy across layers and timesteps. Achieved notable acceleration on both Unet and DiT architectures: up to 1.3× speedup on Flux and 1.4× on SDXL without quality degradation measured by FID, CLIP, and DINO scores.

Recommended citation: Lu, W.*, Zheng, S.*, Xia, Y., & Wang, S. (2025). "ToMA: Token Merge with Attention for Diffusion Models." ICML 2025, PMLR 267:40930–40951.