Text-to-image diffusion models secretly encode instance boundaries in their self-attention maps during denoising. TRACE decodes these hidden cues into sharp instance edges without any annotations, points, boxes, or prompts, achieving 81× faster inference and up to +5.1 AP improvement on COCO.
High-quality instance and panoptic segmentation has traditionally relied on dense instance-level annotations such as masks, boxes, or points, which are costly, inconsistent, and difficult to scale. Unsupervised and weakly-supervised approaches reduce this burden but remain constrained by semantic backbone constraints and human bias, often producing merged or fragmented outputs.
We present TRACE (TRAnsforming diffusion Cues to instance Edges), showing that text-to-image diffusion models secretly function as instance edge annotators. TRACE identifies the Instance Emergence Point (IEP) where object boundaries first appear in self-attention maps, extracts boundaries through Attention Boundary Divergence (ABDiv), and distills them into a lightweight one-step edge decoder. This design removes the need for per-image diffusion inversion, achieving 81× faster inference while producing sharper and more connected boundaries.
On the COCO benchmark, TRACE improves unsupervised instance segmentation by +5.1 AP, and in tag-supervised panoptic segmentation it outperforms point-supervised baselines by +1.7 PQ without using any instance-level labels. These results reveal that diffusion models encode hidden instance boundary priors, and that decoding these signals offers a practical and scalable alternative to costly manual annotation.
Self-attention in diffusion models briefly yet reliably reveals instance-level structure during denoising, unlike common vision transformers.
TRACE unifies two key ideas, Instance Emergence Point and Attention Boundary Divergence, for annotation-free instance boundary discovery.
Boosts unsupervised instance segmentation by +4.4 AP with only 6% overhead; surpasses point-supervised panoptic models up to +7.1 PQ on VOC 2012.
Overview of TRACE. (a) Diffusion forward locates the instance emergence point t★ via a KL peak and extracts instance-aware attention; ABDiv converts it into a pseudo edge map. (b) One-step self-distillation trains an edge decoder, yielding connected boundaries at inference without IEP or ABDiv.
During the forward diffusion process, self-attention maps transition from semantic grouping to instance-level structure. IEP identifies the exact timestep where this transition peaks by maximizing the KL divergence between consecutive attention maps.
A non-parametric score that converts instance-aware self-attention maps into boundary maps by measuring criss-cross divergence between opposite neighbors. Boundary pixels show sharp divergence while interior pixels remain stable.
Pseudo edge maps are distilled into a lightweight decoder via LoRA fine-tuning, enabling single-pass inference at t=0. This achieves 81× speedup (3682ms → 45ms) while improving edge connectivity.
Emergence of instance cues in diffusion attention. Cross-attention remains semantic even with explicit prompts, whereas self-attention at specific timesteps reveals instance-level structure, a hidden capability we unlock with TRACE.
TRACE consistently improves existing UIS baselines across all benchmarks with APmk gains of +3.6 to +5.3.
With only image-level tags, TRACE+DHR surpasses point-supervised methods on both VOC and COCO.
Qualitative Comparison
Coming soon
TRACE instance edges reconnect fragmented masks and separate adjacent objects. White dotted circles mark corrected boundaries.
Even the smallest diffusion model (PixArt-α, 0.6B) significantly outperforms the massive 72B-parameter Qwen2.5-VL, confirming that TRACE leverages the unique generative nature of diffusion models.
@inproceedings{jo2026trace,
title = {TRACE: Your Diffusion Model Is Secretly an Instance Edge Detector},
author = {Jo, Sanghyun and Lee, Ziseok and Lee, Wooyeol and Choi, Jonghyun and Park, Jaesik and Kim, Kyungsu},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2026},
note = {Oral Presentation}
}