TRACE: Your Diffusion Model Is Secretly an Instance Edge Detector

Key Insight

Text-to-image diffusion models secretly encode instance boundaries in their self-attention maps during denoising. TRACE decodes these hidden cues into sharp instance edges without any annotations, points, boxes, or prompts, achieving 81× faster inference and up to +5.1 AP improvement on COCO.

01 Abstract

High-quality instance and panoptic segmentation has traditionally relied on dense instance-level annotations such as masks, boxes, or points, which are costly, inconsistent, and difficult to scale. Unsupervised and weakly-supervised approaches reduce this burden but remain constrained by semantic backbone constraints and human bias, often producing merged or fragmented outputs.

We present TRACE (TRAnsforming diffusion Cues to instance Edges), showing that text-to-image diffusion models secretly function as instance edge annotators. TRACE identifies the Instance Emergence Point (IEP) where object boundaries first appear in self-attention maps, extracts boundaries through Attention Boundary Divergence (ABDiv), and distills them into a lightweight one-step edge decoder. This design removes the need for per-image diffusion inversion, achieving 81× faster inference while producing sharper and more connected boundaries.

On the COCO benchmark, TRACE improves unsupervised instance segmentation by +5.1 AP, and in tag-supervised panoptic segmentation it outperforms point-supervised baselines by +1.7 PQ without using any instance-level labels. These results reveal that diffusion models encode hidden instance boundary priors, and that decoding these signals offers a practical and scalable alternative to costly manual annotation.

02 Contributions

①

Hidden Instance Structure

Self-attention in diffusion models briefly yet reliably reveals instance-level structure during denoising, unlike common vision transformers.

②

IEP + ABDiv

TRACE unifies two key ideas, Instance Emergence Point and Attention Boundary Divergence, for annotation-free instance boundary discovery.

③

State-of-the-Art Results

Boosts unsupervised instance segmentation by +4.4 AP with only 6% overhead; surpasses point-supervised panoptic models up to +7.1 PQ on VOC 2012.

03 Method

Overview of TRACE. (a) Diffusion forward locates the instance emergence point t★ via a KL peak and extracts instance-aware attention; ABDiv converts it into a pseudo edge map. (b) One-step self-distillation trains an edge decoder, yielding connected boundaries at inference without IEP or ABDiv.

Instance Emergence Point (IEP)

During the forward diffusion process, self-attention maps transition from semantic grouping to instance-level structure. IEP identifies the exact timestep where this transition peaks by maximizing the KL divergence between consecutive attention maps.

Attention Boundary Divergence (ABDiv)

A non-parametric score that converts instance-aware self-attention maps into boundary maps by measuring criss-cross divergence between opposite neighbors. Boundary pixels show sharp divergence while interior pixels remain stable.

One-Step Self-Distillation

Pseudo edge maps are distilled into a lightweight decoder via LoRA fine-tuning, enabling single-pass inference at t=0. This achieves 81× speedup (3682ms → 45ms) while improving edge connectivity.

04 Observation

Emergence of instance cues in diffusion attention. Cross-attention remains semantic even with explicit prompts, whereas self-attention at specific timesteps reveals instance-level structure, a hidden capability we unlock with TRACE.

05 Results

+5.1
APmk
Unsupervised Instance Seg.
on COCO

+7.1
PQ
Tag-supervised Panoptic Seg.
on VOC 2012

81×
Faster
One-step Distillation
45ms per image

0.889
ODS
Instance Edge Quality
2× over best baseline

Unsupervised Instance Segmentation
(Left, Right)

TRACE consistently improves existing UIS baselines across all benchmarks with AP^mk gains of +3.6 to +5.3.

Weakly-supervised Panoptic Segmentation

With only image-level tags, TRACE+DHR surpasses point-supervised methods on both VOC and COCO.

Qualitative Results

Qualitative Comparison
Coming soon

TRACE instance edges reconnect fragmented masks and separate adjacent objects. White dotted circles mark corrected boundaries.

Superiority of Generative Diffusion Priors

Even the smallest diffusion model (PixArt-α, 0.6B) significantly outperforms the massive 72B-parameter Qwen2.5-VL, confirming that TRACE leverages the unique generative nature of diffusion models.

06 Citation

BibTeX

@inproceedings{jo2026trace,
  title     = {TRACE: Your Diffusion Model Is Secretly an Instance Edge Detector},
  author    = {Jo, Sanghyun and Lee, Ziseok and Lee, Wooyeol and Choi, Jonghyun and Park, Jaesik and Kim, Kyungsu},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026},
  note      = {Oral Presentation}
}