Under Review

ISAC Training-Free Instance-to-Semantic Attention Control
for Multi-Instance Generation

1OGQ 2Seoul National University
Equal contribution Corresponding authors
Scroll to explore
Qualitative comparison on multi-instance prompts: SD1.5, InitNO, Self-Cross, and ISAC

Where prior training-free methods merge or mix, ISAC separates and binds. On “two horses” and “a dog and a sheep”, SD1.5, InitNO, and Self-Cross omit, merge, or confuse instances. ISACLO (latent optimization) keeps every requested instance distinct.

Key Insight

Token-conditioned cross-attention can separate concepts, but it assumes instance regions have already emerged, so in early denoising it cannot carve them out, and count failures and semantic mixing persist. Self-attention, by contrast, exposes class-agnostic instance layouts early. ISAC stabilizes self-attention instance layouts first, then binds cross-attention semantics within them, fully training-free, model-agnostic, with no external vision models.

Motivation

Open-weight diffusion models still omit or merge requested objects (count failures) and leak attributes across instances (semantic mixing), especially when instances are semantically similar.

Semantic overlap of instance-aware semantic masks across class pairs grouped by supercategory

Semantic mixing is strongest within a supercategory. We build instance-aware semantic masks and measure their Dice overlap for each class pair. Overlap is consistently higher for pairs in the same supercategory (blue boxes), showing that token-level semantics spill across multiple similar objects, motivating an instance-first hierarchy.

Why It Works: Structure Before Semantics

In early denoising, instance structure emerges in self-attention while semantics are still underdeveloped. ISAC exploits this asymmetry: form instances from structure first, then assign semantics.

Dynamics of text-to-image diffusion: self-attention vs cross-attention vs Grounding DINO across timesteps

Dynamics of T2I diffusion. Early steps form class-agnostic instance structure in self-attention; semantics (cross-attention) sharpen later. Detection models such as Grounding DINO rely on strong semantic cues, so they become effective only in late steps, too late to carve out instances.

01 Abstract

Recent open-weight text-to-image (T2I) diffusion models still struggle with multi-instance prompts, often omitting or merging instances and mixing semantics among similar objects. We trace these failures to early denoising steps, before instance boundaries are reliably stabilized.

Existing training-free guidance is largely driven by cross-attention or other token-conditioned semantic signals. Such guidance can separate concepts at the token level, but largely assumes that distinct instance regions have already emerged; in early denoising steps it cannot reliably carve out these regions, so count failures and semantic mixing persist. By contrast, self-attention exposes class-agnostic instance layouts during early denoising. To exploit this asymmetry, we propose ISAC (Instance-to-Semantic Attention Control), a training-free, model-agnostic objective that first stabilizes self-attention layouts and then binds cross-attention semantics within them, without fine-tuning or external vision models.

Across T2I-CompBench, HRS-Bench, and our newly curated IntraCompBench, ISAC consistently outperforms prior training-free methods. Furthermore, ISAC enhances layout-to-image controllers by refining coarse, overlapping bounding boxes into dense instance masks.

02 Contributions

Instance-to-Semantic Hierarchy

ISAC is a training-free, model-agnostic objective that enforces an instance-to-semantic hierarchy: it separates instance formation from semantic assignment, stabilizing class-agnostic layouts from self-attention before binding cross-attention semantics within them.

Diagnosis of Semantic Mixing

We quantify semantic mixing via the Dice overlap between instance-aware semantic masks, revealing substantially higher overlap for class pairs within the same supercategory, the regime where count failures and mixing are most severe.

IntraCompBench

A new benchmark for explicit 2–5-instance counting and intra-supercategory multi-class compositions, stress-testing the similar-object cases that existing benchmarks do not isolate.

How ISAC Differs

Conceptual comparison of methods for multi-instance text-to-image generation

Conceptual comparison. ISAC is the only method that preserves instance structure first, separates semantic masks, stays diffusion-backbone agnostic, and needs no fine-tuning or extra counting model.

03 Method

Overview of ISAC: Phase 1 instance formation and Phase 2 instance-aware semantic separation

Two phases along the denoising trajectory. Phase 1 clusters the self-attention map into N class-agnostic instance masks and repels their overlap to establish clean layouts early. Phase 2 injects this stabilized structure into cross-attention to form instance-aware semantic masks, then applies a repel-and-bind loss so semantics follow instance shapes. An instance-to-semantic schedule shifts weight from Phase 1 to Phase 2 over time.

Phase 1 · Instance Formation (Lins)

Same-instance pixels attend to each other more strongly, so the most discriminative self-attention clusters reveal disjoint instance layouts. ISAC builds a foreground gate from cross-attention, clusters self-attention into N class-agnostic masks, and penalizes their worst overlap via Maximum Pixel-wise Overlap (MPO), carving out the requested number of separated regions before any class label is assigned.

Phase 2 · Instance-aware Semantic Separation (Lsem)

With sharp instance structure in place, ISAC injects it into cross-attention (CAins = SA · CA) to form instance-aware semantic masks, then applies a repel-and-bind loss: tokens for different instances are pushed apart, while class and attribute tokens of the same instance are pulled together, so semantics follow instance shapes without leaking across regions.

Instance-to-Semantic Schedule

A single schedule aligns the objective with diffusion dynamics: λins(t) = t/T and λsem(t) = 1 − t/T, so early steps focus on instance formation and later steps on semantic refinement. The objective plugs into latent optimization (ISACLO) or latent selection (ISACLS), with one shared step size across all models and benchmarks.

Algorithm 1: ISAC with Latent Optimization (ISAC-LO)

04 Results

1.9×
Multi-Object Gain
over the diffusion
baseline
+10%
vs. Count-Supervised
accuracy, with
no extra training
52%
Multi-Class Accuracy
IntraCompBench on
SD3.5-M (vs. 34%)
3
Benchmarks
T2I-CompBench · HRS-Bench
· IntraCompBench

Consistent Gains Across Benchmarks and Backbones

Quantitative comparison on HRS-Bench, T2I-CompBench, and IntraCompBench (Multi-Class)

Quantitative comparison on HRS-Bench, T2I-CompBench, and IntraCompBench (Multi-Class). At inference cost comparable to other attention-control methods, ISACLO outperforms every prior training-free baseline on both SD1.5 and SD3.5-M, with the largest gains in the crowded intra-category regime (e.g., multi-class accuracy 25% → 52% on SD3.5-M).

Why the Order Matters: Instance Before Semantics

Loss scheduleλins(t)λsem(t)Multi-ClassMulti-Instance
Instance only1010%65%
Semantic only0128%54%
Fixed balance0.50.525%60%
Semantic → Instance1−t/Tt/T21%55%
Instance → Semantic (ours)t/T1−t/T36%69%

Effect of the loss schedule (SD1.5). Forming instances first and binding semantics later (our schedule) wins on both metrics. Instance-only cannot assign semantics, and semantic-first is unstable without prior boundary stabilization.

Beats Count-Supervised Methods, Training-Free

Given the same instance-count supervision, ISACLO reaches 70% (SD1.4) and 76% (SDXL) instance-counting accuracy, surpassing Counting Guidance (49%) and CountGen (69%), with no fine-tuning, no auxiliary networks, and no mask labels. A VLM judge (GPT-5.5) independently reproduces these gains, confirming they are not detector-specific.

05 Applications

Layout-to-Image: Carving Dense Masks from Coarse Boxes

Applied on top of the GLIGEN controller, ISACLO separates adjacent and overlapping boxes early in the trajectory. On HRS-Bench it lifts counting F1 from 0.666 to 0.713 and color accuracy from 0.307 to 0.452, beating the layout-refinement baseline CAR&SAR.

Qualitative examples of applying ISAC to the GLIGEN layout-to-image controller

ISAC on GLIGEN. From coarse, overlapping reference layouts, ISAC carves dense instance masks so neighboring objects stay distinct, where the controller alone merges them.

Closing the Gap to Commercial Models

Applied to strong open-weight backbones, ISACLS moves Qwen-Image (multi-class 48% → 56%) and Flux.2-dev (88% → 92%) toward the upper bound set by GPT-Image-1.5 and Nano Banana 2, with no fine-tuning.

Two-dimensional comparison of average performance on IntraCompBench, open-weight + ISAC vs commercial models

Open-weight + ISAC moves toward the commercial upper bound. On IntraCompBench, ISACLS shifts open-weight models up and to the right on both axes, narrowing the gap to closed-source systems.

Performance comparison between open-weight models equipped with ISAC and commercial models

Gradient-Free Scaling via Latent Selection (ISACLS)

For large backbones where backpropagation is costly, ISAC also works as a verifier: score a batch of candidate latents with the ISAC objective and keep the best (best-of-10). This gradient-free variant lifts Flux.1-dev multi-class accuracy from 31% to 51% with no model gradients.

Algorithm 2: ISAC with Latent Selection (ISAC-LS)
Qualitative results of latent selection: best vs worst samples scored by the ISAC objective

ISAC as a verifier. The ISAC score cleanly separates good candidates from samples with missing instances or semantic mixing on Flux.1-dev and Qwen-Image.

Plug-and-Play Across Backbones

Few-Step Models

Even under tight 8-step and 4-step budgets, ISACLO boosts multi-class accuracy on Z-Image-Turbo (48% → 68%) and Flux.2-klein-4B (34% → 54%), confirming suitability for low-latency use.

Fine-Tuned Models

ISACLO is complementary to supervised fine-tuning: stacked on TokenCompose (8% → 36%) and IterComp (5% → 30%), it further improves multi-class accuracy.

Text and Layout

One objective serves both text-to-image backbones (SD1.5, SD3.5-M, Flux, Qwen-Image) and layout-to-image controllers (GLIGEN), with schedules fixed by design and a single shared step size.

06 Qualitative Gallery

Qualitative comparison of attention-control methods on SD1.5 and SD3.5-M

Across prompts and backbones. From simple color-shape pairs to crowded scenes of vehicles and animals, ISAC allocates distinct, spatially coherent instances to each requested class while keeping their attributes, where InitNO and Self-Cross blur boundaries or merge categories.

07 Limitation

Limitation of ISAC: depth ordering through transparent materials

No explicit 3D understanding. ISAC operates on 2D attention, so prompts requiring depth ordering through transparent materials (e.g., “two apples behind a glass bottle”) remain hard. 3D- and physics-aware extensions are left to future work.

08 Citation

BibTeX
@article{jo2025isac,
  title   = {ISAC: Training-Free Instance-to-Semantic Attention Control for Multi-Instance Generation},
  author  = {Jo, Sanghyun and Lee, Wooyeol and Lee, Ziseok and Choi, Jonghyun and Park, Jaesik and Kim, Kyungsu},
  journal = {arXiv preprint arXiv:2505.20935},
  year    = {2025}
}