Chain-of-Prompts: One Click per Cell Type Suffices

Group Prompting (SAM3 + CoP) segments every cell in 3 clicks; per-instance prompting keeps clicking to 245

Group Prompting (SAM3 + CoP) segments every cell in 3 clicks; per-instance prompting (SAM3) keeps clicking to 245.

Key Insight

SAM's frozen image encoder already clusters same-type cells in its feature space before any prompt is given. CoP exploits this property to propagate a single click per cell type to all instances of that type, retaining 92.7% of per-instance performance with 81.7× fewer prompts, fully training-free.

Motivation

Cell-specific models break on unseen cell types, while interactive foundation models such as SAM3 require one click per instance. Neither path scales to histopathology images with hundreds of cells.

Pretrained models fail on unseen cell types; SAM3 needs per-instance clicks

Existing pipelines force a hard trade-off. Cell-specific baselines and open-vocabulary detectors miss out-of-distribution cell types (red dashed boxes), and SAM3 generalizes only by clicking every cell (e.g., 245 clicks per image).

Our Approach: Group Prompting

We replace per-instance clicking, the long-standing O(N) bottleneck, with Group Prompting: one click per cell type, recursively propagated across all same-type cells.

From 245 Clicks to 3. With just 3 clicks (one per cell type), CoP reaches 92.7% of the per-instance upper bound with 81.7× fewer prompts.

01 Abstract

Cell instance segmentation models trained on cell-specific datasets suffer severe performance drops on out-of-distribution cell types, while interactive foundation models overcome this through per-instance prompting at a cost that is prohibitively expensive for histopathology images containing hundreds to thousands of densely packed instances.

We introduce Group Prompting, a new paradigm that shifts interactive segmentation from per-instance O(N) to per-type O(T), where a single click per cell type suffices to segment all instances of that type. Our key observation is that the frozen image encoder of the Segment Anything Model (SAM) already clusters same-type cells in its feature space before any prompt is given. Exploiting this property, we propose Chain-of-Prompts (CoP), a training-free framework that recursively expands a single user click by (1) identifying reliable same-type locations through non-parametric gating of multi-scale encoder features, and (2) selecting the most spatially distant reliable point as the next prompt to maximize coverage.

On three cell-type-annotated benchmarks, CoP with one click per type retains over 90% of per-instance performance and surpasses fully-supervised methods without any additional training. On four morphologically homogeneous benchmarks, a single click retains over 99%.

02 Contributions

①

Group Prompting

Shifts interactive segmentation from per-instance O(N) to per-type O(T) interaction, reducing annotation cost from the number of cells to the number of cell types while remaining robust to out-of-distribution types without cell-specific training.

②

Chain-of-Prompts

A training-free framework that recursively expands prompt coverage by combining Hierarchical Similarity Gating (HSG) and Farthest Prompt Recursion (FPR), maintaining precision ≥ 96% at every iteration.

③

Generalization on 7 Benchmarks

Retains over 90% of per-instance performance on cell-type-annotated datasets and over 99% on morphologically homogeneous datasets, outperforming fully-supervised methods that require complete mask annotations.

03 Method

Two Steps, Zero Training: Identify Reliable Cells, Then Expand. A frozen SAM encoder extracts high- and low-resolution feature maps once per image. HSG turns each user click into a reliable point set via hierarchical similarity gating + connected-component labeling. FPR then iteratively re-prompts the farthest uncovered point until no new cells appear, decoding the converged set into instance masks.

Hierarchical Similarity Gating (HSG)

A single feature scale cannot achieve spatial precision and type selectivity simultaneously. HSG element-wise gates SAM's high-resolution and low-resolution features so the product suppresses tissue-level false activations of F_h while preserving its sharp localization, then non-parametrically thresholds at μ+σ and applies connected-component labeling to extract reliable point centroids.

Farthest Prompt Recursion (FPR)

Feature similarity decays across distant tissue regions, so a single prompt under-covers far-away cells. FPR selects the reliable point that is farthest (in image coordinates) from all previous prompts, feeds it back into HSG, and merges newly discovered points into the reliable set. The cycle repeats until convergence, ensuring whole-image coverage without feature drift.

Training-Free Mask Decoding

Each point in the converged reliable set is decoded into an instance mask via SAM's frozen decoder; overlapping predictions are resolved by non-maximum suppression at IoU > 0.5. The entire pipeline operates in feature space without any backpropagation or task-specific training.

04 Results

92.7%
of Upper Bound
vs. per-instance SAM3
on CoNIC

81.7×
Fewer Prompts
245 clicks → 3 clicks
group prompting

>90%
Retention
on 3 cell-type-annotated
benchmarks

>99%
Retention
on 4 morphologically
homogeneous benchmarks

Group Prompting Tracks the Upper Bound with 81.7× Fewer Prompts

Figure 2: Group Prompting O(T) vs. per-instance prompting O(N) on CoNSeP/test_2, with AJI vs. number of prompts

Group Prompting O(T) vs. per-instance O(N) on CoNSeP/test_2. A single click per cell type (left) recovers the same cell populations that per-instance prompting reaches with 245 clicks, while the AJI vs. number-of-prompts curve (right) tracks the upper bound.

One Click per Type Beats Fully-Supervised Models

Table 1: quantitative comparison on cell-type-annotated benchmarks

Cell-Type-Annotated Benchmarks (CoNIC, CoNSeP, GlaS). Prompt types: Τ text, 𝒱 visual, ℳ mask supervision, 𝒫_N per-instance points, 𝒫_T per-type points. With only one click per cell type, CoP retains ≥ 90% of per-instance performance on every benchmark and outperforms fully-supervised methods that require complete mask annotations.

>99% Retention from a Single Click

Table 2: quantitative results on benchmarks without cell-type annotations

Morphologically Homogeneous Benchmarks (MoNuSeg, TNBC, CryoNuSeg, CPM-17). When cells within an image share similar morphology, CoP segments every instance from a single click and retains > 99% of per-instance performance.

CoP Recovers Cells That Supervised Models Miss

Qualitative comparison on CoNIC. Fully-supervised baselines drop entire cell populations that fall outside their training distribution (red dashed boxes), while CoP discovers them from a single click per type.

Why It Works: SAM Already Clusters Same-Type Cells

UMAP of SAM's frozen encoder features at GT centroids. (a) F_h mixes cell types; (b) F_l groups same-type cells without any training. CoP exploits this latent clustering: no fine-tuning, no labels, just the right propagation rule.

05 Supplementary Video

One Click, Hundreds of Cells. Watch CoP recursively expand a single user click into every same-type cell in the image.

06 Citation

BibTeX

@article{jo2026cop,
  title   = {One Click per Cell Type Suffices: Training-free Group Interaction for Cell Instance Segmentation},
  author  = {Jo, Sanghyun and Lee, Seo Jin and Hong, Seohyung and Gang, Yoorim and Kim, Hyeongsub and Seo, Hyungseok and Kim, Kyungsu},
  journal = {arXiv preprint arXiv:2605.29429},
  year    = {2026}
}