SAM's frozen image encoder already clusters same-type cells in its feature space before any prompt is given. CoP exploits this property to propagate a single click per cell type to all instances of that type, retaining 92.7% of per-instance performance with 81.7× fewer prompts, fully training-free.
Cell-specific models break on unseen cell types, while interactive foundation models such as SAM3 require one click per instance. Neither path scales to histopathology images with hundreds of cells.
Existing pipelines force a hard trade-off. Cell-specific baselines and open-vocabulary detectors miss out-of-distribution cell types (red dashed boxes), and SAM3 generalizes only by clicking every cell (e.g., 245 clicks per image).
We replace per-instance clicking, the long-standing O(N) bottleneck, with Group Prompting: one click per cell type, recursively propagated across all same-type cells.
From 245 Clicks to 3. With just 3 clicks (one per cell type), CoP reaches 92.7% of the per-instance upper bound with 81.7× fewer prompts.
Cell instance segmentation models trained on cell-specific datasets suffer severe performance drops on out-of-distribution cell types, while interactive foundation models overcome this through per-instance prompting at a cost that is prohibitively expensive for histopathology images containing hundreds to thousands of densely packed instances.
We introduce Group Prompting, a new paradigm that shifts interactive segmentation from per-instance O(N) to per-type O(T), where a single click per cell type suffices to segment all instances of that type. Our key observation is that the frozen image encoder of the Segment Anything Model (SAM) already clusters same-type cells in its feature space before any prompt is given. Exploiting this property, we propose Chain-of-Prompts (CoP), a training-free framework that recursively expands a single user click by (1) identifying reliable same-type locations through non-parametric gating of multi-scale encoder features, and (2) selecting the most spatially distant reliable point as the next prompt to maximize coverage.
On three cell-type-annotated benchmarks, CoP with one click per type retains over 90% of per-instance performance and surpasses fully-supervised methods without any additional training. On four morphologically homogeneous benchmarks, a single click retains over 99%.
Shifts interactive segmentation from per-instance O(N) to per-type O(T) interaction, reducing annotation cost from the number of cells to the number of cell types while remaining robust to out-of-distribution types without cell-specific training.
A training-free framework that recursively expands prompt coverage by combining Hierarchical Similarity Gating (HSG) and Farthest Prompt Recursion (FPR), maintaining precision ≥ 96% at every iteration.
Retains over 90% of per-instance performance on cell-type-annotated datasets and over 99% on morphologically homogeneous datasets, outperforming fully-supervised methods that require complete mask annotations.
Two Steps, Zero Training: Identify Reliable Cells, Then Expand. A frozen SAM encoder extracts high- and low-resolution feature maps once per image. HSG turns each user click into a reliable point set via hierarchical similarity gating + connected-component labeling. FPR then iteratively re-prompts the farthest uncovered point until no new cells appear, decoding the converged set into instance masks.
A single feature scale cannot achieve spatial precision and type selectivity simultaneously. HSG element-wise gates SAM's high-resolution and low-resolution features so the product suppresses tissue-level false activations of Fh while preserving its sharp localization, then non-parametrically thresholds at μ+σ and applies connected-component labeling to extract reliable point centroids.
Feature similarity decays across distant tissue regions, so a single prompt under-covers far-away cells. FPR selects the reliable point that is farthest (in image coordinates) from all previous prompts, feeds it back into HSG, and merges newly discovered points into the reliable set. The cycle repeats until convergence, ensuring whole-image coverage without feature drift.
Each point in the converged reliable set is decoded into an instance mask via SAM's frozen decoder; overlapping predictions are resolved by non-maximum suppression at IoU > 0.5. The entire pipeline operates in feature space without any backpropagation or task-specific training.
Cell-Type-Annotated Benchmarks (CoNIC, CoNSeP, GlaS). Prompt types: Τ text, 𝒱 visual, ℳ mask supervision, 𝒫N per-instance points, 𝒫T per-type points. With only one click per cell type, CoP retains ≥ 90% of per-instance performance on every benchmark and outperforms fully-supervised methods that require complete mask annotations.
Morphologically Homogeneous Benchmarks (MoNuSeg, TNBC, CryoNuSeg, CPM-17). When cells within an image share similar morphology, CoP segments every instance from a single click and retains > 99% of per-instance performance.
Qualitative comparison on CoNIC. Fully-supervised baselines drop entire cell populations that fall outside their training distribution (red dashed boxes), while CoP discovers them from a single click per type.
UMAP of SAM's frozen encoder features at GT centroids. (a) Fh mixes cell types; (b) Fl groups same-type cells without any training. CoP exploits this latent clustering: no fine-tuning, no labels, just the right propagation rule.
One Click, Hundreds of Cells. Watch CoP recursively expand a single user click into every same-type cell in the image.
@inproceedings{jo2026cop,
title = {One Click per Cell Type Suffices: Training-free Group Interaction for Cell Instance Segmentation},
author = {Jo, Sanghyun and Lee, Seo Jin and Hong, Seohyung and Gang, Yoorim and Kim, Hyeongsub and Seo, Hyungseok and Kim, Kyungsu},
booktitle = {International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI)},
year = {2026},
note = {Early Accept (Top 9\%)}
}