Skip to main content
  1. Index/
  2. πŸ‘¨πŸ»β€πŸ’» Blog/

Efficient SAM2

Table of Contents
Efficient SAM2 pipeline diagram
Figure 1: the pipeline including our automatic prompt generator module. Attention scores are extracted from the vision encoder, clustered with HDBSCAN, and prompted to the decoder.

Training deep neural networks from scratch has largely been replaced by foundation models. A prime example in computer vision is the Segment Anything Model (SAM), a foundation model for promptable instance segmentation.

Promptable means that, analogous to LLMs, SAM2 exposes an interactive interface that allows users to inject priors to condition the output. This enhances zero-shot capabilities, at the cost of reduced automation.

In this post we propose Efficient-SAM2, a configurable, automatic prompting solution that overcomes this limitation with a training-free, content-aware approach that exploits attention scores from the vision encoder. By adjusting just two intuitive parameters, users can trade off between latency and segmentation granularity to match their specific use case.

Motivation
#

SAM2 was (before SAM3) the state of the art in promptable image segmentation, but its reliance on manual prompting limits practicality in high-throughput or fully automatic settings.

The most common workaround is Automatic SAM, which blindly samples \(1024\) points on a \(32 \times 32\) grid. This suffers from two key problems:

  1. Lack of control over granularity: excessive prompting leads to fragmented masks, preventing the model from capturing object-level semantics.
  2. Inefficiency: processing thousands of prompt points, many on background or redundant areas, is computationally expensive.

These inefficiencies propagate into downstream systems. For example, SAM3D achieves impressive zero-shot 3D instance segmentation by applying SAM to video frames, but inherits Automatic SAM’s 1024-point overhead per frame, even when datasets like ScanNet3D typically contain only a few object instances per frame.

Background
#

Sample images from COCO, SA1B and LVIS
Figure 2: (a), (b), and (c) images from the COCO, SA-1B, and LVIS datasets with their ground-truth instance masks overlaid. (d) Attention scores computed by encoding image (a), corresponding to the highlighted patch (red dot).

Instance segmentation assigns each pixel a unique identifier corresponding to an individual object. As Figure 2 shows, different benchmarks exhibit substantial variation in object distributions, granularity, and annotation detail.

SAM Architecture
#

SAM is composed of three components: an Image Encoder (a ViT that processes patches \(x_p \in \mathbb{R}^{N \times (P^2 \cdot C)}\) into embeddings \(z_p \in \mathbb{R}^{N \times D}\)), a Prompt Encoder (which embeds user-provided points, boxes, or masks into a conditioning latent space), and a Mask Decoder (which combines image and prompt embeddings to produce segmentation masks). Crucially, the image encoder runs only once per image regardless of the number of prompts.

HDBSCAN
#

HDBSCAN is a hierarchical density-based clustering algorithm. Given two points \(x_p, x_q \in \mathbb{R}^N\) and a distance metric \(d(\cdot, \cdot)\), the mutual reachability distance is:

$$ d_{\text{mrd}}(x_p, x_q) = \max \{ d_{\text{core}_k}(x_p), d_{\text{core}_k}(x_q), d(x_p, x_q) \}. $$

Here \(d_{\text{core}_k}(x_p)\) is the distance to the \(k\)-th nearest neighbor, where \(k\) corresponds to the \(\texttt{min\_samples}\) parameter. Larger \(k\) yields more conservative clusters restricted to higher-density regions. The \(\texttt{min\_cluster\_size}\) parameter discards clusters below a specified size. These two parameters are the main degrees of freedom in our pipeline.

Method
#

Efficient SAM2 replaces naive point sampling with a training-free, configurable, content-aware alternative. It requires no changes to model weights or architecture, only the prompting mechanism.

1. Mining Attention Scores
#

We exploit SAM2’s ViT image encoder (from the Hiera family) by intercepting attention scores \(e \in [0,1]^{N \times N}\) from the deepest global attention layers. These scores quantify pairwise similarity between patch embeddings and can be interpreted as cosine similarities between representations \(z_p\). For further intuition on the semantic structure of such embeddings, see this work.

As shown in Figure 2(d), a patch on a horse attends strongly to other patches on the same object, and so attention distributions reveal object instances.

Head Selection. Individual attention heads specialize further. Through ablation on COCO, we found that specific heads (e.g., Head 2 Layer 22 in SAM2 given Hiera-Large checkpoint) correlate strongly with object instances. Detailed head configurations for every backbone are in the repository.

Patches clustering in overlay on the image
Figure 3: Patch-level cluster labels overlaid on the original image. Clusters capture object-level semantics: background forms one cluster, individual objects get distinct labels.

2. Clustering as Prompt Generation
#

We treat each row \(e_i\) of the attention matrix as a feature vector for the \(i\)-th patch. Patches on the same object should have similar attention vectors, so we cluster them with HDBSCAN using cosine distance:

$$ d(x, y) = 1 - \text{cosine}(x, y). $$

Cosine distance better captures the structure of attention distributions compared to Euclidean distance. The cluster medoids are projected back to image coordinates and used as point prompts for the SAM2 decoder.

In Figure 3, HDBSCAN effectively separates patches by semantic content: the background forms a dominant cluster, while individual objects (each horse) get distinct labels.

3. Post-Processing: Mask Merging
#

Standard Automatic SAM uses aggressive NMS to remove duplicate masks from its dense grid. Our sparser prompts produce fewer overlaps, and when they do occur it’s meaningful: either granularity differences (the same object at multiple detail levels) or split objects due to occlusion or locality biases in patch representations.

Instead of suppressing these, we merge overlapping masks using a custom intersection metric that handles nested masks:

$$ \texttt{Intersection}(M_i, M_j) = \frac{|M_i \cap M_j|}{\min(|M_i|, |M_j|)}. $$

Thresholding the pairwise intersection matrix and applying union-find over the induced graph yields the final merged masks. While this choice matches human intuition of segmentation, in the ablation in Table 3 we show it is not the only source of quality improvement in our method.

Results
#

We evaluate across instance segmentation (COCO, LVIS, SA-1B), salient object segmentation (ILSO1K, SOC, SIP), and camouflaged object segmentation (NC4K, COD10K, CAMO). All comparisons use identical backbones (Hiera-L) and decoder settings, varying only the prompting strategy and post-processing. Latency is measured as end-to-end inference time including prompt generation, decoding, and post-processing.

Configurations
#

Name# samplescluster sizeavg. #P
(1)Low Latency210107.3
(2)High Quality25213.1
(3)SAM3D54022.5
(4)SAM123300.5

Table 1: HDBSCAN hyperparameter configurations with the average number of input points produced across all experiments.

These configurations illustrate a key strength of the approach: the user controls the granularity-efficiency trade-off through just two HDBSCAN parameters. Low Latency averages ~100 points per image, reducing decoder passes by roughly 10x. High Quality halves min_cluster_size to detect smaller objects at the cost of more prompts. The SAM3D and SAM1 configurations are task-specific presets tailored to those pipelines. Custom configurations for new use cases are straightforward to define. See the repository for details.

Instance Segmentation
#

PromptingIoU ↑COCO #P ↓Lat. ↓IoU ↑LVIS #P ↓Lat. ↓IoU ↑SA-1B #P ↓Lat. ↓
Random44.5102440.0102445.51024
Grid45.610241.37s41.210241.37s49.710241.56s
Ours Low Latency60.5 (+32.5%)105.5 (-89.7%)1.22s (-0.16s)54.3 (+31.8%)105.7 (-89.7%)1.18s (-0.19s)41.7 (-16.2%)110.3 (-89.2%)1.13s (-0.43s)
Ours High Quality66.4 (+45.5%)210.4 (-79.5%)1.41s (+0.04s)61.0 (+48.1%)211.1 (-79.4%)1.33s (-0.04s)57.3 (+15.2%)220.7 (-78.4%)1.11s (-0.44s)

Table 2: Instance segmentation results on 10k-image subsets. Subscripts show relative change vs. Grid (32x32).

On COCO and LVIS, Efficient SAM2 improves IoU by over 30% while using ~90% fewer prompts. On SA-1B, the Low Latency config trades some IoU due to the dataset’s extreme annotation granularity, but High Quality recovers and exceeds grid performance. Dense uniform prompting is neither necessary nor optimal, semantic-aware prompt selection yields better masks with dramatically lower cost. Importantly, the user can tune the trade-off between speed and detail simply by adjusting min_cluster_size: a single parameter that controls segmentation granularity without retraining or architectural changes.

Ablation
#

OursGrid(32 x 32)nmsmmIoU#PLat.
βœ—βœ“βœ“βœ—46.610240.00s
βœ—βœ“βœ—βœ“47.010240.02s
βœ“βœ—βœ“βœ—66.2210.1-0.04s
βœ“βœ—βœ—βœ“66.9210.1-0.01s

Table 3: Ablation on 1k COCO images. Salient prompting accounts for most of the gain (~20 IoU points); mask merging provides a consistent additional improvement. Best performance comes from combining both.

We ablate the pipeline’s post-processing options: nms (non-maximum suppression) and mm (our mask merging). The noise injected by grid prompt sampling cannot be recovered even by our mask merging algorithm, proving the effectiveness and quality impact of content-aware prompting.

Salient Object Segmentation
#

PromptingILSO1K
IoU ↑
ILSO1K
#P ↓
SOC
IoU ↑
SOC
#P ↓
SIP
IoU ↑
SIP
#P ↓
Grid (16x16)64.925661.025683.9256
Random (1024)64.3102467.0102486.11024
Grid (32x32)64.6102465.9102485.71024
Ours (Low Latency)87.9104.683.0106.393.4104.7
Avg. Var.+36.7%-89.8%+23.9%-89.6%+8.5%-89.8%

Table 4: Salient object segmentation on the full ILSO1K, SOC, and SIP test sets. Efficient SAM2 naturally aligns with tasks emphasizing global object prominence, even without explicit saliency training.

Camouflaged Object Segmentation
#

PromptingNC4K
IoU
NC4K
#P
COD10K
IoU
COD10K
#P
CAMO
IoU
CAMO
#P
Grid (16x16)19.625622.325614.5256
Random (1024)21.1102426.1102416.31024
Grid (32x32)16.6102416.3102415.51024
Ours (Low Latency)56.2100.548.7102.758.2104.2
Avg. Var.+238.6%-90.2%+198.8%-90.0%+275.5%-89.8%

Table 5: Camouflaged object segmentation on the full NC4K, COD10K, and CAMO test sets.

Grid prompting largely fails here (16-22% IoU), while attention-driven prompting achieves 2-3x improvements: semantic consistency in embedding space remains informative even when appearance cues are weak.

Generalization to SAM1
#

PromptingCOCO IoU ↑COCO #P ↓Lat. ↓
Grid (16x16)60.5256
Grid (32x32)68.910240.00s
Ours66.6 (-3.4%)300.5 (-70.7%)+0.50s

Table 6: SAM1 (ViT-L) on 10k COCO images.

We also applied our approach to the older SAM1 pipeline, which has an inherently different output mask distribution. We registered a small IoU drop for a 70% reduction in prompts, confirming that the approach is model-agnostic and does not rely on SAM2-specific features.

Summary
#

Across all tasks, Efficient SAM2 achieves higher or comparable segmentation quality than Automatic SAM, reduces prompt count by 70-90%, and lowers inference latency, all without modifying model weights or architecture. The approach is fully configurable: users can adapt it to their specific latency, granularity, and accuracy requirements by tuning two parameters. Semantic attention is a powerful and underutilized prior for prompt generation.

Citation
#

BibTeX Citation
@mastersthesis{zirilli2025efficientsam2,
      title={Efficient SAM2: Training-Free Content-Aware Prompt Generation for the Segment Anything Model},
      author={Alessandro Zirilli and Emanuele RodolΓ },
      year={2025},
      school={Sapienza University of Rome / Technical University of Munich},
      type={Master's Thesis},
}

Resources
#

Alessandro Zirilli
Author
Alessandro Zirilli