Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Haobo Yuan¹, Xiangtai Li², Tao Zhang^2,3, Zilong Huang², Shilin Xu⁴,
Shunping Ji³, Yunhai Tong⁴, Lu Qi², Jiashi Feng², Ming-Hsuan Yang¹

¹UC Merced ²Beytedance Seed ³WHU ⁴PKU

arXiv Video Code 🤗 HuggingFace

Sa2VA is able to segment the referred object and understand the whole scene, and supports image conversation, video conversation, image referring segmentation, video referring segmentation, and grounded caption generation with single-shot instruction-tuning. Sa2VA achieves strong results on multiple images, video referring segmentation, and chat benchmarks compared with existing MLLMs, such as GLaMM and OMG-LLaVA.

Demos

(Demo 1) Input Video (Source: La La Land, 2016):

Instruction: "Please segment the girl wearing the yellow dress."

(Demo 2) Input Video (Source: La La Land, 2016):

Instruction: "Please segment the main character."

(Demo 3) Input Video (Source: Internet):

Instruction: "Please segment the person wearing sun glasses."

(Demo 4) Input Video (Source: Internet):

Instruction: "Please segment the singing girl."

(Demo 5) Input Video (Source: Internet):

Instruction: "Please segment the guy in the center."

(Demo 6) Input Video (Source: The Godfather, 1972):

Instruction: "What is the atmosphere of the scene?"

Answer: "The scene has a dark and mysterious atmosphere, with the men dressed in suits and ties, and the dimly lit room."

(Demo 7) Input Video (Source: Internet):

Instruction: "What are the guys doing in the video?"

Answer: "The guys in the video are dancing together in a group. They are performing a choreographed routine, moving in sync with each other."

Short Summary

This work presents Sa2VA, the first unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with LLaVA, an advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content.

Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model performance. We also manually validate 2k video objects in the Ref-SAV datasets to benchmark referring video object segmentation in complex environments. Experiments show that Sa2VA achieves state-of-the-art across multiple tasks, particularly in referring video object segmentation, highlighting its potential for complex real-world applications.

Model Scope Comparison

Comparison of capabilities of different representative models. Our method supports various tasks and modalities. Benefiting from these interactive features on video, Sa2VA can perform multiple promptable tasks in the video: Ref-VOS, Image/Video Chat, Visual prompt understanding.

Method: Sa2VA

Figure: Our proposed Sa2VA model. The model first encodes the input texts, visual prompts, images, and videos into token embeddings. These tokens are then processed through a large language model (LLM). The output text tokens are used to generate the [SEG] token and associated language outputs. The SAM-2 decoder receives the image and video features from the SAM-2 encoder, along with the [SEG] token, to generate corresponding image and video masks.

Experiment Results: Sa2VA

Figure: Experiment results on various settings, including image/video referring segmentation benchmarks and image/video chat benchmarks. Sa2VA achieves stronger performance than existing methods.

Labeled Examples: Ref-SAM-V dataset

Figure: The samples of our Ref-SAV benchmark. Our proposed benchmark features multi-granularity, complex occlusion and reappearing, and both short and long-format text expressions.

BibTeX

@article{sa2va,
  title={Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos},
  author={Yuan, Haobo and Li, Xiangtai and Zhang, Tao and Huang, Zilong and Xu, Shilin and Ji, Shunping and Tong, Yunhai and Qi, Lu and Feng, Jiashi and Yang, Ming-Hsuan},
  journal={arXiv},
  year={2025}
}