Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

1UC Merced     2Beytedance Seed    3WHU    4PKU   
empty

Figure-1: Illustration of capabilities of our proposed Sa2VA. (a). Given a video, Sa2VA is able to segment the referred object and understand the whole scene. (b).Sa2VA supports image conversation, video conversation, image referring segmentation, video referring segmentation, and grounded caption generation with single-shot instruction-tuning. (c).Sa2VA achieves strong results on multiple images, video referring segmentation, and chat benchmarks compared with existing MLLMs, such as GLaMM and OMG-LLaVA

Short Summary

This work presents Sa2VA, the first unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with LLaVA, an advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content.

Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model performance. We also manually validate 2k video objects in the Ref-SAV datasets to benchmark referring video object segmentation in complex environments. Experiments show that Sa2VA achieves state-of-the-art across multiple tasks, particularly in referring video object segmentation, highlighting its potential for complex real-world applications.

Model Scope Comparison

empty

Comparison of capabilities of different representative models. Our method supports various tasks and modalities. Benefiting from these interactive features on video, Sa2VA can perform multiple promptable tasks in the video: Ref-VOS, Image/Video Chat, Visual prompt understanding.

Method: Sa2VA

empty

Figure: Our proposed Sa2VA model. The model first encodes the input texts, visual prompts, images, and videos into token embeddings. These tokens are then processed through a large language model (LLM). The output text tokens are used to generate the [SEG] token and associated language outputs. The SAM-2 decoder receives the image and video features from the SAM-2 encoder, along with the [SEG] token, to generate corresponding image and video masks.

Experiment Results: Sa2VA

empty

Figure: Experiment results on various settings, including image/video referring segmentation benchmarks and image/video chat benchmarks. Sa2VA achieves stronger performance than existing methods.

Labeled Examples: Ref-SAM-V dataset

empty

Figure: The samples of our Ref-SAV benchmark. Our proposed benchmark features multi-granularity, complex occlusion and reappearing, and both short and long-format text expressions.

Demos

(Demo 1)Input Video (Source: La La Land, 2016):

Instruction: "Please segment the girl wearing the yellow dress."

(Demo 2) Input Video (Source: La La Land, 2016):

Instruction: "Please segment the main character."

(Demo 3) Input Video (Source: The Godfather, 1972):

Instruction: "What is the atmosphere of the scene?"

Answer: "The scene has a dark and mysterious atmosphere, with the men dressed in suits and ties, and the dimly lit room."

(Demo 4) Input Video (Source: Internet):

Instruction: "Please segment the person wearing sun glasses."

(Demo 5) Input Video (Source: Internet):

Instruction: "Please segment the singing girl."

(Demo 6) Input Video (Source: Internet):

Instruction: "What are the guys doing in the video?"

Answer: "The guys in the video are dancing together in a group. They are performing a choreographed routine, moving in sync with each other."

(Demo 7) Input Video (Source: Internet):

Instruction: "Please segment the guy in the center."

BibTeX

@article{sa2va,
  title={Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos},
  author={Yuan, Haobo and Li, Xiangtai and Zhang, Tao and Huang, Zilong and Xu, Shilin and Ji, Shunping and Tong, Yunhai and Qi, Lu and Feng, Jiashi and Yang, Ming-Hsuan},
  journal={arXiv},
  year={2025}
}