OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

1Wuhan University, 2Skywork AI, 3S-Lab, NTU
empty

The comprehensive capabilities of OMG-LLaVA. OMG-LLaVA can handle a variety of pixel-level, object-level, and image-level understanding and reasoning tasks with only one visual encoder, one visual decoder and one LLM.

Abstract

We propose OMG-LLaVA, a new and elegant framework combining powerful pixel-level vision understanding with reasoning abilities. It can accept various visual and text prompts for flexible user interaction. Specifically, we use a universal segmentation method as the visual encoder, integrating image information, perception priors, and visual prompts into visual tokens provided to the LLM. The LLM is responsible for understanding the user's text instructions and providing text responses and pixel-level segmentation results based on the visual information.

OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model, matching or surpassing the performance of specialized methods on multiple benchmarks. Rather than using LLM to connect each specialist, our work aims at end-to-end training on one encoder, one decoder, and one LLM.

Model Scope Comparison

empty

Comparison of capabilities of different models. We include several representative methods here. Our OMG-LLaVA offers the most comprehensive capabilities, encompassing image-level, object-level, and pixel-level understanding and reasoning. Compared to GlaMM and AnyRef, OMG-LLaVA features an elegant and simple system architecture with only a single visual encoder.

Video

Method: OMG-LLaVA

empty

The Overview of OMG-LLaVA. OMG-LLaVA consists of OMG-Seg and LLM. OMG-Seg tokenizes the image into pixel-centric visual tokens, the detected objects, and inputs visual prompts into object-centric visual tokens. Additionally, the [SEG] token output by LLM is decoded by OMG-Seg into segmentation masks. OMG-Seg remains frozen at all stages. In particular, we present perception prior embedding method to enhance the pixel features with object prior before sending visual tokens to LLMs.

Experiments

The comprehensive comparison of OMG-LLaVA and other MLLMs regarding pixel-level and object-level understanding and reasoning capability and performance. "-" indicates that the method does not handle this task. GLaMM used the GranD dataset for pretraining, which is significantly larger than the datasets used by other methods.

empty

Performance on referring expression segmentation datasets. The evaluation metric is cIoU. "ft" indicates finetuning on the referring expression datasets.

empty

Performance on grounded conversation generation datasets. “ft” indicates finetuning on the GranDf dataset. † indicates that the method used the GranD dataset for pretraining.

empty

Ablation study on RES and GCG datasets.

empty

Demos

empty

Qualitative comparison on the referring expression segmentation task. LISA uses the 13B LLM, while GLaMM and our proposed OMG-LLaVA use the 7B LLM.

empty empty

Qualitative comparison on the grounded conversation generation task.

empty

Qualitative comparison on the visual prompt-based description task.

empty

Qualitative comparison on the image-based conversation task.

BibTeX

@article{OMGLLaVA,
  title={OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding},
  author={Zhang, Tao and Li, Xiangtai and Fei, Hao and Yuan, Haobo and Wu, Shengqiong and Ji, Shunping and Chen, Change Loy and Yan, Shuicheng},
  journal={arXiv preprint},
  year={2024}
}