📝 Publications

* means equal contribution.

Recent Works

  • SAMTok: Representing Any Mask with Two Words, Yikang Zhou*, Tao Zhang*, Dengxian Gong, Yuanzheng Wu, Ye Tian, Haochen Wang, Haobo Yuan, Jiacong Wang, Lu Qi, Hao Fei, Anran Wang, Zhuochen Wang, Yujing Wang, Cheng Chen, Shunping Ji, Xiangtai Li [Tech Report] Use two words to represent any mask to facilitate use of VLM infra for pixel-LLM. | Code Data
  • Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence, Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, Zhuochen Wang [Tech Report] The first non-agentic spatial-temporal RL post-training framework. | Code Data
  • Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs, Haochen Wang*, Yuhao Wang*, Tao Zhang, Yikang Zhou, Yanwei Li, Jiacong Wang, Jiani Zheng, Ye Tian, Jiahao Meng, Zilong Huang, Guangcan Mai, Anran Wang, Yunhai Tong, Zhuochen Wang, Xiangtai Li, Zhaoxiang Zhang [ICLR-2026] STOA region caption models | Code Model and Data
  • MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation, Ye Tian, Ling Yang, Jiongfan Yang, Anran Wang, Yu Tian, Jiani Zheng, Haochen Wang, Zhiyang Teng, Zhuochen Wang, Yinjie Wang, Yunhai Tong, Mengdi Wang, Xiangtai Li [ICLR-2026] The first unified parallel generation model | Code Model and Data
  • DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World, Xiangtai Li, Tao Zhang, Yanwei Li, Haobo Yuan, Shihao Chen, Yikang Zhou, Jiahao Meng, Yueyi Sun, Shilin Xu, Lu Qi, Tianheng Cheng, Yi Lin, Zilong Huang, Wenhao Huang, Jiashi Feng, Guang Shi [Tech Report] Strong data engine for desne grounded caption. | Code Data
  • The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer, Weixian Lei, Jiacong Wang, Haochen Wang, Xiangtai Li, Jun Hao Liew, Jiashi Feng, Zilong Huang ICCV 2025 Scaling Single Transformer backbone training for both VLM and vision tasks. | Code
  • Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos, Haobo Yuan*, Xiangtai Li*, Tao Zhang*, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, Ming-Hsuan Yang [Tech Report] Marrying SAM2 with LLaVA-like MLLM for open-ended spatial temporal understanding. | Project Page
  • Several Previous works

  • DreamRelation: Bridging Customization and Relation Generation, Qingyu Shi, Lu Qi, Jianzong Wu, Jinbin Bai, Jingbo Wang, Yunhai Tong, Xiangtai Li CVPR 2025 (oral) First relation-aware customization approach | Github
  • RMP-SAM: Towards Real-Time Multi-Purpose Segment Anything, Shilin Xu, Haobo Yuan, Qingyu Shi, Lu Qi, Jingbo Wang, Yibo Yang, Yining Li, Kai Chen, Yunhai Tong, Bernard Ghanem, Xiangtai Li, Ming-Hsuan Yang ICLR 2025 (oral) A new real-time multi-task segmentation setting, benchmark, and a simple effcient baseline. | Github
  • OMG-Seg: Is One Model Good Enough For All Segmentation?, Xiangtai Li, Haobo Yuan, Wei Li, Henghui Ding, Size Wu, Wenwei Zhang, Yining Li, Kai Chen, Chen Change Loy CVPR 2024 One model to perform image/video/open-vocabulary/multi-dataset/interactive segmentation in one shot. | Project Page
  • OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding, Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, Shuicheng Yan NeurIPS 2024 The first end-to-end MLLM that unifies image-level, object-level, pixel-level understanding and reasoning. | Github
  • Tube-Link: A Flexible Cross Tube Baseline for Universal Video Segmentation, Xiangtai Li, Haobo Yuan, Wenwei Zhang, Guangliang Cheng, Jiangmiao Pang, Chen Change Loy, ICCV 2023 The first unified SOTA universal video segmentation model. | Project
  • Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation, Xiangtai Li*, Wenwei Zhang*, Jiangmiao Pang*, Kai Chen, Guangliang Cheng, Yunhai Tong, Chen Change Loy, CVPR 2022 (Oral, top2%) The first unified video segmentation model and codebase for VPS, VIS, VSS | Code
  • Semantic Flow for Fast and Accurate Scene Parsing, Xiangtai Li, Ansheng You, Zhen Zhu, Houlong Zhao, Maoke Yang, Kuiyuan Yang, Yunhai Tong, ECCV 2020 (Oral, top2%) The first real-time model over 80% mIoU on Cityscapes test set. | Code
  • Code can be found in this.