📝 Publications
* means equal contribution.
Recent Works
SAMTok: Representing Any Mask with Two Words,
Yikang Zhou*, Tao Zhang*, Dengxian Gong, Yuanzheng Wu, Ye Tian, Haochen Wang, Haobo Yuan, Jiacong Wang, Lu Qi, Hao Fei, Anran Wang, Zhuochen Wang, Yujing Wang, Cheng Chen, Shunping Ji, Xiangtai Li
[CVPR-2026, Highlight] Use two words to represent any mask to facilitate use of VLM infra for pixel-LLM. | Code
Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence,
Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, Zhuochen Wang
[ICML-2026] The first non-agentic spatial-temporal RL post-training framework. | Code
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs,
Haochen Wang*, Yuhao Wang*, Tao Zhang, Yikang Zhou, Yanwei Li, Jiacong Wang, Jiani Zheng, Ye Tian, Jiahao Meng, Zilong Huang, Guangcan Mai, Anran Wang, Yunhai Tong, Zhuochen Wang, Xiangtai Li, Zhaoxiang Zhang
[ICLR-2026] SOTA region caption models | Code Model and Data
MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation,
Ye Tian, Ling Yang, Jiongfan Yang, Anran Wang, Yu Tian, Jiani Zheng, Haochen Wang, Zhiyang Teng, Zhuochen Wang, Yinjie Wang, Yunhai Tong, Mengdi Wang, Xiangtai Li
[ICLR-2026] The first unified parallel generation model | Code Model and Data
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos,
Haobo Yuan*, Xiangtai Li*, Tao Zhang*, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, Ming-Hsuan Yang
[Tech Report] Marrying SAM2 with LLaVA-like MLLM for open-ended spatial temporal understanding. | Project Page
Several Other Previous Works
RMP-SAM: Towards Real-Time Multi-Purpose Segment Anything,
Shilin Xu, Haobo Yuan, Qingyu Shi, Lu Qi, Jingbo Wang, Yibo Yang, Yining Li, Kai Chen, Yunhai Tong, Bernard Ghanem, Xiangtai Li, Ming-Hsuan Yang
[ICLR-2025, Oral] A new real-time multi-task segmentation setting, benchmark, and a simple efficient baseline. | Code
OMG-Seg: Is One Model Good Enough For All Segmentation?,
Xiangtai Li, Haobo Yuan, Wei Li, Henghui Ding, Size Wu, Wenwei Zhang, Yining Li, Kai Chen, Chen Change Loy
[CVPR-2024] One model to perform image/video/open-vocabulary/multi-dataset/interactive segmentation in one shot. | Project Page
Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation,
Xiangtai Li*, Wenwei Zhang*, Jiangmiao Pang*, Kai Chen, Guangliang Cheng, Yunhai Tong, Chen Change Loy
[CVPR-2022, Oral] Top 2%. The first unified video segmentation model and codebase for VPS, VIS, VSS. | Code
Code can be found on my GitHub.