I am Xiangtai Li. I work on computer vision, multi-modal learning, and related problems.

I am working as a Staff Research Scientist at TikTok (ByteDance), Singapore.

Our team works on applications development and research for TikTok Live. Our products and models are used by TT-Live directly and have impact on billions of users.

Topics cover multi-modal large language models, diffusion models, and LLM reasoning.

Previously, I worked as a Research Fellow at MMLab@NTU, S-Lab, advised by Prof. Chen Change Loy.

I obtained my PhD degree from Peking University (PKU) under the supervision of Prof. Yunhai Tong, and my bachelor’s degree from Beijing University of Posts and Telecommunications (BUPT).

My research topics focus on two main aspects:

  • Multi-modal learning with LLMs (MLLM): unified modeling, benchmarking, dataset pipeline building, RL-based post-training, diffusion language models.

  • Image/video generation and editing, controllable image/video generation.

Moreover, the code and models for my work (about 98%), including the projects I have deeply contributed to, are open-sourced on GitHub.

I serve as a regular reviewer for many conferences and journals, including CVPR, ICCV, ECCV, ICLR, AAAI, NeurIPS, ICML, IJCAI, IEEE-TIP, IEEE-TPAMI, IJCV, IEEE-TSCVT, IEEE-TMM, and IEEE-TGRS.

I also serve as an Area Chair for ICLR-2025/2026, CVPR-2026, ICML-2025, ICCV-2025, NeurIPS-2025, AAAI-2025/2026, WACV-2026, and ECCV-2026.

In addition, I also serve as an Associate Editor for T-PAMI.

I am looking for stronger interns with LLM/Diffusion infra background, location: Beijing and Singapore. ByteIntern/筋斗云实习生.

I am looking for stronger interns with AIGC background, location: Beijing and Singapore.[Urgent!] Candidates with strong infra ability first! (筋斗云实习生)

My email addresses are xiangtai94@gmail.com and xiangtai.li@bytedance.com. Feel free to contact me directly.

📝 News

  • 2026.05.01: Two works on Temporal Grounding and Highlight Detection were accepted to ICML 2026!

  • 2026.04.10: SAMTok was accepted to CVPR 2026 as a Highlight paper!

📝 Publications

* means equal contribution.

Recent Works

  • SAMTok: Representing Any Mask with Two Words, Yikang Zhou*, Tao Zhang*, Dengxian Gong, Yuanzheng Wu, Ye Tian, Haochen Wang, Haobo Yuan, Jiacong Wang, Lu Qi, Hao Fei, Anran Wang, Zhuochen Wang, Yujing Wang, Cheng Chen, Shunping Ji, Xiangtai Li [CVPR-2026, Highlight] Use two words to represent any mask to facilitate use of VLM infra for pixel-LLM. | Code
  • Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence, Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, Zhuochen Wang [ICML-2026] The first non-agentic spatial-temporal RL post-training framework. | Code
  • Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs, Haochen Wang*, Yuhao Wang*, Tao Zhang, Yikang Zhou, Yanwei Li, Jiacong Wang, Jiani Zheng, Ye Tian, Jiahao Meng, Zilong Huang, Guangcan Mai, Anran Wang, Yunhai Tong, Zhuochen Wang, Xiangtai Li, Zhaoxiang Zhang [ICLR-2026] SOTA region caption models | Code Model and Data
  • MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation, Ye Tian, Ling Yang, Jiongfan Yang, Anran Wang, Yu Tian, Jiani Zheng, Haochen Wang, Zhiyang Teng, Zhuochen Wang, Yinjie Wang, Yunhai Tong, Mengdi Wang, Xiangtai Li [ICLR-2026] The first unified parallel generation model | Code Model and Data
  • Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos, Haobo Yuan*, Xiangtai Li*, Tao Zhang*, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, Ming-Hsuan Yang [Tech Report] Marrying SAM2 with LLaVA-like MLLM for open-ended spatial temporal understanding. | Project Page
  • Several Other Previous Works

  • RMP-SAM: Towards Real-Time Multi-Purpose Segment Anything, Shilin Xu, Haobo Yuan, Qingyu Shi, Lu Qi, Jingbo Wang, Yibo Yang, Yining Li, Kai Chen, Yunhai Tong, Bernard Ghanem, Xiangtai Li, Ming-Hsuan Yang [ICLR-2025, Oral] A new real-time multi-task segmentation setting, benchmark, and a simple efficient baseline. | Code
  • OMG-Seg: Is One Model Good Enough For All Segmentation?, Xiangtai Li, Haobo Yuan, Wei Li, Henghui Ding, Size Wu, Wenwei Zhang, Yining Li, Kai Chen, Chen Change Loy [CVPR-2024] One model to perform image/video/open-vocabulary/multi-dataset/interactive segmentation in one shot. | Project Page
  • Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation, Xiangtai Li*, Wenwei Zhang*, Jiangmiao Pang*, Kai Chen, Guangliang Cheng, Yunhai Tong, Chen Change Loy [CVPR-2022, Oral] Top 2%. The first unified video segmentation model and codebase for VPS, VIS, VSS. | Code
  • Code can be found on my GitHub.

    🎖 Honors and Awards

    • National Scholarship, Ministry of Education of China in PKU (2019-2020, 2020-2021).

    • President Scholarship of PKU (year 2020-2021).

    • 2017, 2022 Beijing Excellent Graduates.

    • 2017, 2022 BUPT Excellent Graduates, PKU Excellent Graduates.

    📖 Educations

    • 2017.09 - 2022.07, PhD in Peking University (PKU).

    • 2013.09 - 2017.07, Bachelor in Beijing University of Posts and Telecommunications (BUPT).

    💬 Invited Talks

    • 2024.03 Invited talk on Open-Vocabulary Segmentation and Segment Anything at VALSE, online. Slides, Video.

    • 2023.08 Invited talk on Video Segmentation at VALSE, online. Slides, Video.

    • 2022.05 Invited talk on Panoptic Segmentation and Beyond in Baidu PaddleSeg Group.

    • 2021.12 Invited talk on Video Segmentation in DiDi Auto-Driving Group.

    • 2021.10 Invited talk on Aligned Segmentation at Huawei Noah Auto-Driving Group.

    💻 Internships