I am Xiangtai Li, I work on computer vision, multi-modal learning and related problems.

I am working as a Research Scientist in Bytedance Seed (Tiktok), Singapore, working on multi-modal large langauge models, including both products and related research.

Previously, I worked as a Research Fellow at MMLab@NTU, S-Lab advised by Prof.Chen Change Loy.

I obtained my PhD degree at Peking University (PKU) under the supervision of Prof.Yunhai Tong, and my bachelor’s degree at Beijing University of Posts and Telecommunications (BUPT).

Previously, I worked as research intern or research scientist in DeepMotion (Now Xiaomi Car) / JD Exploration Academy / Sensetime Research / Shanghai AI Laboratory / Skywork 2050 Research, with several research outputs on top conference and journals.

My research topics are:

Multi-modal learning with LLMs (MLLM): Benchmarking, new architecture design, unified modeling.

Large language models (LLM) and auto-regressive model.

Image/video generation, editing and synthesis. (Diffusion Models)

Previously, I did some works on image/video segmentation and detection, open vocabulary learning.

Moreover, the code and models for my works (maybe 98%), including the ones I have deeply contributed to, are open-sourced on GitHub.

I serve as a regular reviewer for lots of conference and journals, including CVPR, ICCV, ECCV, ICLR, AAAI, NeurIPS, ICML, IJCAI, IEEE-TIP, IEEE-TPAMI, IJCV, IEEE-TSCVT, IEEE-TMM, IEEE-TGRS, Remote Sensing.

I also serve as an area chair for ICLR-2025, ICML-2025, and ICCV-2025.

πŸ”₯ News

  • 2025.01 Β  πŸ”₯πŸ”₯ Release a new video MLLM, Sa2VA, project page, combine both SAM-2 and LLaVA into one model, for dense grounded understanding of images and videos.
  • 2025.01 Β  πŸ”₯πŸ”₯ Release a new MLLM benchmark, MMVM, project page, which explores visual correspondence shortcomings of current MMLMs.
  • 2025.01 Β πŸŽ‰πŸŽ‰ Several works are accepted by ICLR-2025, including one oral and one spotlight.
  • 2024.12 Β  πŸ”₯πŸ”₯ Serving as an Area Chair for both ICML-2025 and ICCV-2025!
  • 2024.12 Β πŸŽ‰πŸŽ‰ Several works on AAAI-2025 and 3DV-2025. Point Cloud Mamba, Point RWKV, LDM-Seg, ReasonSeg3D.
  • 2024.09 Β πŸŽ‰πŸŽ‰ Several works on NeurIPS-2024. OMG-LLaVA, MotionBooth (spotlight), SemFlow, MamabaAD. Thanks for all co-authors’ help!
  • 2024.07 Β πŸŽ‰πŸŽ‰ Our Transformer Survey is finally accepted by T-PAMI. Arxiv.
  • 2024.07: πŸ”₯πŸ”₯ The training code of Edge-SAM and corresponding app, β€œCutcha” in IOS shop, are available now, link. Code.
  • 2024.07: πŸ”₯πŸ”₯ Checkout our recent Universal Dense MLLM Model, OMG-LLaVA, project, code.
  • 2024.07: Β πŸŽ‰πŸŽ‰ DVIS-DAQ, Open-Vocabulary SAM, FaceAdapter, and GenView are accepted by ECCV-2024. All code and models are released.

πŸ“ Publications

* means equal contribution.

Recent Arxiv

  • Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos, Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, Ming-Hsuan Yang Marrying SAM2 with LLaVA-like MLLM for open-ended spatial temporal understanding. | Project Page
  • Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs, Yikang Zhou, Tao Zhang, Shilin Xu, Shihao Chen, Qianyu Zhou, Yunhai Tong, Shunping Ji, Jiangning Zhang, Xiangtai Li, Lu Qi The first MLLM visual matching benchmark and a simple contrastive token solution. | Project Page
  • DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation, Jianzong Wu, Chao Tang, Jingbo Wang, Yanhong Zeng, Xiangtai Li, Yunhai Tong The first MLLM-based generation method for customized manga generation. | Project Page
  • Several Recent Works

  • Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis, Jinbin Bai, Tian Ye, Wei Chow, Enxin Song, Qing-Guo Chen, Xiangtai Li, Zhen Dong, Lei Zhu, Shuicheng Yan ICLR 2025 Make Masked Generative Transformer For Text to Image Generation Great Again! | Github
  • Towards Semantic Equivalence of Tokenization in Multimodal LLM, Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, Shuicheng Yan ICLR 2025 A new visual tokenizer for various MLLMs design. | Github
  • RAP-SAM: Towards Real-Time All-Purpose Segment Anything, Shilin Xu, Haobo Yuan, Qingyu Shi, Lu Qi, Jingbo Wang, Yibo Yang, Yining Li, Kai Chen, Yunhai Tong, Bernard Ghanem, Xiangtai Li, Ming-Hsuan Yang ICLR 2025 (oral) A new real-time multi-task segmentation setting, benchmark, and a simple effcient baseline. | Github
  • Reason3D: Searching and Reasoning 3D Segmentation via Large Language Model, Kuan-Chih Huang, Xiangtai Li, Lu Qi, Shuicheng Yan, Ming-Hsuan Yang 3DV 2025 Searching and Reasoning 3D Segmentation with LLMs. | Github
  • OMG-Seg: Is One Model Good Enough For All Segmentation?, Xiangtai Li, Haobo Yuan, Wei Li, Henghui Ding, Size Wu, Wenwei Zhang, Yining Li, Kai Chen, Chen Change Loy CVPR 2024 One model to perform image/video/open-vocabulary/multi-dataset/interactive segmentation in one shot. | Project Page
  • Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively, Haobo Yuan, Xiangtai Li, Chong Zhou, Yining Li, Kai Chen, Chen Change Loy ECCV 2024 Extend SAM to recognize over twenty-thousand class. | Project Page
  • OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding, Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, Shuicheng Yan NeurIPS 2024 The first end-to-end MLLM that unifies image-level, object-level, pixel-level understanding and reasoning. | Github
  • MotionBooth: Motion-Aware Customized Text-to-Video Generation, Jianzong Wu, Xiangtai Li, Yanhong Zeng, Jiangning Zhang, Qianyu Zhou, Yining Li, Yunhai Tong, Kai Chen NeurIPS 2024 (spotlight) The first customized T2V with motion control. | Github
  • Several Previous Works

  • Towards Open Vocabulary Learning: A Survey, Jianzong Wu, Xiangtai Li, Shilin Xu, Haobo Yuan, Henghui Ding, Yibo Yang, Xia Li, Jiangning Zhang, Yunhai Tong, Xudong Jiang, Bernard Ghanem, Dacheng Tao T-PAMI 2024 The first survey in open-vocabulary learning. (PAMI popular paper) | Github
  • Transformer-Based Visual Segmentation: A Survey, Xiangtai Li, Henghui Ding, Haobo Yuan, Wenwei Zhang, Jiangmiao Pang, Guangliang Cheng, Kai Chen, Ziwei Liu, Chen Change Loy T-PAMI 2024 The first survey that summarizes the transformer-based segmentation method from technical views. (PAMI popular paper) | Github
  • Tube-Link: A Flexible Cross Tube Baseline for Universal Video Segmentation, Xiangtai Li, Haobo Yuan, Wenwei Zhang, Guangliang Cheng, Jiangmiao Pang, Chen Change Loy, ICCV 2023 The first unified SOTA universal video segmentation model. | Project
  • TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers , Qianyu Zhou*, Xiangtai Li* , Lu He, Yibo Yang, Guangliang Cheng, Yunhai Tong, Lizhuang Ma, Dacheng Tao, T-PAMI 2023 The first End-to-End Vision Transformer for Video Object Detection and STOA results on Video Object Detection | Code
  • Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation, Xiangtai Li*, Wenwei Zhang*, Jiangmiao Pang*, Kai Chen, Guangliang Cheng, Yunhai Tong, Chen Change Loy, CVPR 2022 (Oral, top2%) The first unified video segmentation model and codebase for VPS, VIS, VSS | Code
  • Semantic Flow for Fast and Accurate Scene Parsing, Xiangtai Li, Ansheng You, Zhen Zhu, Houlong Zhao, Maoke Yang, Kuiyuan Yang, Yunhai Tong, ECCV 2020 (Oral, top2%) The first real-time model over 80% mIoU on Cityscapes test set. | Code
  • Improving Semantic Segmentation via Decoupled Body and Edge Supervision, Xiangtai Li, Xia Li, Li Zhang, Guangliang Cheng, Jianping Shi, Zhouchen Lin, Shaohua Tan, Yunhai Tong ECCV 2020 Improving semantic segmentaiton by decoupled body and edge supervision. | Code
  • Gated Fully Fusion for Semantic Segmentation, Xiangtai Li, Houlong Zhao, Lei Han, Yunhai Tong, Shaohua Tan, Kuiyuan Yang AAAI 2020 (Oral, top2%) Gated Fusion Multi-scale feature for semantic segmentation | Code
  • Code can be found in this.

    πŸŽ– Honors and Awards

    • National Scholarship, Ministry of Education of China in PKU (year 2020-2021) (year 2019-2020).

    • President Scholarship of PKU (year 2020-2021).

    • 2017, 2022 Beijing Excellent Graduates.

    • 2017, 2022 BUPT Excellent Graduates, PKU Excellent Graduates.

    πŸ“– Educations

    • 2017.09 - 2022.07, PhD in Peking University (PKU).

    • 2013.09 - 2017.07, Bachelor in Beijing University of Posts and Telecommunications (BUPT).

    πŸ’¬ Invited Talks

    • 2024.03 Invited talk on Open-Vocabulary Segmentation and Segment Anything at VALSE, online. Slide, Video.
    • 2023.08 Invited talk on Video Segmentation at VALSE, online. Slides, Video.
    • 2022.05 Invited talk on Panoptic Segmentation and Beyond in Baidu PaddleSeg Group.
    • 2021.12 Invited talk on Video Segmentation in DiDi Auto-Driving Group.
    • 2021.10 Invited talk on Aligned Segmentation HuaWei Noah Auto-Driving Group.

    πŸ’» Internships and Work Experience