Can Qin

Google

qincan01_5.JPG

Email: qin.ca[at]northeastern.edu

Hello and welcome! I am working at Google CoreML, where I focus on building Multimodal and GenMedia models, working on systems that can understand and generate across text, image, video, and other modalities. Before joining Google, I was at Salesforce AI Research, contributing to multimodal and generative models. I received my Ph.D. from Northeastern University in Boston, and during my studies I interned at Adobe and Salesforce, gaining hands-on experience in real-world AI research. I am passionate about the future and want to push the boundaries of what AI can see, reason about, and create.

news

Jan, 2026 I have jointed the Google, working on Multimodal and GenMedia!
Dec, 2025 Vlm2vec-v2 (MMEB-V2) and our MLLM token compression survery were accepted by TMLR.
Oct, 2025 Holitom was accepted by NeurIPS 25. We have released the CoDA (a 1.7b coding DLLM model).
May, 2025 CogAlign was accepted by ACL findings and we have released BLIP-3o.
Feb, 2025 We have two papers accepted by CVPR 25! Our latest paper CogAlign was released.
Sep, 2024 Our Medical MLLM paper was accepted by EMNLP 24 (Main)!
Aug, 2024 The xGen-MM (BLIP3) and xGen-VideoSyn-1 were released to the public! We have a paper accepted by TKDE and congrats to Yizhou! I have been invited as the reviewer of Nature Communications.
Jul, 2024 We have one paper accepted by ECCV 24!
Feb, 2024 We have one paper accepted by CVPR 24!
Nov, 2023 Begin my journey at Salesforce Research in Palo Alto!

selected publications

  1. agent0.png
    Agent0: Unleashing self-evolving agents from zero data via tool-integrated reasoning
    Peng Xia, Kaide Zeng, Jiaqi Liu, Can Qin, Fang Wu, Yiyang Zhou, Caiming Xiong, and Huaxiu Yao
    arXiv preprint arXiv:2511.16043, 2025
    AgentSelf-evolving
  2. mmeb-v2.png
    Vlm2vec-v2 (MMEB-V2): Advancing multimodal embedding for videos, images, and visual documents
    Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, and  others
    Transactions on Machine Learning Research (TMLR), 2025
    Embedding ModelMultimodal
  3. token-compression-survey.png
    When Tokens Talk Too Much: A Survey of Multimodal Long-context Token Compression across Images, Videos, and Audios
    Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, and Huan Wang
    Transactions on Machine Learning Research (TMLR), 2025
    Token CompressionSurveyMultimodal
  4. image-blip3o.jpg
    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, and  others
    arXiv preprint arXiv:2505.09568, 2025
    Unified Multimodal Model
  5. dycoke-demo.gif
    DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models
    Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
    Video LLMToken Compression
  6. xgen-videosyn-1.gif
    xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations
    Can Qin, Congying Xia, Krithika Ramakrishnan, Michael Ryoo, Lifu Tu, Yihao Feng, Manli Shu, Honglu Zhou, Anas Awadalla, Jun Wang, and  others
    arXiv preprint arXiv:2408.12590, 2024
    DiffusionVideo Generation
  7. blip3.png
    xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
    Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, and  others
    arXiv preprint arXiv:2408.08872, 2024
    VLMMultimodal
  8. hive.png
    HIVE: Harnessing Human Feedback for Instructional Visual Editing
    Shu Zhang*, Xinyi Yang*, Yihao Feng*, Can Qin, Chia-Chih Chen, Ning Yu, Zeyuan Chen, Huan Wang, Silvio Savarese, Stefano Ermon, Caiming Xiong, and Ran Xu
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
    DiffusionImage EditingHuman-in-the-loop
  9. unicontrol.png
    UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild
    Can Qin, Shu Zhang, Ning Yu, Yihao Feng, Xinyi Yang, Yingbo Zhou, Huan Wang, Juan Carlos Niebles, Caiming Xiong, Silvio Savarese, Stefano Ermon, Yun Fu, and Ran Xu
    Advances in Neural Information Processing Systems (NeurIPS), 2023
    DiffusionControllable Image Generation