Can Qin

Salesforce AI Research, 181 Lytton Avenue, Palo Alto, CA, 94301, USA

qincan01_5.JPG

Email: cqin[at]salesforce.com or qin.ca[at]northeastern.edu

Hello and welcome! I’m currently embracing the exciting world of artificial intelligence as a Research Scientist at Salesforce AI Research. My journey is driven by a deep passion for Generative AI and Multi-modal Learning, with a focus on developing Video/Image to Text (Understanding) and Text to Video/Image (Generation) techniques.

In 2023, I earned my Ph.D. from Northeastern University in Boston, USA. My research during this period was primarily centered around the realms of Transfer Learning and Efficient AI, where I delved into complex problems and innovative solutions.

Before my Ph.D. journey, I obtained my B.E. degree from Xidian University in Xi’an, China, in 2018. This foundation laid the groundwork for my ongoing pursuit of knowledge and innovation.

news

Sep, 2024 Our Medical MLLM paper was accepcted by EMNLP 24 (Main)!
Aug, 2024 The xGen-MM (BLIP3) and xGen-VideoSyn-1 were released to the public! We have a paper accepcted by TKDE and congrats to Yizhou! I have been invited as the reviewer of Nature Communications.
Jul, 2024 We have one paper accepcted by ECCV 24!
Feb, 2024 We have one paper accepcted by CVPR 24!
Nov, 2023 Begin my journey at Salesforce Research in Palo Alto!
Jun, 2023 I have passed the PhD Dissertation Defense and become Dr. Qin!

selected publications

  1. xgen-mm-vid1.gif
    xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
    Michael S Ryoo, Honglu Zhou, Shrikant Kendre, Can Qin, Le Xue, Manli Shu, Silvio Savarese, Ran Xu, Caiming Xiong, and Juan Carlos Niebles
    arXiv preprint arXiv:2410.16267, 2024
  2. xgen-videosyn-1.gif
    xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations
    Can Qin, Congying Xia, Krithika Ramakrishnan, Michael Ryoo, Lifu Tu, Yihao Feng, Manli Shu, Honglu Zhou, Anas Awadalla, Jun Wang, and  others
    arXiv preprint arXiv:2408.12590, 2024
  3. blip3.png
    xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
    Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, and  others
    arXiv preprint arXiv:2408.08872, 2024
  4. preference_data_st_llava_med.png
    Self-Training Large Language and Vision Assistant for Medical
    Guohao Sun, Can Qin, Huazhu Fu, Linwei Wang, and Zhiqiang Tao
    In The 2024 Conference on Empirical Methods in Natural Language Processing (to appear), 2024
  5. sq-llava.png
    SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant
    Guohao Sun, Can Qin, Jiamian Wang, Zeyuan Chen, Ran Xu, and Zhiqiang Tao
    European Conference on Computer Vision, 2024
  6. hive.png
    HIVE: Harnessing Human Feedback for Instructional Visual Editing
    Shu Zhang*, Xinyi Yang*, Yihao Feng*, Can Qin, Chia-Chih Chen, Ning Yu, Zeyuan Chen, Huan Wang, Silvio Savarese, Stefano Ermon, Caiming Xiong, and Ran Xu
    IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
  7. unicontrol.png
    UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild
    Can Qin, Shu Zhang, Ning Yu, Yihao Feng, Xinyi Yang, Yingbo Zhou, Huan Wang, Juan Carlos Niebles, Caiming Xiong, Silvio Savarese, Stefano Ermon, Yun Fu, and Ran Xu
    Advances in Neural Information Processing Systems, 2023
  8. gluegen.png
    GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation
    Can Qin, Ning Yu, Chen Xing, Shu Zhang, Zeyuan Chen, Stefano Ermon, Yun Fu, Caiming Xiong, and Ran Xu
    International Conference on Computer Vision, 2023