I am currently a PhD student at UCLA advised by Prof. Cho-Jui Hsieh, and a project lead at TurningPointAI, advised by Ruochen Wang, Prof. Minhao Cheng, Prof. Tianyi Zhou. We are a collaborative initiative dedicated to advancing the field of Multimodal Language Agents. Learn more about our work at TurningPointAI and stay updated by following us on twitter. Previously, I was fortunate to work with Prof. Yue Gao on 3D Computer Vision at Tsinghua University.
My research is funded by Amazon Trainium Fellowship. For further information, please see my CV (last update: Nov 23, 2024).
Research Interests: My research centers on advancing MLLM post-training, with a focus on Reasoning/Agent and Multimodal.
I developed VisualThinker, one of the first open-source repo to replicate the aha moment of Deepseek-r1 on a small non-sft multimodal model(600+ Github stars).
Prior to the era of LLM, I had experiences working on 3D Computer Vision, Human-Computer Interaction (HCI), and visually-rich document understanding.
We released the first successful replication of DeepSeek-R1's 'aha moment' in a multimodal task using only a 2B non-SFT model! - February 26, 2025
Our submisson on multimodal oversensitivity is accepted in ICLR - January 22, 2025
We presented the first test suite assessing if current MLLMs overreact to benign queries! - June 22, 2024
R1-Zero’s “Aha Moment” in Visual Reasoning on a 2B Non-SFT Model