About me

I am Ziyuan Huang, a research scientist at Ant Group, advancing omni-modal intelligence - next frontier of artificial general intelligence. My work focuses on a foundational breakthrough: a unified model grounded in unified representations that unlocks deep cross-modal and cross-task synergy, moving beyond isolated perception or generation pipelines. This enables AI systems that can truly assist, create and collaborate with people in complex, real-world contexts, intuitively, coherently, and across any modality.

I earned my Ph.D. degree from National University of Singapore in 2023 under the supervision of Prof. Marcelo Ang. Prior to Ant, I have spent wonderful times conducting research in the MARS Lab under Professor Zhao Hang, TONGYI under Dr. Zhang Shiwei, and Vision4Robotics Group at Tongji University under Professor Fu Changhong. I am also fortunate to have worked closely with Dr. Pan Liang and Professor Liu Ziwei in S-Lab@NTU.

We are actively hiring self-motivated full-time research scientists and interns to work on cutting-edge research projects on unified omni-modal models. Feel free to drop me an email if you are interested!

Selected technical reports

For a full publication list, please refer to my Google Scholar.

Ming-Flash-Omni

Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation

Ming team, Ant Group.

A 100-billion-scale omni-modal MoE model delivering leading results in text-to-image generation, generative segmentation and contextual ASR.

[paper] [code] [hf]

Ming-UniAudio

Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation

Canxiang Yan, Chunxiang Jin, Dawei Huang, Haibing Yu, Han Peng, Hui Zhan, Jie Gao, Jing Peng, Jingdong Chen, Jun Zhou, Kaimeng Ren, Ming Yang, Mingxue Yang, Qiang Xu, Qin Zhao, Ruijie Xiong, Shaoxiong Lin, Xuezhi Wang, Yi Yuan, Yifei Wu, Yongjie Lyu, Zhengyu He, Zhihao Qiu, Zhiqiang Fang, Ziyuan Huang

A unified audio model for understanding, generating, and editing audio contents, based on unified representations.

[paper] [code] [hf]

Ming-UniVision

Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer

Ziyuan Huang, DanDan Zheng, Cheng Zou, Rui Liu, Xiaolong Wang, Kaixiang Ji, Weilong Chai, Jianxin Sun, Libin Wang, Yongjie Lyv, Taoye Huang, Jiajia Liu, Qingpei Guo, Ming Yang, Jingdong Chen, Jun Zhou

A unified MLLM for understanding, generating and editing visual contents that seamlessly supports multi-round interactions, all powered by the first-ever continuous unified visual representations.

[paper] [code] [hf]

Ming-Omni

Ming-Omni: A Unified Multimodal Model for Perception and Generation

Ming team, Ant Group.

The first open-source omni-modal model that matches the input-output capability of GPT-4o.

[paper] [code] [hf]

Selected publications

ARGenSeg

ARGenSeg: Image Segmentation with Autoregressive Image Generation Model

Xiaolong Wang, Lixiang Ru, Ziyuan Huang, Kaixiang Ji, Dandan Zheng, Jingdong Chen, Jun Zhou

NeurIPS 2025.

[paper]

SkipVision

Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping

Weili Zeng, Ziyuan Huang, Kaixiang Ji, Yichao Yan

ICCV 2025.

[paper]

Chain-of-Sight

Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight

Ziyuan Huang, Kaixiang Ji, Biao Gong, Zhiwu Qing, Qinglong Zhang, Kecheng Zheng, Jian Wang, Jingdong Chen, Ming Yang

NeurIPS 2024.

[paper]

SkySense

Skysense: A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery

Xin Guo, Jiangwei Lao, Bo Dang, Yingying Zhang, Lei Yu, Lixiang Ru, Liheng Zhong, Ziyuan Huang, Kang Wu, Dingxiang Hu, Huimei He, Jian Wang, Jingdong Chen, Ming Yang, Yongjun Zhang, Yansheng Li

CVPR 2024.

[paper]

Res-Tuning

Res-tuning: A flexible and efficient tuning paradigm via unbinding tuner from backbone

Zeyinzi Jiang, Chaojie Mao, Ziyuan Huang, Ao Ma, Yiliang Lv, Yujun Shen, Deli Zhao, Jingren Zhou

NeurIPS 2024.

[paper]

TCTrack

Towards real-world visual tracking with temporal contexts

Ziang Cao, Ziyuan Huang, Liang Pan, Shiwei Zhang, Ziwei Liu, Changhong Fu

TPAMI.

[paper] [code]

MAR

MAR: Masked Autoencoders for Efficient Action Recognition

Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Xiang Wang, Yuehuan Wang, Yiliang Lv, Changxin Gao, Nong Sang

Transactions on Multimedia.

[paper] [code]

DiST

Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning

Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Yingya Zhang, Changxin Gao, Deli Zhao, Nong Sang

ICCV 2023.

[paper] [code]

PVTpp

PVT++: A Simple End-to-End Latency-Aware Visual Tracking Framework

Bowen Li*, Ziyuan Huang*, Junjie Ye, Yiming Li, Sebastian Scherer, Hang Zhao, Changhong Fu

ICCV 2023.

[paper] [code]

TAdaConv

TAda! Temporally-Adaptive Convolutions for Video Understanding

Ziyuan Huang, Shiwei Zhang, Liang Pan, Zhiwu Qing, Mingqian Tang, Ziwei Liu, Marcelo H Ang Jr

ICLR 2022.

[paper] [code]

SSCS

Support-Set Based Cross-Supervision for Video Grounding

Xinpeng Ding, Nannan Wang, Shiwei Zhang, De Cheng, Xiaomeng Li, Ziyuan Huang, Mingqian Tang, Xinbo Gao

ICCV 2021.

[paper]

MoSI

Self-supervised Motion Learning from Static Images

Ziyuan Huang, Shiwei Zhang, Liang Pan, Zhiwu Qing, Mingqian Tang, Ziwei Liu, Marcelo H Ang Jr

CVPR 2021.

[paper] [code]

DNet

Toward Hierarchical Self-Supervised Monocular Absolute Depth Estimation for Autonomous Driving Applications

Feng Xue, Guirong Zhuo, Ziyuan Huang, Wufei Fu, Zhuoyue Wu, Marcelo H Ang Jr

IROS 2020.

[paper] [code]

ARCF

Learning Aberrance Repressed Correlation Filters for Real-Time UAV Tracking

Ziyuan Huang, Changhong Fu, Yiming Li, Fuling Lin, Peng Lu

ICCV 2019.

[paper] [code]