For a full publication list, please refer to my Google Scholar.
Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation
[paper][code][hf] A 100-billion-scale omni-modal MoE model delivering leading results in text-to-image generation, generative segmentation and contextual ASR. Ming team, Ant Group.
Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation
[paper][code][hf] A unified audio model for understanding, generating, and editing audio contents, based on unified representations. Canxiang Yan, Chunxiang Jin, Dawei Huang, Haibing Yu, Han Peng, Hui Zhan, Jie Gao, Jing Peng, Jingdong Chen, Jun Zhou, Kaimeng Ren, Ming Yang, Mingxue Yang, Qiang Xu, Qin Zhao, Ruijie Xiong, Shaoxiong Lin, Xuezhi Wang, Yi Yuan, Yifei Wu, Yongjie Lyu, Zhengyu He, Zhihao Qiu, Zhiqiang Fang, Ziyuan Huang
Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer
[paper][code][hf] A unified MLLM for understanding, generating and editing visual contents that seamlessly supports multi-round interactions, all powered by the first-ever continuous unified visual representations. Ziyuan Huang, DanDan Zheng, Cheng Zou, Rui Liu, Xiaolong Wang, Kaixiang Ji, Weilong Chai, Jianxin Sun, Libin Wang, Yongjie Lyv, Taoye Huang, Jiajia Liu, Qingpei Guo, Ming Yang, Jingdong Chen, Jun Zhou
Ming-Omni: A Unified Multimodal Model for Perception and Generation
[paper][code][hf] The first open-source omni-modal model that matches the input-output capability of GPT-4o. Ming team, Ant Group.
