Towards Better Vision-Inspired Vision-Language Models

Jun 1, 2024·

Yun-Hao Cao

Kaixiang Ji

Ziyuan Huang

And Other 5 Authors

· 0 min read

PDF Cite

Abstract

Vision-language (VL) models have achieved unprecedented success recently, in which the connection module is the key to bridge the modality gap. Nevertheless, the abundant visual clues are not sufficiently exploited in most existing methods. On the vision side, most existing approaches only use the last feature of the vision tower, without using the low-level features. On the language side, most existing methods only introduce shallow vision-language interactions. In this paper, we present a vision-inspired vision-language connection module, dubbed as VIVL, which efficiently exploits the vision cue for VL models. To take advantage of the lowerlevel information from the vision tower, a feature pyramid extractor (FPE) is introduced to combine features from different intermediate layers, which enriches the visual cue with negligible parameters and computation overhead. To enhance VL interactions, we propose deep vision-conditioned prompts (DVCP) that allows deep interactions of vision and language features efficiently. Our VIVL exceeds the previous state-of-the-art method by 18.1 CIDEr when training from scratch on the COCO caption task, which greatly improves the data efficiency. When used as a plug-in module, VIVL consistently improves the performance for various backbones and VL frameworks, delivering new state-of-the-art results on multiple benchmarks, e.g., NoCaps and VQAv2.

Type

Conference paper

Publication

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Last updated on Jun 1, 2024

MLLM

Authors

Ziyuan Huang

Research scientist

← Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight Jul 4, 2024

Res-tuning: A flexible and efficient tuning paradigm via unbinding tuner from backbone Feb 13, 2024 →