Multi-modal large language models with hierarchical visual features and deeper vision-language interactions.
Jun 1, 2024