In vision-language models (VLMs), visual tokens usually consume a significant amount of computational overhead, despite their sparser information density compared to text tokens. To address this, ...
Abstract: Vision-language models (VLM) can solve complex tasks such as visual question answering by integrating visual and linguistic information. Their performance have improved significantly with ...
1 University of Science and Technology of China 2 WeChat, Tencent Inc. 1. A Novel Parameter Space Alignment Paradigm Recent MLLMs follow an input space alignment paradigm that aligns visual features ...
Abstract: Visual target navigation is a critical capability for autonomous robots operating in unknown environments, particularly in human-robot interaction scenarios. While classical and ...