Evaluating Zero-Shot Learning Capabilities of Vision-Language Models
View/ Open
Date
2024Author
Doğan, Mustafa
xmlui.dri2xhtml.METS-1.0.item-emb
Acik erisimxmlui.mirage2.itemSummaryView.MetaData
Show full item recordAbstract
Vision-Language Models (VLMs) stand at the forefront of artificial intelligence research, aiming to bridge the gap between visual content and natural language understanding. Their significance lies in their potential to enable machines to comprehend and interact with the world in a more human-like manner. However, the evaluation of VLMs poses twofold challenges that require careful consideration and innovative approaches.
One of the primary challenges in evaluating VLMs revolves around understanding the intricate relationship between visual and linguistic information. While these models are good at processing individual modalities, such as images, videos, or text, effectively integrating these modalities to derive meaningful insights remains a complex task. Particularly in dynamic and context-rich scenarios, VLMs must navigate diverse visual stimuli while interpreting accompanying textual cues, requiring robust mechanisms for cross-modal fusion and comprehension.
Furthermore, the lack of transparency in VLMs adds another layer of complexity to their evaluation. While these models may exhibit high performance on benchmark datasets, understanding the underlying reasoning processes and knowledge representations remains elusive. Deciphering how VLMs leverage their learned knowledge to generate responses and make predictions is essential for gaining insights into their capabilities and limitations.
This thesis addresses these challenges by conducting a comprehensive comparative analysis of Multimodal Large Language Models (MLLMs) and Video-Language Models (VidLMs). It focuses on their ability to bridge the semantic gap between visual inputs and linguistic outputs. Through empirical evaluation, this research examines the strengths and limitations of these models in comprehending and articulating visual content in both static and dynamic contexts.
This thesis makes two main contributions. Firstly, it conducts a comprehensive analysis of few-shot In-Context Learning (ICL) and Chain-of-Thought (CoT) strategies on MLLMs, revealing that these strategies can significantly boost performance compared to zero-shot settings. Secondly, it introduces a novel zero-shot foiling test for VidLMs, designed to assess their proficiency in recognizing actions and actors within dynamic scenes. The findings indicate that current VidLMs face challenges in temporal reasoning and action recognition, performing only marginally better than chance, thereby highlighting the imperative for advancements in VidLMs architectures to effectively handle spatio-temporal tasks.
In conclusion, this thesis sheds light on the performance of MLLMs and VidLMs, offering valuable insights and identifying areas for future improvement. It indicates the importance of ongoing innovation in multimodal architectures to develop more robust and contextually aware language models capable of bridging the gap between visual content and natural language.