Towards Understandıng Intuıtıve Physıcs Wıth Language And Vısıon
Date
2021Author
Ateş, Tayfun
xmlui.dri2xhtml.METS-1.0.item-emb
Acik erisimxmlui.mirage2.itemSummaryView.MetaData
Show full item recordAbstract
Visual question answering (VQA) is one of the difficult tasks in multimodal machine reasoning. VQA requires machines to provide correct answers to questions about an image or a video. Here, the machine should perceive the scene and infer true judgements on the relationships between different entities. Recent benchmarks on VQA have been mostly proposed for static images and they only question spatial reasoning capabilities of artificial models. In other words, it is not a requirement for the machines to learn the physical properties of objects and understand different physical relationships among them. Hence, it is not possible to evaluate whether the models have intuitive physics or causal and temporal reasoning capabilities using these datasets. This thesis proposes a new benchmark, CRAFT, which is designed to evaluate these capabilities of artificial intelligence models. In particular, it comprises of 38K video and question pairs that are automatically generated from 3K videos of dynamic scenes. These scenes are synthetically created using a physics engine by considering ten different two-dimensional scene layouts containing variable number of dynamic objects. While generating the questions in CRAFT, we consider five different categories, two of those (descriptive and counterfactual) have been investigated in earlier works. However, in our work, we have introduced three new question categories (cause, enable, and prevent) which are proposed inspired by the representations of causal relationships in cognitive science. A special attention has been given to data generation process to focus on creating questions which are easy to solve by humans, but difficult for machines. In order to support this claim, CRAFT questions are asked to both artificial models and 12 adult participants. Our experimental results demonstrate that although the tasks seem intuitive for human participants, there is a large gap between them and the most successful artificial model.