Multimodal Machine Comprehension of How-to Instructions with Images and Text
xmlui.mirage2.itemSummaryView.MetaDataShow full item record
In the blink of an eye, we understand what we are looking at. Most of our brain is organized to process the visual information we receive; thus, replicating human intelligence requires a complete understanding of human vision. But, is understanding vision enough to understand human intelligence? Probably not. Besides our visual perception skills, language is an essential and unique ability and a natural way of communication for humans. For thousands of years, humankind has been telling stories and giving instructions through spoken language. One of the earliest written forms of language is instructions, specifically food recipes. These instructions not only help us understand what the people of that time ate but also teach us how they used to live their lives. Instructions have been around us for centuries, be it in the form of recipes, or how-to guides, written on stone tablets or books, or else published on the web. How-to instructions with images and text are perfect candidates for understanding human intelligence, and understanding them is an important, intriguing research problem to solve. Modern how-to guides of any form almost always contain multimodal information, such as images, videos, and text. Instructions are key to understanding and replicating a process, and how-to guides are great sources of instruction, as we can replicate the same process by just following the guide. Furthermore, how-to instructions often involve a joint understanding of multiple modalities of information e.g. images and text. However, they are also very challenging to understand as they often contain multimodal information such as images and text, consist of multiple objects and entities as well as require a procedural understanding of actions and interactions between such entities often referred in from one modality into another. How-to guides such as cooking recipes, typically consist of multiple steps involving various objects and entities, most of which interact with each other through different actions. Considering an action as a combination of a verb and an object or entity, being able to generalize to unseen compositions of these action compounds pose a great challenge. In this regard, understanding how visually grounded textual instructions might help models' systematic generalization abilities remains an important research problem. In this thesis, we examine multimodal machine comprehension of how-to instructions with images and text, review related literature, and point out current challenges. We also propose methods to address some of these challenges and ways to improve upon existing approaches. The main contributions of this thesis can be summarized as follows. We investigate machine comprehension and reasoning problems and review the previous literature to lay the grounds for understanding multimodal how-to instructions. We survey compositional generalization literature, highlight current research challenges, and discuss its relation to understanding multimodal how-to instructions. We introduce a multimodal benchmark how-to instructions dataset comprised of cooking recipes with images and text. We propose novel methods for understanding multimodal procedures. Finally, we present a challenging multimodal compositional generalization setup and propose methods to benchmark and show multimodality's contribution to significantly improve the current state of the art in understanding multimodal how-to instructions and conclude with future research directions and discuss open challenges.