Abstract:
Visual Question Answering is a complex problem that fuses natural language and image processing to answer a question based on information from the image. The basic archit...Show MoreMetadata
Abstract:
Visual Question Answering is a complex problem that fuses natural language and image processing to answer a question based on information from the image. The basic architecture for accomplishing this is using a CNN to extract features from the image and an RNN for the language processing, then combine the two in an MLP to produce an answer. These architectures perform well at identifying content, but fail at higher level reasoning such as spatial awareness and combining objects. To help remedy this, we propose using attention to divide the image into separate objects, then using the extracted features along with the location and size information to learn the MLP.
Date of Conference: 14-19 May 2017
Date Added to IEEE Xplore: 03 July 2017
ISBN Information:
Electronic ISSN: 2161-4407