Conferences >2019 IEEE International Confe...

Positional Self-attention Based Hierarchical Image Captioning

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Intelligent robotic systems process images captured by camera devices to get high-level semantic concepts. Current image captioning approaches use convolutional neural ne...Show More

Metadata

Abstract:

Intelligent robotic systems process images captured by camera devices to get high-level semantic concepts. Current image captioning approaches use convolutional neural networks as encoder and recurrent neural networks as decoder. Various modalities are treated differently and parallel computation is not fully allowed due to the chain architecture. To address these issues, we present a parallel hierarchical neural network based on encoder-decoder architecture to generate descriptions for images. Only convolutional neural networks are adopted here, and the hierarchical framework takes less steps to capture distant dependencies. Masked positional self-attention mechanism is utilized in the decoder to improve the performance. The fixed sized windows make it possible for parallel computation. This generative model uses unified architecture for different modalities, reaching a BLEU-1 score (the higher the better) over 0.7 with a higher training speed.

Published in: 2019 IEEE International Conference on Robotics and Biomimetics (ROBIO)

Date of Conference: 06-08 December 2019

Date Added to IEEE Xplore: 20 January 2020

ISBN Information:

DOI: 10.1109/ROBIO49542.2019.8961665

Conference Location: Dali, China