Loading [a11y]/accessibility-menu.js
Improved Speaker and Navigator for Vision-and-Language Navigation | IEEE Journals & Magazine | IEEE Xplore

Improved Speaker and Navigator for Vision-and-Language Navigation


Abstract:

Prior works in vision-and-language navigation (VLN) focus on using long short-term memory (LSTM) to carry the flow of information on either the navigation model (navigato...Show More

Abstract:

Prior works in vision-and-language navigation (VLN) focus on using long short-term memory (LSTM) to carry the flow of information on either the navigation model (navigator) or the instruction generating model (speaker).The outstanding capability of LSTM to process intermodal interactions has been widely verified; however, LSTM neglects the intramodel interactions, leading to negative effect on either navigator or speaker. The performance of attention-based Transformer is satisfactory in sequence-to-sequence translation domains, but Transformer structure implemented directly in VLN has yet been satisfied. In this article, we propose novel Transformer-based multimodal frameworks for the navigator and speaker, respectively. In our frameworks, the multihead self-attention with the residual connection is used to carry the information flow. Specially, we set a switch to prevent them from directly entering the information flow in our navigator framework. In experiments, we verify the effectiveness of our proposed approach, and show significant performance advantages over the baselines.
Published in: IEEE MultiMedia ( Volume: 28, Issue: 4, 01 Oct.-Dec. 2021)
Page(s): 55 - 63
Date of Publication: 09 February 2021

ISSN Information:

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.