Loading [a11y]/accessibility-menu.js
Encoder–Decoder Calibration for Multimodal Machine Translation | IEEE Journals & Magazine | IEEE Xplore

Encoder–Decoder Calibration for Multimodal Machine Translation


Impact Statement:In machine translation, the translation accuracy of the model can be enhanced by adding image information related to the text. As there is a modal gap between images and ...Show More

Abstract:

The main purpose of multimodal machine translation (MMT) is to improve the quality of translation results by taking the corresponding visual context as an additional inpu...Show More
Impact Statement:
In machine translation, the translation accuracy of the model can be enhanced by adding image information related to the text. As there is a modal gap between images and texts, how to integrate more complementary high-quality features is a key consideration for MMT. Many recent works have attempted to use attention mechanisms to fuse multimodal information and achieved remarkable results. However, recent usability studies have shown that the ability of attention mechanisms to select deterministic multimodal inputs is questionable. We construct an Enc–Dec calibration method. After adopting our method, the translation accuracy is obviously improved, especially in the short sentences. In addition, examples and analysis show that this technique can reduce some problems of inaccurate multimodal alignment.

Abstract:

The main purpose of multimodal machine translation (MMT) is to improve the quality of translation results by taking the corresponding visual context as an additional input. Recently many studies in neural machine translation have attempted to obtain high-quality multimodal representation of encoder or decoder via attention mechanism. However, attention mechanism does not always accurately identify the decisive input for each prediction, which leads to an unsatisfactory multimodal information fusion. To this end, we propose an encoder–decoder (Enc–Dec) calibration method which can automatically calibrate the image and text fusion representation in the encoder, and find the decisive input to the translation in the decoder. We validate our model on the MMT dataset Multi30K. Experimental results show that our method significantly outperforms several recent baselines for both English–German and English–French translation tasks in terms of BLEU and METEOR.
Published in: IEEE Transactions on Artificial Intelligence ( Volume: 5, Issue: 8, August 2024)
Page(s): 3965 - 3973
Date of Publication: 17 January 2024
Electronic ISSN: 2691-4581

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.