skip to main content
research-article

Better Localness for Non-Autoregressive Transformer

Published:08 May 2023Publication History
Skip Abstract Section

Abstract

The Non-Autoregressive Transformer, due to its low inference latency, has attracted much attention from researchers. Although, the performance of the non-autoregressive transformer has been significantly improved in recent years, there is still a gap between the non-autoregressive transformer and the autoregressive transformer. Considering the success of localness on the autoregressive transformer, in this work, we consider incorporating localness into the non-autoregressive transformer. Specifically, we design a dynamic mask matrix according to the query tokens, key tokens, and relative distance, and unify the localness module for self-attention and cross-attention module. We conduct experiments on several benchmark tasks, and the results show that our model can significantly improve the performance of the non-autoregressive transformer.

REFERENCES

  1. [1] Babu Arun, Shrivastava Akshat, Aghajanyan Armen, Aly Ahmed, Fan Angela, and Ghazvininejad Marjan. 2021. Non-autoregressive semantic parsing for compositional task-oriented dialog. arXiv preprint arXiv:2104.04923 (2021).Google ScholarGoogle Scholar
  2. [2] Bahdanau Dzmitry, Cho Kyunghyun, and Bengio Yoshua. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google ScholarGoogle Scholar
  3. [3] Bugliarello Emanuele and Okazaki Naoaki. 2019. Improving neural machine translation with parent-scaled self-attention. (2019).Google ScholarGoogle Scholar
  4. [4] Chen Nanxin, Watanabe Shinji, Villalba Jesús, and Dehak Najim. 2019. Listen and fill in the missing letters: Non-autoregressive transformer for speech recognition. arXiv preprint arXiv:1911.04908 (2019).Google ScholarGoogle Scholar
  5. [5] Ding Liang, Wang Longyue, Wu Di, Tao Dacheng, and Tu Zhaopeng. 2020. Context-aware cross-attention for non-autoregressive translation. arXiv preprint arXiv:2011.00770 (2020).Google ScholarGoogle Scholar
  6. [6] Fan Zhihao, Gong Yeyun, Liu Dayiheng, Wei Zhongyu, Wang Siyuan, Jiao Jian, Duan Nan, Zhang Ruofei, and Huang Xuanjing. 2021. Mask attention networks: Rethinking and strengthen transformer. arXiv preprint arXiv:2103.13597 (2021).Google ScholarGoogle Scholar
  7. [7] Gehring Jonas, Auli Michael, Grangier David, Yarats Denis, and Dauphin Yann N.. 2017. Convolutional sequence to sequence learning. In ICML.Google ScholarGoogle Scholar
  8. [8] Geng Xinwei, Feng Xiaocheng, and Qin Bing. 2021. Learning to rewrite for non-autoregressive neural machine translation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 32973308.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Ghazvininejad Marjan, Levy Omer, Liu Yinhan, and Zettlemoyer Luke. 2019. Mask-predict: Parallel decoding of conditional masked language models. arXiv preprint arXiv:1904.09324 (2019).Google ScholarGoogle Scholar
  10. [10] Gu Jiatao, Bradbury James, Xiong Caiming, Li Victor O. K., and Socher Richard. 2017. Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281 (2017).Google ScholarGoogle Scholar
  11. [11] Gu Jiatao and Kong Xiang. 2020. Fully non-autoregressive neural machine translation: Tricks of the trade. arXiv preprint arXiv:2012.15833 (2020).Google ScholarGoogle Scholar
  12. [12] Gu Jiatao, Wang Changhan, and Zhao Jake. 2019. Levenshtein transformer. arXiv preprint arXiv:1905.11006 (2019).Google ScholarGoogle Scholar
  13. [13] Kasai Jungo, Cross James, Ghazvininejad Marjan, and Gu Jiatao. 2020. Parallel machine translation with disentangled context transformer. arXiv preprint arXiv:2001.05136 (2020).Google ScholarGoogle Scholar
  14. [14] Kingma Diederik P. and Ba Jimmy. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google ScholarGoogle Scholar
  15. [15] Koehn Philipp, Hoang Hieu, Birch Alexandra, Callison-Burch Chris, Federico Marcello, Bertoldi Nicola, Cowan Brooke, Shen Wade, Moran Christine, Zens Richard, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions. 177180.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Lee Jason, Mansimov Elman, and Cho Kyunghyun. 2018. Deterministic non-autoregressive neural sequence modeling by iterative refinement. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 11731182.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Ma Yun and Li Qing. 2021. Exploring non-autoregressive text style transfer. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 92679278.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311318.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Qian Lihua, Zhou Hao, Bao Yu, Wang Mingxuan, Qiu Lin, Zhang Weinan, Yu Yong, and Li Lei. 2020. Glancing transformer for non-autoregressive neural machine translation. arXiv preprint arXiv:2008.07905 (2020).Google ScholarGoogle Scholar
  20. [20] Sennrich Rico, Haddow Barry, and Birch Alexandra. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 17151725.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Sukhbaatar Sainbayar, Grave Edouard, Bojanowski Piotr, and Joulin Armand. 2019. Adaptive attention span in transformers. arXiv preprint arXiv:1905.07799 (2019).Google ScholarGoogle Scholar
  22. [22] Sun Zhiqing, Li Zhuohan, Wang Haoqing, He Di, Lin Zi, and Deng Zhihong. 2019. Fast structured decoding for sequence models. Advances in Neural Information Processing Systems 32 (2019).Google ScholarGoogle Scholar
  23. [23] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 59986008.Google ScholarGoogle Scholar
  24. [24] Wang Yiren, Tian Fei, He Di, Qin Tao, Zhai ChengXiang, and Liu Tie-Yan. 2019. Non-autoregressive machine translation with auxiliary regularization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 53775384.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Wu Di, Ding Liang, Lu Fan, and Xie Jian. 2020. Slotrefine: A fast non-autoregressive model for joint intent detection and slot filling. arXiv preprint arXiv:2010.02693 (2020).Google ScholarGoogle Scholar
  26. [26] Yang Baosong, Tu Zhaopeng, Wong Derek F., Meng Fandong, Chao Lidia S., and Zhang Tong. 2018. Modeling localness for self-attention networks. arXiv preprint arXiv:1810.10182 (2018).Google ScholarGoogle Scholar
  27. [27] Yang Kexin, Lei Wenqiang, Liu Dayiheng, Qi Weizhen, and Lv Jiancheng. 2021. POS-constrained parallel decoding for non-autoregressive generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 59906000.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Yang Yilin, Wang Longyue, Shi Shuming, Tadepalli Prasad, Lee Stefan, and Tu Zhaopeng. 2020. On the sub-layer functionalities of transformer decoder. arXiv preprint arXiv:2010.02648 (2020).Google ScholarGoogle Scholar

Index Terms

  1. Better Localness for Non-Autoregressive Transformer

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 5
      May 2023
      653 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3596451
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 8 May 2023
      • Online AM: 11 March 2023
      • Accepted: 7 March 2023
      • Revised: 22 November 2022
      • Received: 11 December 2021
      Published in tallip Volume 22, Issue 5

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)132
      • Downloads (Last 6 weeks)17

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text