research-article

Better Localness for Non-Autoregressive Transformer

Authors:
Shuheng Wang

Nanjing University of Science and Technology, Nanjing, Jiangsu, China

Nanjing University of Science and Technology, Nanjing, Jiangsu, China

0000-0003-1816-0902
Search about this author

,
Heyan Huang

Beijing Institute of Technology, Beijing, Beijing, China

Beijing Institute of Technology, Beijing, Beijing, China

0000-0002-0320-7520
Search about this author

,
Shumin Shi

Beijing Institute of Technology, Beijing, Beijing, China

Beijing Institute of Technology, Beijing, Beijing, China

0000-0003-3436-7575
Search about this author

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22 Issue 5Article No.: 125pp 1–11https://doi.org/10.1145/3587266

Published:08 May 2023Publication History

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

The Non-Autoregressive Transformer, due to its low inference latency, has attracted much attention from researchers. Although, the performance of the non-autoregressive transformer has been significantly improved in recent years, there is still a gap between the non-autoregressive transformer and the autoregressive transformer. Considering the success of localness on the autoregressive transformer, in this work, we consider incorporating localness into the non-autoregressive transformer. Specifically, we design a dynamic mask matrix according to the query tokens, key tokens, and relative distance, and unify the localness module for self-attention and cross-attention module. We conduct experiments on several benchmark tasks, and the results show that our model can significantly improve the performance of the non-autoregressive transformer.

REFERENCES

[1] Babu Arun, Shrivastava Akshat, Aghajanyan Armen, Aly Ahmed, Fan Angela, and Ghazvininejad Marjan. 2021. Non-autoregressive semantic parsing for compositional task-oriented dialog. arXiv preprint arXiv:2104.04923 (2021).Google Scholar
[2] Bahdanau Dzmitry, Cho Kyunghyun, and Bengio Yoshua. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google Scholar
[3] Bugliarello Emanuele and Okazaki Naoaki. 2019. Improving neural machine translation with parent-scaled self-attention. (2019).Google Scholar
[4] Chen Nanxin, Watanabe Shinji, Villalba Jesús, and Dehak Najim. 2019. Listen and fill in the missing letters: Non-autoregressive transformer for speech recognition. arXiv preprint arXiv:1911.04908 (2019).Google Scholar
[5] Ding Liang, Wang Longyue, Wu Di, Tao Dacheng, and Tu Zhaopeng. 2020. Context-aware cross-attention for non-autoregressive translation. arXiv preprint arXiv:2011.00770 (2020).Google Scholar
[6] Fan Zhihao, Gong Yeyun, Liu Dayiheng, Wei Zhongyu, Wang Siyuan, Jiao Jian, Duan Nan, Zhang Ruofei, and Huang Xuanjing. 2021. Mask attention networks: Rethinking and strengthen transformer. arXiv preprint arXiv:2103.13597 (2021).Google Scholar
[7] Gehring Jonas, Auli Michael, Grangier David, Yarats Denis, and Dauphin Yann N.. 2017. Convolutional sequence to sequence learning. In ICML.Google Scholar
[8] Geng Xinwei, Feng Xiaocheng, and Qin Bing. 2021. Learning to rewrite for non-autoregressive neural machine translation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 3297–3308.Google ScholarCross Ref
[9] Ghazvininejad Marjan, Levy Omer, Liu Yinhan, and Zettlemoyer Luke. 2019. Mask-predict: Parallel decoding of conditional masked language models. arXiv preprint arXiv:1904.09324 (2019).Google Scholar
[10] Gu Jiatao, Bradbury James, Xiong Caiming, Li Victor O. K., and Socher Richard. 2017. Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281 (2017).Google Scholar
[11] Gu Jiatao and Kong Xiang. 2020. Fully non-autoregressive neural machine translation: Tricks of the trade. arXiv preprint arXiv:2012.15833 (2020).Google Scholar
[12] Gu Jiatao, Wang Changhan, and Zhao Jake. 2019. Levenshtein transformer. arXiv preprint arXiv:1905.11006 (2019).Google Scholar
[13] Kasai Jungo, Cross James, Ghazvininejad Marjan, and Gu Jiatao. 2020. Parallel machine translation with disentangled context transformer. arXiv preprint arXiv:2001.05136 (2020).Google Scholar
[14] Kingma Diederik P. and Ba Jimmy. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
[15] Koehn Philipp, Hoang Hieu, Birch Alexandra, Callison-Burch Chris, Federico Marcello, Bertoldi Nicola, Cowan Brooke, Shen Wade, Moran Christine, Zens Richard, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions. 177–180.Google ScholarCross Ref
[16] Lee Jason, Mansimov Elman, and Cho Kyunghyun. 2018. Deterministic non-autoregressive neural sequence modeling by iterative refinement. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 1173–1182.Google ScholarCross Ref
[17] Ma Yun and Li Qing. 2021. Exploring non-autoregressive text style transfer. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 9267–9278.Google ScholarCross Ref
[18] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318.Google ScholarDigital Library
[19] Qian Lihua, Zhou Hao, Bao Yu, Wang Mingxuan, Qiu Lin, Zhang Weinan, Yu Yong, and Li Lei. 2020. Glancing transformer for non-autoregressive neural machine translation. arXiv preprint arXiv:2008.07905 (2020).Google Scholar
[20] Sennrich Rico, Haddow Barry, and Birch Alexandra. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1715–1725.Google ScholarCross Ref
[21] Sukhbaatar Sainbayar, Grave Edouard, Bojanowski Piotr, and Joulin Armand. 2019. Adaptive attention span in transformers. arXiv preprint arXiv:1905.07799 (2019).Google Scholar
[22] Sun Zhiqing, Li Zhuohan, Wang Haoqing, He Di, Lin Zi, and Deng Zhihong. 2019. Fast structured decoding for sequence models. Advances in Neural Information Processing Systems 32 (2019).Google Scholar
[23] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.Google Scholar
[24] Wang Yiren, Tian Fei, He Di, Qin Tao, Zhai ChengXiang, and Liu Tie-Yan. 2019. Non-autoregressive machine translation with auxiliary regularization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 5377–5384.Google ScholarDigital Library
[25] Wu Di, Ding Liang, Lu Fan, and Xie Jian. 2020. Slotrefine: A fast non-autoregressive model for joint intent detection and slot filling. arXiv preprint arXiv:2010.02693 (2020).Google Scholar
[26] Yang Baosong, Tu Zhaopeng, Wong Derek F., Meng Fandong, Chao Lidia S., and Zhang Tong. 2018. Modeling localness for self-attention networks. arXiv preprint arXiv:1810.10182 (2018).Google Scholar
[27] Yang Kexin, Lei Wenqiang, Liu Dayiheng, Qi Weizhen, and Lv Jiancheng. 2021. POS-constrained parallel decoding for non-autoregressive generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 5990–6000.Google ScholarCross Ref
[28] Yang Yilin, Wang Longyue, Shi Shuming, Tadepalli Prasad, Lee Stefan, and Tu Zhaopeng. 2020. On the sub-layer functionalities of transformer decoder. arXiv preprint arXiv:2010.02648 (2020).Google Scholar

Index Terms

Better Localness for Non-Autoregressive Transformer
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Machine translation

Recommendations

Barriers to the Localness of Volunteered Geographic Information
CHI '15: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems

Localness is an oft-cited benefit of volunteered geographic information (VGI). This study examines whether localness is a constant, universally shared benefit of VGI, or one that varies depending on the context in which it is produced. Focusing on ...
Read More
The Geography and Importance of Localness in Geotagged Social Media
CHI '16: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems

Geotagged tweets and other forms of social media volunteered geographic information (VGI) are becoming increasingly critical to many applications and scientific studies. An important assumption underlying much of this research is that social media VGI ...
Read More
Defining and Predicting the Localness of Volunteered Geographic Information using Ground Truth Data
CHI '18: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems

Many applications of geotagged content are predicated on the concept of localness (e.g., local restaurant recommendation, mining social media for local perspectives on an issue). However, definitions of who is a "local" in a given area are typically ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22, Issue 5
May 2023
653 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3596451
Editor:
Imed Zitouni
Google, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 May 2023
- Online AM: 11 March 2023
- Accepted: 7 March 2023
- Revised: 22 November 2022
- Received: 11 December 2021
Published in tallip Volume 22, Issue 5

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Non-autoregressive
localness
attention module
translation
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 170
  Total Downloads
- Downloads (Last 12 months)132
- Downloads (Last 6 weeks)17
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

Better Localness for Non-Autoregressive Transformer

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Barriers to the Localness of Volunteered Geographic Information

The Geography and Importance of Localness in Geotagged Social Media

Defining and Predicting the Localness of Volunteered Geographic Information using Ground Truth Data