Abstract
The Non-Autoregressive Transformer, due to its low inference latency, has attracted much attention from researchers. Although, the performance of the non-autoregressive transformer has been significantly improved in recent years, there is still a gap between the non-autoregressive transformer and the autoregressive transformer. Considering the success of localness on the autoregressive transformer, in this work, we consider incorporating localness into the non-autoregressive transformer. Specifically, we design a dynamic mask matrix according to the query tokens, key tokens, and relative distance, and unify the localness module for self-attention and cross-attention module. We conduct experiments on several benchmark tasks, and the results show that our model can significantly improve the performance of the non-autoregressive transformer.
- [1] . 2021. Non-autoregressive semantic parsing for compositional task-oriented dialog. arXiv preprint arXiv:2104.04923 (2021).Google Scholar
- [2] . 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google Scholar
- [3] . 2019. Improving neural machine translation with parent-scaled self-attention. (2019).Google Scholar
- [4] . 2019. Listen and fill in the missing letters: Non-autoregressive transformer for speech recognition. arXiv preprint arXiv:1911.04908 (2019).Google Scholar
- [5] . 2020. Context-aware cross-attention for non-autoregressive translation. arXiv preprint arXiv:2011.00770 (2020).Google Scholar
- [6] . 2021. Mask attention networks: Rethinking and strengthen transformer. arXiv preprint arXiv:2103.13597 (2021).Google Scholar
- [7] . 2017. Convolutional sequence to sequence learning. In ICML.Google Scholar
- [8] . 2021. Learning to rewrite for non-autoregressive neural machine translation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 3297–3308.Google ScholarCross Ref
- [9] . 2019. Mask-predict: Parallel decoding of conditional masked language models. arXiv preprint arXiv:1904.09324 (2019).Google Scholar
- [10] . 2017. Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281 (2017).Google Scholar
- [11] . 2020. Fully non-autoregressive neural machine translation: Tricks of the trade. arXiv preprint arXiv:2012.15833 (2020).Google Scholar
- [12] . 2019. Levenshtein transformer. arXiv preprint arXiv:1905.11006 (2019).Google Scholar
- [13] . 2020. Parallel machine translation with disentangled context transformer. arXiv preprint arXiv:2001.05136 (2020).Google Scholar
- [14] . 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- [15] . 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions. 177–180.Google ScholarCross Ref
- [16] . 2018. Deterministic non-autoregressive neural sequence modeling by iterative refinement. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 1173–1182.Google ScholarCross Ref
- [17] . 2021. Exploring non-autoregressive text style transfer. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 9267–9278.Google ScholarCross Ref
- [18] . 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318.Google ScholarDigital Library
- [19] . 2020. Glancing transformer for non-autoregressive neural machine translation. arXiv preprint arXiv:2008.07905 (2020).Google Scholar
- [20] . 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1715–1725.Google ScholarCross Ref
- [21] . 2019. Adaptive attention span in transformers. arXiv preprint arXiv:1905.07799 (2019).Google Scholar
- [22] . 2019. Fast structured decoding for sequence models. Advances in Neural Information Processing Systems 32 (2019).Google Scholar
- [23] . 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.Google Scholar
- [24] . 2019. Non-autoregressive machine translation with auxiliary regularization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 5377–5384.Google ScholarDigital Library
- [25] . 2020. Slotrefine: A fast non-autoregressive model for joint intent detection and slot filling. arXiv preprint arXiv:2010.02693 (2020).Google Scholar
- [26] . 2018. Modeling localness for self-attention networks. arXiv preprint arXiv:1810.10182 (2018).Google Scholar
- [27] . 2021. POS-constrained parallel decoding for non-autoregressive generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 5990–6000.Google ScholarCross Ref
- [28] . 2020. On the sub-layer functionalities of transformer decoder. arXiv preprint arXiv:2010.02648 (2020).Google Scholar
Index Terms
- Better Localness for Non-Autoregressive Transformer
Recommendations
Barriers to the Localness of Volunteered Geographic Information
CHI '15: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing SystemsLocalness is an oft-cited benefit of volunteered geographic information (VGI). This study examines whether localness is a constant, universally shared benefit of VGI, or one that varies depending on the context in which it is produced. Focusing on ...
The Geography and Importance of Localness in Geotagged Social Media
CHI '16: Proceedings of the 2016 CHI Conference on Human Factors in Computing SystemsGeotagged tweets and other forms of social media volunteered geographic information (VGI) are becoming increasingly critical to many applications and scientific studies. An important assumption underlying much of this research is that social media VGI ...
Defining and Predicting the Localness of Volunteered Geographic Information using Ground Truth Data
CHI '18: Proceedings of the 2018 CHI Conference on Human Factors in Computing SystemsMany applications of geotagged content are predicated on the concept of localness (e.g., local restaurant recommendation, mining social media for local perspectives on an issue). However, definitions of who is a "local" in a given area are typically ...
Comments