skip to main content
research-article

Sentence Boundary Disambiguation for Tibetan Based on Attention Mechanism at the Syllable Level

Published: 21 February 2023 Publication History

Abstract

Tibetan is a low-resource language with few existing electronic reference materials. The goal of Tibetan sentence boundary disambiguation (SBD) is to segment long text into sentences, and it is the foundation for downstream tasks corpora building. This study implemented the Tibetan SBD at the syllable level to avoid word segmentation (WS) errors affecting the accuracy of SBD. Specifically, the attention mechanism is introduced based on a recurrent neural network (RNN) to study Tibetan SBD. The primary objective is to determine, using a trained model, whether the shad contained in Tibetan text is the ending of the sentence, and implement experiments on syllable embedding and component embedding to measure the model's performance. The highest accuracy for Tibetan syllable embedding and component embedding is 96.23% and 95.40 %, respectively, and the F1 score reaches 96.23% and 95.37%, respectively. The experimental results demonstrate that the proposed method can achieve better results than the established rule-based and statistical methods without considering various syntactic and part-of-speech (POS) tagging rules. German and English data from the Europarl corpus and Thai data from the IWSLT2015 corpus are validated to prove the models’ reliability and generalizability. The results demonstrate that this method is efficient not only for low-resource languages but also for high-resource languages. More importantly, we can formally apply the experimental results of this study to the research of downstream tasks, such as machine translation and automatic summarization.

References

[1]
Jagroop Kaur and Jaswinder Singh. 2019. Deep neural network based sentence boundary detection and end marker suggestion for social media text. In 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS'19).
[2]
Markus Kreuzthaler and Stefan Schulz. 2015. Detection of sentence boundaries and abbreviations in clinical narratives. BMC Medical Informatics Decision Making, 2015.
[3]
QuecairangHua and Haixing Zhao. 2016. Dependency parsing of Tibetan compound sentence. Journal of Chinese Information Processing 30, 6 (2016), 224--229.
[4]
Te Rou, Chajia Se, and Rangjia Cai. 2019. Semantic block recognition method for Tibetan sentences. Journal of Chinese Information Processing 33, 6 (2019), 42--49.
[5]
Rangzhuoma Cai and Zhijie Cai. 2020. Tibetan word segmentation strategy and algorithm based on part-of-speech constraints. Journal of Chinese Information Processing 34, 2 (2020), 33--37.
[6]
Chajia Se, Guocairang Hua, Rangjia Cai, Zhenjiacuo Ci, and Te Rou. 2019. Tibetan poem generation with attention based encoder-decoder model. Journal of Chinese Information Processing 33, 4 (2019), 68--74.
[7]
Rangdangzhi Cai and Quecairang Hua. 2019. Tibetan syllable segmentation based on mixed mode. Journal of Inner Mongolia Normal University (Natural Science Edition) 48, 5 (2019), 406--412.
[8]
Tsering Tashi. 1988. The design of a Tibetan spelling checker. International Conference on Chinese Information Processing, 1988.
[9]
Mabao Ban, Zhijie Cai, and Mazhaxi La. 2019. Tibetan interrogative sentences parsing based on PCFG. Journal of Chinese Information Processing 33, 2 (2019), 67--74.
[10]
Cuozhuoma Que, Quecairang Hua, Rangdangzhi Cai, and Wuji Xia. 2019. Tibetan sentence boundary recognition based on mixed strategy. Journal of Inner Mongolia Normal University (Natural Science Chinese Edition) 48, 5 (2019), 400--405.
[11]
Mecairang Wan. 2014. Research on Rule-Based Analysis of Tibetan Syntax. Qinghai University for Nationalities. 2014.
[12]
Katrin Tomanek, Joachim Wermter, and Udo Hahn. 2007. A reappraisal of sentence and token splitting for life sciences documents. Studies in Health Technology Informatics 129, 1 (2007), 524--528.
[13]
Jonathon Read, Rebecca Driden, Stephan Oepen, and Lars Jørgen Solberg. 2012. Sentence boundary detection: A long solved problem? In Proceedings of COLING 2012: Posters, 985--994.
[14]
Michael D. Riley. 1989. Some applications of tree-based modelling to speech and language. In Proceedings of the Workshop on Speech and Natural Language. Association for Computational Linguistics 1989, 339--352.
[15]
David D. Palmer and Marti A. Hearst. 1997. Adaptive multilingual sentence boundary disambiguation. Computational Linguistics 23, 2, 242--267.
[16]
Jeffrey C. Reynar and Adwait Ratnaparkhi. 1997. A maximum entropy approach to identifying sentence boundaries. In Proceedings of the 5th Conference on Applied Natural Language Processing, 16--19.
[17]
Daniel Gillick. 2009. Sentence boundary detection and the problem with the US. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 241--244.
[18]
A. Mikheev. 2002. Periods, capitalized words, etc. Computational Linguistics 28, 3 (2002), 289--318.
[19]
A. Mikheev. 2000. Tagging sentence boundaries. In Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, 264--271.
[20]
Tibor Kiss and Jan Strunk. 2016. Unsupervised multilingual sentence boundary detection. Computational Linguistics 32, 4 (2016), 485--525.
[21]
Rangjia Cai and Taijia Ji 2005. Researches of speech classification methods based on Tibetan repertoire. Journal of Northwest University for Nationalities 26, 2 (2005), 39--42.
[22]
Qingji Ren and Jiancairang An. 2014. Research on automatic recognition method of Tibetan sentence boundary. China Computer and Communication. 8 (2014), 62--63.
[23]
Zangtai Cai. 2012. Research on the automatic identification of Tibetan sentence boundaries with maximum entropy classifier. Computer Engineering & Science 34, 6 (2012), 187--190.
[24]
Xiang Li, Zangtai Cai, Jiang Wenbin, Yajuan Lv, and Qun Liu. 2011. A maximum entropy and rules approach to identifying Tibetan sentence boundaries. Journal of Chinese Information Processing 25, 4 (2011), 39--45.
[25]
Weina Zhao, Huidan Liu, Xin Yu, Jian Wu, and Pu Zang. 2010. The Tibetan sentence boundary identification based on legal texts. In Proceedings of National Symposium on Computational Linguistics for Young People. (YWCL'10).
[26]
Weizhen Ma, Mezhaxi Wan, and Zha Nima. 2012. Method of identification of Tibetan sentence boundary. Journal of Tibet University 27, 2 (2012), 70--76.
[27]
Xiji Zha and Ba Luo. 2018. Tibetan sentence extraction method based on feature of function words and sentence ending words. Journal of Northwest Minzu University 39, 112 (2018), 39--44.
[28]
Oliver Hellwig. 2016. Detecting sentence boundaries in Sanskrit texts. In Proceedings of the 26th International Conference on Computational Linguistics, COLING 2016. 288--297.
[29]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735--1780.
[30]
Nils Reimers and Iryna Gurevych. 2017. Reporting score distributions makes a difference: Performance study of LSTM-networks for sequence tagging. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. EMNLP 2017.
[31]
Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Shirui Pan, and Chengqi Zhang. 2018. DiSAN: Directional self-attention network for RNN/CNN free language understanding. The 2018 AAAI Conference on Artificial Intelligence. AAAI 2018.
[32]
Linzhou Han, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. The 5th International Conference on Learning Representations (ICLR'17).
[33]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. Computer Science, 2013.
[34]
Carlos Emiliano, González-Gallardo, Juan Manuel, and Torres Moreno. 2018. Sentence boundary detection for French with subword-level information vectors and convolutional neural networks, 2018.
[35]
Chenglin Xu, Lei Xie, and Xiong Xiao. 2018. A bidirectional LSTM approach with word embeddings for sentence boundary detection. Journal of Signal Processing Systems 90 (2018), 1063--1075.
[36]
Kim Yoon. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 19th Empirical Methods in Natural Language Processing (EMNLP'14).
[37]
Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016. Recurrent neural network for text classification with multi-task learning. In Proceedings of the 25th International Joint Conference on Artificial Intelligence 2016 (IJCAI'16).
[38]
Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics 2 (2016), 427--431.
[39]
Rie Johnson and Tong Zhang. 2017. Deep pyramid convolutional neural networks for text categorization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, (2017), 1.
[40]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of 31st Conference on Neural Information Processing Systems (NIPS’17).
[41]
Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceeding of the 10th Machine Translation Summit (MT summit), 79--86.
[42]
Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, Roldano Cattoni, and Marcello Federico. 2015. The IWSLT 2015 evaluation campaign. In Proceedings of the 12th International Workshop on Spoken Language Translation (IWSLT'15), Da Nang, Vietnam.

Cited By

View all
  • (2024)Improved Tibetan Word Vectors Models Based on Position Information FusionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/368178723:11(1-21)Online publication date: 21-Nov-2024
  • (2022)A Novel Sentiment Analysis Model of Museum User Experience Evaluation Data Based on Unbalanced Data Analysis TechnologyComputational Intelligence and Neuroscience10.1155/2022/20966342022Online publication date: 1-Jan-2022

Index Terms

  1. Sentence Boundary Disambiguation for Tibetan Based on Attention Mechanism at the Syllable Level

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 6
        November 2022
        372 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3568970
        Issue’s Table of Contents

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 21 February 2023
        Online AM: 01 April 2022
        Accepted: 18 March 2022
        Revision received: 21 February 2022
        Received: 09 August 2020
        Published in TALLIP Volume 21, Issue 6

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Low-resource language
        2. Tibetan sentence boundary disambiguation
        3. recurrent neural network
        4. attention mechanism
        5. shad

        Qualifiers

        • Research-article

        Funding Sources

        • Ministry of Education - China Mobile Research Foundation
        • The Fundamental Research Funds for the Central Universities
        • National Natural Science Foundation of China
        • Major National Project of High Resolution Earth Observation System
        • State Grid Corporation of China Science and Technology Project
        • Program for New Century Excellent Talents in University
        • Strategic Priority Research Program of the Chinese Academy of Sciences
        • Google Research Awards and Google Faculty Award

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)58
        • Downloads (Last 6 weeks)6
        Reflects downloads up to 03 Mar 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Improved Tibetan Word Vectors Models Based on Position Information FusionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/368178723:11(1-21)Online publication date: 21-Nov-2024
        • (2022)A Novel Sentiment Analysis Model of Museum User Experience Evaluation Data Based on Unbalanced Data Analysis TechnologyComputational Intelligence and Neuroscience10.1155/2022/20966342022Online publication date: 1-Jan-2022

        View Options

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        Full Text

        HTML Format

        View this article in HTML Format.

        HTML Format

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media