research-article

Sentence Boundary Disambiguation for Tibetan Based on Attention Mechanism at the Syllable Level

Authors:

Qingguo ZhouAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 21, Issue 6

Article No.: 134, Pages 1 - 18

https://doi.org/10.1145/3527663

Published: 21 February 2023 Publication History

Abstract

Tibetan is a low-resource language with few existing electronic reference materials. The goal of Tibetan sentence boundary disambiguation (SBD) is to segment long text into sentences, and it is the foundation for downstream tasks corpora building. This study implemented the Tibetan SBD at the syllable level to avoid word segmentation (WS) errors affecting the accuracy of SBD. Specifically, the attention mechanism is introduced based on a recurrent neural network (RNN) to study Tibetan SBD. The primary objective is to determine, using a trained model, whether the shad contained in Tibetan text is the ending of the sentence, and implement experiments on syllable embedding and component embedding to measure the model's performance. The highest accuracy for Tibetan syllable embedding and component embedding is 96.23% and 95.40 %, respectively, and the F1 score reaches 96.23% and 95.37%, respectively. The experimental results demonstrate that the proposed method can achieve better results than the established rule-based and statistical methods without considering various syntactic and part-of-speech (POS) tagging rules. German and English data from the Europarl corpus and Thai data from the IWSLT2015 corpus are validated to prove the models’ reliability and generalizability. The results demonstrate that this method is efficient not only for low-resource languages but also for high-resource languages. More importantly, we can formally apply the experimental results of this study to the research of downstream tasks, such as machine translation and automatic summarization.

References

[1]

Jagroop Kaur and Jaswinder Singh. 2019. Deep neural network based sentence boundary detection and end marker suggestion for social media text. In 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS'19).

[2]

Markus Kreuzthaler and Stefan Schulz. 2015. Detection of sentence boundaries and abbreviations in clinical narratives. BMC Medical Informatics Decision Making, 2015.

[3]

QuecairangHua and Haixing Zhao. 2016. Dependency parsing of Tibetan compound sentence. Journal of Chinese Information Processing 30, 6 (2016), 224--229.

[4]

Te Rou, Chajia Se, and Rangjia Cai. 2019. Semantic block recognition method for Tibetan sentences. Journal of Chinese Information Processing 33, 6 (2019), 42--49.

[5]

Rangzhuoma Cai and Zhijie Cai. 2020. Tibetan word segmentation strategy and algorithm based on part-of-speech constraints. Journal of Chinese Information Processing 34, 2 (2020), 33--37.

[6]

Chajia Se, Guocairang Hua, Rangjia Cai, Zhenjiacuo Ci, and Te Rou. 2019. Tibetan poem generation with attention based encoder-decoder model. Journal of Chinese Information Processing 33, 4 (2019), 68--74.

[7]

Rangdangzhi Cai and Quecairang Hua. 2019. Tibetan syllable segmentation based on mixed mode. Journal of Inner Mongolia Normal University (Natural Science Edition) 48, 5 (2019), 406--412.

[8]

Tsering Tashi. 1988. The design of a Tibetan spelling checker. International Conference on Chinese Information Processing, 1988.

[9]

Mabao Ban, Zhijie Cai, and Mazhaxi La. 2019. Tibetan interrogative sentences parsing based on PCFG. Journal of Chinese Information Processing 33, 2 (2019), 67--74.

[10]

Cuozhuoma Que, Quecairang Hua, Rangdangzhi Cai, and Wuji Xia. 2019. Tibetan sentence boundary recognition based on mixed strategy. Journal of Inner Mongolia Normal University (Natural Science Chinese Edition) 48, 5 (2019), 400--405.

[11]

Mecairang Wan. 2014. Research on Rule-Based Analysis of Tibetan Syntax. Qinghai University for Nationalities. 2014.

[12]

Katrin Tomanek, Joachim Wermter, and Udo Hahn. 2007. A reappraisal of sentence and token splitting for life sciences documents. Studies in Health Technology Informatics 129, 1 (2007), 524--528.

[13]

Jonathon Read, Rebecca Driden, Stephan Oepen, and Lars Jørgen Solberg. 2012. Sentence boundary detection: A long solved problem? In Proceedings of COLING 2012: Posters, 985--994.

[14]

Michael D. Riley. 1989. Some applications of tree-based modelling to speech and language. In Proceedings of the Workshop on Speech and Natural Language. Association for Computational Linguistics 1989, 339--352.

Digital Library

[15]

David D. Palmer and Marti A. Hearst. 1997. Adaptive multilingual sentence boundary disambiguation. Computational Linguistics 23, 2, 242--267.

[16]

Jeffrey C. Reynar and Adwait Ratnaparkhi. 1997. A maximum entropy approach to identifying sentence boundaries. In Proceedings of the 5th Conference on Applied Natural Language Processing, 16--19.

Digital Library

[17]

Daniel Gillick. 2009. Sentence boundary detection and the problem with the US. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 241--244.

[18]

A. Mikheev. 2002. Periods, capitalized words, etc. Computational Linguistics 28, 3 (2002), 289--318.

Digital Library

[19]

A. Mikheev. 2000. Tagging sentence boundaries. In Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, 264--271.

Digital Library

[20]

Tibor Kiss and Jan Strunk. 2016. Unsupervised multilingual sentence boundary detection. Computational Linguistics 32, 4 (2016), 485--525.

Digital Library

[21]

Rangjia Cai and Taijia Ji 2005. Researches of speech classification methods based on Tibetan repertoire. Journal of Northwest University for Nationalities 26, 2 (2005), 39--42.

[22]

Qingji Ren and Jiancairang An. 2014. Research on automatic recognition method of Tibetan sentence boundary. China Computer and Communication. 8 (2014), 62--63.

[23]

Zangtai Cai. 2012. Research on the automatic identification of Tibetan sentence boundaries with maximum entropy classifier. Computer Engineering & Science 34, 6 (2012), 187--190.

[24]

Xiang Li, Zangtai Cai, Jiang Wenbin, Yajuan Lv, and Qun Liu. 2011. A maximum entropy and rules approach to identifying Tibetan sentence boundaries. Journal of Chinese Information Processing 25, 4 (2011), 39--45.

[25]

Weina Zhao, Huidan Liu, Xin Yu, Jian Wu, and Pu Zang. 2010. The Tibetan sentence boundary identification based on legal texts. In Proceedings of National Symposium on Computational Linguistics for Young People. (YWCL'10).

[26]

Weizhen Ma, Mezhaxi Wan, and Zha Nima. 2012. Method of identification of Tibetan sentence boundary. Journal of Tibet University 27, 2 (2012), 70--76.

[27]

Xiji Zha and Ba Luo. 2018. Tibetan sentence extraction method based on feature of function words and sentence ending words. Journal of Northwest Minzu University 39, 112 (2018), 39--44.

[28]

Oliver Hellwig. 2016. Detecting sentence boundaries in Sanskrit texts. In Proceedings of the 26th International Conference on Computational Linguistics, COLING 2016. 288--297.

[29]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735--1780.

Digital Library

[30]

Nils Reimers and Iryna Gurevych. 2017. Reporting score distributions makes a difference: Performance study of LSTM-networks for sequence tagging. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. EMNLP 2017.

[31]

Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Shirui Pan, and Chengqi Zhang. 2018. DiSAN: Directional self-attention network for RNN/CNN free language understanding. The 2018 AAAI Conference on Artificial Intelligence. AAAI 2018.

[32]

Linzhou Han, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. The 5th International Conference on Learning Representations (ICLR'17).

[33]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. Computer Science, 2013.

[34]

Carlos Emiliano, González-Gallardo, Juan Manuel, and Torres Moreno. 2018. Sentence boundary detection for French with subword-level information vectors and convolutional neural networks, 2018.

[35]

Chenglin Xu, Lei Xie, and Xiong Xiao. 2018. A bidirectional LSTM approach with word embeddings for sentence boundary detection. Journal of Signal Processing Systems 90 (2018), 1063--1075.

Digital Library

[36]

Kim Yoon. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 19th Empirical Methods in Natural Language Processing (EMNLP'14).

[37]

Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016. Recurrent neural network for text classification with multi-task learning. In Proceedings of the 25th International Joint Conference on Artificial Intelligence 2016 (IJCAI'16).

[38]

Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics 2 (2016), 427--431.

[39]

Rie Johnson and Tong Zhang. 2017. Deep pyramid convolutional neural networks for text categorization. In Proceedings of the 55^th Annual Meeting of the Association for Computational Linguistics, (2017), 1.

[40]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of 31st Conference on Neural Information Processing Systems (NIPS’17).

Digital Library

[41]

Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceeding of the 10th Machine Translation Summit (MT summit), 79--86.

[42]

Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, Roldano Cattoni, and Marcello Federico. 2015. The IWSLT 2015 evaluation campaign. In Proceedings of the 12th International Workshop on Spoken Language Translation (IWSLT'15), Da Nang, Vietnam.

Cited By

Lv HLv HYang LShen JDuo LLi YZhou QYong B(2024)Improved Tibetan Word Vectors Models Based on Position Information FusionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/368178723:11(1-21)Online publication date: 21-Nov-2024
https://dl.acm.org/doi/10.1145/3681787
Chen XChen ZXiao LZhou M(2022)A Novel Sentiment Analysis Model of Museum User Experience Evaluation Data Based on Unbalanced Data Analysis TechnologyComputational Intelligence and Neuroscience10.1155/2022/20966342022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/2096634

Index Terms

Sentence Boundary Disambiguation for Tibetan Based on Attention Mechanism at the Syllable Level
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing

Recommendations

Sentence boundary disambiguation for Indonesian language
iiWAS '17: Proceedings of the 19th International Conference on Information Integration and Web-based Applications & Services

Sentence boundary detection is essential for natural language processing (NLP). Sentence boundary detection in the Indonesian language has lots of problems, which includes punctuation, abbreviation, and character in the bracket. The disambiguation ...
Cross-lingual Sentence Embedding for Low-resource Chinese-Vietnamese Based on Contrastive Learning
Cross-lingual sentence embedding’s goal is mapping sentences with similar semantics but in different languages close together and dissimilar sentences farther apart in the representation space. It is the basis of many downstream tasks such as cross-...
Adaptive multilingual sentence boundary disambiguation

The sentence is a standard textual unit in natual language processing applications. In many language the punctuation mark that indicates the end-of-sentence boundary is ambiguous; thus the tokenizers of most NLP systems must be equipped with special ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 21, Issue 6

November 2022

372 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3568970

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 February 2023

Online AM: 01 April 2022

Accepted: 18 March 2022

Revision received: 21 February 2022

Received: 09 August 2020

Published in TALLIP Volume 21, Issue 6

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Ministry of Education - China Mobile Research Foundation
The Fundamental Research Funds for the Central Universities
National Natural Science Foundation of China
Major National Project of High Resolution Earth Observation System
State Grid Corporation of China Science and Technology Project
Program for New Century Excellent Talents in University
Strategic Priority Research Program of the Chinese Academy of Sciences
Google Research Awards and Google Faculty Award

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
202
Total Downloads

Downloads (Last 12 months)58
Downloads (Last 6 weeks)6

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lv HLv HYang LShen JDuo LLi YZhou QYong B(2024)Improved Tibetan Word Vectors Models Based on Position Information FusionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/368178723:11(1-21)Online publication date: 21-Nov-2024
https://dl.acm.org/doi/10.1145/3681787
Chen XChen ZXiao LZhou M(2022)A Novel Sentiment Analysis Model of Museum User Experience Evaluation Data Based on Unbalanced Data Analysis TechnologyComputational Intelligence and Neuroscience10.1155/2022/20966342022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/2096634

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents