Character-Based N-gram Model for Uyghur Text Retrieval

Tohti, Turdi; Xu, Lirui; Huang, Jimmy; Musajan, Winira; Hamdulla, Askar

doi:10.1007/978-3-319-97909-0_72

Turdi Tohti^21,22,
Lirui Xu²¹,
Jimmy Huang²²,
Winira Musajan²¹ &
…
Askar Hamdulla²¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10996))

Included in the following conference series:

Chinese Conference on Biometric Recognition

3054 Accesses

Abstract

Uyghur is a low resourced language, but Uyghur Information Retrieval (IR) is getting more and more important recently. Although there are related research results and stem-based Uyghur IR systems, it is always difficult to obtain high-performance retrieval results due to the limitations of the existing stemming method. In this paper, we propose a character-based N-gram model and the corresponding smoothing algorithm for Uyghur IR. A full-text IR system based on character N-gram model is developed using the open-source tool Lucene. A series of experiments and comparative analysis are conducted. Experimental results show that our proposed method has the better performance compared with conventional Uyghur IR systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Tohti, T., Musajan, W., Hamdull, A.: Design and implementation of Uyghur, Kazak, Kyrgyz web-based full-text search engine. Comput. Appl. Softw. 26(6), 96–98 (2009)
Google Scholar
Tohti, T., Musajan, W., Hamdull, A.: Key techniques of Uyghur, Kazak, Kyrgyz full-text search engine retrieval server. Comput. Eng. 34(21), 45–47 (2008)
Google Scholar
Tohti, T., Hamdull, A., Musajan, W.: Research on web text representation and the similarity based on improved VSM in Uyghur web information retrieval. In: Chinese Conference on Pattern Recognition (CCPR 2010), pp. 984–988 (2010)
Google Scholar
Huang, X., Peng, F., Schuurmans, D., Cercone, N., Robertson, S.: Applying machine learning to text segmentation for information retrieval. Inf. Retr. 6(3), 333–362 (2003)
Article Google Scholar
Beaulieu, M., Gatford, M., Huang, X., Robertson, S., Walker, S., Williams, P.: Okapi at TREC-5. In: Proceedings of the 5th Text Retrieval Conference, National Institute of Standards and Technology (NIST), pp. 238–500, 143–166. NIST Special Publication (1997)
Google Scholar
Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275–281 (1998)
Google Scholar
Miller, D.R.H., Leek, T., Schwartz, R.M.: A hidden Markov model information retrieval system. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 214–221 (1999)
Google Scholar
Berger, A., Lafferty, J.: Information retrieval as statistical translation. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 222–229 (1999)
Google Scholar
Jin, R., Hauptmann, A.G., Zhai, C.X.: Language model for information retrieval. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42–48 (2002)
Google Scholar
Lavrenko, V., Croft, W.B.: Relevance based language models. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 120–127 (2001)
Google Scholar
Ren, Z.F., Cang, Y.Q., Fan, A.W.: N-Gram statistical information retrieval model based on bayesian theory. J. Zhengzhou Univ. 42(1), 21–23 (2010)
Google Scholar
Li, X.G., Wang, D.L., Yu, G.: Information retrieval based on statistical language model. Comput. Sci. 32(8), 124–127 (2005)
Google Scholar
Ablimit, M., Hamdull, A., Kawahara, T.: Morpheme concatenation approach in language modeling for large-vocabulary Uyghur speech recognition. In: International Conference on Speech Database and Assessments (Oriental COCOSDA), pp. 112–115 (2011)
Google Scholar
Zhang, Y.J.: Study on N-gram language model of Uygur language. Comput. Knowl. Technol. 7(17), 4177–4179 (2011)
Google Scholar
Song, F., Croft, W.B.: A general language model for information retrieval. In: Proceedings of the Eighth International Conference on Information and Knowledge Management, pp. 316–321 (1999)
Google Scholar
Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to ad hoc information retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 334–342 (2001)
Google Scholar

Download references

Acknowledgments

This work has been supported by the National Natural Science Foundation of China (61562083, 61262062), Western Region Talent Cultivation Special Projects of China Scholarship Council (201608655002).

Author information

Authors and Affiliations

School of Information Science and Engineering, Xinjiang University, Ürümqi, China
Turdi Tohti, Lirui Xu, Winira Musajan & Askar Hamdulla
Information Retrieval and Knowledge Management Research Lab, York University, Toronto, Canada
Turdi Tohti & Jimmy Huang

Authors

Turdi Tohti
View author publications
You can also search for this author in PubMed Google Scholar
Lirui Xu
View author publications
You can also search for this author in PubMed Google Scholar
Jimmy Huang
View author publications
You can also search for this author in PubMed Google Scholar
Winira Musajan
View author publications
You can also search for this author in PubMed Google Scholar
Askar Hamdulla
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Turdi Tohti .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Jie Zhou
Beihang University, Beijing, China
Yunhong Wang
Chinese Academy of Sciences, Beijing, China
Zhenan Sun
Xinjiang University, Urumqi, China
Zhenhong Jia
Tsinghua University, Beijing, China
Jianjiang Feng
Chinese Academy of Sciences, Beijing, China
Shiguang Shan
Xinjiang University, Urumqi, China
Kurban Ubul
Tsinghua University, Shenzhen, China
Zhenhua Guo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tohti, T., Xu, L., Huang, J., Musajan, W., Hamdulla, A. (2018). Character-Based N-gram Model for Uyghur Text Retrieval. In: Zhou, J., et al. Biometric Recognition. CCBR 2018. Lecture Notes in Computer Science(), vol 10996. Springer, Cham. https://doi.org/10.1007/978-3-319-97909-0_72

Download citation

DOI: https://doi.org/10.1007/978-3-319-97909-0_72
Published: 09 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-97908-3
Online ISBN: 978-3-319-97909-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics