Improved N-grams Approach for Web Page Language Identification

Selamat, Ali

doi:10.1007/978-3-642-24016-4_1

Ali Selamat¹⁷

Part of the book series: Lecture Notes in Computer Science ((TCCI,volume 6910))

613 Accesses
5 Citations

Abstract

Language identification has been widely used for machine translations and information retrieval. In this paper, an improved N-grams (ING) approach is proposed for web page language identification. The improved N-grams approach is based on a combination of original N-grams (ONG) approach and a modified N-grams (MNG) approach that has been used for language identification of web documents. The features selected from the improved N-grams approach are based on N-grams frequency and N-grams position. The features selected from the original N-grams approach are based on a distance measurement and the features selected from the modified N-grams approach are based on a Boolean matching rate for language identification of Roman and Arabic scripts web pages. A large real-world document collection from British Broadcasting Corporation (BBC) website, which is composed of 1000 documents on each of the languages (e.g., Azeri, English, Indonesian, Serbian, Somali, Spanish, Turkish, Vietnamese, Arabic, Persian, Urdu, Pashto) have been used for evaluations. The precision, recall and F1 measures have been used to determine the effectiveness of the proposed improved N-grams (ING) approach. From the experiments, we have found that the improved N-grams approach has been able to improve the language identification of the contents in Roman and Arabic scripts web page documents from the available datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Gordon, R.G.: Ethnologue: Languages of the world. In: SIL International Dallas, TX (2005)
Google Scholar
Abd Rozan, M.Z., Mikami, Y., Abu Bakar, A.Z., Vikas, O.: Multilingual ict education: Language observatory as a monitoring instrument. In: Proceedings of the South East Asia Regional Computer Confederation 2005: ICT Building Bridges Conference, Sydney, Australia, vol. 46, pp. 53–61 (2005)
Google Scholar
Maclean, D.: Beyond english: Transnational corporations and the strategic management of language in a complex multilingual business environment. Management Decision 44(10), 1377–1390 (2006)
Article Google Scholar
Redondo-Bellon, I.: The effects of bilingualism on the consumer: The case of spain. European Journal of Marketing 33(11/12), 1136–1160 (1999)
Article Google Scholar
Selamat, A., Ng, C.C.: Arabic script web page language identifications using decision tree neural networks. Pattern Recognition, Elsevier Science (2010), doi:10.1016/j.patcog.2010.07.009
Google Scholar
Chowdhury, G.G.: Natural language processing. Annual Review of Information Science and Technology 37(1), 51–89 (2003)
Article Google Scholar
Lewandowski, D.: Problems with the use of web search engines to find results in foreign languages. Online Information Review 32(5), 668–672 (2008)
Article Google Scholar
Jin, H., Wong, K.F.: A chinese dictionary construction algorithm for information retrieval. ACM Transactions on Asian Language Information Processing 1(4), 281–296 (2002)
Article Google Scholar
Botha, G., Zimu, V., Barnard, E.: Text-based language identification for the south african languages. In: Proceedings of the 17th Annual Symposium of the Pattern Recognition Association of South Africa 2006, Parys, South Africa, pp. 7–13 (2006)
Google Scholar
Ng, C.-C., Selamat, A.: Improve feature selection method of web page language identification using fuzzy artmap. International Journal of Intelligent Information and Database Systems 4(6), 629–642 (2010)
Article Google Scholar
Barroso, N., de Ipiña, K.L., Ezeiza, A., Barroso, O., Susperregi, U.: Hybrid approach for language identification oriented to multilingual speech recognition in the basque context. In: Graña Romay, M., Corchado, E., Garcia Sebastian, M.T. (eds.) HAIS 2010. LNCS (LNAI), vol. 6076, pp. 196–204. Springer, Heidelberg (2010)
Chapter Google Scholar
Wang, H., Xiao, X., Zhang, X., Zhang, J., Yan, Y.: A hierarchical system design for language identification. In: 2nd International Symposium on Information Science and Engineering, ISISE 2009, pp. 443–446 (2010)
Google Scholar
Amine, A.B., Elberrichi, Z., Simonet, M.: Automatic language identification: an alternative unsupervised approach using a new hybrid algorithm. International Journal of Computer Science and Applications 7(1), 94–107 (2010)
Google Scholar
Xiao, H., Yu, L., Chen, K.: An efficient method of language identification using lvq network. In: International Conference on Signal Processing Proceedings, ICSP, pp. 1690–1694 (2008)
Google Scholar
Řehůřek, R., Kolkus, M.: Language identification on the web: Extending the dictionary method. In: Gelbukh, A. (ed.) CICLing 2009. LNCS (LNAI), vol. 5449, pp. 357–368. Springer, Heidelberg (2009)
Chapter Google Scholar
You, J.-L., Chen, Y.-N., Chu, M., Soong, F.K., Wang, J.-L.: Identifying language origin of named entity with multiple information sources. IEEE Transactions on Audio, Speech and Language Processing 16(6), 1077–1086 (2008)
Article Google Scholar
Ng, R., Lee, T.: Entropy-based analysis of the prosodic features of chinese dialects. In: Proceedings - 2008 6th International Symposium on Chinese Spoken Language Processing, ISCSLP 2008, pp. 65–68 (2008)
Google Scholar
Deng, Y., Liu, J.: Automatic language identification using support vector machines and phonetic n-gram. In: ICALIP 2008, Proceedings of 2008 International Conference on Audio, Language and Image Processing, pp. 71–74 (2008)
Google Scholar
Botha, G., Zimu, V., Barnard, E.: Text-based language identification for south african languages. Transactions of the South African Institute of Electrical Engineers 98(4), 141–148 (2007)
Google Scholar
Cordoba, R., D’Haro, L., Fernandez-Martinez, F., Macias-Guarasa, J., Ferreiros, J.: Language identification based on n-gram frequency ranking. In: 8th Annual Conference of the International Speech Communication Association, Interspeech 2007., vol. 3, pp. 1921–1924 (2007)
Google Scholar
Thomas, S., Verma, A.: Language identification of person names using cf-iof based weighing function. In: 8th Annual Conferenceof the International Speech Communication Association, Interspeech 2007, vol. 1, pp. 361–364 (2007)
Google Scholar
Suo, H., Li, M., Liu, T., Lu, P., Yan, Y.: The design of backend classifiers in pprlm system for language identification. In: Proceedings of Third International Conference on Natural Computation, ICNC 2007, vol. 1, pp. 678–682 (2007)
Google Scholar
Moscola, J., Cho, Y., Lockwood, J.: Hardware-accelerated parser for extraction of metadata in semantic network content. In: IEEE Aerospace Conference Proceedings (2007)
Google Scholar
Yang, X., Siu, M.: N-best tokenization in a gmm-svm language identification system. In: ICASSP, Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 4, pp. IV1005–IV1008 (2007)
Google Scholar
Rouas, J.L.: Automatic prosodic variations modeling for language and dialect discrimination. IEEE Transactions on Audio, Speech and Language Processing 15(6), 1904–1911 (2007)
Article Google Scholar
Hanif, F., Latif, F., Sikandar Hayat Khiyal, M.: Unicode aided language identification across multiple scripts and heterogeneous data. Information Technology Journal 6(4), 534–540 (2007)
Article Google Scholar
Li, H., Ma, B., Lee, C.H.: A vector space modeling approach to spoken language identification. IEEE Transactions on Audio, Speech and Language Processing 15(1), 271–284 (2007)
Article Google Scholar
Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval 1994, Las Vegas, Nevada, USA, pp. 161–175 (1994)
Google Scholar
Choong, C., Mikami, Y., Marasinghe, C., Nandasara, S.: Optimizing n-gram order of an n-gram based language identification algorithm for 68 written languages. International Journal on Advances in ICT for Emerging Regions 2(2), 21–28 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Software Engineering Research Group, Faculty of Computer Science & Information Systems, Universiti Teknologi Malaysia, UTM Johor Baharu Campus, 81310, Johor, Malaysia
Ali Selamat

Authors

Ali Selamat
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Wroclaw University of Technology, Wyb. Wyspianskiego 27, 50-370, Wroclaw, Poland
Ngoc Thanh Nguyen

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Selamat, A. (2011). Improved N-grams Approach for Web Page Language Identification. In: Nguyen, N.T. (eds) Transactions on Computational Collective Intelligence V. Lecture Notes in Computer Science, vol 6910. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24016-4_1

Download citation

DOI: https://doi.org/10.1007/978-3-642-24016-4_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24015-7
Online ISBN: 978-3-642-24016-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics