Skip to main content

Improved N-grams Approach for Web Page Language Identification

  • Chapter
Transactions on Computational Collective Intelligence V

Part of the book series: Lecture Notes in Computer Science ((TCCI,volume 6910))

Abstract

Language identification has been widely used for machine translations and information retrieval. In this paper, an improved N-grams (ING) approach is proposed for web page language identification. The improved N-grams approach is based on a combination of original N-grams (ONG) approach and a modified N-grams (MNG) approach that has been used for language identification of web documents. The features selected from the improved N-grams approach are based on N-grams frequency and N-grams position. The features selected from the original N-grams approach are based on a distance measurement and the features selected from the modified N-grams approach are based on a Boolean matching rate for language identification of Roman and Arabic scripts web pages. A large real-world document collection from British Broadcasting Corporation (BBC) website, which is composed of 1000 documents on each of the languages (e.g., Azeri, English, Indonesian, Serbian, Somali, Spanish, Turkish, Vietnamese, Arabic, Persian, Urdu, Pashto) have been used for evaluations. The precision, recall and F1 measures have been used to determine the effectiveness of the proposed improved N-grams (ING) approach. From the experiments, we have found that the improved N-grams approach has been able to improve the language identification of the contents in Roman and Arabic scripts web page documents from the available datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Gordon, R.G.: Ethnologue: Languages of the world. In: SIL International Dallas, TX (2005)

    Google Scholar 

  2. Abd Rozan, M.Z., Mikami, Y., Abu Bakar, A.Z., Vikas, O.: Multilingual ict education: Language observatory as a monitoring instrument. In: Proceedings of the South East Asia Regional Computer Confederation 2005: ICT Building Bridges Conference, Sydney, Australia, vol. 46, pp. 53–61 (2005)

    Google Scholar 

  3. Maclean, D.: Beyond english: Transnational corporations and the strategic management of language in a complex multilingual business environment. Management Decision 44(10), 1377–1390 (2006)

    Article  Google Scholar 

  4. Redondo-Bellon, I.: The effects of bilingualism on the consumer: The case of spain. European Journal of Marketing 33(11/12), 1136–1160 (1999)

    Article  Google Scholar 

  5. Selamat, A., Ng, C.C.: Arabic script web page language identifications using decision tree neural networks. Pattern Recognition, Elsevier Science (2010), doi:10.1016/j.patcog.2010.07.009

    Google Scholar 

  6. Chowdhury, G.G.: Natural language processing. Annual Review of Information Science and Technology 37(1), 51–89 (2003)

    Article  Google Scholar 

  7. Lewandowski, D.: Problems with the use of web search engines to find results in foreign languages. Online Information Review 32(5), 668–672 (2008)

    Article  Google Scholar 

  8. Jin, H., Wong, K.F.: A chinese dictionary construction algorithm for information retrieval. ACM Transactions on Asian Language Information Processing 1(4), 281–296 (2002)

    Article  Google Scholar 

  9. Botha, G., Zimu, V., Barnard, E.: Text-based language identification for the south african languages. In: Proceedings of the 17th Annual Symposium of the Pattern Recognition Association of South Africa 2006, Parys, South Africa, pp. 7–13 (2006)

    Google Scholar 

  10. Ng, C.-C., Selamat, A.: Improve feature selection method of web page language identification using fuzzy artmap. International Journal of Intelligent Information and Database Systems 4(6), 629–642 (2010)

    Article  Google Scholar 

  11. Barroso, N., de Ipiña, K.L., Ezeiza, A., Barroso, O., Susperregi, U.: Hybrid approach for language identification oriented to multilingual speech recognition in the basque context. In: Graña Romay, M., Corchado, E., Garcia Sebastian, M.T. (eds.) HAIS 2010. LNCS (LNAI), vol. 6076, pp. 196–204. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  12. Wang, H., Xiao, X., Zhang, X., Zhang, J., Yan, Y.: A hierarchical system design for language identification. In: 2nd International Symposium on Information Science and Engineering, ISISE 2009, pp. 443–446 (2010)

    Google Scholar 

  13. Amine, A.B., Elberrichi, Z., Simonet, M.: Automatic language identification: an alternative unsupervised approach using a new hybrid algorithm. International Journal of Computer Science and Applications 7(1), 94–107 (2010)

    Google Scholar 

  14. Xiao, H., Yu, L., Chen, K.: An efficient method of language identification using lvq network. In: International Conference on Signal Processing Proceedings, ICSP, pp. 1690–1694 (2008)

    Google Scholar 

  15. Řehůřek, R., Kolkus, M.: Language identification on the web: Extending the dictionary method. In: Gelbukh, A. (ed.) CICLing 2009. LNCS (LNAI), vol. 5449, pp. 357–368. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  16. You, J.-L., Chen, Y.-N., Chu, M., Soong, F.K., Wang, J.-L.: Identifying language origin of named entity with multiple information sources. IEEE Transactions on Audio, Speech and Language Processing 16(6), 1077–1086 (2008)

    Article  Google Scholar 

  17. Ng, R., Lee, T.: Entropy-based analysis of the prosodic features of chinese dialects. In: Proceedings - 2008 6th International Symposium on Chinese Spoken Language Processing, ISCSLP 2008, pp. 65–68 (2008)

    Google Scholar 

  18. Deng, Y., Liu, J.: Automatic language identification using support vector machines and phonetic n-gram. In: ICALIP 2008, Proceedings of 2008 International Conference on Audio, Language and Image Processing, pp. 71–74 (2008)

    Google Scholar 

  19. Botha, G., Zimu, V., Barnard, E.: Text-based language identification for south african languages. Transactions of the South African Institute of Electrical Engineers 98(4), 141–148 (2007)

    Google Scholar 

  20. Cordoba, R., D’Haro, L., Fernandez-Martinez, F., Macias-Guarasa, J., Ferreiros, J.: Language identification based on n-gram frequency ranking. In: 8th Annual Conference of the International Speech Communication Association, Interspeech 2007., vol. 3, pp. 1921–1924 (2007)

    Google Scholar 

  21. Thomas, S., Verma, A.: Language identification of person names using cf-iof based weighing function. In: 8th Annual Conferenceof the International Speech Communication Association, Interspeech 2007, vol. 1, pp. 361–364 (2007)

    Google Scholar 

  22. Suo, H., Li, M., Liu, T., Lu, P., Yan, Y.: The design of backend classifiers in pprlm system for language identification. In: Proceedings of Third International Conference on Natural Computation, ICNC 2007, vol. 1, pp. 678–682 (2007)

    Google Scholar 

  23. Moscola, J., Cho, Y., Lockwood, J.: Hardware-accelerated parser for extraction of metadata in semantic network content. In: IEEE Aerospace Conference Proceedings (2007)

    Google Scholar 

  24. Yang, X., Siu, M.: N-best tokenization in a gmm-svm language identification system. In: ICASSP, Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 4, pp. IV1005–IV1008 (2007)

    Google Scholar 

  25. Rouas, J.L.: Automatic prosodic variations modeling for language and dialect discrimination. IEEE Transactions on Audio, Speech and Language Processing 15(6), 1904–1911 (2007)

    Article  Google Scholar 

  26. Hanif, F., Latif, F., Sikandar Hayat Khiyal, M.: Unicode aided language identification across multiple scripts and heterogeneous data. Information Technology Journal 6(4), 534–540 (2007)

    Article  Google Scholar 

  27. Li, H., Ma, B., Lee, C.H.: A vector space modeling approach to spoken language identification. IEEE Transactions on Audio, Speech and Language Processing 15(1), 271–284 (2007)

    Article  Google Scholar 

  28. Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval 1994, Las Vegas, Nevada, USA, pp. 161–175 (1994)

    Google Scholar 

  29. Choong, C., Mikami, Y., Marasinghe, C., Nandasara, S.: Optimizing n-gram order of an n-gram based language identification algorithm for 68 written languages. International Journal on Advances in ICT for Emerging Regions 2(2), 21–28 (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Selamat, A. (2011). Improved N-grams Approach for Web Page Language Identification. In: Nguyen, N.T. (eds) Transactions on Computational Collective Intelligence V. Lecture Notes in Computer Science, vol 6910. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24016-4_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-24016-4_1

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-24015-7

  • Online ISBN: 978-3-642-24016-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics