Skip to main content

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 285))

Abstract

National web archives have been successfully made available through domain—and language-specific web crawlers for years. We here propose another focused web crawler for collecting foreign language web pages that are also related to a nation. Rather finding the most relevant web pages, an ensemble machine learning has been trained with selective features to find relevant clusters of unvisited web pages, called website segments. During consecutive crawling cycles, the machine will be retrained with features extracted from new found website segments. Preliminary experiments in the real web space on Thai-tourism related topics show that this approach can take advantage of recent crawling experiences to produce more promising harvest rates than traditional breadth—and best-first baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. British Library: UK web archive. http://www.webarchive.org.uk (2011)

  2. National Diet Library: Web archiving project. http://warp.ndl.go.jp (2011)

  3. Baeza-Yates, R., Castillo, C., Lopez, V.: Characteristics of the web of spain. Cy- bermetrics 9(1) (2005)

    Google Scholar 

  4. Christensen, N.H.: Preserving the bits of the danish internet. In: Proc. of the 5th IWAW. (2005)

    Google Scholar 

  5. Gomes, D., Nogueira, A., Miranda, J., Costa, M.: Introducing the portuguese web archive initiative. In: Proc. of the 8th IWAW. (2008)

    Google Scholar 

  6. Baeza-Yates, R., Castillo, C., Marin, M., Rodriguez, A.: Crawling a country: Better strategies than breadth-first for web page ordering. In: Proc. of the 14th WWW. (2005)

    Google Scholar 

  7. Bordino, I., Boldi, P., Donato, D., Santini, M., Vigna, S.: Temporal evolution of the uk web. In: Proc. of the 8th ICDMW. (2008)

    Google Scholar 

  8. Alabbad, S.H., Alanazi, S.: Language based crawling: Crawling the arabic content of the web. In: Proc. of the IC0MP’09. (2009)

    Google Scholar 

  9. Somboonviwat, K., Tamura, T., Kitsuregawa, M.: Finding thai web pages in foreign web spaces. In: Proc. of the 22nd ICDEW. (2006)

    Google Scholar 

  10. Srisukha, E., Jinarat, S., Haruechaiyasak, C., Rungsawang, A.: Naive bayes based language-specific web crawling. In: Proc. of 5th ECTI-C0 N. (2008)

    Google Scholar 

  11. Tamura, T., Somboonviwat, K., Kitsuregawa, M.: A method for language-specific web crawling and its evaluation. Systems and Computers in Japan 38 (2007)

    Google Scholar 

  12. Tadapak, P., Suebchua, T., Rungsawang, A.: A machine learning based language specific web site crawler. In: Proc. of the 13th NBiS. (2010)

    Google Scholar 

  13. DMOZ: Open directory project (ODP). http://www.dmoz.org (2011)

  14. Nakatani, S.: Language detection library for java. http://code.google.com/p/lan- guage-detection/(2010)

  15. Garcia, S., Herrera, F.: Evolutionary undersampling for classification with imbal- anced datasets: proposals and taxonomy. Evolutionary Computation 17-3 (2009)

    Google Scholar 

  16. Ranawana, R., Palade, V.: Multi-classifier systems: Review and a roadmap for developers. International Journal of Hybrid Intelligent Systems 3 (2006)

    Google Scholar 

  17. Menczer, F., Pant, G., Srinivasan, P.: Topical web crawlers: Evaluating adaptive algorithms. ACM Transactions on Internet Technology 4(4) (2004)

    Google Scholar 

Download references

Acknowledgments

The first author thanks the JSTP-NSTDA Thailand for the funding support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tanaphol Suebchua .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer Science+Business Media Singapore

About this paper

Cite this paper

Suebchua, T., Manaskasemsak, B., Rungsawang, A. (2014). Thai Related Foreign Language Specific Web Crawling Approach. In: Herawan, T., Deris, M., Abawajy, J. (eds) Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013). Lecture Notes in Electrical Engineering, vol 285. Springer, Singapore. https://doi.org/10.1007/978-981-4585-18-7_72

Download citation

  • DOI: https://doi.org/10.1007/978-981-4585-18-7_72

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-4585-17-0

  • Online ISBN: 978-981-4585-18-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics