Thai Related Foreign Language Specific Web Crawling Approach

Suebchua, Tanaphol; Manaskasemsak, Bundit; Rungsawang, Arnon

doi:10.1007/978-981-4585-18-7_72

Tanaphol Suebchua⁴,
Bundit Manaskasemsak⁴ &
Arnon Rungsawang⁴

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 285))

3043 Accesses
1 Citations

Abstract

National web archives have been successfully made available through domain—and language-specific web crawlers for years. We here propose another focused web crawler for collecting foreign language web pages that are also related to a nation. Rather finding the most relevant web pages, an ensemble machine learning has been trained with selective features to find relevant clusters of unvisited web pages, called website segments. During consecutive crawling cycles, the machine will be retrained with features extracted from new found website segments. Preliminary experiments in the real web space on Thai-tourism related topics show that this approach can take advantage of recent crawling experiences to produce more promising harvest rates than traditional breadth—and best-first baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

British Library: UK web archive. http://www.webarchive.org.uk (2011)
National Diet Library: Web archiving project. http://warp.ndl.go.jp (2011)
Baeza-Yates, R., Castillo, C., Lopez, V.: Characteristics of the web of spain. Cy- bermetrics 9(1) (2005)
Google Scholar
Christensen, N.H.: Preserving the bits of the danish internet. In: Proc. of the 5th IWAW. (2005)
Google Scholar
Gomes, D., Nogueira, A., Miranda, J., Costa, M.: Introducing the portuguese web archive initiative. In: Proc. of the 8th IWAW. (2008)
Google Scholar
Baeza-Yates, R., Castillo, C., Marin, M., Rodriguez, A.: Crawling a country: Better strategies than breadth-first for web page ordering. In: Proc. of the 14th WWW. (2005)
Google Scholar
Bordino, I., Boldi, P., Donato, D., Santini, M., Vigna, S.: Temporal evolution of the uk web. In: Proc. of the 8th ICDMW. (2008)
Google Scholar
Alabbad, S.H., Alanazi, S.: Language based crawling: Crawling the arabic content of the web. In: Proc. of the IC0MP’09. (2009)
Google Scholar
Somboonviwat, K., Tamura, T., Kitsuregawa, M.: Finding thai web pages in foreign web spaces. In: Proc. of the 22nd ICDEW. (2006)
Google Scholar
Srisukha, E., Jinarat, S., Haruechaiyasak, C., Rungsawang, A.: Naive bayes based language-specific web crawling. In: Proc. of 5th ECTI-C0 N. (2008)
Google Scholar
Tamura, T., Somboonviwat, K., Kitsuregawa, M.: A method for language-specific web crawling and its evaluation. Systems and Computers in Japan 38 (2007)
Google Scholar
Tadapak, P., Suebchua, T., Rungsawang, A.: A machine learning based language specific web site crawler. In: Proc. of the 13th NBiS. (2010)
Google Scholar
DMOZ: Open directory project (ODP). http://www.dmoz.org (2011)
Nakatani, S.: Language detection library for java. http://code.google.com/p/lan- guage-detection/(2010)
Garcia, S., Herrera, F.: Evolutionary undersampling for classification with imbal- anced datasets: proposals and taxonomy. Evolutionary Computation 17-3 (2009)
Google Scholar
Ranawana, R., Palade, V.: Multi-classifier systems: Review and a roadmap for developers. International Journal of Hybrid Intelligent Systems 3 (2006)
Google Scholar
Menczer, F., Pant, G., Srinivasan, P.: Topical web crawlers: Evaluating adaptive algorithms. ACM Transactions on Internet Technology 4(4) (2004)
Google Scholar

Download references

Acknowledgments

The first author thanks the JSTP-NSTDA Thailand for the funding support.

Author information

Authors and Affiliations

Massive Information and Knowledge Engineering Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Bangkok, 10900, Thailand
Tanaphol Suebchua, Bundit Manaskasemsak & Arnon Rungsawang

Authors

Tanaphol Suebchua
View author publications
You can also search for this author in PubMed Google Scholar
Bundit Manaskasemsak
View author publications
You can also search for this author in PubMed Google Scholar
Arnon Rungsawang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tanaphol Suebchua .

Editor information

Editors and Affiliations

Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia
Tutut Herawan
Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Batu Pahat, Malaysia
Mustafa Mat Deris
School of Information Technology, Deakin University, Burwood, Victoria, Australia
Jemal Abawajy

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Suebchua, T., Manaskasemsak, B., Rungsawang, A. (2014). Thai Related Foreign Language Specific Web Crawling Approach. In: Herawan, T., Deris, M., Abawajy, J. (eds) Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013). Lecture Notes in Electrical Engineering, vol 285. Springer, Singapore. https://doi.org/10.1007/978-981-4585-18-7_72

Download citation

DOI: https://doi.org/10.1007/978-981-4585-18-7_72
Published: 15 December 2013
Publisher Name: Springer, Singapore
Print ISBN: 978-981-4585-17-0
Online ISBN: 978-981-4585-18-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics