A Clustering Framework to Build Focused Web Crawlers for Automatic Extraction of Cultural Information

Tsekouras, George E.; Gavalas, Damianos; Filios, Stefanos; Niros, Antonios D.; Bafaloukas, George

doi:10.1007/978-3-540-87881-0_43

A Clustering Framework to Build Focused Web Crawlers for Automatic Extraction of Cultural Information

George E. Tsekouras¹,
Damianos Gavalas¹,
Stefanos Filios¹,
Antonios D. Niros¹ &
…
George Bafaloukas¹

Conference paper

1749 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5138))

Abstract

We present a novel focused crawling method for extracting and processing cultural data from the web in a fully automated fashion. After downloading the pages, we extract from each document a number of words for each thematic cultural area. We then create multidimensional document vectors comprising the most frequent word occurrences. The dissimilarity between these vectors is measured by the Hamming distance. In the last stage, we employ cluster analysis to partition the document vectors into a number of clusters. Finally, our approach is illustrated via a proof-of-concept application which scrutinizes hundreds of web pages spanning different cultural thematic areas.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Huang, Y., Ye, Y.-M.: wHunter: A Focused Web Crawler – A Tool for Digital Library. In: Chen, Z., Chen, H., Miao, Q., Fu, Y., Fox, E., Lim, E.-p. (eds.) ICADL 2004. LNCS, vol. 3334, pp. 519–522. Springer, Heidelberg (2004)
Google Scholar
Zhu, Q.: An algorithm for the focused web crawler. In: The Proceedings of the 6th International Conference on Machine Learning and Cybernetics, Hong Kong, (2007)
Google Scholar
Tsekouras, G.E., Anagnostopoulos, C.N., Gavalas, D., Economou, D.: Classification of Web Documents using Fuzzy Logic Categorical Data Clustering. In: Boukis, C., Pnevmatikakis, A., Polymenakos, L. (eds.) Artificial Intelligence and Innovations: From Therory to Applications, pp. 93–100. Springer, Heidelberg (2007)
Chapter Google Scholar
Chakrabarti, S., Punera, K., Subramanyam, M.: Accelerated focused crawling through online relevance feedback. In: WWW, pp. 148–159 (2002)
Google Scholar
Xu, Q., Zuo, W.: First-order Focused Crawling. In: The Proceedings of the International Conference on WWW 2007, Banff, Alberta, Canada (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Cultural Technology and Communication, University of the Aegean, 81100, Mytilene, Lesvos, Greece
George E. Tsekouras, Damianos Gavalas, Stefanos Filios, Antonios D. Niros & George Bafaloukas

Authors

George E. Tsekouras
View author publications
You can also search for this author in PubMed Google Scholar
Damianos Gavalas
View author publications
You can also search for this author in PubMed Google Scholar
Stefanos Filios
View author publications
You can also search for this author in PubMed Google Scholar
Antonios D. Niros
View author publications
You can also search for this author in PubMed Google Scholar
George Bafaloukas
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

John Darzentas George A. Vouros Spyros Vosinakis Argyris Arnellos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tsekouras, G.E., Gavalas, D., Filios, S., Niros, A.D., Bafaloukas, G. (2008). A Clustering Framework to Build Focused Web Crawlers for Automatic Extraction of Cultural Information. In: Darzentas, J., Vouros, G.A., Vosinakis, S., Arnellos, A. (eds) Artificial Intelligence: Theories, Models and Applications. SETN 2008. Lecture Notes in Computer Science(), vol 5138. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87881-0_43

Download citation

DOI: https://doi.org/10.1007/978-3-540-87881-0_43
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-87880-3
Online ISBN: 978-3-540-87881-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics