Applying Clickstream Data Mining to Real-Time Web Crawler Detection and Containment Using ClickTips Platform

Lourenço, Anália; Belo, Orlando

doi:10.1007/978-3-540-70981-7_39

Anália Lourenço³ &
Orlando Belo³

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

3792 Accesses
2 Citations

Abstract

Web crawler uncontrolled widespread has led to undesired situations of server overload and contents misuse. Most programs still have legitimate and useful goals, but standard detection heuristics have not evolved along with Web crawling technology and are now unable to identify most of today’s programs. In this paper, we propose an integrated approach to the problem that ensures the generation of up-to-date decision models, targeting both monitoring and clickstream differentiation. The ClickTips platform sustains Web crawler detection and containment mechanisms and its data webhousing system is responsible for clickstream processing and further data mining. Web crawler detection and monitoring helps preserving Web server performance and Web site privacy and clickstream differentiated analysis provides focused report and interpretation of navigational patterns. The generation of up-to-date detection models is based on clickstream data mining and targets not only well-known Web crawlers, but also camouflaging and previously unknown programs. Experiments with different real-world Web sites are optimistic, proving that the approach is not only feasible but also adequate.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

ALMEIDA, V., MENASCE, D.A., RIEDI, R.H., PELIGRINELLI, F., FONSECA, R.C. and MEIRA Jr., W. (2001): Analyzing Web Robots and Their Impact on Caching. In Proceedings of the 6th Web Caching and Content Delivery Workshop. Boston MA.
Google Scholar
COOLEY, R., MOBASHER, B. and SRIVASTAVA, J. (1999): Data Preparation for Mining World Wide Web Browsing Patterns. Knowledge and Information Systems, 1, 1.
Article Google Scholar
KIMBALL, R. and MERZ, R. (2000): The Data Webhouse Toolkit: Building the Web-Enabled Data Warehouse. Wiley, New York.
Google Scholar
PANT, G., SRINIVASAN, P. and MENCZER, F. (2004): Crawling the Web. In: M. Levene and A. Poulovassilis (Eds.): Web Dynamics: Adapting to Change in Content, Size, Topology and Use. Springer, Berlin.
Google Scholar
SWEIGER, M., MADSEN, M.R., LANGSTON, J. and LOMBARD, H. (2002): Clickstream Data Warehousing. Wiley, New York.
Google Scholar
TAN, P.N. and KUMAR, V. (2002): Discovery of Web Robot Sessions Based on their Navigational Patterns. Data Mining and Knowledge Discovery, 6,1, 9–35.
Article MathSciNet Google Scholar
WITTEN, I.H. and FRANK, E. (2005): Data Mining: Practical Machine Learning Tools and Techniques, 2nd Edition. Morgan Kaufmann, San Francisco.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, School of Engineering, University of Minho, Campus de Gualtar, 4710-057, Braga, Portugal
Anália Lourenço & Orlando Belo

Authors

Anália Lourenço
View author publications
You can also search for this author in PubMed Google Scholar
Orlando Belo
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Business Administration and Economics, Bielefeld University, Universitätsstr. 25, 33501, Bielefeld, Germany
Reinhold Decker
Department of Economics, Freie Universität Berlin, Garystraße 21, 14195, Berlin, Germany
Hans -J. Lenz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lourenço, A., Belo, O. (2007). Applying Clickstream Data Mining to Real-Time Web Crawler Detection and Containment Using ClickTips Platform. In: Decker, R., Lenz, H.J. (eds) Advances in Data Analysis. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70981-7_39

Download citation

DOI: https://doi.org/10.1007/978-3-540-70981-7_39
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-70980-0
Online ISBN: 978-3-540-70981-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics