research-article

Locality sensitive hashing for scalable structural classification and clustering of web documents

Authors:

Christian Hachenberg,

Thomas GottronAuthors Info & Claims

CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pages 359 - 368

https://doi.org/10.1145/2505515.2505673

Published: 27 October 2013 Publication History

Abstract

Web content management systems as well as web front ends to databases usually use mechanisms based on homogeneous templates for generating and populating HTML documents containing structured, semi-structured or plain text data. Wrapper based information extraction techniques leverage such templates as an essential cornerstone of their functionality but rely heavily on the availability of proper training documents based on the specific template. Thus, structural classification and structural clustering of web documents is an important contributing factor to the success of those methods. We introduce a novel technique to support these two tasks: template fingerprints. Template fingerprints are locality sensitive hash values in the form of short sequences of characters which effectively represent the underlying template of a web document. Small changes in the document structure, as they may occur in template based documents, lead to no or only minor variations in the corresponding fingerprint. Based on the fingerprints we introduce a scalable index structure and algorithm for large collections of web documents, which can retrieve structurally similar documents efficiently. The effectiveness of our approach is empirically validated in a classification task on a data set of 13,237 documents based on 50 templates from different domains. The general efficiency and scalability is evaluated in a clustering task on a data set retrieved from the Open Directory Project comprising more than 3.6 million web documents. For both tasks, our template fingerprint approach provides results of high quality and demonstrates a linear runtime of O(n) w.r.t. the number of documents.

References

[1]

Ziv Bar-Yossef and Sridhar Rajagopalan. Template detection via data mining and its applications. In WWW '02: Proceedings of the 11th International Conference on World Wide Web, pages 580--591. ACM Press, 2002.

Digital Library

[2]

Lorenzo Blanco, Nilesh Dalvi, and Ashwin Machanavajjhala. Highly efficient algorithms for structural clustering of large websites. In Proceedings of the 20th international conference on World wide web, WWW '11, pages 437--446. ACM, 2011.

Digital Library

[3]

David Buttler. A short survey of document structure similarity algorithms. In IC '04: Proceedings of the International Conference on Internet Computing, pages 3--9. CSREA Press, 2004.

[4]

Deepayan Chakrabarti, Ravi Kumar, and Kunal Punera. Page-level template detection via isotonic smoothing. In WWW '07: Proceedings of the 16th International Conference on World Wide Web, pages 61--70. ACM Press, 2007.

Digital Library

[5]

Isabel F. Cruz, Slava Borisov, Michael A. Marks, and Timothy R. Webbs. Measuring structural similarity among web documents: preliminary results. In EP '98: Proceedings of the 7th Int. Conference on Electronic Publishing, Artistic Imaging, and Digital Typography, pages 513--524, 1998.

Digital Library

[6]

Thomas Gottron. Bridging the Gap: From Multi Document Template Detection to Single Document Content Extraction. In EuroIMSA '08: Proceedings of the IASTED Conference on Internet and Multimedia Systems and Applications 2008, pages 66--71, 2008.

Digital Library

[7]

Thomas Gottron. Clustering template based web documents. In ECIR '08: Proceedings of the 30th European Conference on Information Retrieval, pages 40--51. Springer, 2008.

Digital Library

[8]

Thomas Gottron. Detecting website redesigns via template similarity on streams of documents. In ITA'09: Proceedings of the 3rd International Conference on Internet Technologies and Applications, pages 35--43, 2009.

[9]

Thomas Gottron and Roman Schneider. A hybrid approach to statistical and semantical analysis of web documents. In EuroIMSA'09: Proceedings of 5th European Conference on Internet and Multimedia Systems and Applications, pages 115--120, 2009.

[10]

Sachindra Joshi, Neeraj Agrawal, Raghu Krishnapuram, and Sumit Negi. A bag of paths model for measuring structural similarity in web documents. In KDD '03: Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 577--582. ACM Press, 2003.

Digital Library

[11]

VI Levenshtein. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady, 10:707--710, 1966.

[12]

Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. Detecting near-duplicates for web crawling. In Proceedings of the 16th international conference on World Wide Web, WWW '07, pages 141--150. ACM, 2007.

Digital Library

[13]

William M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846--850, 1971.

[14]

D. C. Reis, P. B. Golgher, A. S. da Silva, and A. F. Laender. Automatic web news extraction using tree edit distance. In WWW '04: Proceedings of the 13th International Conference on World Wide Web, pages 502--511. ACM Press, 2004.

Digital Library

[15]

R. Rivest. The md5 message-digest algorithm, 1992.

Digital Library

[16]

Lei Shi, Cheng Niu, Ming Zhou, and Jianfeng Gao. A DOM tree alignment model for mining parallel data from the web. In ACL '06: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the ACL, pages 489--496. Association for Computational Linguistics, 2006.

Digital Library

[17]

Benno Stein. Principles of Hash-based Text Retrieval. In SIGIR'07: Proceedings of the 30th International ACM Conference on Research and Development in Information Retrieval, pages 527--534. ACM, July 2007.

Digital Library

[18]

Martin Theobald, Jonathan Siddharth, and Andreas Paepcke. Spotsigs: robust and efficient near duplicate detection in large web collections. In SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 563--570, New York, NY, USA, 2008. ACM.

Digital Library

[19]

Karane Vieira, André da Costa Carvalho, Klessius Berlt, Edleno de Moura, Altigran da Silva, and Juliana Freire. On finding templates on web collections. World Wide Web, 12:171--211, 2009. 10.1007/s11280-009-0059--3.

Digital Library

[20]

T. A. Welch. A technique for high-performance data compression. Computer, 17(6):8--19, 1984.

Digital Library

Cited By

Grigera JGardey JRossi GGarrido A(2023)Flexible Detection of Similar DOM ElementsWeb Information Systems and Technologies10.1007/978-3-031-24197-0_10(174-195)Online publication date: 18-Jan-2023
https://doi.org/10.1007/978-3-031-24197-0_10
Bakaev MSpeicher MHeil SGaedke MBielikova MPautasso CMikkonen TBeroual OGuérin FHallé SLaine MNakajima ADayama NOulasvirta AChamberland-Thibeault XHallé SYeo JRim JShin CMoon SJayasinghe MChathurangani JKuruppu GTennage PPerera SSchlott VKorkan EKaebisch SSteinhorst SVan de Vyvere BColpaert PVerborgh RLiu LLiu TWang XXiao TFang WChen HPutra SJoshi BRedi JBozzon AFärber MScheer BBartscherer FBakaev MSpeicher MHeil SGaedke MGuo JCao QZhao RLi ZNoura MHeil SGaedke MWang CSha YZhao RChen CWang WGuo JFraternali PHerrera Gonzalez STariq MPavanetto SBrambilla MHucko MMoro RBielikova MHoffmann CVidal MMarcinowski MŁawrynowicz ADelva HRojas JVandenberghe PColpaert PVerborgh RRojas JVan Assche DDelva HColpaert PVerborgh RRolim DSilva JBatista TCavalcante ELingyu ZBin WBai WTruşcǎ MWassenberg DFrasincar FDekker RQiu SGadiraju UBozzon ALanger AVu Nguyen Hai DGaedke MGonzález-Mora CBarros CGarrigós IZubcoff JLloret EMazón JKwon JLee HMoon Sda Silva CMessai NSam YDevogele TGonzalez RFirmenich SFernandez ARossi GVelez DKaila EKajasilta HTommasini RValle EBalduini MSakr SKousa JIhantola PHellas ALuukkainen MMarcinowski MŁawrynowicz AGonzález-Mora CGarrigós IZubcoff JRojo JHernandez JMurillo JWollmer BWingerath WRitter NBucaille SCánovas Izquierdo JEd-Douibi HCabot JEd-Douibi HDaniel GCabot JKirsh IJoy MGonzález-Mora CGarrigós ICasteleyn SFirmenich SMeissner RKöbis LGonzález-Mora CGarrigós IZubcoff JParis PHamdi FCherfi S(2020)I Don’t Have That Much Data! Reusing User Behavior Models for Websites from Different DomainsWeb Engineering10.1007/978-3-030-50578-3_11(146-162)Online publication date: 9-Jun-2020
https://dl.acm.org/doi/10.1007/978-3-030-50578-3_11
Proskurnia JCartright MGarcia-Pueyo LKrka IWendt JKaufmann TMiklos BBarrett RCummings RAgichtein EGabrilovich E(2017)Template Induction over Unstructured Email CorporaProceedings of the 26th International Conference on World Wide Web10.1145/3038912.3052631(1521-1530)Online publication date: 3-Apr-2017
https://dl.acm.org/doi/10.1145/3038912.3052631
Show More Cited By

Index Terms

Locality sensitive hashing for scalable structural classification and clustering of web documents
1. Information systems
  1. Information retrieval
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Locality-Sensitive Hashing for Massive String-Based Ontology Matching
WI-IAT '14: Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 01

This paper reports initial research results related to the use of locality-sensitive hashing (LSH) for string-based matching of big ontologies. Two ways of transforming the matching problem into a LSH problem are proposed and experimental results are ...
A posteriori multi-probe locality sensitive hashing
MM '08: Proceedings of the 16th ACM international conference on Multimedia

Efficient high-dimensional similarity search structures are essential for building scalable content-based search systems on feature-rich multimedia data. In the last decade, Locality Sensitive Hashing (LSH) has been proposed as indexing technique for ...
Locality-Sensitive Hashing for Chi2 Distance

In the past 10 years, new powerful algorithms based on efficient data structures have been proposed to solve the problem of Nearest Neighbors search (or Approximate Nearest Neighbors search). If the Euclidean Locality Sensitive Hashing algorithm, which ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

October 2013

2612 pages

ISBN:9781450322638

DOI:10.1145/2505515

General Chairs:
Qi He
LinkedIn, USA
,
Arun Iyengar
IBM T.J. Watson Research Center, USA
,
Program Chairs:
Wolfgang Nejdl
L3S Research Center, Germany
,
Jian Pei
Simon Fraser University, Canada
,
Rajeev Rastogi
Amazon, India

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM'13

Sponsor:

CIKM'13: 22nd ACM International Conference on Information and Knowledge Management

October 27 - November 1, 2013

California, San Francisco, USA

Acceptance Rates

CIKM '13 Paper Acceptance Rate 143 of 848 submissions, 17%;

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
402
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)1

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Grigera JGardey JRossi GGarrido A(2023)Flexible Detection of Similar DOM ElementsWeb Information Systems and Technologies10.1007/978-3-031-24197-0_10(174-195)Online publication date: 18-Jan-2023
https://doi.org/10.1007/978-3-031-24197-0_10
Bakaev MSpeicher MHeil SGaedke MBielikova MPautasso CMikkonen TBeroual OGuérin FHallé SLaine MNakajima ADayama NOulasvirta AChamberland-Thibeault XHallé SYeo JRim JShin CMoon SJayasinghe MChathurangani JKuruppu GTennage PPerera SSchlott VKorkan EKaebisch SSteinhorst SVan de Vyvere BColpaert PVerborgh RLiu LLiu TWang XXiao TFang WChen HPutra SJoshi BRedi JBozzon AFärber MScheer BBartscherer FBakaev MSpeicher MHeil SGaedke MGuo JCao QZhao RLi ZNoura MHeil SGaedke MWang CSha YZhao RChen CWang WGuo JFraternali PHerrera Gonzalez STariq MPavanetto SBrambilla MHucko MMoro RBielikova MHoffmann CVidal MMarcinowski MŁawrynowicz ADelva HRojas JVandenberghe PColpaert PVerborgh RRojas JVan Assche DDelva HColpaert PVerborgh RRolim DSilva JBatista TCavalcante ELingyu ZBin WBai WTruşcǎ MWassenberg DFrasincar FDekker RQiu SGadiraju UBozzon ALanger AVu Nguyen Hai DGaedke MGonzález-Mora CBarros CGarrigós IZubcoff JLloret EMazón JKwon JLee HMoon Sda Silva CMessai NSam YDevogele TGonzalez RFirmenich SFernandez ARossi GVelez DKaila EKajasilta HTommasini RValle EBalduini MSakr SKousa JIhantola PHellas ALuukkainen MMarcinowski MŁawrynowicz AGonzález-Mora CGarrigós IZubcoff JRojo JHernandez JMurillo JWollmer BWingerath WRitter NBucaille SCánovas Izquierdo JEd-Douibi HCabot JEd-Douibi HDaniel GCabot JKirsh IJoy MGonzález-Mora CGarrigós ICasteleyn SFirmenich SMeissner RKöbis LGonzález-Mora CGarrigós IZubcoff JParis PHamdi FCherfi S(2020)I Don’t Have That Much Data! Reusing User Behavior Models for Websites from Different DomainsWeb Engineering10.1007/978-3-030-50578-3_11(146-162)Online publication date: 9-Jun-2020
https://dl.acm.org/doi/10.1007/978-3-030-50578-3_11
Proskurnia JCartright MGarcia-Pueyo LKrka IWendt JKaufmann TMiklos BBarrett RCummings RAgichtein EGabrilovich E(2017)Template Induction over Unstructured Email CorporaProceedings of the 26th International Conference on World Wide Web10.1145/3038912.3052631(1521-1530)Online publication date: 3-Apr-2017
https://dl.acm.org/doi/10.1145/3038912.3052631
Wendt JBendersky MGarcia-Pueyo LJosifovski VMiklos BKrka ISaikia AYang JCartright MRavi SBennett PJosifovski VNeville JRadlinski F(2016)Hierarchical Label Propagation and Discovery for Machine Generated EmailProceedings of the Ninth ACM International Conference on Web Search and Data Mining10.1145/2835776.2835780(317-326)Online publication date: 8-Feb-2016
https://dl.acm.org/doi/10.1145/2835776.2835780
Toker KYuksel S(2015)Accelerating classification time in Hyperspectral Images2015 23nd Signal Processing and Communications Applications Conference (SIU)10.1109/SIU.2015.7130292(2126-2129)Online publication date: May-2015
https://doi.org/10.1109/SIU.2015.7130292
Alrwais SYuan KAlowaisheq ELi ZWang XFu K(2014)Understanding the dark side of domain parkingProceedings of the 23rd USENIX conference on Security Symposium10.5555/2671225.2671239(207-222)Online publication date: 20-Aug-2014
https://dl.acm.org/doi/10.5555/2671225.2671239

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten