Skip to main content

Clustering Template Based Web Documents

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4956))

Abstract

More and more documents on the World Wide Web are based on templates. On a technical level this causes those documents to have a quite similar source code and DOM tree structure. Grouping together documents which are based on the same template is an important task for applications that analyse the template structure and need clean training data. This paper develops and compares several distance measures for clustering web documents according to their underlying templates. Combining those distance measures with different approaches for clustering, we show which combination of methods leads to the desired result.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bar-Yossef, Z., Rajagopalan, S.: Template detection via data mining and its applications. In: WWW 2002: Proceedings of the 11th International Conference on World Wide Web, pp. 580–591. ACM Press, New York (2002)

    Chapter  Google Scholar 

  2. Yang, G., Ramakrishnan, I.V., Kifer, M.: On the complexity of schema inference from web pages in the presence of nullable data attributes. In: CIKM 2003: Proceedings of the twelfth International Conference on Information and Knowledge Management, pp. 224–231. ACM Press, New York (2003)

    Chapter  Google Scholar 

  3. Lin, S.H., Ho, J.M.: Discovering informative content blocks from web documents. In: KDD 2002: Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 588–593. ACM Press, New York (2002)

    Chapter  Google Scholar 

  4. Debnath, S., Mitra, P., Giles, C.L.: Automatic extraction of informative blocks from webpages. In: SAC 2005, pp. 1722–1726. ACM Press, New York (2005)

    Chapter  Google Scholar 

  5. Yi, L., Liu, B., Li, X.: Eliminating noisy information in web pages for data mining. In: KDD 2003: Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 296–305. ACM Press, New York (2003)

    Chapter  Google Scholar 

  6. Reis, D.C., Golgher, P.B., Silva, A.S., Laender, A.F.: Automatic web news extraction using tree edit distance. In: WWW 2004: Proceedings of the 13th International Conference on World Wide Web, pp. 502–511. ACM Press, New York (2004), doi:10.1145/988672.988740

    Chapter  Google Scholar 

  7. Gibson, D., Punera, K., Tomkins, A.: The volume and evolution of web page templates. In: WWW 2005: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, pp. 830–839. ACM Press, New York (2005)

    Chapter  Google Scholar 

  8. Chakrabarti, D., Kumar, R., Punera, K.: Page-level template detection via isotonic smoothing. In: WWW 2007: Proceedings of the 16th International Conference on World Wide Web, pp. 61–70. ACM Press, New York (2007)

    Chapter  Google Scholar 

  9. Cruz, I.F., Borisov, S., Marks, M.A., Webbs, T.R.: Measuring structural similarity among web documents: preliminary results. In: Porto, V.W., Waagen, D. (eds.) EP 1998. LNCS, vol. 1447, pp. 513–524. Springer, Heidelberg (1998)

    Google Scholar 

  10. Buttler, D.: A short survey of document structure similarity algorithms. In: IC 2004: Proceedings of the International Conference on Internet Computing, pp. 3–9. CSREA Press (2004)

    Google Scholar 

  11. Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Computer Networks 29(8-13), 1157–1166 (1997)

    Google Scholar 

  12. Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A bag of paths model for measuring structural similarity in web documents. In: KDD 2003: Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 577–582. ACM Press, New York (2003)

    Chapter  Google Scholar 

  13. Lindholm, T., Kangasharju, J., Tarkoma, S.: Fast and simple XML tree differencing by sequence alignment. In: DocEng 2006: Proceedings of the 2006 ACM Symposium on Document Engineering, pp. 75–84. ACM Press, New York (2006)

    Chapter  Google Scholar 

  14. Shi, L., Niu, C., Zhou, M., Gao, J.: A DOM tree alignment model for mining parallel data from the web. In: ACL 2006: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the ACL, Morristown, NJ, USA, Association for Computational Linguistics, pp. 489–496 (2006)

    Google Scholar 

  15. Liu, B.: Web Data Mining – Exploring Hyperlinks, Contents, and Usage Data. Springer, Heidelberg (2007)

    MATH  Google Scholar 

  16. Kruskal, J.B.: Nonmetric multidimensional scaling: A numerical method. Psychometrika 29(2), 115–129 (1964)

    Article  MATH  MathSciNet  Google Scholar 

  17. Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66(336), 846–850 (1971)

    Article  Google Scholar 

  18. Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: AAAI 2000: Proceedings of the 17th National Conference on Artificial Intelligence: Workshop of Artificial Intelligence for Web Search, AAAI, pp. 58–64 (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Craig Macdonald Iadh Ounis Vassilis Plachouras Ian Ruthven Ryen W. White

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Gottron, T. (2008). Clustering Template Based Web Documents. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds) Advances in Information Retrieval. ECIR 2008. Lecture Notes in Computer Science, vol 4956. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78646-7_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-78646-7_7

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-78645-0

  • Online ISBN: 978-3-540-78646-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics