Skip to main content

Clustering Structured Web Sources: A Schema-Based, Model-Differentiation Approach

  • Conference paper
Book cover Current Trends in Database Technology - EDBT 2004 Workshops (EDBT 2004)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3268))

Included in the following conference series:

Abstract

The Web has been rapidly “deepened” with the prevalence of databases online. On this “deep Web,” numerous sources are structured, providing schema-rich data. Their schemas define the object domain and its query capabilities. This paper proposes clustering sources by their query schemas, which is critical for enabling both source selection and query mediation, by organizing sources of with similar query capabilities. In abstraction, this problem is essentially clustering categorical data (by viewing each query schema as a transaction). Our approach hypothesizes that “homogeneous sources” are characterized by the same hidden generative models for their schemas. To find clusters governed by such statistical distributions, we propose a novel objective function, model-differentiation, which employs principled hypothesis testing to maximize statistical heterogeneity among clusters. Our evaluation shows that, on clustering the Web query schemas, the model-differentiation function outperforms existing ones with the hierarchical agglomerative clustering algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chang, K.C.C., He, B., Li, C., Zhang, Z.: Structured databases on the web: Observations and implications. Technical Report UIUCDCS-R-2003-2321, Dept. of Computer Science, UIUC (2003)

    Google Scholar 

  2. Guha, S., Rastogi, R., Shim, K.: ROCK: A robust clustering algorithm for categorical attributes. Information Systems 25, 345–366 (2000)

    Article  Google Scholar 

  3. Barbara, D., Li, Y., Couto, J.: Coolcat: An entropy-based algorithm for categorical clustering. In: Proceedings of CIKM Conference (2002)

    Google Scholar 

  4. Brunk, H.D.: An introduction to mathematical statistics. Blaisdell Pub. Co. (1965)

    Google Scholar 

  5. Banfield, J.D., Raftery, A.E.: Model-based gaussian and non-gaussian clustering. Biometrics 49, 803–821 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  6. Fraley, C.: Algorithms for model-based Gaussian hierarchical clustering. SIAM Journal on Scientific Computing 20, 270–281 (1999)

    Article  MathSciNet  Google Scholar 

  7. Meila, M., Heckerman, D.: An experimental comparison of several clustering and initialization methods. Technical report, Microsoft Research, MSR-TR-98-06 (1998)

    Google Scholar 

  8. Levy, A.Y., Rajaraman, A., Ordille, J.J.: Querying heterogeneous information sources using source descriptions. In: Proceedings of the VLDB Conference (1996)

    Google Scholar 

  9. Papakonstantinou, Y., García-Molina, H., Ullman, J.: Medmaker: A mediation system based on declarative specifications. In: Proceedings of the ICDE Conference (1996)

    Google Scholar 

  10. Callan, J.P., Connell, M., Du., A.: Automatic discovery of language models for text databases. In: Proceedings of the SIGMOD Conference (1999)

    Google Scholar 

  11. Ipeirotis, P.G., Luis Gravano, M.S.: Probe, count, and classify: Categorizing hidden web databases. In: Proceedings of the SIGMOD Conference (2001)

    Google Scholar 

  12. Meng, W., Liu, K.L., Yu, C.T., Wang, X., Chang, Y., Rishe, N.: Determining text databases to search in the internet. In: Proceedings of the VLDB Conference (1998)

    Google Scholar 

  13. Gibson, D., Kleinberg, J.M., Raghavan, P.: Clustering categorical data: An approach based on dynamical systems. VLDB Journal 8, 222–236 (1998)

    Google Scholar 

  14. Ganti, V., Gehrke, J., Ramakrishnan, R.: CACTUS – Clustering categorical data using summaries. In: Proceedings of Knowledge Discovery and Data Mining, pp. 73–83 (1999)

    Google Scholar 

  15. He, B., Tao, T., Chang, K.C.C.: Clustering structured web sources: A schema-based, model-differentiation approach. Technical Report UIUCDCS-R-2003-2322, Dept. of Computer Science, UIUC (2003)

    Google Scholar 

  16. Ponte, J., Croft, W.: A language modelling approach to information retrieval. In: Proceedings of the 21st ACM SIGIR Conference on Research and Development in Information Retrieval (1998)

    Google Scholar 

  17. He, B., Chang, K.C.C.: Statistical schema matching across web query interfaces. In: Proceedings of the 2003 ACM SIGMOD Conference (2003)

    Google Scholar 

  18. Agresti, A.: Categorical Data Analysis. John Wiley & Sons, Inc., New Jersey (2002)

    Book  MATH  Google Scholar 

  19. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: A review. ACM Computing Surveys 31, 264–323 (1999)

    Article  Google Scholar 

  20. Berkhin, P.: Survey of clustering data mining techniques. Technical report, Accrue Software (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

He, B., Tao, T., Chang, K.CC. (2004). Clustering Structured Web Sources: A Schema-Based, Model-Differentiation Approach. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds) Current Trends in Database Technology - EDBT 2004 Workshops. EDBT 2004. Lecture Notes in Computer Science, vol 3268. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30192-9_53

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30192-9_53

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23305-3

  • Online ISBN: 978-3-540-30192-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics