Clustering Structured Web Sources: A Schema-Based, Model-Differentiation Approach

He, Bin; Tao, Tao; Chang, Kevin Chen-Chuan

doi:10.1007/978-3-540-30192-9_53

Bin He²¹,
Tao Tao²¹ &
Kevin Chen-Chuan Chang²¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3268))

Included in the following conference series:

International Conference on Extending Database Technology

1245 Accesses
13 Citations

Abstract

The Web has been rapidly “deepened” with the prevalence of databases online. On this “deep Web,” numerous sources are structured, providing schema-rich data. Their schemas define the object domain and its query capabilities. This paper proposes clustering sources by their query schemas, which is critical for enabling both source selection and query mediation, by organizing sources of with similar query capabilities. In abstraction, this problem is essentially clustering categorical data (by viewing each query schema as a transaction). Our approach hypothesizes that “homogeneous sources” are characterized by the same hidden generative models for their schemas. To find clusters governed by such statistical distributions, we propose a novel objective function, model-differentiation, which employs principled hypothesis testing to maximize statistical heterogeneity among clusters. Our evaluation shows that, on clustering the Web query schemas, the model-differentiation function outperforms existing ones with the hierarchical agglomerative clustering algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chang, K.C.C., He, B., Li, C., Zhang, Z.: Structured databases on the web: Observations and implications. Technical Report UIUCDCS-R-2003-2321, Dept. of Computer Science, UIUC (2003)
Google Scholar
Guha, S., Rastogi, R., Shim, K.: ROCK: A robust clustering algorithm for categorical attributes. Information Systems 25, 345–366 (2000)
Article Google Scholar
Barbara, D., Li, Y., Couto, J.: Coolcat: An entropy-based algorithm for categorical clustering. In: Proceedings of CIKM Conference (2002)
Google Scholar
Brunk, H.D.: An introduction to mathematical statistics. Blaisdell Pub. Co. (1965)
Google Scholar
Banfield, J.D., Raftery, A.E.: Model-based gaussian and non-gaussian clustering. Biometrics 49, 803–821 (1993)
Article MATH MathSciNet Google Scholar
Fraley, C.: Algorithms for model-based Gaussian hierarchical clustering. SIAM Journal on Scientific Computing 20, 270–281 (1999)
Article MathSciNet Google Scholar
Meila, M., Heckerman, D.: An experimental comparison of several clustering and initialization methods. Technical report, Microsoft Research, MSR-TR-98-06 (1998)
Google Scholar
Levy, A.Y., Rajaraman, A., Ordille, J.J.: Querying heterogeneous information sources using source descriptions. In: Proceedings of the VLDB Conference (1996)
Google Scholar
Papakonstantinou, Y., García-Molina, H., Ullman, J.: Medmaker: A mediation system based on declarative specifications. In: Proceedings of the ICDE Conference (1996)
Google Scholar
Callan, J.P., Connell, M., Du., A.: Automatic discovery of language models for text databases. In: Proceedings of the SIGMOD Conference (1999)
Google Scholar
Ipeirotis, P.G., Luis Gravano, M.S.: Probe, count, and classify: Categorizing hidden web databases. In: Proceedings of the SIGMOD Conference (2001)
Google Scholar
Meng, W., Liu, K.L., Yu, C.T., Wang, X., Chang, Y., Rishe, N.: Determining text databases to search in the internet. In: Proceedings of the VLDB Conference (1998)
Google Scholar
Gibson, D., Kleinberg, J.M., Raghavan, P.: Clustering categorical data: An approach based on dynamical systems. VLDB Journal 8, 222–236 (1998)
Google Scholar
Ganti, V., Gehrke, J., Ramakrishnan, R.: CACTUS – Clustering categorical data using summaries. In: Proceedings of Knowledge Discovery and Data Mining, pp. 73–83 (1999)
Google Scholar
He, B., Tao, T., Chang, K.C.C.: Clustering structured web sources: A schema-based, model-differentiation approach. Technical Report UIUCDCS-R-2003-2322, Dept. of Computer Science, UIUC (2003)
Google Scholar
Ponte, J., Croft, W.: A language modelling approach to information retrieval. In: Proceedings of the 21st ACM SIGIR Conference on Research and Development in Information Retrieval (1998)
Google Scholar
He, B., Chang, K.C.C.: Statistical schema matching across web query interfaces. In: Proceedings of the 2003 ACM SIGMOD Conference (2003)
Google Scholar
Agresti, A.: Categorical Data Analysis. John Wiley & Sons, Inc., New Jersey (2002)
Book MATH Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: A review. ACM Computing Surveys 31, 264–323 (1999)
Article Google Scholar
Berkhin, P.: Survey of clustering data mining techniques. Technical report, Accrue Software (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
Bin He, Tao Tao & Kevin Chen-Chuan Chang

Authors

Bin He
View author publications
You can also search for this author in PubMed Google Scholar
Tao Tao
View author publications
You can also search for this author in PubMed Google Scholar
Kevin Chen-Chuan Chang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Sidonia Systems, Grubmühl 20, D-82131, Stockdorf, Germany
Wolfgang Lindner
Università di Milano, Italy
Marco Mesiti
Functional Genomics Center Zurich (FGCZ), UZH / ETH Zurich, Winterthurerstrasse 190, CH–8057, Zurich, Switzerland
Can Türker
Computer Science Department, University of Crete, GREECE, and, Institute of Computer Science, FORTH-ICS, Greece
Yannis Tzitzikas
Aristotle University of Thessaloniki,
Athena I. Vakali

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

He, B., Tao, T., Chang, K.CC. (2004). Clustering Structured Web Sources: A Schema-Based, Model-Differentiation Approach. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds) Current Trends in Database Technology - EDBT 2004 Workshops. EDBT 2004. Lecture Notes in Computer Science, vol 3268. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30192-9_53

Download citation

DOI: https://doi.org/10.1007/978-3-540-30192-9_53
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23305-3
Online ISBN: 978-3-540-30192-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics