An Efficient Algorithm for Clustering XML Schemas

Rhim, Tae-Woo; Lee, Kyong-Ho; Ko, Myeong-Cheol

doi:10.1007/978-3-540-30480-7_38

Tae-Woo Rhim²¹,
Kyong-Ho Lee²¹ &
Myeong-Cheol Ko²²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3306))

Included in the following conference series:

International Conference on Web Information Systems Engineering

1158 Accesses

Abstract

Schema clustering is important as a prerequisite to the integration of XML schemas. This paper presents an efficient method for clustering XML schemas. The proposed method first computes similarities among schemas. The similarity is defined by the size of the common structure between two schemas under the assumption that the schemas with less cost to be integrated are more similar. Specifically, we extract one-to-one matchings between paths with the largest number of corresponding elements. Finally, a hierarchical clustering method is applied to the value of similarity. Experimental results with many XML schemas show that the method has performed better compared with previous works, resulting in a precision of 98% and a rate of clustering of 95% in average.

This work was supported the Korea Research Foundation Grant(KRF-2003-003-D00429).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

World Wide Web Consortium, Extensible Markup Language (XML) 1.0 (Third Edition), W3C Recommendation (2000), http://www.w3c.org/TR/REC-xml
World Wide Web Consortium, XML schema Part 0: Primer, W3C Recommendation (2001), http://www.w3.org/TR/xmlschema-0/
Lee, M., Yang, L., Hsu, W., Yang, X.: Clustering XML Schemas for Effective Integration. In: Proc. 11th Int’l. Conf. Information and Knowledge Management, pp. 292–299 (2002)
Google Scholar
Jeong, E., Hsu, C.-N.: Induction of Integrated View for XML Data with Heterogeneous DTDs. In: Proc. 10th Int’l. Conf. Information and Knowledge Management, pp. 151–158 (2001)
Google Scholar
De Francesca, F., Gordano, G., Ortale, R., Tagarelli, A.: Distance-based Clustering of XML Documents. In: Proc. First Int’l. Workshop on Mining Graphs, Trees and Sequences, pp. 75–78 (2003)
Google Scholar
Nierman, A., Jagadish, H.V.: Evaluate Structural Similarity in XML Documents. In: Proc. Fifth Int’l. Workshop on the Web and Databases, pp. 61–66 (2002)
Google Scholar
Miller, G.A.: WordNet: A Lexical Database for English. Communications of the ACM 38(11), 39–41 (1995)
Article Google Scholar
Rick, C.: Simple and Fast Linear Space Computation of Longest Common Subsequence. Information Processing Letters 75(6), 275–281 (2000)
Article MATH MathSciNet Google Scholar
Sedgewick, R.: Algorithm in C++, Part 5 Graph algorithm, 3rd edn. Addison-Wesley, Reading (2001)
Google Scholar
Gose, E., Johnsonbaugh, R., Jost, S.: Pattern Recognition and Image Analysis. Prentice-Hall, Englewood Cliffs (1996)
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Science, Yonsei Univ., 134 Shinchon-dong, Sudaemoon-ku, Seoul, 120-749, Korea
Tae-Woo Rhim & Kyong-Ho Lee
Dept. of Computer Science, Konkuk Univ, 322 Danwol-dong, Chungju-si, Chungbuk, 380-701, Korea
Myeong-Cheol Ko

Authors

Tae-Woo Rhim
View author publications
You can also search for this author in PubMed Google Scholar
Kyong-Ho Lee
View author publications
You can also search for this author in PubMed Google Scholar
Myeong-Cheol Ko
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of ITEE, The University of Queensland, Australia
Xiaofang Zhou
Database Systems Research and Development Center, University of Florida, P.O. Box 116125, 470 CSE, 32601-6125, Gainesville, FL, USA
Stanley Su
INFOLAB, Dept. of Information Systems and Management, Tilburg University, The Netherlands
Mike P. Papazoglou
Polish-Japanese Institute of Information Technology, Faculty of IT, Ul. Koszykowa 86, 02-008, Warsaw, Poland
Maria Elzbieta Orlowska
Rutherford Appleton Laboratory, Science and Technology Facilities Council, Harwell Science and Innovation Campus, OX11 0QX, Didcot, UK
Keith Jeffery

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rhim, TW., Lee, KH., Ko, MC. (2004). An Efficient Algorithm for Clustering XML Schemas. In: Zhou, X., Su, S., Papazoglou, M.P., Orlowska, M.E., Jeffery, K. (eds) Web Information Systems – WISE 2004. WISE 2004. Lecture Notes in Computer Science, vol 3306. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30480-7_38

Download citation

DOI: https://doi.org/10.1007/978-3-540-30480-7_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23894-2
Online ISBN: 978-3-540-30480-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics