Clustering DTDs: An interactive two-level approach

Zhou, Aoying; Qian, Weining; Qian, Hailei; Zhang, Long; Liang, Yuqi; Jin, Wen

doi:10.1007/BF02960771

Clustering DTDs: An interactive two-level approach

Correspondence
Published: November 2002

Volume 17, pages 807–819, (2002)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Zhou Aoying¹,
Qian Weining¹,
Qian Hailei¹,
Zhang Long¹,
Liang Yuqi¹ &
…
Jin Wen²

35 Accesses
1 Citation
Explore all metrics

Abstract

XML (eXtensible Markup Language) is a standard which is widely applied in data representation and data exchange. However, as an important concept of XML, DTD (Document Type Definition) is not taken full advantage in current applications. In this paper, a new method for clustering DTDs is presented, and it can be used in XML document clustering. The two-level method clusters the elements in DTDs and clusters DTDs separately. Element clustering forms the first level and provides element clusters, which are the generalization of relevant elements. DTD clustering utilizes the generalized information and forms the second level in the whole clustering process. The two-level method has the following advantages: 1) It takes into consideration both the content and the structure within DTDs; 2) The generalized information about elements is more useful than the separated words in the vector model; 3) The two-level method facilitates the searching of outliers. The experiments show that this method is able to categorize the relevant DTDs effectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering XML Documents Using Frequent Edge-Sets

Clustering XML documents by patterns

Article Open access 23 January 2015

Maciej Piernik, Dariusz Brzezinski & Tadeusz Morzy

TreeXP—An Instantiation of XPattern Framework

References

Bray T, Paoli J, Sperberg-McQueen C M. Extensible Markup Language (XML) 1.0. February 1998. W3C Recommendation available at http://www.w3.org/TR/1998/REC-xml-19980210.
Abiteboul S, Quass D, McHugh J, Widom J, Wiener J L. The lore query language for semistructured data.International Journal on Digital Libraries, April, 1997, 1(1): 68–88.
Article Google Scholar
McHugh J, Abiteboul S, Goldman R, Quass D, Widom J. Lore: A database management system for semistructured data.SIGMOD Record, September, 1997, 26(3): 54–66.
Article Google Scholar
Goldman R, McHugh J, Widom J. From semistructured data to XML: Migrating the lore data model and query language. InProceedings of the 2nd International Workshop on the Web and Databases (WebDB’99), Philadelphia, Pennsylvania, June, Philadelphia, Pennsylvania, 1999 pp.25–30.
Goldman R, Widom J. DataGuides: Enabling query formulation and optimization in semistructured databases. InProceedings of the Twenty-Third International Conference on Very Large Data Bases, Athens, Greece, August, 1997, pp.436–445.
McHugh J, Widom J, Abiteboul S, Luo Q, Rajaraman A. Indexing semistructured data. Technical Report, January, 1998, http://www-db.stanford.edu/lore/pubs/semiindexing98.pdf.
Faloutsos C, Oard D. A survey of information retrieval and filtering methods. Department of Computer Science, University of Maryland, Technical Report, CS-TR-3514, August, 1995.
Boley D, Gini M, Gross R, Han E H, Hastings K. Partitioning-based clustering for Web document categorization.Journal of Decision Support Systems, 1999, 27(3): 329–341.
Article Google Scholar
Jackson J E. A User’s Guide to Principal Components. John Wiley & Sons, 1991.
Jain A K, Dubes R C. Algorithms for Clustering Data. Prentice Hall, 1988.
Kohonen T. Self-Organization and Associated Memory. Springer-Verlag, 1988.
Berry M Wet al. Using linear algebra for intelligent information retrieval.SIAM Review, 1995, 37 (4): 573–595.
Article MATH MathSciNet Google Scholar
Guha S, Rastogi R, Shim K. ROCK: A robust clustering algorithm for categorical attributes. InProc. the 15th International Conference on Data Engineering, Sydney, Australia, 1999, pp.512–521.
Han E H, Karypis G, Kumar V. Clustering in a high-dimensional space using hypergraph models.Bulletin of the Technical Committee on Data Engineering, March, 1998, 21(1).
Agrawal Ret al. Automatic subspace clustering of high dimensional data for data mining applications. InProc. the ACM SIGMOD Int. Conference on Management of Data, Seattle, Washington, June, 1998, pp.94–105.
Gibson D, Kleinberg J, Raghavan P. Clustering categorical data: An approach based on dynamical systems. InProc. the 24th Very Large Database Conference, New York City, New York, USA, 1998, pp.311–322.
Broder A Zet al. Syntactic Clustering of the Web. SRC Technical Note, 1997–015, July, 1997.
Fellbaum C. WordNet: An Electronic Lexical Database. MIT Press, 1998.

Download references

Author information

Authors and Affiliations

Department of Computer Science, Laboratory for Intelligent Information Processing, Fudan University, 200433, Shanghai, P.R. China
Zhou Aoying, Qian Weining, Qian Hailei, Zhang Long & Liang Yuqi
Department of Computer Science, Simon Fraser University, Canada
Jin Wen

Authors

Zhou Aoying
View author publications
You can also search for this author in PubMed Google Scholar
Qian Weining
View author publications
You can also search for this author in PubMed Google Scholar
Qian Hailei
View author publications
You can also search for this author in PubMed Google Scholar
Zhang Long
View author publications
You can also search for this author in PubMed Google Scholar
Liang Yuqi
View author publications
You can also search for this author in PubMed Google Scholar
Jin Wen
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

This work is supported by the NKBRSF of China (Grant No.G1998030414), the National Natural Science Foundation of China (Grant No.60003016), the National Doctoral Research Foundation of China, and the Joint Project with IBM China Research Lab.

The second author is partially supported by Microsoft Research Fellowship.

Zhou Aoying received his M.S. degree in computer science from Sichuan University in 1988, and his Ph.D. degree in computer software from Fudan University in 1993. He is currently a professor in the Department of Computer Science, Fudan University. His main research interests include object-oriented data models for multimedia information, Web/XML data management, data mining and data warehousing, peerto-peer computing, the novel database technologies and their applications to digital library and electronic commerce.

QIAN Weining is a Ph.D. candidate in the Department of Computer Science, Fudan University. His speciality is database and knowledge base. His research interests include clustering, data mining and Web mining.

QIAN Hailei is a graduate student in the Department of Computer Science, Fudan University. She majors in database and knowledge base. Her research interests include clustering and data mining.

ZHANG Long is a graduate student in the Department of Computer Science, Fudan University. He majors in database and knowledge base. His research interests is XML data management.

LIANG Yuqi is a graduate student in the Department of Computer Science, Fudan University. He majors in database and knowledge base. His research interest is XML data management and Web services.

JIN Wen is a Ph.D. candidate in the School of Computing, Simon Fraser University, Canada, supervised by Dr. Jiawei Han. His current research interests are database and data warehousing, data mining, Web mining and XML.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, A., Qian, W., Qian, H. et al. Clustering DTDs: An interactive two-level approach. J. Compt. Sci. & Technol. 17, 807–819 (2002). https://doi.org/10.1007/BF02960771

Download citation

Received: 04 January 2001
Revised: 15 August 2001
Issue Date: November 2002
DOI: https://doi.org/10.1007/BF02960771

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering DTDs: An interactive two-level approach

Abstract

Access this article

Similar content being viewed by others

Clustering XML Documents Using Frequent Edge-Sets

Clustering XML documents by patterns

TreeXP—An Instantiation of XPattern Framework

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

Clustering XML Documents Using Frequent Edge-Sets

Clustering XML documents by patterns

TreeXP—An Instantiation of XPattern Framework

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation