Skip to main content
Log in

Clustering DTDs: An interactive two-level approach

  • Correspondence
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

XML (eXtensible Markup Language) is a standard which is widely applied in data representation and data exchange. However, as an important concept of XML, DTD (Document Type Definition) is not taken full advantage in current applications. In this paper, a new method for clustering DTDs is presented, and it can be used in XML document clustering. The two-level method clusters the elements in DTDs and clusters DTDs separately. Element clustering forms the first level and provides element clusters, which are the generalization of relevant elements. DTD clustering utilizes the generalized information and forms the second level in the whole clustering process. The two-level method has the following advantages: 1) It takes into consideration both the content and the structure within DTDs; 2) The generalized information about elements is more useful than the separated words in the vector model; 3) The two-level method facilitates the searching of outliers. The experiments show that this method is able to categorize the relevant DTDs effectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Bray T, Paoli J, Sperberg-McQueen C M. Extensible Markup Language (XML) 1.0. February 1998. W3C Recommendation available at http://www.w3.org/TR/1998/REC-xml-19980210.

  2. Abiteboul S, Quass D, McHugh J, Widom J, Wiener J L. The lore query language for semistructured data.International Journal on Digital Libraries, April, 1997, 1(1): 68–88.

    Article  Google Scholar 

  3. McHugh J, Abiteboul S, Goldman R, Quass D, Widom J. Lore: A database management system for semistructured data.SIGMOD Record, September, 1997, 26(3): 54–66.

    Article  Google Scholar 

  4. Goldman R, McHugh J, Widom J. From semistructured data to XML: Migrating the lore data model and query language. InProceedings of the 2nd International Workshop on the Web and Databases (WebDB’99), Philadelphia, Pennsylvania, June, Philadelphia, Pennsylvania, 1999 pp.25–30.

  5. Goldman R, Widom J. DataGuides: Enabling query formulation and optimization in semistructured databases. InProceedings of the Twenty-Third International Conference on Very Large Data Bases, Athens, Greece, August, 1997, pp.436–445.

  6. McHugh J, Widom J, Abiteboul S, Luo Q, Rajaraman A. Indexing semistructured data. Technical Report, January, 1998, http://www-db.stanford.edu/lore/pubs/semiindexing98.pdf.

  7. Faloutsos C, Oard D. A survey of information retrieval and filtering methods. Department of Computer Science, University of Maryland, Technical Report, CS-TR-3514, August, 1995.

  8. Boley D, Gini M, Gross R, Han E H, Hastings K. Partitioning-based clustering for Web document categorization.Journal of Decision Support Systems, 1999, 27(3): 329–341.

    Article  Google Scholar 

  9. Jackson J E. A User’s Guide to Principal Components. John Wiley & Sons, 1991.

  10. Jain A K, Dubes R C. Algorithms for Clustering Data. Prentice Hall, 1988.

  11. Kohonen T. Self-Organization and Associated Memory. Springer-Verlag, 1988.

  12. Berry M Wet al. Using linear algebra for intelligent information retrieval.SIAM Review, 1995, 37 (4): 573–595.

    Article  MATH  MathSciNet  Google Scholar 

  13. Guha S, Rastogi R, Shim K. ROCK: A robust clustering algorithm for categorical attributes. InProc. the 15th International Conference on Data Engineering, Sydney, Australia, 1999, pp.512–521.

  14. Han E H, Karypis G, Kumar V. Clustering in a high-dimensional space using hypergraph models.Bulletin of the Technical Committee on Data Engineering, March, 1998, 21(1).

  15. Agrawal Ret al. Automatic subspace clustering of high dimensional data for data mining applications. InProc. the ACM SIGMOD Int. Conference on Management of Data, Seattle, Washington, June, 1998, pp.94–105.

  16. Gibson D, Kleinberg J, Raghavan P. Clustering categorical data: An approach based on dynamical systems. InProc. the 24th Very Large Database Conference, New York City, New York, USA, 1998, pp.311–322.

  17. Broder A Zet al. Syntactic Clustering of the Web. SRC Technical Note, 1997–015, July, 1997.

  18. Fellbaum C. WordNet: An Electronic Lexical Database. MIT Press, 1998.

Download references

Author information

Authors and Affiliations

Authors

Additional information

This work is supported by the NKBRSF of China (Grant No.G1998030414), the National Natural Science Foundation of China (Grant No.60003016), the National Doctoral Research Foundation of China, and the Joint Project with IBM China Research Lab.

The second author is partially supported by Microsoft Research Fellowship.

Zhou Aoying received his M.S. degree in computer science from Sichuan University in 1988, and his Ph.D. degree in computer software from Fudan University in 1993. He is currently a professor in the Department of Computer Science, Fudan University. His main research interests include object-oriented data models for multimedia information, Web/XML data management, data mining and data warehousing, peerto-peer computing, the novel database technologies and their applications to digital library and electronic commerce.

QIAN Weining is a Ph.D. candidate in the Department of Computer Science, Fudan University. His speciality is database and knowledge base. His research interests include clustering, data mining and Web mining.

QIAN Hailei is a graduate student in the Department of Computer Science, Fudan University. She majors in database and knowledge base. Her research interests include clustering and data mining.

ZHANG Long is a graduate student in the Department of Computer Science, Fudan University. He majors in database and knowledge base. His research interests is XML data management.

LIANG Yuqi is a graduate student in the Department of Computer Science, Fudan University. He majors in database and knowledge base. His research interest is XML data management and Web services.

JIN Wen is a Ph.D. candidate in the School of Computing, Simon Fraser University, Canada, supervised by Dr. Jiawei Han. His current research interests are database and data warehousing, data mining, Web mining and XML.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, A., Qian, W., Qian, H. et al. Clustering DTDs: An interactive two-level approach. J. Compt. Sci. & Technol. 17, 807–819 (2002). https://doi.org/10.1007/BF02960771

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02960771

Keywords

Navigation