Skip to main content
Log in

XTRACT: Learning Document Type Descriptors from XML Document Collections

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

XML is rapidly emerging as the new standard for data representation and exchange on the Web. Unlike HTML, tags in XML documents describe the semantics of the data and not how it is to be displayed. In addition, an XML document can be accompanied by a Document Type Descriptor (DTD) which plays the role of a schema for an XML data collection. DTDs contain valuable information on the structure of documents and thus have a crucial role in the efficient storage of XML data, as well as the effective formulation and optimization of XML queries. Despite their importance, however, DTDs are not mandatory, and it is frequently possible that documents in XML databases will not have accompanying DTDs. In this paper, we propose XTRACT, a novel system for inferring a DTD schema for a database of XML documents. Since the DTD syntax incorporates the full expressive power of regular expressions, naive approaches typically fail to produce concise and intuitive DTDs. Instead, the XTRACT inference algorithms employ a sequence of sophisticated steps that involve: (1) finding patterns in the input sequences and replacing them with regular expressions to generate “general” candidate DTDs, (2) factoring candidate DTDs using adaptations of algorithms from the logic optimization literature, and (3) applying the Minimum Description Length (MDL) principle to find the best DTD among the candidates. The results of our experiments with real-life and synthetic DTDs demonstrate the effectiveness of XTRACT's approach in inferring concise and semantically meaningful DTD schemas for XML databases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Abiteboul, S. 1997. Querying semi-structured data. In Proceedings of the International Conference on Database Theory (ICDT), pp. 1–18.

  • Ahonen, H. 1996. Generating grammars for structured documents using grammatical inference methods. Ph.D. Thesis, University of Helsinki.

  • Ahonen, H., Mannila, H., and Nikunen, E. 1994. Forming grammars for structured documents: An application of grammatical inference. In Proceedings of the 2nd Intl. Colloquium on Grammatical Inference and Applications, pp. 153–167.

  • Angluin, D. 1978. On the complexity of minimum inference of regular sets. Information and Control, 39(3):337–350.

    Google Scholar 

  • Brayton, R.K. and McMullen, C. 1982. The decomposition and factorization of boolean expressions. In International Symposium on Circuits and Systems, pp. 49–54.

  • Bray, T., Paoli, J., and Sperberg-McQueen, C.M. Extensible markup language (XML). Available at http://www.w3.org/TR/REC-xml.

  • Brazma, A. 1993. Efficient identification of regular expressions from representative examples. COLT, 236–242.

  • Charikar, M. and Guha, S. 1999. Improved combinatorial algorithms for the facility location and k-median problems. In 40th Annual Symposium on Foundations of Computer Science.

  • Deutsch, A., Fernandez, M., and Suciu, D. 1999. Storing semi structured data with stored. In Proc. of the ACM SIGMOD Conference on Management of Data.

  • Fernandez, M. and Suciu, D. 1997. Optimizing regular path expressions using graph schemas. In Proceedings of the International Conference on Database Theory (ICDT).

  • Goldman, R., McHugh, J., and Widom, J. 1999. From semi structured data to XML: Migrating the lore data model and query language. In Proceedings of the 2nd International Workshop on the Web and Databases (WebDB'99), pp. 25–30.

  • Gold, E.M. 1967. Language identification in the limit. Information and Control, 10(5):447–474.

    Google Scholar 

  • Gold, E.M. 1978. Complexity of automaton identification from given data. Information and Control, 37(3):302–320.

    Google Scholar 

  • Goldman, R. and Widom, J. 1997. Data guides: Enabling query formulation and optimization in semistructured databases. In Proceedings of the 23rd International Conference on Very Large Data Bases, Athens, Greece.

  • Hochbaum, D.S. 1982. Heuristics for the fixed cost median problem. Mathematical Programming, 22:148–162.

    Google Scholar 

  • Hopcroft, J.E. and Ullman, J.D. 1979. Introduction to Automaton Theory, Languages, and Computation. Addison-Wesley, Reading, MA.

    Google Scholar 

  • Kilpeläinen, P., Mannila, H., and Ukkonen, E. 1995. MDL learning of unions of simple pattern languages from positive examples. In Second European Conference on Computational Learning Theory, Euro COLT, pp. 252–260, Barcelona.

  • Lawler, E. 1964. An approach to multilevel boolean minimization. Journal of the ACM, 11(3):283–295.

    Google Scholar 

  • Mehta, M., Rissanen, J., and Agrawal, R. 1995.MDL-based decision tree pruning. In Int'l Conference on Knowledge Discovery in Databases and Data Mining (KDD-95), Montreal, Canada.

  • Nestorov, S., Abiteboul, S., and Motwani, R. 1998. Extracting schema from semi structured data. In Proceedings of the ACM SIGMOD Conference on Management of Data, pp. 295–306.

  • Pitt, L. 1989. Inductive inference, DFAs, and computational complexity. Analogical and Inductive Inference. In Proceedings of the 1989 Intl. Workshop on Analogical and Inductive Inference, Reinhardsbrunn Castle, GDR, pp. 18–44.

  • Quinlan, J.R. and Rivest, R.L. 1989. Inferring decision trees using the minimum description length principle. Information and Computation, 80:227–248.

    Google Scholar 

  • Rissanen, J. 1978. Modeling by shortest data description. Automatica, 14:465–471.

    Google Scholar 

  • Rissanen, J. 1989. Stochastic Complexity in Statistical Inquiry. World Scientific Publ. Co: Singapore.

    Google Scholar 

  • Shafer, K.E. 1995. Creating dtds via the gb-engine and fred. In Proceedings of the SGML'95 Conference, Boston, MA. Available at http://www.oclc.org/fred/docs/sgml95.html.

  • Shanmugasundaram, J., He, G., Tufte, K., Zhang, C., DeWitt, D., and Naughton, J. 1999. Relational databases for querying XML documents: Limitations and opportunities. In Proc. of the Int'l Conf. on Very Large Data Bases, Edinburgh, Scotland.

  • Wang, A.R.R. 1989. Algorithms for multi-level logic optimization. Ph.D. Thesis, The University of California, Berkeley.

  • Widom, J. 1999. Data management for XML: Research directions. IEEE Data Engineering Bulletin, 22(3):44–52.

    Google Scholar 

  • Young-Lai, M. and Tompa, F.W.M. 2000. Stochastic grammatical inference of text database structure. Machine Learning, 40(2):111–137.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kyuseok Shim.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Garofalakis, M., Gionis, A., Rastogi, R. et al. XTRACT: Learning Document Type Descriptors from XML Document Collections. Data Mining and Knowledge Discovery 7, 23–56 (2003). https://doi.org/10.1023/A:1021560618289

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1021560618289

Navigation