Abstract
XML is rapidly emerging as the new standard for data representation and exchange on the Web. Unlike HTML, tags in XML documents describe the semantics of the data and not how it is to be displayed. In addition, an XML document can be accompanied by a Document Type Descriptor (DTD) which plays the role of a schema for an XML data collection. DTDs contain valuable information on the structure of documents and thus have a crucial role in the efficient storage of XML data, as well as the effective formulation and optimization of XML queries. Despite their importance, however, DTDs are not mandatory, and it is frequently possible that documents in XML databases will not have accompanying DTDs. In this paper, we propose XTRACT, a novel system for inferring a DTD schema for a database of XML documents. Since the DTD syntax incorporates the full expressive power of regular expressions, naive approaches typically fail to produce concise and intuitive DTDs. Instead, the XTRACT inference algorithms employ a sequence of sophisticated steps that involve: (1) finding patterns in the input sequences and replacing them with regular expressions to generate “general” candidate DTDs, (2) factoring candidate DTDs using adaptations of algorithms from the logic optimization literature, and (3) applying the Minimum Description Length (MDL) principle to find the best DTD among the candidates. The results of our experiments with real-life and synthetic DTDs demonstrate the effectiveness of XTRACT's approach in inferring concise and semantically meaningful DTD schemas for XML databases.
Similar content being viewed by others
References
Abiteboul, S. 1997. Querying semi-structured data. In Proceedings of the International Conference on Database Theory (ICDT), pp. 1–18.
Ahonen, H. 1996. Generating grammars for structured documents using grammatical inference methods. Ph.D. Thesis, University of Helsinki.
Ahonen, H., Mannila, H., and Nikunen, E. 1994. Forming grammars for structured documents: An application of grammatical inference. In Proceedings of the 2nd Intl. Colloquium on Grammatical Inference and Applications, pp. 153–167.
Angluin, D. 1978. On the complexity of minimum inference of regular sets. Information and Control, 39(3):337–350.
Brayton, R.K. and McMullen, C. 1982. The decomposition and factorization of boolean expressions. In International Symposium on Circuits and Systems, pp. 49–54.
Bray, T., Paoli, J., and Sperberg-McQueen, C.M. Extensible markup language (XML). Available at http://www.w3.org/TR/REC-xml.
Brazma, A. 1993. Efficient identification of regular expressions from representative examples. COLT, 236–242.
Charikar, M. and Guha, S. 1999. Improved combinatorial algorithms for the facility location and k-median problems. In 40th Annual Symposium on Foundations of Computer Science.
Deutsch, A., Fernandez, M., and Suciu, D. 1999. Storing semi structured data with stored. In Proc. of the ACM SIGMOD Conference on Management of Data.
Fernandez, M. and Suciu, D. 1997. Optimizing regular path expressions using graph schemas. In Proceedings of the International Conference on Database Theory (ICDT).
Goldman, R., McHugh, J., and Widom, J. 1999. From semi structured data to XML: Migrating the lore data model and query language. In Proceedings of the 2nd International Workshop on the Web and Databases (WebDB'99), pp. 25–30.
Gold, E.M. 1967. Language identification in the limit. Information and Control, 10(5):447–474.
Gold, E.M. 1978. Complexity of automaton identification from given data. Information and Control, 37(3):302–320.
Goldman, R. and Widom, J. 1997. Data guides: Enabling query formulation and optimization in semistructured databases. In Proceedings of the 23rd International Conference on Very Large Data Bases, Athens, Greece.
Hochbaum, D.S. 1982. Heuristics for the fixed cost median problem. Mathematical Programming, 22:148–162.
Hopcroft, J.E. and Ullman, J.D. 1979. Introduction to Automaton Theory, Languages, and Computation. Addison-Wesley, Reading, MA.
Kilpeläinen, P., Mannila, H., and Ukkonen, E. 1995. MDL learning of unions of simple pattern languages from positive examples. In Second European Conference on Computational Learning Theory, Euro COLT, pp. 252–260, Barcelona.
Lawler, E. 1964. An approach to multilevel boolean minimization. Journal of the ACM, 11(3):283–295.
Mehta, M., Rissanen, J., and Agrawal, R. 1995.MDL-based decision tree pruning. In Int'l Conference on Knowledge Discovery in Databases and Data Mining (KDD-95), Montreal, Canada.
Nestorov, S., Abiteboul, S., and Motwani, R. 1998. Extracting schema from semi structured data. In Proceedings of the ACM SIGMOD Conference on Management of Data, pp. 295–306.
Pitt, L. 1989. Inductive inference, DFAs, and computational complexity. Analogical and Inductive Inference. In Proceedings of the 1989 Intl. Workshop on Analogical and Inductive Inference, Reinhardsbrunn Castle, GDR, pp. 18–44.
Quinlan, J.R. and Rivest, R.L. 1989. Inferring decision trees using the minimum description length principle. Information and Computation, 80:227–248.
Rissanen, J. 1978. Modeling by shortest data description. Automatica, 14:465–471.
Rissanen, J. 1989. Stochastic Complexity in Statistical Inquiry. World Scientific Publ. Co: Singapore.
Shafer, K.E. 1995. Creating dtds via the gb-engine and fred. In Proceedings of the SGML'95 Conference, Boston, MA. Available at http://www.oclc.org/fred/docs/sgml95.html.
Shanmugasundaram, J., He, G., Tufte, K., Zhang, C., DeWitt, D., and Naughton, J. 1999. Relational databases for querying XML documents: Limitations and opportunities. In Proc. of the Int'l Conf. on Very Large Data Bases, Edinburgh, Scotland.
Wang, A.R.R. 1989. Algorithms for multi-level logic optimization. Ph.D. Thesis, The University of California, Berkeley.
Widom, J. 1999. Data management for XML: Research directions. IEEE Data Engineering Bulletin, 22(3):44–52.
Young-Lai, M. and Tompa, F.W.M. 2000. Stochastic grammatical inference of text database structure. Machine Learning, 40(2):111–137.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Garofalakis, M., Gionis, A., Rastogi, R. et al. XTRACT: Learning Document Type Descriptors from XML Document Collections. Data Mining and Knowledge Discovery 7, 23–56 (2003). https://doi.org/10.1023/A:1021560618289
Issue Date:
DOI: https://doi.org/10.1023/A:1021560618289