Extracting Minimum Length Document Type Definitions Is NP-Hard

Fernau, Henning

doi:10.1007/978-3-540-30195-0_26

Extracting Minimum Length Document Type Definitions Is NP-Hard

Henning Fernau^20,21

Conference paper

360 Accesses
5 Citations
1 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3264))

Abstract

XML (eXtensible Markup Language) is becoming more and more popular. Since not all XML documents come with a (proper) accompanying Document Type Descriptors (DTD), it is a challenge to find “good” DTDs automatically. Note that many optimization procedures rely on being given a well-fitting DTD to work properly.

M. Garofalakis et al. have developed XTRACT, a system for extracting Document Type Descriptors (DTD) from XML documents. This system may actually integrate many of the other proposals as kind of subroutines, since it finally tries to find the “best” DTD out of those proposals. Due to the connections to regular expression (inference), see [1,2], any good inference algorithm for regular expressions can hence be incorporated. Observe that the regular expressions which are generated by first using learning algorithms designed for deterministic finite automata and then turning these automata into regular expressions by “textbook algorithms” (as proposed in [2]) tend to produce expressions which are rather “unreadable” from a human perspective. In the envisaged application – the extraction of DTDs – this is particularly bad, since those DTDs are meant to be read and understood by humans. This is one of the reasons why the Grammatical Inference community should get interested in MDL approaches to learning as proposed with the XTRACT project.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Berstel, J., Boasson, L.: XML grammars. Acta Informatica 38, 649–671 (2002)
Article MATH MathSciNet Google Scholar
Fernau, H.: Learning XML grammars. In: Perner, P. (ed.) MLDM 2001. LNCS (LNAI), vol. 2123, pp. 73–87. Springer, Heidelberg (2001)
Chapter Google Scholar
Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: learning document type descriptors from XML document collections. Data Mining and Knowledge Discovery 7, 23–56 (2003)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

School of Electrical Engineering and Computer Science, University of Newcastle, University Drive, NSW 2380, Callaghan, Australia
Henning Fernau
Wilhelm-Schickard-Institut für Informatik, Universität Tübingen, Sand 13, D-72076, Tübingen, Germany
Henning Fernau

Authors

Henning Fernau
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Informatics and Telecommunications, National Centre for Scientific Research “Demokritos”, Athens, Greece
Georgios Paliouras
Department of Biosciences and Informatics, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, 223-8522, Yokohama, Japan
Yasubumi Sakakibara

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fernau, H. (2004). Extracting Minimum Length Document Type Definitions Is NP-Hard. In: Paliouras, G., Sakakibara, Y. (eds) Grammatical Inference: Algorithms and Applications. ICGI 2004. Lecture Notes in Computer Science(), vol 3264. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30195-0_26

Download citation

DOI: https://doi.org/10.1007/978-3-540-30195-0_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23410-4
Online ISBN: 978-3-540-30195-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics