Abstract
Information contained in XML documents cannot properly be interpreted without an appropriate DTD. However, XML documents collected from the web may not always be accompanied by the corresponding DTD, so that extracting information from such sources may not be easy. In this study, we reverse construct a DTD from DTD-unknown XML sources, and use it to extract information from XML inputs. The DTD construction module developed is designed to scan input XML files in 1-path, where most other implementations use 2-path approach. Developed modules provide clean Java programming interfaces as well, so that it can be integrated with other web applications seamlessly.
This works is supported in part by the Ministry of Information & Communication of Korea
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Oh, D., Jung, J.: Effective Web-Based Information Gathering Services of IHWA. Proceedings of ICEIC’2000 International Conference, Shenyang, China (2000) 202–205
Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT-A System for Extracting Document Type Descriptors from XML Documents. Bell Labs Tech. Memorandum (1999)
Moh, C.-H., Lim, E.-P., Ng, W.-K.: Re-engineering Structures from Web Documents. Proceedings of the 5th ACM International Conference on Digital Libraries (DL2000), San Antonio, Texas, USA (2000)
Ha, S.: The Effective Exploitation of Heterogeneous Product Information for E-Commerce. Submitted for Publication (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Jung, JS., Oh, DI., Kong, YH., Ahn, JK. (2002). Extracting Information from XML Documents by Reverse Generating a DTD. In: Shafazand, H., Tjoa, A.M. (eds) EurAsia-ICT 2002: Information and Communication Technology. EurAsia-ICT 2002. Lecture Notes in Computer Science, vol 2510. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36087-5_37
Download citation
DOI: https://doi.org/10.1007/3-540-36087-5_37
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00028-0
Online ISBN: 978-3-540-36087-2
eBook Packages: Springer Book Archive