Abstract
This paper describes work towards automatically building on-line structured information resources from information sources that are comprised largely of natural language but with some structuring conventions. Such conversion requires two phases: region identification of the incoming documents, and mapping the information they contain into a more structured form. We describe a system that uses decision-tree-based machine learning techniques to build a classifier that can accurately identify document regions and discuss pattern-discovery methods for extracting information from the identified regions. Experiments demonstrate that this approach works well.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Serge Abiteboul. Querying semi-structured data. In International Conference on Database Technology, Jan 1997.
Brad Adlberg. Nodose-a tool for semi-automatically extracting structured and semistructured data from text documents. In SIGMOD, 1998.
Peter Buneman, Susan Davidson, Mary Fernandez, and Dan Suciu. Adding structure to unstructured data. Technical report, University of Pennsylvania, 1996.
Jim Cowie and Wendy Lehnert. Information extraction. Technical report, Communications of the ACM 39, 1, Jan. 1996.
Alin Deutsch, Mary Fernandez, and Dan Suciu. Storing semistructured data with stored. In SIGMOD, 1999.
D. W. Embley, D. M. Campbell, Y. S. Jiang, Y.-K. Ng, R. D. Smith, S. W. Liddle, and D. W. Quass. A conceptual-modeling approach to extracting data from the web. In ER’98, 1998.
D. W. Embley, Y S. Jiang, and Y.-K. Ng. Record-boundary discovery in web documents. In SIGMOD, 1999.
Mary Fernandez, Daniela Florescu, Jaewoo Kang, Alon Levy, and Dan Suciu. Strudel: A web site management system. In SIGMOD, 1997.
C. Knoblock I. Muslea, S. Minton. A hierarchical approach to wrapper induction. In Third International Conference on Autonomous Agents, (Agents’99), 1999.
N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper induction for information extraction. In IJCAI’97, 1997.
Michael Ley. DBLP Computer Science Bibliography. http://www.informatik.uni-trier.de/~ley/db/, 2001.
Chin Yew Lin. Assembly of topic extraction modules in summarist. In AAAI, Spring Symposium on Intelligent Test Summarization, 1998.
Ling Liu, Calton Pu, and Wei Han. Xwrap: An xml-enabled wrapper construction system for web information sources. In ICDE2000, 2000.
Liping Ma, John Shepherd, and Yanchun Zhang. Using machine learning to extract information from semistructured data. Technical report, School of Computer Science and Engineering, UNSW, 2002.
G. Mecca, A. Masci P. Atzeni, P. Merialdo, and G. Sindoni. The araneus web-base management system. Technical report, Exhibits Program of SIGMOD, 1998.
Research Institute NEC. ResearchIndex: The NECI Scientific Literature Digital Library. http://citeseer.nj.nec.com/cs, 2001.
Svetlozar Nestorov, Serge Abiteboul, and Rajeev Motwani. Extracting schema from semistructured data. In International workshop on management of semistructured data, 1997.
Svetlozar Nestorov, Serge Abiteboul, and Rajeev Motwani. Infer structure in semistruc-tured data. In International workshop on management of semistructured data, 1997.
J. R. Quinlan. C4.5: Programs for machine learning, 1993.
Stephen Soderland, David Fisher, Jonathan Aseltine, and Wendy Lehnert. Crystal: Inducing a conceptual dictionary. In IJCAI’95, 1995.
Ke Wang and Huiqing Liu. Schema discovery for semistructured data. In KDD, 1997.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ma, L., Shepherd, J., Zhang, Y. (2002). Extracting Information from Semistructured Data. In: Meng, X., Su, J., Wang, Y. (eds) Advances in Web-Age Information Management. WAIM 2002. Lecture Notes in Computer Science, vol 2419. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45703-8_13
Download citation
DOI: https://doi.org/10.1007/3-540-45703-8_13
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44045-1
Online ISBN: 978-3-540-45703-9
eBook Packages: Springer Book Archive