research-article

Extracting XML data from the web

Authors:

Ngo Sy Viet Phu,

Toshiyuki Amagasa,

Hiroyuki KitagawaAuthors Info & Claims

iiWAS '10: Proceedings of the 12th International Conference on Information Integration and Web-based Applications & Services

Pages 109 - 116

https://doi.org/10.1145/1967486.1967507

Published: 08 November 2010 Publication History

Abstract

Information Extraction (IE) is a technique to extract structured information (record) from unstructured documents such as Web pages. However, existing techniques are basically aiming at extracting simple records, such as binary relationships like "(company, location)" or named entities like "(organization)". In this paper, we propose an algorithm for extracting complex records like XML by utilizing an existing IE technique. Given a set of seed records in the form of XML data (XML records), we firstly infer the schema information from the XML records. Then, we transform the XML records to a set of relational records consisting of several tables. The obtained relational tables are decomposed into a set of binary relations, and they are forwarded to a record extraction system. We reconstruct XML data from the results obtained from the record of the extraction system. We point out a naive implementation docs not work well, and propose an improved scheme for more efficient XML record extraction. We evaluate the effectiveness of our proposed algorithm in some experiments.

References

[1]

XML1.0. http://www.w3.org/TR/REC-xml/.

[2]

E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In Proceedings of the 5th ACM International Conference on Digital Libraries, pages 85--94, 2000.

Digital Library

[3]

E. Agichtein and L. Gravano. Querying text databases for efficient information extraction. In Proceedings of the 19th IEEE International Conference on Data Engineering (ICDE), pages 113--124, 2003.

[4]

D. E. Appelt and D. Israel. Introduction to information extraction technology. IJCAI-99 Tutorial, August 1999.

[5]

M. Banko, M. J. Cafarella, S. Soderl, M. Broadhead, and O. Etzioni. Open information extraction from the web. In IJCAI, pages 2670--2676, 2007.

Digital Library

[6]

P. Bohannon,. Juliana, F. Jayant, R. Haritsa, and M. Ramanath. Legodb: Customizing relational storage for xml documents. In VLDB, pages 1091--1094, 2002.

Digital Library

[7]

S. Brin. Extracting patterns and relations from the world wide web. In WebDB Workshop at 6th International Conference on Extending Database Technology, EDBT '98, pages 172--183, 1998.

Digital Library

[8]

M. J. Cafarella, D. Downey, S. Soderl, and O. Etzioni. Knowitnow: Fast, scalable information extraction from the web. In Proceedings of the Human Language Technology Conference (HLT-EMNLP-05, pages 563--570, 2005.

Digital Library

[9]

C.-H. Chang, M. Kayed, M. R. Girgis, and K. Shaalan. A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering, 18(10):1411--1428, 2006.

Digital Library

[10]

O. Etzioni, M. Cafarclla, D. Downey, A. maria Popescu, T. Shaked, S. Soderl, D. S. Weld, and E. Yates. Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence, 165:91--134, 2005.

Digital Library

[11]

R. Mcdonald, F. Pereira, S. Kulick, S. Winters, Y. Jin, and P. White. Simple algorithms for complex relation extraction with applications to biomedical ie. In Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL-05), pages 491--498, 2005.

Digital Library

[12]

J. Shanmugasundaram, K. Tufte, G. He, C. Zhang, D. De Witt, and J. Naughton. Relational databases for querying xml documents: Limitations and opportunities. In Proceeding VLDB, pages 302--314, 1999.

Digital Library

[13]

R. Xu, A. Morgan, A. K. Das, and A. Garber. Investigation of unsupervised pattern learning techniques for bootstrap construction of a medical treatment lexicon, 2009.

[14]

J. Zhang, Y. Ishikawa, and H. Kitagawa. Record extraction based on user feedback and document selection. In APWeb/WAIM, pages 574--585, 2007.

Digital Library

[15]

R. Y. Zhang, L. V. S. Lakshmanan, and R. H. Zamar. Extracting relational data from html repositories. SIGKDD Explorations Newsletter, 6(2):5--13, 2004.

Digital Library

[16]

J. Zhu, Z. Nie, X. Liu, B. Zhang, and J.-R. Wen. Statsnowball: a statistical approach to extracting entity relationships. In Proceedings of the 18th international conference on World Wide Web (WWW), pages 101--110, 2009.

Digital Library

Index Terms

Extracting XML data from the web

Index terms have been assigned to the content through auto-classification.

Recommendations

Mapping XML Schema to Entity Relationship and Extended Entity Relationship Models

In this paper, we conceptually model an Entity Relationship (ER) diagram and Extended Entity Relationship (EER) diagram from XML Schema. This conceptual view of XML Schema is a necessary step in understanding XML data, and can easily be used to ...
Conceptual modeling of XML data
SAC '06: Proceedings of the 2006 ACM symposium on Applied computing

In this paper, we propose a conceptual model for building schemata for XML documents. In particular, we define the conceptual model UXS (UML & XML Schema), which is based on UML and provides several graphical constructs to help the programmer to define ...
Object-Based Methodology for XML Data Partitioning (OXDP)
AINA '11: Proceedings of the 2011 IEEE International Conference on Advanced Information Networking and Applications

Due to the growing use of XML data format in global information, an effective XML data management system is needed. An Enabled XML DB is one of the recent widely accepted approaches to store XML documents. This ability coupled with the increase use of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

iiWAS '10: Proceedings of the 12th International Conference on Information Integration and Web-based Applications & Services

November 2010

895 pages

ISBN:9781450304214

DOI:10.1145/1967486

General Chair:
Eric Pardede
La Trobe University, Australia
,
Program Chairs:
David Taniar
Monash University, Australia
,
Eric Pardede
La Trobe University, Australia

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IIWAS: International Organization for Information Integration
Web-b: Web-b

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 November 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

iiWAS '10

Sponsor:

IIWAS
Web-b

iiWAS '10: 12th International Conference on Information Integration and Web-based Applications & Services

November 8 - 10, 2010

Paris, France

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
65
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten