skip to main content
10.1145/1967486.1967507acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiiwasConference Proceedingsconference-collections
research-article

Extracting XML data from the web

Published: 08 November 2010 Publication History

Abstract

Information Extraction (IE) is a technique to extract structured information (record) from unstructured documents such as Web pages. However, existing techniques are basically aiming at extracting simple records, such as binary relationships like "(company, location)" or named entities like "(organization)". In this paper, we propose an algorithm for extracting complex records like XML by utilizing an existing IE technique. Given a set of seed records in the form of XML data (XML records), we firstly infer the schema information from the XML records. Then, we transform the XML records to a set of relational records consisting of several tables. The obtained relational tables are decomposed into a set of binary relations, and they are forwarded to a record extraction system. We reconstruct XML data from the results obtained from the record of the extraction system. We point out a naive implementation docs not work well, and propose an improved scheme for more efficient XML record extraction. We evaluate the effectiveness of our proposed algorithm in some experiments.

References

[1]
XML1.0. http://www.w3.org/TR/REC-xml/.
[2]
E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In Proceedings of the 5th ACM International Conference on Digital Libraries, pages 85--94, 2000.
[3]
E. Agichtein and L. Gravano. Querying text databases for efficient information extraction. In Proceedings of the 19th IEEE International Conference on Data Engineering (ICDE), pages 113--124, 2003.
[4]
D. E. Appelt and D. Israel. Introduction to information extraction technology. IJCAI-99 Tutorial, August 1999.
[5]
M. Banko, M. J. Cafarella, S. Soderl, M. Broadhead, and O. Etzioni. Open information extraction from the web. In IJCAI, pages 2670--2676, 2007.
[6]
P. Bohannon,. Juliana, F. Jayant, R. Haritsa, and M. Ramanath. Legodb: Customizing relational storage for xml documents. In VLDB, pages 1091--1094, 2002.
[7]
S. Brin. Extracting patterns and relations from the world wide web. In WebDB Workshop at 6th International Conference on Extending Database Technology, EDBT '98, pages 172--183, 1998.
[8]
M. J. Cafarella, D. Downey, S. Soderl, and O. Etzioni. Knowitnow: Fast, scalable information extraction from the web. In Proceedings of the Human Language Technology Conference (HLT-EMNLP-05, pages 563--570, 2005.
[9]
C.-H. Chang, M. Kayed, M. R. Girgis, and K. Shaalan. A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering, 18(10):1411--1428, 2006.
[10]
O. Etzioni, M. Cafarclla, D. Downey, A. maria Popescu, T. Shaked, S. Soderl, D. S. Weld, and E. Yates. Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence, 165:91--134, 2005.
[11]
R. Mcdonald, F. Pereira, S. Kulick, S. Winters, Y. Jin, and P. White. Simple algorithms for complex relation extraction with applications to biomedical ie. In Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL-05), pages 491--498, 2005.
[12]
J. Shanmugasundaram, K. Tufte, G. He, C. Zhang, D. De Witt, and J. Naughton. Relational databases for querying xml documents: Limitations and opportunities. In Proceeding VLDB, pages 302--314, 1999.
[13]
R. Xu, A. Morgan, A. K. Das, and A. Garber. Investigation of unsupervised pattern learning techniques for bootstrap construction of a medical treatment lexicon, 2009.
[14]
J. Zhang, Y. Ishikawa, and H. Kitagawa. Record extraction based on user feedback and document selection. In APWeb/WAIM, pages 574--585, 2007.
[15]
R. Y. Zhang, L. V. S. Lakshmanan, and R. H. Zamar. Extracting relational data from html repositories. SIGKDD Explorations Newsletter, 6(2):5--13, 2004.
[16]
J. Zhu, Z. Nie, X. Liu, B. Zhang, and J.-R. Wen. Statsnowball: a statistical approach to extracting entity relationships. In Proceedings of the 18th international conference on World Wide Web (WWW), pages 101--110, 2009.

Index Terms

  1. Extracting XML data from the web
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Other conferences
          iiWAS '10: Proceedings of the 12th International Conference on Information Integration and Web-based Applications & Services
          November 2010
          895 pages
          ISBN:9781450304214
          DOI:10.1145/1967486
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Sponsors

          • IIWAS: International Organization for Information Integration
          • Web-b: Web-b

          In-Cooperation

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 08 November 2010

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. WWW
          2. XML data
          3. information extraction

          Qualifiers

          • Research-article

          Funding Sources

          Conference

          iiWAS '10
          Sponsor:
          • IIWAS
          • Web-b

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • 0
            Total Citations
          • 65
            Total Downloads
          • Downloads (Last 12 months)1
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 13 Feb 2025

          Other Metrics

          Citations

          View Options

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Figures

          Tables

          Media

          Share

          Share

          Share this Publication link

          Share on social media