Extracting Information from Semi-structured Web Documents

Hemnani, Ajay; Bressan, Stephane

doi:10.1007/3-540-46105-1_20

Extracting Information from Semi-structured Web Documents

Ajay Hemnani⁶ &
Stephane Bressan⁶

Conference paper
First Online: 01 January 2002

592 Accesses
4 Citations
3 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2426))

Abstract

The World Wide Web has nowen tered its mature age. It not only hosts and serves large amounts of pages but also offers large amounts of information potentially useful for individuals and businesses. Modern decision support can no more be effective without timely and accurate access to this unprecedented source of data. However, unlike in a database, the structure of data available on the Web is not known a priori and its understanding seems to require human intervention. Yet the conjunction of rules for interpreting layout and simple domain knowledge enables in many cases the automatic extraction of such data. In such cases we say that data is semi-structured. In this paper, we present a framework in which we try to address the problem of extracting semi-structured data. This framework combines a syntactical extraction strategy with a set of mapping rules, heuristics and simple domain knowledge, which maps a syntactical structure identified in Web documents to a conceptual/ semantic structure. We present and analyse one instance of this framework in which a syntactical extraction strategy exploits the HTML structure of Web documents using a Tree Alignment algorithm with a novel combination of heuristics to detect repeated patterns and infer rules to extract relevant records. Then, by the use of domain knowledge, we refine the extraction rules such that not only are they able to extract data, but they also construe meaning to the extracted results.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Atzeni, P., Mecca, G., Merialdo, P.: To Weave the Web. In Proc. Twenty-third International Conference on Very Large Data Bases (1997) 206–215
Google Scholar
Cali., M. E., Mooney, R. J.: Relational Learning of Pattern-Match Rules for Information Extraction. Working papers of the ACL-97 workshop in Natural Language Learning (1997)
Google Scholar
Chang, C. H., Lui, S. C.: Information Extraction Based on Pattern Discovery. In Proc. 10th International World Wide Web conference on World Wide Web (2001)
Google Scholar
Chawathe, S., Garcia-Molina, H., Hammer, J., Ireland, K., Papakonstantinou, Y., Ullman, J., Widom, J.: The TSIMMIS project: integration of heterogeneous information sources. IPSJ Conference (1994) 7–18
Google Scholar
Colby, M., Jackson, D. S.: Using SGML. Que Corporation, Indianapolis, USA. Special edition (1996)
Google Scholar
Doorenbos, R. B., Etzioni, O., Weld, D. S.: A scalable comparison-shopping agent for the World Wide Web. In Proc. 1st international conference on Autonomous Agents. ACM Press., New York (1997) 39–48
Google Scholar
Embley, D., Jiang, Y., and Ng, Y.-K.: Record-boundary discovery in Web documents. In Proc. ACM SIGMOD International Conference on Management of Data. Philadelphia, Pennsylvania, (1999) 467–478
Google Scholar
Freitag, D. Information Extraction from HTML: Application of a general Machine Learning Approach. In Proc. 15th National Conference on Artificial Intelligence (1998)
Google Scholar
Hemnani, A., Bressan, S.: Information Extraction-Tree Alignment Approach to Pattern discovery in Web documents. In Proc. Thirteenth International Conference on Database and Expert Systems Applications (2002) (to appear)
Google Scholar
Hsu, C.-H., Dung, M.-T.: Generating finite-state transducers for semi-structured data extraction from the Web. Journal of Information Systems, 23(8) (1998) 521–538.
Article Google Scholar
Hsu, J. Y., and Yih, W. T.: Template-based information mining from html documents. In AAAI 97. AAAI Press, August (1997)
Google Scholar
Jiang, T., Wang L., Zhang, K.: Alignment of trees-an alternative to tree edit. Combinatorial Pattern Matching (1994) 75–86
Google Scholar
Kushmerick, N., Weld, D., Doorenbos, R.: Wrapper induction for information extraction. In Proc. 15th International Joint Conference on Artificial Intelligence (1997)
Google Scholar
Lakshmi, V.: Web structure Analysis for Information Mining. PhD Dissertation, National University of Singapore (2001)
Google Scholar
Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In Proc. 3rd International Conference on Autonomous Agents (1999)
Google Scholar
Soderland, S.: Learning Information Extraction Rules for Semi-structured and Free Text. Machine Learning, vol. 34 (1999) 233–272
Article MATH Google Scholar
Sowa, J. F.: Conceptual Graphs. NCITS.T2/98-003 (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

National University of Singapore, 3 Science Drive 2, 117543, Singapore
Ajay Hemnani & Stephane Bressan

Authors

Ajay Hemnani
View author publications
You can also search for this author in PubMed Google Scholar
Stephane Bressan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

LIUPPA, Computer Science Research Department, University of Pau, B.P. 1155, 64013, Pau Cedex, France
Jean-Michel Bruel
LIRMM, 161 rue Ada, 34392, Montpellier Cedex 5, France
Zohra Bellahsene

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hemnani, A., Bressan, S. (2002). Extracting Information from Semi-structured Web Documents. In: Bruel, JM., Bellahsene, Z. (eds) Advances in Object-Oriented Information Systems. OOIS 2002. Lecture Notes in Computer Science, vol 2426. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46105-1_20

Download citation

DOI: https://doi.org/10.1007/3-540-46105-1_20
Published: 18 September 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44088-8
Online ISBN: 978-3-540-46105-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics