Mining Web Sites Using Wrapper Induction, Named Entities, and Post-processing

Sigletos, Georgios; Paliouras, Georgios; Spyropoulos, Constantine D.; Hatzopoulos, Michalis

doi:10.1007/978-3-540-30123-3_6

Georgios Sigletos^24,25,
Georgios Paliouras²⁴,
Constantine D. Spyropoulos²⁴ &
…
Michalis Hatzopoulos²⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3209))

Included in the following conference series:

European Web Mining Forum

475 Accesses
1 Citations

Abstract

This paper presents a new framework for extracting information from collections of Web pages across different sites. In the proposed framework, a standard wrapper induction algorithm is used that exploits named entity information that has been previously identified. The idea of post-processing the extraction results is introduced for resolving ambiguous fields and improving the overall extraction performance. Post-processing involves the exploitation of two additional sources of information: field transition probabilities, based on a trained bigram model, and confidence scores, estimated for each field by the wrapper induction system. A multiplicative model that is based on the product of those two probabilities is also considered for post-processing. Experiments were conducted on pages describing laptop products, collected from many different sites and in four different languages. The results highlight the effectiveness of the new framework.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Defense Advanced Research Projects Agency (DARPA), Proceedings of the 4th Message Understanding Conferences (MUC-4), McLean, Virginia, Morgan Kaufmann (1992)
Google Scholar
Defense Advanced Research Projects Agency (DARPA), Proceedings of the 5th Message Understanding Conferences (MUC-5), San Mateo, CA, Morgan Kaufmann (1993)
Google Scholar
Kushmerick, N.: Wrapper induction for Information Extraction, PhD Thesis, Department Of computer Scienc, Univ. Of Washington (1997)
Google Scholar
Muslea, I., Minton, S., Knoblock, C.: Hierarchical Wrapper Induction for Semistructured Information Sources. Journal Of Autonomous Agents and Multi-Agent Systems 4, 93–114 (2001)
Article Google Scholar
Sonderland, S.: Learning Information Extraction Rules for Semi-structured and Free Text. Machine Learning 34-(1/3), 233–272 (1999)
Article Google Scholar
Ciravegna, F.: Adaptive Information Extraction from Text by Rule Induction and Generalization. In: Proceedings of the 17th IJCAI Conference. Seattle (2001)
Google Scholar
Freitag, D.: Machine Learning for Information Extraction in Informal Domains. Machine Learrning 39, 169–202 (2000)
Article MATH Google Scholar
Freitag, D., McCallum, A.K.: Information Extraction using HMMs and Shrinkage. In: AAAI 1999 Workshop on Machine Learning for Information Extraction, pp. 31–36 (1999)
Google Scholar
Freitag, D., Kushmerick, N.: Boosted Wrapper Induction. In: Proceedings of the 17th AAAI, pp. 59–66 (1999)
Google Scholar
Grover, C., McDonald, S., Gearailt, D.N., Karkaletsis, V., Farmakiotou, D., Samaritakis, G., Petasis, G., Pazienza, M.T., Vindigni, M., Vichot, F., Wolinski, F.: Multilingual XML-based Named Entity Recognition for E-Retail Domains. In: Proceedings of the LREC 2002, Las Palmas (May 2002)
Google Scholar
Sigletos, G., Farmakiotou, D., Stamatakis, K., Paliouras, G., Karkaletsis, V.: Annotating Web pages for the needs of Web Information Extraction Applications. Poster at WWW 2003, Budapest Hungary, May 20-24 (2003)
Google Scholar
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Cohen, W., Fan, W.: Learning page-independent heuristics for extracting data from Web pages. In: The Proceedings of the 8th international WWW conference (WWW 1999). Toronto, Canada (1999)
Google Scholar
Cohen, W., Hurst, M., Jensen, L.: A Flexible Learning System for Wrapping Tables and Lists in HTML Documents. In: Proceedings of the 11th International WWW Conference, Hawaii, USA (2002)
Google Scholar
Davulcu, H., Mukherjee, S., Ramakrishman, I.V.: Extraction Techniques for Mining Services from Web Sources. In: IEEE International Conference on Data Mining, Maebashi City, Japan (2002)
Google Scholar
Embley, D.W., Campbell, D.M., Jiang, Y.S., Liddle, S.W., Lonsdale, D.W., Ng, Y.K., Smith, R.D.: Conceptual model-based data extraction from multiple-record web documents. Data and Knowledge Engineering 31(3), 227–251 (1999)
Article MATH Google Scholar
Rabiner, L.: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77-2 (1989)
Google Scholar
Carrasco, R., Oncina, J.: Learning stochastic regular grammars by means of a statemerging method. In: Carrasco, R.C., Oncina, J. (eds.) ICGI 1994. LNCS, vol. 862, pp. 139–150. Springer, Heidelberg (1994)
Google Scholar
Muslea, I.: Active Learning with multiple views. PhD Thesis, University of Southern California (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Informatics and Telecommunications, NCSR “Demokritos”, P.O. BOX 60228, Aghia Paraskeyh, GR-153 10, Athens, Greece
Georgios Sigletos, Georgios Paliouras & Constantine D. Spyropoulos
Department of Informatics and Telecommunications, University of Athens, TYPA Buildings, Panepistimiopolis, Athens, Greece
Georgios Sigletos & Michalis Hatzopoulos

Authors

Georgios Sigletos
View author publications
You can also search for this author in PubMed Google Scholar
Georgios Paliouras
View author publications
You can also search for this author in PubMed Google Scholar
Constantine D. Spyropoulos
View author publications
You can also search for this author in PubMed Google Scholar
Michalis Hatzopoulos
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, K.U. Leuven, B-3001, Heverlee, Belgium
Bettina Berendt
Knowledge & Data Engineering Group, University of Kassel, Wilhelmshöher Allee 73, D-34121, Kassel, Germany
Andreas Hotho
Jožef Stefan Institute, Jamova 39, 1000, Ljubljana, Slovenia
Dunja Mladenič
Human Computer Studies Lab, University of Amsterdam, Kruislaan 419, 1089 VA, Amsterdam, The Netherlands
Maarten van Someren
Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, Germany
Myra Spiliopoulou
Research Center L3S, Appelstr. 9a, D-30167, Hannover, Germany
Gerd Stumme

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sigletos, G., Paliouras, G., Spyropoulos, C.D., Hatzopoulos, M. (2004). Mining Web Sites Using Wrapper Induction, Named Entities, and Post-processing. In: Berendt, B., Hotho, A., Mladenič, D., van Someren, M., Spiliopoulou, M., Stumme, G. (eds) Web Mining: From Web to Semantic Web. EWMF 2003. Lecture Notes in Computer Science(), vol 3209. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30123-3_6

Download citation

DOI: https://doi.org/10.1007/978-3-540-30123-3_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23258-2
Online ISBN: 978-3-540-30123-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics