Web Information Extraction Based on Similar Patterns

Ye, Na; Wu, Xuejun; Zhu, Jingbo; Chen, Wenliang; Yao, Tianshun

doi:10.1007/978-3-540-27772-9_67

Na Ye¹⁸,
Xuejun Wu¹⁸,
Jingbo Zhu¹⁸,
Wenliang Chen¹⁸ &
…
Tianshun Yao¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3129))

Included in the following conference series:

International Conference on Web-Age Information Management

889 Accesses

Abstract

Information Extraction is an important research topic in data mining. In this paper we introduce a web information extraction approach based on similar patterns, in which the construction of pattern library is a knowledge acquisition bottleneck. We use a method based on similarity computation to automatically acquire patterns from large-scale corpus. According to the given seed patterns, relevant patterns can be learned from unlabeled training web pages. The generated patterns can be put to use after little manual correction. Compared to other algorithms, our approach requires much less human intervention and avoids the necessity of hand-tagging training corpus. Experimental results show that the acquired patterns achieve IE precision of 79.45% and recall of 66.51% in open test.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Soderland, S.: Learning Information Extraction Rules from Semi-Structured and free text. Machine Learning 34, 233–272 (1999)
Article MATH Google Scholar
Huffman, S.B.: Learning Information Extraction Patterns from Examples. In: Proceeding of the 1995 IJCAI Workshop on New Approaches to Learning for Natural Language Processing (1995)
Google Scholar
Nobata, C., Sekine, S.: Automatic Acquisition of Pattern for Information Extraction. In: Proceeding of the ICCPOL 1999 (1999)
Google Scholar
Yangarber, R., Grishman, R., Tapanainen, P.: Automatic Acquisition of Domain Knowledge for Information Extraction. In: Proceeding of the COLING 2000 (2000)
Google Scholar
Riloff, E., Jones, R.: Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence
Google Scholar
Yao, T.S., et al.: Natural Language Processing- A research of making computers understand human languages. Tsinghua University Press (2002) (in Chinese)
Google Scholar

Download references

Author information

Authors and Affiliations

Natural Language Processing Lab Northeastern University, Shenyang, China
Na Ye, Xuejun Wu, Jingbo Zhu, Wenliang Chen & Tianshun Yao

Authors

Na Ye
View author publications
You can also search for this author in PubMed Google Scholar
Xuejun Wu
View author publications
You can also search for this author in PubMed Google Scholar
Jingbo Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Wenliang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Tianshun Yao
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong, China
Qing Li
Shenyang Liaoning, Northeastern University, 110004, China
Guoren Wang
Dept. of Computer Science & Technology, Tsinghua University, Beijing, China
Ling Feng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ye, N., Wu, X., Zhu, J., Chen, W., Yao, T. (2004). Web Information Extraction Based on Similar Patterns. In: Li, Q., Wang, G., Feng, L. (eds) Advances in Web-Age Information Management. WAIM 2004. Lecture Notes in Computer Science, vol 3129. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-27772-9_67

Download citation

DOI: https://doi.org/10.1007/978-3-540-27772-9_67
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22418-1
Online ISBN: 978-3-540-27772-9
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics