Record Extraction Based on User Feedback and Document Selection

Zhang, Jianwei; Ishikawa, Yoshiharu; Kitagawa, Hiroyuki

doi:10.1007/978-3-540-72524-4_59

Jianwei Zhang¹,
Yoshiharu Ishikawa² &
Hiroyuki Kitagawa^1,3

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4505))

Included in the following conference series:

1134 Accesses

Abstract

In recent years, the research of record extraction from large document data is becoming popular. However there still exist some problems in record extraction. 1) when large document data is used for the target of information extraction, the process usually becomes very expensive. 2) it is also likely that extracted records may not pertain to the user’s interest on the aspect of the topic. To address these problems, in this paper we propose a method to efficiently extract those records whose topics agree with the user’s interest. To improve the efficiency of the information extraction system, our method identifies documents from which useful records are probably extracted. We make use of user feedback on extraction results to find topic-related documents and records. Our experiments show that our system achieves high extraction accuracy across different extraction targets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Brin, S.: Extracting Patterns and Relations from the World Wide Web. In: Atzeni, P., Mendelzon, A.O., Mecca, G. (eds.) WebDB 1998. LNCS, vol. 1590, pp. 172–183. Springer, Heidelberg (1999)
Chapter Google Scholar
Agichtein, E., Gravano, L.: Snowball: Extracting Relations from Large Plain-Text Collections. In: Proc. ACM SIGMOD (2001)
Google Scholar
Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proc. VLDB, pp. 119–128 (2001)
Google Scholar
Kushmerick, N.: Wrapper Induction: Efficiency And Expressiveness. Artificial Intelligence 118(1-2), 15–68 (2000)
Article MATH MathSciNet Google Scholar
Zhang, R.Y., Lakshmanan, L.V.S., Zamar, R.H.: Extracting Relational Data from HTML Repositories. SIGKDD Explorations (2004)
Google Scholar
Gravano, L., Ipeirotis, P., Sahami, M.: QProber: A System for Automatic Classification of Hidden-web Databases. ACM Trans. Inf. Syst. 21(1), 1–41 (2003)
Article Google Scholar
Chakrabarti, S., Punera, K., Subramanyam, M.: Accelerated Focused Crawling through Online Relevance Feedback. In: Proc. WWW, pp. 148–159 (2002)
Google Scholar
Chakrabarti, S., van den Berg, M., Dom, B.: Focused Crawling: A New Approach to Topic-specific Web Resource Discovery. Computer Networks 31(11-16), 1623–1640 (1999)
Article Google Scholar
Agichtein, E., Gravano, L.: Querying Text Databases for Efficient Information Extraction. In: Proc. ICDE, pp. 113–124 (2003)
Google Scholar
Robertson, S.E.: Overview of the Okapi projects. Journal of the American Society for Information Science 53(1), 3–7 (1997)
Google Scholar
Named Entity Tagger: http://l2r.cs.uiuc.edu/~cogcomp/asoftware.php?skey=NE
Namazu: http://www.namazu.org/index.html.en

Download references

Author information

Authors and Affiliations

Department of Computer Science, Graduate School of, Systems and Information Engineering, University of Tsukuba, 1-1-1 Tennohdai, Tsukuba, Ibaraki, 305-8573, Japan
Jianwei Zhang & Hiroyuki Kitagawa
Information Technology Center, Nagoya University, Furo-cho, Chikusa-ku, Nagoya, Aichi, 464-8601, Japan
Yoshiharu Ishikawa
Center for Computational Sciences, University of Tsukuba, 1-1-1 Tennohdai, Tsukuba, Ibaraki, 305-8573, Japan
Hiroyuki Kitagawa

Authors

Jianwei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yoshiharu Ishikawa
View author publications
You can also search for this author in PubMed Google Scholar
Hiroyuki Kitagawa
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Guozhu Dong Xuemin Lin Wei Wang Yun Yang Jeffrey Xu Yu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, J., Ishikawa, Y., Kitagawa, H. (2007). Record Extraction Based on User Feedback and Document Selection. In: Dong, G., Lin, X., Wang, W., Yang, Y., Yu, J.X. (eds) Advances in Data and Web Management. APWeb WAIM 2007 2007. Lecture Notes in Computer Science, vol 4505. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72524-4_59

Download citation

DOI: https://doi.org/10.1007/978-3-540-72524-4_59
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72483-4
Online ISBN: 978-3-540-72524-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics