Skip to main content

Record Extraction Based on User Feedback and Document Selection

  • Conference paper
Advances in Data and Web Management (APWeb 2007, WAIM 2007)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4505))

  • 1134 Accesses

Abstract

In recent years, the research of record extraction from large document data is becoming popular. However there still exist some problems in record extraction. 1) when large document data is used for the target of information extraction, the process usually becomes very expensive. 2) it is also likely that extracted records may not pertain to the user’s interest on the aspect of the topic. To address these problems, in this paper we propose a method to efficiently extract those records whose topics agree with the user’s interest. To improve the efficiency of the information extraction system, our method identifies documents from which useful records are probably extracted. We make use of user feedback on extraction results to find topic-related documents and records. Our experiments show that our system achieves high extraction accuracy across different extraction targets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brin, S.: Extracting Patterns and Relations from the World Wide Web. In: Atzeni, P., Mendelzon, A.O., Mecca, G. (eds.) WebDB 1998. LNCS, vol. 1590, pp. 172–183. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  2. Agichtein, E., Gravano, L.: Snowball: Extracting Relations from Large Plain-Text Collections. In: Proc. ACM SIGMOD (2001)

    Google Scholar 

  3. Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proc. VLDB, pp. 119–128 (2001)

    Google Scholar 

  4. Kushmerick, N.: Wrapper Induction: Efficiency And Expressiveness. Artificial Intelligence 118(1-2), 15–68 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  5. Zhang, R.Y., Lakshmanan, L.V.S., Zamar, R.H.: Extracting Relational Data from HTML Repositories. SIGKDD Explorations (2004)

    Google Scholar 

  6. Gravano, L., Ipeirotis, P., Sahami, M.: QProber: A System for Automatic Classification of Hidden-web Databases. ACM Trans. Inf. Syst. 21(1), 1–41 (2003)

    Article  Google Scholar 

  7. Chakrabarti, S., Punera, K., Subramanyam, M.: Accelerated Focused Crawling through Online Relevance Feedback. In: Proc. WWW, pp. 148–159 (2002)

    Google Scholar 

  8. Chakrabarti, S., van den Berg, M., Dom, B.: Focused Crawling: A New Approach to Topic-specific Web Resource Discovery. Computer Networks 31(11-16), 1623–1640 (1999)

    Article  Google Scholar 

  9. Agichtein, E., Gravano, L.: Querying Text Databases for Efficient Information Extraction. In: Proc. ICDE, pp. 113–124 (2003)

    Google Scholar 

  10. Robertson, S.E.: Overview of the Okapi projects. Journal of the American Society for Information Science 53(1), 3–7 (1997)

    Google Scholar 

  11. Named Entity Tagger: http://l2r.cs.uiuc.edu/~cogcomp/asoftware.php?skey=NE

  12. Namazu: http://www.namazu.org/index.html.en

Download references

Author information

Authors and Affiliations

Authors

Editor information

Guozhu Dong Xuemin Lin Wei Wang Yun Yang Jeffrey Xu Yu

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Berlin Heidelberg

About this paper

Cite this paper

Zhang, J., Ishikawa, Y., Kitagawa, H. (2007). Record Extraction Based on User Feedback and Document Selection. In: Dong, G., Lin, X., Wang, W., Yang, Y., Yu, J.X. (eds) Advances in Data and Web Management. APWeb WAIM 2007 2007. Lecture Notes in Computer Science, vol 4505. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72524-4_59

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-72524-4_59

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-72483-4

  • Online ISBN: 978-3-540-72524-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics