Skip to main content

Turning the Web into a Database: Extracting Data and Structure

  • Conference paper
Natural Language Processing and Information Systems (NLDB 2009)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5723))

Abstract

People build databases to collect, systematize, and make available to users knowledge in a consistent and hopefully trustworthy form. But the largest data collection today, the web, is not systematic, consistent, or trustworthy, and the access techniques we use are provably inadequate. Focusing just on text, what would it take to extract information from the web, organize it, and form a database (both instances and metadata) from it? This paper discusses some of the core problems and provides examples of recent research in NLP: automated instance mining, metadata structure harvesting, and inter-concept relation discovery.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agirre, E., Ansa, O., Martinez, D., Hovy, E.H.: Enriching WordNet Concepts with Topic Signatures. In: Proceedings of the NAACL Workshop on WordNet, Pittsburgh, PA (2001)

    Google Scholar 

  2. Banko, M., Cafarella, M., Soderland, S., Broadhead, M., Etzioni, O.: Open Information Extraction from the Web. In: Proceedings of International Joint Conference on Artificial Intelligence (IJCAI), pp. 2670–2676 (2007)

    Google Scholar 

  3. Berland, M., Charniak, E.: Finding Parts in Very Large Corpora. In: Proceedings of the 37th conference of the Association for Computational Linguistics (ACL) (1999)

    Google Scholar 

  4. Caraballo, S.: Automatic Acquisition of a Hypernym-Labeled Noun Hierarchy from Text. In: Proceedings of the 37th conference of the Association for Computational Linguistics (ACL), pp. 120–126 (1999)

    Google Scholar 

  5. Cimiano, P., Volker, J.: Towards Large-Scale, Open-Domain and Ontology-Based Named Entity Classification. In: Proceedings of the RANLP 2005 conference, pp. 166–172 (2005)

    Google Scholar 

  6. DUC conference (2001), http://duc.nist.gov/

  7. DUC conference (2002), http://duc.nist.gov/

  8. Etzioni, O., Cafarella, M., Downey, D., Popescu, A., Shaked, T., Soderland, S., Weld, D., Yates, A.: Unsupervised Named-Entity Extraction from the Web: An Experimental Study. Artificial Intelligence 165(1), 91–134 (2005)

    Article  Google Scholar 

  9. Fleischman, M., Hovy, E.H.: Fine Grained Classification of Named Entities. In: Proceedings of the international conference on Computational Linguistics (COLING), Taipei, Taiwan (2002)

    Google Scholar 

  10. Freitag, D.: Toward General-Purpose Learning for Information Extraction. In: Proceedings of the 36th conference of the Association for Computational Linguistics and 17th international conference on Computational Linguistics (COLING-ACL) Montreal, Quebec, pp. 404–408 (1998)

    Google Scholar 

  11. Girju, R., Badulescu, A., Moldovan, D.: Learning Semantic Constraints for the Automatic Discovery of Part-whole Relations. In: Proceedings of the HLT-NAACL conference (2003)

    Google Scholar 

  12. Hearst, M.: Automatic Acquisition of Hyponyms from Large Text Corpora. In: Proceedings of the 14th international conference on Computational Linguistics (COLING), pp. 539–545 (1992)

    Google Scholar 

  13. Hovy, E.H., Kozareva, Z., Riloff, E.: Toward Completeness in Concept Extraction and Classification. In: Proceedings of the conference of Empirical Methods in Natural Language Processing (EMNLP), Singapore (2009)

    Google Scholar 

  14. Kozareva, Z., Riloff, E., Hovy, E.H.: Semantic Class Learning from the Web with Hyponym Pattern Linkage Graphs. In: Proceedings of the 46th conference of the Association of Computational Linguistics (ACL), Columbus, OH (2008)

    Google Scholar 

  15. Kozareva, Z., Hovy, E.H., Riloff, E.: Learning and Evaluating the Content and the Structure of a Term Taxonomy. In: Proceedings of the AAAI Spring Symposium on Learning by Reading and Learning to Read Stanford University, CA (2009)

    Google Scholar 

  16. Lin, C.-Y., Hovy, E.H.: The Automated Acquisition of Topic Signatures for Text Summarization. In: Proceedings of the 18th international conference on Computational Linguistics (COLING), Strasbourg, France (2000)

    Google Scholar 

  17. Lin, C.-Y., Hovy, E.H.: Automated Multi-Document Summarization in NaATS. In: Proceedings of the Human Language Technology Conference (HLT), San Diego, California (2002)

    Google Scholar 

  18. Mann, G.: Fine-grained Proper Noun Ontologies for Question Answering. In: Proceedings of the 19th international conference on Computational Linguistics (COLING), pp. 1–7 (2002)

    Google Scholar 

  19. Pantel, P., Ravichandran, D.: Automatically Labeling Semantic Classes. In: Proceedings of the HLT-NAACL conference, pp. 321–328 (2004)

    Google Scholar 

  20. Pasca, M.: Acquisition of Categorized Named Entities for Web Search. In: Proceedings of the CIKM conference, pp. 137–145 (2004)

    Google Scholar 

  21. Pasca, M.: Weakly-supervised Discovery of Named Entities using Web Search Queries. In: Proceedings of the CIKM conference, pp. 683–690 (2007)

    Google Scholar 

  22. Patwardhan, S., Riloff, E.: Effective Information Extraction with Semantic Affinity Patterns and Relevant Regions. In: Proceedings of the joint conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) Prague, Czech Republic, pp. 717–727 (2007)

    Google Scholar 

  23. Phillips, W., Riloff, E.: Exploiting Strong Syntactic Heuristics and Co-Training to Learn Semantic Lexicons. In: Proceedings of the conference on Empirical Methods in Natural Language Processing (EMNLP) (2002)

    Google Scholar 

  24. Ponzetto, S., Strube, M.: Deriving a Large scale Taxonomy from Wikipedia. In: Proceedings of the 22nd national conference on Artificial Intelligence (AAAI), pp. 1440–1447 (2007)

    Google Scholar 

  25. Ravichandran, D., Hovy, E.H.: Learning Surface Text Patterns for a Question Answering System. In: Proceedings of the 40th conference of the Association for Computational Linguistics (ACL), Philadelphia, PA (2002)

    Google Scholar 

  26. Riloff, E., Shepherd, J.: A Corpus-Based Approach for Building Semantic Lexicons. In: Proceedings of the 2nd conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 117–124 (1997)

    Google Scholar 

  27. Ritter, A., Soderland, S., Etzioni, O.: What is This, Anyway: Automatic Hypernym Discovery. In: Proceedings of the AAAI 2009 Spring Symposium on Learning by Reading and Learning to Read, pp. 88–93. Stanford University, Stanford (2009)

    Google Scholar 

  28. Roark, B., Charniak, E.: Noun-phrase Cooccurrence Statistics for Semi-automatic Semantic Lexicon Construction. In: Proceedings of the 36th conference of the Association for Computational Linguistics (ACL), pp. 1110–1116 (1998)

    Google Scholar 

  29. Rosch, E.: Principles of Categorization, pp. 27–48 (1978)

    Google Scholar 

  30. Snow, R., Jurafsky, D., Ng, A.Y.: Learning Syntactic Patterns for Automatic Hypernym Discovery. In: Proceedings of the NIPS conference (2005)

    Google Scholar 

  31. Tanev, H., Magnini, B.: Weakly Supervised Approaches for Ontology Population. In: Proceedings of the 11th conference of the European Chapter of the Association for Computational Linguistics (EACL) (2006)

    Google Scholar 

  32. Thelen, M., Riloff, E.: A Bootstrapping Method for Learning Semantic Lexicons Using Extraction Pattern Contexts. In: Proceedings of the conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 214–221 (2002)

    Google Scholar 

  33. TREC conferences, http://trec.nist.gov/

  34. Widdows, D., Dorow, B.: A Graph Model for Unsupervised Lexical Acquisition. In: Proceedings of the 19th international conference on Computational Linguistics (COLING), pp. 1–7 (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hovy, E.H. (2010). Turning the Web into a Database: Extracting Data and Structure. In: Horacek, H., Métais, E., Muñoz, R., Wolska, M. (eds) Natural Language Processing and Information Systems. NLDB 2009. Lecture Notes in Computer Science, vol 5723. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12550-8_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-12550-8_1

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-12549-2

  • Online ISBN: 978-3-642-12550-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics