Skip to main content

A Self-Supervised Approach for Extraction of Attribute-Value Pairs from Wikipedia Articles

  • Conference paper
String Processing and Information Retrieval (SPIRE 2010)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6393))

Included in the following conference series:

Abstract

Wikipedia is the largest encyclopedia on the web and has been widely used as a reliable source of information. Researchers have been extracting entities, relationships and attribute-value pairs from Wikipedia and using them in information retrieval tasks. In this paper we present a self-supervised approach for autonomously extract attribute-value pairs from Wikipedia articles. We apply our method to the Wikipedia automatic infobox generation problem and outperformed a method presented in the literature by 21.92% in precision, 26.86% in recall and 24.29% in F1.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Auer, S., Lehmann, J.: What have innsbruck and leipzig in common? extracting semantics from wiki content. In: Proceedings of the 4th European Conference on The Semantic Web, pp. 503–517 (2007)

    Google Scholar 

  2. Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)

    Google Scholar 

  3. Banerjee, S., Ramanathan, K., Gupta, A.: Clustering short texts using wikipedia. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 787–788 (2007)

    Google Scholar 

  4. Boser, B.E., Guyon, I., Vapnik, V.: A training algorithm for optimal margin classifiers. In: Proceedings of the 5th Annual Workshop on Computational Learning Theory, pp. 144–152 (1992)

    Google Scholar 

  5. Chang, C., Lin, C.: Libsvm: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~cjlin/libsvm/

  6. Cortes, C., Vapnik, V.: Support-vector network. Machine Learning 20, 273–297 (1995)

    MATH  Google Scholar 

  7. Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of the 7th International Conference on Information and Knowledge Management, pp. 148–155 (1998)

    Google Scholar 

  8. Bliki Engine, http://code.google.com/p/gwtwiki/

  9. Google Search Engine, http://www.google.com/

  10. Etzioni, O., Banko, M., Soderland, S., Weld, D.S.: Open information extraction from the web. Communications of the ACM 51(12), 68–74 (2008)

    Article  Google Scholar 

  11. Hahn, R., Bizer, C., Sahnwaldt, C., Herta, C., Robinson, S., Bürgle, M., Düwiger, H., Scheel, U.: Faceted wikipedia search. In: Wecel, K. (ed.) BIS 2010. LNBIP, vol. 47, pp. 1–11. Springer, Heidelberg (2010)

    Google Scholar 

  12. Higashinaka, R., Dohsaka, K., Isozaki, H.: Learning to rank definitions to generate quizzes for interactive information presentation. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, pp. 117–120 (2007)

    Google Scholar 

  13. Kaisser, M.: The qualim question answering demo: Supplementing answers with paragraphs drawn from wikipedia. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, pp. 32–35 (2008)

    Google Scholar 

  14. Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: Analysis and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(7), 881–892 (2002)

    Article  MATH  Google Scholar 

  15. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, pp. 282–289 (2001)

    Google Scholar 

  16. WordNet: A lexical database for English, http://wordnet.princeton.edu/ .

  17. Li, Y., Luk, W.P.R., Ho, K.S.E., Chung, F.L.K.: Improving weak ad-hoc queries using wikipedia as external corpus. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 797–798 (2007)

    Google Scholar 

  18. OpenNLP Maxent Library, http://maxent.sourceforge.net/

  19. Milne, D.N., Witten, I.H., Nichols, D.M.: A knowledge-based search engine powered by wikipedia. In: Proceedings of the 16th ACM Conference on Information and Knowledge Management, pp. 445–454 (2007)

    Google Scholar 

  20. Nguyen, D.P., Matsuo, Y., Ishizuka, M.: Exploiting syntatic and semantic information for relation extraction from wikipedia. In: Proceedings of the Workshop on Text-Mining & Link-Analysis (2007)

    Google Scholar 

  21. Nigam, K., Lafferty, J., Mccallum, A.: Using maximum entropy for text classification. In: Proceedings of the 16th International Joint Conference on Artificial Intelligence, pp. 61–67 (1999)

    Google Scholar 

  22. OpenNLP, http://opennlp.sourceforge.net/

  23. Potthast, M., Stein, B., Anderka, M.: Wikipedia-based multilingual retrieval model. In: Proceedings of the 30th European Conference on Information Retrieval Research, pp. 522–530 (2008)

    Google Scholar 

  24. CRF Project, http://crf.sourceforge.net/

  25. Resource Description Framework (RDF), http://www.w3.org/RDF/

  26. Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 134–141 (2003)

    Google Scholar 

  27. Stone, M.: Cross-validation choices and assessment of statistical predictions. Journal of the Royal Statistical Society B36, 111–147 (1974)

    Google Scholar 

  28. Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: A core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, pp. 697–706 (2007)

    Google Scholar 

  29. Wang, P., Hu, J., Zeng, H., Chen, L., Chen, Z.: Improving text classification by using encyclopedia knowledge. In: Proceedings of the 7th IEEE International Conference on Data Mining, pp. 332–341 (2007)

    Google Scholar 

  30. Wu, F., Hoffmann, R., Weld, D.S.: Information extraction from wikipedia: Moving down the long tail. In: Proceeding of the 14th ACM International Conference on Knowledge Discovery and Data Mining, pp. 731–739 (2008)

    Google Scholar 

  31. Wu, F., Weld, D.S.: Autonomously semantifying wikipedia. In: Proceedings of the 16th ACM Conference on Information and Knowledge Management, pp. 41–50 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Brandão, W.C., Moura, E.S., Silva, A.S., Ziviani, N. (2010). A Self-Supervised Approach for Extraction of Attribute-Value Pairs from Wikipedia Articles. In: Chavez, E., Lonardi, S. (eds) String Processing and Information Retrieval. SPIRE 2010. Lecture Notes in Computer Science, vol 6393. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16321-0_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-16321-0_29

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-16320-3

  • Online ISBN: 978-3-642-16321-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics