A Self-Supervised Approach for Extraction of Attribute-Value Pairs from Wikipedia Articles

Brandão, Wladmir C.; Moura, Edleno S.; Silva, Altigran S.; Ziviani, Nivio

doi:10.1007/978-3-642-16321-0_29

Wladmir C. Brandão¹⁸,
Edleno S. Moura¹⁹,
Altigran S. Silva¹⁹ &
…
Nivio Ziviani¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6393))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

1120 Accesses

Abstract

Wikipedia is the largest encyclopedia on the web and has been widely used as a reliable source of information. Researchers have been extracting entities, relationships and attribute-value pairs from Wikipedia and using them in information retrieval tasks. In this paper we present a self-supervised approach for autonomously extract attribute-value pairs from Wikipedia articles. We apply our method to the Wikipedia automatic infobox generation problem and outperformed a method presented in the literature by 21.92% in precision, 26.86% in recall and 24.29% in F1.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

DAVE: Extracting Domain Attributes and Values from Text Corpus

Extracting Knowledge Using Wikipedia Semi-structured Resources

Ontology Augmentation via Attribute Extraction from Multiple Types of Sources

References

Auer, S., Lehmann, J.: What have innsbruck and leipzig in common? extracting semantics from wiki content. In: Proceedings of the 4th European Conference on The Semantic Web, pp. 503–517 (2007)
Google Scholar
Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Google Scholar
Banerjee, S., Ramanathan, K., Gupta, A.: Clustering short texts using wikipedia. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 787–788 (2007)
Google Scholar
Boser, B.E., Guyon, I., Vapnik, V.: A training algorithm for optimal margin classifiers. In: Proceedings of the 5th Annual Workshop on Computational Learning Theory, pp. 144–152 (1992)
Google Scholar
Chang, C., Lin, C.: Libsvm: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Cortes, C., Vapnik, V.: Support-vector network. Machine Learning 20, 273–297 (1995)
MATH Google Scholar
Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of the 7th International Conference on Information and Knowledge Management, pp. 148–155 (1998)
Google Scholar
Bliki Engine, http://code.google.com/p/gwtwiki/
Google Search Engine, http://www.google.com/
Etzioni, O., Banko, M., Soderland, S., Weld, D.S.: Open information extraction from the web. Communications of the ACM 51(12), 68–74 (2008)
Article Google Scholar
Hahn, R., Bizer, C., Sahnwaldt, C., Herta, C., Robinson, S., Bürgle, M., Düwiger, H., Scheel, U.: Faceted wikipedia search. In: Wecel, K. (ed.) BIS 2010. LNBIP, vol. 47, pp. 1–11. Springer, Heidelberg (2010)
Google Scholar
Higashinaka, R., Dohsaka, K., Isozaki, H.: Learning to rank definitions to generate quizzes for interactive information presentation. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, pp. 117–120 (2007)
Google Scholar
Kaisser, M.: The qualim question answering demo: Supplementing answers with paragraphs drawn from wikipedia. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, pp. 32–35 (2008)
Google Scholar
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: Analysis and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(7), 881–892 (2002)
Article MATH Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, pp. 282–289 (2001)
Google Scholar
WordNet: A lexical database for English, http://wordnet.princeton.edu/ .
Li, Y., Luk, W.P.R., Ho, K.S.E., Chung, F.L.K.: Improving weak ad-hoc queries using wikipedia as external corpus. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 797–798 (2007)
Google Scholar
OpenNLP Maxent Library, http://maxent.sourceforge.net/
Milne, D.N., Witten, I.H., Nichols, D.M.: A knowledge-based search engine powered by wikipedia. In: Proceedings of the 16th ACM Conference on Information and Knowledge Management, pp. 445–454 (2007)
Google Scholar
Nguyen, D.P., Matsuo, Y., Ishizuka, M.: Exploiting syntatic and semantic information for relation extraction from wikipedia. In: Proceedings of the Workshop on Text-Mining & Link-Analysis (2007)
Google Scholar
Nigam, K., Lafferty, J., Mccallum, A.: Using maximum entropy for text classification. In: Proceedings of the 16th International Joint Conference on Artificial Intelligence, pp. 61–67 (1999)
Google Scholar
OpenNLP, http://opennlp.sourceforge.net/
Potthast, M., Stein, B., Anderka, M.: Wikipedia-based multilingual retrieval model. In: Proceedings of the 30th European Conference on Information Retrieval Research, pp. 522–530 (2008)
Google Scholar
CRF Project, http://crf.sourceforge.net/
Resource Description Framework (RDF), http://www.w3.org/RDF/
Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 134–141 (2003)
Google Scholar
Stone, M.: Cross-validation choices and assessment of statistical predictions. Journal of the Royal Statistical Society B36, 111–147 (1974)
Google Scholar
Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: A core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, pp. 697–706 (2007)
Google Scholar
Wang, P., Hu, J., Zeng, H., Chen, L., Chen, Z.: Improving text classification by using encyclopedia knowledge. In: Proceedings of the 7th IEEE International Conference on Data Mining, pp. 332–341 (2007)
Google Scholar
Wu, F., Hoffmann, R., Weld, D.S.: Information extraction from wikipedia: Moving down the long tail. In: Proceeding of the 14th ACM International Conference on Knowledge Discovery and Data Mining, pp. 731–739 (2008)
Google Scholar
Wu, F., Weld, D.S.: Autonomously semantifying wikipedia. In: Proceedings of the 16th ACM Conference on Information and Knowledge Management, pp. 41–50 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Federal University of Minas Gerais, Belo Horizonte, Brazil
Wladmir C. Brandão & Nivio Ziviani
Department of Computer Science, Federal University of Amazonas, Manaus, Brazil
Edleno S. Moura & Altigran S. Silva

Authors

Wladmir C. Brandão
View author publications
You can also search for this author in PubMed Google Scholar
Edleno S. Moura
View author publications
You can also search for this author in PubMed Google Scholar
Altigran S. Silva
View author publications
You can also search for this author in PubMed Google Scholar
Nivio Ziviani
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Physics and Mathematics, Edificio "B", Universidad Michoacana, Ciudad Universitaria, 5800, Morelia, Mich., Mexico
Edgar Chavez
Dept. of Computer Science and Enginerring, University of California, 92521, Riverside, CA, USA
Stefano Lonardi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Brandão, W.C., Moura, E.S., Silva, A.S., Ziviani, N. (2010). A Self-Supervised Approach for Extraction of Attribute-Value Pairs from Wikipedia Articles. In: Chavez, E., Lonardi, S. (eds) String Processing and Information Retrieval. SPIRE 2010. Lecture Notes in Computer Science, vol 6393. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16321-0_29

Download citation

DOI: https://doi.org/10.1007/978-3-642-16321-0_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16320-3
Online ISBN: 978-3-642-16321-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics