Abstract
Wikipedia is the largest encyclopedia on the web and has been widely used as a reliable source of information. Researchers have been extracting entities, relationships and attribute-value pairs from Wikipedia and using them in information retrieval tasks. In this paper we present a self-supervised approach for autonomously extract attribute-value pairs from Wikipedia articles. We apply our method to the Wikipedia automatic infobox generation problem and outperformed a method presented in the literature by 21.92% in precision, 26.86% in recall and 24.29% in F1.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Auer, S., Lehmann, J.: What have innsbruck and leipzig in common? extracting semantics from wiki content. In: Proceedings of the 4th European Conference on The Semantic Web, pp. 503–517 (2007)
Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Banerjee, S., Ramanathan, K., Gupta, A.: Clustering short texts using wikipedia. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 787–788 (2007)
Boser, B.E., Guyon, I., Vapnik, V.: A training algorithm for optimal margin classifiers. In: Proceedings of the 5th Annual Workshop on Computational Learning Theory, pp. 144–152 (1992)
Chang, C., Lin, C.: Libsvm: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Cortes, C., Vapnik, V.: Support-vector network. Machine Learning 20, 273–297 (1995)
Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of the 7th International Conference on Information and Knowledge Management, pp. 148–155 (1998)
Bliki Engine, http://code.google.com/p/gwtwiki/
Google Search Engine, http://www.google.com/
Etzioni, O., Banko, M., Soderland, S., Weld, D.S.: Open information extraction from the web. Communications of the ACM 51(12), 68–74 (2008)
Hahn, R., Bizer, C., Sahnwaldt, C., Herta, C., Robinson, S., Bürgle, M., Düwiger, H., Scheel, U.: Faceted wikipedia search. In: Wecel, K. (ed.) BIS 2010. LNBIP, vol. 47, pp. 1–11. Springer, Heidelberg (2010)
Higashinaka, R., Dohsaka, K., Isozaki, H.: Learning to rank definitions to generate quizzes for interactive information presentation. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, pp. 117–120 (2007)
Kaisser, M.: The qualim question answering demo: Supplementing answers with paragraphs drawn from wikipedia. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, pp. 32–35 (2008)
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: Analysis and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(7), 881–892 (2002)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, pp. 282–289 (2001)
WordNet: A lexical database for English, http://wordnet.princeton.edu/ .
Li, Y., Luk, W.P.R., Ho, K.S.E., Chung, F.L.K.: Improving weak ad-hoc queries using wikipedia as external corpus. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 797–798 (2007)
OpenNLP Maxent Library, http://maxent.sourceforge.net/
Milne, D.N., Witten, I.H., Nichols, D.M.: A knowledge-based search engine powered by wikipedia. In: Proceedings of the 16th ACM Conference on Information and Knowledge Management, pp. 445–454 (2007)
Nguyen, D.P., Matsuo, Y., Ishizuka, M.: Exploiting syntatic and semantic information for relation extraction from wikipedia. In: Proceedings of the Workshop on Text-Mining & Link-Analysis (2007)
Nigam, K., Lafferty, J., Mccallum, A.: Using maximum entropy for text classification. In: Proceedings of the 16th International Joint Conference on Artificial Intelligence, pp. 61–67 (1999)
OpenNLP, http://opennlp.sourceforge.net/
Potthast, M., Stein, B., Anderka, M.: Wikipedia-based multilingual retrieval model. In: Proceedings of the 30th European Conference on Information Retrieval Research, pp. 522–530 (2008)
CRF Project, http://crf.sourceforge.net/
Resource Description Framework (RDF), http://www.w3.org/RDF/
Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 134–141 (2003)
Stone, M.: Cross-validation choices and assessment of statistical predictions. Journal of the Royal Statistical Society B36, 111–147 (1974)
Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: A core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, pp. 697–706 (2007)
Wang, P., Hu, J., Zeng, H., Chen, L., Chen, Z.: Improving text classification by using encyclopedia knowledge. In: Proceedings of the 7th IEEE International Conference on Data Mining, pp. 332–341 (2007)
Wu, F., Hoffmann, R., Weld, D.S.: Information extraction from wikipedia: Moving down the long tail. In: Proceeding of the 14th ACM International Conference on Knowledge Discovery and Data Mining, pp. 731–739 (2008)
Wu, F., Weld, D.S.: Autonomously semantifying wikipedia. In: Proceedings of the 16th ACM Conference on Information and Knowledge Management, pp. 41–50 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Brandão, W.C., Moura, E.S., Silva, A.S., Ziviani, N. (2010). A Self-Supervised Approach for Extraction of Attribute-Value Pairs from Wikipedia Articles. In: Chavez, E., Lonardi, S. (eds) String Processing and Information Retrieval. SPIRE 2010. Lecture Notes in Computer Science, vol 6393. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16321-0_29
Download citation
DOI: https://doi.org/10.1007/978-3-642-16321-0_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16320-3
Online ISBN: 978-3-642-16321-0
eBook Packages: Computer ScienceComputer Science (R0)