Abstract
Wikipedia is the largest encyclopedia on the web and has been widely used as a reliable source of information. Researchers have been extracting entities, relationships and attribute-value pairs from Wikipedia and using them in information retrieval tasks. In this paper we present a self-supervised approach for autonomously extract attribute-value pairs from Wikipedia articles. We apply our method to the Wikipedia automatic infobox generation problem and outperformed a method presented in the literature by 21.92% in precision, 26.86% in recall and 24.29% in F1.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Auer, S., Lehmann, J.: What have innsbruck and leipzig in common? extracting semantics from wiki content. In: Proceedings of the 4th European Conference on The Semantic Web, pp. 503–517 (2007)
Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Banerjee, S., Ramanathan, K., Gupta, A.: Clustering short texts using wikipedia. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 787–788 (2007)
Boser, B.E., Guyon, I., Vapnik, V.: A training algorithm for optimal margin classifiers. In: Proceedings of the 5th Annual Workshop on Computational Learning Theory, pp. 144–152 (1992)
Chang, C., Lin, C.: Libsvm: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Cortes, C., Vapnik, V.: Support-vector network. Machine Learning 20, 273–297 (1995)
Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of the 7th International Conference on Information and Knowledge Management, pp. 148–155 (1998)
Bliki Engine, http://code.google.com/p/gwtwiki/
Google Search Engine, http://www.google.com/
Etzioni, O., Banko, M., Soderland, S., Weld, D.S.: Open information extraction from the web. Communications of the ACM 51(12), 68–74 (2008)
Hahn, R., Bizer, C., Sahnwaldt, C., Herta, C., Robinson, S., Bürgle, M., Düwiger, H., Scheel, U.: Faceted wikipedia search. In: Wecel, K. (ed.) BIS 2010. LNBIP, vol. 47, pp. 1–11. Springer, Heidelberg (2010)
Higashinaka, R., Dohsaka, K., Isozaki, H.: Learning to rank definitions to generate quizzes for interactive information presentation. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, pp. 117–120 (2007)
Kaisser, M.: The qualim question answering demo: Supplementing answers with paragraphs drawn from wikipedia. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, pp. 32–35 (2008)
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: Analysis and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(7), 881–892 (2002)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, pp. 282–289 (2001)
WordNet: A lexical database for English, http://wordnet.princeton.edu/ .
Li, Y., Luk, W.P.R., Ho, K.S.E., Chung, F.L.K.: Improving weak ad-hoc queries using wikipedia as external corpus. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 797–798 (2007)
OpenNLPÂ Maxent Library, http://maxent.sourceforge.net/
Milne, D.N., Witten, I.H., Nichols, D.M.: A knowledge-based search engine powered by wikipedia. In: Proceedings of the 16th ACM Conference on Information and Knowledge Management, pp. 445–454 (2007)
Nguyen, D.P., Matsuo, Y., Ishizuka, M.: Exploiting syntatic and semantic information for relation extraction from wikipedia. In: Proceedings of the Workshop on Text-Mining & Link-Analysis (2007)
Nigam, K., Lafferty, J., Mccallum, A.: Using maximum entropy for text classification. In: Proceedings of the 16th International Joint Conference on Artificial Intelligence, pp. 61–67 (1999)
OpenNLP, http://opennlp.sourceforge.net/
Potthast, M., Stein, B., Anderka, M.: Wikipedia-based multilingual retrieval model. In: Proceedings of the 30th European Conference on Information Retrieval Research, pp. 522–530 (2008)
CRF Project, http://crf.sourceforge.net/
Resource Description Framework (RDF), http://www.w3.org/RDF/
Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 134–141 (2003)
Stone, M.: Cross-validation choices and assessment of statistical predictions. Journal of the Royal Statistical Society B36, 111–147 (1974)
Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: A core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, pp. 697–706 (2007)
Wang, P., Hu, J., Zeng, H., Chen, L., Chen, Z.: Improving text classification by using encyclopedia knowledge. In: Proceedings of the 7th IEEE International Conference on Data Mining, pp. 332–341 (2007)
Wu, F., Hoffmann, R., Weld, D.S.: Information extraction from wikipedia: Moving down the long tail. In: Proceeding of the 14th ACM International Conference on Knowledge Discovery and Data Mining, pp. 731–739 (2008)
Wu, F., Weld, D.S.: Autonomously semantifying wikipedia. In: Proceedings of the 16th ACM Conference on Information and Knowledge Management, pp. 41–50 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Brandão, W.C., Moura, E.S., Silva, A.S., Ziviani, N. (2010). A Self-Supervised Approach for Extraction of Attribute-Value Pairs from Wikipedia Articles. In: Chavez, E., Lonardi, S. (eds) String Processing and Information Retrieval. SPIRE 2010. Lecture Notes in Computer Science, vol 6393. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16321-0_29
Download citation
DOI: https://doi.org/10.1007/978-3-642-16321-0_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16320-3
Online ISBN: 978-3-642-16321-0
eBook Packages: Computer ScienceComputer Science (R0)