Abstract
We extend the scope of Wikification to novel words by relaxing two premises of Wikification: (i) we wikify without using the surface form of the word (ii) to a mixture of Wikipedia senses instead of a single sense. We identify two types of “novel” words: words where the connection between their surface form and their meaning is broken (e.g., a misspelled word), and words where there is no meaning to connect to—the meaning itself is also novel. We propose a method capable of wikifying both types of novel words while also dealing with the inherently large-scale disambiguation problem. We show that the method can disambiguate between up to 1,000 Wikipedia senses, and it can explain words with novel meaning as a mixture of other, possibly related senses. This mixture representation compares favorably to the widely used bag of words representation.
The work was carried out while Zoltán Szabó was working at Eötvös Loránd University, Hungary
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The form of a word as it appears in the text.
- 2.
The word to be explained with Wikipedia senses.
- 3.
In “A rose is a rose is a rose”, there are three word types (a, rose, is), but eight word tokens.
- 4.
Downloaded from http://dumps.wikimedia.org/enwiki/.
References
Mihalcea, R., Csomai, A.: Wikify!: linking documents to encyclopedic knowledge. In: Proceedings of the Conference on Information and Knowledge Management (CIKM), pp. 233–242 (2007)
Akmajian, A.: Linguistics: An Introduction to Language and Communication. The MIT press, Cambridge (2001)
Harris, Z.: Distributional structure. Word 10, 146–162 (1954)
Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Optimization with sparsity-inducing penalties. Found. Trends Mach. Learn. 4, 1–106 (2012)
Milne, D., Witten, I.H.: Learning to link with Wikipedia. In: Proceedings of the Conference on Information and Knowledge Management (CIKM), pp. 509–518 (2008)
Kulkarni, S., Singh, A., Ramakrishnan, G., Chakrabarti, S.: Collective annotation of Wikipedia entities in web text. In: Proceedings of the Conference on Knowledge Discovery and Data Mining (KDD), pp. 457–466 (2009)
Ratinov, L., Roth, D., Downey, D., Anderson, M.: Local and global algorithms for disambiguation to Wikipedia. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.1375–1384 (2011)
Kantor, P.B., Voorhees, E.M.: The TREC-5 confusion track: comparing retrieval methods for scanned text. Inf. Retr. 2, 165–176 (2000)
Garofolo, J.S., Auzanne, C.G.P., Voorhees, E.M.: The TREC spoken document retrieval track: a success story. In: RIAO, pp. 1–20 (2000)
Jenatton, R., Mairal, J., Obozinski, G., Bach, F.: Proximal methods for hierarchical sparse coding. J. Mach. Learn. Res. 12, 2297–2334 (2011)
Martins, A.F.T., Smith, N.A., Aguiar, P.M.Q., Figueiredo, M.A.T.: Structured sparsity in structured prediction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1500–1511 (2011)
Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37, 141–188 (2010)
Yuan, M., Yuan, M., Lin, Y., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. 68, 49–67 (2006)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. 58, 267–288 (1994)
Porter, M.F.: An algorithm for suffix stripping. Readings in Information Retrieval. Morgan Kaufmann Publishers Inc., San Francisco (1997)
Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38, 39–41 (1995)
BNC Consortium: The British National Corpus, version 2 (BNC World) (2001)
Liu, J., Ji, S., Ye, J.: SLEP: Sparse Learning with Efficient Projections. Arizona State University, Tempe (2009)
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001)
Lee, Y.K., Ng, H.T.: An empirical evaluation of knowledge sources and learning algorithms for word sense disambiguation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 41–48 (2002)
Schütze, H.: Automatic word sense discrimination. Comput. Linguist. 24, 97–123 (1998)
Liu, Y., Li, Z., Xiong, H., Gao, X., Wu, J.: Understanding of internal clustering validation measures. In: IEEE 10th International Conference on Data Mining (ICDM), pp. 911–916. IEEE (2010)
Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. J. Intell. Inf. Syst. 17, 107–145 (2001)
Sharma, S.: Applied Multivariate Techniques. Wiley, New York (1996)
Han, E.H., Karypis, G.: Centroid-based document classification: analysis and experimental results. In: Proceedings of the Conference on Principles of Data Mining and Knowledge Discovery (PKDD), pp. 116–123 (2000)
Acknowledgments
The research has been supported by the ‘European Robotic Surgery’ EC FP7 grant (no.: 288233). Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of other members of the consortium or the European Commission. The research was carried out as part of the EITKIC_12-1-2012-0001 project, which is supported by the Hungarian Government, managed by the National Development Agency, financed by the Research and Technology Innovation Fund and was performed in cooperation with the EIT ICT Labs Budapest Associate Partner Group.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Pintér, B., Vörös, G., Szabó, Z., Lőrincz, A. (2015). Wikifying Novel Words to Mixtures of Wikipedia Senses by Structured Sparse Coding. In: Fred, A., De Marsico, M. (eds) Pattern Recognition Applications and Methods. Advances in Intelligent Systems and Computing, vol 318. Springer, Cham. https://doi.org/10.1007/978-3-319-12610-4_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-12610-4_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12609-8
Online ISBN: 978-3-319-12610-4
eBook Packages: EngineeringEngineering (R0)