Abstract
With growing amounts of text data the descriptive metadata become more crucial in efficient processing of it. One kind of such metadata are keywords, which we can encounter e.g. in everyday browsing of webpages. Such metadata can be of benefit in various scenarios, such as web search or content-based recommendation. We research keyword extraction problem from the perspective of vector space and present a novel method to extract relevant words from an article, where we represent each word and phrase of the article as a vector of its latent features. We evaluate our method within text categorisation problem using a well-known 20-newsgroups dataset and achieve state-of-the-art results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Available online at - https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing (last accessed on May 14, 2014).
- 2.
Available online at - http://www.washingtonpost.com/world/national-security/cybersecurity-poll-americans-divided-over-government-requirements-on-companies/2012/06/06/gJQAmWqnJV_story.html (last accessed on May 14, 2014).
- 3.
Available online at - http://storage.googleapis.com/books/ngrams/books/datasetsv2.html (last accessed on May 14, 2014).
References
Barla, M., Bieliková, M.: On deriving tagsonomies: keyword relations coming from crowd. In: Nguyen, N.T., Kowalczyk, R., Chen, S.-M. (eds.) ICCCI 2009. LNCS, vol. 5796, pp. 309–320. Springer, Heidelberg (2009)
Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine Learning, pp. 160–167. ACM (2008)
Fara, D.G., Russell, G.: The Routledge Companion to Philosophy of Language, p. 92. Routledge, New York (2013). ISBN: 978-0-203-20696-6
Giesbrecht, E.: In search of semantic compositionality in vector spaces. In: Rudolph, S., Dau, F., Kuznetsov, S.O. (eds.) ICCS 2009. LNCS, vol. 5662, pp. 173–184. Springer, Heidelberg (2009)
Harris, Z.S.: Distributional structure. Word 10(23), 146–162 (1954)
Hinton, G.E., McClelland, J.L., Rumelhart, D.E.: Distributed representations. In: Rumelhart, D.E., McClelland, J.L. (eds.) Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol. 1: Foundations, pp. 77–109. MIT Press, Cambridge (1986)
Kramár, T., Barla, M., Bieliková, M.: Personalizing search using socially enhanced interest model, built from the stream of user’s activity. J. Web Eng. 12(1–2), 65–92 (2013)
Lan, M., Tan, C., Low, H.: Proposing a new term weighting scheme for text categorization. In: Proceedings of the 21st National Conference on Artificial Intelligence, vol. 1, pp. 763–768. AAAI Press (2008)
Larochelle, H., Bengio, Y.: Classification using discriminative restricted Boltzmann machines. In: Proceedings of the 25th International Conference on Machine Learning, pp. 536–543. ACM (2008)
Li, B., Vogel, C.: Improving multiclass text classification with error-correcting output coding and sub-class partitions. In: Farzindar, A., Kešelj, V. (eds.) Canadian AI 2010. LNCS, vol. 6085, pp. 4–15. Springer, Heidelberg (2010)
Van der Maaten, L.J.P., Hinton, G.E.: Visualizing high-dimensional data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
Martinský, L., Návrat, P.: Query formulation improved by suggestions resulting from intermediate web search results. Comput. Inf. Syst. J. 16(1), 56–73 (2012)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26, pp. 3111–3119. Curran Associates (2013)
Mikolov, T., Yih, W., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of NAACL HLT, pp. 746–751. ACL (2013)
Mitchell, J., Lapata, M.: Vector-based models of semantic composition. In: Proceedings of the 46th Annual Meeting of the ACL, pp. 236–244. ACL (2008)
Bauer, J., Socher, R., Manning, C.D., Ng, A.Y.: Parsing with compositional vector grammars. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pp. 455–465. ACL (2013)
Šajgalík, M., Barla, M., Bieliková, M.: From ambiguous words to key-concept extraction. In: Proceedings of the 24th International Workshop on Database and Expert Systems Applications, pp. 63–67. IEEE (2013)
Vu, T., Aw, A.T., Zhang, M.: Term extraction through unithood and termhood unification. In: Proceedings of the Third International Joint Conference on NLP, pp. 631–636. ACL (2004)
Wang, D., Zhang, H.: Inverse-category-frequency based supervised term weighting scheme for text categorization (2010). arXiv preprint arXiv:1012.2609
Acknowledgement
This work was partially supported by grants No. VG1/0675/11, APVV-0208-10 and it is the partial result of the Research and Development Operational Programme project “University Science Park of STU Bratislava”, ITMS 26240220084, co-funded by the European Regional Development Fund.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Šajgalík, M., Barla, M., Bieliková, M. (2014). Exploring Multidimensional Continuous Feature Space to Extract Relevant Words. In: Besacier, L., Dediu, AH., Martín-Vide, C. (eds) Statistical Language and Speech Processing. SLSP 2014. Lecture Notes in Computer Science(), vol 8791. Springer, Cham. https://doi.org/10.1007/978-3-319-11397-5_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-11397-5_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11396-8
Online ISBN: 978-3-319-11397-5
eBook Packages: Computer ScienceComputer Science (R0)