Abstract
For knowledge management purposes, it would be useful to automatically classify and tag documents based on their content. Keyphrase extraction is one way of achieving this automatically by using statistical or semantic methods. Whereas corpus-index-based keyphrase extraction can extract relevant concepts for documents, the inverse document index grows exponentially with the number of words that candidate concepts can have. Document-based heuristics can solve this issue, but often result in keyphrases that are not concepts. To increase concept precision, or the percentage of extracted keyphrases that represent actual concepts, we contribute a method to filter keyphrases based on a pre–trained convolutional neural network (CNN). We tested CNNs containing vertical and horizontal filters to decide whether an n-gram (i.e, a consecutive sequence of N words) is a concept or not, from a training set with labeled examples. The classification training signal is derived from the Wikipedia corpus, assuming that an n-gram certainly represents a concept if a corresponding Wikipedia page title exists. The CNN input feature is the vector representation of each word, derived from a word embedding model; the output is the probability of an n-gram to represent a concept. Multiple configurations for vertical and horizontal filters are analyzed and optimised through a hyper-parameterization process. The results demonstrated concept precision for extracted keywords of between 60 and 80% on average. Consequently, by applying a CNN-based concept recognition filter, the concept precision of keyphrase extraction was significantly improved. For an optimal parameter configuration with an average of five extracted keyphrases per document, the concept precision could be increased from 0.65 to 0.8, meaning that on average, at least four out of five keyphrases extracted by our algorithm were actual concepts verified by Wikipedia titles.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
The permanent link for the selected news item is https://perma.cc/PF53-SY2L.
References
Beliga, S., Metrovic, A., Martinic-Ipsic, S.: An overview of graph-based keyword extraction methods and approaches. J. Inf. Organ. Sci. 39, 1–20 (2015)
Bengio, Y.: Practical recommendations for gradient-based training of deep architectures. CoRR abs/1206.5533 (2012). http://arxiv.org/abs/1206.5533
Bennani-Smires, K., Musat, C., Jaggi, M., Hossmann, A., Baeriswyl, M.: EmbedRank: unsupervised keyphrase extraction using sentence embeddings. CoRR abs/1801.04470 (2018). http://arxiv.org/abs/1801.04470
Dalvi, N., et al.: A web of concepts. In: Proceedings of the Twenty-Eighth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2009, pp. 1–12. ACM, New York (2009). https://doi.org/10.1145/1559795.1559797
Das, B., Pal, S., Mondal, S.K., Dalui, D., Shome, S.K.: Automatic keyword extraction from any text document using n-gram rigid collocation. Int. J. Soft Comput. Eng. (IJSCE) 3(2), 238–242 (2013)
Eiholzer, M.: Method engineering for automatic tagging with inductive fuzzy classification. Master’s thesis, School of Computer Science, Lucerne University of Applied Sciences and Arts, Rotkreuz, Switzerland (2019)
FĂ¼rnkranz, J.: A study using n-gram features for text categorization. Austrian Res. Inst. Artif. Intell. 3(1998), 1–10 (1998)
Google: Googlenews-vectors-negative300.bin.gz (2013). https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit. Accessed 15 Jan 2018
Hughes, M., Li, I., Kotoulas, S., Suzumura, T.: Medical text classification using convolutional neural networks. arXiv preprint arXiv:1704.06841 (2017)
Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing (2003)
Jagarlamudi, J., Pingali, P., Varma, V.: Query independent sentence scoring approach to DUC 2006. In: Proceeding of Document Understanding Conference (DUC) (2006)
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016)
Kalchbrenner, N., Grefenstette, E., Blunsom, P.: A convolutional neural network for modelling sentences. CoRR abs/1404.2188 (2014). http://arxiv.org/abs/1404.2188
Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014)
Lee, S., Kim, H.: News keyword extraction for topic tracking. In: 2008 Fourth International Conference on Networked Computing and Advanced Information Management, vol. 2, pp. 554–559, September 2008. https://doi.org/10.1109/NCM.2008.199
Liu, Y., Shi, M., Li, C.: Domain ontology concept extraction method based on text. In: 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS), pp. 1–5. IEEE (2016)
Liu, Z., Li, P., Zheng, Y., Sun, M.: Clustering to find exemplar terms for keyphrase extraction. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1. Association for Computational Linguistics (2009)
Lopez, M.M., Kalita, J.: Deep learning applied to NLP. CoRR abs/1703.03091 (2017). http://arxiv.org/abs/1703.03091
Mihalcea, R., Tarau, P.: TextRank: bringing order into texts. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, EMNLP 2004 (2004)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp. 807–814 (2010)
Pang, L., Lan, Y., Guo, J., Xu, J., Wan, S., Cheng, X.: Text matching as image recognition. In: AAAI, pp. 2793–2799 (2016)
Parameswaran, A., Garcia-Molina, H., Rajaraman, A.: Towards the web of concepts: extracting concepts from large datasets. Proc. VLDB Endow. 3(1–2), 566–577 (2010)
Rong, X.: word2vec parameter learning explained. arXiv preprint arXiv:1411.2738 (2014)
Rose, S., Engel, D., Cramer, N., Cowley, W.: Automatic keyword extraction from individual documents. In: Berry, M.W., Kogan, J. (eds.) Text Mining: Applications and Theory. Wiley, Hoboken (2010)
Siegfried, P., Waldis, A.: Automatische generierung plattformĂ¼bergreifender wissensnetzwerken mit metadaten und volltextindexierung, July 2017. http://www.enterpriselab.ch/webabstracts/projekte/diplomarbeiten/2017/Siegfried.Waldis.2017.bda.html
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Song, Y., et al.: Real-time automatic tag recommendation. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 515–522. ACM (2008)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Waldis, A., Mazzola, L., Kaufmann, M.: Concept extraction with convolutional neural networks. In: Proceedings of the 7th International Conference on Data Science, Technology and Applications - Volume 1: DATA, pp. 118–129. INSTICC, SciTePress (2018). https://doi.org/10.5220/0006901201180129
Westphal, C., Pei, G.: Scalable routing via greedy embedding. In: INFOCOM 2009, pp. 2826–2830. IEEE (2009)
Zhang, Q., Wang, Y., Gong, Y., Huang, X.: Keyphrase extraction using deep recurrent neural networks on Twitter. In: EMNLP (2016)
Acknowledgements
This research has been funded in part by the Swiss Commission for Technology and Innovation (CTI) as part of the research project Feasibility Study X-MAS: Cross-Platform Mediation, Association and Search Engine, CTI-No. 26335.1 PFES-ES. We thank Benjamin Haymond for proof-reading and copy-editing of our work.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Waldis, A., Mazzola, L., Kaufmann, M. (2019). Concept Recognition with Convolutional Neural Networks to Optimize Keyphrase Extraction. In: Quix, C., Bernardino, J. (eds) Data Management Technologies and Applications. DATA 2018. Communications in Computer and Information Science, vol 862. Springer, Cham. https://doi.org/10.1007/978-3-030-26636-3_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-26636-3_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26635-6
Online ISBN: 978-3-030-26636-3
eBook Packages: Computer ScienceComputer Science (R0)