Abstract
While a suffix index built on a suffix array is capable of supporting full-text searches over any data, its search speed can be accelerated using a keyword index for the set of keywords extracted from data. We attempt to design a method for extracting keywords from data using deep learning and a suffix array in this article. In particular, the study starts with Chinese texts because many word segmentation results on Chinese are available for performance evaluation. A new method combining the use of a neural network and a suffix array of training data is proposed for Chinese word segmentation. The suffix array of training data is constructed to divide long sentences in the input text into short fragments for better word segmentation by our neural network method without a context window. Our experiments on the typical datasets reveal that the proposed method achieves encouraging results in terms of the precision, recall and \(F_1\) score compared to other existing advanced methods while avoiding the drawback of a context window. This study provides some helpful experience for designing a general solution to extract keywords from data using a suffix array.





Similar content being viewed by others
Notes
https://www.elastic.com/cn/blog/elastic-search-7-2-0-released
http://sighan.cs.uchicago.edu/bakeoff2005/
https://dumps.wikimedia.org/zhwiki/20210101/zhwiki-20210101-pages-articles-multistream.xml.bz2
https://radimrehurek.com/gensim/index.html
https://github.com/BYVoid/OpenCC
https://github.com/fxsjy/jieba/
https://catalog.ldc.upenn.edu/LDC2007T36
https://github.com/lancopku/PKUSeg-python
References
Cai D, Zhao H (2016) Neural word segmentation learning for Chinese. In: Proceedings of the 54th annual meeting of the association for computational linguistics. pp 409–420
Cai D, Zhao H, Zhang Z, Xin Y, Wu Y, Huang F (2017) Fast and accurate neural word segmentation for Chinese. In: Proceedings of the 55th annual meeting of the association for computational linguistics. pp 608–615
Chen X, Qiu X, Zhu C, Huang X (2015) Gated recursive neural network for Chinese word segmentation. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing. pp 1744–1753
Chen X, Qiu X, Zhu C, Liu P, Huang X (2016) Long short-term memory neural networks for Chinese word segmentation. In: Proceedings of the 2015 conference on empirical methods in natural language processing. pp 1197–1206
Chen X, Shi Z, Qiu X, Huang, X (2017) Adversarial multi-criteria learning for Chinese word segmentation. In: Proceedings of the 55th annual meeting of the association for computational linguistics. pp 1193–1203
Chen Y, Zheng Q, Chen P (2015) A boundary assembling method for Chinese entity-mention recognition. IEEE Intelligent Systems 30(6):50–58
Daumé H, Langford J, Marcu D (2009) Search-based structured prediction. Machine Learning 75(3):297–325
Deng X, Li Y, Weng J, Zhang J (2019) Feature selection for text classification: A review. Multimedia Tools and Applications 78(3):3797–3816
Gers FA, Schmidhuber J, Cummins F (2000) Learning to forget: Continual prediction with LSTM. Neural Computation 12(10):2451–2471
Goldberg Y, Levy O (2014) Word2vec explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv:14023722v1
Greff K, Srivastava RK, Koutnik J, Steunebrink BR, Schmidhuber J (2017) LSTM: a search space odyssey. IEEE Transactions on Neural Networks and Learning Systems 28(10):2222–2232
Kumar A, Garg G (2019) Sentiment analysis of multimodal twitter data. Multimedia Tools and Applications 78(17):1–17
Liu Q, Wu L, Yang Z, Liu Y (2011) Domain phrase identification using atomic word formation in Chinese text. Knowledge-Based Systems 24(8):1254–1260
Manber U, Myers G (1993) Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing 22(5):935–948
Mo J, Zheng Y, Shou Z, Zhang S (2013) Improved Chinese word segmentation method based on dictionary. Computer Engineering & Design 34(5):1802–1771
Nong G (2013) Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Transactions on Information Systems 31(3):1–15
Nong G, Zhang S, Chan WH (2011) Two efficient algorithms for linear time suffix array construction. IEEE Transactions on Computers 60(10):1471–1484
Peng H, Ma Y, Li Y, Cambria E (2018) Learning multi-grained aspect target sequence for Chinese sentiment analysis. Knowledge-Based Systems 148:167–176
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536
Sundermeyer M, Schlüter, R, Ney, H (2012) LSTM neural networks for language modeling. In: Interspeech. pp 601–608
Sutskever I, Martens J, Dahl G, Hinton G (2013) On the importance of initialization and momentum in deep learning. In: International conference on international conference on machine learning
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems. pp 5998–6008
Wang H, Yang Z, Yu Q, Hong T, Lin X (2018) Online reliability time series prediction via convolutional neural network and long short term memory for service-oriented systems. Knowledge-Based Systems 159:132–147
Xiao H, Zhang D, Wang W, Wang J (2021) Non-detection text recognition of certificate image based on transformer. Information Technology 45(6):78–90
Xu J, Sun X (2016) Dependency-based gated recursive neural network for Chinese word segmentation. In: Proceedings of the 54th annual meeting of the association for computational linguistics. pp 567–572
Xu W, Zhao X, Lao B, Nong G (2021) Enhancing HDFS with a full-text search system for massive small files. The Journal of Supercomputing 77(4):1–22
Xue N (2003) Chinese word segmentation as character tagging. International Journal of Computational Linguistics & Chinese Language Processing: Special Issue on Word Formation and Chinese Language Processing 8:29–48
Zhang J, Meng F, Wang M, Zheng D, Jiang W, Liu Q (2016) Is local window essential for neural network based Chinese word segmentation? In: China national conference on Chinese computational linguistics. pp 450–457
Zhang Y, Clark S (2007) Chinese segmentation with a word-based perceptron algorithm. In: Proceedings of the 45th annual meeting of the association of computational linguistics. pp 840–847
Zhao L, Zhang Q, Wang P, Liu X (2018) Neural networks incorporating unlabeled and partially-labeled data for cross-domain Chinese word segmentation. In: Proceedings of the twenty-seventh international joint conference on artificial intelligence
Zhou P, Shi W, Tian J, Qi Z, Li B, Hao H, Xu B (2016) Attention-based bidirectional long short-term memory networks for relation classification. In: Proceedings of the 54th annual meeting of the association for computational linguistics. pp 207–212
Acknowledgements
This work was funded by the National Natural Science Foundation of China (Grant number 61872391), the Special Funds for Guangzhou Scientific and Technological Innovation and Development (Grant number 201802010011).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Xu, W., Nong, G. A study for extracting keywords from data with deep learning and suffix array. Multimed Tools Appl 81, 7419–7437 (2022). https://doi.org/10.1007/s11042-021-11762-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-11762-7