A Sample Extension Method Based on Wikipedia and Its Application in Text Classification

Zhu, Wenhao; Liu, Yiting; Hu, Guannan; Ni, Jianyue; Lu, Zhiguo

doi:10.1007/s11277-018-5416-z

A Sample Extension Method Based on Wikipedia and Its Application in Text Classification

Published: 08 February 2018

Volume 102, pages 3851–3867, (2018)
Cite this article

Wireless Personal Communications Aims and scope Submit manuscript

Wenhao Zhu¹,
Yiting Liu¹,
Guannan Hu¹,
Jianyue Ni¹ &
…
Zhiguo Lu ORCID: orcid.org/0000-0002-9044-2819²

250 Accesses
4 Citations
Explore all metrics

Abstract

Text classification is a topic in natural language processing that is particularly useful for Internet information processing. Methods based on supervised learning require a large amount of manually annotated training samples. The annotation of training samples is time consuming, and performance relies heavily on the quality of the training samples. This paper presents a text classification method based on sample extension. The extension is based on the correlation of the labeled sample data and the concepts in Wikipedia. Combined with the rich link relationships between concepts, we selected appropriate articles from Wikipedia to expand the training sample set. By introducing the large amount of rich semantic concept pages that are contained in Wikipedia along with links that are related to different pages, our approach enhances the performance and generalization of the classifier. Experiments demonstrate that the performance of the method proposed in this paper is better than that of both supervised and semi-supervised methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Diksha Khurana, Aditya Koli, … Sukhdev Singh

TextConvoNet: a convolutional neural network based architecture for text classification

Article 22 October 2022

Sanskar Soni, Satyendra Singh Chouhan & Santosh Singh Rathore

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

Article 05 March 2020

Kanish Shah, Henil Patel, … Manan Shah

References

Banerjee, S. (2007). Boosting inductive transfer for text classification using wikipedia. In Sixth International Conference on Machine Learning and Applications, 2007 (ICMLA 2007) (pp. 148–153).
Bijalwan, V., Kumar, V., Kumari, P., & Pascual, J. (2014). Knn based machine learning approach for text and document mining. International Journal of Database Theory and Application, 7(1), 61–70.
Article Google Scholar
BYVoid: Opencc (2014). https://github.com/BYVoid/OpenCC. Accessed 10 Nov 2016.
Chapelle, O., & Zien, A. (2005). Semi-supervised classification by low density separation. In AISTATS (pp. 57–64).
Dópido, I., Li, J., Marpu, P. R., Plaza, A., Dias, J. M. B., & Benediktsson, J. A. (2013). Semisupervised self-learning for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 51(7), 4032–4044.
Article Google Scholar
Dorado, R., & Ratté, S. (2016). Semisupervised text classification using unsupervised topic information. In FLAIRS.
Galán-GarcÍa, P., De La Puerta, J. G., Gómez, C. L., Santos, I., & Bringas, P. G. (2015). Supervised machine learning for the detection of troll profiles in twitter social network: Application to a real case of cyberbullying. Logic Journal of IGPL, 24(1), 42–53.
MathSciNet Google Scholar
Harispe, S., Ranwez, S., Janaqi, S., & Montmain, J. (2013). Semantic measures for the comparison of units of language, concepts or instances from text and knowledge base analysis. arXiv preprint arXiv:1310.1285.
Harispe, S., Sánchez, D., Ranwez, S., Janaqi, S., & Montmain, J. (2014). A framework for unifying ontology-based semantic similarity measures: A study in the biomedical domain. Journal of Biomedical Informatics, 48, 38–53.
Article Google Scholar
Jiang, S., Pang, G., Wu, M., & Kuang, L. (2012). An improved k-nearest-neighbor algorithm for text categorization. Expert Systems with Applications, 39(1), 1503–1509.
Article Google Scholar
Junyi, S. (2017). https://github.com/fxsjy/jieba. Accessed 25 Nov 2016.
Li, Y., Guan, C., Li, H., & Chin, Z. (2008). A self-training semi-supervised SVM algorithm and its application in an EEG-based brain computer interface speller system. Pattern Recognition Letters, 29(9), 1285–1294.
Article Google Scholar
Low, Y., & Zheng, A. X. (2012). Fast top-k similarity queries via matrix compression. In Proceedings of the 21st ACM international conference on information and knowledge management (pp. 2070–2074).
Pavlinek, M., & Podgorelec, V. (2017). Text classification method based on self-training and lda topic models. Expert Systems with Applications, 80, 83–93.
Article Google Scholar
Ramírez, J., Górriz, J., Salas-Gonzalez, D., Romero, A., López, M., Álvarez, I., et al. (2013). Computer-aided diagnosis of alzheimers type dementia combining support vector machines and discriminant set of features. Information Sciences, 237, 59–72.
Article Google Scholar
Van Dongen, B., Dijkman, R., & Mendling, J. (2013). Measuring similarity between business process models. In Seminal contributions to information systems engineering (pp. 405–419). Berlin: Springer.
Wajeed, M.A., Adilakshmi, T. (2011). Semi-supervised text classification using enhanced KNN algorithm. In 2011 World Congress on information and communication technologies (WICT) (pp. 138–142).
Wang, P., Hu, J., Zeng, H. J., & Chen, Z. (2009). Using wikipedia knowledge to improve text classification. Knowledge and Information Systems, 19(3), 265–281.
Article Google Scholar
Wang, X. Z., He, Y. L., & Wang, D. D. (2014). Non-naive bayesian classifiers for classification problems with continuous attributes. IEEE Transactions on Cybernetics, 44(1), 21–39.
Article Google Scholar
Yoshikawa, Y., Iwata, T., & Sawada, H. (2014). Latent support measure machines for bag-of-words data classification. In Advances in neural information processing systems (pp. 1961–1969).
Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. In Advances in neural information processing systems (pp. 649–657).

Download references

Acknowledgements

The work of this paper is partially supported by the National Natural Science Foundation of China (Nos. 61572434, 61303097).

Author information

Authors and Affiliations

School of Computer Engineering and Science, Shanghai University, Shanghai, China
Wenhao Zhu, Yiting Liu, Guannan Hu & Jianyue Ni
Library of Shanghai University, Shanghai, China
Zhiguo Lu

Authors

Wenhao Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Yiting Liu
View author publications
You can also search for this author in PubMed Google Scholar
Guannan Hu
View author publications
You can also search for this author in PubMed Google Scholar
Jianyue Ni
View author publications
You can also search for this author in PubMed Google Scholar
Zhiguo Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhiguo Lu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, W., Liu, Y., Hu, G. et al. A Sample Extension Method Based on Wikipedia and Its Application in Text Classification. Wireless Pers Commun 102, 3851–3867 (2018). https://doi.org/10.1007/s11277-018-5416-z

Download citation

Published: 08 February 2018
Issue Date: October 2018
DOI: https://doi.org/10.1007/s11277-018-5416-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Sample Extension Method Based on Wikipedia and Its Application in Text Classification

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

TextConvoNet: a convolutional neural network based architecture for text classification

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Sample Extension Method Based on Wikipedia and Its Application in Text Classification

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

TextConvoNet: a convolutional neural network based architecture for text classification

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation