A Sample Extension Method Based on Wikipedia and Its Application in Text Classification

Zhu, Wenhao; Liu, Yiting; Hu, Guannan; Ni, Jianyue; Lu, Zhiguo

doi:10.1007/s11277-018-5416-z

A Sample Extension Method Based on Wikipedia and Its Application in Text Classification

Published: 08 February 2018

Volume 102, pages 3851–3867, (2018)
Cite this article

Wireless Personal Communications Aims and scope Submit manuscript

Wenhao Zhu¹,
Yiting Liu¹,
Guannan Hu¹,
Jianyue Ni¹ &
…
Zhiguo Lu ORCID: orcid.org/0000-0002-9044-2819²

264 Accesses
4 Citations
Explore all metrics

Abstract

Text classification is a topic in natural language processing that is particularly useful for Internet information processing. Methods based on supervised learning require a large amount of manually annotated training samples. The annotation of training samples is time consuming, and performance relies heavily on the quality of the training samples. This paper presents a text classification method based on sample extension. The extension is based on the correlation of the labeled sample data and the concepts in Wikipedia. Combined with the rich link relationships between concepts, we selected appropriate articles from Wikipedia to expand the training sample set. By introducing the large amount of rich semantic concept pages that are contained in Wikipedia along with links that are related to different pages, our approach enhances the performance and generalization of the classifier. Experiments demonstrate that the performance of the method proposed in this paper is better than that of both supervised and semi-supervised methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Weakly Supervised Text Classification Method Based on Vocabulary Construction

A review of semi-supervised learning for text classification

Article 31 January 2023

Semi-supervised learning in large scale text categorization

Article 30 May 2017

References

Banerjee, S. (2007). Boosting inductive transfer for text classification using wikipedia. In Sixth International Conference on Machine Learning and Applications, 2007 (ICMLA 2007) (pp. 148–153).
Bijalwan, V., Kumar, V., Kumari, P., & Pascual, J. (2014). Knn based machine learning approach for text and document mining. International Journal of Database Theory and Application, 7(1), 61–70.
Article Google Scholar
BYVoid: Opencc (2014). https://github.com/BYVoid/OpenCC. Accessed 10 Nov 2016.
Chapelle, O., & Zien, A. (2005). Semi-supervised classification by low density separation. In AISTATS (pp. 57–64).
Dópido, I., Li, J., Marpu, P. R., Plaza, A., Dias, J. M. B., & Benediktsson, J. A. (2013). Semisupervised self-learning for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 51(7), 4032–4044.
Article Google Scholar
Dorado, R., & Ratté, S. (2016). Semisupervised text classification using unsupervised topic information. In FLAIRS.
Galán-GarcÍa, P., De La Puerta, J. G., Gómez, C. L., Santos, I., & Bringas, P. G. (2015). Supervised machine learning for the detection of troll profiles in twitter social network: Application to a real case of cyberbullying. Logic Journal of IGPL, 24(1), 42–53.
MathSciNet Google Scholar
Harispe, S., Ranwez, S., Janaqi, S., & Montmain, J. (2013). Semantic measures for the comparison of units of language, concepts or instances from text and knowledge base analysis. arXiv preprint arXiv:1310.1285.
Harispe, S., Sánchez, D., Ranwez, S., Janaqi, S., & Montmain, J. (2014). A framework for unifying ontology-based semantic similarity measures: A study in the biomedical domain. Journal of Biomedical Informatics, 48, 38–53.
Article Google Scholar
Jiang, S., Pang, G., Wu, M., & Kuang, L. (2012). An improved k-nearest-neighbor algorithm for text categorization. Expert Systems with Applications, 39(1), 1503–1509.
Article Google Scholar
Junyi, S. (2017). https://github.com/fxsjy/jieba. Accessed 25 Nov 2016.
Li, Y., Guan, C., Li, H., & Chin, Z. (2008). A self-training semi-supervised SVM algorithm and its application in an EEG-based brain computer interface speller system. Pattern Recognition Letters, 29(9), 1285–1294.
Article Google Scholar
Low, Y., & Zheng, A. X. (2012). Fast top-k similarity queries via matrix compression. In Proceedings of the 21st ACM international conference on information and knowledge management (pp. 2070–2074).
Pavlinek, M., & Podgorelec, V. (2017). Text classification method based on self-training and lda topic models. Expert Systems with Applications, 80, 83–93.
Article Google Scholar
Ramírez, J., Górriz, J., Salas-Gonzalez, D., Romero, A., López, M., Álvarez, I., et al. (2013). Computer-aided diagnosis of alzheimers type dementia combining support vector machines and discriminant set of features. Information Sciences, 237, 59–72.
Article Google Scholar
Van Dongen, B., Dijkman, R., & Mendling, J. (2013). Measuring similarity between business process models. In Seminal contributions to information systems engineering (pp. 405–419). Berlin: Springer.
Wajeed, M.A., Adilakshmi, T. (2011). Semi-supervised text classification using enhanced KNN algorithm. In 2011 World Congress on information and communication technologies (WICT) (pp. 138–142).
Wang, P., Hu, J., Zeng, H. J., & Chen, Z. (2009). Using wikipedia knowledge to improve text classification. Knowledge and Information Systems, 19(3), 265–281.
Article Google Scholar
Wang, X. Z., He, Y. L., & Wang, D. D. (2014). Non-naive bayesian classifiers for classification problems with continuous attributes. IEEE Transactions on Cybernetics, 44(1), 21–39.
Article Google Scholar
Yoshikawa, Y., Iwata, T., & Sawada, H. (2014). Latent support measure machines for bag-of-words data classification. In Advances in neural information processing systems (pp. 1961–1969).
Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. In Advances in neural information processing systems (pp. 649–657).

Download references

Acknowledgements

The work of this paper is partially supported by the National Natural Science Foundation of China (Nos. 61572434, 61303097).

Author information

Authors and Affiliations

School of Computer Engineering and Science, Shanghai University, Shanghai, China
Wenhao Zhu, Yiting Liu, Guannan Hu & Jianyue Ni
Library of Shanghai University, Shanghai, China
Zhiguo Lu

Authors

Wenhao Zhu
View author publications
You can also search for this author inPubMed Google Scholar
Yiting Liu
View author publications
You can also search for this author inPubMed Google Scholar
Guannan Hu
View author publications
You can also search for this author inPubMed Google Scholar
Jianyue Ni
View author publications
You can also search for this author inPubMed Google Scholar
Zhiguo Lu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Zhiguo Lu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, W., Liu, Y., Hu, G. et al. A Sample Extension Method Based on Wikipedia and Its Application in Text Classification. Wireless Pers Commun 102, 3851–3867 (2018). https://doi.org/10.1007/s11277-018-5416-z

Download citation

Published: 08 February 2018
Issue Date: October 2018
DOI: https://doi.org/10.1007/s11277-018-5416-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Sample Extension Method Based on Wikipedia and Its Application in Text Classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Weakly Supervised Text Classification Method Based on Vocabulary Construction

A review of semi-supervised learning for text classification

Semi-supervised learning in large scale text categorization

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now