Skip to main content
Log in

A Sample Extension Method Based on Wikipedia and Its Application in Text Classification

  • Published:
Wireless Personal Communications Aims and scope Submit manuscript

Abstract

Text classification is a topic in natural language processing that is particularly useful for Internet information processing. Methods based on supervised learning require a large amount of manually annotated training samples. The annotation of training samples is time consuming, and performance relies heavily on the quality of the training samples. This paper presents a text classification method based on sample extension. The extension is based on the correlation of the labeled sample data and the concepts in Wikipedia. Combined with the rich link relationships between concepts, we selected appropriate articles from Wikipedia to expand the training sample set. By introducing the large amount of rich semantic concept pages that are contained in Wikipedia along with links that are related to different pages, our approach enhances the performance and generalization of the classifier. Experiments demonstrate that the performance of the method proposed in this paper is better than that of both supervised and semi-supervised methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Banerjee, S. (2007). Boosting inductive transfer for text classification using wikipedia. In Sixth International Conference on Machine Learning and Applications, 2007 (ICMLA 2007) (pp. 148–153).

  2. Bijalwan, V., Kumar, V., Kumari, P., & Pascual, J. (2014). Knn based machine learning approach for text and document mining. International Journal of Database Theory and Application, 7(1), 61–70.

    Article  Google Scholar 

  3. BYVoid: Opencc (2014). https://github.com/BYVoid/OpenCC. Accessed 10 Nov 2016.

  4. Chapelle, O., & Zien, A. (2005). Semi-supervised classification by low density separation. In AISTATS (pp. 57–64).

  5. Dópido, I., Li, J., Marpu, P. R., Plaza, A., Dias, J. M. B., & Benediktsson, J. A. (2013). Semisupervised self-learning for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 51(7), 4032–4044.

    Article  Google Scholar 

  6. Dorado, R., & Ratté, S. (2016). Semisupervised text classification using unsupervised topic information. In FLAIRS.

  7. Galán-GarcÍa, P., De La Puerta, J. G., Gómez, C. L., Santos, I., & Bringas, P. G. (2015). Supervised machine learning for the detection of troll profiles in twitter social network: Application to a real case of cyberbullying. Logic Journal of IGPL, 24(1), 42–53.

    MathSciNet  Google Scholar 

  8. Harispe, S., Ranwez, S., Janaqi, S., & Montmain, J. (2013). Semantic measures for the comparison of units of language, concepts or instances from text and knowledge base analysis. arXiv preprint arXiv:1310.1285.

  9. Harispe, S., Sánchez, D., Ranwez, S., Janaqi, S., & Montmain, J. (2014). A framework for unifying ontology-based semantic similarity measures: A study in the biomedical domain. Journal of Biomedical Informatics, 48, 38–53.

    Article  Google Scholar 

  10. Jiang, S., Pang, G., Wu, M., & Kuang, L. (2012). An improved k-nearest-neighbor algorithm for text categorization. Expert Systems with Applications, 39(1), 1503–1509.

    Article  Google Scholar 

  11. Junyi, S. (2017). https://github.com/fxsjy/jieba. Accessed 25 Nov 2016.

  12. Li, Y., Guan, C., Li, H., & Chin, Z. (2008). A self-training semi-supervised SVM algorithm and its application in an EEG-based brain computer interface speller system. Pattern Recognition Letters, 29(9), 1285–1294.

    Article  Google Scholar 

  13. Low, Y., & Zheng, A. X. (2012). Fast top-k similarity queries via matrix compression. In Proceedings of the 21st ACM international conference on information and knowledge management (pp. 2070–2074).

  14. Pavlinek, M., & Podgorelec, V. (2017). Text classification method based on self-training and lda topic models. Expert Systems with Applications, 80, 83–93.

    Article  Google Scholar 

  15. Ramírez, J., Górriz, J., Salas-Gonzalez, D., Romero, A., López, M., Álvarez, I., et al. (2013). Computer-aided diagnosis of alzheimers type dementia combining support vector machines and discriminant set of features. Information Sciences, 237, 59–72.

    Article  Google Scholar 

  16. Van Dongen, B., Dijkman, R., & Mendling, J. (2013). Measuring similarity between business process models. In Seminal contributions to information systems engineering (pp. 405–419). Berlin: Springer.

  17. Wajeed, M.A., Adilakshmi, T. (2011). Semi-supervised text classification using enhanced KNN algorithm. In 2011 World Congress on information and communication technologies (WICT) (pp. 138–142).

  18. Wang, P., Hu, J., Zeng, H. J., & Chen, Z. (2009). Using wikipedia knowledge to improve text classification. Knowledge and Information Systems, 19(3), 265–281.

    Article  Google Scholar 

  19. Wang, X. Z., He, Y. L., & Wang, D. D. (2014). Non-naive bayesian classifiers for classification problems with continuous attributes. IEEE Transactions on Cybernetics, 44(1), 21–39.

    Article  Google Scholar 

  20. Yoshikawa, Y., Iwata, T., & Sawada, H. (2014). Latent support measure machines for bag-of-words data classification. In Advances in neural information processing systems (pp. 1961–1969).

  21. Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. In Advances in neural information processing systems (pp. 649–657).

Download references

Acknowledgements

The work of this paper is partially supported by the National Natural Science Foundation of China (Nos. 61572434, 61303097).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhiguo Lu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, W., Liu, Y., Hu, G. et al. A Sample Extension Method Based on Wikipedia and Its Application in Text Classification. Wireless Pers Commun 102, 3851–3867 (2018). https://doi.org/10.1007/s11277-018-5416-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11277-018-5416-z

Keywords

Navigation