A Text Clustering Algorithm to Detect Basic Level Categories in Texts

Xu, Jingyun; Cai, Yi; Wang, Shuai; Yang, Kai; Du, Qing; Zhang, Jun; Yao, Li; Li, Jingjing

doi:10.1007/978-3-319-66733-1_8

Jingyun Xu¹⁷,
Yi Cai¹⁷,
Shuai Wang¹⁸,
Kai Yang¹⁷,
Qing Du¹⁷,
Jun Zhang¹⁷,
Li Yao¹⁹ &
…
Jingjing Li²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10473))

Included in the following conference series:

International Conference on Web-Based Learning

1063 Accesses

Abstract

With the rapid development of Internet and explosion of texts, an appropriate way to organize the amount of texts is necessary. Text clustering is of great practical importance for web-learning, which can group similar texts (e.g. documents, textbooks and online notes) to provide users with more valuable information. However, most of existing text clustering algorithms are very sensitive to the parameters needed to be input by users and it is hard to set an appropriate parameter as computers do not know what an appropriate parameter is. Therefore, aiming at this problem, according to the studies of cognitive psychology and our observation, this paper firstly introduces basic level categories and category utility, and then propose a text clustering algorithm to detect basic level categories in texts automatically, which is an non-parametric algorithm. The experimental results show that our algorithm significantly outperforms one basic level concept detection method, k-means and single linkage clustering on different datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A comprehensive and analytical review of text clustering techniques

Article 08 April 2024

Classification of Chinese Texts Based on Recognition of Semantic Topics

Article 02 July 2015

A semi-supervised framework for concept-based hierarchical document clustering

Article 02 October 2023

References

Aggarwal, C.C., Zhai, C.: A survey of text clustering algorithms. In: Aggarwal, C., Zhai, C. (eds.) Mining Text Data, pp. 77–128. Springer, Boston (2012). doi:10.1007/978-1-4614-3223-4_4
Chapter Google Scholar
Altman, N.M.: An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 46, 175–185 (1992). Taylor & Francis
MathSciNet Google Scholar
Anastasiu, D.C., Tagarelli, A., Karypis, G.: Document clustering: the next frontier. Technical report, University of Minnesota (2013)
Google Scholar
Andrews, N.O., Fox, E.A.: Recent developments in document clustering. Technical report, Computer Science, Virginia Tech (2007)
Google Scholar
Belohlavek, R., Trnecka, M.: Basic level in formal concept analysis: interesting concepts and psychological ramifications, pp. 1233–1239 (2013)
Google Scholar
Cai, Y., Chen, W.-H., Leung, H.-F., Li, Q., Xie, H., Lau, R.Y.K., Min, H., Wang, F.L.: Context-aware ontologies generation with basic level concepts from collaborative tags. Neurocomputing 208, 25–38 (2016)
Article Google Scholar
Dhillon, S.I., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Mach. Learn. 42(1), 143–175 (2001)
Article MATH Google Scholar
Fisher, H.D.: Knowledge acquisition via incremental conceptual clustering. Mach. Learn. 2(2), 139–172 (1987)
Google Scholar
Gower, J.C., Ross, G.J.S.: Minimum spanning trees and single linkage cluster analysis. Appl. Stat. 18, 54–64 (1969)
Article MathSciNet Google Scholar
Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28(1), 11–21 (1972)
Article Google Scholar
MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Oakland, vol. 1, pp. 281–297 (1967)
Google Scholar
Su, J., Lan, M., Tan, C.L., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009)
Article Google Scholar
Manning, C.D., Raghavan, P., Schütze, H., et al.: Introduction to Information Retrieval, vol. 1. Cambridge University Press, Cambridge (2008)
Book MATH Google Scholar
Manning, C.D., Schütze, H., et al.: Foundations of Statistical Natural Language Processing, vol. 999. MIT Press, Cambridge (1999)
MATH Google Scholar
Murphy, G.L.: Information, uncertainty and the utility of categories. In: Proceedings of the 7th Annual Conference of Cognitive Science Society, pp. 283–287 (1989)
Google Scholar
Murphy, G.L.: The Big Book of Concepts. MIT Press, Cambridge (2004)
Google Scholar
Rayson, P., Berridge, D., Francis, B.: Extending the Cochran rule for the comparison of word frequencies between corpora. In: 7th International Conference on Statistical analysis of textual data (JADT 2004), pp. 926–936 (2004)
Google Scholar
Rosch, E., Mervis, C.B., Gray, W.D., Johnson, D.M., Boyes-Braem, P.: Basic objects in natural categories. Cognitive Psychol. 8(3), 382–439 (1976)
Article Google Scholar
Leung, H., Cai, Z., Wang, T., Cai, Y., Min, H.: Entropy-based term weighting schemes for text categorization in VSM, pp. 325–332 (2015)
Google Scholar
Voorhees, E.M.: Implementing agglomerative hierarchic clustering algorithms for use in document retrieval. Inf. Process. Manag. 22(6), 465–476 (1986)
Article Google Scholar
Willett, P.: Recent trends in hierarchic document clustering: a critical review. Inf. Process. Manag. 24(5), 577–597 (1988)
Article Google Scholar

Download references

Acknowledgements

This work is supported by the Fundamental Research Funds for the Central Universities, SCUT (NO. 2017ZD0482015ZM136), Tiptop Scientific and Technical Innovative Youth Talents of Guangdong special support program (No.2015TQ01X633), Science and Technology Planning Project of Guangdong Province, China (No. 2016A030310423), Science and Technology Program of Guangzhou (International Science and Technology Cooperation Program No. 201704030076 and Science and Technology Planning Major Project of Guangdong Province (No. 2015A070711001).

Author information

Authors and Affiliations

School of Software Engineering, South China University of Technoloy, Guangzhou, China
Jingyun Xu, Yi Cai, Kai Yang, Qing Du & Jun Zhang
Ideological and Political Office, Sun Yat-sen University, Guangzhou, China
Shuai Wang
School of Software Engineering, Beijing Normal University, Beijing, China
Li Yao
School of Software Engineering, South China Normal University, Guangzhou, China
Jingjing Li

Authors

Jingyun Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yi Cai
View author publications
You can also search for this author in PubMed Google Scholar
Shuai Wang
View author publications
You can also search for this author in PubMed Google Scholar
Kai Yang
View author publications
You can also search for this author in PubMed Google Scholar
Qing Du
View author publications
You can also search for this author in PubMed Google Scholar
Jun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Li Yao
View author publications
You can also search for this author in PubMed Google Scholar
Jingjing Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yi Cai .

Editor information

Editors and Affiliations

The Education University of Hong Kong, Hong Kong, China
Haoran Xie
University of Craiova, Craiova, Romania
Elvira Popescu
City University of Hong Kong, Hong Kong, Hong Kong
Gerhard Hancke
Department of software, Complutense University of Madrid, Madrid, Spain
Baltasar Fernández Manjón

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, J. et al. (2017). A Text Clustering Algorithm to Detect Basic Level Categories in Texts. In: Xie, H., Popescu, E., Hancke, G., Fernández Manjón, B. (eds) Advances in Web-Based Learning – ICWL 2017. ICWL 2017. Lecture Notes in Computer Science(), vol 10473. Springer, Cham. https://doi.org/10.1007/978-3-319-66733-1_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-66733-1_8
Published: 19 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66732-4
Online ISBN: 978-3-319-66733-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics