A Comparative Study of Classification and Clustering Methods from Text of Books

Probierz, Barbara; Kozak, Jan; Hrabia, Anita

doi:10.1007/978-3-031-21967-2_2

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13758))

Included in the following conference series:

Asian Conference on Intelligent Information and Database Systems

897 Accesses
3 Citations

Abstract

Book collections in libraries are an important means of information, but without proper assignment of books into appropriate categories, searching for books on similar topics is very troublesome for both librarians and readers. This is a difficult problem due to the analysis of large sets of real text data, such as the content of books. For this purpose, we propose to create an appropriate model system, the use of which will allow for automatic assignment of books to appropriate categories by analyzing the text from the content of the books. Our research was tested on a database consisting of 552 documents. Each document contains the full content of the book. All books are from Project Gutenberg in the Art, Biology, Mathematics, Philosophy, or Technology category. Well-known techniques of natural language processing (NLP) were used for the proper preprocessing of the book content and for data analysis. Then, two different machine learning approaches were used: classification (as supervised learning) and clustering (as unsupervised learning) in order to properly assign books to selected categories. Measures of accuracy, precision and recall were used to evaluate the quality of classification. In our research, good classification results were obtained, even above 90% accuracy. Also, the use of clustering algorithms allowed for effective assignment of books to categories.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Text Clustering Algorithm to Detect Basic Level Categories in Texts

A dictionary based model for bengali document classification

Article 20 October 2022

Automatic Document Classification Based on J.S. Mill’s Ideas

Notes

1.
https://www.gutenberg.org/.

References

Amer, A.A., Abdalla, H.I.: A set theory based similarity measure for text clustering and classification. J. Big Data 7(1), 1–43 (2020). https://doi.org/10.1186/s40537-020-00344-3
Article Google Scholar
Amirhosseini, M.H., Kazemian, H.: Automating the process of identifying the preferred representational system in neuro linguistic programming using natural language processing. Cogn. Process. 20(2), 175–193 (2019)
Article Google Scholar
Bean, R.: The use of Project Gutenberg and hexagram statistics to help solve famous unsolved ciphers. In: Proceedings of the 3rd International Conference on Historical Cryptology HistoCrypt 2020, pp. 31–35. No. 171. Linköping University Electronic Press (2020)
Google Scholar
Bedekar, P.P., Bhide, S.R.: Optimum coordination of directional overcurrent relays using the hybrid GA-NLP approach. IEEE Trans. Power Delivery 26(1), 109–119 (2010)
Article Google Scholar
Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)
Article MATH Google Scholar
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Chapman & Hall, New York (1984)
MATH Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Article MATH Google Scholar
Brooke, J., Hammond, A., Hirst, G.: GutenTag: an NLP-driven tool for digital humanities research in the Project Gutenberg corpus. In: Proceedings of the Fourth Workshop on Computational Linguistics for Literature, pp. 42–47 (2015)
Google Scholar
Chowdhury, G.G.: Natural language processing. Ann. Rev. Inf. Sci. Technol. 37(1), 51–89 (2003)
Article Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)
Article MATH Google Scholar
Devi, S.A., Kumar, S.S.: A hybrid document features extraction with clustering based classification framework on large document sets. Int. J. Adv. Comput. Sci. Appli. (IJACSA) 11(7) (2020)
Google Scholar
Eichstaedt, J.C., et al.: Closed-and open-vocabulary approaches to text analysis: A review, quantitative comparison, and recommendations. Psychol. Methods 26(4), 398 (2021)
Article Google Scholar
Hart, M.: Project Gutenberg literary archive foundation (1971)
Google Scholar
Jalal, A.A., Ali, B.H.: Text documents clustering using data mining techniques. Int. J. Electr. Comput. Eng. (2088–8708) 11(1) 664–670 (2021)
Google Scholar
Jivani, A.G., et al.: A comparative study of stemming algorithms. Int. J. Comp. Tech. Appl 2(6), 1930–1938 (2011)
Google Scholar
Kannan, G., Nagarajan, R.: Text document clustering using statistical integrated graph based sentence sensitivity ranking algorithm. In: IOP Conference Series: Materials Science and Engineering, vol. 1070, p. 012069. IOP Publishing (2021)
Google Scholar
Kent, A., Williams, J.G.: Encyclopedia of Computer Science and Technology: Volume 27-Supplement 12: Artificial Intelligence and ADA to Systems Integration: Concepts: Methods, and Tools. CRC Press (2021)
Google Scholar
Lakshmi, R., Baskar, S.: DIC-DOC-K-means: dissimilarity-based initial centroid selection for document clustering using K-means for improving the effectiveness of text document clustering. J. Inf. Sci. 45(6), 818–832 (2019)
Article Google Scholar
Lebert, M.: Le Projet Gutenberg (1971–2008). Project Gutenberg (2008)
Google Scholar
Lin, Y.S., Jiang, J.Y., Lee, S.J.: A similarity measure for text classification and clustering. IEEE Trans. Knowl. Data Eng. 26(7), 1575–1590 (2013)
Article Google Scholar
Lovins, J.B.: Development of a stemming algorithm. Mech. Transl. Comput. Linguistics 11(1–2), 22–31 (1968)
Google Scholar
Luhn, H.P.: A statistical approach to mechanized encoding and searching of literary information. IBM J. Res. Dev. 1(4), 309–317 (1957)
Article MathSciNet Google Scholar
Oghbaie, M., Mohammadi Zanjireh, M.: Pairwise document similarity measure based on present term set. J. Big Data 5(1), 1–23 (2018). https://doi.org/10.1186/s40537-018-0163-2
Article Google Scholar
Pedregosa, F., et al.: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Schapire, R.E.: The strength of weak learnability. Mach. Learn. 5, 197–227 (1990)
Article Google Scholar
Selivanova, I., Kosyakov, D., Dubovitskii, D., Guskov, A.: Expert, journal, and automatic classification of full texts and annotations of scientific articles. Autom. Docum. Math. Lingu. 55(4), 178–189 (2021)
Article Google Scholar
Wang, K., Thrasher, C., Viegas, E., Li, X., Hsu, B.j.P.: An overview of microsoft web n-gram corpus and applications. In: Proceedings of the NAACL HLT 2010 Demonstration Session, pp. 45–48 (2010)
Google Scholar
Wanigasooriya, A., Silva, W.P.D.: Automated text classification of library books into the dewey decimal classification (ddc) (2021)
Google Scholar
Webster, J.J., Kit, C.: Tokenization as the initial phase in NLP. In: COLING 1992 Volume 4: The 15th International Conference on Computational Linguistics (1992)
Google Scholar
Yao, L., Mao, C., Luo, Y.: Graph convolutional networks for text classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 7370–7377 (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Machine Learning, University of Economics in Katowice, 1 Maja, 40-287, Katowice, Poland
Barbara Probierz, Jan Kozak & Anita Hrabia

Authors

Barbara Probierz
View author publications
You can also search for this author in PubMed Google Scholar
Jan Kozak
View author publications
You can also search for this author in PubMed Google Scholar
Anita Hrabia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Barbara Probierz .

Editor information

Editors and Affiliations

Wrocław University of Science and Technology, Wrocław, Poland
Ngoc Thanh Nguyen
Vietnam National University, Ho Chi Minh City, Ho Chi Minh City, Vietnam
Tien Khoa Tran
Al-Farabi Kazakh National University, Almaty, Kazakhstan
Ualsher Tukayev
National University of Kaohsiung, Kaohsiung, Taiwan
Tzung-Pei Hong
Wrocław University of Science and Technology, Wrocław, Poland
Bogdan Trawiński
University of Newcastle, Newcastle, NSW, Australia
Edward Szczerbicki

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Probierz, B., Kozak, J., Hrabia, A. (2022). A Comparative Study of Classification and Clustering Methods from Text of Books. In: Nguyen, N.T., Tran, T.K., Tukayev, U., Hong, TP., Trawiński, B., Szczerbicki, E. (eds) Intelligent Information and Database Systems. ACIIDS 2022. Lecture Notes in Computer Science(), vol 13758. Springer, Cham. https://doi.org/10.1007/978-3-031-21967-2_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-21967-2_2
Published: 09 December 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21966-5
Online ISBN: 978-3-031-21967-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Comparative Study of Classification and Clustering Methods from Text of Books