Abstract
Book collections in libraries are an important means of information, but without proper assignment of books into appropriate categories, searching for books on similar topics is very troublesome for both librarians and readers. This is a difficult problem due to the analysis of large sets of real text data, such as the content of books. For this purpose, we propose to create an appropriate model system, the use of which will allow for automatic assignment of books to appropriate categories by analyzing the text from the content of the books. Our research was tested on a database consisting of 552 documents. Each document contains the full content of the book. All books are from Project Gutenberg in the Art, Biology, Mathematics, Philosophy, or Technology category. Well-known techniques of natural language processing (NLP) were used for the proper preprocessing of the book content and for data analysis. Then, two different machine learning approaches were used: classification (as supervised learning) and clustering (as unsupervised learning) in order to properly assign books to selected categories. Measures of accuracy, precision and recall were used to evaluate the quality of classification. In our research, good classification results were obtained, even above 90% accuracy. Also, the use of clustering algorithms allowed for effective assignment of books to categories.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
References
Amer, A.A., Abdalla, H.I.: A set theory based similarity measure for text clustering and classification. J. Big Data 7(1), 1–43 (2020). https://doi.org/10.1186/s40537-020-00344-3
Amirhosseini, M.H., Kazemian, H.: Automating the process of identifying the preferred representational system in neuro linguistic programming using natural language processing. Cogn. Process. 20(2), 175–193 (2019)
Bean, R.: The use of Project Gutenberg and hexagram statistics to help solve famous unsolved ciphers. In: Proceedings of the 3rd International Conference on Historical Cryptology HistoCrypt 2020, pp. 31–35. No. 171. Linköping University Electronic Press (2020)
Bedekar, P.P., Bhide, S.R.: Optimum coordination of directional overcurrent relays using the hybrid GA-NLP approach. IEEE Trans. Power Delivery 26(1), 109–119 (2010)
Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Chapman & Hall, New York (1984)
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Brooke, J., Hammond, A., Hirst, G.: GutenTag: an NLP-driven tool for digital humanities research in the Project Gutenberg corpus. In: Proceedings of the Fourth Workshop on Computational Linguistics for Literature, pp. 42–47 (2015)
Chowdhury, G.G.: Natural language processing. Ann. Rev. Inf. Sci. Technol. 37(1), 51–89 (2003)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)
Devi, S.A., Kumar, S.S.: A hybrid document features extraction with clustering based classification framework on large document sets. Int. J. Adv. Comput. Sci. Appli. (IJACSA) 11(7) (2020)
Eichstaedt, J.C., et al.: Closed-and open-vocabulary approaches to text analysis: A review, quantitative comparison, and recommendations. Psychol. Methods 26(4), 398 (2021)
Hart, M.: Project Gutenberg literary archive foundation (1971)
Jalal, A.A., Ali, B.H.: Text documents clustering using data mining techniques. Int. J. Electr. Comput. Eng. (2088–8708) 11(1) 664–670 (2021)
Jivani, A.G., et al.: A comparative study of stemming algorithms. Int. J. Comp. Tech. Appl 2(6), 1930–1938 (2011)
Kannan, G., Nagarajan, R.: Text document clustering using statistical integrated graph based sentence sensitivity ranking algorithm. In: IOP Conference Series: Materials Science and Engineering, vol. 1070, p. 012069. IOP Publishing (2021)
Kent, A., Williams, J.G.: Encyclopedia of Computer Science and Technology: Volume 27-Supplement 12: Artificial Intelligence and ADA to Systems Integration: Concepts: Methods, and Tools. CRC Press (2021)
Lakshmi, R., Baskar, S.: DIC-DOC-K-means: dissimilarity-based initial centroid selection for document clustering using K-means for improving the effectiveness of text document clustering. J. Inf. Sci. 45(6), 818–832 (2019)
Lebert, M.: Le Projet Gutenberg (1971–2008). Project Gutenberg (2008)
Lin, Y.S., Jiang, J.Y., Lee, S.J.: A similarity measure for text classification and clustering. IEEE Trans. Knowl. Data Eng. 26(7), 1575–1590 (2013)
Lovins, J.B.: Development of a stemming algorithm. Mech. Transl. Comput. Linguistics 11(1–2), 22–31 (1968)
Luhn, H.P.: A statistical approach to mechanized encoding and searching of literary information. IBM J. Res. Dev. 1(4), 309–317 (1957)
Oghbaie, M., Mohammadi Zanjireh, M.: Pairwise document similarity measure based on present term set. J. Big Data 5(1), 1–23 (2018). https://doi.org/10.1186/s40537-018-0163-2
Pedregosa, F., et al.: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Schapire, R.E.: The strength of weak learnability. Mach. Learn. 5, 197–227 (1990)
Selivanova, I., Kosyakov, D., Dubovitskii, D., Guskov, A.: Expert, journal, and automatic classification of full texts and annotations of scientific articles. Autom. Docum. Math. Lingu. 55(4), 178–189 (2021)
Wang, K., Thrasher, C., Viegas, E., Li, X., Hsu, B.j.P.: An overview of microsoft web n-gram corpus and applications. In: Proceedings of the NAACL HLT 2010 Demonstration Session, pp. 45–48 (2010)
Wanigasooriya, A., Silva, W.P.D.: Automated text classification of library books into the dewey decimal classification (ddc) (2021)
Webster, J.J., Kit, C.: Tokenization as the initial phase in NLP. In: COLING 1992 Volume 4: The 15th International Conference on Computational Linguistics (1992)
Yao, L., Mao, C., Luo, Y.: Graph convolutional networks for text classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 7370–7377 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Probierz, B., Kozak, J., Hrabia, A. (2022). A Comparative Study of Classification and Clustering Methods from Text of Books. In: Nguyen, N.T., Tran, T.K., Tukayev, U., Hong, TP., Trawiński, B., Szczerbicki, E. (eds) Intelligent Information and Database Systems. ACIIDS 2022. Lecture Notes in Computer Science(), vol 13758. Springer, Cham. https://doi.org/10.1007/978-3-031-21967-2_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-21967-2_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21966-5
Online ISBN: 978-3-031-21967-2
eBook Packages: Computer ScienceComputer Science (R0)