skip to main content
10.1145/3451471.3451489acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicsimConference Proceedingsconference-collections
research-article

Analysis of Clustering Algorithms to Clean and Normalize Early Modern European Book Titles

Published:13 July 2021Publication History

ABSTRACT

In this paper, we identify the most accurate method of clustering to deduplicate the past centuries book records from multiple libraries for data analysis out of five common algorithms. The presence of duplicate records is a major concern in data analysis. The dataset we studied contains over 5 million records of books published in European languages between 1500 and 1800 in the Machine-Readable Cataloging (MARC) data format from 17,983 libraries in 123 countries. However, each book record was archived by the library owning it. This creates a consistency problem in which the same book was archived in a slightly different way between libraries. Moreover, the change in geography and language over the past centuries also affects data consistency regarding the name of a person and place. Many slightly different names represent the same record. Analyzing such a dataset without proper cleaning will misrepresent the result. Due to the size of the dataset and unknown number of duplicate records with variation, it is impractical to create a lookup table to replace each record. To solve this problem, we use data clustering to deduplicate this dataset. Our work is informed by scholarship on European History and the History of the Book. We find that clustering is an effective method for detecting the slight differences in records caused by the above-mentioned cataloging inconsistencies. Our foundation was experimentation with several candidate clustering methods on a test dataset. The test dataset was prepared by corrupting a clean dataset according to the same characteristics found in the whole dataset. The clean dataset contains roughly 1,000 random records in English, German, French, and Latin with approximately the same language distribution and average record lengths as the whole dataset. Our evaluation reveals that some clustering algorithms can achieve accuracy up to 0.97072. The clustering techniques perform well on the dataset we studied as demonstrated in this paper.

References

  1. Lee Leighton. 1998. Changing the tasks of cataloging. Journal of library adminis-tration25, 2-3 (1998), 45–54Google ScholarGoogle Scholar
  2. Library of Congress. Network Development and MARC Standards Office, Frequently Asked Questions (FAQ). Retrieved November 2, 2020 from https://www.loc.gov/marc/faq.html#marc21vsuscanGoogle ScholarGoogle Scholar
  3. Michele Seikel and Thomas Steele. 2011. How MARC has changed: The history of the format and its forthcoming relationship to RD. Technical Services Quarterly 28, 3 (2011), 322–334Google ScholarGoogle ScholarCross RefCross Ref
  4. K Wayne Smith. 1998. OCLC: Yesterday, today and tomorrow. Journal of library administration 25, 4 (1998), 251–270Google ScholarGoogle ScholarCross RefCross Ref
  5. Jay Jordan. 2009. OCLC 1998–2008: Weaving libraries into the web. Journal of Library Administration49, 7 (2009), 727–762Google ScholarGoogle ScholarCross RefCross Ref
  6. Phil Schieber. 2009. Chronology: Noteworthy Achievements of the Cooperative1967–2008.Journal of Library Administration49, 7 (2009), 763–775Google ScholarGoogle Scholar
  7. OCLC. OCLC Technology. Retrieved November 2, 2020 from https://www.oclc.org/en/technology.htmlGoogle ScholarGoogle Scholar
  8. OCLC. Inside WorldCat. Retrieved November 2, 2020 from https://www.oclc.org/en/worldcat/inside-worldcat.htmlGoogle ScholarGoogle Scholar
  9. OCLC. OCLC Delivers Quality. Retrieved November 2, 2020 from https://www.oclc.org/en/worldcat/cooperative-quality.htmlGoogle ScholarGoogle Scholar
  10. ARIF MARDI WALUYO, EKO PRASETYO, and ARIF ARIZAL. 2018. CLASIFICA-TION SYSTEM OF LIBRARY BOOK BASED ON SIMILARITY OF THE BOOK TI-TLE USING K-MEANS METHOD (CASE STUDY LIBRARY OF BHAYANGKARASURABAYA).JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCI-ENCES, VOL 3 NUMBER 1, JUNE 20183, 1 (2018)Google ScholarGoogle Scholar
  11. Mikko Tolonen, Leo Lahti, Hege Roivainen, and Jani Marjanen. 2019. A quantitative approach to book-printing in Sweden and Finland, 1640–1828.HistoricalMethods: A Journal of Quantitative and Interdisciplinary History52, 1 (2019),57–78Google ScholarGoogle Scholar
  12. Mikko Tolonen, Jani Marjanen, Hege Roivainen, and Leo Lahti. 2019. Scaling Up Bibliographic Data Science. In DHN. 450–456Google ScholarGoogle Scholar
  13. Jani Marjanen, Ville Vaara, Antti Kanner, Hege Roivainen, Eetu Mäkelä, Leo Lahti,and Mikko Tolonen. 2019. A National Public Sphere? Analyzing the Language, Location, and Form of Newspapers in Finland, 1771–1917.Journal of European Periodical Studies4, 1 (2019), 54–77Google ScholarGoogle Scholar
  14. Jingxuan Li and Tao Li. 2010. HCC: a hierarchical co-clustering algorithm. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. 861–862Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Péter Király. 2019. Validating 126 million MARC records. In Proceedings of the 3rdInternational Conference on Digital Access to Textual Cultural Heritage. 161–168Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Eran Ozsarfati, Egemen Sahin, Can Jozef Saul, and Alper Yilmaz. 2019. Book Genre Classification Based on Titles with Comparative Machine Learning Algorithms. In2019 IEEE 4th International Conference on Computer and Communication Systems (ICCCS). IEEE, 14–20Google ScholarGoogle Scholar
  17. Avi Bleiweiss. 2017. A Hierarchical Book Representation of Word Embeddings for Effective Semantic Clustering and Search. In ICAART (2). 154–163Google ScholarGoogle Scholar
  18. Ming Li, Xin Chen, Xin Li, Bin Ma, and Paul Vitányi. 2003. The Similarity Metric. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (Baltimore, Maryland) (SODA ’03). Society for Industrial and Applied Mathematics, USA, 863–872Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Senthil Shanmugasundaram and L. Robert. 2011. A Comparative Study of Text Compression Algorithms. ICTACT Journal on Communication Technology2 (122011). https://doi.org/10.21917/ijct.2011.0062Google ScholarGoogle Scholar
  20. O'Neill, Edward T and Rogers, Sally A and Oskins, W Michael, “Characteristics of duplicate records in oclc's online union catalog,” 1993.Google ScholarGoogle Scholar

Index Terms

  1. Analysis of Clustering Algorithms to Clean and Normalize Early Modern European Book Titles
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Other conferences
              ICSIM '21: Proceedings of the 2021 4th International Conference on Software Engineering and Information Management
              January 2021
              251 pages
              ISBN:9781450388955
              DOI:10.1145/3451471

              Copyright © 2021 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 13 July 2021

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed limited

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            HTML Format

            View this article in HTML Format .

            View HTML Format