ABSTRACT
In this paper, we identify the most accurate method of clustering to deduplicate the past centuries book records from multiple libraries for data analysis out of five common algorithms. The presence of duplicate records is a major concern in data analysis. The dataset we studied contains over 5 million records of books published in European languages between 1500 and 1800 in the Machine-Readable Cataloging (MARC) data format from 17,983 libraries in 123 countries. However, each book record was archived by the library owning it. This creates a consistency problem in which the same book was archived in a slightly different way between libraries. Moreover, the change in geography and language over the past centuries also affects data consistency regarding the name of a person and place. Many slightly different names represent the same record. Analyzing such a dataset without proper cleaning will misrepresent the result. Due to the size of the dataset and unknown number of duplicate records with variation, it is impractical to create a lookup table to replace each record. To solve this problem, we use data clustering to deduplicate this dataset. Our work is informed by scholarship on European History and the History of the Book. We find that clustering is an effective method for detecting the slight differences in records caused by the above-mentioned cataloging inconsistencies. Our foundation was experimentation with several candidate clustering methods on a test dataset. The test dataset was prepared by corrupting a clean dataset according to the same characteristics found in the whole dataset. The clean dataset contains roughly 1,000 random records in English, German, French, and Latin with approximately the same language distribution and average record lengths as the whole dataset. Our evaluation reveals that some clustering algorithms can achieve accuracy up to 0.97072. The clustering techniques perform well on the dataset we studied as demonstrated in this paper.
- Lee Leighton. 1998. Changing the tasks of cataloging. Journal of library adminis-tration25, 2-3 (1998), 45–54Google Scholar
- Library of Congress. Network Development and MARC Standards Office, Frequently Asked Questions (FAQ). Retrieved November 2, 2020 from https://www.loc.gov/marc/faq.html#marc21vsuscanGoogle Scholar
- Michele Seikel and Thomas Steele. 2011. How MARC has changed: The history of the format and its forthcoming relationship to RD. Technical Services Quarterly 28, 3 (2011), 322–334Google ScholarCross Ref
- K Wayne Smith. 1998. OCLC: Yesterday, today and tomorrow. Journal of library administration 25, 4 (1998), 251–270Google ScholarCross Ref
- Jay Jordan. 2009. OCLC 1998–2008: Weaving libraries into the web. Journal of Library Administration49, 7 (2009), 727–762Google ScholarCross Ref
- Phil Schieber. 2009. Chronology: Noteworthy Achievements of the Cooperative1967–2008.Journal of Library Administration49, 7 (2009), 763–775Google Scholar
- OCLC. OCLC Technology. Retrieved November 2, 2020 from https://www.oclc.org/en/technology.htmlGoogle Scholar
- OCLC. Inside WorldCat. Retrieved November 2, 2020 from https://www.oclc.org/en/worldcat/inside-worldcat.htmlGoogle Scholar
- OCLC. OCLC Delivers Quality. Retrieved November 2, 2020 from https://www.oclc.org/en/worldcat/cooperative-quality.htmlGoogle Scholar
- ARIF MARDI WALUYO, EKO PRASETYO, and ARIF ARIZAL. 2018. CLASIFICA-TION SYSTEM OF LIBRARY BOOK BASED ON SIMILARITY OF THE BOOK TI-TLE USING K-MEANS METHOD (CASE STUDY LIBRARY OF BHAYANGKARASURABAYA).JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCI-ENCES, VOL 3 NUMBER 1, JUNE 20183, 1 (2018)Google Scholar
- Mikko Tolonen, Leo Lahti, Hege Roivainen, and Jani Marjanen. 2019. A quantitative approach to book-printing in Sweden and Finland, 1640–1828.HistoricalMethods: A Journal of Quantitative and Interdisciplinary History52, 1 (2019),57–78Google Scholar
- Mikko Tolonen, Jani Marjanen, Hege Roivainen, and Leo Lahti. 2019. Scaling Up Bibliographic Data Science. In DHN. 450–456Google Scholar
- Jani Marjanen, Ville Vaara, Antti Kanner, Hege Roivainen, Eetu Mäkelä, Leo Lahti,and Mikko Tolonen. 2019. A National Public Sphere? Analyzing the Language, Location, and Form of Newspapers in Finland, 1771–1917.Journal of European Periodical Studies4, 1 (2019), 54–77Google Scholar
- Jingxuan Li and Tao Li. 2010. HCC: a hierarchical co-clustering algorithm. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. 861–862Google ScholarDigital Library
- Péter Király. 2019. Validating 126 million MARC records. In Proceedings of the 3rdInternational Conference on Digital Access to Textual Cultural Heritage. 161–168Google ScholarDigital Library
- Eran Ozsarfati, Egemen Sahin, Can Jozef Saul, and Alper Yilmaz. 2019. Book Genre Classification Based on Titles with Comparative Machine Learning Algorithms. In2019 IEEE 4th International Conference on Computer and Communication Systems (ICCCS). IEEE, 14–20Google Scholar
- Avi Bleiweiss. 2017. A Hierarchical Book Representation of Word Embeddings for Effective Semantic Clustering and Search. In ICAART (2). 154–163Google Scholar
- Ming Li, Xin Chen, Xin Li, Bin Ma, and Paul Vitányi. 2003. The Similarity Metric. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (Baltimore, Maryland) (SODA ’03). Society for Industrial and Applied Mathematics, USA, 863–872Google ScholarDigital Library
- Senthil Shanmugasundaram and L. Robert. 2011. A Comparative Study of Text Compression Algorithms. ICTACT Journal on Communication Technology2 (122011). https://doi.org/10.21917/ijct.2011.0062Google Scholar
- O'Neill, Edward T and Rogers, Sally A and Oskins, W Michael, “Characteristics of duplicate records in oclc's online union catalog,” 1993.Google Scholar
Index Terms
- Analysis of Clustering Algorithms to Clean and Normalize Early Modern European Book Titles
Recommendations
Non-Exhaustive, Overlapping Co-Clustering
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge ManagementThe goal of co-clustering is to simultaneously identify a clustering of the rows as well as the columns of a two dimensional data matrix. Most existing co-clustering algorithms are designed to find pairwise disjoint and exhaustive co-clusters. However, ...
Document clustering as a record linkage problem
DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018This work examines document clustering as a record linkage problem, focusing on named-entities and frequent terms, using several vector and graph-based document representation methods and k-means clustering with different similarity measures. The JedAI ...
An analysis of one-to-one matching algorithms for entity resolution
AbstractEntity resolution (ER) is the task of finding records that refer to the same real-world entities. A common scenario, which we refer to as Clean-Clean ER, is to resolve records across two clean sources (i.e., they are duplicate-free and contain one ...
Comments