research-article

Analysis of Clustering Algorithms to Clean and Normalize Early Modern European Book Titles

Authors:
Evan Bryer

University of South Carolina, United States

University of South Carolina, United States
View Profile

,
Theppatorn Rhujittawiwat

University of South Carolina, United States

University of South Carolina, United States
View Profile

,
Samyu Comandur

University of South Carolina, United States

University of South Carolina, United States
View Profile

,
Vasco Madrid

University of South Carolina, United States

University of South Carolina, United States
View Profile

,
Stephanie Riley

University of South Carolina, United States

University of South Carolina, United States
View Profile

,
John Rose

University of South Carolina, United States

University of South Carolina, United States
View Profile

,
Colin Wilder

University of South Carolina, United States

University of South Carolina, United States
View Profile

ICSIM '21: Proceedings of the 2021 4th International Conference on Software Engineering and Information ManagementJanuary 2021Pages 106–112https://doi.org/10.1145/3451471.3451489

Published:13 July 2021Publication History

ICSIM '21: Proceedings of the 2021 4th International Conference on Software Engineering and Information Management

Pages 106–112

ABSTRACT

In this paper, we identify the most accurate method of clustering to deduplicate the past centuries book records from multiple libraries for data analysis out of five common algorithms. The presence of duplicate records is a major concern in data analysis. The dataset we studied contains over 5 million records of books published in European languages between 1500 and 1800 in the Machine-Readable Cataloging (MARC) data format from 17,983 libraries in 123 countries. However, each book record was archived by the library owning it. This creates a consistency problem in which the same book was archived in a slightly different way between libraries. Moreover, the change in geography and language over the past centuries also affects data consistency regarding the name of a person and place. Many slightly different names represent the same record. Analyzing such a dataset without proper cleaning will misrepresent the result. Due to the size of the dataset and unknown number of duplicate records with variation, it is impractical to create a lookup table to replace each record. To solve this problem, we use data clustering to deduplicate this dataset. Our work is informed by scholarship on European History and the History of the Book. We find that clustering is an effective method for detecting the slight differences in records caused by the above-mentioned cataloging inconsistencies. Our foundation was experimentation with several candidate clustering methods on a test dataset. The test dataset was prepared by corrupting a clean dataset according to the same characteristics found in the whole dataset. The clean dataset contains roughly 1,000 random records in English, German, French, and Latin with approximately the same language distribution and average record lengths as the whole dataset. Our evaluation reveals that some clustering algorithms can achieve accuracy up to 0.97072. The clustering techniques perform well on the dataset we studied as demonstrated in this paper.

References

Lee Leighton. 1998. Changing the tasks of cataloging. Journal of library adminis-tration25, 2-3 (1998), 45–54Google Scholar
Library of Congress. Network Development and MARC Standards Office, Frequently Asked Questions (FAQ). Retrieved November 2, 2020 from https://www.loc.gov/marc/faq.html#marc21vsuscanGoogle Scholar
Michele Seikel and Thomas Steele. 2011. How MARC has changed: The history of the format and its forthcoming relationship to RD. Technical Services Quarterly 28, 3 (2011), 322–334Google ScholarCross Ref
K Wayne Smith. 1998. OCLC: Yesterday, today and tomorrow. Journal of library administration 25, 4 (1998), 251–270Google ScholarCross Ref
Jay Jordan. 2009. OCLC 1998–2008: Weaving libraries into the web. Journal of Library Administration49, 7 (2009), 727–762Google ScholarCross Ref
Phil Schieber. 2009. Chronology: Noteworthy Achievements of the Cooperative1967–2008.Journal of Library Administration49, 7 (2009), 763–775Google Scholar
OCLC. OCLC Technology. Retrieved November 2, 2020 from https://www.oclc.org/en/technology.htmlGoogle Scholar
OCLC. Inside WorldCat. Retrieved November 2, 2020 from https://www.oclc.org/en/worldcat/inside-worldcat.htmlGoogle Scholar
OCLC. OCLC Delivers Quality. Retrieved November 2, 2020 from https://www.oclc.org/en/worldcat/cooperative-quality.htmlGoogle Scholar
ARIF MARDI WALUYO, EKO PRASETYO, and ARIF ARIZAL. 2018. CLASIFICA-TION SYSTEM OF LIBRARY BOOK BASED ON SIMILARITY OF THE BOOK TI-TLE USING K-MEANS METHOD (CASE STUDY LIBRARY OF BHAYANGKARASURABAYA).JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCI-ENCES, VOL 3 NUMBER 1, JUNE 20183, 1 (2018)Google Scholar
Mikko Tolonen, Leo Lahti, Hege Roivainen, and Jani Marjanen. 2019. A quantitative approach to book-printing in Sweden and Finland, 1640–1828.HistoricalMethods: A Journal of Quantitative and Interdisciplinary History52, 1 (2019),57–78Google Scholar
Mikko Tolonen, Jani Marjanen, Hege Roivainen, and Leo Lahti. 2019. Scaling Up Bibliographic Data Science. In DHN. 450–456Google Scholar
Jani Marjanen, Ville Vaara, Antti Kanner, Hege Roivainen, Eetu Mäkelä, Leo Lahti,and Mikko Tolonen. 2019. A National Public Sphere? Analyzing the Language, Location, and Form of Newspapers in Finland, 1771–1917.Journal of European Periodical Studies4, 1 (2019), 54–77Google Scholar
Jingxuan Li and Tao Li. 2010. HCC: a hierarchical co-clustering algorithm. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. 861–862Google ScholarDigital Library
Péter Király. 2019. Validating 126 million MARC records. In Proceedings of the 3rdInternational Conference on Digital Access to Textual Cultural Heritage. 161–168Google ScholarDigital Library
Eran Ozsarfati, Egemen Sahin, Can Jozef Saul, and Alper Yilmaz. 2019. Book Genre Classification Based on Titles with Comparative Machine Learning Algorithms. In2019 IEEE 4th International Conference on Computer and Communication Systems (ICCCS). IEEE, 14–20Google Scholar
Avi Bleiweiss. 2017. A Hierarchical Book Representation of Word Embeddings for Effective Semantic Clustering and Search. In ICAART (2). 154–163Google Scholar
Ming Li, Xin Chen, Xin Li, Bin Ma, and Paul Vitányi. 2003. The Similarity Metric. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (Baltimore, Maryland) (SODA ’03). Society for Industrial and Applied Mathematics, USA, 863–872Google ScholarDigital Library
Senthil Shanmugasundaram and L. Robert. 2011. A Comparative Study of Text Compression Algorithms. ICTACT Journal on Communication Technology2 (122011). https://doi.org/10.21917/ijct.2011.0062Google Scholar
O'Neill, Edward T and Rogers, Sally A and Oskins, W Michael, “Characteristics of duplicate records in oclc's online union catalog,” 1993.Google Scholar

Index Terms

Analysis of Clustering Algorithms to Clean and Normalize Early Modern European Book Titles

Index terms have been assigned to the content through auto-classification.

Recommendations

Non-Exhaustive, Overlapping Co-Clustering
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

The goal of co-clustering is to simultaneously identify a clustering of the rows as well as the columns of a two dimensional data matrix. Most existing co-clustering algorithms are designed to find pairwise disjoint and exhaustive co-clusters. However, ...
Read More
Document clustering as a record linkage problem
DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018

This work examines document clustering as a record linkage problem, focusing on named-entities and frequent terms, using several vector and graph-based document representation methods and k-means clustering with different similarity measures. The JedAI ...
Read More
An analysis of one-to-one matching algorithms for entity resolution
Abstract
Entity resolution (ER) is the task of finding records that refer to the same real-world entities. A common scenario, which we refer to as Clean-Clean ER, is to resolve records across two clean sources (i.e., they are duplicate-free and contain one ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICSIM '21: Proceedings of the 2021 4th International Conference on Software Engineering and Information Management
January 2021
251 pages
ISBN:9781450388955
DOI:10.1145/3451471

Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 July 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Analysis
Clustering
Comparison
OpenRefine
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 49
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Analysis of Clustering Algorithms to Clean and Normalize Early Modern European Book Titles

ICSIM '21: Proceedings of the 2021 4th International Conference on Software Engineering and Information Management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Non-Exhaustive, Overlapping Co-Clustering

Document clustering as a record linkage problem

An analysis of one-to-one matching algorithms for entity resolution

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Analysis of Clustering Algorithms to Clean and Normalize Early Modern European Book Titles

ICSIM '21: Proceedings of the 2021 4th International Conference on Software Engineering and Information Management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Non-Exhaustive, Overlapping Co-Clustering

Document clustering as a record linkage problem

An analysis of one-to-one matching algorithms for entity resolution

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media