Abstract:
Metadata is crucial for the accessibility, interoperability, and long-term usability of digital objects such as Electronic Theses and Dissertations (ETDs). In large-scale...Show MoreMetadata
Abstract:
Metadata is crucial for the accessibility, interoperability, and long-term usability of digital objects such as Electronic Theses and Dissertations (ETDs). In large-scale academic repositories, poor metadata quality can significantly impede the discovery and use of resources. This study addresses persistent issues of incomplete and inconsistent ETD metadata collected from U.S. university libraries. However, directly applying machine learning-based error detection and correction models may introduce unwanted errors due to the imperfection of these models. We propose an ETD metadata improvement system (ETDMIS) that mitigates the problem by integrating metadata validation and a version control mechanism. Our system was applied to a dataset of 100,000 U.S. ETDs, resulting in substantial improvements in metadata quality. Scalability was demonstrated by processing the entire dataset efficiently. The original and the enhanced metadata for the 100,000 ETDs are publicly accessible at https://github.com/lamps-lab/ETDMiner/tree/master/Meta100K.
Published in: 2024 IEEE International Conference on Big Data (BigData)
Date of Conference: 15-18 December 2024
Date Added to IEEE Xplore: 16 January 2025
ISBN Information: