D-Lib Magazine
spacer
The Magazine of Digital Library Research
spacer
transparent image

D-Lib Magazine

March/April 2015
Volume 21, Number 3/4
Table of Contents

 

OpenDOAR Repositories and Metadata Practices

Heather Lea Moulaison, Felicity Dykas and Kristen Gallant
University of Missouri
{moulaisonhe, dykasf}@missouri.edu and kahm9c@mail.missouri.edu

DOI: 10.1045/march2015-moulaison

 

Printer-friendly Version

 

Abstract

In spring 2014, authors from the University of Missouri conducted a nation-wide survey on metadata practices among United States-based OpenDOAR repositories. Examining the repository systems and current practices of metadata in these repositories, researchers collected and analyzed the responses of 23 repositories. Results from this survey include information about the creators of metadata, best practices and resources, and controlled vocabularies. Findings will inform libraries about the current state of repository and metadata choices in open repositories in the United States, especially as they pertain to overarching questions of interoperability.

 

Introduction

The creation of metadata for research and repository content is an essential part of the scholarly communication process and is necessary for the long-term access and preservation of our digital (and digitized) heritage. Metadata choices and practices affect the findability of resources in the online environment, and these choices, influenced by the content itself, also reflect the institutions, stakeholders, and users of specific repositories. Content in repositories may be one-of-a-kind, with academic libraries creating digital repositories to house and make available the campus's unique intellectual capital (see Cullen & Chawner, 2011). Other institutions such as the American Museum of Natural History and the New York Public Library have also chosen to make curated digitized information freely available on the open Web. Such collections often have original or unique content, and "can more broadly facilitate the creation of new knowledge by an even wider array of scholars and researchers than in the past" (Gasaway, 2010, p. 758-759). As it stands, no one-size-fits-all answers to metadata practices have been devised, and details about current practice remain understudied.

Having an understanding of current practice in highly visible and accessible repositories that are part of the OpenDOAR registry will provide insight into larger questions of access and interoperability. In spring 2014, we conducted a nation-wide survey on metadata practices among United States-based OpenDOAR repositories. The current analysis begins by addressing questions of repository demographics, including providing an overview of which systems respondents are using, what is being made available, and the overall size of the collection. Next, we investigate metadata practices including the metadata creation environment. Specifically on the topic of standards, we investigate the metadata schema being used as well as the controlled vocabularies. In order to ensure consistency and interoperability, repositories must have knowledgeable team members performing related duties. The survey gathered information on the kinds of individuals doing metadata-related work and inquired into the resources, including best practices documentation, that they have at their disposal.

 

OpenDOAR Repositories

OpenDOAR is a directory of open repositories developed by the University of Nottingham in the United Kingdom and Lund University in Sweden (Jacso, 2006; "OpenDOAR: Open Access," 2006; "OpenDOAR or Directory," 2005). The directory serves an international academic community, establishing an authoritative and quality-based source for accessing open-access scholarly materials ("About," 2014). OpenDOAR ensures quality by visiting the repositories prior to listing them in the directory. The directory includes 2,704 repositories and their content ("Content Types," 2014), and through the OpenDOAR website, it is possible to search or browse for repositories based on criteria such as repository type, content type held, and subject area. In addition, the site includes OpenDOAR Search that uses Google Custom to search across repository content ("Tools for Repository Administrators," 2014). Content in OpenDOAR repositories can be as varied as private university scholarly publications or digital collections of music and art from a public library, and includes objects such as journal articles, theses and dissertations, conference papers, software, patents, datasets, learning objects, audio-visual materials, and books ("Content Types," 2014).

 

Method

This study limited its scope to United States-based institutions and their repositories as listed on the OpenDOAR registry. Of the 328 OpenDOAR institutions listed in February 2014, we randomly selected 50 institutions for study. If individual United States-based institutions listed more than one repository in the directory, we focused our efforts on the first listed repository. In one case, we included both a repository and the consortium of repositories of which it was a member.

In May 2014, emails were sent directly to repository administrators with links to the Qualtrics online survey instrument. The survey questions included information about the demographics of the repository and the metadata creation environment. (View a PDF of the survey through the MOspace repository.) Recipients were asked to forward the email to the responsible party if he or she felt someone else was better suited to answer.

 

Findings and Analysis

The survey contained five sections with 19 major questions and sub-questions which aimed to investigate current practices in relation to the metadata choices and practices. Representatives from 23 of the institutions identified (46%) completed the first two sections of the survey, with 19 institutions (38% of the total) completing the entire survey. The responses were collected using the online survey software, Qualtrics, and analyzed further in Excel.

 

Repository Demographics: System, Content, and Size

Institutions were asked demographics questions about the repository system they used, the content they collected, and the size of the collections. Among the 23 respondents, the most common repository system/software was DSpace with 43% of survey respondents using it. This finding is consistent with the findings of Li and Banach (2011) in their spring 2010 survey on preservation in academic libraries in North America and with the posted report about all OpenDOAR repositories' usage of repository software ("Usage of Open Access Repository Software", 2014). Twenty-six percent of libraries surveyed used Digital Commons (bepress), with 13% (n=3) using Fedora. The rest of respondents were using other software, including both commercial and homegrown systems; no respondents reported using EPrints in this study. Five repositories reported using more than one software package (see Table 1).

 
Repository/Software Number Using (N=23)     %    
DSpace 10 43%
Digital Commons (bepress) 6 26%
Fedora 3 13%
ExLibris DigiTool 2 9%
Hydra 2 9%
Islandora 1 4%
Omeka 1 4%
ContentDM 1 4%
Locally developed software 2 9%
Other 2 9%

Table 1: Repository Software or System Used

Content made available in these open repositories was varied, with 78% making individual articles, student projects, and/or images available. Seventy-four percent made photographs available, and 65% made electronic theses and dissertations of some kind available. Sixty-one percent made reports, digitized books, video, journals or presentations available (see Table 2). These responses are consistent with the data reported for the entirety of the OpenDOAR repositories as listed on the OpenDOAR website, with journal articles and theses and dissertations listed as the most common kinds of content ("Content Types in OpenDOAR Repositories," 2014).

 
Kinds of Content Number Using (N=23)     %    
Images 18 78%
Individual articles 18 78%
Student projects 18 78%
Photographs 17 74%
ETDs 15 65%
Presentations 14 61%
Reports 14 61%
Digitized books 14 61%
Video 14 61%
Journals 14 61%
Newspapers 12 52%
Audio 11 48%
White papers 9 39%
Research data/datasets 8 35%
Born digital books 6 26%
Databases 3 13%
Websites 1 4%
Other: government documents 1 4%
Other: university archive items 1 4%
Other: collective bargaining agreements 1 4%

Table 2: Kinds of Content

Overall, the 23 repositories surveyed held between two and 15 different kinds of content. The repository with only two kinds of content held individual articles and ETDs (see Figure 1).

moulaison-fig1

Figure 1: Kinds of Content, by Repository

Finally, the repositories varied in size. Twenty-two respondents provided information about the extent of their digital collections. Thirty-two percent (n=7) held between 500-4,999 digital objects, with 23% (n=5) holding 5,000-9,999 and also 23% (n=5) holding 10,000-19,999 digital objects. Only one respondent held 100,000-1,000,000 digital objects, the highest of all the respondents.

Given the variety of content mentioned in the previous question, it seems reasonable that smaller collections contain fewer content types. When plotted, the trendline generally confirms that collections with fewer kinds of content have smaller numbers of digital objects (see Figure 2).

moulaison-fig2

Figure 2: Digital Objects in the Collection by Kinds of Content

 

Metadata Practices

Respondents also supplied information about the metadata schema and controlled vocabularies used in their repositories. Although the question of encoding schema was followed directly by the question about controlled vocabularies in use, all 23 respondents answered the schema question, but only 17 supplied information about the controlled vocabularies.

In terms of metadata schema being used, many of the 23 respondents selected more than one schema. The greatest number of respondents by far used Dublin Core (n=12; 52%) or Qualified Dublin Core (QDC) (n=11; 48%). Metadata Object Description Schema (MODS) use (n=6; 26%) beat out MAchine-Readable Cataloging (MARC) use (n=4; 17%) in the repositories, and a variety of other schema were mentioned, each with only one or two repositories reporting their use. One respondent reported being unaware of schema being used (4%).

The high use of standards such as Dublin Core ensures the interoperability of repository content. One cannot be overly optimistic with these results, however. Park reported in 2006 that the ambiguity of Dublin Core metadata elements can hamper consistency in their use. She noted that, "This in turn has great potential to hinder semantic interoperability" (Park, 2006, p. 32).

Library-centric encoding schema like MARC are less used in these repositories, yet the library-based Library of Congress Subject Headings (LCSH) is the most common controlled vocabulary used, with 88% (n=15) of the seventeen respondents using it. Other library-based vocabularies used are the Library of Congress Name Authority File (NAF) access points (n=4; 24%); Medical Subject Headings (MeSH) (n=2; 12%); and Library of Congress Genre/Form Terms (LCGFT) (n=2; 12%). Controlled vocabularies maintained by the Getty were also mentioned, though not frequently, with 18% (n=3) using the Getty's Art & Architecture Thesaurus (AAT) and 12% (n=2) using the Thesaurus for Graphic Materials (TGM); only one respondent (6%) reported using Getty Thesaurus of Geographic Names (TGN). As mentioned, six survey respondents chose not to answer the question about controlled vocabularies, and one indicated that no controlled vocabulary are used (6%). In the results as we present them, we interpret respondents' lack of response as a result of their inability to speak to specifics of controlled vocabularies in use and have calculated the responses out of 17; we acknowledge that other explanations, including that no controlled vocabularies are in use, are also possible.

 

Metadata Creation Environments

On the topic of the metadata creation environment, the survey asked respondents about staff involved in the creation of metadata for their repository and the tools and resources used. Nineteen respondents provided information about who inputs or loads metadata, creates descriptive metadata, creates administrative metadata, and reviews metadata. Overall, professional librarians with a master's level degree did the majority of the work. These librarians created descriptive metadata at 16 out of 19 institutions; they created administrative metadata at 14 institutions; and reviewed metadata at 15 institutions. Paraprofessional staff were the next most common group contributing to the repository's metadata, followed by administrators and department heads (see Table 3).

 
Team Member Creates DESCRIPTIVE metadata % Creates ADMINISTRATIVE metadata % Reviews metadata %
Librarian (master's level) 16 84% 14 74% 15 79%
Paraprofessional 10 53% 3 16% 6 32%
Administrator (outside department) 7 37% 3 16% 3 16%
Department head 3 16% 3 16% 4 21%
Subject specialist 4 21% 2 11% 3 16%
Student worker 4 21% 2 11% 0 0%
Volunteer 2 11% 1 5% 1 5%
IT 0 0% 2 11% 2 11%

Table 3: Creating and Reviewing Metadata

These standards are applied based on documentation and specialized resources. Eighteen respondents supplied information about the resources used in their repository work with many choosing more than one. In responding to the question about best practices, the majority of respondents reported using homegrown best practices (11 respondents; 61%); though 6 did not mention any best practices documentation (see Table 4).

Best practices Number Using (N=18)     %    
Best practices: homegrown 11 61%
Best practices: Resource Description and Access (RDA) 4 22%
Best practices: Western States /CDP Dublin Core Metadata Best Practices 1 6%
Best practices: other 1 6%
None mentioned 6 33%

Table 4: Best Practices Used

Most repositories used one set of best practices, with only four respondents using more than one set of best practices (see Figure 3).

moulaison-fig3

Figure 3: Number of Best Practices Used by Repositories

Best practices documentation is not the only way that quality is ensured. Other resources respondents used included OCLC Connexion (n=8; 44%) and oXygen XML editor and RDA Toolkit (n=4; 22%). Five of the 18 respondents did not indicate any additional tools or resources (see Table 5).

Other Resources Number Using (N=18)     %    
OCLC Connexion 8 44%
oXygen XML editor 4 22%
RDA Toolkit 4 22%
Cataloger's Desktop 3 17%
Classification Web 3 17%
MARCedit 3 17%
Virtual International Authority File (VIAF) 3 17%
id.loc.gov 2 11%
DublinCore Generator.com 1 6%
ORCID 1 6%
None mentioned 5 28%

Table 5: Other Resources Used in Metadata Creation

 

Discussion

Having content available in open repositories is a first step toward ensuring the content's use, especially in a federated environment such as OpenDOAR. Providing the best conditions for access, including adequate systems and consistent and correct metadata, contributes to the usefulness of content over the long term.

In general, the OpenDOAR repositories surveyed in this study are using known library standards, including Dublin Core, MODS, MARC, and LCSH. The use of some of these standards reflects the use of repository software systems that have been built with these standards as default options. Those creating and/or reviewing the metadata are primarily librarians and paraprofessional staff.

A strength of traditional cataloging is the use of shared standards for descriptive and subject cataloging. Libraries worldwide participate in creating and using bibliographic records in WorldCat. OCLC reports that 72,000 libraries are represented in WorldCat ("A Global Library Resource," 2014). As described on the OCLC website, "OCLC members harness the collective energy and innovation of library world to share collections, metadata, best practices and expertise" ("The Value of Cooperative Cataloging," 2014). From this survey, we conclude that this collective energy is yet to be realized in digital repositories. Hillmann (2008) discusses the differences between traditional cataloging and work done in digital libraries. She observes that in digital libraries, "few communities of practice have been able to define their needs as a community" (p. 68). The report of the types of content in the digital repositories in this survey shows that many are unique items, including electronic theses and dissertations (ETDs), student projects, white papers, and research data and datasets. It is likely that many of the photographs and images are unique to the reporting repository, too. Given that repositories are not sharing records, the need for shared best practices does not occur at the repository level and the need for a community of practice may not be perceived as important. On the other hand, a large number of respondents are using controlled vocabularies with at least some of their repository material.

The size of repositories may impact this perception about best practices, too. In comparison to the traditional holdings of large academic libraries, digital repositories are still small. Thirty-seven percent of respondents answering the question indicated that their repositories held fewer than 5,000 items, and 82% reported fewer than 20,000 items. The smaller repositories also reported having fewer kinds of content. Stvilia and Gasser (2008) observe that large repositories may receive greater use than smaller repositories and their need for quality metadata may be greater, but, in what they call the "cycle of diminishing returns," larger repositories may have "greater difficulty in providing those metadata with limited resources as the metadata collection continues to grow and becomes increasingly diverse" (p. 67).

Repositories, although essential in the provision of information, especially unique content, are still working through the challenges of providing access to their content based on their own unique environments, cultures, and resources.

 

Conclusion

The usefulness of metadata is dependent on many factors, including system functionality, the encoding of metadata for machine manipulation, and the quality of the metadata. In this study we gathered information on systems used, metadata encoding schemes and elements that impact metadata quality, including the level of staff creating it and best practices resources in use, in an effort to describe metadata practices. Repository operations surveyed were drawn from those registered with OpenDOAR, a directory of vetted repositories adhering to practices of openness. Having an understanding of current practices in this environment provides insight into larger questions of access and interoperability.

 

Acknowledgements

This research was funded by a grant from the University of Missouri Richard Wallace Faculty Incentive.

 

References

[1] About OpenDOAR. (2014). OpenDOAR.

[2] Content Types in OpenDOAR Repositories — Worldwide. (2014). OpenDOAR.

[3] Cullen, R., & Chawner, B. (2011). Institutional repositories, open access, and scholarly communication: A study of conflicting paradigms. The Journal of Academic Librarianship, 37(6), 460-470. http://doi.org/10.1016/j.acalib.2011.07.002

[4] Gasaway, L. N. (2010). Libraries, digital content, and copyright. Vanderbilt Journal of Entertainment & Technology Law, 12(4), 755-778.

[5] A global library resource. (2014). OCLC.

[6] Hillmann, D. I. (2008). Metadata quality: From evaluation to augmentation. Cataloging & Classification Quarterly, 46(1), 65-80. http://doi.org/10.1080/01639370802183008

[7] Jacso, P. (2006). GeoScienceWorld, OpenDOAR, and Enciclopedia Estudiantil Hallazgos. Online, 30(3), 52-54

[8] Li, Y. & Banach, M. (2011). Institutional repositories and digital preservation: Assessing current practices at research libraries. D-Lib Magazine, 17(5/6). http://doi.org/10.1045/may2011-yuanli

[9] OpenDOAR or directory of open access repositories. (2005). Information Services & Use, 25(2), 109-111.

[10] OpenDOAR: open access to research information. (2006). Library Hi Tech News, 23(3), 20-21.

[11] Park, J.-r. (2006). Semantic interoperability and metadata quality: An analysis of metadata item records for digital image collections. Knowledge Organization, 33(1), 20-34.

[12] Stvilia, B., & Gasser, L. (2008). Value-based metadata quality assessment. Library & Information Science Research, 30, 67-74.

[13] Tools for Repository Administrators. (2014). OpenDOAR.

[14] Usage of Open Access Repository Software — Worldwide. (2014). OpenDOAR.

[15] The value of cooperative cataloging. (2014). OCLC.

 

About the Authors

moulaison

Heather Lea Moulaison is Assistant Professor at the iSchool at the University of Missouri. Her research focuses primarily on the intersection of the organization of information and technology and includes the study of issues pertaining to metadata, standards, and digital preservation. An ardent Francophile, Dr. Moulaison is also interested in international aspects of access to information.

 

Felicity Dykas is Head of the Digital Services Department at the University of Missouri. Previous positions have included Head of the Catalog Department and electronic resources librarian. She works with the university institutional repository and digital library and has additional interests in metadata standards, organizational systems for online resources, and the preservation of print and digital material.

 
gallant

Kristen Gallant is a graduate student at the University of Missouri's iSchool. Ms. Gallant holds a masters of arts in Art History and is interested in digital resources and their metadata.

 
transparent image