Skip to main content
Log in

Automatic evaluation of metadata quality in digital repositories

  • regular paper
  • Published:
International Journal on Digital Libraries Aims and scope Submit manuscript

Abstract

Owing to the recent developments in automatic metadata generation and interoperability between digital repositories, the production of metadata is now vastly surpassing manual quality control capabilities. Abandoning quality control altogether is problematic, because low-quality metadata compromise the effectiveness of services that repositories provide to their users. To address this problem, we present a set of scalable quality metrics for metadata based on the Bruce & Hillman framework for metadata quality control. We perform three experiments to evaluate our metrics: (1) the degree of correlation between the metrics and manual quality reviews, (2) the discriminatory power between metadata sets and (3) the usefulness of the metrics as low-quality filters. Through statistical analysis, we found that several metrics, especially Text Information Content, correlate well with human evaluation and that the average of all the metrics are roughly as effective as people to flag low-quality instances. The implications of this finding are discussed. Finally, we propose possible applications of the metrics to improve tools for the administration of digital repositories.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Barton J., Currier S., Hey J.M.N.: Building quality assurance into metadata creation: an analysis based on the learning objects and e-prints communities of practice. In: Sutton, S., Greenberg, J., Tennis, J. (eds) Proceedings 2003 Dublin Core Conference: Supporting Communities of Discourse and Practice—Metadata Research and Applications, pp. 39–48. Seattle, Washington (2003)

    Google Scholar 

  2. Beall J.: Metadata and data quality problems in the digital library. J. Dig. Inf. 6(3), 20 (2005)

    Google Scholar 

  3. Bederson B.B., Shneiderman B., Wattenberg M.: Ordered and quantum treemaps: making effective use of 2D space to display hierarchies. ACM Trans. Graph. 21(4), 833–854 (2002)

    Article  Google Scholar 

  4. Bingham E., Mannila H.: Random projection in dimensionality reduction: applications to image and text data. In: Provost, F., Srikant, R. (eds) Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 245–250. ACM Press, New York (2001)

    Chapter  Google Scholar 

  5. Bruce T.R., Hillmann D.: Metadata in Practice, Chap. The Continuum of Metadata Quality: Defining, Expressing, Exploiting, pp. 238–256. ALA Editions, Chicago (2004)

    Google Scholar 

  6. Bui, Y., Park, J. (2006) An assessment of metadata quality: A case study of the national science digital library metadata repository. In: Moukdad, H. (ed.) Proceedings of CAIS/ACSI 2006 Information Science Revisited: Approaches to Innovation, p. 13

  7. Cardinaels, K., Meire, M., Duval, E.: Automating metadata generation: the simple indexing interface. In: WWW ’05: Proceedings of the 14th International Conference on World Wide Web, pp. 548–556. ACM Press, New York (2005)

  8. Chapman A., Massey O.: A catalogue quality audit tool. Libr. Manage. 23(6–7), 314–324 (2002)

    Article  Google Scholar 

  9. DCMI: Dublin Core Metadata Innitiative. http://dublincore.org. 2 April 2007 (1995)

  10. Dushay, N., Hillmann, D.: Analyzing metadata for effective use and re-use. In: Sutton, S., Greenberg, J., Tennis, J. (eds.) DCMI Metadata Conference and Workshop. Dublin Core Metadata Initiative, p. 10. Seattle, USA (2003)

  11. Duval, E., Hodgins, W.: Making metadata go away: hiding everything but the benefits. In: Proceedings of the DCMI 2004 Conference. Dublin Core Metadata Initiative, pp. 29–35. Shanghai, China (2004)

  12. Duval E., Warkentyne K., Haenni F., Forte E., Cardinaels K., Verhoeven B., Van Durm R., Hendrikx K., Forte M., Ebel N. et al.: The ariadne knowledge pool system. Commun. ACM 44(5), 72–78 (2001)

    Article  Google Scholar 

  13. Ede S.: Fitness for purpose: the future evolution of bibliographic records and their delivery. Catalogue Index 116, 1–3 (1995)

    Google Scholar 

  14. Foltz P.W., Kintsch W., Landauer T.K.: The measurement of textual coherence with latent semantic analysis. Discourse Process. 25, 285–307 (1998)

    Article  Google Scholar 

  15. Foulonneau M.: Information redundancy across metadata collections. Inf. Process. Manage. Int. J. 43(3), 740–751 (2007)

    Article  Google Scholar 

  16. Greenberg, J., Pattuelli, M.C., Parsia, B., Robertson, W.D.: Author-generated dublin core metadata for web resources: a baseline study in an organization. In: Oyama, K., Gotoda, H. (eds.) DC’01: Proceedings of the International Conference on Dublin Core and Metadata Applications 2001, pp. 38–46. National Institute of Informatics (2001)

  17. Guy M., Powell A., Day M.: Improving the quality of metadata in eprint archives. Ariadne 38, 5 (2004)

    Google Scholar 

  18. Harman D.: Overview of the first trec conference. In: Korfhage, R., Rasmussen, E.M., Willett, P. (eds) SIGIR ’93: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 36–47. ACM Press, New York (1993)

    Chapter  Google Scholar 

  19. Hughes B.: Metadata quality evaluation: experience from the open language archives community. In: Chen, Z., Chen, H., Miao, Q., Fu, Y., Fox, E., Lim, E. (eds) Digital Libraries: International Collaboration and Cross-Fertilization: Proceedings of the 7th International Conference on Asian Digital Libraries, ICADL 2004, pp. 320–329. Springer Verlag, Shangay (2004)

    Google Scholar 

  20. Hughes B., Kamat A.: A metadata search engine for digital language archives. D-Lib Mag. 11(2), 6 (2005)

    Google Scholar 

  21. IEEE: IEEE 1484.12.1 Standard: Learning Object Metadata. http://ltsc.ieee.org/wg12/par1484-12-1.html. 2 April 2007 (2002)

  22. Landauer T., Foltz P., Laham D.: An introduction to latent semantic analysis. Discourse Process. 25(2–3), 259–284 (1998)

    Article  Google Scholar 

  23. Liu X., Maly K., Zubair M., Nelson M.L.: Arc—an oai service provider for digital library federation. D-Lib Mag. 7(4), 12 (2001)

    Google Scholar 

  24. Lubas R., Wolfe R., Fleischman M.: Creating metadata practices for mit’s opencourseware project. Libr. Hi Tech 22(2), 138–143 (2004)

    Article  Google Scholar 

  25. McCallum D.R., Peterson J.L.: Computer-based readability indexes. In: Burns, W.J., Ward, D.L. (eds) ACM 82: Proceedings of the ACM ’82 Conference, pp. 44–48. ACM Press, New York (1982)

    Chapter  Google Scholar 

  26. Meire M., Ochoa X., Duval E.: Samgi: automatic metadata generation v2.0. In: Seale, C.M.J. (eds) Proceedings of the ED-MEDIA 2007 World Conference on Educational Multimedia, Hypermedia and Telecommunications, pp. 1195–1204. AACE, Chesapeake (2007)

    Google Scholar 

  27. Moen W.E., Stewart E.L., McClure C.R.: Assessing metadata quality: findings and methodological considerations from an evaluation of the u.s. government information locator service (gils). In: Smith, T.R. (eds) ADL ’98: Proceedings of the Advances in Digital Libraries Conference, pp. 246–255. IEEE Computer Society, Washington (1998)

    Google Scholar 

  28. Najjar, J., Ternier, S., Duval, E.: The actual use of metadata in ariadne: an empirical analysis. In: Duval, E. (ed.) Proceedings of the 3rd Annual ARIADNE Conference, pp. 1–6. ARIADNE Foundation (2003)

  29. Najjar J., Ternier S., Duval E.: User behavior in learning objects repositories: an empirical analysis. In: McLoughlin, L.C.C. (eds) Proceedings of the ED-MEDIA 2004 World Conference on Educational Multimedia, Hypermedia and Telecommunications, pp. 4373–4378. AACE, Chesapeake (2004)

    Google Scholar 

  30. Newman M., Watts D., Barabasi A.L.: The Structure and Dynamics of Networks. Princeton University Press, New Jersy (2006)

    MATH  Google Scholar 

  31. Ochoa X., Cardinaels K., Meire M., Duval E.: Frameworks for the automatic indexation of learning management systems content into learning object repositories. In: Kommers, P., Richards, G. (eds) Proceedings of the ED-MEDIA 2005 World Conference on Educational Multimedia, Hypermedia and Telecommunications, pp. 1407–1414. AACE, Chesapeake (2005)

    Google Scholar 

  32. O’Neill E.T.: Frbr: functional requirements for bibliographic records; application of the entity-relationship model to humphry clinker. Libr. Resour. Tech. Serv. 46(4), 150–159 (2002)

    Article  Google Scholar 

  33. Richardson M., Prakash A., Brill E.: Beyond pagerank: machine learning for static ranking. In: Goble, C., Dahlin, M. (eds) Proceedings of the 15th International Conference on World Wide Web, pp. 707–715. ACM Press, New York (2006)

    Chapter  Google Scholar 

  34. Salton G., Buckley C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. Int. J. 24(5), 513–523 (1988)

    Article  Google Scholar 

  35. Salton G., Wong A., Yang C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)

    Article  Google Scholar 

  36. Shannon C., Weaver W.: The Mathematical Theory of Communication. University of Illinois Press, Urbana (1963)

    MATH  Google Scholar 

  37. Shreeves S.L., Knutson E.M., Stvilia B., Palmer C.L., Twidale M.B., Cole T.W.: Is “quality” metadata “shareable” metadata? The implications of local metadata practices for federated collections. In: Thompson, H.A. (eds) Currents and convergence: navigating the rivers of change. Proceedings of the Twelfth National Conference of the Association of College and Research Libraries, pp. 223–237. ALA, Minneapolis, USA (2005)

    Google Scholar 

  38. Shrout P., Fleiss J.: Intraclass correlations: uses in assessing rater reliability. Psychol. Bull. 86, 420–428 (1977)

    Article  Google Scholar 

  39. Simon B., Massart D., van Assche F., Ternier S., Duval E., Brantner S., Olmedilla D., Miklos Z.: A simple query interface for interoperable learning repositories. In: Olmedilla, D., Saito, N., Simon, B. (eds) Proceedings of the 1st Workshop on Interoperability of Web-based Educational Systems, pp. 11–18. CEUR, Chiba (2005)

    Google Scholar 

  40. Vande Sompel H., Nelson M., Lagoze C., Warner S.: Resource harvesting within the oai-pmh framework. D-Lib Mag. 10(12), 1082–9873 (2004)

    Google Scholar 

  41. Strong, D.M., Lee, Y.W., Wang, R.Y.: Data quality in context. Commun. ACM 40(5), 103–110 (1997). citeseer.ist.psu.edu/strong97data.html

    Article  Google Scholar 

  42. Stvilia, B.: Measuring information quality. Ph.D. thesis, University of Illinois at Urbana-Champaign, Urbana, IL (2006)

  43. Stvilia B., Gasser L., Twidale M.: Information quality management: theory and applications. Chap Metadata quality problems in federated collections, pp. 154–186. Idea Group, Hershey (2006)

    Google Scholar 

  44. Stvilia B., Gasser L., Twidale M.: A framework for information quality assessment. J. Am. Soc. Inf. Sci. Technol. 58(12), 1720–1733 (2007)

    Article  Google Scholar 

  45. Stvilia B., Gasser L., Twidale M.B., Shreeves S.L., Cole T.W.: Metadata quality for federated collections. In: Chengalur-Smith, I.N., Raschid, L., Long, J., Seko, C. (eds) IQ, pp. 111–125. MIT Press, Cambridge (2004)

    Google Scholar 

  46. Thomas S.E.: Quality in bibliographic control. Libr. Trends 44(3), 491–505 (1996)

    Google Scholar 

  47. Verbert, K., Jovanovic, J., Gasevic, D., Duval, E.: Repurposing learning object components. In: Meersman, R., Tari, Z., Herrero, P. (eds.) On the Move to Meaningful Internet Systems 2005: OTM Workshops. Lecture Notes in Computer Science, vol. 3762, pp. 1169–1178. Springer, Berlin(2005)

  48. Wilson A.J.: Toward releasing the metadata bottleneck—a baseline evaluation of contributor-supplied metadata. Libr. Resour. Tech. Serv. 51(1), 16–28 (2007)

    Article  Google Scholar 

  49. Zhu X., Gauch S.: Incorporating quality metrics in centralized/distributed information retrieval on the world wide web. In: Yannakoudakis, E., Leong, N.J.B.M.K., Ingwersen, P. (eds) Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 288–295. ACM Press, New York (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xavier Ochoa.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ochoa, X., Duval, E. Automatic evaluation of metadata quality in digital repositories. Int J Digit Libr 10, 67–91 (2009). https://doi.org/10.1007/s00799-009-0054-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00799-009-0054-4

Keywords

Navigation