Skip to main content

Content Profiling for Preservation: Improving Scale, Depth and Quality

  • Conference paper
The Emergence of Digital Libraries – Research and Practices (ICADL 2014)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8839))

Included in the following conference series:

Abstract

Content profiling in digital preservation is a crucial step that enables controlled management of content over time. However, large-scale profiling is facing a set of challenges. As data grows and gets more diverse, the only option to control it is to combine outputs of multiple characterization tools to cover the varieties of formats and extract features of interest. This cooperation of tools introduces conflicting measures and poses challenges on data quality. Sparsity and labeling conflicts make it difficult or impossible to partition, sample and analyze large metadata sets of a content profile. Without this, however, it is virtually impossible to manage heterogeneous collections reliably over time.

In this paper, we present the content profiling tool C3PO, which includes rule-based techniques and heuristics designed for conflict reduction. We conduct a set of experiments in which we assess the effect of creating such a mechanisms and rule set on the quality and effectiveness of content profiling. The results show the potential of simple conflict reduction rules to strongly improve data quality of content profiling for analysis and decision support.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abrams, S., Morrissey, S., Cramer, T.: What? So What.: The next-generation JHOVE2 architecture for format-aware characterization. IJDC 4(3) (2009)

    Google Scholar 

  2. Becker, C., Duretec, K.: Free benchmark corpora for preservation experiments: using model-driven engineering to generate data sets. In: Proc. JCDL. ACM (2013)

    Google Scholar 

  3. Becker, C., Rauber, A.: Preservation decisions: Terms and conditions apply. In: Proc. JCDL. ACM (2011)

    Google Scholar 

  4. Brody, T., Carr, L., Hey, J., Brown, A., Hitchcock, S.: PRONOM-ROAR: Adding format profiles to a repository registry to inform preservation services. IJDC 2(2) (2008)

    Google Scholar 

  5. Dappert, A.: Deal with conflict, capture the relationship: The case of digital object properties. In: Proc. IPRES, pp. 21–29 (2010)

    Google Scholar 

  6. Dappert, A., Farquhar, A.: Significance is in the eye of the stakeholder. In: Agosti, M., Borbinha, J., Kapidakis, S., Papatheodorou, C., Tsakonas, G. (eds.) ECDL 2009. LNCS, vol. 5714, pp. 297–308. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  7. Faria, L., Petrov, P., Duretec, K., Becker, C., Ferreira, M., Ramalho, J.: Design and architecture of a novel preservation watch system. In: Chen, H.-H., Chowdhury, G. (eds.) ICADL 2012. LNCS, vol. 7634, pp. 168–178. Springer, Heidelberg (2012)

    Google Scholar 

  8. Forgy, C.L.: Rete: A fast algorithm for the many pattern/many object pattern match problem. Artificial Intelligence 19(1), 17–37 (1982)

    Article  Google Scholar 

  9. Hedstrom, M., Lee, C.A.: Significant properties of digital objects: definitions, applications, implications. In: DLM-Forum, vol. 200, pp. 218–27 (2002)

    Google Scholar 

  10. Hutchins, M.: Testing software tools of potential interest for digital preservation activities at the national library of australia. NLA Australia Staff Papers (2012)

    Google Scholar 

  11. Kulovits, H., Kraxner, M., Plangg, M., Becker, C., Bechhofer, S.: Open preservation data: Controlled vocabularies and ontologies for preservation ecosystems. In: Proc. IPRES, pp. 63–72

    Google Scholar 

  12. Petrov, P., Becker, C.: Large-scale content profiling for preservation analysis. In: 9th International Conference on Preservation of Digital Objects (IPRES 2012) (2012)

    Google Scholar 

  13. van der Knijff, J., Wilson, C.: Evaluation of characterisation tools. part 1: Identification. Technical report, National Library of the Netherlands (2011)

    Google Scholar 

  14. Wheatley, P.: The practitioners have spoken: “we need better characterisation!”. Blog post (2012), http://www.openplanetsfoundation.org/blogs/2012-10-19-practitioners-have-spoken-we-need-better-characterisation (accessed June 2014)

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Kulmukhametov, A., Becker, C. (2014). Content Profiling for Preservation: Improving Scale, Depth and Quality. In: Tuamsuk, K., Jatowt, A., Rasmussen, E. (eds) The Emergence of Digital Libraries – Research and Practices. ICADL 2014. Lecture Notes in Computer Science, vol 8839. Springer, Cham. https://doi.org/10.1007/978-3-319-12823-8_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-12823-8_1

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-12822-1

  • Online ISBN: 978-3-319-12823-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics