Abstract
This paper identifies three distinct data production paradigms for Earth science data, each having its own versioning structure:
-
Climate data record production, used when the data producer’s dominant concern is providing a homogeneous error structure for each data set version, particularly when the data record is expected to cover a long time period
-
Operational data set production, used when the producer must ensure low latency and service continuity with less attention to error homogeneity across the entire record
-
Exploratory production, used for validation or research in which the producer decides which processes to apply by interacting with the data. In this paradigm, there may not be a common versioning structure from one production episode to another
This paper then develops a mathematical framework for three provenance tracing activities that are important in long-term preservation of Earth science data:
-
tracing the history of data production that created an item of Earth science data, with particular attention to the versioning structure of the data collections
-
tracing the history of custody for an item
-
tracing the history of Intellectual Property Rights transfers for an item
Each of these activities has its own type of Directed Acyclic Graph (DAG) underlying a particular kind of provenance. Provenance tracing is equivalent to performing a Breadth First Search on the appropriate DAG.
Similar content being viewed by others
References
Abiteboul S, Quass D, McHugh J, Widom J, Wiener J (1997) The Lorel query language for semistructured data. Int J Digit Libr 1:1
Appell D (2009) Stumbling over data: mistakes fuel climate-warming skeptics. Sci Am 301:19–20
ASDC (2010) CERES metadata and data quality summaries, see http://eosweb.larc.nasa.gov/PRODOCS/ceres/table_ceres.html as well as http://eosweb.larc.nasa.gov/PRODOCS/ceres/level2_ssf_table.html, http://eosweb.larc.nasa.gov/PRODOCS/ceres/SSF/Quality_Summaries/CER_SSF_Aqua_Edition2C.html, http://eosweb.larc.nasa.gov/PRODOCS/ceres/SSF/Quality_Summaries/ssf_toa_aqua_ed2A.html
Barker A, Hemert JV (2008) Scientific workflow: a survey and research directions. In: Parallel processing and applied mathematics lecture notes in computer science, vol 4967. Springer Berlin, pp 746–753
Barkstrom BR (1984) The earth radiation budget experiment (ERBE). Bull Am Meteorol Soc 65:1170–1185
Barkstrom BR (2003) Data product configuration management and versioning in large-scale production of satellite scientific data. In: Westfechtel B, van den Hoek A (eds) Software configuration management/ICSE workshops SCM 2001 and SCM 2003, Toronto, Canada, May 2001 and Portland, OR, USA, May 2003. Lecture notes in computer science, vol 2649. Springer, Berlin, pp 118–133
Barton J, Whitfield E (2005) Letter to Dr. Michael Mann dated June 23, 2005. Available online at http://republicans.energycommerce.house.gov/108/Letters/062305_Mann.pdf after going to http://republicans.energycommerce.house.gov/ and doing a search for “Letter to Dr. Mann”. Accessed 29 Sept 2009
Baudin M (1990) Manufacturing systems analysis: with application to production scheduling. Prentice-Hall, Englewood Cliffs
Belhajjame K, Wolstencroft K, Corcho O, Oinn T, Tanoh F, William A, Goble C (2008) Metadata management in the Taverna workflow system. In: IEEE international symposium on cluster computing and the grid, pp 651–656
Bose R (2002) A conceptual framework for composing and managing scientific data lineage. In: Proc. 14th international conf. on scientific and statistical database management, pp 15–19
Bose R, Frew J (2004) Composing lineage metadata with XML for custom satellite-derived data products SSBDM. In: 16th international conf. on scientific and statistical database management (SSBDM’04), p 275
Bose R, Frew J (2005) Lineage retrieval for scientific data processing: a survey. ACM Comput Surv 37:1–28
Buneman P, Suciu D (2007) Data Eng 32(special issue):1–58
Buneman P, Khanna S, Tan W-C (2002) On propagation of deletions and annotations through views. In: PODS ’02: proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems. Madison, Wisconsin, 3–6 June 2002
Buneman P, Khanna S, Tan W-C (2002) Computing provenance and annotations for views. Workshop Paper: Workshop on Data Derivation and Provenance (Oct.), Chicago, IL
Buneman P, Fernandez M, Suciu D (2000) UnQL: a query language and algebra for semistructured data based on structural recursion. VLDB J 9:76–110
Buneman P, Khanna S, Tajima K, Tan W-C (2004) Archiving scientific data. Trans Database Syst (TODS) 29:2–42
Buneman P, Cheney J, Tajima, Tan W-C, Vansummeren S (2008) Curated databases. In: PODS ’08: proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems. Vancouver, BC, Canada, 9–12 June 2008
Burroughs J (2010) Web page on quality control for the integrated global radiosonde archive. Available at http://www.ncdc.noaa.gov/oa/climate/igra/index.php
Cane MA, Kaplan A, Miller RN, Tang B, Hackett EC, Busalacci AJ (1996) Mapping tropical Pacific sea level: data assimilation via a reduced state space Kalman filter. J Geophys Res 101(C10):22599–22617
CCSDS (2002) Reference model for an open archival information system (OAIS). Consultative Committee for Space Data Systems, CCSDS 650.0-B-1, Blue Book, CCSDS Secretariat, Washington, DC
Chase RB, Aquilano NJ, Jacobs FR (1998) Production and operations management: manufacturing and services. Irwin McGraw-Hill, Boston
Chebotko A, Lin C, Fei X, Lai Z, Lu S, Hua J, Fotouhi F (2007) VIEW: a VIsual sciEntificWorkflow management system. In: IEEE congress on services, Salt Lake City, Utah, USA, 9–13 July 2007
Cheney J, Buneman P, Ludäscher B (2008) Report on the principles of provenance workshop. SIGMOD Rec 37:62–65
Committee on Climate Data Records from NOAA Operational Satellites (2004) Climate data records from environmental satellites. National Academies, Washington
Committee on Surface Temperature Reconstructions for the past 2,000 Years (2006) Surface temperature reconstructions for the last 2,000 years. National Academies, Washington
Consens MP, Mendelzon AO (1990) GraphLog: a visual formalism for real life recursion. In: PODS ’90. ACM, New York, pp 404–416
Conway E, Dunckley M, McIlwrath B, Giaretta D (2009) Preservation network models: creating stable networks of information to ensure the long term use of scientific data. In: Proc. PV2009, Madrid, Spain, 1–3 Dec 2009
Cormen TH, Lieserson CE, Rivest RL (1997) Introduction to algorithms. MIT, Cambridge
Cui Y, Widom J, Wiener JL (2000) Tracing the lineage of view data in a warehousing environment. ACM Trans. Database Syst. 26:179–227
Easterling DR, Karl TR, Mason EH, Hughes PY, Bowman DP (1996) United states historical climatology network (US HCN) monthly temperature and precipitation data. ORNL/CDIAC-87, NDP-019/R3. Carbon Dioxide Information Analysis Center, Oak Ridge National Laboratory, US Department of Energy, Oak Ridge, Tennessee
Eifrem E (2009) Neo4j—the benefits of graph databases. In: O’Reilly open source convention, 20–24 July 2009. Available online at http://en.oreilly.com/oscon2009/public/schedule/detail/8364
ESW (2009) ESW wiki—large TripleStores. Available online at http://esw.w3.org/topic/LargeTripleStores
Euler L (1736) Solutio problematis ad geometriam situs pertinentis. Comment Acad Sci Imper Petropol 8:128–140
Fleig AJ, Tilmes C (2006) Provenance and reuse: essential elements for long term climate data sets. EOS Trans. AGU 87
Foster I, Vockler J, Wilde M, Zhao Y (2002) Chimera: a virtual data system for representing, querying, and automating data derivation. In: Proc. 14th int. conf. on scientific and statistical database management, pp 37–46
Frew J, Metzger D, Slaughter P (2008) Automatic capture and reconstruction of computational provenance. Concurrency Comput Pract Exper 20:485–496
Frew J, Bose R (2001) Earth system science workbench: a data management infrastructure for earth science products. In: Fairfax VA, Kerschberg L, Kafatos M (eds) Proc. of the 13th international conference on scientific and statistical database management (SSDBM ’01) (July). IEEE Computer Society, Washington, pp 180–189
Gershwin SB (1994) Manufacturing systems engineering. PTR Prentice Hall, Englewood Cliffs
Giaretta D (2007) The CASPAR approach to digital preservation. Int J Digit Curation 2:112–131
Gibbons A (1985) Algorithmic graph theory. Cambridge University Press, Cambridge
Groth P, Jiang S, Miles S, Munroe S, Tan V, Tsasakou S, Moreau L (2006) An architecture for a provenance system: enabling and supporting provenance in grids for complex problems. Available online at http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.96.3841&rep=repl&type=pdf
Guan Z, Hernandez F, Bangalore P, Gray J, Skjellum A, Velusamy V, Liu Y (2005) Grid-flow: a grid-enabled scientific workflow system with a Petri-net-based interface. Concurrency Comput Pract Exper 18:1115–1140
Hook R, Romaniello M, Ullgrén M, Maisala S, Solin O, Oittinen T, Savolainen V, Järveläinen P, Tyynelä J, Péron M, Izzo C, Ballester P, Gabasch A (2006) ESO reflex: a graphical workflow engine for astronomical data reduction. Messenger 131:41–44. Available online at http://www.eso.org/sci/publications/messenger/archive/no.131-mar08/messenger-no131-42.pdf. Accessed 28 Sept 2009
Jüngnickel D (1999) Graphs, networks and algorithms. Springer, Berlin
Knuth DE (1993) The Stanford GraphBase: a platform for combinatorial computing. Addison-Wesley, Reading
Knuth DE (1997) The art of computer programming: vol 1. Fundamental algorithms, 3rd edn. Addison-Wesley, Boston
Loeb NG, Wielicki BA, Doelling DR, Kato S, Wond T, Smith GL, Keyes DF, Manalo-Smith N (2009) Toward optimal closure of the earth’s top-of-atmosphere radiation budget. J Clim 22:748–766
Lorenc AC, Ballard SP, Bell RS, Ingleby NB, Andrews PLF, Barker DM, Bray JR, Clayton AM, Dalby T, Li D, Payne TJ, Saunders FW (2006) The met. office global three-dimensional variational data assimilation scheme. Q J Royal Meteorol Soc 126:2991–3012
Mann M, Bradley E, Hughes RS, Malcolm K (1998) Global-scale temperature patterns and climate forcing over the past six centuries. Nature 392:779–787
McIntyre S, McKitrick R (2003) Corrections to the Mann et al. proxy data base and northern hemisphere average temperature series. Energy Environ 14:751–772
Miles S, Groth P, Munroe S, Jiang S, Assandri T, Moreau L (2000) Extracting causal graphs from an open provenance model. Concurrency Comput Pract Exper 00:1–7
MIT World (2009) The climategate debate, on-line discussion. Available online at http://mitworld.mit.edu/video/730. Accessed 3 Feb 2010
Moradkhani H, Sorooshian S, Gupta HV, Houser PR (2004) Dual state-parameter estimation of hydrological models using Kalman filter. Adv Water Plann 26:135–147
Moreau L, Groth P (2009) Open provenance challenge. Available online at http://twiki.ipaw.info/bin/view/Challenge/WebHome
Moreau L, Plale B, Miles S, Goble C, Missier P, Barga R, Simmhan Y, Futrelle J, McGrath RE, Myers J, Paulson P, Bowers S, Ludaescher B, Kwasnikowsak N, den Bussche JV, Ellkvist T, Freire J, Groth P (2008) The open provenance model (v1.01). Available online at http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.143.7208&rep=repl&type=pdf
Morton TE, Pentico DW (1993) Heuristic scheduling systems: with applications to production systems and project management. Wiley, New York
NARA (2007) Strategic directions: appraisal policy. Available online at http://www.archives.gov/records-mgmt/initiatives/appraisal.html
NARA (2010) Archives and records management resources 2010. Available online at http://www.archives.gov/records-mgmt/initiatives/appraisal.html
NSIDC/WDC for Glaciology (2009) Glacier photograph collection. National snow and ice data center/world data center for glaciology. NSIDC/WDC for Glaciology, Boulder
Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, Carver T, Glover K, Pocock MR, Wipat A, Li P (2004) Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20:3045–3054
Pashkin N (2006) The DOI handbook, Ed. 4.4.1 International DOI Foundation, Oxford. Available online at http://www.doi.org
Peterson TC, Vose RS (1997) An overview of the global historical climatology network temperature database. Bull Am Meteorol Soc 78:2837–2849
Reichle RH, Koster RD, Liu P, Mahanama SPP, Njoku EG, Owe M (2007) Comparison and assimilation of global soil moisture retrievals from the advanced microwave scanning radiometer for the earth observing system (AMSR-E) and the scanning multichannel microwave radiometer (SMMR) J Geophys Res Atmos 112:D09108
Rodell M, Houser PR, Jamjor U, Gottschalck J, Mitchell K, Meng C-J, Arsenault K, Cosgrove B, Radakovich J, Bosilovich M, Entin JK, Walker JP, Lohmann D, Toll D (2003) The global land data assimilation system. Bull Am Meteorol Soc 85:381–394
Sedgewick R (1989) Algorithms, 2nd edn. Addison-Wesley, Reading
Simmhan YL, Plale B, Gannon D (2005) A survey of data provenance in e-science. SIGMOD Rec 34:31–36
Simmhan YL, Plale B, Gannon D (2006) A framework for collecting provenance in data-centric scientific workflows. In: IEEE intn’l. conf. on web services (CWS’06)
Solomon S, Qin D, Manning M, Chen Z, Marquis M, Averyt KB, Tignor M, Miller HL (eds) (2007) Climate change, the physical science basis. In: Solomon S, Qin D, Manning M (eds) Contribution of working group I to the fourth assessment report of the intergovernmental panel on climate change contribution of working group I. Cambridge University Press, Cambridge
Stein J (1966) The random house dictionary of the English language: the unabridged edition. Random House, New York
Stonebraker M (2009) Saying good-bye to DBMSs. Commun ACM 52:12–13
Szunyogh I, Kostelich EJ, Gyarmati G, Patil DJ, Hunt BR, Kalnay E, Ott E, Yorke JA (2005) Assessing a local ensemble Kalman filter: prefect model experiments with the national centers for environmental prediction global model. Tellus A-57:528–545
Szomszor M, Moreau L (2003) Recording and reasoning over data provenance in web and grid services. In: Meersman R et al. (eds) CoopIS/DOA/ODBASE 2003. Lecture notes in computer science, vol 2888. Springer, Berlin, pp 603–620
Tilmes C, Fleig A (2008) Provenance tracking in an earth science data processing system. In: Freire J, Koop D, Moreau L (eds) Provenance and annotation of data and processes. Lecture notes in computer science, vol 5272. Springer, Berlin, pp 221–228
Ullman JD (1988) Principles of database and knowledge-base systems. In: Classical database systems computer, vol 1. Science, Rockville
USGCRP Program Office (1999) Global change science requirements for long-term archiving. Report of the Workshop, 28–30 Oct 1998, National Center for Atmospheric Research, Boulder. Available online at http://wiki.esipfed.org/images/4/40/USGCRP_Long-Term_Archiving.pdf
Valentini M (2009) Preserving intellectual property rights in the long term: demo presented at CASPAR all hands meeting, Rome, IT 15–16 Sept 2009. Available online at www.casparpreserves.eu/training/training-lectures/10.ppt
Weaver P (2006) A brief history of scheduling: back to the future. myPrimavera06, 4–6 April 2006, Canberra, Australia
Weaver P (2007) The origins of project management. In: Fourth annual PMI college of scheduling conference, 15–18 April 2007, Vancouver, BC
Wegman E, Scott DW, Said YH (2006) Ad hoc committee report on the ‘hockey stick’ global climate reconstruction. Available online as http://republicans.energycommerce.house.gov/108/home/07142006Wegman_Report.pdf. after going to http://republicans.energycommerce.house.gov/ and doing a search for “Wegman report”. Accessed 29 Sept 2009
Widom J (2005) Trio: a system for integrated management of data, accuracy, and lineage. In: Proc. CIDR conf
Wielicki B, Barkstrom BR, Harrison EF, Lee RB III, Smith GL, Cooper JE (1996) Clouds and the earth’s radiant energy system (CERES): an earth observing system experiment. Bull Am Meteorol Soc 77:853–868
Woodruff A, Stonebraker M (1997) Supporting fine-grained data lineage in a database visualization environment report no UCB/CSD-97-932. Computer Science Division, University of California, Berkeley
World Wide Web Consortium (2009) RDF available online at http://www.w3.org/RDF/
Yunck T, Wilson B, Fetzer E, Braverman A, Eldering, A, Garay, M, Manipon, G, Dobinson E, Tang B (2006) Rolling out GENESIS/SciFlo in the ESIP federation’s earth information exchange. Available online at http://esto.nasa.gov/conferences/ESTC2006/papers/a1p3.pdf
Acknowledgements
The author is deeply grateful to the reviewers of this paper for helping him to remove a number of misconceptions and to clarify the writing.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: H. A. Babaie
Rights and permissions
About this article
Cite this article
Barkstrom, B.R. A mathematical framework for earth science data provenance tracing. Earth Sci Inform 3, 167–196 (2010). https://doi.org/10.1007/s12145-010-0057-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12145-010-0057-0