Skip to main content

Abstract

Next-generation scientific instruments will collect data at unprecedented rates: multiple GB/s and exceeding TB/day. Such runs will benefit from automation and steering via machine learning methods, but these methods require new data management and policy techniques. We present here the Braid Provenance Engine (Braid-DB), a system that embraces AI-for-science automation in how and when to analyze and retain data, and when to alter experimental configurations. Traditional provenance systems automate record-keeping so that humans and/or machines can recover how a particular result was obtained—and, when failures occur, diagnose causes and enable rapid restart. Related workflow automation efforts need additional recording about model training inputs, including experiments, simulations, and the structures of other learning and analysis activities. Braid-DB combines provenance and version control concepts to provide a robust and usable solution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. NeXpy: A Python GUI to analyze NeXus data. http://nexpy.github.io/nexpy

  2. Abeykoon, V., Liu, Z., Kettimuthu, R., Fox, G., Foster, I.: Scientific image restoration anywhere. In: IEEE/ACM 1st Annual Workshop on Large-scale Experiment-in-the-Loop Computing (XLOOP), pp. 8–13. IEEE (2019)

    Google Scholar 

  3. Ananthakrishnan, R., et al.: Globus platform services for data publication. In: Proceedings of the Practice and Experience on Advanced Research Computing, pp. 1–7 (2018)

    Google Scholar 

  4. Babuji, Y., et al.: Parsl: pervasive parallel programming in Python. In: Proceedings of the HPDC (2019)

    Google Scholar 

  5. Baker, N.: Basic research needs workshop for scientific machine learning, core technologies for artificial intelligence (2019)

    Google Scholar 

  6. Blaiszik, B., Chard, K., Pruyne, J., Ananthakrishnan, R., Tuecke, S., Foster, I.: The materials data facility: data services to advance materials science research. J. Mater. 68(8), 2045–2052 (2016)

    Google Scholar 

  7. Blaiszik, B., et al.: A data ecosystem to support machine learning in materials science. MRS Commun. 9(4), 1125–1133 (2019). https://doi.org/10.1557/mrc.2019.118

    Article  Google Scholar 

  8. Borycz, J., Carroll, B.: Implementing FAIR data for people and machines: impacts and implications - results of a research data community workshop. Inf. Serv. Use 40(1–2), 71–85 (2020)

    Google Scholar 

  9. Chard, K., et al.: I’ll take that to go: big data bags and minimal identifiers for exchange of large, complex datasets. In: International Conference on Big Data (Big Data), pp. 319–328. IEEE (2016)

    Google Scholar 

  10. Fagnan, K., Nashed, Y., Perdue, G., Ratner, D., Shankar, A., Yoo, S.: Data and models: a framework for advancing AI in science. Report of the Office of Science Roundtable on Data for AI (2019). https://www.osti.gov/servlets/purl/1579323

  11. Juty, N., et al.: Unique, persistent, resolvable: identifiers as the foundation of FAIR. Data Intell. 2, 30–39 (2020)

    Article  Google Scholar 

  12. Li, J., Zhang, C., Cao, Q., Qi, C., Huang, J., Xie, C.: An experimental study on deep learning based on different hardware configurations. In: 2017 International Conference on Networking, Architecture, and Storage (NAS), pp. 1–6. IEEE (2017)

    Google Scholar 

  13. Liu, Z., et al.: Bridge data center AI systems with edge computing for actionable information retrieval. arXiv preprint arXiv:2105.13967 (2021)

  14. Liu, Z., et al.: BraggNN: fast X-ray Bragg peak analysis using deep learning. arXiv preprint arXiv:2008.08198 (2020)

  15. Machine Learning Schema Community Group: W3C machine learning schema (2017). https://github.com/ML-Schema/

  16. Madduri, R., et al.: Reproducible big data science: a case study in continuous fairness. PLoS ONE 14(4), e0213013 (2019)

    Article  Google Scholar 

  17. Moreau, L., et al.: The open provenance model core specification (v1.1). Future Gener. Comput. Syst. 27(6), 743–756 (2011)

    Google Scholar 

  18. Polyzotis, N., Roy, S., Whang, S.E., Zinkevich, M.: Data management challenges in production machine learning. In: 2017 ACM International Conference on Management of Data, SIGMOD 2017, pp. 1723–1726. Association for Computing Machinery, New York (2017). https://doi.org/10.1145/3035918.3054782

  19. Schelter, S., Böse, J.H., Kirschnick, J., Klein, T., Seufert, S.: Automatically tracking metadata and provenance of machine learning experiments. In: Machine Learning Systems Workshop at NIPS (2017)

    Google Scholar 

  20. Simmhan, Y.L., Plale, B., Gannon, D.: A survey of data provenance in e-science. ACM SIGMOD Rec. 34(3), 31–36 (2005)

    Article  Google Scholar 

  21. Souza, R., et al.: Provenance data in the machine learning lifecycle in computational science and engineering. In: Workshop on Workflows in Support of Large-Scale Science at SC, pp. 1–10 (11 2019). https://doi.org/10.1109/WORKS49585.2019.00006

  22. Stevens, R., Nichols, J., Yelick, K.: AI for Science Report on the Department of Energy (DOE) Town Halls on Artificial Intelligence (AI) for Science (2020)

    Google Scholar 

  23. Tuecke, S., et al.: Globus auth: a research identity and access management platform. In: 12th International Conference on e-Science, pp. 203–212. IEEE (2016)

    Google Scholar 

  24. Vartak, M., et al.: ModelDB: a system for machine learning model management. In: Proceedings of the Workshop on Human-In-the-Loop Data Analytics, HILDA 2016. Association for Computing Machinery, New York (2016). https://doi.org/10.1145/2939502.2939516

  25. Wilamowski, M., et al.: 2’-O methylation of RNA cap in SARS-CoV-2 captured by serial crystallography. Proc. Natl. Acad. Sci. 118(21) (2021). https://doi.org/10.1073/pnas.2100170118. https://www.pnas.org/content/118/21/e2100170118

  26. Wilkinson, M.D., et al.: The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3(1), 1–9 (2016)

    Article  Google Scholar 

  27. Wozniak, J.M., Armstrong, T.G., Wilde, M., Katz, D.S., Lusk, E., Foster, I.T.: Swift/T: scalable data flow programming for distributed-memory task-parallel applications. In: Proceedings of the CCGrid (2013)

    Google Scholar 

  28. Wozniak, J.M., et al.: CANDLE/Supervisor: a workflow framework for machine learning applied to cancer research. BMC Bioinform. 19(18), 491 (2018). https://doi.org/10.1186/s12859-018-2508-4

  29. Wozniak, J.M., et al.: Braid-DB GitHub repository. https://github.com/ANL-Braid/DB

Download references

Acknowledgments

This work was supported by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research, under contract number DE-AC02-06CH11357.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Justin M. Wozniak .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wozniak, J.M., Liu, Z., Vescovi, R., Chard, R., Nicolae, B., Foster, I. (2022). Braid-DB: Toward AI-Driven Science with Machine Learning Provenance. In: Nichols, J., et al. Driving Scientific and Engineering Discoveries Through the Integration of Experiment, Big Data, and Modeling and Simulation. SMC 2021. Communications in Computer and Information Science, vol 1512. Springer, Cham. https://doi.org/10.1007/978-3-030-96498-6_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-96498-6_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-96497-9

  • Online ISBN: 978-3-030-96498-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics