skip to main content
10.1145/3332186.3332241acmotherconferencesArticle/Chapter ViewAbstractPublication PagespearcConference Proceedingsconference-collections
research-article

Petrel: A Programmatically Accessible Research Data Service

Published: 28 July 2019 Publication History

Abstract

We report on our experiences deploying and operating Petrel, a data service designed to support science projects that must organize and distribute large quantities of data. Building on a high-performance 3.2 PB parallel file system and embedded in Argonne National Laboratory's 100+ Gbps network fabric, Petrel leverages Science DMZ concepts and Globus APIs to provide application scientists with a high-speed, highly connected, and programmatically controllable data store. We describe Petrel's design, implementation, and usage and give representative examples to illustrate the many different ways in which scientists have employed the system.

References

[1]
D. Abramson, J. Carroll, C. Jin, and M. Mallon. 2017. A Metropolitan Area Infrastructure for Data Intensive Science. In 13th IEEE International Conference on e-Science (e-Science). 238--247.
[2]
W. Allcock, J. Bresnahan, R. Kettimuthu, M. Link, C. Dumitrescu, I. Raicu, and I. Foster. 2005. The Globus striped GridFTP framework and server. In ACM/IEEE Conference on Supercomputing. IEEE Computer Society, 54.
[3]
R. Ananthakrishnan, B. Blaiszik, K. Chard, R. Chard, B. McCollam, J. Pruyne, S. Rosen, S. Tuecke, and I. Foster. 2018. Globus Platform Services for Data Publication. In Practice and Experience on Advanced Research Computing (PEARC '18). Article 14, 7 pages.
[4]
Y. Babuji, A. Woodard, Z. Li, D. Katz, B. Clifford, R. Kumar, L. Lacinski, R. Chard, J. Wozniak, I. Foster, M. Wilde, and K. Chard. 2019. Parsl: Pervasive Parallel Programming in Python. In ACM International Symposium on High-Performance Parallel and Distributed Computing.
[5]
Y. N. Babuji, K. Chard, A. Gerow, and E. Duede. 2016. Cloud Kotta: Enabling secure and scalable data analytics in the cloud. In IEEE International Conference on Big Data. 302--310.
[6]
M. Beck, T. Moore, J. Plank, and M. Swany. 2000. Logistical networking. In Active Middleware Services. Springer, 141--154.
[7]
M. Beck, T. Moore, and J. S. Plank. 2002. An end-to-end approach to globally scalable network storage. In ACM SIGCOMM Computer Communication Review, Vol. 32. ACM, 339--346.
[8]
P. Beckman, T. J. Skluzacek, K. Chard, and I. Foster. 2017. Skluma: A Statistical Learning Pipeline for Taming Unkempt Data Repositories. In 29th International Conference on Scientific and Statistical Database Management. 41.
[9]
D. A. Benson, M. Cavanaugh, K. Clark, I. Karsch-Mizrachi, D.J. Lipman, J. Ostell, and E. W. Sayers. 2012. GenBank. Nucleic Acids Research 41, D1 (2012), D36--D42.
[10]
T. Bicer, D. Gürsoy, R. Kettimuthu, F. De Carlo, and I. T. Foster. 2016. Optimization of tomographic reconstruction workflows on geographically distributed resources. Journal of Synchrotron Radiation 23, 4 (2016), 997--1005.
[11]
B. Blaiszik, K. Chard, J. Pruyne, R. Ananthakrishnan, S. Tuecke, and I. Foster. 2016. The Materials Data Facility: Data Services to Advance Materials Science Research. Journal of Materials 68, 8 (2016), 2045--2052.
[12]
K. Chard, E. Dart, I. Foster, D. Shifflett, S. Tuecke, and J. Williams. 2017. The Modern Research Data Portal: A Design Pattern for Networked, Data-Intensive Science. PeerJ Computer Science 4, e144 (2017).
[13]
K. Chard, M. Lidman, B. McCollam, J. Bryan, R. Ananthakrishnan, S. Tuecke, and I. Foster. 2016. Globus Nexus: A Platform-as-a-Service provider of research identity, profile, and group management. Future Generation Computer Systems 56 (2016), 571--583.
[14]
K. Chard, J. Pruyne, B. Blaiszik, R. Ananthakrishnan, S. Tuecke, and I. Foster. 2015. Globus data publication as a service: Lowering barriers to reproducible science. In 11th International Conference on e-Science. IEEE, 401--410.
[15]
K. Chard, S. Tuecke, and I. Foster. 2014. Efficient and Secure Transfer, Synchronization, and Sharing of Big Data. IEEE Cloud Computing 1, 3 (Sep. 2014), 46--55.
[16]
K. Chard, S. Tuecke, I. Foster, B. Allen, R. Ananthakrishnan, J. Bester, B. Blaiszik, V. Cuplinskas, R. Kettimuthu, J. Kordas, L. Lacinski, M. Lidman, M. Link, S. Martin, B. McCollam, K. Pickett, D. Powers, J. Pruyne, B. Raumann, G. Rohder, S. Rosen, D. Shifflett, T. Sutton, V. Vasiliadis, and J. Williams. 2016. Globus: Recent Enhancements and Future Plans. In XSEDE16 Conference on Diversity, Big Data, and Science at Scale. 27:1--27:8.
[17]
R. Chard, K. Chard, J. Alt, D. Y. Parkinson, S. Tuecke, and I. Foster. 2017. Ripple: Home automation for research data management. In 37th International Conference on Distributed Computing Systems Workshops. IEEE, 389--394.
[18]
R. Chard, Z. Li, K. Chard, L. T. Ward, Y. N. Babuji, A. Woodard, S. Tuecke, B. Blaiszik, M.J. Franklin, and I. T. Foster. 2019. DLHub: Model and data serving for science. In 33rd IEEE International Parallel and Distributed Processing Symposium.
[19]
E. Dart, L. Rotman, B. Tierney, M. Hester, and J. Zurawski. 2013. The Science DMZ: A Network Design Pattern for Data-intensive Science. In International Conference on High Performance Computing, Networking, Storage and Analysis (SC '13). ACM, New York, NY, USA, Article 85, 10 pages.
[20]
I. Foster, R. Ananthakrishnan, B. Blaiszik, K. Chard, R. Osborn, S. Tuecke, M. Wilde, and J. Wozniak. 2015. Networking materials data: Accelerating discovery at an experimental facility. Big Data and High Performance Computing (2015).
[21]
I. Foster and D. Gannon. 2017. Cloud Computing for Science and Engineering. MIT Press. https://cloud4scieng.org.
[22]
D. Gürsoy, F. De Carlo, X. Xiao, and C. Jacobsen. 2014. TomoPy: A framework for the analysis of synchrotron tomographic data. Journal of Synchrotron Radiation 21, 5 (2014), 1188--1193.
[23]
A. P. Heath, M. Greenway, R. Powell, J. Spring, R. Suarez, D. Hanley, C. Bandlamudi, M. E. McNerney, K. P. White, and R. L. Grossman. 2014. Bionimbus: A cloud for managing, analyzing and sharing large genomics datasets. Journal of the American Medical Informatics Association 21, 6 (2014), 969--975.
[24]
K. Heitmann, T. D. Uram, H. Finkel, N. Frontiere, S. Habib, A. Pope, E. Rangel, J. Hollowed, D. Korytov, P. Larsen, B. S. Allen, K. Chard, and I. Foster. 2019. HACC Cosmological Simulations: First Data Release. arXiv:arXiv:1904.11966
[25]
J. H. Howard, M. L. Kazar, S. G. Menees, D. A. Nichols, M. Satyanarayanan, R. N. Sidebotham, and M. J. West. 1988. Scale and performance in a distributed file system. ACM Transactions on Computer Systems 6, 1 (1988), 51--81.
[26]
T. Ito, H. Ohsaki, and M. Imase. 2005. On parameter tuning of data transfer protocol GridFTP for wide-area grid computing. In 2nd International Conference on Broadband Networks. IEEE, 1338--1344.
[27]
J. Kim, E. Yildirim, and T. Kosar. 2015. A highly-accurate and low-overhead prediction model for transfer throughput optimization. Cluster Computing 18, 1 (2015), 41--59.
[28]
G. Klimeck, M. McLennan, S. P. Brophy, G. B. Adams III, and M. S. Lundstrom. 2008. nanohub.org: Advancing education and research in nanotechnology. Computing in Science & Engineering 10, 5 (2008), 17--23.
[29]
T. Kluyver, B. Ragan-Kelley, F. Pérez, B. E. Granger, M. Bussonnier, J. Frederic, K. Kelley, J. B. Hamrick, J. Grout, S. Corlay, et al. 2016. Jupyter Notebooks--A publishing format for reproducible computational workflows. In ELPUB. 87--90.
[30]
K. A. Lawrence, M. Zentner, N. Wilkins-Diehr, J. A. Wernert, M. Pierce, S. Marru, and S. Michael. 2015. Science gateways today and tomorrow: Positive perspectives of nearly 5000 members of the research community. Concurrency and Computation: Practice and Experience 27, 16 (2015), 4252--4268.
[31]
Z. Liu, P. Balaprakash, R. Kettimuthu, and I. Foster. 2017. Explaining Wide Area Data Transfer Performance. In 26th International Symposium on High-Performance Parallel and Distributed Computing. 167--178.
[32]
M. McLennan and R. Kennell. 2010. HUBzero: A platform for dissemination and collaboration in computational science and engineering. Computing in Science & Engineering 12, 2 (2010).
[33]
P. A. Meyer, S. Socias, J. Key, E. Ransey, E. C. Tjon, A. Buschiazzo, M. Lei, C. Botka, J. Withrow, D. Neau, K. Rajashankar, K. S. Anderson, R. H. Baxter, S. C. Blacklow, T. J. Boggon, A. M. J. J. Bonvin, D. Borek, T. J. Brett, A. Caflisch, C.-I. Chang, W. J. Chazin, K. D. Corbett, M. S. Cosgrove, S. Crosson, S. Dhe-Paganon, E. D. Cera, C. L. Drennan, M. J. Eck, B. F. Eichman, Q. R. Fan, A. R. Ferré-D'Amaré, J. C. Fromme, K. C. Garcia, R. Gaudet, P. Gong, S. C. Harrison, E. E. Heldwein, Z. Jia, R. J. Keenan, A. C. Kruse, M. Kvansakul, J. S. McLellan, Y. Modis, Y. Nam, Z. Otwinowski, E. F. Pai, P. J. B. Pereira, C. Petosa, C. S. Raman, T. A. Rapoport, A. Roll-Mecak, M. K. Rosen, G. Rudenko, J. Schlessinger, T. U. Schwartz, Y. Shamoo, H. Sondermann, Y. J. Tao, N. H. Tolia, O. V. Tsodikov, K. D. Westover, H. Wu, I. Foster, J. S. Fraser, F. R. N. C. Maia, T. Gonen, T. Kirchhausen, K. Diederichs, M. Crosas, and P. Sliz. 2016. Data publication with the Structural Biology Data Grid supports live analysis. Nature Communications 7 (2016).
[34]
M. Russell, G. Allen, G. Daues, I. Foster, E. Seidel, J. Novotny, J. Shalf, and G. Von Laszewski. 2001. The Astrophysics Simulation Collaboratory: A science portal enabling community software development. In 10th IEEE International Symposium on High Performance Distributed Computing. 207--215.
[35]
G. A. Schmidt, D. Bader, L. J. Donner, G. S. Elsaesser, J.-C. Golaz, C. Hannay, A. Molod, R. Neale, and S. Saha. 2017. Practice and philosophy of climate model tuning across six US modeling centers. Geoscientific Model Development 10, 9 (2017), 3207--3223.
[36]
F. B. Schmuck and R. L. Haskin. 2002. GPFS: A Shared-Disk File System for Large Computing Clusters. In USENIX Conference on File and Storage Technologies, Vol. 2.
[37]
S. C. Simms, G. G. Pike, and D. Balog. 2007. Wide area filesystem performance using Lustre on the TeraGrid. In TeraGrid Conference.
[38]
S. C. Simms, G. G. Pike, S. Teige, B. Hammond, Y. Ma, L. L. Simms, C. Westneat, and D. A. Balog. 2007. Empowering distributed workflow with the Data Capacitor: Maximizing Lustre performance across the wide area network. In Workshop on Service-oriented Computing Performance: Aspects, Issues, and Approaches. ACM, 53--58.
[39]
A. S. Szalay. 2014. From simulations to interactive numerical laboratories. In Winter Simulation Conference. IEEE Press, 875--886.
[40]
S. Tuecke, R. Ananthakrishnan, K. Chard, M. Lidman, B. McCollam, S. Rosen, and I. Foster. 2016. Globus auth: A research identity and access management platform. In IEEE 12th International Conference on e-Science (e-Science). 203--212.
[41]
T. D. Uram and M. E. Papka. 2016. Expanding the Scope of High-Performance Computing Facilities. Computing in Science and Engineering 18, 3 (May 2016), 84--87.
[42]
N. Wilkins-Diehr, D. Gannon, G. Klimeck, S. Oster, and S. Pamidighantam. 2008. TeraGrid science gateways and their impact on science. Computer 41, 11 (2008).
[43]
D. N. Williams, R. Drach, R. Ananthakrishnan, I. T. Foster, D. Fraser, F. Siebenlist, D. E. Bernholdt, M. Chen, J. Schwidder, S. Bharathi, A. L. Chervenak, R. Schuler, M. Su, D. Brown, L. Cinquini, P. Fox, J. Garcia, D. E. Middleton, W. G. Strand, N. Wilhelmi, S. Hankin, R. Schweitzer, P. Jones, A. Shoshani, and A. Sim. 2009. The Earth System Grid: Enabling Access to Multimodel Climate Simulation Data. Bulletin of the American Meteorological Society 90, 2 (2009), 195--205.
[44]
J. M. Wozniak, K. Chard, B. Blaiszik, R. Osborn, M. Wilde, and I. Foster. 2015. Big data remote access interfaces for light source science. In IEEE/ACM 2nd International Symposium on Big Data Computing. IEEE, 51--60.
[45]
E. Yildirim, E. Arslan, J. Kim, and T. Kosar. 2016. Application-level optimization of big data transfers through pipelining, parallelism and concurrency. IEEE Transactions on Cloud Computing 4, 1 (2016), 63--75.

Cited By

View all
  • (2024)Cheap and FAIR: A Serverless Research Data Repository for the Next Generation Cosmic Microwave Background ExperimentPractice and Experience in Advanced Research Computing 2024: Human Powered Computing10.1145/3626203.3670558(1-4)Online publication date: 17-Jul-2024
  • (2024)Machine learning controller for data rate management in science DMZ networksComputer Networks: The International Journal of Computer and Telecommunications Networking10.1016/j.comnet.2024.110237242:COnline publication date: 2-Jul-2024
  • (2023)Globus automation servicesFuture Generation Computer Systems10.1016/j.future.2023.01.010142:C(393-409)Online publication date: 1-May-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
PEARC '19: Practice and Experience in Advanced Research Computing 2019: Rise of the Machines (learning)
July 2019
775 pages
ISBN:9781450372275
DOI:10.1145/3332186
  • General Chair:
  • Tom Furlani
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 July 2019

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

PEARC '19

Acceptance Rates

Overall Acceptance Rate 133 of 202 submissions, 66%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)1
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Cheap and FAIR: A Serverless Research Data Repository for the Next Generation Cosmic Microwave Background ExperimentPractice and Experience in Advanced Research Computing 2024: Human Powered Computing10.1145/3626203.3670558(1-4)Online publication date: 17-Jul-2024
  • (2024)Machine learning controller for data rate management in science DMZ networksComputer Networks: The International Journal of Computer and Telecommunications Networking10.1016/j.comnet.2024.110237242:COnline publication date: 2-Jul-2024
  • (2023)Globus automation servicesFuture Generation Computer Systems10.1016/j.future.2023.01.010142:C(393-409)Online publication date: 1-May-2023
  • (2020)Toward an Automated HPC Pipeline for Processing Large Scale Electron Microscopy Data2020 IEEE/ACM 2nd Annual Workshop on Extreme-scale Experiment-in-the-Loop Computing (XLOOP)10.1109/XLOOP51963.2020.00008(16-22)Online publication date: Nov-2020
  • (2019)Understanding Data Motion in the Modern HPC Data Center2019 IEEE/ACM Fourth International Parallel Data Systems Workshop (PDSW)10.1109/PDSW49588.2019.00012(74-83)Online publication date: Nov-2019
  • (2019)Virtual Excited State Reference for the Discovery of Electronic Materials Database: An Open-Access Resource for Ground and Excited State Properties of Organic MoleculesThe Journal of Physical Chemistry Letters10.1021/acs.jpclett.9b0257710:21(6835-6841)Online publication date: 23-Oct-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media