Abstract
In ever more disciplines, science is driven by data, which leads to data analytics becoming a primary skill for researchers. This includes the complete process from data acquisition at sensors, over pre-processing and feature extraction to the use and application of machine learning. Sensors here often produce a plethora of data that needs to be dealt with in near-realtime, which requires a combined effort of implementations at the hardware level to high-level design of data flows. In this paper we outline two use-cases of this wide span of data analysis for science in a real-world example in astroparticle physics. We outline a high-level design approach which is capable of defining the complete data flow from sensor hardware to final analysis.
Similar content being viewed by others
Notes
The term has been used in Silicon Graphics by John Massey since 1998, but started to be popular only in 2012 with a trending peak in 2016 according to Google trends.
Project C3 by Wolfgang Rhode, Katharina Morik, Tim Ruhe investigates astrophysical data from the IceCube project and Cherenkov telescopes. Project C5 by Bernhard Spaan and Jens Teubner discusses the data of the LHCb experiment at the Large Hadron Collider (LHC) facility in Geneva.
The TU Dortmund university offers studies in data science within the statistics faculty since 2002. Within the computer science faculty, students may specialize on data science.
References
Abeysekara AU et al (2012) On the sensitivity of the HAWC observatory to gamma-ray bursts. Astropart Phys 35:641–650. https://doi.org/10.1016/j.astropartphys.2012.02.001
Bockermann C et al (2016) FACT-Tools—Processing high-volume telescope data. ADASS Conference Series - Astronomical Data Analysis Software & Systems
Anderhub H, Backes M, Biland A, Boller A, Braun I, Bretz T, Commichau S, Commichau V, Domke M, Dorner D et al (2011) Fact—the first cherenkov telescope using a g-apd camera for tev gamma-ray astronomy. Nucl Instrum Methods Phys Res A 639:58–61
Atkins R et al (2000) Milagrito, a tev air-shower array. Nucl Instrum Methods Phys Res 449:478–499
Bacon DF, Rabbah R, Shukla S (2013) Fpga programming for the masses. Commun ACM 56(4):56–63
Badanidiyuru A, Mirzasoleiman B, Karbasi A, Krause A (2014) Streaming submodular maximization: massive data summarization on the fly. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 671–680
Bockermann C (2015) Mining big data streams for multiple concepts. Ph.D. Thesis, TU Dortmund University
Bockermann C, Brügge K, Buss J, Egorov A, Morik K, Rhode W, Ruhe T (2015) Online analysis of high-volume data streams in astroparticle physics. In: Proceedings of the European conference on Machine Learning (ECML), Industrial Track. Springer, Berlin
Courbariaux M, Bengio Y, David JP (2015) Binaryconnect: training deep neural networks with binary weights during propagations. In: Advances in neural information processing systems, pp 3123–3131
Cutting D et al (2007) Apache Hadoop. http://hadoop.apache.org/
D’Addario M, Kopczynski D, Baumbach JI, Rahmann S (2014) A modular computational framework for automated peak extraction from ion mobility spectra. BMC Bioinf 15(25). http://www.biomedcentral.com/1471-2105/15/25
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113. https://doi.org/10.1145/1327452.1327492
Egorov A (2016) Distributed stream processing with the intention of mining. Master’s Thesis, TU Dortmund
Fernandez RC, Pietzuch PR, Kreps J, Narkhede N, Rao J, Koshy J, Lin D, Riccomini C, Wang G (2015) Liquid: unifying nearline and offline big data integration. In: CIDR 2015, seventh biennial conference on innovative data systems research, Asilomar, CA, USA, January 4–7, 2015, Online Proceedings
Geppert L, Ickstadt K, Munteanu A, Quedenfeld J, Christian S (2015) Random projections for Bayesian regression. Stat Comput. https://doi.org/10.1007/s11222-015-9608-z
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press
Hauck S, DeHon A (2008) Reconfigurable computing: the theory and practice of FPGA-based computation. Morgan Kaufmann, Burlington
IceCube Collaboration, Morik K (2014) Development of a general analysis and unfolding scheme and its application to measure the energy spectrum of atmospheric neutrinos with icecube. Eur Phys J 75(3):116. https://doi.org/10.1140/epjc/s10052-015-3330-z
Keskar NS, Mudigere D, Nocedal J, Smelyanskiy M, Tang PTP (2016) On large-batch training for deep learning: generalization gap and sharp minima. arXiv:1609.04836 (preprint )
Kieda DB, VERITAS Collab (2004) Status of the VERITAS ground based GeV/TeV gamma-ray observatory. In: High Energy Astrophysics Division, Bulletin of the American Astronomical Society, vol 36, p 910
Krause A, Gomes RG (2010) Budgeted nonparametric learning from data streams. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp 391–398
Krause A, Guestrin CE (2012) Near-optimal nonmyopic value of information in graphical models. arXiv:1207.1394 (preprint)
Lacey G, Taylor GW, Areibi S (2016) Deep learning on fpgas: past, present, and future. arXiv:1602.04283 (preprint)
Lee S, Brzyski D, Bogdan M (2016) Fast saddle-point algorithm for generalized Dantzig selector and FDR control with the ordered l1-norm. In: Gretton A, Robert CC (eds) Proceedings of the 19th international conference on artificial intelligence and statistics (AISTATS), pp 780–789. JMLR W&CP. http://jmlr.org/proceedings/papers/v51/lee16b.html
Lee S, Rahnenführer J, Lang M, de Preter K, Mestdagh P, Koster J, Versteeg R, Stallings R, Varesio L, Asgharzadeh S, Schulte J, Fielitz K, Heilmann M, Morik K, Schramm A (2014) Robust selection of cancer survival signatures from high-throughput genomic data using two-fold subsampling. PLoS One 9:e108818
Marz N, Warren J (2014) Big data–principles and best practices of scalable realtime data systems. Manning Publications Co., Greenwich
Minoux M (1978) Accelerated greedy algorithms for maximizing submodular set functions. In: Optimization techniques. Springer, pp 234–243
Molina A, Natarajan S, Kersting K (2017) Poisson sum-product networks: a deep architecture for tractable multivariate poisson distributions. In: Singh S, Markovitch S (eds) Proceedings of the 31st AAAI conference on artificial intelligence (AAAI). AAAI Press
Muller LK, Indiveri G (2015) Rounding methods for neural networks with low resolution synaptic weights. arXiv:1504.05767 (preprint)
Neugebauer O, Engel M, Marwedel P (2016) A parallelization approach for resource-restricted embedded heterogeneous MPSoCs inspired by OpenMP. J Syst Softw 125:439–448. https://doi.org/10.1016/j.jss.2016.08.069
Ngiam J, Coates A, Lahiri A, Prochnow B, Le QV, Ng AY (2011) On optimization methods for deep learning. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 265–272
Petry D et al (1999) The MAGIC telescope—prospects for GRB research. Astron Astrophys Suppl Ser 138:601–602. https://doi.org/10.1051/aas:1999369
Piatkowski N, Lee S, Morik K (2016) Integer undirected graphical models for resource-constrained systems. Neurocomputing 173(1):9–23. http://www.sciencedirect.com/science/article/pii/S0925231215010449
Pivato G et al (2013) Fermi LAT and WMAP observations of the supernova remnant HB 21. Astrophys J 779:179. https://doi.org/10.1088/0004-637X/779/2/179
Rastegari M, Ordonez V, Redmon J, Farhadi A (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In: European conference on computer vision. Springer, pp 525–542
Richter J, Kotthaus H, Bischl B, Marwedel P, Rahnenführer J, Lang M (2016) Faster model-based optimization through resource-aware scheduling strategies. In: Proceedings of the 10th international conference: learning and intelligent optimization (LION 10), Lecture notes in computer science (LNCS), vol 10079. Springer International Publishing, pp 267–273
Stolpe M (2016) The internet of things: opportunities and challenges for distributed data analysis. SIGKDD Explor Newsl 18(1):15–34. http://doi.acm.org/10.1145/2980765.2980768
William PH, Saul A, Vetterling WT, Flannery BP (2007) Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge University Press, New York, USA
Wulf N (2013) Speicherung und Analyse von BigData am Beispiel der Daten des FACT-Teleskops. Master’s Thesis, AI Group, Computer Science Department, TU Dortmund
Acknowledgements
This work has been supported by the DFG, Collaborative Research Center SFB 876 (http://sfb876.tu-dortmund.de/), projects C3 and A1.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Morik, K., Bockermann, C. & Buschjäger, S. Big Data Science. Künstl Intell 32, 27–36 (2018). https://doi.org/10.1007/s13218-017-0522-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13218-017-0522-8