research-article

Data motifs: a lens towards fully understanding big data and AI workloads

Authors:

Rui RenAuthors Info & Claims

PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

Article No.: 2, Pages 1 - 14

https://doi.org/10.1145/3243176.3243190

Published: 01 November 2018 Publication History

Abstract

The complexity and diversity of big data and AI workloads make understanding them difficult and challenging. This paper proposes a new approachto modelling and characterizing big data and AI workloads. We consider each big data and AI workload as a pipeline of one or more classes of units of computation performed on different initial or intermediate data inputs. Each class of unit of computation captures the common requirements while being reasonably divorced from individual implementations, and hence we call it a data motif. For the first time, among a wide variety of big data and AI workloads, we identify eight data motifs that take up most of the run time of those workloads, including Matrix, Sampling, Logic, Transform, Set, Graph, Sort and Statistic. We implement the eight data motifs on different software stacks as the micro benchmarks of an open-source big data and AI benchmark suite --- BigDataBench 4.0 (publicly available from http://prof.ict.ac.cn/BigDataBench), and perform comprehensive characterization of those data motifs from perspective of data sizes, types, sources, and patterns as a lens towards fully understanding big data and AI workloads. We believe the eight data motifs are promising abstractions and tools for not only big data and AI benchmarking, but also domain-specific hardware and software co-design.

References

[1]

2018. Hadoop. http://hadoop.apache.org/. (2018).

[2]

2018. LSD. https://software.intel.com/en-us/vtune-amplifier-help-front-end-bandwidth-lsd. (2018).

[3]

2018. Perf tool. https://perf.wiki.kernel.org/index.php/Main_Page. (2018).

[4]

2018. PMU Tools. https://github.com/andikleen/pmu-tools. (2018).

[5]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. In OSDI, Vol. 16. 265--283.

Digital Library

[6]

Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Yelick Katherine. 2006. The landscape of parallel computing research: A view from Berkeley. Technical Report. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley.

[7]

David H Bailey, Eric Barszcz, John T Barton, David S Browning, Robert L Carter, Leonardo Dagum, Rod A Fatoohi, Paul O Frederickson, Thomas A Lasinski, Rob S Schreiber, H D Simon, V Venkatakrishnan, and S K Weeratunga. 1991. The NAS parallel benchmarks. The International Journal of Supercomputing Applications 5, 3 (1991), 63--73.

Digital Library

[8]

Blaise Barney. 2009. POSIX threads programming. National Laboratory. Disponível em:<https://computing.llnl.gov/tutorials/pthreads/> Acesso em 5 (2009), 46.

[9]

Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM Sigplan Notices 49, 4 (2014), 269--284.

Digital Library

[10]

Yanpei Chen, Francois Raab, and Randy Katz. 2014. From tpc-c to big data benchmarks: A functional workload model. In Specifying Big Data Benchmarks. Springer, 28--43.

Digital Library

[11]

Edgar F Codd. 1970. A relational model of data for large shared data banks. Commun. ACM 13, 6 (1970), 377--387.

Digital Library

[12]

Phillip Colella. 2004. Defining software requirements for scientific computing. (2004).

[13]

James W Cooley and John W Tukey. 1965. An algorithm for the machine calculation of complex Fourier series. Mathematics of computation 19, 90 (1965), 297--301.

[14]

NR Council. 2013. Frontiers in Massive Data Analysis. The National Academies Press Washington, DC.

[15]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 248--255.

[16]

Jack J Dongarra, Piotr Luszczek, and Antoine Petitet. 2003. The LINPACK benchmark: past, present and future. Concurrency and Computation: practice and experience 15, 9 (2003), 803--820.

[17]

Lieven Eeckhout, Hans Vandierendonck, and Koen De Bosschere. 2003. Quantifying the impact of input data sets on program behavior and its applications. Journal of Instruction-Level Parallelism 5, 1 (2003), 1--33.

[18]

Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, Anastasia Ailamaki, and Babak Falsafı. 2012. Clearing the Clouds: A Study of Emerging Workloads on Modern Hardware. In ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).

Digital Library

[19]

Wanling Gao, Jianfeng Zhan, Lei Wang, Chunjie Luo, Zhen Jia, Daoyi Zheng, Chen Zheng, Xiwen He, Hainan Ye, Haibin Wang, and Rui Ren. 2018. Data Motif-based Proxy Benchmarks for Big Data and AI Workloads. Workload Characterization (IISWC), 2018 IEEE International Symposium on (2018).

[20]

Wanling Gao, Jianfeng Zhan, Lei Wang, Chunjie Luo, Daoyi Zheng, Xu Wen, Rui Ren, Chen Zheng, Hainan Ye, Jiahui Dai, Zheng Cao, et al. 2018. BigDataBench: A Scalable and Unified Big Data and AI Benchmark Suite. Under review of IEEE Transaction on Parallel and Distributed Systems (2018).

[21]

Andrew Glew. 1998. MLP yes! ILP no. ASPLOS Wild and Crazy Idea Session'98 (1998).

[22]

Part Guide. 2011. Intel® 64 and IA-32 Architectures Software Developerś Manual. Volume 3B: System programming Guide, Part 2 (2011).

[23]

Dominique Guinard, Vlad Trifa, and Erik Wilde. 2010. A resource oriented architecture for the web of things. In Internet of Things (IOT), 2010. IEEE, 1--8.

[24]

John Hennessy and David Patterson. 2018. A New Golden Age for Computer Architecture: Domain-specific Hardware/Software Co-Design, Enhanced Security, Open Instruction Sets, and Agile Chip Development. (2018).

[25]

Zhen Jia, Jianfeng Zhan, Lei Wang, Rui Han, Sally A McKee, Qiang Yang, Chunjie Luo, and Jingwei Li. 2014. Characterizing and subsetting big data workloads. In IEEE International Symposium on Workload Characterization (IISWC).

[26]

Stephen C Johnson. 1967. Hierarchical clustering schemes. Psychometrika 32, 3 (1967), 241--254.

[27]

Ian T Jolliffe. 1986. Principal component analysis and factor analysis. In Principal component analysis. Springer, 115--128.

[28]

Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on. IEEE, 1--12.

Digital Library

[29]

Gwangsun Kim, Jiyun Jeong, John Kim, and Mark Stephenson. 2016. Automatically exploiting implicit Pipeline Parallelism from multiple dependent kernels for GPUs. In Parallel Architecture and Compilation Techniques (PACT), 2016 International Conference on. IEEE, 339--350.

Digital Library

[30]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.

Digital Library

[31]

David J Lilja. 2005. Measuring computer performance: a practitioner's guide. Cambridge university press.

[32]

David G Lowe. 2004. Distinctive image features from scale-invariant keypoints. International journal of computer vision 60, 2 (2004), 91--110.

Digital Library

[33]

Piotr R Luszczek, David H Bailey, Jack J Dongarra, Jeremy Kepner, Robert F Lucas, Rolf Rabenseifner, and Daisuke Takahashi. 2006. The HPC Challenge (HPCC) benchmark suite. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing. Citeseer, 213.

Digital Library

[34]

David Maier. 1983. The theory of relational databases. Vol. 11. Computer science press Rockville.

Digital Library

[35]

John D Owens, Mike Houston, David Luebke, Simon Green, John E Stone, and James C Phillips. 2008. GPU computing. Proc. IEEE 96, 5 (2008), 879--899.

[36]

Heather Quinn, William H Robinson, Paolo Rech, Miguel Aguirre, Arno Barnard, Marco Desogus, Luis Entrena, Mario Garcia-Valderas, Steven M Guertin, David Kaeli, et al. 2015. Using benchmarks for radiation testing of microprocessors and FPGAs. IEEE Transactions on Nuclear Science 62, 6 (2015), 2547--2554.

[37]

Mehul Shah, Parthasarathy Ranganathan, Jichuan Chang, Niraj Tolia, David Roberts, and Trevor Mudge. 2010. Data dwarfs: Motivating a coverage set for future large data center workloads. In Proc. Workshop Architectural Concerns in Large Datacenters.

[38]

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The hadoop distributed file system. In Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on. Ieee, 1--10.

Digital Library

[39]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[40]

Sam Van den Steen, Stijn Eyerman, Sander De Pestel, Moncef Mechri, Trevor E Carlson, David Black-Schaffer, Erik Hagersten, and Lieven Eeckhout. 2016. Analytical processor performance and power modeling using micro-architecture independent characteristics. IEEE Trans. Comput. 65, 12 (2016), 3537--3551.

Digital Library

[41]

Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, Chen Zheng, Gang Lu, Kent Zhan, Xiaona Li, and Bizhu Qiu. 2014. Bigdatabench: A big data benchmark suite from internet services. In IEEE International Symposium On High Performance Computer Architecture (HPCA).

[42]

Wm A Wulf and Sally A McKee. 1995. Hitting the memory wall: implications of the obvious. ACM SIGARCH computer architecture news 23, 1 (1995), 20--24.

Digital Library

[43]

Biwei Xie, Jianfeng Zhan, Xu Liu, Wanling Gao, Zhen Jia, Xiwen He, and Lixin Zhang. 2018. CVR: Efficient Vectorization of SpMV on X86 Processors. In 2018 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

Digital Library

[44]

Ahmad Yasin. 2014. A top-down method for performance analysis and counters architecture. In Performance Analysis of Systems and Software (ISPASS), 2014 IEEE International Symposium on. IEEE, 35--44.

[45]

Buse Yilmaz, Bariş Aktemur, MaríA J Garzarán, Sam Kamin, and Furkan Kiraç. 2016. Autotuning runtime specialization for sparse matrix-vector multiplication. ACM Transactions on Architecture and Code Optimization (TACO) 13, 1 (2016), 5.

Digital Library

[46]

Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing. 10--10.

Digital Library

Cited By

Yang YWang LZhan J(2024)A Linear Combination-Based Method to Construct Proxy Benchmarks for Big Data WorkloadsBenchmarking, Measuring, and Optimizing10.1007/978-981-97-0316-6_8(120-136)Online publication date: 14-Feb-2024
https://doi.org/10.1007/978-981-97-0316-6_8
Ahmed HIsmail M(2023)A Structured Approach Towards Big Data IdentificationIEEE Transactions on Big Data10.1109/TBDATA.2021.31390699:1(147-159)Online publication date: 1-Feb-2023
https://doi.org/10.1109/TBDATA.2021.3139069
Umeike JPatel NManley AMamandipoor AYun HAlian M(2023)Profiling gem5 Simulator2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00019(103-113)Online publication date: Apr-2023
https://doi.org/10.1109/ISPASS57527.2023.00019
Show More Cited By

Index Terms

Data motifs: a lens towards fully understanding big data and AI workloads

Recommendations

A Brief Survey on Big Data in Healthcare

This article presents a brief introduction to big data and big data analytics and also their roles in the healthcare system. A definite range of scientific researches about big data analytics in the healthcare system have been reviewed. The definition ...
SparkBench: a spark benchmarking suite characterizing large-scale in-memory data analytics

Spark has been increasingly employed by industries for big data analytics recently, due to its resilience, scalability and efficient in-memory distributed programming model. Meanwhile, the rapid growing community is also actively incubating a rich ...
Artificial Intelligence and Big Data

AI has been used in several different ways to facilitate capturing and structuring big data, and AI has been used to analyze big data for key insights. Some of the basic concerns and uses are examined here, while future articles will present case ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

November 2018

494 pages

ISBN:9781450359863

DOI:10.1145/3243176

General Chair:
Skevos Evripidou
University of Cyprus, Cyprus
,
Program Chairs:
Per Stenström
Chalmers University of Technology, Sweden
,
Michael O'Boyle
University of Edinburgh, UK

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

In-Cooperation

IFIP WG 10.3: IFIP WG 10.3
IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

Publication History

Published: 01 November 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Conference

PACT '18

Sponsor:

SIGARCH

PACT '18: International conference on Parallel Architectures and Compilation Techniques

November 1 - 4, 2018

Limassol, Cyprus

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

23
Total Citations
View Citations
441
Total Downloads

Downloads (Last 12 months)47
Downloads (Last 6 weeks)6

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yang YWang LZhan J(2024)A Linear Combination-Based Method to Construct Proxy Benchmarks for Big Data WorkloadsBenchmarking, Measuring, and Optimizing10.1007/978-981-97-0316-6_8(120-136)Online publication date: 14-Feb-2024
https://doi.org/10.1007/978-981-97-0316-6_8
Ahmed HIsmail M(2023)A Structured Approach Towards Big Data IdentificationIEEE Transactions on Big Data10.1109/TBDATA.2021.31390699:1(147-159)Online publication date: 1-Feb-2023
https://doi.org/10.1109/TBDATA.2021.3139069
Umeike JPatel NManley AMamandipoor AYun HAlian M(2023)Profiling gem5 Simulator2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00019(103-113)Online publication date: Apr-2023
https://doi.org/10.1109/ISPASS57527.2023.00019
Everman BVillwock TChen DSoto NZhang OZong Z(2023)Evaluating the Carbon Impact of Large Language Models at the Inference Stage2023 IEEE International Performance, Computing, and Communications Conference (IPCCC)10.1109/IPCCC59175.2023.10253886(150-157)Online publication date: 17-Nov-2023
https://doi.org/10.1109/IPCCC59175.2023.10253886
Sakly HAyres AFerraciolli Sda Costa Leite CKitamura FSaid M(2023)Radiology, AI and Big Data: Challenges and Opportunities for Medical ImagingTrends of Artificial Intelligence and Big Data for E-Health10.1007/978-3-031-11199-0_3(33-55)Online publication date: 2-Jan-2023
https://doi.org/10.1007/978-3-031-11199-0_3
Zhang CWang SYu ZWang HXu YCai LTang DSun NBao Y(2022) A Labeled Architecture for Low-Entropy Clouds: Theory, Practice, and Lessons Intelligent Computing10.34133/2022/97954762022Online publication date: Jan-2022
https://doi.org/10.34133/2022/9795476
Bhat KNai KShetty NNagarajan YKalambur S(2022)Performance Analysis of Big Data Motifs on Large Core Machines2022 IEEE International Conference on Cloud Computing in Emerging Markets (CCEM)10.1109/CCEM57073.2022.00014(37-42)Online publication date: 12-Dec-2022
https://doi.org/10.1109/CCEM57073.2022.00014
Zhan J(2022)A BenchCouncil view on benchmarking emerging and future computingBenchCouncil Transactions on Benchmarks, Standards and Evaluations10.1016/j.tbench.2022.1000642:2(100064)Online publication date: Apr-2022
https://doi.org/10.1016/j.tbench.2022.100064
Zhan J(2022)Open-source computer systems initiative: The motivation, essence, challenges, and methodologyBenchCouncil Transactions on Benchmarks, Standards and Evaluations10.1016/j.tbench.2022.1000382:1(100038)Online publication date: Mar-2022
https://doi.org/10.1016/j.tbench.2022.100038
Gao WTang FZhan JWen XWang LCao ZLan CLuo CLiu XJiang Z(2021)AIBench Scenario: Scenario-distilling AI BenchmarkingProceedings of the 30th International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT52795.2021.00018(142-158)Online publication date: 26-Sep-2021
https://dl.acm.org/doi/10.1109/PACT52795.2021.00018
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten