Benchmarking big data architectures for social networks data processing using public cloud platforms
Introduction
Today, millions users generate and exchange information by means of On-line Social Networks (OSNs). The development of big data technologies has enhanced OSNs features, enabling users to share their own life, generating and interacting with tons of multimedia content (text, audio, video, images) and providing it with feedbacks, comments, or feelings [[1], [2]]. As a matter of fact, this technological development led to the proliferation of a huge amount of data whose statistics are impressive: on average, everyday Facebook users publish 4.5 billion posts, share more than 4.7 billion status updates, and watch over 1 billion videos; Instagram – which recently reached the 300 million monthly active user marks – sees 2.5 billion likes and 70 million new photo uploads per day; concerning YouTube (where 100 h of new contents are uploaded each minute) more than 1 billion unique users visit the site each month consuming over 6 billion hours of video. In spite of the social value of OSN data, the multimedia content of such nets may be valuable in a number of public and strategic applications such as marketing, security, but also medicine, epidemiological analyses, counter-terrorism, and so on. Therefore, the analysis of multimedia OSNs represents an extremely interesting real-application domain, characterized by extremely huge and heterogeneous datasets (e.g., type of social network, time variance of data) introducing particularly challenging problems.
The big data paradigm has been designed for achieving the optimal management and analysis of such large quantities of data (big data analytics). The performance of such analytics – possibly influenced by a number of heterogeneous factors such as the type of data, the class of problem to address, or the underlying processing systems – is of the utmost importance as impacting both the effectiveness and the cost of the overall knowledge extraction process [3]. In this context, big data benchmarks are therefore useful to generate application-specific workloads and tests in order to evaluate the analysis processes for data matching the well known Volume, Velocity, Variety and Veracity (i.e. 4V) properties.
In this paper, we focus on the real-time analysis of massive OSN streams containing both textual and multimedia information coming from multiple data sources, leveraged to feed analytics and advanced applications. We exploited two state-of-art big data streaming architectures, namely Lambda and Kappa, designing and deploying them on Microsoft Azure public-cloud Platform-as-a-Service. We produced meaningful evaluation results implementing for both architectures a novel influence maximization and diffusion algorithm [4] in charge of the automated real-time analysis of multimedia streams. It is worth noting that the problem of stream analysis (although not being a problem novel per se) is particularly challenging when applied in presence of several multimedia streams, due to the very nature of multimedia data, which is complex, heterogeneous, and of large size. This makes the analysis of multimedia streams computationally expensive, so that, when deployed on a single centralized system, computing power becomes a bottleneck. Moreover, the size of the knowledge base could be so big to prevent its storage on a single node [5].
The main contributions of the paper are as follows:
considering the influence analysis problem as an interesting case study for OSNs, we have implemented a state-of-art algorithm [2] for addressing this problem on both Lambda and Kappa architectures and we have selected it as reference task;
through purposely designed experimental campaigns, we have evaluated Lambda and Kappa architectures against the reference task; indeed the evaluation has been performed adopting cutting-edge technologies leveraging state-of-the-art open-source analytics frameworks (such as Apache Storm and Apache Spark), which are more and more adopted in both industry and academia fields, proving to be the de facto standard technologies [[6], [7]];
the evaluation deployment involved Microsoft Azure cloud PaaS services, thus allowing to obtain easily-reproducible configurations and results as well as estimating the actual costs related to the analyses according to real provider fees;
the performance of Lambda and Kappa architectures has been evaluated along different dimensions, primarily considering timeliness, deployment costs, and outcome quality; with this goal in mind a number of different factors has been identified and taken into account during the experimental campaigns to investigate the performance of the considered architectures, such as the volume of the input dataset, the size of the deployed cluster, as well as the characteristics of the nodes composing the cluster.
In the light of the considerations above, we believe that this paper provides insightful information to both research and industry practitioners interested in deploying big data analytics systems onto the cloud. In addition, the obtained results are helpful to both newcomers (possibly interested in the qualitative impact of the analyzed factors on the performance) as well as experts (that can leverage provided information to optimize their own deployments).
The paper is organized as follows. In Section 2 a state of the art in big data benchmarking and performance evaluation on cloud infrastructures is reported. Section 3 describes Lambda and Kappa architectural models for OSN applications. In Section 4 we introduce the benchmarking task based on Influence diffusion problems. In Section 5 we describe the dimensions along which the architectures taken into account are evaluated, discussing the choices leading to the analyses, whose experimental results are reported and discussed in Section 7. Finally, discussions, lessons learned, and conclusions are reported in Section 8.
Section snippets
Related work
Due to the tremendous interest in big data by academia, industry, and a larger and larger user base, a variety of solutions and products has been released in recent years. With their gradual maturity, a growing need to evaluate and compare these solutions has been observed. Therefore, benchmarking big data systems has notably attracted the interest of the scientific community. Indeed, benchmarking solutions hugely facilitate performance comparison between equivalent systems, providing useful
Big data architectures
In this section, we briefly summarize the characteristics of the two big data architectures we consider in this work – namely, Lambda and Kappa – that are adopted for executing analytics.
Task definition
The estimation of influence exerts among users is an important task in OSNs, contributing to properly defining the social communities and improving the performance of recommender systems. In this section we first introduce the OSN model we refer to (Section 4.1); then, we provide the details of the algorithm we have implemented for analyzing interactions among OSN users (Section 4.2). This algorithm represents the reference task executed by the big data architectures we benchmark. The workload
Performance evaluation
Big data systems are required to provide timely, cost-effective, and quality answers to data-driven questions [27]. Therefore, we consider three main dimensions along which the architectures taken into account are evaluated: timeliness, cost, and quality of the output. Here we discuss the choices leading to the analyses whose experimental results are detailed in Section 7.
For what concerns timeliness, works in the scientific literature have proposed and adopted several metrics to investigate
Experimental testbed
In this section, for the sake of repeatability of the analyses, we provide the details of the experimental testbed we set up. We first describe the implementation details of the architectures in Section 6.1; then we detail the cloud deployment in Section 6.2; finally, we describe the procedures adopted for obtaining the input datasets in Section 6.3.
Experiments and results
We report here the results obtained through our experimental campaigns. In Section 7.1 we discuss the timeliness of the architectures when they are fed with different volumes of data in input; in Section 7.2 we evaluate how the performance improves when increasing the number of nodes composing the architectures (horizontal scaling) as well as when deploying VMs with enhanced characteristics (vertical scaling); in Section 7.3 the trade-off between cost and performance is analyzed; finally in
Discussion and conclusion
In this paper we have analyzed the performance of two state-of-art big data analytics architectures (Kappa and Lambda) when deployed onto a public-cloud PaaS. To achieve this goal we have considered (i) Apache Spark and Storm (providing the de-facto standard implementation for Kappa and Lambda, respectively); (ii) an implementation of the popular influence analysis task to generate the workload; (iii) Flickr YFCC100M big-data dataset as input; (iv) Microsoft Azure HDInsight PaaS as deployment
Acknowledgment
This work is partially funded by art. 11 DM 593/2000 for NM2 srl (Italy).
Valerio Persico is a Post Doc at the Department of Electrical Engineering and Information Technology of University of Napoli Federico II. He has a PhD in computer engineering from the University of Napoli Federico II. His work focuses on measurement and monitoring of cloud network infrastructures. Valerio is the recipient of the best student paper award at ACM CoNext 2013.
References (42)
- et al.
Real-time event detection for online behavioral analysis of big social data
Future Gener. Comput. Syst.
(2017) - et al.
Measuring network throughput in the cloud: the case of amazon ec2
Comput. Netw.
(2015) Social Physics: How Social Networks Can Make Us Smarter
(2015)- et al.
Multimedia social network modeling: A proposal
- et al.
Benchmarking big data systems: a review
IEEE Trans. Serv. Comput.
(2017) - et al.
Diffusion algorithms in multimedia social networks: a preliminary model
An efficient industrial big-data engine
IEEE Trans. Ind. Inform.
(2017)- et al.
Architecting time-critical big-data systems
IEEE Trans. Big Data
(2016) - Yanpei Chen, et al., We dont know enough to make a big data benchmark suite-an academia-industry view, in: Proc. of...
- et al.
Interactive analytical processing in big data systems: A cross-industry study of mapreduce workloads
Proc. VLDB Endow.
(2012)
Bigbench: towards an industry standard benchmark for big data analytics
Benchmarking cloud serving systems with YCSB
The PigMix Benchmark on Pig, MapReduce, and HPCC Systems
Bigdatabench: a big data benchmark suite from internet services
BDGS: A scalable big data generator suite in big data benchmarking
Implementation of the Linear Road Benchmark on the Basis of the Real-Time Stream-Processing System Storm
Benchmarking Apache Spark with Machine Learning Applications
Empirical Evaluation and Architecture Design for Big Monitoring Data Analysis
Social-network-sourced big data analytics
IEEE Internet Comput.
The LDBC social network benchmark: Interactive workload
Cited by (0)
Valerio Persico is a Post Doc at the Department of Electrical Engineering and Information Technology of University of Napoli Federico II. He has a PhD in computer engineering from the University of Napoli Federico II. His work focuses on measurement and monitoring of cloud network infrastructures. Valerio is the recipient of the best student paper award at ACM CoNext 2013.
Antonio Pescapè is a Full Professor of computer engineering at the University of Napoli Federico II. His work focuses on Internet technologies and more precisely on measurement, monitoring, and analysis of the Internet. Antonio has co-authored more than 200 conference and journal papers and is the recipient of a number of research awards.
Antonio Picariello is a Full Professor at Department of Electrical Engineering and Information Technology, University of Naples Federico II. He works in the field of Multimedia Database and Multimedia Information Systems, Multimedia Ontology and Semantic Web, Natural Language Processing and Sentiment Analysis.
Giancarlo Sperlí is Post Doc in the Department of Electrical Engineering and Information Technology at the University of Naples Federico II. He hold a Master’s Degree and a Bachelor’s Degree in Computer Science and Engineering, both from the University of Naples Federico II, Italy. His main research interests are in the area of Cybersecurity, Semantic Analysis of Multimedia Data and Social Networks Analysis.