Elsevier

Future Generation Computer Systems

Volume 89, December 2018, Pages 98-109
Future Generation Computer Systems

Benchmarking big data architectures for social networks data processing using public cloud platforms

https://doi.org/10.1016/j.future.2018.05.068Get rights and content

Highlights

  • we evaluate the performance of two big data architectures, namely Lambda and Kappa.

  • we consider Online Social Network data analysis as reference task.

  • we propose a reproducible methodology to assess performance over cloud platforms.

Abstract

When considering popular On-line Social Networks (OSN) containing heterogeneous multimedia data sources, the complexity of the underlying processing systems becomes challenging, and requires to implement application-specific but still comprehensive benchmarking. The variety of big data architectures (and of their possible realization) for both batch and streaming processing in a huge number of application domains, makes the benchmarking of these systems critical for both academic and industrial communities.

In this work, we evaluate the performance of two state-of-art big data architectures, namely Lambda and Kappa, considering OSN data analysis as reference task. In more details, we have implemented and deployed an influence analysis algorithm on the Microsoft Azure public cloud platform to investigate the impact of a number of factors on the performance obtained by cloud users. These factors comprise the type of the implemented architecture, the volume of the data to analyze, the size of the cluster of nodes realizing the architectures and their characteristics, the deployment costs, as well as the quality of the output when the analysis is subjected to strict temporal deadlines. Experimental campaigns have been carried out on the Yahoo Flickr Creative Commons 100 Million (YFCC100M). Reported results and discussions show that Lambda outperforms Kappa architecture for the class of problems investigated. Providing a variety of analyses – e.g., also investigating the impact of dataset size, scaling, cost – this paper provides useful insights on the performance of these state-of-art big data architectures that are helpful to both experts and newcomers interested in deploying big data architectures leveraging cloud platforms.

Introduction

Today, millions users generate and exchange information by means of On-line Social Networks (OSNs). The development of big data technologies has enhanced OSNs features, enabling users to share their own life, generating and interacting with tons of multimedia content (text, audio, video, images) and providing it with feedbacks, comments, or feelings [[1], [2]]. As a matter of fact, this technological development led to the proliferation of a huge amount of data whose statistics are impressive: on average, everyday Facebook users publish 4.5 billion posts, share more than 4.7 billion status updates, and watch over 1 billion videos; Instagram – which recently reached the 300 million monthly active user marks – sees 2.5 billion likes and 70 million new photo uploads per day; concerning YouTube (where 100 h of new contents are uploaded each minute) more than 1 billion unique users visit the site each month consuming over 6 billion hours of video. In spite of the social value of OSN data, the multimedia content of such nets may be valuable in a number of public and strategic applications such as marketing, security, but also medicine, epidemiological analyses, counter-terrorism, and so on. Therefore, the analysis of multimedia OSNs represents an extremely interesting real-application domain, characterized by extremely huge and heterogeneous datasets (e.g., type of social network, time variance of data) introducing particularly challenging problems.

The big data paradigm has been designed for achieving the optimal management and analysis of such large quantities of data (big data analytics). The performance of such analytics – possibly influenced by a number of heterogeneous factors such as the type of data, the class of problem to address, or the underlying processing systems – is of the utmost importance as impacting both the effectiveness and the cost of the overall knowledge extraction process [3]. In this context, big data benchmarks are therefore useful to generate application-specific workloads and tests in order to evaluate the analysis processes for data matching the well known Volume, Velocity, Variety and Veracity (i.e. 4V) properties.

In this paper, we focus on the real-time analysis of massive OSN streams containing both textual and multimedia information coming from multiple data sources, leveraged to feed analytics and advanced applications. We exploited two state-of-art big data streaming architectures, namely Lambda and Kappa, designing and deploying them on Microsoft Azure public-cloud Platform-as-a-Service. We produced meaningful evaluation results implementing for both architectures a novel influence maximization and diffusion algorithm [4] in charge of the automated real-time analysis of multimedia streams. It is worth noting that the problem of stream analysis (although not being a problem novel per se) is particularly challenging when applied in presence of several multimedia streams, due to the very nature of multimedia data, which is complex, heterogeneous, and of large size. This makes the analysis of multimedia streams computationally expensive, so that, when deployed on a single centralized system, computing power becomes a bottleneck. Moreover, the size of the knowledge base could be so big to prevent its storage on a single node [5].

The main contributions of the paper are as follows:

  • considering the influence analysis problem as an interesting case study for OSNs, we have implemented a state-of-art algorithm [2] for addressing this problem on both Lambda and Kappa architectures and we have selected it as reference task;

  • through purposely designed experimental campaigns, we have evaluated Lambda and Kappa architectures against the reference task; indeed the evaluation has been performed adopting cutting-edge technologies leveraging state-of-the-art open-source analytics frameworks (such as Apache Storm and Apache Spark), which are more and more adopted in both industry and academia fields, proving to be the de facto standard technologies [[6], [7]];

  • the evaluation deployment involved Microsoft Azure cloud PaaS services, thus allowing to obtain easily-reproducible configurations and results as well as estimating the actual costs related to the analyses according to real provider fees;

  • the performance of Lambda and Kappa architectures has been evaluated along different dimensions, primarily considering timeliness, deployment costs, and outcome quality; with this goal in mind a number of different factors has been identified and taken into account during the experimental campaigns to investigate the performance of the considered architectures, such as the volume of the input dataset, the size of the deployed cluster, as well as the characteristics of the nodes composing the cluster.

In the light of the considerations above, we believe that this paper provides insightful information to both research and industry practitioners interested in deploying big data analytics systems onto the cloud. In addition, the obtained results are helpful to both newcomers (possibly interested in the qualitative impact of the analyzed factors on the performance) as well as experts (that can leverage provided information to optimize their own deployments).

The paper is organized as follows. In Section 2 a state of the art in big data benchmarking and performance evaluation on cloud infrastructures is reported. Section 3 describes Lambda and Kappa architectural models for OSN applications. In Section 4 we introduce the benchmarking task based on Influence diffusion problems. In Section 5 we describe the dimensions along which the architectures taken into account are evaluated, discussing the choices leading to the analyses, whose experimental results are reported and discussed in Section 7. Finally, discussions, lessons learned, and conclusions are reported in Section 8.

Section snippets

Related work

Due to the tremendous interest in big data by academia, industry, and a larger and larger user base, a variety of solutions and products has been released in recent years. With their gradual maturity, a growing need to evaluate and compare these solutions has been observed. Therefore, benchmarking big data systems has notably attracted the interest of the scientific community. Indeed, benchmarking solutions hugely facilitate performance comparison between equivalent systems, providing useful

Big data architectures

In this section, we briefly summarize the characteristics of the two big data architectures we consider in this work – namely, Lambda and Kappa – that are adopted for executing analytics.

Task definition

The estimation of influence exerts among users is an important task in OSNs, contributing to properly defining the social communities and improving the performance of recommender systems. In this section we first introduce the OSN model we refer to (Section 4.1); then, we provide the details of the algorithm we have implemented for analyzing interactions among OSN users (Section 4.2). This algorithm represents the reference task executed by the big data architectures we benchmark. The workload

Performance evaluation

Big data systems are required to provide timely, cost-effective, and quality answers to data-driven questions [27]. Therefore, we consider three main dimensions along which the architectures taken into account are evaluated: timeliness, cost, and quality of the output. Here we discuss the choices leading to the analyses whose experimental results are detailed in Section 7.

For what concerns timeliness, works in the scientific literature have proposed and adopted several metrics to investigate

Experimental testbed

In this section, for the sake of repeatability of the analyses, we provide the details of the experimental testbed we set up. We first describe the implementation details of the architectures in Section 6.1; then we detail the cloud deployment in Section 6.2; finally, we describe the procedures adopted for obtaining the input datasets in Section 6.3.

Experiments and results

We report here the results obtained through our experimental campaigns. In Section 7.1 we discuss the timeliness of the architectures when they are fed with different volumes of data in input; in Section 7.2 we evaluate how the performance improves when increasing the number of nodes composing the architectures (horizontal scaling) as well as when deploying VMs with enhanced characteristics (vertical scaling); in Section 7.3 the trade-off between cost and performance is analyzed; finally in

Discussion and conclusion

In this paper we have analyzed the performance of two state-of-art big data analytics architectures (Kappa and Lambda) when deployed onto a public-cloud PaaS. To achieve this goal we have considered (i) Apache Spark and Storm (providing the de-facto standard implementation for Kappa and Lambda, respectively); (ii) an implementation of the popular influence analysis task to generate the workload; (iii) Flickr YFCC100M big-data dataset as input; (iv) Microsoft Azure HDInsight PaaS as deployment

Acknowledgment

This work is partially funded by art. 11 DM 593/2000 for NM2 srl (Italy).

Valerio Persico is a Post Doc at the Department of Electrical Engineering and Information Technology of University of Napoli Federico II. He has a PhD in computer engineering from the University of Napoli Federico II. His work focuses on measurement and monitoring of cloud network infrastructures. Valerio is the recipient of the best student paper award at ACM CoNext 2013.

References (42)

  • NguyenDuc T. et al.

    Real-time event detection for online behavioral analysis of big social data

    Future Gener. Comput. Syst.

    (2017)
  • PersicoValerio et al.

    Measuring network throughput in the cloud: the case of amazon ec2

    Comput. Netw.

    (2015)
  • PentlandAlex

    Social Physics: How Social Networks Can Make Us Smarter

    (2015)
  • AmatoFlora et al.

    Multimedia social network modeling: A proposal

  • HanRui et al.

    Benchmarking big data systems: a review

    IEEE Trans. Serv. Comput.

    (2017)
  • AmatoF. et al.

    Diffusion algorithms in multimedia social networks: a preliminary model

  • Basanta-ValPablo

    An efficient industrial big-data engine

    IEEE Trans. Ind. Inform.

    (2017)
  • Basanta-ValPablo et al.

    Architecting time-critical big-data systems

    IEEE Trans. Big Data

    (2016)
  • Yanpei Chen, et al., We dont know enough to make a big data benchmark suite-an academia-industry view, in: Proc. of...
  • ChenYanpei et al.

    Interactive analytical processing in big data systems: A cross-industry study of mapreduce workloads

    Proc. VLDB Endow.

    (2012)
  • GhazalAhmad et al.

    Bigbench: towards an industry standard benchmark for big data analytics

  • CooperBrian F. et al.

    Benchmarking cloud serving systems with YCSB

  • OuaknineKeren et al.

    The PigMix Benchmark on Pig, MapReduce, and HPCC Systems

  • WangLei et al.

    Bigdatabench: a big data benchmark suite from internet services

  • MingZijian et al.

    BDGS: A scalable big data generator suite in big data benchmarking

  • MöllerRalf et al.

    Implementation of the Linear Road Benchmark on the Basis of the Real-Time Stream-Processing System Storm

    (2014)
  • WeiJinliang et al.

    Benchmarking Apache Spark with Machine Learning Applications

    (2016)
  • SinghSamneet

    Empirical Evaluation and Architecture Design for Big Monitoring Data Analysis

    (2016)
  • N. Satra, Is ‘Distributed’ worth it? Benchmarking Apache Spark with Mesos, 2015,...
  • TanWei et al.

    Social-network-sourced big data analytics

    IEEE Internet Comput.

    (2013)
  • ErlingOrri et al.

    The LDBC social network benchmark: Interactive workload

  • Cited by (0)

    Valerio Persico is a Post Doc at the Department of Electrical Engineering and Information Technology of University of Napoli Federico II. He has a PhD in computer engineering from the University of Napoli Federico II. His work focuses on measurement and monitoring of cloud network infrastructures. Valerio is the recipient of the best student paper award at ACM CoNext 2013.

    Antonio Pescapè is a Full Professor of computer engineering at the University of Napoli Federico II. His work focuses on Internet technologies and more precisely on measurement, monitoring, and analysis of the Internet. Antonio has co-authored more than 200 conference and journal papers and is the recipient of a number of research awards.

    Antonio Picariello is a Full Professor at Department of Electrical Engineering and Information Technology, University of Naples Federico II. He works in the field of Multimedia Database and Multimedia Information Systems, Multimedia Ontology and Semantic Web, Natural Language Processing and Sentiment Analysis.

    Giancarlo Sperlí is Post Doc in the Department of Electrical Engineering and Information Technology at the University of Naples Federico II. He hold a Master’s Degree and a Bachelor’s Degree in Computer Science and Engineering, both from the University of Naples Federico II, Italy. His main research interests are in the area of Cybersecurity, Semantic Analysis of Multimedia Data and Social Networks Analysis.

    View full text