1 Introduction

The age of online social networks is firmly upon us. Not only have we seen generic online social networks like Facebook and Twitter grow to unprecedented scales, the past few years have also witnessed the miasmic rise of more specialized networks such as LinkedIn (professional network), FourSquare (location) and Instagram (photos). These networks generate terabytes of data per day, host billions of users, and contain a variety of information from message streams to ever-changing graphs. Many online social networks, most notably Twitter, make their data public. It is without surprise that online social networks have become one of the most important sources for data analysis.

The palpable momentum behind big data fuels continuous development in analytics algorithms and large-scale platforms (Apache HadoopFootnote 1, Apache SparkFootnote 2, Apache StormFootnote 3, etc.). For many analytics, such as graph analysis and user or population modeling, online social networks provide arguably the richest, largest data source available. Given the sheer number of developments in this area, fair and comprehensive evaluation between different hardware, software platforms and analytic algorithms in an online social network context becomes ever more important. End users can leverage such evaluation to navigate (many) competing big data solutions based on their needs; whereas solution providers and developers can measure their solution against the state-of-the-art.

However, there is an unfortunate vacuum when it comes to benchmarking big data platforms and analytics against online social network data. The unprecedented size, variety and ever-changing nature of online social network data put forward unique challenges to data analytics and their underlying data platforms. Most existing big data benchmarks such as BigBench [3] and BigDataBench [11] are designed around data sets and use cases largely different from online social networks. They cannot be used to truly measure new solutions against online social network data. LDBC-SNB [2], a new benchmark very much under development, is the only online-social-network-centric benchmark to our best knowledge. Nonetheless, LDBC-SNB is not specifically geared towards big data and does not consist of workloads readily executable on popular big data platforms such as Spark or Storm. In our humble opinion, both LDBC-SNB and other online-social-network-agnostic benchmarks fall short of meeting the following key requirements of an effective online social network benchmark for big data:

  • Realistic Data. This requirement is rooted in the Veracity of the famous 5 V model of big data [1]. Most existing benchmarks use simulated data. The simulation is often done based on assumptions that are not necessarily true in real online social networks. For instance, LDBC-SNB simulates friendships based on people’s common interests and/or places of study. The simulation would thus miss online social network friendships by being at the same company, by introduction through friends, by accidentally meeting at the same event and so forth.

  • Live Data. When real data is used in existing big data benchmarks, the data set is often a point-in-time copy and no attempt is made to update the data set. Crucially, online social networks are always changing, as new users join and new content (e.g. tweets, images) are posted. A text sentiment classification solution with fast and accurate results against tweets downloaded last year may not have the same success today, perhaps because there might be more tweets from non-English speaking users that are harder to classify and require more processing. A big data platform that efficiently analyzes the social graph of NYC today, may underperform in a mere few months. As new people move into the city and join the network, its size and structural complexity becomes greater, thus potentially shifting the in-memory vs. on-disk ratio of graph storage as well as the compute demands of the graph.

  • Comprehensive Online-social-network-centric Big Data Workloads. Online social networks present a fertile ground for a wide range of user-oriented mining algorithms aiming at inter-connected users and user-generated content. These include text analytics (e.g. topic discovery), image analysis (e.g. principal component analysis), graph analysis (e.g. influencer), trend analysis (e.g. major event predictions), user analytics (e.g. interest or personality analysis) and recommendations (e.g. friend suggestion). These workloads would enable comparison between different platforms (e.g. Spark vs. MapReduce). Existing benchmarks often only cover one or two of these categories. BigDataBench, for instance, only provides graph analytics for online social networks.

  • Analytics-aware Metrics. Many online social network analytics are machine learning algorithms that build models of classification or prediction. When comparing analytics algorithms, the efficiency of execution is only a part of the story. The accuracy of classification or prediction is equally important. Existing benchmarks only provide metrics at a system or platform level. We are not aware of one that measures classification or prediction accuracy.

It is the goal of our benchmark to become the first benchmark to meet the aforementioned requirements. It can be used to evaluate both platform innovations such as a new caching algorithm for Spark and new analytics algorithms such as a more accurate way to analyze tweet sentiment. The benchmark tracks both live and historic data from Twitter (and other online social networks) and channels it in a way that ensures fair comparison across big data platforms. It aims to provide out-of-the-box implementations of all analytics categories above based on state-of-the-art platforms such as Hadoop MapReduce, Spark and Storm.

The vast amount of new applications made possible by online social networks also open up an unusual opportunity, the opportunity to demonstrate the functionality of a big data platform in a compelling way. Benchmark designs to date have solely focused on testing system efficiency. The result is workloads that differentiate platforms with regard to execution time, throughput etc., not necessarily workloads that can show what exciting applications a big data platform has made possible. We have run into many scenarios where the latter is necessary, in addition to the usual emphasis on efficiency. Hence we have designed the benchmark to provide additional application-level plug-ins that can be used to showcase the functional or business values of big data solutions. For instance, we have provided a stock recommendation application (together with a GUI) that builds on Hadoop-based sentiment analysis.

This paper makes the following contributions:

  • We identify key challenges in building a big data benchmark based on online social network data, in terms of data, workloads and metrics.

  • We present the design and an early prototype of a benchmark that meets these challenges.

  • We put forward the additional notion of demo applications for big data benchmarks.

The following two sections describe the challenges and design of our benchmark. Section 4 presents use cases. Related work is discussed in Sect. 5, before we offer concluding remarks and future plans in Sect. 6.

2 Challenges

To further set the stage for this work, we first describe the major challenges in designing an online social network benchmark for big data, with respect to data generation, workloads and evaluation metrics.

2.1 Realistic and Live Data

Online social networks are constantly evolving as new content is generated and new users are registered. Certain analytics such as graph and text analysis can exhibit vastly different behaviors depending on the text content and graph structure. The benchmark must keep pace with this growing complexity, so as to deliver the most meaningful evaluation of big data systems.

Integrating the latest online social network data into a benchmark is far from simple. Firstly, the benchmark must have a means to obtain the latest data from online social networks. This often means building a crawler that can acquire a sufficient sample from the online social network via carefully identified APIs.

Secondly, one must balance keeping the data up-to-date with making it possible to use the same data set for fair evaluation of different systems. This implies taking reasonable snapshots of the data, as new data is continuously downloaded.

Third, there is an additional challenge in organizing the downloaded data. The organization needs to take into account both time and the snapshot copies above. Different data may need different ways of organization. Text data such as tweets, where data items are relatively independent from one another, may be organized differently from graph data where the data points are (far) more correlated.

Last but not least, steps can be taken to minimize the users efforts in transforming the downloaded data into a storage format they are interested in. For instance, the users may be interested in a specific file format such as a standard row-oriented format or a column-oriented format like ParquetFootnote 4, an increasingly popular format supported by many big data platforms. They may also want to place the files in one of several storage systems including HDFS and object stores like OpenStack SwiftFootnote 5, in which case some file grouping or splitting might be needed to attain optimal (or even acceptable) performance. For instance, HDFS has a known problem in handling a large number of small files, thus some automatic grouping would be useful.

2.2 Comprehensive Workloads

Designing a benchmark like ours imposes a range of challenges regarding the workloads that comprise the benchmark. We focus on the main challenge of offering a comprehensive set of workloads that represent a sufficient range of big data analytics relevant to the online social network domain. This ensures big data platforms and algorithms can be evaluated with regard to a wide range of realistic use cases.

2.3 Evaluation Metrics

A successful benchmark depends on a well-defined set of metrics for apple-to-apple quantitative comparison between different levels of platform offerings, analytics algorithms, cluster configurations etc. Such differentiation allows users to make an informed decision between solutions based on their needs.

Defining an effective set of metrics for online social networks is challenging. One reason is that there is not a single set of metrics that work for all the workloads we plan to cover. For example, some streaming applications require the underlying data analytic frameworks to support near real time processing capability which might be in the order of milliseconds. On the other hand, for batch workloads such as iterative machine learning, end users may care more about job throughput. Moreover, metrics not directly related to systems performance also come into play. Scalability metrics are needed to show how the performance of the system can change when the data and cluster size scale up and down. In order to evaluate different analytics algorithms (that perform the same type of analysis on the same data), we also need a set of metrics to evaluate the output quality (e.g. classification accuracy) of algorithms being compared.

3 Benchmark Design and Early Prototype

This section presents our benchmark design, as shown in Fig. 1. Herein we describe several key components and their current prototyping states in detail.

Fig. 1.
figure 1

Architecture

3.1 Data Generator

The benchmark currently uses the Twitter Search APIFootnote 6 to crawl live tweet-related data. We have chosen Twitter as the primary data source, as it is the largest online social network with an open data policy. The Search API acquires a sample of the latest tweets on Twitter that match given keywords. The crawler is implemented in PHP. It maintains a list of default keywords but can also be customized by the users with regard to their interests.

Each downloaded tweet is a JSON object. We pre-process the data to extract the tweet text, the user profile information, time stamp, any attached image and/or URL. For Twitter, we have not encountered very large objects (e.g. over the 5 GB size supported by OpenStack Swift) that need to be split up. Instead, data associated with tweets are bundled into one file each day by default. One is free to take the latest N days of data for an evaluation. We plan to experiment with different bundle sizes.

We are able to place and extract data from cluster file systems like HDFS and object stores such as OpenStack Swift. The Search API yields around 1 % of the latest tweets. We are exploring means to sample more data and vary the distribution of the samples.

3.2 Workload Provider

We have identified a comprehensive set of workloads that represent different online social network analytics that can be performed on the data collected by the data generator. The comprehensive set of workloads are described in Table 1. In this table, the first column Source Type, represents the type of data source used (i.e., plain text files, graphs, etc.); the second column Category, shows the category of the data analysis; the third column Analysis Type, represents the kind of analysis to be performed on the data; the fourth column Methods, lists the most common algorithms or heuristics used to carry out the corresponding analysis.

Table 1. Description of workloads

Our set of workloads cover a broad range of techniques used in the analysis of online social network data. Some workloads are already being integrated into the benchmark using existing Hadoop or Spark libraries. For example, in the category of media analytics the HIPIFootnote 7 library provides an image processing interface for Hadoop. For text analytics, different implementations of LDA for topic modeling are available for HadoopFootnote 8 and SparkFootnote 9. The same is true for item recommendationsFootnote 10 \(^{,}\) Footnote 11. We plan to open up our APIs for third-parties to add new workloads into the benchmark.

3.3 Demo Plug-Ins

It is a goal of our benchmark to provide out-of-the-box applications to be used for demonstrating big data solutions. These demo applications will be built on the social network analytics workloads in the benchmark. We believe that it is important not only to demonstrate the efficiency of novel big data systems or algorithms via the benchmark, but also to showcase what exciting applications these systems or features would enable. New system features and analytic algorithms run the risk of being under-appreciated, specially from a business or end-user point of view, if their use cases are not properly highlighted.

In our current prototype, we have included an application that recommends the most promising stock combinations to purchase based on the sentiment being expressed about the corresponding companies on Twitter. The application is based on demonstrated correlation between tweet sentiment and stock market movement [8]. It currently directs the data generator to crawl tweets related to companies in the S & P 100 index. The application currently leverages two workloads: a MapReduce-based sentiment analysis workload which classifies each tweet download in a given period as positive, negative or neutral, and further aggregates the per-tweet classification at a stock and portfolio (stock combination) level; and a query workload that finds the top N portfolios with the most positive sentiment. We are adding a third workload to analyze each tweet live in a streaming fashion (vs. in the batch fashion in the MapReduce workload). The query workload is linked to a GUI, which displays the stock combinations recommended.

We have used the stock recommendation application at several major conferences to demo different big data innovations (see [12] for an example). We have noted that the application helped engage people who would otherwise not pay attention to the features or fail to appreciate their values. The benchmark proposed in this paper is to our knowledge the first benchmark to introduce the notion of functionality demonstration. It is our hope that benchmarking results can be better understood and valued when given a compelling context. We plan to identify and bundle additional applications in the benchmark and provide the facilities (e.g. script wrappers that make benchmark workloads easy to integrate into a GUI) for the community to do so.

3.4 Metrics Evaluator

Within the benchmark, we focus on two types of metrics: system performance metrics and analytic accuracy metrics. While the former concentrates on the latency, throughput, system utilization, power consumption and scalability, the latter focuses on how accurate the data analytics algorithms can conduct prediction or classification.

System performance metrics are common across all the workload types within the benchmark. A tradeoff is often needed between latency and throughput when running online social network applications on data processing frameworks. With a given cluster size and setup, the lower latency and higher throughput the system can offer, the better it is. For streaming applications and interactive queries, we plan to enable users to evaluate and identify the peak throughput the system can achieve by increasing the data set size and varying the data arrival interval based on a user-specified latency constraint. Moreover, system utilization metrics help users pinpoint the performance bottlenecks and compare amongst frameworks. Given a fixed number of machines, a better system would have lower and more balanced resource utilization. In other words, the resource utilization offers deeper insights into why an underlying system can achieve higher throughput while maintaining low latency. Power consumption is an utilization metric that can help vastly reduce the cost of operating a data center. Our benchmark guides users towards big data solutions that consume less power if other metrics are comparable. In addition, the benchmark includes scalability as one of its metrics. A good system should be able to increase the throughput while maintaining low latency as the number of machine increases.

For analytics accuracy metrics, we initially choose confusion matrix, precision, recall, F-measure and Receiver Operator Characteristics (ROC) curves, as defined in [9]. The confusion matrix defines the basic true or false positives and true or false negatives. Precision and recall captures how well the algorithm can detect true positives while minimizing false negatives. On the other hand, ROCs demonstrate, as the number of false negatives varies, how the number of true positives changes accordingly.

4 Use Cases

Our benchmark can be used to evaluate new analytics algorithms, novel big data systems or optimizations done to existing systems. Its evaluation can focus on an entire big data environment or specific sub systems such as data storage. Herein we discuss several use cases, although space constraints preclude us from providing extensive details.

4.1 Evaluating Big Data QoS Impact

The increasing multi-tenant nature of the cloud and big data environments requires differentiation between analytics workloads sharing the same underlying system and competing for resources. In previous work [12], we created a Quality of Service (QoS) capability for big data that maintains relative priority between workloads from Hadoop MapReduce through to back-end storage.

We used the demo application (see Sect. 3.3) in our benchmark to evaluate the QoS feature via a with-vs-without comparison. The QoS feature was first disabled so that one can observe the severity of storage bandwidth contention between (a) a batch analytics workload that scores stock combinations based on their corresponding tweets and (b) an interactive workload that queries the top N stock combinations. Later in the run the QoS feature was enabled to resolve the contention. Our benchmark, in this case, allowed us to observe the % improvement in interactive query response thanks to intelligent control.

This is an evaluation that would not have been possible with any existing big data benchmark reviewed in Sect. 5, because they either lack the means of acquiring real Twitter data or the support for big data workloads required.

4.2 Constructing an Object Store Benchmark

Object storage [5] has been rapidly gaining momentum in recent times, due largely to its unique ability to scale to a large number of small objects and storage-agnostic REST-ful APIs. Running analytics on object storage is a very recent development and there is a clear void in assessing the performance of object storage against these new workloads. Existing object storage benchmarks, such as COSBenchFootnote 12, are merely simplistic read/write request generators. We are constructing a sub benchmark for object stores using selected benchmark data and analytics workload combinations as shown in Table 2.

Table 2. Sub benchmark for object storage

We have chosen linked images and linked web documents as the data set because they often range in KB to MB in sizes and a large number of tweets link to these data. Currently we rely on the MapReduce implementation of PCA, but hope to include a Spark implementation in the future. Note that no benchmark reviewed in Sect. 5 directly provides image data, which is arguably the data type most suited for object stores as they are small, abundant and unstructured.

4.3 Evaluating Spark Core Optimization

The increasingly wide adoption of Spark has spurred broad interest in optimizing its core system for superior performance. Our benchmark can be used to evaluate the effectiveness of performance optimizations of Spark such as scheduling policies, memory management and caching optimization. In the near future, we plan to use the workloads contained within the benchmark to evaluate the effectiveness of these specific Spark optimizations.

Table 3. Big data benchmarking: state of the art

5 Related Work

To the best of our knowledge, this is the first work that proposes a big data benchmark tailored for online social network data and analytics. Table 3 summarizes the most relevant benchmarks [24, 11]. Although most of these benchmarks (except LDBC-SNB [2]) are presented as capable of evaluating different big data platforms, they lack mechanisms to collect and analyze live data. These benchmarks also only support a relatively narrow set of online social network workloads, mostly focusing on graph-analytics or/and micro-benchmarks.

Amongst the reviewed benchmarks, LDBC-SNB and BigDataBench [11] are the most related. LDBC-SNB proposes a social network data generator and an interactive workload for structured and semi-structured data. Unlike our benchmark, LDBC-SNB focuses solely on generation of synthetic data, and offers mostly graph analysis workloads. Note that LDBC-SNB does not seem to provide any benchmarking workloads based on big data platforms such as Spark. BigDataBench is presented as a benchmark that offers workloads for Internet Services such as e-commerce, search engines and social networks. Although BigDataBench provides several workloads that cover several categories of web services, the number of workloads is small for each category (including the online social network category). Furthermore, BigDataBench adopts a synthetic data approach. In comparison, our benchmark provides a board set of online social network workloads and uses real data.

Literature on data crawling and simulation is also relevant. Studies on capturing and analyzing data from online social networks have shown the evolving nature of this kind of data sources [7], motivating the need to capture data of different sizes, time-frames and in real-time. This is the philosophy that our benchmark tries to follow. By contrast, existing works on data simulation are mainly designed to create synthetic data sets [6, 10] and lack evaluation methods that support the veracity property of the data being generated for benchmarking.

6 Concluding Remarks

Although it is still work-in-progress, our benchmark has taken an important step towards an inaugural online-social-network-centric benchmark for big data. We plan to continue extending the benchmark with regard to its workload coverage and data variety (e.g. Twitter follower graph and data from other public networks). We are conducting extensive experiments to verify the usefulness of the benchmark, and where possible, to quantify the benefits it offers compared to existing benchmarks. It is hoped that more compelling demos will also be added.