Distributed Big Data Ingestion at Scale for Extremely Large Community of Users

Tipparam, Venkat; Liu, Belinda; Chen, Yifei; Lang, Zoe; Ye, Gang; Li, Diana; Nguyen, Hong-Yen; Lai, CP; Chan, Steve

doi:10.1007/978-3-319-94301-5_8

Distributed Big Data Ingestion at Scale for Extremely Large Community of Users

Venkat Tipparam¹⁸,
Belinda Liu¹⁸,
Yifei Chen¹⁸,
Zoe Lang¹⁸,
Gang Ye¹⁸,
Diana Li¹⁸,
Hong-Yen Nguyen¹⁸,
CP Lai¹⁸ &
…
Steve Chan¹⁸

Conference paper
First Online: 21 June 2018

1958 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10968))

Abstract

To make big data analytics available to mass online users, in the range of tens of millions, a different architecture other than those in the market has been designed and implemented which employs distributed blob store, custom compression, and custom query algorithm, including filtering, joins and group by. The system has been in operation at eBay for years and is described in [1]. However, large scale ingestion of data to a distributed blob store presents a unique challenge. This paper outlines an approach to solve the problem and uses an example of ingesting one trillion real time impressions per day, or 11+ millions per second, to illustrate how the proposed approach work. As discussed in the paper, the approach manages to consume 1 trillion real time impressions per day and is capable of making the data available to 100 million online users for analytics in just a few minutes. The incoming stream is partitioned first and then combined for ingestion. The ingestion is also divided into two stages, while data are available for query immediately after the first stage. Techniques are discussed to distribute volume of the data among system components to bring down the load on each component to a reasonable level.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

In online eCommerce world, a site will have suppliers with services or goods that meet the needs of consumers, e.g., shops selling their wares, property owners open up for short term rental, advertisers competing for impressions and clicks on social media sites. Enterprises operating the sites always have had powerful analytical platforms for in-house business analysts to make sense of the impressions, clicks, sales, rents, rides, etc. However, suppliers on the site, big or small, are business owners in their own rights and they need to have analytical capability of their own to have deeper insights and to optimize their businesses. For the rest of the paper, we use the term user to refer to a supplier.

While a site has huge amount of data, each supplier or user on the site is usually a small fraction of the whole and one user cannot access the data of another of user. There can be up to a hundred millions of users but each user usually has no more than tens of millions of records or rows of data. So, an analytical platform for these users will have to provide maximum concurrency for the large community but provide access to only a fraction of the whole data set to each user.

This leads to an architecture that compress each user’s data into a finite set of blobs. Each blob will contain a user’s data organized as rows of certain data type for a given month. Each of the blobs can be accessed via a composite key of user-type-month. A key-value store, where a value is a persisted blob, or a distributed blob store, can deliver a blob in <50 ms to a query node and can support large number of concurrent reads. A thread on a query node, using the compression and query algorithm described in [1], can join, filter and aggregate millions of rows contained in a blob in a second.

This architecture decouples compute from store and allows any compute node to serve a user’s request by retrieving the relevant blobs from the store which can be on any node in the store. A distributed blob store usually does a good job of replicating blobs to each geography so the compute node simply gets the blobs from the same geography. The scheme of processing the data where the data is may not be the best approach given the large number of concurrent users involved. With 100 million users and even using 1,000 data nodes in the store (eBay uses far fewer than that), would simply mean 100 K users per node and the node will be a severe bottleneck for compute for those users.

For the rest of the paper, we’ll use the term distributed blob store, or simply blob store, to differentiate from a distributed key-value in-memory hash where a value is relatively small and is always cached in memory. We’ll also use the term blob to refer to the value.

This paper describes an approach to ingest data into blobs in such an architecture. We’ll use the example of 1,000,000,000,000, or 1 trillion, real time impression per day, 100,000,000, or 100 million users on the site, with 1,000,000,000 listings where each listing is an item to sell, a property to rent, cars and drivers to provide rides, ads to show).

Impression stream is used as basis for discussion in this paper as impression tends to have the largest volume of all the data. The mechanisms discussed can obviously be used for other types streams of smaller volumes.

This paper focuses on the core concepts involved in solving the problem at hand. Given the challenges involved in implementing a reliable distributed stream processing application, topics such as fault-tolerance, exactly-once delivery semantics, etc., are outside the scope of this paper.

2 Ingestion to Blob Store

The ingestion of every impression from the stream will involve using the user (and type impression for current month) as a key to retrieve one of the existing blob containing rows of impressions data already ingested, create a new blob by merging all the existing impressions and the new impression, delete the existing blob, and associate the key with the newly created blob object. Further query from this point on using the composite key of user-type-month will retrieve the new blob containing the new impression data.

This process will repeat for the blob representing the current month (or week, or quarter). At the beginning of the next month, a new blob will be created to start the month with the first impression of the month and the process will repeat only for the new blob created for the new or current month, but the past month (was the current month) is left untouched, unless if some of the data arrives later in the current month but with a time stamp indicating they are for the past month.

We’ll use the term write, or write to blob store, update the blob, to refer to the process of creating a new blob and replacing an existing blob with the new.

3 Challenges of Ingestion

The architecture uses a key-value, distributed blob store as the backend storage system and this presents a unique challenge. For a blob that contains a user’s impression data for the current month to add a new impression, it will require merging the existing impression data with the new impression to generate a new blob and replace the existing blob with the new blob for the composite key of user-type-month. To ingest one trillion real time impressions per day, that will mean generating 11,574,074 new blobs to replace existing blobs every second, or more than 11 million writes to the blob store per second. To further compound the problem, users are likely to have impressions at the same moment (may be for different listings), but processed by different threads of the system, thus triggering a race condition. The high volume and concurrency are beyond the realm of feasibility for any blob store on the market.

The rest of the paper discusses how to bring the writes to the blob store per second down to reasonable level and do it without concurrent writes.

4 Partition of the Stream

Assuming impressions will arrive in the form a stream (there are many streaming platforms, e.g., Kafka [2]) and consumer of the stream will be the first touch point to the impression data. Consumer of the stream comprises of many threads. Impressions of the same user can potentially be consumed by many threads concurrently, each trying to perform a write to the same monthly blob for the same user to ingest an impression. This will trigger a high frequent race condition on the monthly blob.

To avoid the problem, we need to divert the stream in an organized way and partition the impressions by user id. For simplicity, let’s assume there is a unique numerical ID for each user, or user ID, and the impression will be partitioned by mod the user ID by 1,000 (of course, the number 1,000 can be fine-tuned depending on the actual size of the user community). A user ID generated through by hash code of the user identifier, if that is alphanumeric in nature, will serve the purpose.

The partition of the stream can be achieved through Kafka. The consumer at the first touch point simply add user ID to the impression and produce the impression message down the pipeline for a second consumer. Kafka can be configured to dispatch the impression with the same mod based on user ID to the same thread of the second consumer. We’ll refer to this second consumer as partition consumer, or partition thread.

Given it is mod 1,000, there will be 1,000 threads each consuming a sub-stream of impression data based on user ID mod 1,000. Each thread will consume 1 billion impression a day, or 11,574 impressions per second.

Partition by user ID does not reduce the overall number of impressions to be ingested, or the number of writes required to the blob store, but it does eliminate the race condition because partitioning ensures that all impressions of the same user will be ingested by only one partition consumer thread.

5 Ingestion Interval

To reduce the number of writes, an ingestion interval is introduced. The partition consumer thread receiving impression will accumulate impressions in memory and only ingest those impressions to a blob at end of the interval. To make further discussion easier, we assume the impressions in the memory are organized as a hash of user id to a list of impressions of the user, or in Java notation Map<User,List<Impression>> .

For the purpose of this paper, we use the interval of 1 min, or 60 s. Since on average, each user can potentially have some impressions every minute and can trigger a write to the blob store. So, it is 100 million writes to the blob store every minute, or 1,666,667 writes per second. Much better than 11 million writes per second, but still too high.

Note that impressions cannot be evenly distributed for each user and for each minute. The effect of uneven distribution on the overall architecture and algorithm will be discussed in a later section.

It is worthwhile to point out that lengthening the interval will reduce the number of writes to the blob store but will have the undesirable side effect increasing the memory footage of each partition, as it will have to store more impressions before flushing them out. We will first continue to tackle the problem of reducing writes to the blob store, then we will discuss this side effect.

6 Daily Blob for Multiple Users

Each partition thread is still processing impressions for 100 millions/1,000 = 100,000 users. To reduce overall number of writes to blob store, we’ll introduce a daily blob. A daily blob contains 1,000 users’ impression but only for the current day. Of course, the number 1,000 can be adjusted based on actual environment and is a different and independent number from the number of partitions.

Given a user ID of nnnnnnxxxppp , where ppp is user ID mod 1,000 and determines the partition, we’ll simply mask out xxx , the 3 digits to the left of ppp , to 000 , and use nnnnnn000ppp as the user ID key to determine the blob. This will effectively combine 1,000 users’ impression into one blob. We’ll only add the impressions of the current day to this daily blob.

The partition thread will write 1,000 users’ impression to one shared daily blob, instead of writing 1,000 users’ impression to 1,000 monthly blob (one for each user). This effectively reduces the number of writes to the blob store by a factor of 1,000, from 1,666,667 writes per second to 1,667 writes per second.

Although the daily blob now contains multiple users’ impression, adding a filter condition on user id will eliminate other users’ impression from the result of a user’s query.

The algorithm to compress and de-compress data, and the algorithm to execute a SQL like query as described in [1] can be further optimized to speed up this special filtering by user id.

7 Merging the Daily Blob with Monthly Blob

At the beginning of a new day, a new daily blob is created, and we need to merge the daily blob for the previous day with the current monthly blob. Given there are 100 million users, there should be 100 million monthly blobs, each blob needs one write to add the impression from the previous day, that is 100 million writes. This, of course, cannot be accomplished in an instance when the day switches. Assume we’ll spread the writes to an entire day, that is, 100 million writes in 24 h, or 1157 writes per second to merge previous day’s daily blob with the current monthly blobs.

Add the writes per second to merge yesterday’s daily blob to monthly blobs to the writes per second to update the current daily blob, we arrive at 2,824 writes per second, and that’s easily within the scalability of any distributed blob store today.

Note, the two daily blobs for previous day and the current day, and blobs for the current month are all instantly available for queries. it is not difficult to determine if a monthly blob contains the impression from previous day or not, and if not, to include the previous day’s daily blob as an extra blob, the same way the current daily blob is included in a query.

This mechanism ensures that up to date impressions are included in the online users’ real-time aggregation. In fact, this mechanism makes the existence of daily blobs entirely transparent to the online users.

Thus, an impression will go through two stages of ingestions, first to the current daily blob and then from the previous daily blob to the current monthly blob. The impressions are available for query and aggregation the moment it gets into one of the blobs.

8 Hourly Impression Count Per Listing

Given 1 trillion impressions per day and 100 million users, each user should have on average 10,000 impressions per day. The number of impressions in a daily blob of 1,000 users will be 10,000 * 1,000 = 10,000,000, or 10 million.

However, given the volume of data for impression, user is usually not interested at each and every impression but rather at an aggregated value, for example, hourly impression per listing.

With the assumption of 1 billion listings 100 million users, each user will have on average 10 listings. Assuming each listing will have at least one impression at each hour, each listing’s impression data will only need at most 24 rows per day, one for each hour. A given user will have 10 * 24 = 240 rows for a day. A daily blob of 1,000 users will have 240 * 1,000 = 240,000 rows.

The monthly blob for a single user will contain on average 240 * 31 = 7,440 rows. 

The scan and query algorithm described in [1] can process millions of rows in a second.

Now, hourly impression count absorbs more impressions by simply adding to the count, but maintain the same number of rows in a blob. Though we used 1 trillion impressions per day, metrics are very much the same if the daily impression volume is 10 trillion.

As pointed earlier, combining users into a partition has the undesirable side effect of increasing memory footage for the partition, but we can now counter that with a desirable side effect of aggregation to the current hour by the minute in the partition and in memory. The hourly impression aggregation dramatically reduces the number of rows in daily blob and by the same token, it reduces the memory footage for a partition.

9 Incremental Impression Count for the Current Hour

A daily blob contains one row per user per listing per hour with the count of the impression for the listing within the hour. However, the daily impression blob is refreshed every minute, thus a row for a given listing and at a given minute within the hour will only have its impression count incremented until the next hour, when a new row will be created.

This enables a dashboard that polls the daily blob at a rate, say once per minute, to see the incremental impression count for a given listing minute by minute for the current hour and chart it in a graph. Once time advances to the next hour, the past hour will only have a single hourly count, but once can still chart the daily impression hour by hour. This should be sufficient for most of the analytical purpose.

While most of the parameters in this paper can be changed based on application we believe it is essential to mainly hourly impression count even for historical data.

As communication becomes cheaper and faster covering ever longer distance, a business owner often finds their customers not only in a different time zone by in an entirely different country (and often speaking a different language). For a US owner of a short-term rental in Australia with an ad in Britain hoping to catch potential customers visiting Australia for a Commonwealth Rugby game, it is quite useless to show a daily impression based on any US time zone.

Hourly impression count enables the user to view the daily impression trend based on any of the time zone. by simply assembling the relevant hours of the local day.

10 Implementation

This paper outlines an approach to solve the problem at hand using an example of ingesting one trillion real time impressions per day without delving into the details of implementation. Our overall Big Data system involves three major components, i.e., a real-time ingestion pipeline, blob store and a query engine. This paper focuses on a design approach for a highly scalable ingestion pipeline for extremely large number of impressions and very large number of users.

Our current solution involves a home-grown implementation based on Apache Kafka [2] Java consumer and custom in-memory aggregator. We have also validated the design ideas using stream processing frameworks, Apache Flink [4] and Kafka Streams [5]. However it should be certainly possible to leverage other stream processing frameworks such as open source Apache Storm [3], Apache Spark [6] or commercial solutions like Google Cloud Dataflow [7], Amazon Kinesis [8] and Azure Stream Analytics [9] to implement an ingestion pipeline based on the ideas presented here.

11 Uneven Distribution

The paper so far assumed an even distribution of the impressions among the users and at each time interval down to the second which is certainly not the case in real life application. It is quite reasonable to assume that not all of the users will have impressions every minute for every listing. In fact, it is quite possible that some of the users will not have any impression for any of the listings in a whole day. The more realistic and uneven distribution will involve more complex mathematics and obviously will have impact on how the system work.

To keep it simple, let’s just look some examples. If a user happens not to have any impression in a single day, the daily process to merge daily blobs with monthly blob can certainly skip the step for that user, thus reducing the number of writes to the blob store by 1. Given the constant overall number of 1 trillion impressions per day, it simply means some other users will have more impressions in their respective daily impression blobs. However, the process of merging will need 1 write to replace the user’s monthly blob with a new monthly blob which contains the increment daily impressions. This is just 1 write to the blob store, regardless of the number of hourly listing impressions involved. So, it can be generally stated that this process requires fewer number of writes to the blob store in the case of uneven distribution among users.

As an extreme example, all that 1 trillion impressions in a day are for just one big user, then there is just only one daily blob to be updated, and only 24 h * 60 min = 1,440 writes in a whole day to the blob store.

The process of writing impressions to daily impression blob at the end of every one minute interval however can potentially see less variation with regarding to the number of writes. Given the sample configuration of 1,000 users sharing one single daily file, it is unlikely that none of the 1,000 users will have any impression. As long as one user has a single impression in that one minute interval, a write to the combined daily blob will be required. However, the algorithm described in this paper has already taken that into consideration.

While the number of listings by a user and their impressions varies in a great degree from a single mom-and-pop user to an enterprise user exploring the site as an additional commerce channel, combing 1,000 users into a single daily file absorbs some of the up and downs. This implies that the load on each partition will somewhat even out on its own. The daily impression file is no longer needed once merged into the monthly blob, thus providing the opportunity to fine tune the partitioning of users for future days.

By now it is obvious that though even distribution is unlikely in reality, it actually presents the worst case scenario in theory in some cases. While even distribution does bring some additional challenges in some other parts of the algorithm, the nature of the design seems to reduce the effect of the uneven distribution.

We have shown that the approach can work for the worst case. and there is opportunity to optimize by further fine tuning the parameters.

12 System at a Glance

The system at eBay which more or less follows the general architecture described in this paper and [1] currently has ~400 TB of data covering a span of 5+ years and growing. The system has ~10 billions of key-value pairs, serving close to 20 millions of queries per day using less than 50 VM for compute, and still managing an average query time under 140 ms. The longest query joins 7 types of data, 17 selects, 15 group by, and is expressed by more than 10,000 chars in text. Depending on the type, data is updated as frequent as 1 min.

13 Summary

An approach is described to ingest trillions of real time impression per day for analytics by a community of 100 million users with a delay of just minutes. The blueprint laid out in this paper stretches to a scale that is beyond that of the most of the ecommerce web site today. It is quite straightforward to come up with a spreadsheet with formula, and plug in the numbers to fine tune the set up for any given real-life production environment.

With this approach, we believe our architecture of decoupling compute and store can scale to 100 million online users end-to-end, from data ingestion to serving online real-time aggregation.

References

Liu, B., Ponnusamy, T., et al.: Distributed data aggregation at scale for large community of users. In: International Conference on Big Data and Education (2018)
Google Scholar
Kafka, A distributed streaming platform. https://kafka.apache.org/
Apache Storm: http://storm.apache.org/
Apache Flink: https://flink.apache.org/
Kafka Streams: https://docs.confluent.io/current/streams/index.html
Spark Streaming: https://spark.apache.org/streaming/
Google Cloud Dataflow: https://cloud.google.com/dataflow/
Amazon Kinesis: https://aws.amazon.com/kinesis/
Azure Stream Analytics: https://azure.microsoft.com/en-us/services/stream-analytics/

Download references

Acknowledgement

We would like to express our gratitude to Sami Ben-Romdhane, VP and eBay Fellow, for his advices and guidance over the years in evolving the architecture. We want to thank Adithya Ramakrishnan for his contributions to this product. 

We also want to thank Sujeet Varakhedi, Vijeya Anbalagan, and Hardik Patel for their invaluable support on an internal eBay Key-value distributed blob store that made this architecture a reality.

Author information

Authors and Affiliations

eBay Inc., 2145 E Hamilton Ave, San Jose, CA, 95125, USA
Venkat Tipparam, Belinda Liu, Yifei Chen, Zoe Lang, Gang Ye, Diana Li, Hong-Yen Nguyen, CP Lai & Steve Chan

Authors

Venkat Tipparam
View author publications
You can also search for this author in PubMed Google Scholar
Belinda Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yifei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Zoe Lang
View author publications
You can also search for this author in PubMed Google Scholar
Gang Ye
View author publications
You can also search for this author in PubMed Google Scholar
Diana Li
View author publications
You can also search for this author in PubMed Google Scholar
Hong-Yen Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
CP Lai
View author publications
You can also search for this author in PubMed Google Scholar
Steve Chan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Venkat Tipparam .

Editor information

Editors and Affiliations

The University of Hong Kong, Hong Kong, Hong Kong
Francis Y. L. Chin
University of Macau, Macao, Macao
C. L. Philip Chen
The University of Texas at Dallas, Richardson, Texas, USA
Latifur Khan
Louisiana State University, Baton Rouge, USA
Kisung Lee
Kingdee International Software Group Company Limited, Shenzhen, China
Liang-Jie Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tipparam, V. et al. (2018). Distributed Big Data Ingestion at Scale for Extremely Large Community of Users. In: Chin, F., Chen, C., Khan, L., Lee, K., Zhang, LJ. (eds) Big Data – BigData 2018. BIGDATA 2018. Lecture Notes in Computer Science(), vol 10968. Springer, Cham. https://doi.org/10.1007/978-3-319-94301-5_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-94301-5_8
Published: 21 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-94300-8
Online ISBN: 978-3-319-94301-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics