research-article

Fast data in the era of big data: Twitter's real-time related query suggestion architecture

Authors:

Jimmy LinAuthors Info & Claims

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Pages 1147 - 1158

https://doi.org/10.1145/2463676.2465290

Published: 22 June 2013 Publication History

Abstract

We present the architecture behind Twitter's real-time related query suggestion and spelling correction service. Although these tasks have received much attention in the web search literature, the Twitter context introduces a real-time "twist": after significant breaking news events, we aim to provide relevant results within minutes. This paper provides a case study illustrating the challenges of real-time data processing in the era of "big data". We tell the story of how our system was built twice: our first implementation was built on a typical Hadoop-based analytics stack, but was later replaced because it did not meet the latency requirements necessary to generate meaningful real-time results. The second implementation, which is the system deployed in production today, is a custom in-memory processing engine specifically designed for the task. This experience taught us that the current typical usage of Hadoop as a "big data" platform, while great for experimentation, is not well suited to low-latency processing, and points the way to future work on data analytics platforms that can handle "big" as well as "fast" data.

References

[1]

C. Aggarwal. Data Streams: Models and Algorithms. Kluwer Academic Publishers, 2007.

Digital Library

[2]

E. Alfonseca, M. Ciaramita, and K. Hall. Gazpacho and summer rash: lexical relationships from temporal patterns of web search queries. In EMNLP, 2009.

Digital Library

[3]

G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, Y. Lu, B. Saha, and E. Harris. Reining in the outliers in Map-Reduce clusters using Mantri. In OSDI, 2010.

Digital Library

[4]

R. Baraglia, F. M. Nardini, C. Castillo, R. Perego, D. Donato, and F. Silvestri. The effects of time on query flow graph-based models for query suggestion. In RIAO, 2010.

Digital Library

[5]

D. Borthakur, J. Gray, J. Sarma, K. Muthukkaruppan, N. Spiegelberg, H. Kuang, K. Ranganathan, D. Molkov, A. Menon, S. Rash, R. Schmidt, and A. Aiyer. Apache Hadoop goes realtime at Facebook. In SIGMOD, 2011.

Digital Library

[6]

M. Busch, K. Gade, B. Larson, P. Lok, S. Luckenbill, and J. Lin. Earlybird: Real-time search at Twitter. In ICDE, 2012.

Digital Library

[7]

H. Cao, D. Jiang, J. Pei, Q. He, Z. Liao, E. Chen, and H. Li. Context-aware query suggestion by mining click-through and session data. In KDD, 2008.

Digital Library

[8]

D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik. Monitoring streams--a new class of data management applications. In VLDB, 2002.

Digital Library

[9]

B. Chandramouli, J. Goldstein, and S. Duan. Temporal analytics on big data for web advertising. In ICDE, 2012.

Digital Library

[10]

F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber. Bigtable: A distributed storage system for structured data. In OSDI, 2006.

Digital Library

[11]

S. Cucerzan and E. Brill. Spelling correction as an iterative process that exploits the collective knowledge of web users. In EMNLP, 2004.

[12]

H. Cui, J.-R. Wen, J.-Y. Nie, and W.-Y. Ma. Query expansion by mining user logs. TKDE, 15(4):829--839, 2003.

Digital Library

[13]

W. Dakka, L. Gravano, and P. Ipeirotis. Answering general time-sensitive queries. TKDE, 24(2):220--235, 2012.

Digital Library

[14]

M. Efron and G. Golovchinsky. Estimation methods for ranking recent information. In SIGIR, 2011.

Digital Library

[15]

Y. Ganjisaffar, R. Caruana, and C. Lopes. Bagging gradient-boosted trees for high precision, low variance ranking models. In SIGIR, 2011.

Digital Library

[16]

A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a high-level dataflow system on top of MapReduce: The Pig experience. In VLDB, 2009.

Digital Library

[17]

B. Gedik, H. Andrade, K.-L. Wu, P. Yu, and M. Doo. SPADE: The System S declarative stream processing engine. In SIGMOD, 2008.

Digital Library

[18]

J. Gehrke. Special issue on data stream processing. Bulletin of the Technical Committee on Data Engineering, 26(1):2, 2003.

[19]

K. Goodhope, J. Koshy, J. Kreps, N. Narkhede, R. Park, J. Rao, and V. Ye. Building LinkedIn's real-time activity data pipeline. Bulletin of the Technical Committee on Data Engineering, 35(2):33--45, 2012.

[20]

P. Hunt, M. Konar, F. Junqueira, and B. Reed. ZooKeeper: Wait-free coordination for Internet-scale systems. In USENIX, 2010.

Digital Library

[21]

R. Jones and F. Diaz. Temporal profiles of queries. ACM TOIS, 25(3), 2007.

Digital Library

[22]

R. Jones, B. Rey, O. Madani, and W. Greiner. Generating query substitutions. In WWW, 2006.

Digital Library

[23]

J. Koenemann and N. Belkin. A case for interaction: A study of interactive information retrieval behavior and effectiveness. In CHI, 1996.

Digital Library

[24]

J. Kreps, N. Narkhede, and J. Rao. Kafka: A distributed messaging system for log processing. In NetDB Workshop, 2011.

[25]

S. Krishnamurthy, M. Franklin, J. Davis, D. Farina, P. Golovko, A. Li, and N. Thombre. Continuous analytics over discontinuous streams. In SIGMOD, 2010.

Digital Library

[26]

Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. SkewTune: Mitigating skew in MapReduce applications. In SIGMOD, 2012.

Digital Library

[27]

W. Lam, L. Liu, S. Prasad, A. Rajaraman, Z. Vacheri, and A. Doan. Muppet: MapReduce-style processing of fast data. In VLDB, 2012.

Digital Library

[28]

V. Lavrenko and W. Croft. Relevance-based language models. In SIGIR, 2001.

Digital Library

[29]

G. Lee, J. Lin, C. Liu, A. Lorek, and D. Ryaboy. The unified logging infrastructure for data analytics at Twitter. In VLDB, 2012.

Digital Library

[30]

F. Leibert, J. Mannix, J. Lin, and B. Hamadani. Automatic management of partitioned, replicated search services. In SoCC, 2011.

Digital Library

[31]

H. Li. Learning to Rank for Information Retrieval and Natural Language Processing. Morgan & Claypool Publishers, 2011.

Digital Library

[32]

X. Li and W. Croft. Time-based language models. In CIKM, 2003.

Digital Library

[33]

J. Lin and A. Kolcz. Large-scale machine learning at Twitter. In SIGMOD, 2012.

Digital Library

[34]

J. Lin and G. Mishne. A study of "churn" in tweets and real-time search queries. In ICWSM, 2012.

[35]

J. Lin and D. Ryaboy. Scaling big data mining infrastructure: The Twitter experience. SIGKDD Explorations, 14(2):6--19, 2012.

Digital Library

[36]

J. Lin, D. Ryaboy, and K. Weil. Full-text indexing for optimizing selection operations in large-scale data analytics. In MAPREDUCE Workshop, 2011.

Digital Library

[37]

C. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, Massachusetts, 1999.

Digital Library

[38]

Q. Mei, D. Zhou, and K. Church. Query suggestion using hitting time. In CIKM, 2008.

Digital Library

[39]

S. Mizzaro. How many relevances in information retrieval? Interacting With Computers, 10(3):305--322, 1998.

[40]

C. Moretti, J. Bulosan, D. Thain, and P. Flynn. All-Pairs: An abstraction for data-intensive cloud computing. In IPDPS, 2008.

[41]

L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed stream computing platform. In KDCloud Workshop at ICDM, 2010.

Digital Library

[42]

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In SIGMOD, 2008.

Digital Library

[43]

D. Pearce. A comparative evaluation of collocation extraction techniques. In LREC, 2002.

[44]

D. Peng and F. Dabek. Large-scale incremental processing using distributed transactions and notifications. In OSDI, 2010.

Digital Library

[45]

K. Radinsky, K. Svore, S. Dumais, J. Teevan, A. Bocharov, and E. Horvitz. Modeling and predicting behavioral dynamics on the web. In WWW, 2012.

Digital Library

[46]

J. Rocchio. Relevance feedback in information retrieval. In G. Salton, editor, The SMART Retrieval System--Experiments in Automatic Document Processing. Prentice-Hall, 1971.

[47]

M. Shokouhi. Detecting seasonal queries by time-series analysis. In SIGIR, 2011.

Digital Library

[48]

M. Shokouhi and K. Radinsky. Time sensitive query auto-completion. In SIGIR, 2012.

Digital Library

[49]

A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sarma, R. Murthy, and H. Liu. Data warehousing and analytics infrastructure at Facebook. In SIGMOD, 2010.

Digital Library

[50]

M. Vlachos, C. Meek, Z. Vagena, and D. Gunopulos. Identifying similarities, periodicities and bursts for online search queries. In SIGMOD, 2004.

Digital Library

[51]

J. Xu and W. Croft. Improving the effectiveness of information retrieval with local context analysis. ACM TOIS, 18(1):79--112, 2000.

Digital Library

Cited By

Theodorakopoulos LTheodoropoulou A(2024)Leveraging Big Data Analytics for Understanding Consumer Behavior in Digital Marketing: A Systematic ReviewHuman Behavior and Emerging Technologies10.1155/2024/36415022024:1Online publication date: 24-Oct-2024
https://doi.org/10.1155/2024/3641502
Xia SShi YZhou YWu YYang LLetaief K(2024)Federated Learning With Massive Random AccessIEEE Transactions on Wireless Communications10.1109/TWC.2024.340545123:10(13856-13871)Online publication date: Oct-2024
https://doi.org/10.1109/TWC.2024.3405451
Ahmad PRai M(2024)Optimized Continuous Quality and Storage Management Model for Big Data Analysis2024 2nd International Conference on Advancements and Key Challenges in Green Energy and Computing (AKGEC)10.1109/AKGEC62572.2024.10868899(1-6)Online publication date: 21-Nov-2024
https://doi.org/10.1109/AKGEC62572.2024.10868899
Show More Cited By

Index Terms

Fast data in the era of big data: Twitter's real-time related query suggestion architecture
1. Information systems
  1. Information retrieval

Recommendations

Big Data Management: Advanced Issues and Approaches

The objective of this article is to provide the advanced issues and approaches of big data management. The literature review indicates the overview of big data management; the aspects of Big Data Analytics BDA; the importance of big data management; the ...
Disease Surveillance System for Big Climate Data Processing and Dengue Transmission

Ambient intelligence is an emerging platform that provides advances in sensors and sensor networks, pervasive computing, and artificial intelligence to capture the real time climate data. This result continuously generates several exabytes of ...
Design and Development of a Medical Big Data Processing System Based on Hadoop

Secondary use of medical big data is increasingly popular in healthcare services and clinical research. Understanding the logic behind medical big data demonstrates tendencies in hospital information technology and shows great significance for hospital ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

June 2013

1322 pages

ISBN:9781450320375

DOI:10.1145/2463676

General Chairs:
Kenneth Ross
Columbia University
,
Divesh Srivastava
AT&T Research
,
Program Chair:
Dimitris Papadias
HKUST

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS'13

Sponsor:

SIGMOD

SIGMOD/PODS'13: International Conference on Management of Data

June 22 - 27, 2013

New York, New York, USA

Acceptance Rates

SIGMOD '13 Paper Acceptance Rate 76 of 372 submissions, 20%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

67
Total Citations
View Citations
2,070
Total Downloads

Downloads (Last 12 months)37
Downloads (Last 6 weeks)6

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Theodorakopoulos LTheodoropoulou A(2024)Leveraging Big Data Analytics for Understanding Consumer Behavior in Digital Marketing: A Systematic ReviewHuman Behavior and Emerging Technologies10.1155/2024/36415022024:1Online publication date: 24-Oct-2024
https://doi.org/10.1155/2024/3641502
Xia SShi YZhou YWu YYang LLetaief K(2024)Federated Learning With Massive Random AccessIEEE Transactions on Wireless Communications10.1109/TWC.2024.340545123:10(13856-13871)Online publication date: Oct-2024
https://doi.org/10.1109/TWC.2024.3405451
Ahmad PRai M(2024)Optimized Continuous Quality and Storage Management Model for Big Data Analysis2024 2nd International Conference on Advancements and Key Challenges in Green Energy and Computing (AKGEC)10.1109/AKGEC62572.2024.10868899(1-6)Online publication date: 21-Nov-2024
https://doi.org/10.1109/AKGEC62572.2024.10868899
Paulraj B(2023)Enhancing Data Engineering Frameworks for Scalable Real-Time Marketing SolutionsIntegrated Journal for Research in Arts and Humanities10.55544/ijrah.3.5.343:5(309-315)Online publication date: 24-Oct-2023
https://doi.org/10.55544/ijrah.3.5.34
Li F(2023)Modernization of Databases in the Cloud Era: Building Databases that Run Like LegosProceedings of the VLDB Endowment10.14778/3611540.361163916:12(4140-4151)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.14778/3611540.3611639
Wang JLi TSong HYang XZhou WLi FYan BWu QLiang YYing CWang YChen BCai CRuan YWeng XChen SYin LYang CCai XXing HYu NChen XHuang DSun J(2023)PolarDB-IMCI: A Cloud-Native HTAP Database System at AlibabaProceedings of the ACM on Management of Data10.1145/35897851:2(1-25)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589785
Ahmad PRai M(2023)Analysis of Optimization Strategies for Big Data Storage Management: A Study2023 4th International Conference on Electronics and Sustainable Communication Systems (ICESC)10.1109/ICESC57686.2023.10193738(1747-1753)Online publication date: 6-Jul-2023
https://doi.org/10.1109/ICESC57686.2023.10193738
Patil TAnand KBhateja AJamal KSawant-Patil SPaygude P(2023)Real-Time Clickstream Data Processing and Visualization Using Apache Tools2023 7th International Conference On Computing, Communication, Control And Automation (ICCUBEA)10.1109/ICCUBEA58933.2023.10392270(1-5)Online publication date: 18-Aug-2023
https://doi.org/10.1109/ICCUBEA58933.2023.10392270
Garriga MMonsieur GTamburri D(2023)Big Data ArchitecturesData Science for Entrepreneurship10.1007/978-3-031-19554-9_4(63-76)Online publication date: 24-Mar-2023
https://doi.org/10.1007/978-3-031-19554-9_4
Damaskinos GGuerraoui RKermarrec ANitu VPatra RTaiani F(2022)FLeet: Online Federated Learning via Staleness Awareness and Performance PredictionACM Transactions on Intelligent Systems and Technology10.1145/352762113:5(1-30)Online publication date: 23-Sep-2022
https://dl.acm.org/doi/10.1145/3527621
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten