skip to main content
10.1145/2463676.2465290acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Fast data in the era of big data: Twitter's real-time related query suggestion architecture

Published: 22 June 2013 Publication History

Abstract

We present the architecture behind Twitter's real-time related query suggestion and spelling correction service. Although these tasks have received much attention in the web search literature, the Twitter context introduces a real-time "twist": after significant breaking news events, we aim to provide relevant results within minutes. This paper provides a case study illustrating the challenges of real-time data processing in the era of "big data". We tell the story of how our system was built twice: our first implementation was built on a typical Hadoop-based analytics stack, but was later replaced because it did not meet the latency requirements necessary to generate meaningful real-time results. The second implementation, which is the system deployed in production today, is a custom in-memory processing engine specifically designed for the task. This experience taught us that the current typical usage of Hadoop as a "big data" platform, while great for experimentation, is not well suited to low-latency processing, and points the way to future work on data analytics platforms that can handle "big" as well as "fast" data.

References

[1]
C. Aggarwal. Data Streams: Models and Algorithms. Kluwer Academic Publishers, 2007.
[2]
E. Alfonseca, M. Ciaramita, and K. Hall. Gazpacho and summer rash: lexical relationships from temporal patterns of web search queries. In EMNLP, 2009.
[3]
G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, Y. Lu, B. Saha, and E. Harris. Reining in the outliers in Map-Reduce clusters using Mantri. In OSDI, 2010.
[4]
R. Baraglia, F. M. Nardini, C. Castillo, R. Perego, D. Donato, and F. Silvestri. The effects of time on query flow graph-based models for query suggestion. In RIAO, 2010.
[5]
D. Borthakur, J. Gray, J. Sarma, K. Muthukkaruppan, N. Spiegelberg, H. Kuang, K. Ranganathan, D. Molkov, A. Menon, S. Rash, R. Schmidt, and A. Aiyer. Apache Hadoop goes realtime at Facebook. In SIGMOD, 2011.
[6]
M. Busch, K. Gade, B. Larson, P. Lok, S. Luckenbill, and J. Lin. Earlybird: Real-time search at Twitter. In ICDE, 2012.
[7]
H. Cao, D. Jiang, J. Pei, Q. He, Z. Liao, E. Chen, and H. Li. Context-aware query suggestion by mining click-through and session data. In KDD, 2008.
[8]
D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik. Monitoring streams--a new class of data management applications. In VLDB, 2002.
[9]
B. Chandramouli, J. Goldstein, and S. Duan. Temporal analytics on big data for web advertising. In ICDE, 2012.
[10]
F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber. Bigtable: A distributed storage system for structured data. In OSDI, 2006.
[11]
S. Cucerzan and E. Brill. Spelling correction as an iterative process that exploits the collective knowledge of web users. In EMNLP, 2004.
[12]
H. Cui, J.-R. Wen, J.-Y. Nie, and W.-Y. Ma. Query expansion by mining user logs. TKDE, 15(4):829--839, 2003.
[13]
W. Dakka, L. Gravano, and P. Ipeirotis. Answering general time-sensitive queries. TKDE, 24(2):220--235, 2012.
[14]
M. Efron and G. Golovchinsky. Estimation methods for ranking recent information. In SIGIR, 2011.
[15]
Y. Ganjisaffar, R. Caruana, and C. Lopes. Bagging gradient-boosted trees for high precision, low variance ranking models. In SIGIR, 2011.
[16]
A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a high-level dataflow system on top of MapReduce: The Pig experience. In VLDB, 2009.
[17]
B. Gedik, H. Andrade, K.-L. Wu, P. Yu, and M. Doo. SPADE: The System S declarative stream processing engine. In SIGMOD, 2008.
[18]
J. Gehrke. Special issue on data stream processing. Bulletin of the Technical Committee on Data Engineering, 26(1):2, 2003.
[19]
K. Goodhope, J. Koshy, J. Kreps, N. Narkhede, R. Park, J. Rao, and V. Ye. Building LinkedIn's real-time activity data pipeline. Bulletin of the Technical Committee on Data Engineering, 35(2):33--45, 2012.
[20]
P. Hunt, M. Konar, F. Junqueira, and B. Reed. ZooKeeper: Wait-free coordination for Internet-scale systems. In USENIX, 2010.
[21]
R. Jones and F. Diaz. Temporal profiles of queries. ACM TOIS, 25(3), 2007.
[22]
R. Jones, B. Rey, O. Madani, and W. Greiner. Generating query substitutions. In WWW, 2006.
[23]
J. Koenemann and N. Belkin. A case for interaction: A study of interactive information retrieval behavior and effectiveness. In CHI, 1996.
[24]
J. Kreps, N. Narkhede, and J. Rao. Kafka: A distributed messaging system for log processing. In NetDB Workshop, 2011.
[25]
S. Krishnamurthy, M. Franklin, J. Davis, D. Farina, P. Golovko, A. Li, and N. Thombre. Continuous analytics over discontinuous streams. In SIGMOD, 2010.
[26]
Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. SkewTune: Mitigating skew in MapReduce applications. In SIGMOD, 2012.
[27]
W. Lam, L. Liu, S. Prasad, A. Rajaraman, Z. Vacheri, and A. Doan. Muppet: MapReduce-style processing of fast data. In VLDB, 2012.
[28]
V. Lavrenko and W. Croft. Relevance-based language models. In SIGIR, 2001.
[29]
G. Lee, J. Lin, C. Liu, A. Lorek, and D. Ryaboy. The unified logging infrastructure for data analytics at Twitter. In VLDB, 2012.
[30]
F. Leibert, J. Mannix, J. Lin, and B. Hamadani. Automatic management of partitioned, replicated search services. In SoCC, 2011.
[31]
H. Li. Learning to Rank for Information Retrieval and Natural Language Processing. Morgan & Claypool Publishers, 2011.
[32]
X. Li and W. Croft. Time-based language models. In CIKM, 2003.
[33]
J. Lin and A. Kolcz. Large-scale machine learning at Twitter. In SIGMOD, 2012.
[34]
J. Lin and G. Mishne. A study of "churn" in tweets and real-time search queries. In ICWSM, 2012.
[35]
J. Lin and D. Ryaboy. Scaling big data mining infrastructure: The Twitter experience. SIGKDD Explorations, 14(2):6--19, 2012.
[36]
J. Lin, D. Ryaboy, and K. Weil. Full-text indexing for optimizing selection operations in large-scale data analytics. In MAPREDUCE Workshop, 2011.
[37]
C. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, Massachusetts, 1999.
[38]
Q. Mei, D. Zhou, and K. Church. Query suggestion using hitting time. In CIKM, 2008.
[39]
S. Mizzaro. How many relevances in information retrieval? Interacting With Computers, 10(3):305--322, 1998.
[40]
C. Moretti, J. Bulosan, D. Thain, and P. Flynn. All-Pairs: An abstraction for data-intensive cloud computing. In IPDPS, 2008.
[41]
L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed stream computing platform. In KDCloud Workshop at ICDM, 2010.
[42]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In SIGMOD, 2008.
[43]
D. Pearce. A comparative evaluation of collocation extraction techniques. In LREC, 2002.
[44]
D. Peng and F. Dabek. Large-scale incremental processing using distributed transactions and notifications. In OSDI, 2010.
[45]
K. Radinsky, K. Svore, S. Dumais, J. Teevan, A. Bocharov, and E. Horvitz. Modeling and predicting behavioral dynamics on the web. In WWW, 2012.
[46]
J. Rocchio. Relevance feedback in information retrieval. In G. Salton, editor, The SMART Retrieval System--Experiments in Automatic Document Processing. Prentice-Hall, 1971.
[47]
M. Shokouhi. Detecting seasonal queries by time-series analysis. In SIGIR, 2011.
[48]
M. Shokouhi and K. Radinsky. Time sensitive query auto-completion. In SIGIR, 2012.
[49]
A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sarma, R. Murthy, and H. Liu. Data warehousing and analytics infrastructure at Facebook. In SIGMOD, 2010.
[50]
M. Vlachos, C. Meek, Z. Vagena, and D. Gunopulos. Identifying similarities, periodicities and bursts for online search queries. In SIGMOD, 2004.
[51]
J. Xu and W. Croft. Improving the effectiveness of information retrieval with local context analysis. ACM TOIS, 18(1):79--112, 2000.

Cited By

View all
  • (2024)Leveraging Big Data Analytics for Understanding Consumer Behavior in Digital Marketing: A Systematic ReviewHuman Behavior and Emerging Technologies10.1155/2024/36415022024:1Online publication date: 24-Oct-2024
  • (2024)Federated Learning With Massive Random AccessIEEE Transactions on Wireless Communications10.1109/TWC.2024.340545123:10(13856-13871)Online publication date: Oct-2024
  • (2024)Optimized Continuous Quality and Storage Management Model for Big Data Analysis2024 2nd International Conference on Advancements and Key Challenges in Green Energy and Computing (AKGEC)10.1109/AKGEC62572.2024.10868899(1-6)Online publication date: 21-Nov-2024
  • Show More Cited By

Index Terms

  1. Fast data in the era of big data: Twitter's real-time related query suggestion architecture

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
    June 2013
    1322 pages
    ISBN:9781450320375
    DOI:10.1145/2463676
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 June 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. hadoop
    2. log analysis
    3. mapreduce

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS'13
    Sponsor:

    Acceptance Rates

    SIGMOD '13 Paper Acceptance Rate 76 of 372 submissions, 20%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)37
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Leveraging Big Data Analytics for Understanding Consumer Behavior in Digital Marketing: A Systematic ReviewHuman Behavior and Emerging Technologies10.1155/2024/36415022024:1Online publication date: 24-Oct-2024
    • (2024)Federated Learning With Massive Random AccessIEEE Transactions on Wireless Communications10.1109/TWC.2024.340545123:10(13856-13871)Online publication date: Oct-2024
    • (2024)Optimized Continuous Quality and Storage Management Model for Big Data Analysis2024 2nd International Conference on Advancements and Key Challenges in Green Energy and Computing (AKGEC)10.1109/AKGEC62572.2024.10868899(1-6)Online publication date: 21-Nov-2024
    • (2023)Enhancing Data Engineering Frameworks for Scalable Real-Time Marketing SolutionsIntegrated Journal for Research in Arts and Humanities10.55544/ijrah.3.5.343:5(309-315)Online publication date: 24-Oct-2023
    • (2023)Modernization of Databases in the Cloud Era: Building Databases that Run Like LegosProceedings of the VLDB Endowment10.14778/3611540.361163916:12(4140-4151)Online publication date: 1-Aug-2023
    • (2023)PolarDB-IMCI: A Cloud-Native HTAP Database System at AlibabaProceedings of the ACM on Management of Data10.1145/35897851:2(1-25)Online publication date: 20-Jun-2023
    • (2023)Analysis of Optimization Strategies for Big Data Storage Management: A Study2023 4th International Conference on Electronics and Sustainable Communication Systems (ICESC)10.1109/ICESC57686.2023.10193738(1747-1753)Online publication date: 6-Jul-2023
    • (2023)Real-Time Clickstream Data Processing and Visualization Using Apache Tools2023 7th International Conference On Computing, Communication, Control And Automation (ICCUBEA)10.1109/ICCUBEA58933.2023.10392270(1-5)Online publication date: 18-Aug-2023
    • (2023)Big Data ArchitecturesData Science for Entrepreneurship10.1007/978-3-031-19554-9_4(63-76)Online publication date: 24-Mar-2023
    • (2022)FLeet: Online Federated Learning via Staleness Awareness and Performance PredictionACM Transactions on Intelligent Systems and Technology10.1145/352762113:5(1-30)Online publication date: 23-Sep-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media