research-article

Large-scale machine learning at twitter

Authors:

Alek KolczAuthors Info & Claims

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

Pages 793 - 804

https://doi.org/10.1145/2213836.2213958

Published: 20 May 2012 Publication History

Abstract

The success of data-driven solutions to difficult problems, along with the dropping costs of storing and processing massive amounts of data, has led to growing interest in large-scale machine learning. This paper presents a case study of Twitter's integration of machine learning tools into its existing Hadoop-based, Pig-centric analytics platform. We begin with an overview of this platform, which handles "traditional" data warehousing and business intelligence tasks for the organization. The core of this work lies in recent Pig extensions to provide predictive analytics capabilities that incorporate machine learning, focused specifically on supervised classification. In particular, we have identified stochastic gradient descent techniques for online learning and ensemble methods as being highly amenable to scaling out to large amounts of data. In our deployed solution, common machine learning tasks such as data sampling, feature generation, training, and testing can be accomplished directly in Pig, via carefully crafted loaders, storage functions, and user-defined functions. This means that machine learning is just another Pig script, which allows seamless integration with existing infrastructure for data management, scheduling, and monitoring in a production environment, as well as access to rich libraries of user-defined functions and the materialized output of other scripts.

References

[1]

A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. VLDB, 2009.

Digital Library

[2]

A. Agarwal, O. Chapelle, M. Dudik, and J. Langford. A reliable effective terascale linear learning system. arXiv:1110.4198v1, 2011.

[3]

K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and E. Paulson. Efficient processing of data warehousing queries in a split execution environment. SIGMOD, 2011.

Digital Library

[4]

M. Banko and E. Brill. Scaling to very very large corpora for natural language disambiguation. ACL, 2001.

Digital Library

[5]

R. Bekkerman and M. Gavish. High-precision phrase-based document classification on a modern scale. KDD, 2011.

Digital Library

[6]

C. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag, 2006.

Digital Library

[7]

L. Bottou. Large-scale machine learning with stochastic gradient descent. COMPSTAT, 2010.

[8]

T. Brants, A. Popat, P. Xu, F. Och, and J. Dean. Large language models in machine translation. EMNLP, 2007.

[9]

L. Breiman. Bagging predictors. Machine Learning, 24(2):123--140, 1996.

[10]

L. Breiman. Arcing classifiers. Annals of Statistics, 26(3):801--849, 1998.

[11]

L. Breiman. Random forests. Machine Learning, 45(1):5--32, 2001.

Digital Library

[12]

E. Chang, H. Bai, K. Zhu, H. Wang, J. Li, and Z. Qiu. PSVM: Parallel Support Vector Machines with incomplete Cholesky factorization. Scaling up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, 2012.

[13]

F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber. Bigtable: A distributed storage system for structured data. OSDI, 2006.

Digital Library

[14]

J. Cohen, B. Dolan, M. Dunlap, J. Hellerstein, and C. Welton. MAD skills: New analysis practices for big data. VLDB, 2009.

Digital Library

[15]

J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. OSDI, 2004.

Digital Library

[16]

C. Dyer, A. Cordova, A. Mont, and J. Lin. Fast, easy, and cheap: Construction of statistical machine translation models with MapReduce. StatMT Workshop, 2008.

Digital Library

[17]

U. Fayyad and K. Irani. Multi-interval discretization of continuous-valued attributes for classification learning. IJCAI, 1993.

[18]

A. Gates. Programming Pig. O'Reilly, 2011.

Digital Library

[19]

A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a high-level dataflow system on top of MapReduce: The Pig experience. VLDB, 2009.

Digital Library

[20]

A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. SystemML: Declarative machine learning on MapReduce. ICDE, 2011.

Digital Library

[21]

A. Halevy, P. Norvig, and F. Pereira. The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2):8--12, 2009.

Digital Library

[22]

J. Hammerbacher. Information platforms and the rise of the data scientist. Beautiful Data: The Stories Behind Elegant Data Solutions. O'Reilly, 2009.

[23]

T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2009.

[24]

T. Joachims, L. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay. Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search. ACM TOIS, 25(2):1--27, 2007.

Digital Library

[25]

E. Kouloumpis, T. Wilson, and J. Moore. Twitter sentiment analysis: The good the bad and the OMG! ICWSM, 2011.

[26]

L. Kuncheva. Combining Pattern Classifiers: Methods and Algorithms. Wiley-Interscience, 2004.

Digital Library

[27]

H. Li. Learning to Rank for Information Retrieval and Natural Language Processing. Morgan & Claypool, 2011.

Digital Library

[28]

J. Lin and C. Dyer. Data-Intensive Text Processing with MapReduce. Morgan & Claypool, 2010.

Digital Library

[29]

J. Lin, D. Ryaboy, and K. Weil. Full-text indexing for optimizing selection operations in large-scale data analytics. MAPREDUCE Workshop, 2011.

Digital Library

[30]

Y. Lin, D. Agrawal, C. Chen, B. Ooi, and S. Wu. Llama: Leveraging columnar storage for scalable join processing in the MapReduce framework. SIGMOD, 2011.

Digital Library

[31]

LinkedIn. Data infrastructure at LinkedIn. ICDE, 2012.

[32]

G. Mann, R. McDonald, M. Mohri, N. Silberman, and D. Walker. Efficient large-scale distributed training of conditional maximum entropy models. NIPS, 2009.

Digital Library

[33]

R. McDonald, K. Hall, and G. Mann. Distributed training strategies for the structured perceptron. HLT, 2010.

Digital Library

[34]

A. Nandi, C. Yu, P. Bohannon, and R. Ramakrishnan. Distributed cube materialization on holistic measures. ICDE, 2011.

Digital Library

[35]

A. Ng, G. Bradski, C.-T. Chu, K. Olukotun, S. Kim, Y.-A. Lin, and Y. Yu. Map-reduce for machine learning on multicore. NIPS, 2006.

[36]

B. O'Connor, R. Balasubramanyan, B. Routledge, and N. Smith. From Tweets to polls: Linking text sentiment to public opinion time series. ICWSM, 2010.

[37]

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. SIGMOD, 2008.

Digital Library

[38]

A. Pak and P. Paroubek. Twitter as a corpus for sentiment analysis and opinion mining. LREC, 2010.

[39]

B. Panda, J. Herbach, S. Basu, and R. Bayardo. MapReduce and its application to massively parallel learning of decision tree ensembles. Scaling up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, 2012.

[40]

B. Pang and L. Lee. Opinion mining and sentiment analysis. FnTIR, 2(1-2):1--135, 2008.

Digital Library

[41]

D. Patil. Building Data Science Teams. O'Reilly, 2011.

[42]

D. Sculley, M. Otey, M. Pohl, B. Spitznagel, J. Hainsworth, and Y. Zhou. Detecting adversarial advertisements in the wild. KDD, 2011.

Digital Library

[43]

Y. Singer and N. Srebro. Pegasos: Primal estimated sub-gradient solver for SVM. ICML, 2007.

[44]

A. Smola and S. Narayanamurthy. An architecture for parallel topic models. VLDB, 2010.

Digital Library

[45]

R. Snow, B. O'Connor, D. Jurafsky, and A. Ng. Cheap and fast-but is it good? Evaluating non-expert annotations for natural language tasks. EMNLP, 2008.

Digital Library

[46]

K. Svore and C. Burges. Large-scale learning to rank using boosted decision trees. Scaling up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, 2012.

[47]

A. Thusoo, J. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive-a petabyte scale data warehouse using Hadoop. ICDE, 2010.

[48]

M. Weimer, T. Condie, and R. Ramakrishnan. Machine learning in ScalOps, a higher order cloud computing language. Big Learning Workshop, 2011.

[49]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. Technical Report UCB/EECS-2011-82, Berkeley, 2011.

Cited By

Jalalian SPatel SHajidehi MSeltzer MFedorova ABagchi SZhang Y(2024)EXTMEMProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692017(397-408)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691992.3692017
Nowling R(2024)Experience Report from a Graduate ML Production Systems Course2024 IEEE International Conference on Electro Information Technology (eIT)10.1109/eIT60633.2024.10609876(058-065)Online publication date: 30-May-2024
https://doi.org/10.1109/eIT60633.2024.10609876
Paul JDas Chatterjee AMisra DMajumder SRana SGain MDe AMallick SSil J(2024)A survey and comparative study on negative sentiment analysis in social media dataMultimedia Tools and Applications10.1007/s11042-024-18452-083:30(75243-75292)Online publication date: 15-Feb-2024
https://doi.org/10.1007/s11042-024-18452-0
Show More Cited By

Index Terms

Large-scale machine learning at twitter
1. Information systems
  1. Data management systems
    1. Query languages
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query languages (principles)

Recommendations

Large Scale Machine Learning with Spark
A Theoretical Model for Big Data Analytics using Machine Learning Algorithms
WCI '15: Proceedings of the Third International Symposium on Women in Computing and Informatics

Big Data processing is currently becoming increasingly important in modern era due to continuous growth of the amount of data generated in various fields. Architecture for Big Data usually ranges across multiple machines and clusters consisting of ...
A Database-Hadoop Hybrid Approach to Scalable Machine Learning
BIGDATACONGRESS '13: Proceedings of the 2013 IEEE International Congress on Big Data

There are two popular schools of thought for performing large-scale machine learning that does not fit into memory. One is to run machine learning within a relational database management system, and the other is to push analytical functions into ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

May 2012

886 pages

ISBN:9781450312479

DOI:10.1145/2213836

General Chairs:
K. Selçuk Candan
Arizona State University
,
Yi Chen
Arizona State University
,
Richard Snodgrass
University of Arizona
,
Program Chair:
Luis Gravano
Columbia University
,
Publications Chair:
Ariel Fuxman
Microsoft Research

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 May 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '12

Sponsor:

SIGMOD

SIGMOD/PODS '12: International Conference on Management of Data

May 20 - 24, 2012

Arizona, Scottsdale, USA

Acceptance Rates

SIGMOD '12 Paper Acceptance Rate 48 of 289 submissions, 17%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

148
Total Citations
View Citations
3,602
Total Downloads

Downloads (Last 12 months)76
Downloads (Last 6 weeks)4

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jalalian SPatel SHajidehi MSeltzer MFedorova ABagchi SZhang Y(2024)EXTMEMProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692017(397-408)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691992.3692017
Nowling R(2024)Experience Report from a Graduate ML Production Systems Course2024 IEEE International Conference on Electro Information Technology (eIT)10.1109/eIT60633.2024.10609876(058-065)Online publication date: 30-May-2024
https://doi.org/10.1109/eIT60633.2024.10609876
Paul JDas Chatterjee AMisra DMajumder SRana SGain MDe AMallick SSil J(2024)A survey and comparative study on negative sentiment analysis in social media dataMultimedia Tools and Applications10.1007/s11042-024-18452-083:30(75243-75292)Online publication date: 15-Feb-2024
https://doi.org/10.1007/s11042-024-18452-0
Yang HPan XDeng JYin J(2024)An Effective RSP Data Sampling AlgorithmKnowledge Science, Engineering and Management10.1007/978-981-97-5501-1_25(331-342)Online publication date: 27-Jul-2024
https://doi.org/10.1007/978-981-97-5501-1_25
Li WHacid HAlmazrouei EDebbah M(2023)A Comprehensive Review and a Taxonomy of Edge Machine Learning: Requirements, Paradigms, and TechniquesAI10.3390/ai40300394:3(729-786)Online publication date: 13-Sep-2023
https://doi.org/10.3390/ai4030039
Shumovskaia VKayaalp MCemri MSayed A(2023)Discovering Influencers in Opinion Formation Over Social GraphsIEEE Open Journal of Signal Processing10.1109/OJSP.2023.32611324(188-207)Online publication date: 2023
https://doi.org/10.1109/OJSP.2023.3261132
Nahar NZhang HLewis GZhou SKästner C(2023)A Meta-Summary of Challenges in Building Products with ML Components – Collecting Experiences from 4758+ Practitioners2023 IEEE/ACM 2nd International Conference on AI Engineering – Software Engineering for AI (CAIN)10.1109/CAIN58948.2023.00034(171-183)Online publication date: May-2023
https://doi.org/10.1109/CAIN58948.2023.00034
Krishan R(2023)Customer Behaviour Analysis Using Machine Learning AlgorithmsDigital Transformation, Strategic Resilience, Cyber Security and Risk Management10.1108/S1569-37592023000111B009(133-142)Online publication date: 28-Sep-2023
https://doi.org/10.1108/S1569-37592023000111B009
Minho LCardeal ZMenezes H(2023)A deep learning‐based simulator for comprehensive two‐dimensional GC applicationsJournal of Separation Science10.1002/jssc.20230018746:19Online publication date: 31-Jul-2023
https://doi.org/10.1002/jssc.202300187
Abdelkader M(2022)A Probabilistic Deep Learning Approach for Twitter Sentiment AnalysisResearch Anthology on Implementing Sentiment Analysis Across Multiple Disciplines10.4018/978-1-6684-6303-1.ch020(367-381)Online publication date: 10-Jun-2022
https://doi.org/10.4018/978-1-6684-6303-1.ch020
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten