skip to main content
10.1145/2213836.2213958acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Large-scale machine learning at twitter

Published: 20 May 2012 Publication History

Abstract

The success of data-driven solutions to difficult problems, along with the dropping costs of storing and processing massive amounts of data, has led to growing interest in large-scale machine learning. This paper presents a case study of Twitter's integration of machine learning tools into its existing Hadoop-based, Pig-centric analytics platform. We begin with an overview of this platform, which handles "traditional" data warehousing and business intelligence tasks for the organization. The core of this work lies in recent Pig extensions to provide predictive analytics capabilities that incorporate machine learning, focused specifically on supervised classification. In particular, we have identified stochastic gradient descent techniques for online learning and ensemble methods as being highly amenable to scaling out to large amounts of data. In our deployed solution, common machine learning tasks such as data sampling, feature generation, training, and testing can be accomplished directly in Pig, via carefully crafted loaders, storage functions, and user-defined functions. This means that machine learning is just another Pig script, which allows seamless integration with existing infrastructure for data management, scheduling, and monitoring in a production environment, as well as access to rich libraries of user-defined functions and the materialized output of other scripts.

References

[1]
A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. VLDB, 2009.
[2]
A. Agarwal, O. Chapelle, M. Dudik, and J. Langford. A reliable effective terascale linear learning system. arXiv:1110.4198v1, 2011.
[3]
K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and E. Paulson. Efficient processing of data warehousing queries in a split execution environment. SIGMOD, 2011.
[4]
M. Banko and E. Brill. Scaling to very very large corpora for natural language disambiguation. ACL, 2001.
[5]
R. Bekkerman and M. Gavish. High-precision phrase-based document classification on a modern scale. KDD, 2011.
[6]
C. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag, 2006.
[7]
L. Bottou. Large-scale machine learning with stochastic gradient descent. COMPSTAT, 2010.
[8]
T. Brants, A. Popat, P. Xu, F. Och, and J. Dean. Large language models in machine translation. EMNLP, 2007.
[9]
L. Breiman. Bagging predictors. Machine Learning, 24(2):123--140, 1996.
[10]
L. Breiman. Arcing classifiers. Annals of Statistics, 26(3):801--849, 1998.
[11]
L. Breiman. Random forests. Machine Learning, 45(1):5--32, 2001.
[12]
E. Chang, H. Bai, K. Zhu, H. Wang, J. Li, and Z. Qiu. PSVM: Parallel Support Vector Machines with incomplete Cholesky factorization. Scaling up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, 2012.
[13]
F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber. Bigtable: A distributed storage system for structured data. OSDI, 2006.
[14]
J. Cohen, B. Dolan, M. Dunlap, J. Hellerstein, and C. Welton. MAD skills: New analysis practices for big data. VLDB, 2009.
[15]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. OSDI, 2004.
[16]
C. Dyer, A. Cordova, A. Mont, and J. Lin. Fast, easy, and cheap: Construction of statistical machine translation models with MapReduce. StatMT Workshop, 2008.
[17]
U. Fayyad and K. Irani. Multi-interval discretization of continuous-valued attributes for classification learning. IJCAI, 1993.
[18]
A. Gates. Programming Pig. O'Reilly, 2011.
[19]
A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a high-level dataflow system on top of MapReduce: The Pig experience. VLDB, 2009.
[20]
A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. SystemML: Declarative machine learning on MapReduce. ICDE, 2011.
[21]
A. Halevy, P. Norvig, and F. Pereira. The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2):8--12, 2009.
[22]
J. Hammerbacher. Information platforms and the rise of the data scientist. Beautiful Data: The Stories Behind Elegant Data Solutions. O'Reilly, 2009.
[23]
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2009.
[24]
T. Joachims, L. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay. Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search. ACM TOIS, 25(2):1--27, 2007.
[25]
E. Kouloumpis, T. Wilson, and J. Moore. Twitter sentiment analysis: The good the bad and the OMG! ICWSM, 2011.
[26]
L. Kuncheva. Combining Pattern Classifiers: Methods and Algorithms. Wiley-Interscience, 2004.
[27]
H. Li. Learning to Rank for Information Retrieval and Natural Language Processing. Morgan & Claypool, 2011.
[28]
J. Lin and C. Dyer. Data-Intensive Text Processing with MapReduce. Morgan & Claypool, 2010.
[29]
J. Lin, D. Ryaboy, and K. Weil. Full-text indexing for optimizing selection operations in large-scale data analytics. MAPREDUCE Workshop, 2011.
[30]
Y. Lin, D. Agrawal, C. Chen, B. Ooi, and S. Wu. Llama: Leveraging columnar storage for scalable join processing in the MapReduce framework. SIGMOD, 2011.
[31]
LinkedIn. Data infrastructure at LinkedIn. ICDE, 2012.
[32]
G. Mann, R. McDonald, M. Mohri, N. Silberman, and D. Walker. Efficient large-scale distributed training of conditional maximum entropy models. NIPS, 2009.
[33]
R. McDonald, K. Hall, and G. Mann. Distributed training strategies for the structured perceptron. HLT, 2010.
[34]
A. Nandi, C. Yu, P. Bohannon, and R. Ramakrishnan. Distributed cube materialization on holistic measures. ICDE, 2011.
[35]
A. Ng, G. Bradski, C.-T. Chu, K. Olukotun, S. Kim, Y.-A. Lin, and Y. Yu. Map-reduce for machine learning on multicore. NIPS, 2006.
[36]
B. O'Connor, R. Balasubramanyan, B. Routledge, and N. Smith. From Tweets to polls: Linking text sentiment to public opinion time series. ICWSM, 2010.
[37]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. SIGMOD, 2008.
[38]
A. Pak and P. Paroubek. Twitter as a corpus for sentiment analysis and opinion mining. LREC, 2010.
[39]
B. Panda, J. Herbach, S. Basu, and R. Bayardo. MapReduce and its application to massively parallel learning of decision tree ensembles. Scaling up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, 2012.
[40]
B. Pang and L. Lee. Opinion mining and sentiment analysis. FnTIR, 2(1-2):1--135, 2008.
[41]
D. Patil. Building Data Science Teams. O'Reilly, 2011.
[42]
D. Sculley, M. Otey, M. Pohl, B. Spitznagel, J. Hainsworth, and Y. Zhou. Detecting adversarial advertisements in the wild. KDD, 2011.
[43]
Y. Singer and N. Srebro. Pegasos: Primal estimated sub-gradient solver for SVM. ICML, 2007.
[44]
A. Smola and S. Narayanamurthy. An architecture for parallel topic models. VLDB, 2010.
[45]
R. Snow, B. O'Connor, D. Jurafsky, and A. Ng. Cheap and fast-but is it good? Evaluating non-expert annotations for natural language tasks. EMNLP, 2008.
[46]
K. Svore and C. Burges. Large-scale learning to rank using boosted decision trees. Scaling up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, 2012.
[47]
A. Thusoo, J. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive-a petabyte scale data warehouse using Hadoop. ICDE, 2010.
[48]
M. Weimer, T. Condie, and R. Ramakrishnan. Machine learning in ScalOps, a higher order cloud computing language. Big Learning Workshop, 2011.
[49]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. Technical Report UCB/EECS-2011-82, Berkeley, 2011.

Cited By

View all
  • (2024)EXTMEMProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692017(397-408)Online publication date: 10-Jul-2024
  • (2024)Experience Report from a Graduate ML Production Systems Course2024 IEEE International Conference on Electro Information Technology (eIT)10.1109/eIT60633.2024.10609876(058-065)Online publication date: 30-May-2024
  • (2024)A survey and comparative study on negative sentiment analysis in social media dataMultimedia Tools and Applications10.1007/s11042-024-18452-083:30(75243-75292)Online publication date: 15-Feb-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
May 2012
886 pages
ISBN:9781450312479
DOI:10.1145/2213836
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 May 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. ensembles
  2. logistic regression
  3. online learning
  4. stochastic gradient descent

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '12
Sponsor:

Acceptance Rates

SIGMOD '12 Paper Acceptance Rate 48 of 289 submissions, 17%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)76
  • Downloads (Last 6 weeks)4
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)EXTMEMProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692017(397-408)Online publication date: 10-Jul-2024
  • (2024)Experience Report from a Graduate ML Production Systems Course2024 IEEE International Conference on Electro Information Technology (eIT)10.1109/eIT60633.2024.10609876(058-065)Online publication date: 30-May-2024
  • (2024)A survey and comparative study on negative sentiment analysis in social media dataMultimedia Tools and Applications10.1007/s11042-024-18452-083:30(75243-75292)Online publication date: 15-Feb-2024
  • (2024)An Effective RSP Data Sampling AlgorithmKnowledge Science, Engineering and Management10.1007/978-981-97-5501-1_25(331-342)Online publication date: 27-Jul-2024
  • (2023)A Comprehensive Review and a Taxonomy of Edge Machine Learning: Requirements, Paradigms, and TechniquesAI10.3390/ai40300394:3(729-786)Online publication date: 13-Sep-2023
  • (2023)Discovering Influencers in Opinion Formation Over Social GraphsIEEE Open Journal of Signal Processing10.1109/OJSP.2023.32611324(188-207)Online publication date: 2023
  • (2023)A Meta-Summary of Challenges in Building Products with ML Components – Collecting Experiences from 4758+ Practitioners2023 IEEE/ACM 2nd International Conference on AI Engineering – Software Engineering for AI (CAIN)10.1109/CAIN58948.2023.00034(171-183)Online publication date: May-2023
  • (2023)Customer Behaviour Analysis Using Machine Learning AlgorithmsDigital Transformation, Strategic Resilience, Cyber Security and Risk Management10.1108/S1569-37592023000111B009(133-142)Online publication date: 28-Sep-2023
  • (2023)A deep learning‐based simulator for comprehensive two‐dimensional GC applicationsJournal of Separation Science10.1002/jssc.20230018746:19Online publication date: 31-Jul-2023
  • (2022)A Probabilistic Deep Learning Approach for Twitter Sentiment AnalysisResearch Anthology on Implementing Sentiment Analysis Across Multiple Disciplines10.4018/978-1-6684-6303-1.ch020(367-381)Online publication date: 10-Jun-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media