skip to main content
10.1145/2038916.2038944acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Scaling the mobile millennium system in the cloud

Published: 26 October 2011 Publication History

Abstract

We report on our experience scaling up the Mobile Millennium traffic information system using cloud computing and the Spark cluster computing framework. Mobile Millennium uses machine learning to infer traffic conditions for large metropolitan areas from crowdsourced data, and Spark was specifically designed to support such applications. Many studies of cloud computing frameworks have demonstrated scalability and performance improvements for simple machine learning algorithms. Our experience implementing a real-world machine learning-based application corroborates such benefits, but we also encountered several challenges that have not been widely reported. These include: managing large parameter vectors, using memory efficiently, and integrating with the application's existing storage infrastructure. This paper describes these challenges and the changes they required in both the Spark framework and the Mobile Millennium software. While we focus on a system for traffic estimation, we believe that the lessons learned are applicable to other machine learning-based applications.

References

[1]
Kryo -- Fast, efficient Java serialization. http://code.google.com/p/kryo.
[2]
PostGIS. http://postgis.refractions.net.
[3]
Scala programming language. http://scala-lang. org.
[4]
X. Ban, R. Herring, J. Margulici, and A. Bayen. Optimal sensor placement for freeway travel time estimation. Proceedings of the 18th International Symposium on Transportation and Traffic Theory, July 2009.
[5]
Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop: Efficient iterative data processing on large clusters. In VLDB, 2010.
[6]
M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan, and I. Stoica. Managing data transfers in computer clusters with Orchestra. In SIGCOMM, 2011.
[7]
C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In NIPS, 2007.
[8]
A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), pages 1--38, 1977.
[9]
J. Duchi, A. Agarwal, and M. Wainwright. Distributed dual averaging in networks. In NIPS, 2010.
[10]
J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox. Twister: a runtime for iterative mapreduce. In HPDC '10, 2010.
[11]
Mobile Millennium Project. http://traffic.berkeley.edu.
[12]
T. Hunter, R. Herring, A. Bayen, and P. Abbeel. Path and travel time inference from gps probe vehicle data. In NIPS Analyzing Networks and Learning with Graphs, 2009.
[13]
T. Hunter, R. Herring, A. Bayen, and P. Abbeel. Trajectory reconstruction of noisy GPS probe vehicles in arterial traffic. In preparation for IEEE Transactions on Intelligent Transport Systems, 2011.
[14]
M. Lighthill and G. Whitham. On kinematic waves. II. A theory of traffic flow on long crowded roads. Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences, 229(1178): 317--345, May 1955.
[15]
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Graphlab: A new parallel framework for machine learning. In UAI, 2010.
[16]
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system for large-scale graph processing. In SIGMOD, 2010.
[17]
R. T. McDonald, K. Hall, and G. Mann. Distributed training strategies for the structured perceptron. In Conference of the North American Chapter of the Association of Computation Linguistics, pages 456--464, 2010.
[18]
N. Mitchell and G. Sevitsky. Building memory-efficient Java applications: Practices and challenges. PLDI 2009 Tutorial.
[19]
R. Neal and G. Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. Learning in graphical models, 89: 355--368, 1998.
[20]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD, 2008.
[21]
R. Power and J. Li. Piccolo: Building fast, distributed programs with partitioned tables. In OSDI, 2010.
[22]
K. Thomas, C. Grier, J. Ma, V. Paxson, and D. Song. Design and evaluation of a real-time url spam filtering service. In IEEE Symposium on Security and Privacy, May 2011.
[23]
TTI. Texas Transportation Institute: Urban Mobility Information: 2007 Annual Urban Mobility Report. http://mobility.tamu.edu/ums/, 2007.
[24]
J. Wolfe, A. Haghighi, and D. Klein. Fully distributed EM for very large datasets. In ICML, 2008.
[25]
D. Work, S. Blandin, O. Tossavainen, B. Piccoli, and A. Bayen. A traffic model for velocity data assimilation. Applied Mathematics Research eXpress, 2010(1): 1, 2010.
[26]
Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, 2008.
[27]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. Technical Report UCB/EECS-2011-82, EECS Department, University of California, Berkeley, Jul 2011.
[28]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster Computing with Working Sets. In HotCloud, 2010.

Cited By

View all
  • (2024)A Brief IntroductionIncentive Mechanism for Mobile Crowdsensing10.1007/978-981-99-6921-0_1(1-8)Online publication date: 4-Jan-2024
  • (2023)Metropolitan Segment Traffic Speeds From Massive Floating Car Data in 10 CitiesIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2023.329173724:11(12821-12830)Online publication date: Nov-2023
  • (2022)Low-rank traffic matrix completion with marginal informationJournal of Computational and Applied Mathematics10.1016/j.cam.2022.114219410:COnline publication date: 15-Aug-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SOCC '11: Proceedings of the 2nd ACM Symposium on Cloud Computing
October 2011
377 pages
ISBN:9781450309769
DOI:10.1145/2038916
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2011

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

SOCC '11
Sponsor:

Acceptance Rates

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A Brief IntroductionIncentive Mechanism for Mobile Crowdsensing10.1007/978-981-99-6921-0_1(1-8)Online publication date: 4-Jan-2024
  • (2023)Metropolitan Segment Traffic Speeds From Massive Floating Car Data in 10 CitiesIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2023.329173724:11(12821-12830)Online publication date: Nov-2023
  • (2022)Low-rank traffic matrix completion with marginal informationJournal of Computational and Applied Mathematics10.1016/j.cam.2022.114219410:COnline publication date: 15-Aug-2022
  • (2021)Analysis of Mobile Cloud ComputingResearch Anthology on Architectures, Frameworks, and Integration Strategies for Distributed and Cloud Computing10.4018/978-1-7998-5339-8.ch001(1-24)Online publication date: 2021
  • (2021)Egocentric abstractions for modeling and safety verification of distributed cyber-physical systems2021 IEEE Security and Privacy Workshops (SPW)10.1109/SPW53761.2021.00046(268-276)Online publication date: May-2021
  • (2021)Urban Fatigue Driving Prediction With Federated Learning2021 IEEE 7th International Conference on Cloud Computing and Intelligent Systems (CCIS)10.1109/CCIS53392.2021.9754649(47-51)Online publication date: 7-Nov-2021
  • (2020)DSPBench: A Suite of Benchmark Applications for Distributed Data Stream Processing SystemsIEEE Access10.1109/ACCESS.2020.30439488(222900-222917)Online publication date: 2020
  • (2019)New Opportunities and Challenges of Geo-ICT Convergence Technology: GeoCPS and GeoAIJournal of the Korean Society of Mineral and Energy Resources Engineers10.32390/ksmer.2019.56.4.38356:4(387-397)Online publication date: 1-Aug-2019
  • (2019)URFDProceedings of the 2nd International Conference on Big Data Technologies10.1145/3358528.3358595(29-33)Online publication date: 28-Aug-2019
  • (2019)Think Like A Graph: Real-Time Traffic Estimation at City-ScaleIEEE Transactions on Mobile Computing10.1109/TMC.2018.287364218:10(2446-2459)Online publication date: 1-Oct-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media