research-article

Cluster computing for web-scale data processing

Authors:
Aaron Kimball

University of Washington, Seattle, WA, USA

University of Washington, Seattle, WA, USA
View Profile

,
Sierra Michels-Slettvet

Department of Computer Science and Engineering, University of Washington, WA, USA

Department of Computer Science and Engineering, University of Washington, WA, USA
View Profile

,
Christophe Bisciglia

Google, Inc., Mountain View, CA, USA

Google, Inc., Mountain View, CA, USA
View Profile

SIGCSE '08: Proceedings of the 39th SIGCSE technical symposium on Computer science educationMarch 2008Pages 116–120https://doi.org/10.1145/1352135.1352177

Published:12 March 2008Publication History

SIGCSE '08: Proceedings of the 39th SIGCSE technical symposium on Computer science education

Pages 116–120

ABSTRACT

In this paper we present the design of a modern course in cluster computing and large-scale data processing. The defining differences between this and previously published designs are its focus on processing very large data sets and its use of Hadoop, an open source Java-based implementation of MapReduce and the Google File System as the platform for programming exercises. Hadoop proved to be a key element for successfully implementing structured lab activities and independent design projects. Through this course, offered at the University of Washington in 2007, we imparted new skills on our students, improving their ability to design systems capable of solving web-scale problems.

References

ACM/IEEE-CS Joint Curriculum Task Force. Computing Curricula 2001. IEEE Computer Society and Association for Computing Machinery., 2001.Google Scholar
P. Anderson, C. Christensen, and B. Allen. Designing a runtime system for volunteer computing. Proceedings of the 2006 IEEE/ACM SC06 Conference, Nov. 2006. Google ScholarDigital Library
Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1--7):107--117, 1998. Google ScholarDigital Library
Cafarella and D. Cutting. Building Nutch: Open source search. ACM Queue, Apr. 2004. Google ScholarDigital Library
C. Cunha and J. Lourenço. An integrated course on parallel and distributed processing. In SIGCSE '98: Proceedings of the Twenty-Ninth SIGCSE Technical Symposium on Computer Science Education, pages 217--221, New York, NY, USA, 1998. ACM Press. Google ScholarDigital Library
Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI'04: Sixth Symposium on Operating System Design and Implementation, December 2004. Google ScholarDigital Library
Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. SIGOPS Operating Systems Review, 37(5):29--43, 2003. Google ScholarDigital Library
Hadoop. http://lucene.apache.org/hadoop/.Google Scholar
Kimball and S. Michels-Slettvet. CSE 490H lecture notes: Problem solving on large scale clusters.Google Scholar
http://code.google.com/edu/content/submissions/Google Scholar
uwspr2007_clustercourse/listing.html, 2007.Google Scholar
Pheatt. An easy to use distributed computing framework. SIGCSE '07: Proceedings of the Thirty-Eighth SIGCSE Technical Symposium on Computer Science Education, pages 571--575, 2007. Google ScholarDigital Library
Sahami. Scaling computer science education to education on scaling in computer science. Workshop on Integrative Computing Education & Research (ICER): Preparing IT Graduates for 2010 and Beyond, Jan. 2006.Google Scholar
Satyanarayanan, J. Howard, D. Nichols, R. Sidebotham, A. Spector, and M. West. The ITC distributed file system: Principles and design. In Proceedings of the 10th ACM Symposium on Operating System Principles, pages 35--50, New York, NY, USA, Dec. 1985. ACM Press. Google ScholarDigital Library

Index Terms

Cluster computing for web-scale data processing
1. Social and professional topics
  1. Professional topics
    1. Computing education
      1. Computing education programs
        Computer science education
        Information science education

Recommendations

Cluster computing for web-scale data processing
SIGCSE 08

In this paper we present the design of a modern course in cluster computing and large-scale data processing. The defining differences between this and previously published designs are its focus on processing very large data sets and its use of Hadoop, ...
Read More
Teaching large scale data processing: the five-week course and two years' experiences
SCE '08: Proceedings of the 1st ACM Summit on Computing Education in China on First ACM Summit on Computing Education in China

We have setup a new course on the large scale data processing using clusters. It introduces the concepts and design of distributed systems. Many newly developed ideas such as Google file system and MapReduce programming framework for processing large ...
Read More
Disease Surveillance System for Big Climate Data Processing and Dengue Transmission

Ambient intelligence is an emerging platform that provides advances in sensors and sensor networks, pervasive computing, and artificial intelligence to capture the real time climate data. This result continuously generates several exabytes of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGCSE '08: Proceedings of the 39th SIGCSE technical symposium on Computer science education
March 2008
606 pages
ISBN:9781595937995
DOI:10.1145/1352135
General Chairs:
J. D. Dougherty
Haverford College
,
Susan Rodger
Duke University
,
Program Chairs:
Sue Fitzgerald
Metropolitan State University
,
Mark Guzdial
Georgia Institute of Technology
ACM SIGCSE Bulletin Volume 40, Issue 1
SIGCSE 08
March 2008
549 pages
ISSN:0097-8418
DOI:10.1145/1352322
Issue’s Table of Contents
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 March 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
clusters
distributed computing
education
hadoop
mapreduce
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,595of4,542submissions,35%
Upcoming Conference
SIGCSE Virtual 2024

Sponsor:

sigcse

SIGCSE Virtual 2024: ACM Virtual Global Computing Education Conference

December 5 - 7, 2024

Virtual Event , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 50
  Total Citations
  View Citations
- 1,504
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Cluster computing for web-scale data processing

SIGCSE '08: Proceedings of the 39th SIGCSE technical symposium on Computer science education

ABSTRACT

References

Cited By

Index Terms

Recommendations

Cluster computing for web-scale data processing

Teaching large scale data processing: the five-week course and two years' experiences

Disease Surveillance System for Big Climate Data Processing and Dengue Transmission