ABSTRACT
In this paper we present the design of a modern course in cluster computing and large-scale data processing. The defining differences between this and previously published designs are its focus on processing very large data sets and its use of Hadoop, an open source Java-based implementation of MapReduce and the Google File System as the platform for programming exercises. Hadoop proved to be a key element for successfully implementing structured lab activities and independent design projects. Through this course, offered at the University of Washington in 2007, we imparted new skills on our students, improving their ability to design systems capable of solving web-scale problems.
- ACM/IEEE-CS Joint Curriculum Task Force. Computing Curricula 2001. IEEE Computer Society and Association for Computing Machinery., 2001.Google Scholar
- P. Anderson, C. Christensen, and B. Allen. Designing a runtime system for volunteer computing. Proceedings of the 2006 IEEE/ACM SC06 Conference, Nov. 2006. Google ScholarDigital Library
- Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1--7):107--117, 1998. Google ScholarDigital Library
- Cafarella and D. Cutting. Building Nutch: Open source search. ACM Queue, Apr. 2004. Google ScholarDigital Library
- C. Cunha and J. Lourenço. An integrated course on parallel and distributed processing. In SIGCSE '98: Proceedings of the Twenty-Ninth SIGCSE Technical Symposium on Computer Science Education, pages 217--221, New York, NY, USA, 1998. ACM Press. Google ScholarDigital Library
- Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI'04: Sixth Symposium on Operating System Design and Implementation, December 2004. Google ScholarDigital Library
- Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. SIGOPS Operating Systems Review, 37(5):29--43, 2003. Google ScholarDigital Library
- Hadoop. http://lucene.apache.org/hadoop/.Google Scholar
- Kimball and S. Michels-Slettvet. CSE 490H lecture notes: Problem solving on large scale clusters.Google Scholar
- http://code.google.com/edu/content/submissions/Google Scholar
- uwspr2007_clustercourse/listing.html, 2007.Google Scholar
- Pheatt. An easy to use distributed computing framework. SIGCSE '07: Proceedings of the Thirty-Eighth SIGCSE Technical Symposium on Computer Science Education, pages 571--575, 2007. Google ScholarDigital Library
- Sahami. Scaling computer science education to education on scaling in computer science. Workshop on Integrative Computing Education & Research (ICER): Preparing IT Graduates for 2010 and Beyond, Jan. 2006.Google Scholar
- Satyanarayanan, J. Howard, D. Nichols, R. Sidebotham, A. Spector, and M. West. The ITC distributed file system: Principles and design. In Proceedings of the 10th ACM Symposium on Operating System Principles, pages 35--50, New York, NY, USA, Dec. 1985. ACM Press. Google ScholarDigital Library
Index Terms
- Cluster computing for web-scale data processing
Recommendations
Cluster computing for web-scale data processing
SIGCSE 08In this paper we present the design of a modern course in cluster computing and large-scale data processing. The defining differences between this and previously published designs are its focus on processing very large data sets and its use of Hadoop, ...
Teaching large scale data processing: the five-week course and two years' experiences
SCE '08: Proceedings of the 1st ACM Summit on Computing Education in China on First ACM Summit on Computing Education in ChinaWe have setup a new course on the large scale data processing using clusters. It introduces the concepts and design of distributed systems. Many newly developed ideas such as Google file system and MapReduce programming framework for processing large ...
Disease Surveillance System for Big Climate Data Processing and Dengue Transmission
Ambient intelligence is an emerging platform that provides advances in sensors and sensor networks, pervasive computing, and artificial intelligence to capture the real time climate data. This result continuously generates several exabytes of ...
Comments