skip to main content
10.1145/1851476.1851594acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Pydoop: a Python MapReduce and HDFS API for Hadoop

Published: 21 June 2010 Publication History

Abstract

MapReduce has become increasingly popular as a simple and efficient paradigm for large-scale data processing. One of the main reasons for its popularity is the availability of a production-level open source implementation, Hadoop, written in Java. There is considerable interest, however, in tools that enable Python programmers to access the framework, due to the language's high popularity. Here we present a Python package that provides an API for both the MapReduce and the distributed file system sections of Hadoop, and show its advantages with respect to the other available solutions for Hadoop Python programming, Jython and Hadoop Streaming.

References

[1]
}}Amazon Elastic MapReduce. http://aws.amazon.com/elasticmapreduce.
[2]
}}Applications and organizations using hadoop. http://wiki.apache.org/hadoop/PoweredBy.
[3]
}}Disco. http://discoproject.org.
[4]
}}Dumbo. http://wiki.github.com/klbostee/dumbo.
[5]
}}Hadoop. http://hadoop.apache.org.
[6]
}}Hadoop + Python = Happy. http://code.google.com/p/happy.
[7]
}}Hadoop Common Credits. http://hadoop.apache.org/common/credits.html.
[8]
}}Hadoop Distributed File System (HDFS) APIs in perl, python, ruby and php. http://wiki.apache.org/hadoop/HDFS-APIs.
[9]
}}Kevin's Word List Page. http://wordlist.sourceforge.net.
[10]
}}NumPy. http://numpy.scipy.org.
[11]
}}Octopy -- Easy MapReduce for Python. http://code.google.com/p/octopy.
[12]
}}Starfish. http://rufy.com/starfish/doc.
[13]
}}The Jython Project. http://www.jython.org.
[14]
}}Thrift. http://incubator.apache.org/thrift.
[15]
}}D. Abrahams and R. Grosse-Kunstleve. Building hybrid systems with Boost. Python. C/C++ Users Journal, 21(7):29--36, 2003.
[16]
}}J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In OSDI '04: 6th Symposium on Operating Systems Design and Implementation, 2004.
[17]
}}S. Ghemawat, H. Gobioff, and S. Leung. The Google file system. ACM SIGOPS Operating Systems Review, 37(5):43, 2003.
[18]
}}S. Leo, P. Anedda, M. Gaggero, and G. Zanetti. Using virtual clusters to decouple computation and data management in high throughput analysis applications. In Proceedings of the 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, Pisa, Italy, 17--19 February 2010, pages 411--415, 2010.

Cited By

View all
  • (2024)Detecting DoS Outbreaks in Cloud Environment Using Machine Learning Algorithms in Hadoop ClusterControl and Information Sciences10.1007/978-981-99-9554-7_13(177-188)Online publication date: 17-May-2024
  • (2023)DyFuzz: Skeleton-based Fuzzing for Python Libraries2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS)10.1109/QRS60937.2023.00040(325-336)Online publication date: 22-Oct-2023
  • (2023)Dealing with Missing Values in a Relation Dataset Using the DROPNA Function in PythonMathematics and Computer Science Volume 110.1002/9781119879831.ch25(463-470)Online publication date: 17-Jul-2023
  • Show More Cited By

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
June 2010
911 pages
ISBN:9781605589428
DOI:10.1145/1851476
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 June 2010

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

HPDC '10
Sponsor:

Acceptance Rates

Overall Acceptance Rate 166 of 966 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)1
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Detecting DoS Outbreaks in Cloud Environment Using Machine Learning Algorithms in Hadoop ClusterControl and Information Sciences10.1007/978-981-99-9554-7_13(177-188)Online publication date: 17-May-2024
  • (2023)DyFuzz: Skeleton-based Fuzzing for Python Libraries2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS)10.1109/QRS60937.2023.00040(325-336)Online publication date: 22-Oct-2023
  • (2023)Dealing with Missing Values in a Relation Dataset Using the DROPNA Function in PythonMathematics and Computer Science Volume 110.1002/9781119879831.ch25(463-470)Online publication date: 17-Jul-2023
  • (2022)A Survey on Spatio-temporal Data Analytics SystemsACM Computing Surveys10.1145/350790454:10s(1-38)Online publication date: 14-Jan-2022
  • (2022)Towards understanding bugs in Python interpretersEmpirical Software Engineering10.1007/s10664-022-10239-x28:1Online publication date: 13-Dec-2022
  • (2021)Bengali Handwritten Character Transformation: Basic to Compound and Compound to Basic Using Convolutional Neural Network2021 2nd International Conference on Robotics, Electrical and Signal Processing Techniques (ICREST)10.1109/ICREST51555.2021.9331247(142-146)Online publication date: 5-Jan-2021
  • (2019)An overview and comparison of free Python libraries for data mining and big data analysis2019 42nd International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)10.23919/MIPRO.2019.8757088(977-982)Online publication date: May-2019
  • (2019)Upgrading a high performance computing environment for massive data processingJournal of Internet Services and Applications10.1186/s13174-019-0118-710:1Online publication date: 16-Oct-2019
  • (2019)Improved Programming-Language Independent MapReduce on Shared-Memory SystemsBig Data Analytics and Knowledge Discovery10.1007/978-3-030-27520-4_15(206-220)Online publication date: 3-Aug-2019
  • (2018)XRT: Programming-Language Independent MapReduce on Shared-Memory Systems2018 IEEE International Congress on Big Data (BigData Congress)10.1109/BigDataCongress.2018.00031(182-189)Online publication date: Jul-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media