ABSTRACT
Over the last decade, Hadoop has evolved into a widely used platform for Big Data applications. Acknowledging its wide-spread use, we present a comprehensive analysis of the solved issues with applied patches in the Hadoop ecosystem. The analysis is conducted with a focus on Hadoop's two essential components: HDFS (storage) and MapReduce (computation), it involves a total of 4218 solved issues over the last six years, covering 2180 issues from HDFS and 2038 issues from MapReduce. Insights derived from the study concern system design and development, particularly with respect to correlated issues and correlations between root causes of issues and characteristics of the Hadoop subsystems. These findings shed light on the future development of Big Data systems, on their testing, and on bug-finding tools.
- Apache Cascading. http://www.cascading.org/.Google Scholar
- Apache Flume. http://flume.apache.org/.Google Scholar
- Apache Hadoop. http://hadoop.apache.org/.Google Scholar
- Apache HBase. http://hbase.apache.org/.Google Scholar
- Apache HCatalog. https://cwiki.apache.org/confluence/display/Hive/HCatalog.Google Scholar
- Apache Hive. https://hive.apache.org/.Google Scholar
- Apache Mahout. https://mahout.apache.org/.Google Scholar
- Apache Pig. http://pig.apache.org/.Google Scholar
- J. B. Buck, N. Watkins, J. LeFevre, K. Ioannidou, C. Maltzahn, N. Polyzotis, and S. Brandt. SciHadoop: Array-based Query Processing in Hadoop. In SC'11, Seattle, WA, Nov. 2011. Google ScholarDigital Library
- Centralized Cache Management in HDFS. https://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html.Google Scholar
- A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler. An Empirical Study of Operating Systems Errors. In SOSP'01, Oct. 2001. Google ScholarDigital Library
- Contributors to Hadoop. http://blog.cloudera.com/blog/2011/10/the-community-effect/.Google Scholar
- J. Dai, J. Huang, S. Huang, B. Huang, and Y. Liu. HiTune: Dataflow-Based Performance Analysis for Big Data Cloud. In USENIX ATC'11, 2011. Google ScholarDigital Library
- T. Do, T. Harter, Y. Liu, H. S. Gunawi, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. HARDFS: Hardening HDFS with Selective and Lightweight Versioning. In FAST'13, San Jose, CA, Feb. 2013. Google ScholarDigital Library
- U. Erlingsson, M. Peinado, S. Peter, and M. Budiu. Fay: Extensible Distributed Tracing from Kernels to Clusters. In SOSP'11, Cascais, Portugal, Oct. 2011. Google ScholarDigital Library
- P. Fonseca, C. Li, V. Singhal, and R. Rodrigues. A Study of the Internal and External Effects of Concurrency Bugs. In DSN'10.Google Scholar
- D. Fryer, K. Sun, R. Mahmood, T. Cheng, S. Benjamin, A. Goel, and A. D. Brown. Recon: Verifying File System Consistency at Runtime. In FAST'12, San Jose, CA, Feb. 2012. Google ScholarDigital Library
- D. Geels, G. Altekar, S. Shenker, and I. Stoica. Replay Debugging for Distributed Applications. In USENIX ATC'06, May 2006. Google ScholarDigital Library
- H. S. Gunawi, T. Do, P. Joshi, P. Alvaro, J. M. Hellerstein, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, K. Sen, and D. Borthakur. FATE and DESIGN: A Framework for Cloud Recovery Testing. In NSDI'11, Boston, MA, Mar. 2011. Google ScholarDigital Library
- H. S. Gunawi, M. Hao, T. Leesatapornwongsa, T. Patanaanake, T. Do, J. Adityatama, K. J. Eliazar, A. Laksono, J. F. Lukman, V. Martin, and A. D. Satria. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. In SOCC'14, Nov. 2014. Google ScholarDigital Library
- Hadoop at Twitter. https://blog.twitter.com/2010/hadoop-twitter.Google Scholar
- Hadoop Distributed File System. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html.Google Scholar
- Hadoop MapReduce. http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html.Google Scholar
- Hadoop Systems. http://hadoop.apache.org/.Google Scholar
- T. Harter, D. Borthakur, S. Dong, A. Aiyer, L. Tang, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Analysis of HDFS Under HBase: A Facebook Messages Case Study. In FAST'14. Google ScholarDigital Library
- Y. He, R. Lee, Y. Huai, Z. Shao, N. Jain, X. Zhang, and Z. Xu. RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems. In ICDE'11, Apr. 2011. Google ScholarDigital Library
- Y. Huai, A. Chauhan, A. Gates, G. Hagleitner, E. N. Hanson, O. O'Malley, J. Pandey, Y. Yuan, R. Lee, and X. Zhang. Major Technical Advancements in Apache Hive. In SIGMOD'14. Google ScholarDigital Library
- J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, C. Murthy, and D. K. Panda. High-Performance Design of HBase with RDMA over Infiniband. In IPDPS'12. Google ScholarDigital Library
- P. Joshi, M. Ganai, G. Balakrishnan, A. Gupta, and N. Papakonstantinou. SETSUDO: Perturbation-based Testing Framework for Scalable Distributed Systems. In TRIOS'13. Google ScholarDigital Library
- L. Lu, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, and S. Lu. A Study of Linux File System Evolution. In FAST'13, Feb. 2013. Google ScholarDigital Library
- S. Lu, S. Park, E. Seo, and Y. Zhou. Learning from Mistakes - A Comprehensive Study on Real World Concurrency Bug Characteristics. In ASPLOS'08, Seattle, WA, Mar. 2008. Google ScholarDigital Library
- G. Mishne, J. Dalton, Z. Li, A. Sharma, and J. Lin. Fast Data in the Era of Big Data: Twitter's Real-Time Related Query Suggestion Architecture. In SIGMOD'13, New York, USA, June 2013. Google ScholarDigital Library
- T. S. Pillai, V. Chidambaram, R. Alagappan, S. Al-Kiswany, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications. In OSDI'14, Broomfield, CO, Oct. 2014. Google ScholarDigital Library
- A. Rabkin and R. H. Katz. How Hadoop Clusters Break. IEEE Software, pages 88--94, 2013. Google ScholarDigital Library
- P. Reynolds, C. Killian, J. L. Wiener, J. C. Mogul, M. A. Shah, and A. Vahdat. Pip: Detecting the Unexpected in Distributed Systems. In NSDI'06, San Jose, CA, May 2006. Google ScholarDigital Library
- R. Rodrigues, M. Castro, and B. Liskov. BASE: Using Abstraction to Improve Fault Tolerance. In SOSP'01, Banff, Canada, Oct. 2001. Google ScholarDigital Library
- C. Rubio-González, H. S. Gunawi, B. Liblit, R. H. Arpaci-Dusseau, and A. C. Arpaci-Dusseau. Error propagation analysis for file systems. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '09, 2009. Google ScholarDigital Library
- R. R. Sambasivan, R. Fonseca, I. Shafer, and G. R. Ganger. So, you want to trace your distributed system? Key design insights from years of practical experience. Technical Report, CMU-PDL-14-102, 2014.Google Scholar
- A. Silberstein, R. Sears, W. Zhou, and B. F. Cooper. A Batch of PNUTS: Experiences Connecting Cloud Batch and Serving Systems. In SIGMOD'11, Athens, Greece, June 2011. Google ScholarDigital Library
- M. Tatineni. Hadoop for Scientific Computing. SDSC Summer Institute: HPC Meets Big Data, 2014.Google Scholar
- B. Venners. Inside the Java Virtual Machine. McGraw-Hill, Inc., New York, NY, USA, 1996. Google ScholarDigital Library
- C. Wang, I. A. Rayan, G. Eisenhauer, K. Schwan, V. Talwar, M. Wolf, and C. Huneycutt. VScope: Middleware for Troubleshooting Time-Sensitive Data Center Applications. In Middleware' 12, Montreal, Quebec, Canada, Dec. 2012. Google ScholarDigital Library
- L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi, S. Zhang, C. Zheng, G. Lu, K. Zhan, X. Li, and B. Qiu. BigDataBench: A Big Data Benchmark Suite from Internet Services. In HPCA'14, Flordia, USA, Feb. 2014.Google ScholarCross Ref
- Y. Wang, M. Kapritsos, L. Schmidt, L. Alvisi, and M. Dahlin. Exalt: Empowering Researchers to Evaluate Large-Scale Storage Systems. In NSDI'14, Seattle, WA, Apr. 2014. Google ScholarDigital Library
- T. Xu, J. Zhang, P. Huang, J. Zheng, T. Sheng, D. Yuan, Y. Zhou, and S. Pasupathy. Do Not Blame Users for Misconfigurations. In SOSP'13, Farmington, Pennsylvania, Nov. 2013. Google ScholarDigital Library
- Z. Yin, X. Ma, J. Zheng, Y. Zhou, L. N. Bairavasundaram, and S. Pasupathy. An Empirical Study on Configuration Errors in Commercial and Open Source Systems. In SOSP'11. Google ScholarDigital Library
- D. Yuan, Y. Luo, X. Zhuang, G. R. Rodrigues, X. Zhao, Y. Zhang, P. U. Jain, and M. Stumm. Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems. In OSDI'14, Broomfield, CO, Oct. 2014. Google ScholarDigital Library
- D. Yuan, S. Park, P. Huang, Y. Liu, M. M. Lee, X. Tang, Y. Zhou, and S. Savage. Be Conservative: Enhancing Failure Diagnosis with Proactive Logging. In OSDI'12, Hollywood, CA, Oct. 2012. Google ScholarDigital Library
- D. Yuan, J. Zheng, S. Park, Y. Zhou, and S. Savage. Improving Software Diagnosability via Log Enhancement. In ASPLOS' 11, Newport Beach, California, Mar. 2011. Google ScholarDigital Library
Index Terms
- Understanding issue correlations: a case study of the Hadoop system
Recommendations
A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing
ICDS 2015: Proceedings of the Second International Conference on Data Science - Volume 9208With the fast development of remote sensing techniques, the volume of acquired data grows exponentially. This brings a big challenge to process massive remote sensing data. In the paper, an in-memory computing framework is proposed to address this ...
MapReduce: Review and open challenges
The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. ...
Disease Surveillance System for Big Climate Data Processing and Dengue Transmission
Ambient intelligence is an emerging platform that provides advances in sensors and sensor networks, pervasive computing, and artificial intelligence to capture the real time climate data. This result continuously generates several exabytes of ...
Comments