skip to main content
10.1145/2351316.2351323acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Compression-aware I/O performance analysis for big data clustering

Published: 12 August 2012 Publication History

Abstract

As the data volume increases, I/O bottleneck has become a great challenge for data analysis. Data compression can alleviate the bottleneck effectively. Taking K-means algorithm as an example, this paper proposes a compression-aware performance improvement model for big-data clustering. The model quantitatively analyzes the effect of a variety of factors related to compression during the entire computational process. We perform clustering experiments on 10 dimensional data with up to 1.114 TB in size on a cluster computer with hundreds of computing cores. The measurement validates that using compression contributes significantly to improving the I/O performance, and confirms our theoretical analysis empirically. Furthermore, the proposed model can effectively determine when and how to use compression to improve I/O performance for big-data analysis.

References

[1]
D. Abadi, S. Madden, and M. Ferreira. Integrating compression and execution in column-oriented database systems. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data, SIGMOD '06, pages 671--682, New York, NY, USA, 2006. ACM.
[2]
P. K. Agarwal and N. H. Mustafa. k-means projective clustering. In Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, PODS '04, pages 155--165, New York, NY, USA, 2004. ACM.
[3]
C. C. Aggarwal, J. L. Wolf, P. S. Yu, C. Procopiuc, and J. S. Park. Fast algorithms for projected clustering. In Proceedings of the 1999 ACM SIGMOD international conference on Management of data, SIGMOD '99, pages 61--72, New York, NY, USA, 1999. ACM.
[4]
C. C. Aggarwal and P. S. Yu. Redefining clustering for high-dimensional applications. IEEE Trans. on Knowl. and Data Eng., 14(2):210--225, Mar. 2002.
[5]
D. C. Anderson, J. S. Chase, S. Gadde, A. J. Gallatin, K. G. Yocum, and M. J. Feeley. Cheating the i/o bottleneck: network storage with trapeze/myrinet. In Proceedings of the annual conference on USENIX Annual Technical Conference, ATEC '98, pages 12--12, Berkeley, CA, USA, 1998. USENIX Association.
[6]
Y. Chen, A. Ganapathi, and R. H. Katz. To compress or not to compress - compute vs. io tradeoffs for mapreduce energy efficiency. In Proceedings of the first ACM SIGCOMM workshop on Green networking, Green Networking '10, pages 23--28, New York, NY, USA, 2010. ACM.
[7]
Y.-J. Chiang and C. T. Silva. I/o optimal isosurface extraction (extended abstract). In Proceedings of the 8th conference on Visualization '97, VIS '97, pages 293--ff., Los Alamitos, CA, USA, 1997. IEEE Computer Society Press.
[8]
J. M. del Rosario, R. Bordawekar, and A. Choudhary. Improved parallel i/o via a two-phase run-time access strategy. SIGARCH Comput. Archit. News, 21(5):31--38, Dec. 1993.
[9]
Dhillon. Method and system for clustering data in parallel in a distributed-memory multiprocessor system.
[10]
I. S. Dhillon and D. S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD, pages 245--260, London, UK, UK, 2000. Springer-Verlag.
[11]
R. L. Ferreira Cordeiro, C. Traina, Junior, A. J. Machado Traina, J. López, U. Kang, and C. Faloutsos. Clustering very large multi-dimensional datasets with mapreduce. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '11, pages 690--698, New York, NY, USA, 2011. ACM.
[12]
G. A. Gibson. Redundant disk arrays: Reliable, parallel secondary storage. Technical report, Berkeley, CA, USA, 1999.
[13]
Z. Huang. The apache software foundation. http://mahout.apache.org/.
[14]
A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Comput. Surv., 31(3):264--323, Sept. 1999.
[15]
J. Lee, M. Winslett, X. Ma, and S. Yu. Enhancing data migration performance via parallel data compression. In Proceedings of the 16th International Parallel and Distributed Processing Symposium, IPDPS '02, pages 142--, Washington, DC, USA, 2002. IEEE Computer Society.
[16]
K.-H. Lee, Y.-J. Lee, H. Choi, Y. D. Chung, and B. Moon. Parallel data processing with mapreduce: a survey. SIGMOD Rec., 40(4):11--20, Jan. 2012.
[17]
Y. Li and S. M. Chung. Parallel bisecting k-means with prediction clustering algorithm. J. Supercomput., 39(1):19--37, Jan. 2007.
[18]
J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich, D. Mazières, S. Mitra, A. Narayanan, D. Ongaro, G. Parulkar, M. Rosenblum, S. M. Rumble, E. Stratmann, and R. Stutsman. The case for ramcloud. Commun. ACM, 54(7):121--130, July 2011.
[19]
J. Ousterhout and F. Douglis. Beating the i/o bottleneck: a case for log-structured file systems. SIGOPS Oper. Syst. Rev., 23(1):11--28, Jan. 1989.
[20]
V. S. Pai, P. Druschel, and W. Zwaenepoel. Io-lite: a unified i/o buffering and caching system. ACM Trans. Comput. Syst., 18(1):37--66, Feb. 2000.
[21]
A. K. H. Tung, X. Xu, and B. C. Ooi. Curler: finding and visualizing nonlinear correlation clusters. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data, SIGMOD '05, pages 467--478, New York, NY, USA, 2005. ACM.
[22]
B. Welton, D. Kimpe, J. Cope, C. M. Patrick, K. Iskra, and R. Ross. Improving i/o forwarding throughput with data compression. In Proceedings of the 2011 IEEE International Conference on Cluster Computing, CLUSTER '11, pages 438--445, Washington, DC, USA, 2011. IEEE Computer Society.
[23]
W. Zhao, H. Ma, and Q. He. Parallel k-means clustering based on mapreduce. In Proceedings of the 1st International Conference on Cloud Computing, CloudCom '09, pages 674--679, Berlin, Heidelberg, 2009. Springer-Verlag.
[24]
M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar ram-cpu cache compression. In Proceedings of the 22nd International Conference on Data Engineering, ICDE '06, pages 59--, Washington, DC, USA, 2006. IEEE Computer Society.

Cited By

View all
  • (2024)Efficient Exploration of Mobile Robot Based on DL-RRT and AP-BOIEEE Transactions on Instrumentation and Measurement10.1109/TIM.2024.341809073(1-9)Online publication date: 2024
  • (2024)Ensemble-Based System Benchmarking for HPC2024 23rd International Symposium on Parallel and Distributed Computing (ISPDC)10.1109/ISPDC62236.2024.10705405(1-8)Online publication date: 8-Jul-2024
  • (2024)Big data clustering method based on parallel K-means2024 IEEE 4th International Conference on Power, Electronics and Computer Applications (ICPECA)10.1109/ICPECA60615.2024.10470970(893-897)Online publication date: 26-Jan-2024
  • Show More Cited By
  1. Compression-aware I/O performance analysis for big data clustering

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    BigMine '12: Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
    August 2012
    134 pages
    ISBN:9781450315470
    DOI:10.1145/2351316
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 August 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. I/O bottleneck
    2. big data clustering
    3. compression contribution model

    Qualifiers

    • Research-article

    Conference

    KDD '12
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 13 of 23 submissions, 57%

    Upcoming Conference

    KDD '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 27 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Efficient Exploration of Mobile Robot Based on DL-RRT and AP-BOIEEE Transactions on Instrumentation and Measurement10.1109/TIM.2024.341809073(1-9)Online publication date: 2024
    • (2024)Ensemble-Based System Benchmarking for HPC2024 23rd International Symposium on Parallel and Distributed Computing (ISPDC)10.1109/ISPDC62236.2024.10705405(1-8)Online publication date: 8-Jul-2024
    • (2024)Big data clustering method based on parallel K-means2024 IEEE 4th International Conference on Power, Electronics and Computer Applications (ICPECA)10.1109/ICPECA60615.2024.10470970(893-897)Online publication date: 26-Jan-2024
    • (2019)Knowledge Discovery and Big Data AnalyticsWeb Services10.4018/978-1-5225-7501-6.ch011(168-183)Online publication date: 2019
    • (2018)Content-Based Textual Big Data Analysis and CompressionProceedings of the 2018 International Conference on Computing and Big Data10.1145/3277104.3277107(7-12)Online publication date: 8-Sep-2018
    • (2018)The Application of Artificial Intelligence Technology in Energy Internet2018 2nd IEEE Conference on Energy Internet and Energy System Integration (EI2)10.1109/EI2.2018.8582096(1-5)Online publication date: Oct-2018
    • (2017)Knowledge Discovery and Big Data AnalyticsWeb Semantics for Textual and Visual Information Retrieval10.4018/978-1-5225-2483-0.ch007(144-164)Online publication date: 2017
    • (2016)The Memory Challenge in Reduce Phase of MapReduce ApplicationsIEEE Transactions on Big Data10.1109/TBDATA.2016.26077562:4(380-386)Online publication date: 1-Dec-2016
    • (2016)Big Data AnalyticsBig Data Technologies and Applications10.1007/978-3-319-44550-2_2(13-52)Online publication date: 17-Sep-2016
    • (2015)Big data analytics: a surveyJournal of Big Data10.1186/s40537-015-0030-32:1Online publication date: 1-Oct-2015

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media