skip to main content
10.1145/3277104.3277107acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiccbdConference Proceedingsconference-collections
research-article

Content-Based Textual Big Data Analysis and Compression

Published: 08 September 2018 Publication History

Abstract

With the growing enhancement of technology and the Internet, the number of people who are using the Internet is increasing daily. Users are engaged in web searching and accessing different types of websites, such as social media, banking, etc. As a result, a large volume of data is being generated in every day. It is necessary to load this data for analysis purposes. However, memory space and transmission time are the most important factors of limited processing. In most cases, we only need to extract the important textual data from these vast raw datasets. In this work, we propose content-based compression (CBC) for textual data analysis on the basis of the Huffman Code. The data is pre-analyzed to find very high frequency words and then a shorter symbol is inserted to replace those words. This compression approach is performed in an effort to maintain the original format of the data so that, compressed data structure could be completely transparent to Hadoop platform. The algorithm is evaluated on a set of real world data sets (e.g. Amazon movie review, food review, etc.) and a 52.4% average data size reduction is obtained from the experiment. Though this gain may seem modest, this can be supplementary to all other compression optimization techniques. Furthermore, the proposed technique can be effectively applied for the big data optimization purpose.

References

[1]
Lelewer, Debra A and D. S. Hirschberg. 1987. "Data Compression." ACM Computing Surveys (September, 1987): 261--296, Volume 19 Issue 3.http://dl.acm.org/citation.cfm?id=45074.K. Elissa, "Title of paper if known," unpublished.
[2]
Thirunavukarasu, B., V. M. Sudhahar, U. VasanthaKumar, T. Kalaikumaran, and S. Karthik. 2014. "Compressed Data Transmission Among Nodes in BigData." American Journal of Engineering Research (AJER) (2014): 209--212, Volume-03, Issue-06, e-ISSN: 2320-0847. p-ISSN: 2320-0936.
[3]
Dong, Dapeng and J. Herbert. 2014. "Content-aware Partial Compression for Big Textual Data Analysis Acceleration." 2014 IEEE 6th International Conference on Cloud Computing Technology and Science, Singapore (2014): 320--325. Accessed December 15-18, 2014.
[4]
Lovalekar, Sampada. 2014. "A Survey on Compression Algorithms in Hadoop." International Journal on Recent and Innovation Trends in Computing and Communication (2014): 479--482, Volume: 2 Issue: 3, ISSN: 2321-8169.
[5]
Chen, Yanpei, A. S. Ganapathi, and R. H. Katz. 2010. "To Compress or Not To Compress - Compute vs. IO tradeoffs for MapReduce Energy Efficiency." Proceedings of the first ACM SIGCOMM workshop on Green networking - Green Networking (2010): 23--28. Accessed March 29, 2010.Technical Report No. UCB/EECS-2010 36.
[6]
Huffman, David. "A Method for the Construction of Minimum-Redundancy Codes" Proceedings of the IRE 40 (9): 1098--1101.
[7]
Xue, Zhenghua, J. Li, Y. Zhang, G. Shen, Q. Xu, and J. Shao. 2012. "Compression-Aware I/O Performance Analysis for Big Data Clustering." Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining Algorithms, Systems, Programming Models and Applications - BigMine '12 (2012): 45--52. Accessed December 15-18, 2014.
[8]
Zou, Hongbo, Y. Yu, W. Tang, and H. M. Chen. 2014. "Improving I/O Performance with Adaptive Data Compression for Big Data Applications." 2014 IEEE International Parallel & Distributed Processing Symposium Workshops (2014): 1228--1237. Accessed May 19-23, 2014.

Cited By

View all
  • (2020)A Hive-Based Retrieval Optimization Scheme for Long-Term Storage of Massive Call Detail RecordsIEEE Access10.1109/ACCESS.2019.29616928(431-444)Online publication date: 2020

Index Terms

  1. Content-Based Textual Big Data Analysis and Compression

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICCBD '18: Proceedings of the 2018 International Conference on Computing and Big Data
    September 2018
    103 pages
    ISBN:9781450365406
    DOI:10.1145/3277104
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 September 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Compression
    2. Hadoop
    3. Huffman Tree Algorithm
    4. text-based encoding

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ICCBD '18

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 15 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)A Hive-Based Retrieval Optimization Scheme for Long-Term Storage of Massive Call Detail RecordsIEEE Access10.1109/ACCESS.2019.29616928(431-444)Online publication date: 2020

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media