research-article

Content-Based Textual Big Data Analysis and Compression

Authors:

Fei Gao,

Ananya Dutta,

Jiangjiang LiuAuthors Info & Claims

ICCBD '18: Proceedings of the 2018 International Conference on Computing and Big Data

Pages 7 - 12

https://doi.org/10.1145/3277104.3277107

Published: 08 September 2018 Publication History

Get Access

Abstract

With the growing enhancement of technology and the Internet, the number of people who are using the Internet is increasing daily. Users are engaged in web searching and accessing different types of websites, such as social media, banking, etc. As a result, a large volume of data is being generated in every day. It is necessary to load this data for analysis purposes. However, memory space and transmission time are the most important factors of limited processing. In most cases, we only need to extract the important textual data from these vast raw datasets. In this work, we propose content-based compression (CBC) for textual data analysis on the basis of the Huffman Code. The data is pre-analyzed to find very high frequency words and then a shorter symbol is inserted to replace those words. This compression approach is performed in an effort to maintain the original format of the data so that, compressed data structure could be completely transparent to Hadoop platform. The algorithm is evaluated on a set of real world data sets (e.g. Amazon movie review, food review, etc.) and a 52.4% average data size reduction is obtained from the experiment. Though this gain may seem modest, this can be supplementary to all other compression optimization techniques. Furthermore, the proposed technique can be effectively applied for the big data optimization purpose.

References

[1]

Lelewer, Debra A and D. S. Hirschberg. 1987. "Data Compression." ACM Computing Surveys (September, 1987): 261--296, Volume 19 Issue 3.http://dl.acm.org/citation.cfm?id=45074.K. Elissa, "Title of paper if known," unpublished.

Digital Library

Google Scholar

[2]

Thirunavukarasu, B., V. M. Sudhahar, U. VasanthaKumar, T. Kalaikumaran, and S. Karthik. 2014. "Compressed Data Transmission Among Nodes in BigData." American Journal of Engineering Research (AJER) (2014): 209--212, Volume-03, Issue-06, e-ISSN: 2320-0847. p-ISSN: 2320-0936.

Google Scholar

[3]

Dong, Dapeng and J. Herbert. 2014. "Content-aware Partial Compression for Big Textual Data Analysis Acceleration." 2014 IEEE 6th International Conference on Cloud Computing Technology and Science, Singapore (2014): 320--325. Accessed December 15-18, 2014.

Digital Library

Google Scholar

[4]

Lovalekar, Sampada. 2014. "A Survey on Compression Algorithms in Hadoop." International Journal on Recent and Innovation Trends in Computing and Communication (2014): 479--482, Volume: 2 Issue: 3, ISSN: 2321-8169.

Google Scholar

[5]

Chen, Yanpei, A. S. Ganapathi, and R. H. Katz. 2010. "To Compress or Not To Compress - Compute vs. IO tradeoffs for MapReduce Energy Efficiency." Proceedings of the first ACM SIGCOMM workshop on Green networking - Green Networking (2010): 23--28. Accessed March 29, 2010.Technical Report No. UCB/EECS-2010 36.

Digital Library

Google Scholar

[6]

Huffman, David. "A Method for the Construction of Minimum-Redundancy Codes" Proceedings of the IRE 40 (9): 1098--1101.

Google Scholar

[7]

Xue, Zhenghua, J. Li, Y. Zhang, G. Shen, Q. Xu, and J. Shao. 2012. "Compression-Aware I/O Performance Analysis for Big Data Clustering." Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining Algorithms, Systems, Programming Models and Applications - BigMine '12 (2012): 45--52. Accessed December 15-18, 2014.

Digital Library

Google Scholar

[8]

Zou, Hongbo, Y. Yu, W. Tang, and H. M. Chen. 2014. "Improving I/O Performance with Adaptive Data Compression for Big Data Applications." 2014 IEEE International Parallel & Distributed Processing Symposium Workshops (2014): 1228--1237. Accessed May 19-23, 2014.

Digital Library

Google Scholar

Cited By

View all

Peng XLiu LZhang L(2020)A Hive-Based Retrieval Optimization Scheme for Long-Term Storage of Massive Call Detail RecordsIEEE Access10.1109/ACCESS.2019.29616928(431-444)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2019.2961692

Index Terms

Content-Based Textual Big Data Analysis and Compression
1. Information systems
  1. Data management systems
    1. Data structures
      1. Data layout
        Data compression

Recommendations

Content-Aware Partial Compression for Big Textual Data Analysis Acceleration
CLOUDCOM '14: Proceedings of the 2014 IEEE 6th International Conference on Cloud Computing Technology and Science

Analysing text-based data has become increasingly important due to the importance of text from sources such as social media, web contents, web searches. The growing volume of such data creates challenges for data analysis including efficient and ...
Lossless compression of VLSI layout image data

We present a novel lossless compression algorithm called Context Copy Combinatorial Code (C4), which integrates the advantages of two very disparate compression techniques: context-based modeling and Lempel-Ziv (LZ) style copying. While the algorithm ...
A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing
ICDS 2015: Proceedings of the Second International Conference on Data Science - Volume 9208

With the fast development of remote sensing techniques, the volume of acquired data grows exponentially. This brings a big challenge to process massive remote sensing data. In the paper, an in-memory computing framework is proposed to address this ...

Comments

Information & Contributors

Information

Published In

ICCBD '18: Proceedings of the 2018 International Conference on Computing and Big Data

September 2018

103 pages

ISBN:9781450365406

DOI:10.1145/3277104

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 September 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICCBD '18

ICCBD '18: 2018 International Conference on Computing and Big Data

September 8 - 10, 2018

SC, Charleston, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
160
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)1

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Peng XLiu LZhang L(2020)A Hive-Based Retrieval Optimization Scheme for Long-Term Storage of Massive Call Detail RecordsIEEE Access10.1109/ACCESS.2019.29616928(431-444)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2019.2961692

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Content-Aware Partial Compression for Big Textual Data Analysis Acceleration

Lossless compression of VLSI layout image data

A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations