A survey of big data management: Taxonomy and state-of-the-art

https://doi.org/10.1016/j.jnca.2016.04.008Get rights and content

Abstract

The rapid growth of emerging applications and the evolution of cloud computing technologies have significantly enhanced the capability to generate vast amounts of data. Thus, it has become a great challenge in this big data era to manage such voluminous amount of data. The recent advancements in big data techniques and technologies have enabled many enterprises to handle big data efficiently. However, these advances in techniques and technologies have not yet been studied in detail and a comprehensive survey of this domain is still lacking. With focus on big data management, this survey aims to investigate feasible techniques of managing big data by emphasizing on storage, pre-processing, processing and security. Moreover, the critical aspects of these techniques are analyzed by devising a taxonomy in order to identify the problems and proposals made to alleviate these problems. Furthermore, big data management techniques are also summarized. Finally, several future research directions are presented.

Introduction

Over the last few years, the volume of data worldwide has exploded with the amplified use of various digital devices that continuously generate massive amounts of heterogeneous, structured or unstructured data, resulting in what is now called “big data” (Kambatla et al., 2014). Big data refers to rapidly growing amounts of data for which traditional database mechanisms have become inefficient in terms of storage, processing and analysis (Manyika et al., 2011). Managing big data with diverse data formats is a main basis for competition in business and management. Nonetheless, it has also become a new challenge for Information and Communication Technologies in both Science and industry encouraging the pursue of data-centric architectures and operational models (Han et al., 2014).

Meanwhile, traditional data storage and processing typically fed with relatively clean data sets generated by limited sources; hence, the results tended to be accurate. However, the evolution of big data has revealed a serious management problem, as standard tools and procedures are not designed to manage such massive data volumes (Philip Chen and Zhang, 2014). At the same time, current infrastructures are not yet capable of addressing the distributed computational needs of managing big data and exploiting large quantities and varieties of data (Candela et al., 2012). This is not only due to the growth in the volumes of data sets but also to their complexity and volatility that makes processing and analysis very hard to achieve through traditional data management techniques and technologies. Obviously, it is very challenging for current infrastructures to sustain huge amounts of data (Russom, 2011).

Current techniques and technologies designed to handle big data management problems mostly emphasize on certain characteristics of big data, such as volume, variety and velocity (Philip Chen and Zhang, 2014). Moreover, big data comprise complex data that are massively produced and managed in geographically dispersed repositories (Kambatla et al., 2014). Such complexity motivates the development of advanced management techniques and technologies for dealing with the challenges of big data. However, these advances in techniques and technologies have not yet been studied in detail and a comprehensive survey of this domain is required. Although several studies exist related to big data management (Han et al., 2014, Russom, 2013; McAfee and Brynjolfsson, 2012; Chaudhuri, 2012; Borkar et al., 2012), no one directly focused on technical aspects of big data management providing a description of existing techniques in storage, preprocessing, processing and security. Moreover, this survey analyzes several problems inhibiting the big data management and review corresponding solutions by devising taxonomy.

This survey focuses primarily on big data aspects in the context of data management. A broad coverage of existing work on storage, preprocessing, processing and security is provided. In addition, this survey offers added value by means of a comprehensive taxonomy of existing techniques and technologies as well as highlighting the importance of typical big data management challenges related to storage, preprocessing, processing and security. Moreover, this survey aims to be a useful guide to challenges and solution in big data management and also a point of reference for future work on big data management. Furthermore, this survey summarizes the benefits that can be achieved if techniques are adopted for specific application areas of management, such as storage, preprocessing, analysis and/or security. Most significantly, this research contributes as a guide for researchers in the expedition of suitable big data management techniques and in the development of augmented techniques in response to the insufficiency of existing solutions.

In order to achieve the aims as mentioned above, we carry out our research investigation by answering the questions related to recent big data management advances as follows: (a) how big data management techniques optimize storage resources to meet rapid growth and fast retrieval requirements? (b) how pre-processing tools and technologies such as cleansing and transformation are managed to support upcoming trends of big data? (c) how big data analytics is being performed to deal with abundant information that could impact the business? (d) how security infers big data management process?

The contributions are as follows:

  • A comprehensive review of big data management techniques with respect to data storage, pre-processing, processing and security

  • A discussion on a taxonomy of big data management process flow with focus on the problems and available solutions related to storage, pre-processing, processing and security

  • A comparison of different big data management techniques for storage, pre-processing and processing based on parameters including availability, scalability, integrity, heterogeneity, resource optimization and velocity

  • A discussion on future directions and challenges regarding big data management

The rest of the article is organized as follows: Section 2 provides a general overview of big data management. Taxonomy of techniques for storage, pre-processing, processing and security aspects of big data is presented in Section 3. Section 4 discusses techniques for storage, pre-processing and processing and an analysis of their capability to meet big data management requirements, such as availability, scalability, integrity, heterogeneity, resource optimization and velocity. Section 5 highlights challenges and future research directions for big data management and Section 6 concludes the study.

Section snippets

Overview of big data management

Big data management is a new discipline, where data management techniques, tools and platforms including storage, pre-processing, processing and security can be applied. However, data management is a broad practice that encompasses other data disciplines, such as data warehousing, data integration, data quality and data governance (McAfee and Brynjolfsson, 2012). Thus, big data management is a complex process, particularly when abundant data originating from heterogeneous sources are to be used

Taxonomy of big data management

This section discusses the components involved in big data management techniques. Based on the required constituents of big data management as illustrated in Fig. 1, the process commences by transforming big data from original format to computer formats. It progresses with applying big data operations towards achieving decision-making. We propose a big data management process flow as a layered component diagram that shows all steps big data must undergo in order to accomplish the management

Discussion

Numerous studies have addressed a number of significant challenges and techniques pertaining to big data management, as discussed in previous section. With respect to big data management, seven possible parameters are considered essential requirements of analysis techniques for storage, pre-processing and processing: availability, scalability, integrity, heterogeneity, resource optimization and velocity. These requirements are considered because they are matters that describe storage,

Future directions

Although research on big data management has already achieved much, there are still a lot of hard problems that remain to be solved. In order to help researchers get a better grasp of future research directions in the field of big data management, more insight into future research challenges and opportunities is provided as follows:

Conclusion

Data at present is huge and continues to increase every day. The variety of data being generated is also expanding. Therefore, the need for effective management techniques and technologies to handle big data is becoming crucial. This study presented a comprehensive survey of big data management and proposed the management process flow as taxonomy. Big data management was discussed in terms of data storage, pre-processing, processing and security and state-of-the-art techniques for each

Acknowledgements

The authors would like to thank the University of Malaya for grant “Big Data and Mobile Cloud for Collaborative Experiments”, Project Number: RP012C-13AFR, Malaysian Ministry of Higher Education under the University of Malaya High Impact Research Grant “Mobile Cloud Computing: Device and Connectivity”, Project Number: M.C/625/1/HIR/MOE/FCSIT/03 and Bantuan Kecil Penyelidikan (BKP) Grant with project number: BK074-2015.

References (138)

  • B. Meroufel et al.

    Managing data replication and placement based on availability

    AASRI Procedia

    (2013)
  • A. Ma’ayan

    Lean big data integration in systems biology and systems pharmacology

    Trends Pharmacol. Sci.

    (2014)
  • C.L. Philip Chen et al.

    Data-intensive applications, challenges, techniques and technologies: a survey on big data

    Inf. Sci.

    (2014)
  • G. Putnik

    Scalability in manufacturing systems design and operation: state-of-the-art and future developments roadmap

    CIRP Ann. – Manuf. Technol.

    (2013)
  • E. Spaho

    P2P data replication and trustworthiness for a JXTA-Overlay P2P system using fuzzy logic

    Appl. Soft Comput.

    (2013)
  • Agrawal, D., El Abbadi, A., Antony, S., Das, S., 2010. Data management challenges in cloud computing infrastructures,...
  • D.R. Azevedo et al.

    Application of data mining techniques to storage management and online distribution of satellite images

  • R. Azeem et al.

    Techniques about data replication for mobile ad-hoc network databases

    Int J. Multidiscip. Sci. Eng.

    (2012)
  • Ahamed, B.B., Ramkumar, T., Hariharan, S., 2014. Data integration progression in large data source using mapping...
  • H. Abu-Libdeh

    Symbiotic routing in future data centers

    ACM SIGCOMM Comput. Commun. Rev.

    (2011)
  • J. Armstrong

    OFDM for optical communications

    J. Light Technol.

    (2009)
  • Achtert, E., et al., 2007. On exploring complex relationships of correlation clusters. In: Proceedings of the IEEE...
  • Agrawal, D., Aggarwal, C.C., 2001. On the design and quantification of privacy preserving data mining algorithms. In:...
  • Agrawal, D., Aggarwal, C.C., 2001. On the design and quantification of privacy preserving data mining algorithms. In:...
  • M.D. Assunção

    Big Data computing and clouds: trends and future directions

    J. Parallel Distrib. Comput.

    (2014)
  • Assunçaoa, M.D., et al., 2013. Big Data Computing and Clouds: Challenges, Solutions, and Future Directions. arXiv...
  • Borkar, V., Carey, M.J., Li, C., 2012. Inside Big Data management: ogres, onions, or parfaits? In: Proceedings of the...
  • Baker, T, 2014. Designing and managing Big Data – How are you researching your outcomes. In SimTecT...
  • Buza, K., Buza, A., Kis, P.B., 2011. A distributed genetic algorithm for graph-based clustering Man-Machine...
  • L.E. Bautista Villalpando et al.

    DIPAR: a framework for implementing big data science in organizations

  • S. Baskar et al.

    A systematic approach on data pre-processing in data mining

    Int J. Adv. Comput. Technol. (IJACT)

    (2013)
  • Bohannon, P., et al., 2007. Conditional functional dependencies for data cleaning. In: Proceedings of the IEEE 23rd...
  • Begoli, E., Horey, J., 2012. Design principles for effective knowledge discovery from big data. In: Proceedings of the...
  • Bertsekas, D.P., 1999. Nonlinear...
  • Bakshi, K., 2012. Considerations for big data: architecture and approach. In: Proceedings of IEEE Aerospace...
  • N. Beldiceanu

    Toward sustainable development in constraint programming

    Constraints

    (2014)
  • L. Candela et al.

    Managing big data through hybrid data infrastructures

    ERCIM News

    (2012)
  • Chaudhuri, S., 2012. What next?: a half-dozen data management research goals for big data and the cloud. In:...
  • S. Chaudhuri et al.

    An overview of business intelligence technology

    Commun. ACM

    (2011)
  • Chen, H., et al., 2010. Leveraging spatio-temporal redundancy for RFID data cleansing. In: Proceedings of the 2010 ACM...
  • J. Chen

    Big data challenge: a data management perspective

    Front. Comput. Sci.

    (2013)
  • T. Craig et al.

    Privacy and Big Data

    (2011)
  • M. Chen et al.

    Big data: a survey

    Mob. Netw. Appl.

    (2014)
  • T. Chen

    A sentence vector based over-sampling method for imbalanced emotion classification

  • A.D. Chapman

    Principles of Data Quality

    (2005)
  • A.D. Chapman

    Principles and Methods of Data Cleaning: Primary Species and Species-Occurrence Data

    (2005)
  • Dahan, H., Cohen, S., Rokach, L., & Maimon, O., 2014. Proactive Data Mining Using Decision Trees Proactive Data Mining...
  • J. Dittrich et al.

    MOVIES: indexing moving objects by shooting index images

    GeoInformatica

    (2011)
  • E.C. Dalcin

    Data Quality Concepts and Techniques Applied to Taxonomic Databases

    (2005)
  • D.L. Davies et al.

    A cluster separation measure

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1979)
  • Cited by (182)

    View all citing articles on Scopus
    View full text