A survey of big data management: Taxonomy and state-of-the-art
Introduction
Over the last few years, the volume of data worldwide has exploded with the amplified use of various digital devices that continuously generate massive amounts of heterogeneous, structured or unstructured data, resulting in what is now called “big data” (Kambatla et al., 2014). Big data refers to rapidly growing amounts of data for which traditional database mechanisms have become inefficient in terms of storage, processing and analysis (Manyika et al., 2011). Managing big data with diverse data formats is a main basis for competition in business and management. Nonetheless, it has also become a new challenge for Information and Communication Technologies in both Science and industry encouraging the pursue of data-centric architectures and operational models (Han et al., 2014).
Meanwhile, traditional data storage and processing typically fed with relatively clean data sets generated by limited sources; hence, the results tended to be accurate. However, the evolution of big data has revealed a serious management problem, as standard tools and procedures are not designed to manage such massive data volumes (Philip Chen and Zhang, 2014). At the same time, current infrastructures are not yet capable of addressing the distributed computational needs of managing big data and exploiting large quantities and varieties of data (Candela et al., 2012). This is not only due to the growth in the volumes of data sets but also to their complexity and volatility that makes processing and analysis very hard to achieve through traditional data management techniques and technologies. Obviously, it is very challenging for current infrastructures to sustain huge amounts of data (Russom, 2011).
Current techniques and technologies designed to handle big data management problems mostly emphasize on certain characteristics of big data, such as volume, variety and velocity (Philip Chen and Zhang, 2014). Moreover, big data comprise complex data that are massively produced and managed in geographically dispersed repositories (Kambatla et al., 2014). Such complexity motivates the development of advanced management techniques and technologies for dealing with the challenges of big data. However, these advances in techniques and technologies have not yet been studied in detail and a comprehensive survey of this domain is required. Although several studies exist related to big data management (Han et al., 2014, Russom, 2013; McAfee and Brynjolfsson, 2012; Chaudhuri, 2012; Borkar et al., 2012), no one directly focused on technical aspects of big data management providing a description of existing techniques in storage, preprocessing, processing and security. Moreover, this survey analyzes several problems inhibiting the big data management and review corresponding solutions by devising taxonomy.
This survey focuses primarily on big data aspects in the context of data management. A broad coverage of existing work on storage, preprocessing, processing and security is provided. In addition, this survey offers added value by means of a comprehensive taxonomy of existing techniques and technologies as well as highlighting the importance of typical big data management challenges related to storage, preprocessing, processing and security. Moreover, this survey aims to be a useful guide to challenges and solution in big data management and also a point of reference for future work on big data management. Furthermore, this survey summarizes the benefits that can be achieved if techniques are adopted for specific application areas of management, such as storage, preprocessing, analysis and/or security. Most significantly, this research contributes as a guide for researchers in the expedition of suitable big data management techniques and in the development of augmented techniques in response to the insufficiency of existing solutions.
In order to achieve the aims as mentioned above, we carry out our research investigation by answering the questions related to recent big data management advances as follows: (a) how big data management techniques optimize storage resources to meet rapid growth and fast retrieval requirements? (b) how pre-processing tools and technologies such as cleansing and transformation are managed to support upcoming trends of big data? (c) how big data analytics is being performed to deal with abundant information that could impact the business? (d) how security infers big data management process?
The contributions are as follows:
- •
A comprehensive review of big data management techniques with respect to data storage, pre-processing, processing and security
- •
A discussion on a taxonomy of big data management process flow with focus on the problems and available solutions related to storage, pre-processing, processing and security
- •
A comparison of different big data management techniques for storage, pre-processing and processing based on parameters including availability, scalability, integrity, heterogeneity, resource optimization and velocity
- •
A discussion on future directions and challenges regarding big data management
The rest of the article is organized as follows: Section 2 provides a general overview of big data management. Taxonomy of techniques for storage, pre-processing, processing and security aspects of big data is presented in Section 3. Section 4 discusses techniques for storage, pre-processing and processing and an analysis of their capability to meet big data management requirements, such as availability, scalability, integrity, heterogeneity, resource optimization and velocity. Section 5 highlights challenges and future research directions for big data management and Section 6 concludes the study.
Section snippets
Overview of big data management
Big data management is a new discipline, where data management techniques, tools and platforms including storage, pre-processing, processing and security can be applied. However, data management is a broad practice that encompasses other data disciplines, such as data warehousing, data integration, data quality and data governance (McAfee and Brynjolfsson, 2012). Thus, big data management is a complex process, particularly when abundant data originating from heterogeneous sources are to be used
Taxonomy of big data management
This section discusses the components involved in big data management techniques. Based on the required constituents of big data management as illustrated in Fig. 1, the process commences by transforming big data from original format to computer formats. It progresses with applying big data operations towards achieving decision-making. We propose a big data management process flow as a layered component diagram that shows all steps big data must undergo in order to accomplish the management
Discussion
Numerous studies have addressed a number of significant challenges and techniques pertaining to big data management, as discussed in previous section. With respect to big data management, seven possible parameters are considered essential requirements of analysis techniques for storage, pre-processing and processing: availability, scalability, integrity, heterogeneity, resource optimization and velocity. These requirements are considered because they are matters that describe storage,
Future directions
Although research on big data management has already achieved much, there are still a lot of hard problems that remain to be solved. In order to help researchers get a better grasp of future research directions in the field of big data management, more insight into future research challenges and opportunities is provided as follows:
Conclusion
Data at present is huge and continues to increase every day. The variety of data being generated is also expanding. Therefore, the need for effective management techniques and technologies to handle big data is becoming crucial. This study presented a comprehensive survey of big data management and proposed the management process flow as taxonomy. Big data management was discussed in terms of data storage, pre-processing, processing and security and state-of-the-art techniques for each
Acknowledgements
The authors would like to thank the University of Malaya for grant “Big Data and Mobile Cloud for Collaborative Experiments”, Project Number: RP012C-13AFR, Malaysian Ministry of Higher Education under the University of Malaya High Impact Research Grant “Mobile Cloud Computing: Device and Connectivity”, Project Number: M.C/625/1/HIR/MOE/FCSIT/03 and Bantuan Kecil Penyelidikan (BKP) Grant with project number: BK074-2015.
References (138)
- et al.
Storage-optimizing clustering algorithms for high-dimensional tick data
Expert Syst. Appl.
(2014) - et al.
Beyond the hype: big data concepts, methods, and analytics
Int. J. Inf. Manag.
(2015) The rise of “big data” on cloud computing: Review and open research issues
Inf. Syst.
(2015)Trends in big data analytics
J. Parallel Distrib. Comput.
(2014)- et al.
A novel clustering approach: artificial Bee Colony (ABC) algorithm
Appl. Soft Comput.
(2011) - et al.
Data quality management, data usage experience and acquisition intention of big data analytics
Int. J. Inf. Manag.
(2014) Big data's impact on privacy, security and consumer welfare
Telecommun. Policy
(2014)- et al.
Data integration in fuzzy XML documents
Inf. Sci.
(2014) The big data security challenge
Netw. Secur.
(2015)- et al.
On establishing nonlinear combinations of variables from small to big data for use in later processing
Inf. Sci.
(2014)
Managing data replication and placement based on availability
AASRI Procedia
Lean big data integration in systems biology and systems pharmacology
Trends Pharmacol. Sci.
Data-intensive applications, challenges, techniques and technologies: a survey on big data
Inf. Sci.
Scalability in manufacturing systems design and operation: state-of-the-art and future developments roadmap
CIRP Ann. – Manuf. Technol.
P2P data replication and trustworthiness for a JXTA-Overlay P2P system using fuzzy logic
Appl. Soft Comput.
Application of data mining techniques to storage management and online distribution of satellite images
Techniques about data replication for mobile ad-hoc network databases
Int J. Multidiscip. Sci. Eng.
Symbiotic routing in future data centers
ACM SIGCOMM Comput. Commun. Rev.
OFDM for optical communications
J. Light Technol.
Big Data computing and clouds: trends and future directions
J. Parallel Distrib. Comput.
DIPAR: a framework for implementing big data science in organizations
A systematic approach on data pre-processing in data mining
Int J. Adv. Comput. Technol. (IJACT)
Toward sustainable development in constraint programming
Constraints
Managing big data through hybrid data infrastructures
ERCIM News
An overview of business intelligence technology
Commun. ACM
Big data challenge: a data management perspective
Front. Comput. Sci.
Privacy and Big Data
Big data: a survey
Mob. Netw. Appl.
A sentence vector based over-sampling method for imbalanced emotion classification
Principles of Data Quality
Principles and Methods of Data Cleaning: Primary Species and Species-Occurrence Data
MOVIES: indexing moving objects by shooting index images
GeoInformatica
Data Quality Concepts and Techniques Applied to Taxonomic Databases
A cluster separation measure
IEEE Transactions on Pattern Analysis and Machine Intelligence
Cited by (182)
Application and enabling digital twin technologies in the operation and maintenance stage of the AEC industry: A literature review
2023, Journal of Building EngineeringSystematic review of data-centric approaches in artificial intelligence and machine learning
2023, Data Science and ManagementWhen machine learning meets Network Management and Orchestration in Edge-based networking paradigms
2023, Journal of Network and Computer ApplicationsFlexible, highly scalable and cost-effective network structures for data centers
2023, Journal of Network and Computer ApplicationsCD/CV: Blockchain-based schemes for continuous verifiability and traceability of IoT data for edge–fog–cloud
2023, Information Processing and Management