A survey of big data management: Taxonomy and state-of-the-art

doi:10.1016/j.jnca.2016.04.008

Journal of Network and Computer Applications

Volume 71, August 2016, Pages 151-166

https://doi.org/10.1016/j.jnca.2016.04.008 Get rights and content

Abstract

The rapid growth of emerging applications and the evolution of cloud computing technologies have significantly enhanced the capability to generate vast amounts of data. Thus, it has become a great challenge in this big data era to manage such voluminous amount of data. The recent advancements in big data techniques and technologies have enabled many enterprises to handle big data efficiently. However, these advances in techniques and technologies have not yet been studied in detail and a comprehensive survey of this domain is still lacking. With focus on big data management, this survey aims to investigate feasible techniques of managing big data by emphasizing on storage, pre-processing, processing and security. Moreover, the critical aspects of these techniques are analyzed by devising a taxonomy in order to identify the problems and proposals made to alleviate these problems. Furthermore, big data management techniques are also summarized. Finally, several future research directions are presented.

Introduction

Over the last few years, the volume of data worldwide has exploded with the amplified use of various digital devices that continuously generate massive amounts of heterogeneous, structured or unstructured data, resulting in what is now called “big data” (Kambatla et al., 2014). Big data refers to rapidly growing amounts of data for which traditional database mechanisms have become inefficient in terms of storage, processing and analysis (Manyika et al., 2011). Managing big data with diverse data formats is a main basis for competition in business and management. Nonetheless, it has also become a new challenge for Information and Communication Technologies in both Science and industry encouraging the pursue of data-centric architectures and operational models (Han et al., 2014).

Meanwhile, traditional data storage and processing typically fed with relatively clean data sets generated by limited sources; hence, the results tended to be accurate. However, the evolution of big data has revealed a serious management problem, as standard tools and procedures are not designed to manage such massive data volumes (Philip Chen and Zhang, 2014). At the same time, current infrastructures are not yet capable of addressing the distributed computational needs of managing big data and exploiting large quantities and varieties of data (Candela et al., 2012). This is not only due to the growth in the volumes of data sets but also to their complexity and volatility that makes processing and analysis very hard to achieve through traditional data management techniques and technologies. Obviously, it is very challenging for current infrastructures to sustain huge amounts of data (Russom, 2011).

Current techniques and technologies designed to handle big data management problems mostly emphasize on certain characteristics of big data, such as volume, variety and velocity (Philip Chen and Zhang, 2014). Moreover, big data comprise complex data that are massively produced and managed in geographically dispersed repositories (Kambatla et al., 2014). Such complexity motivates the development of advanced management techniques and technologies for dealing with the challenges of big data. However, these advances in techniques and technologies have not yet been studied in detail and a comprehensive survey of this domain is required. Although several studies exist related to big data management (Han et al., 2014, Russom, 2013; McAfee and Brynjolfsson, 2012; Chaudhuri, 2012; Borkar et al., 2012), no one directly focused on technical aspects of big data management providing a description of existing techniques in storage, preprocessing, processing and security. Moreover, this survey analyzes several problems inhibiting the big data management and review corresponding solutions by devising taxonomy.

This survey focuses primarily on big data aspects in the context of data management. A broad coverage of existing work on storage, preprocessing, processing and security is provided. In addition, this survey offers added value by means of a comprehensive taxonomy of existing techniques and technologies as well as highlighting the importance of typical big data management challenges related to storage, preprocessing, processing and security. Moreover, this survey aims to be a useful guide to challenges and solution in big data management and also a point of reference for future work on big data management. Furthermore, this survey summarizes the benefits that can be achieved if techniques are adopted for specific application areas of management, such as storage, preprocessing, analysis and/or security. Most significantly, this research contributes as a guide for researchers in the expedition of suitable big data management techniques and in the development of augmented techniques in response to the insufficiency of existing solutions.

In order to achieve the aims as mentioned above, we carry out our research investigation by answering the questions related to recent big data management advances as follows: (a) how big data management techniques optimize storage resources to meet rapid growth and fast retrieval requirements? (b) how pre-processing tools and technologies such as cleansing and transformation are managed to support upcoming trends of big data? (c) how big data analytics is being performed to deal with abundant information that could impact the business? (d) how security infers big data management process?

The contributions are as follows:

•
A comprehensive review of big data management techniques with respect to data storage, pre-processing, processing and security
•
A discussion on a taxonomy of big data management process flow with focus on the problems and available solutions related to storage, pre-processing, processing and security
•
A comparison of different big data management techniques for storage, pre-processing and processing based on parameters including availability, scalability, integrity, heterogeneity, resource optimization and velocity
•
A discussion on future directions and challenges regarding big data management

The rest of the article is organized as follows: Section 2 provides a general overview of big data management. Taxonomy of techniques for storage, pre-processing, processing and security aspects of big data is presented in Section 3. Section 4 discusses techniques for storage, pre-processing and processing and an analysis of their capability to meet big data management requirements, such as availability, scalability, integrity, heterogeneity, resource optimization and velocity. Section 5 highlights challenges and future research directions for big data management and Section 6 concludes the study.

Section snippets

Overview of big data management

Big data management is a new discipline, where data management techniques, tools and platforms including storage, pre-processing, processing and security can be applied. However, data management is a broad practice that encompasses other data disciplines, such as data warehousing, data integration, data quality and data governance (McAfee and Brynjolfsson, 2012). Thus, big data management is a complex process, particularly when abundant data originating from heterogeneous sources are to be used

Taxonomy of big data management

This section discusses the components involved in big data management techniques. Based on the required constituents of big data management as illustrated in Fig. 1, the process commences by transforming big data from original format to computer formats. It progresses with applying big data operations towards achieving decision-making. We propose a big data management process flow as a layered component diagram that shows all steps big data must undergo in order to accomplish the management

Discussion

Numerous studies have addressed a number of significant challenges and techniques pertaining to big data management, as discussed in previous section. With respect to big data management, seven possible parameters are considered essential requirements of analysis techniques for storage, pre-processing and processing: availability, scalability, integrity, heterogeneity, resource optimization and velocity. These requirements are considered because they are matters that describe storage,

Future directions

Although research on big data management has already achieved much, there are still a lot of hard problems that remain to be solved. In order to help researchers get a better grasp of future research directions in the field of big data management, more insight into future research challenges and opportunities is provided as follows:

Conclusion

Data at present is huge and continues to increase every day. The variety of data being generated is also expanding. Therefore, the need for effective management techniques and technologies to handle big data is becoming crucial. This study presented a comprehensive survey of big data management and proposed the management process flow as taxonomy. Big data management was discussed in terms of data storage, pre-processing, processing and security and state-of-the-art techniques for each

Acknowledgements

The authors would like to thank the University of Malaya for grant “Big Data and Mobile Cloud for Collaborative Experiments”, Project Number: RP012C-13AFR, Malaysian Ministry of Higher Education under the University of Malaya High Impact Research Grant “Mobile Cloud Computing: Device and Connectivity”, Project Number: M.C/625/1/HIR/MOE/FCSIT/03 and Bantuan Kecil Penyelidikan (BKP) Grant with project number: BK074-2015.

References (138)

K. Buza et al.
Storage-optimizing clustering algorithms for high-dimensional tick data
Expert Syst. Appl.
(2014)
A. Gandomi et al.
Beyond the hype: big data concepts, methods, and analytics
Int. J. Inf. Manag.
(2015)
I.A.T. Hashem
The rise of “big data” on cloud computing: Review and open research issues
Inf. Syst.
(2015)
K. Kambatla
Trends in big data analytics
J. Parallel Distrib. Comput.
(2014)
D. Karaboga et al.
A novel clustering approach: artificial Bee Colony (ABC) algorithm
Appl. Soft Comput.
(2011)
O. Kwon et al.
Data quality management, data usage experience and acquisition intention of big data analytics
Int. J. Inf. Manag.
(2014)
N. Kshetri
Big data's impact on privacy, security and consumer welfare
Telecommun. Policy
(2014)
J. Liu et al.
Data integration in fuzzy XML documents
Inf. Sci.
(2014)
G. Lafuente
The big data security challenge
Netw. Secur.
(2015)
J.M. Mendel et al.
On establishing nonlinear combinations of variables from small to big data for use in later processing
Inf. Sci.
(2014)

B. Meroufel et al.

Managing data replication and placement based on availability

AASRI Procedia

(2013)

A. Ma’ayan

Lean big data integration in systems biology and systems pharmacology

Trends Pharmacol. Sci.

(2014)

C.L. Philip Chen et al.

Data-intensive applications, challenges, techniques and technologies: a survey on big data

Inf. Sci.

(2014)

G. Putnik

Scalability in manufacturing systems design and operation: state-of-the-art and future developments roadmap

CIRP Ann. – Manuf. Technol.

(2013)

E. Spaho

P2P data replication and trustworthiness for a JXTA-Overlay P2P system using fuzzy logic

Appl. Soft Comput.

(2013)

Agrawal, D., El Abbadi, A., Antony, S., Das, S., 2010. Data management challenges in cloud computing infrastructures,...

D.R. Azevedo et al.

Application of data mining techniques to storage management and online distribution of satellite images

R. Azeem et al.

Techniques about data replication for mobile ad-hoc network databases

Int J. Multidiscip. Sci. Eng.

(2012)

Ahamed, B.B., Ramkumar, T., Hariharan, S., 2014. Data integration progression in large data source using mapping...

H. Abu-Libdeh

Symbiotic routing in future data centers

ACM SIGCOMM Comput. Commun. Rev.

(2011)

J. Armstrong

OFDM for optical communications

J. Light Technol.

(2009)

Achtert, E., et al., 2007. On exploring complex relationships of correlation clusters. In: Proceedings of the IEEE...

Agrawal, D., Aggarwal, C.C., 2001. On the design and quantification of privacy preserving data mining algorithms. In:...

M.D. Assunção

Big Data computing and clouds: trends and future directions

J. Parallel Distrib. Comput.

(2014)

Assunçaoa, M.D., et al., 2013. Big Data Computing and Clouds: Challenges, Solutions, and Future Directions. arXiv...

Borkar, V., Carey, M.J., Li, C., 2012. Inside Big Data management: ogres, onions, or parfaits? In: Proceedings of the...

Baker, T, 2014. Designing and managing Big Data – How are you researching your outcomes. In SimTecT...

Buza, K., Buza, A., Kis, P.B., 2011. A distributed genetic algorithm for graph-based clustering Man-Machine...

L.E. Bautista Villalpando et al.

DIPAR: a framework for implementing big data science in organizations

S. Baskar et al.

A systematic approach on data pre-processing in data mining

Int J. Adv. Comput. Technol. (IJACT)

(2013)

Bohannon, P., et al., 2007. Conditional functional dependencies for data cleaning. In: Proceedings of the IEEE 23rd...

Begoli, E., Horey, J., 2012. Design principles for effective knowledge discovery from big data. In: Proceedings of the...

Bertsekas, D.P., 1999. Nonlinear...

Bakshi, K., 2012. Considerations for big data: architecture and approach. In: Proceedings of IEEE Aerospace...

N. Beldiceanu

Toward sustainable development in constraint programming

Constraints

(2014)

L. Candela et al.

Managing big data through hybrid data infrastructures

ERCIM News

(2012)

Chaudhuri, S., 2012. What next?: a half-dozen data management research goals for big data and the cloud. In:...

S. Chaudhuri et al.

An overview of business intelligence technology

Commun. ACM

(2011)

Chen, H., et al., 2010. Leveraging spatio-temporal redundancy for RFID data cleansing. In: Proceedings of the 2010 ACM...

J. Chen

Big data challenge: a data management perspective

Front. Comput. Sci.

(2013)

T. Craig et al.

Privacy and Big Data

(2011)

M. Chen et al.

Big data: a survey

Mob. Netw. Appl.

(2014)

T. Chen

A sentence vector based over-sampling method for imbalanced emotion classification

A.D. Chapman

Principles of Data Quality

(2005)

A.D. Chapman

Principles and Methods of Data Cleaning: Primary Species and Species-Occurrence Data

(2005)

Dahan, H., Cohen, S., Rokach, L., & Maimon, O., 2014. Proactive Data Mining Using Decision Trees Proactive Data Mining...

J. Dittrich et al.

MOVIES: indexing moving objects by shooting index images

GeoInformatica

(2011)

E.C. Dalcin

Data Quality Concepts and Techniques Applied to Taxonomic Databases

(2005)

D.L. Davies et al.

A cluster separation measure

IEEE Transactions on Pattern Analysis and Machine Intelligence

(1979)

Cited by (182)

Application and enabling digital twin technologies in the operation and maintenance stage of the AEC industry: A literature review
2023, Journal of Building Engineering
Digital Twin (DT), which emerged in the manufacturing industry, has recently attracted much attention in the Architecture, Engineering, and Construction (AEC) industry. At present, in the whole life cycle of an object in the AEC industry, it is mostly applied in the operation and maintenance (O&M) stage. Although there is a lot of DT application research, these studies are scattered across different topics in different engineering objects of the AEC industry. The common application strategy for the applications of DT in the O&M stage of the AEC industry has not yet been fully understood. So many fresh researchers who want to apply DT to new problems in the field are not clear which technologies should be used and how to organise them together. In this regard, the optional enabling technologies are given based on a bibliometric search, and a complete common strategy for DT application in the O&M stage of the AEC industry is established, which supplements the theory of the AEC field. Meanwhile, the existing gaps and research opportunities at the technical level are secondly identified. To achieve this goal, there are 825 publications related to DT applications in the AEC industry O&M stage are analysed, published between 2016.1.1 and 2023.7.28. The digital twin enabling techniques are summarised from four aspects in the paper. Firstly, the common technologies used in different current research topics are concluded and analysed in the review. Then, the enabling technologies for digital twin are concluded systematically from the perspective of the digital twin five-dimensional model. Thirdly, integrating technologies commonly used for each component based on widely recognised strategies for digital twin applications, a digital twin application technology strategy is finalised for the AEC industry O&M stage. Finally, the future work is concluded at the end of the article.
Systematic review of data-centric approaches in artificial intelligence and machine learning
2023, Data Science and Management
Artificial intelligence (AI) relies on data and algorithms. State-of-the-art (SOTA) AI smart algorithms have been developed to improve the performance of AI-oriented structures. However, model-centric approaches are limited by the absence of high-quality data. Data-centric AI is an emerging approach for solving machine learning (ML) problems. It is a collection of various data manipulation techniques that allow ML practitioners to systematically improve the quality of the data used in an ML pipeline. However, data-centric AI approaches are not well documented. Researchers have conducted various experiments without a clear set of guidelines. This survey highlights six major data-centric AI aspects that researchers are already using to intentionally or unintentionally improve the quality of AI systems. These include big data quality assessment, data preprocessing, transfer learning, semi-supervised learning, machine learning operations (MLOps), and the effect of adding more data. In addition, it highlights recent data-centric techniques adopted by ML practitioners. We addressed how adding data might harm datasets and how HoloClean can be used to restore and clean them. Finally, we discuss the causes of technical debt in AI. Technical debt builds up when software design and implementation decisions run into “or outright collide with” business goals and timelines. This survey lays the groundwork for future data-centric AI discussions by summarizing various data-centric approaches.
When machine learning meets Network Management and Orchestration in Edge-based networking paradigms
2023, Journal of Network and Computer Applications
Caused by the rising of new network types, e.g., Internet of Things (IoT), within the last decade and related challenges like Big Data and data processing delay, new paradigms such as Edge and Fog computing emerged. Although these paradigms can partially address those challenges, their performance can still be affected by various issues, such as faults or network inefficiencies. To establish efficient network infrastructures for these paradigms, Network Management and Orchestration (NMO) techniques are introduced to improve various aspects of networking e.g., Quality of Service (QoS) provisioning, resource management, task allocation, and many others. Therefore, NMO primarily uses various methods like statistical models, heuristic techniques or Artificial Intelligence (AI) to automate networking decision-making. In this study, we investigate NMO issues, related orchestration challenges and the usage of Machine Learning (ML) techniques as a sub-field of AI for NMO purposes. The focus rests on new Edge-based networking and computing paradigms that employ resource-constraint devices to perform different tasks in environments like Extreme Edge, Cloud-of-Things (CoT) or Mist. We provide a comprehensive survey including a state-of-the-art review, research challenges and future directions. The study shows the challenges of NMO in such paradigms and provides information on how ML-based techniques can improve the performance of Edge-based networking paradigms.
Flexible, highly scalable and cost-effective network structures for data centers
2023, Journal of Network and Computer Applications
The ever-increasing number of online services and tremendous data volume make the data center networks expand rapidly. The data center network must be flexible in scalability and cost-effective in order to respond quickly to the ever changing service requirements. However, the costs of switch-centric topologies are high, while the scalabilities of server-centric topologies are not good. How to build a flexible, highly scalable and cost-effective data center network structure is becoming a very challenging problem. Based on the Cartesian product graph, this paper proposes a new type of data center network structure named SDCCP (Scalable Data Center network structure based on Cartesian Product graphs). It is constructed by using commodity $m$ -port switches and 2-port servers. Based on the $m$ -degree Cartesian product graphs, we can construct different kinds of SDCCP structures which can be expanded in different scales by using the same type of $m$ -port switches. In order to achieve incremental scalability, we further propose an incomplete SDCCP structure. Servers can be gradually added into the incomplete SDCCP structure while maintaining their topological properties unchanged. Compared with Fat-Tree and BCube, the SDCCP reduces the cost by about 30% and 60%, respectively; and reduces the energy consumption by about 27% and 35%, respectively. The performance analyzes and experimental results demonstrate that SDCCP strikes a good balance in terms of flexibility, scalability, cost-effectiveness, and low energy consumption in contrast to the typical data center network structures.
CD/CV: Blockchain-based schemes for continuous verifiability and traceability of IoT data for edge–fog–cloud
2023, Information Processing and Management
This paper presents a continuous delivery/continuous verifiability (CD/CV) method for IoT dataflows in edge–fog–cloud. A CD model based on extraction, transformation, and load (ETL) mechanism as well as a directed acyclic graph (DAG) construction, enable end-users to create efficient schemes for the continuous verification and validation of the execution of applications in edge–fog–cloud infrastructures. This scheme also verifies and validates established execution sequences and the integrity of digital assets. CV model converts ETL and DAG into business model, smart contracts in a private blockchain for the automatic and transparent registration of transactions performed by each application in workflows/pipelines created by CD model without altering applications nor edge–fog–cloud workflows. This model ensures that IoT dataflows delivers verifiable information for organizations to conduct critical decision-making processes with certainty. A containerized parallelism model solves portability issues and reduces/compensates the overhead produced by CD/CV operations. We developed and implemented a prototype to create CD/CV schemes, which were evaluated in a case study where user mobility information is used to identify interest points, patterns, and maps. The experimental evaluation revealed the efficiency of CD/CV to register the transactions performed in IoT dataflows through edge–fog–cloud in a private blockchain network in comparison with state-of-art solutions.
Unraveling the Nexus between Big Data Analytics Components, Innovation, and Financial Success
2024, SSRN

View all citing articles on Scopus

View full text

A survey of big data management: Taxonomy and state-of-the-art

Abstract

Introduction

Section snippets

Overview of big data management

Taxonomy of big data management

Discussion

Future directions

Conclusion

Acknowledgements

Expert Syst. Appl.

Int. J. Inf. Manag.

Inf. Syst.

J. Parallel Distrib. Comput.

Appl. Soft Comput.

Int. J. Inf. Manag.

Telecommun. Policy

Inf. Sci.

Netw. Secur.

Inf. Sci.

AASRI Procedia

Trends Pharmacol. Sci.

Inf. Sci.

CIRP Ann. – Manuf. Technol.

Appl. Soft Comput.

Application of data mining techniques to storage management and online distribution of satellite images

Techniques about data replication for mobile ad-hoc network databases

Int J. Multidiscip. Sci. Eng.

Symbiotic routing in future data centers

ACM SIGCOMM Comput. Commun. Rev.

OFDM for optical communications

J. Light Technol.

Big Data computing and clouds: trends and future directions

J. Parallel Distrib. Comput.

DIPAR: a framework for implementing big data science in organizations

A systematic approach on data pre-processing in data mining

Int J. Adv. Comput. Technol. (IJACT)

Toward sustainable development in constraint programming

Constraints

Managing big data through hybrid data infrastructures

ERCIM News

An overview of business intelligence technology

Commun. ACM

Big data challenge: a data management perspective

Front. Comput. Sci.

Privacy and Big Data

Big data: a survey

Mob. Netw. Appl.

A sentence vector based over-sampling method for imbalanced emotion classification

Principles of Data Quality

Principles and Methods of Data Cleaning: Primary Species and Species-Occurrence Data

MOVIES: indexing moving objects by shooting index images

GeoInformatica

Data Quality Concepts and Techniques Applied to Taxonomic Databases

A cluster separation measure

IEEE Transactions on Pattern Analysis and Machine Intelligence