Veracity handling and instance reduction in big data using interval type-2 fuzzy sets☆
Introduction
With the advancement in the technologies, sources of data generations are increasing on regular basis. As a result, amount of data that is being generated every new day is also increasing. Information systems produce mammoth quantity of records every second, every hour and every day. The term “Big Data” is used to define such a large data that is increasing exponentially and the present analytical tools are not sufficient to retrieve the information from it (Chen et al., 2012). However, the growth capacity is limited by evolution of hardware and technologies while the rise in data volume is in fact unlimited. Recently, various notable methods have been developed in the literature to deal with big data (Shukla and Muhuri, 2019, Enríquez et al., 2017, Osman, 2019, Ardagna et al., 2018, Shukla and Muhuri, 2019). The characterization of big data was first described by industry analyst Doug Laney in 2001 (Laney, 2001). He theorized that Volume, Variety, and Velocity are the three dimensions of the big data and collectively termed them as the Three V’s (3V’s). The other dimensions, which were discovered in subsequent years are veracity, variability and value. Thus, collectively there are six dimensions of the big data (Chen et al., 2012, Zikopoulos et al., 2012) as presented in Fig. 1.
The above 6 V’s are briefly described below:
- 1.
Volume refers to the size or scale of data and the most obvious characteristics of big data. With the significant increase in the sources of data, the volume of the data keeps increasing exponentially. Big data sizes are reported in multiple Petabytes and Exabyte’s.
- 2.
Variety refers to different formats of the data. Structured data that resides in relational databases and file systems, unstructured data such as text documents, email, video and audio etc. Big data has to be connected and correlated during the analysis phase in order to extract useful information out of it.
- 3.
Velocity is defined as the sheer rate at which data is coming in, at a particular instance of time and has multiple angles.
- 4.
Veracity is coined by IBM and defined as the ambiguity of the data. Many companies which deal with the product review cannot trust the review made by the reviewer. Thus, there is an urge to find out new tool that can deal with uncertainty.
- 5.
Variability is introduced by SAS and refers to the variations in the data flow rates. Since data flow rates i.e. big data velocity are not consistent and keeps changing frequently, there is demanding need to pair, clean and remodel the data collected from various sources.
- 6.
Value of the big data is considered to be nearly low value density as articulated by Oracle. The data that flows in original form has low value to its volume. If large volumes of aforesaid data are analyzed, high value could be attained.
For any real-world application, there has been sudden availability of a huge dataset that makes the information extraction and data analysis task convenient. The most widely used technique to retrieve information is called clustering which groups the similar featured items together and its centroid act as a representative point of that particular cluster. Further information may also be analyzed from that cluster centroid and treated as the sampled data point. However, clustering algorithms require a significant amount of computations due to the associated similarity/dissimilarity computations. In addition, since the dataset may grow in size, the distance computation in clustering may add extra computational overhead. Similar behavior of a cluster centroid can be observed by the defuzzified value of the type-2 fuzzy sets (T2 FSs). T2 FSs are of higher order than the traditional fuzzy sets (FSs) or type-1 fuzzy sets (T1 FSs) originally proposed by Zadeh in 1965. T1 FSs were used to handle the uncertainty due to multiple interpretations of a point. A T1 FS would associate a membership function with values within [0, 1] for the uncertain parameters. However, T1 FSs suffered from interpretability issues (Casillas et al., 2013) as the membership value was still uncertain. This was resolved using T2 FSs.
Shukla et al. (2020) conducted an extensive bibliometric analysis of the publications on T2 FSs. Authors found that the field of T2FSs has grown tremendously during last two decades, which is now a mature research field and should not be refereed any more as an emerging area. Mendel (2013) explained the details of general type-2 FSs (GT2 FSs). There are several applications where GT2 FSs has been effectively utilized with added computational overhead (Shukla and Muhuri, 2019, Wu and Mendel, 2018, Mendel, 2013). One of the popular T2 FSs is interval type-2 fuzzy sets (IT2 FSs) defined in Liang and Mendel (2000) and Wu and Mendel (2019) and have been used extensively in various applications (Shukla et al., 2017, Gaxiola et al., 2014, Olivas et al., 2019, Olivas et al., 2016, Olivas et al., 2017, Sanchez et al., 2015, Jarraya et al., 2016, Muhuri and Shukla, 2017, Baklouti et al., 2018, Soto et al., 2018). IT2 FSs are simpler to represent and computationally easier to compute as they have unit secondary membership value throughout the domain. IT2 FSs are characterized by the footprint of uncertainty (FOU) which itself is represented by two binding membership functions called the upper membership function (UMF) and lower membership function (LMF). This FOU, when defuzzified, gives one single point that could be used to represent the entire points simulated in the FOU. Utilizing this concept, we have studied the instance reduction in big datasets rather than clustering. This is because, if modeled properly, a FOU has the potential to encompass all the points within particular vicinity and accommodate all the future generating points within its spread. Thus, not only FOUs may help in instance reduction but also it can handle the veracity of big datasets including the uncertainty caused by its exponential growth.
The major contribution of this paper is as follows:
- (1)
This paper proposes a novel method to handle the veracity characteristic of Big Data with the help of type-2 fuzzy sets.
- (2)
The proposed method efficiently handles veracity issues and reduces the instances in a Big Data to a manageable extent.
- (3)
This method is scalable to the ever growing big datasets as FOU can accommodate continuously generated streaming data within its vicinity bounded by LMF and UMF.
- (4)
The outcome after applying FOU is compared with the existing clustering based methods. Relationship between the clusters and the FOUs is studied by comparing their centroids and defuzzified values, respectively.
- (5)
To examine the validity of our results, we have considered two large datasets such as: Census Dataset and Household Power Consumption Dataset. Further, scalability of the proposed approached is validated by adding large instances to the datasets.
- (6)
Three different approaches are considered for the experiments.
- 1)
Dataset as a Cluster & FOU (DC-FOU):
First approach clusters the datasets using k-means algorithm, and for each cluster, its equivalent FOU is generated. The cluster centroid and the defuzzified centroid of the FOU are compared to explore the possibility of any similarity.
- 2)
Attribute as a Cluster & FOU (AC-FOU):
In the second approach, each attribute is simulated as a cluster and its corresponding FOU is generated to compare the results.
- 3)
Support Vector Regression to assess Cluster & FOU (SVRC-FOU): Third approach deals with the application part of the proposed approaches, where we have used a machine learning algorithm, support vector regression, to assess the first two approaches.
- 1)
Rest of the paper is organized as follows. Section 2 provides the literature survey and also discusses the motivation of the proposed work. In Section 3, related mathematical background and basic definitions are presented. Section 4 discusses the details of all the proposed approaches. Then, we demonstrate the simulation and experimental results in Section 5. Finally, Section 6 concludes the paper with a discussion on the overall findings of the work and its future scopes.
Section snippets
Literature review
This section briefly discusses the existing works that are relevant to our proposed problem.
Data has been increasing over a very long period. There are many things that we could do with this much of data such as prediction, extraction etc. Zhai et al. (2014) presented an inclusive study on big data research and its possible application areas. The authors also explained the under explored area of big data dimensionality and its scope in the field of computational intelligence (CI). Jin and
Mathematical background and definitions
In this section, we have given basic definitions and mathematical formulations related to the relevant terms and techniques we have used in this paper.
Proposed architecture
The proposed architecture that is divided into three different approaches to study the nature and the behavior of the clusters and the FOU. In the first approach, Dataset as a Cluster & FOU (DC-FOU), the dataset has been divided into groups or clusters using k-means clustering algorithm and for each cluster its equivalent FOU is generated. The cluster centroid and the defuzzified centroid of the FOU are compared to explore the possibility for any similarity. In the second approach, Attribute as
Simulations and experimental results
We have performed a number of experiments on our proposed approaches and evaluated the results in this section. For the experiments, we have considered two large datasets. One is the US Census 1990 dataset (Anon, 1990) and other is the household power consumption dataset (Anon, 2012).
Conclusion
This paper has discussed a novel way to handle the veracity issue of big data based on the concept of type-2 fuzzy sets. More importantly, it introduces the concept of reducing big data instances with the help of footprint of uncertainty, which are generated when modeling T2 FSs. This instance reduction ability of the proposed FOU based method is examined with the existing clustering algorithm. FOUs are capable to encompass all the points in the specific vicinity bounded by LMF and UMF. The
Acknowledgments
Authors gratefully acknowledge the valuable comments received from the reviewers and the editors which have helped them in improving the paper significantly. Third author is grateful to the Department of Science and Technology, Government of India for the financial support in the form of INSPIRE Fellowship.
References (70)
- et al.
Entity reconciliation in big data sources: A systematic mapping study
Expert Syst. Appl.
(2017) A novel big data analytics framework for smart cities
Future Gener. Comput. Syst.
(2019)- et al.
Context-aware data quality assessment for big data
Future Gener. Comput. Syst.
(2018) - et al.
Big-data clustering with interval type-2 fuzzy uncertainty modeling in gene expression datasets
Eng. Appl. Artif. Intell.
(2019) Fuzzy sets
Inf. Control
(1965)- et al.
Recommendations on designing practical interval type-2 fuzzy systems
Eng. Appl. Artif. Intell.
(2019) - et al.
Interval type-2 fuzzy weight adjustment for backpropagation neural networks with application in time series prediction
Inform. Sci.
(2014) - et al.
Interval type-2 fuzzy logic for dynamic parameter adaptation in a modified gravitational search algorithm
Inform. Sci.
(2019) - et al.
Ant colony optimization with dynamic parameter adaptation based on interval type-2 fuzzy logic systems
Appl. Soft Comput.
(2017) - et al.
Generalized type-2 fuzzy systems for controlling a mobile robot and a performance comparison with interval type-2 and type-1 fuzzy systems
Expert Syst. Appl.
(2015)
Semi-elliptic membership function: Representation, generation, operations, defuzzification, ranking and its application to the real-time task scheduling problem
Eng. Appl. Artif. Intell.
A beta basis function interval type-2 fuzzy neural network for time series applications
Eng. Appl. Artif. Intell.
A novel parallel distance metric-based approach for diversified ranking on large graphs
Future Gener. Comput. Syst.
A novel fuzzy similarity measure and prevalence estimation approach for similarity profiled temporal association pattern mining
Future Gener. Comput. Syst.
Hadoop based uncertain possibilistic kernelized c-means algorithms for image segmentation and a comparative analysis
Appl. Soft Comput.
Context-aware decision making under uncertainty for voice-based control of smart home
Expert Syst. Appl.
FGSN: fuzzy granular social networks–model and applications
Inform. Sci.
Time aware knowledge extraction for microblog summarization on twitter
Inf. Fusion
A review on type-2 fuzzy logic applications in clustering, classification and pattern recognition
Appl. Soft Comput.
Fuzzy granular gravitational clustering algorithm for multivariate data
Inform. Sci.
ClusFuDE: Forecasting low dimensional numerical data using an improved method based on automatic clustering, fuzzy relationships and differential evolution
Eng. Appl. Artif. Intell.
Scalable visual assessment of cluster tendency for large data sets
Pattern Recognit.
Interval type-2 fuzzy membership function generation methods for pattern recognition
Inform. Sci.
Scalable visual assessment of cluster tendency for large data sets
Pattern Recognit.
Business intelligence and analytics: From big data to big impact
MIS Q.
General Type-2 fuzzy decision making and its application to travel time selection
J. Intell. Fuzzy Systems
3D data management: Controlling data volume, velocity and variety
META Group Res. Note
Harness the Power of Big Data the IBM Big Data Platform
Interpretability Issues in Fuzzy Modeling, Vol. 128
A bibliometric overview of the field of type-2 fuzzy sets and systems
IEEE Comput. Intell. Mag.
General type-2 fuzzy logic systems made simple: a tutorial
IEEE Trans. Fuzzy Syst.
Similarity measures for closed general type-2 fuzzy sets: overview, comparisons, and a geometric approach
IEEE Trans. Fuzzy Syst.
Interval type-2 fuzzy logic systems: theory and design
IEEE Trans. Fuzzy Syst.
NSGA-II based multi-objective pollution routing problem with higher order uncertainty
Dynamic parameter adaptation in particle swarm optimization using interval type-2 fuzzy logic
Soft Comput.
Cited by (29)
Adaptive enhanced interval type-2 possibilistic fuzzy local information clustering with dual-distance for land cover classification
2023, Engineering Applications of Artificial IntelligenceAn advanced TOPSIS method with new fuzzy metric based on interval type-2 fuzzy sets
2021, Expert Systems with ApplicationsCitation Excerpt :Shukla, Muhuri et al. (2020) applied comparative analysis to big data with the help of IT2FSs. Shukla, Yadav et al. (2020) benefitted from IT2FSs for handling the veracity issue in big data. Tolga et al. (2020) used finite-interval-valued type-2 Gaussian fuzzy numbers with TODIM (Iterative Multi Criteria Decision Making) technique for healthcare device selection problem.
Big data analytics adoption: Determinants and performances among small to medium-sized enterprises
2020, International Journal of Information ManagementCitation Excerpt :These data include unstructured, semi-structured, and structured data (Mohapatra & Mohanty, 2020). On the other hand, Velocity indicates the speed of generating and analysing the data in real-time (Kuo, Lin, & Lee, 2018; Shukla, Yadav, Kumar, & Muhuri, 2020). Mikalef, Pappas, Krogstie, and Giannakos (2018) stated that the meaning and concepts of big data, BDA, and BDA capabilities (BDAC) vary, though some researchers used these terms interchangeably.
A novel solution approach for multiobjective linguistic optimization problems based on the 2-tuple fuzzy linguistic representation model
2020, Applied Soft Computing JournalCitation Excerpt :Recently, it has been reported that T2 FS based research has also grown remarkably and are now quite mature [46]. There were quite a few notable works which suitably explored and exploited the capabilities of T2 FSs in different domains, e.g. clustering [47,48], big data [49], energy efficient scheduling [50], linguistic group decision making [51], and service quality evaluation [52], etc. Accordingly, to solve MOLOPs using T2 FS based approaches, Gupta and Muhuri proposed a perceptual reasoning based approach in [53,54].
A bibliometric analysis and cutting-edge overview on fuzzy techniques in Big Data
2020, Engineering Applications of Artificial IntelligenceCitation Excerpt :IT2 FSs were used to represent these information granules due to its proximity with the theory of granular computing and applicability to support various measures of uncertainty. Shukla et al. (2020b) addressed the scalability of big datasets by modelling it with footprint of uncertainty in the IT2 FSs. Fig. 9 shows the most common used keywords for the type-2 fuzzy techniques in Big Data.
Emerging Issues and Applications of Type-2 Fuzzy Sets and Systems
2020, Engineering Applications of Artificial Intelligence
- ☆
No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.engappai.2019.103315.