Veracity handling and instance reduction in big data using interval type-2 fuzzy sets

doi:10.1016/j.engappai.2019.103315

Engineering Applications of Artificial Intelligence

Volume 88, February 2020, 103315

https://doi.org/10.1016/j.engappai.2019.103315 Get rights and content

Abstract

Within the aspect of big data, veracity refers to the existing uncertainty in the dataset. The continuous flow of unstructured data with unwanted noise may bring abnormality in the dataset making them unusable. In this paper, we propose a novel method to handle the veracity characteristic of the big data using the concept of footprint of uncertainty (FOU) in interval type-2 fuzzy sets (IT2 FSs). The proposed method helps in handling the veracity issue in big data and reduces the instances to a manageable extent. We have compared the results with the existing clustering based methods and examined the relationship between the clusters and the FOUs by comparing their centroids and defuzzified values. To scrutinize the validity of our results, we have also performed a number of additional experiments by appending extra instances to the datasets. To check its consistency and efficacy, the proposed methodology is assessed from three different aspects. Experimental result validates that the proposed method can suitably handle the veracity issue in big datasets and is efficient in reducing the instances.

Introduction

With the advancement in the technologies, sources of data generations are increasing on regular basis. As a result, amount of data that is being generated every new day is also increasing. Information systems produce mammoth quantity of records every second, every hour and every day. The term “Big Data” is used to define such a large data that is increasing exponentially and the present analytical tools are not sufficient to retrieve the information from it (Chen et al., 2012). However, the growth capacity is limited by evolution of hardware and technologies while the rise in data volume is in fact unlimited. Recently, various notable methods have been developed in the literature to deal with big data (Shukla and Muhuri, 2019, Enríquez et al., 2017, Osman, 2019, Ardagna et al., 2018, Shukla and Muhuri, 2019). The characterization of big data was first described by industry analyst Doug Laney in 2001 (Laney, 2001). He theorized that Volume, Variety, and Velocity are the three dimensions of the big data and collectively termed them as the Three V’s (3V’s). The other dimensions, which were discovered in subsequent years are veracity, variability and value. Thus, collectively there are six dimensions of the big data (Chen et al., 2012, Zikopoulos et al., 2012) as presented in Fig. 1.

The above 6 V’s are briefly described below:

1.
Volume refers to the size or scale of data and the most obvious characteristics of big data. With the significant increase in the sources of data, the volume of the data keeps increasing exponentially. Big data sizes are reported in multiple Petabytes and Exabyte’s.
2.
Variety refers to different formats of the data. Structured data that resides in relational databases and file systems, unstructured data such as text documents, email, video and audio etc. Big data has to be connected and correlated during the analysis phase in order to extract useful information out of it.
3.
Velocity is defined as the sheer rate at which data is coming in, at a particular instance of time and has multiple angles.
4.
Veracity is coined by IBM and defined as the ambiguity of the data. Many companies which deal with the product review cannot trust the review made by the reviewer. Thus, there is an urge to find out new tool that can deal with uncertainty.
5.
Variability is introduced by SAS and refers to the variations in the data flow rates. Since data flow rates i.e. big data velocity are not consistent and keeps changing frequently, there is demanding need to pair, clean and remodel the data collected from various sources.
6.
Value of the big data is considered to be nearly low value density as articulated by Oracle. The data that flows in original form has low value to its volume. If large volumes of aforesaid data are analyzed, high value could be attained.

For any real-world application, there has been sudden availability of a huge dataset that makes the information extraction and data analysis task convenient. The most widely used technique to retrieve information is called clustering which groups the similar featured items together and its centroid act as a representative point of that particular cluster. Further information may also be analyzed from that cluster centroid and treated as the sampled data point. However, clustering algorithms require a significant amount of computations due to the associated similarity/dissimilarity computations. In addition, since the dataset may grow in size, the distance computation in clustering may add extra computational overhead. Similar behavior of a cluster centroid can be observed by the defuzzified value of the type-2 fuzzy sets (T2 FSs). T2 FSs are of higher order than the traditional fuzzy sets (FSs) or type-1 fuzzy sets (T1 FSs) originally proposed by Zadeh in 1965. T1 FSs were used to handle the uncertainty due to multiple interpretations of a point. A T1 FS would associate a membership function with values within [0, 1] for the uncertain parameters. However, T1 FSs suffered from interpretability issues (Casillas et al., 2013) as the membership value was still uncertain. This was resolved using T2 FSs.

Shukla et al. (2020) conducted an extensive bibliometric analysis of the publications on T2 FSs. Authors found that the field of T2FSs has grown tremendously during last two decades, which is now a mature research field and should not be refereed any more as an emerging area. Mendel (2013) explained the details of general type-2 FSs (GT2 FSs). There are several applications where GT2 FSs has been effectively utilized with added computational overhead (Shukla and Muhuri, 2019, Wu and Mendel, 2018, Mendel, 2013). One of the popular T2 FSs is interval type-2 fuzzy sets (IT2 FSs) defined in Liang and Mendel (2000) and Wu and Mendel (2019) and have been used extensively in various applications (Shukla et al., 2017, Gaxiola et al., 2014, Olivas et al., 2019, Olivas et al., 2016, Olivas et al., 2017, Sanchez et al., 2015, Jarraya et al., 2016, Muhuri and Shukla, 2017, Baklouti et al., 2018, Soto et al., 2018). IT2 FSs are simpler to represent and computationally easier to compute as they have unit secondary membership value throughout the domain. IT2 FSs are characterized by the footprint of uncertainty (FOU) which itself is represented by two binding membership functions called the upper membership function (UMF) and lower membership function (LMF). This FOU, when defuzzified, gives one single point that could be used to represent the entire points simulated in the FOU. Utilizing this concept, we have studied the instance reduction in big datasets rather than clustering. This is because, if modeled properly, a FOU has the potential to encompass all the points within particular vicinity and accommodate all the future generating points within its spread. Thus, not only FOUs may help in instance reduction but also it can handle the veracity of big datasets including the uncertainty caused by its exponential growth.

The major contribution of this paper is as follows:

(1)
This paper proposes a novel method to handle the veracity characteristic of Big Data with the help of type-2 fuzzy sets.
(2)
The proposed method efficiently handles veracity issues and reduces the instances in a Big Data to a manageable extent.
(3)
This method is scalable to the ever growing big datasets as FOU can accommodate continuously generated streaming data within its vicinity bounded by LMF and UMF.
(4)
The outcome after applying FOU is compared with the existing clustering based methods. Relationship between the clusters and the FOUs is studied by comparing their centroids and defuzzified values, respectively.
(5)
To examine the validity of our results, we have considered two large datasets such as: Census Dataset and Household Power Consumption Dataset. Further, scalability of the proposed approached is validated by adding large instances to the datasets.
(6)
Three different approaches are considered for the experiments.
- 1)
  Dataset as a Cluster & FOU (DC-FOU):
  First approach clusters the datasets using k-means algorithm, and for each cluster, its equivalent FOU is generated. The cluster centroid and the defuzzified centroid of the FOU are compared to explore the possibility of any similarity.
- 2)
  Attribute as a Cluster & FOU (AC-FOU):
  In the second approach, each attribute is simulated as a cluster and its corresponding FOU is generated to compare the results.
- 3)
  Support Vector Regression to assess Cluster & FOU (SVRC-FOU): Third approach deals with the application part of the proposed approaches, where we have used a machine learning algorithm, support vector regression, to assess the first two approaches.

Rest of the paper is organized as follows. Section 2 provides the literature survey and also discusses the motivation of the proposed work. In Section 3, related mathematical background and basic definitions are presented. Section 4 discusses the details of all the proposed approaches. Then, we demonstrate the simulation and experimental results in Section 5. Finally, Section 6 concludes the paper with a discussion on the overall findings of the work and its future scopes.

Section snippets

Literature review

This section briefly discusses the existing works that are relevant to our proposed problem.

Data has been increasing over a very long period. There are many things that we could do with this much of data such as prediction, extraction etc. Zhai et al. (2014) presented an inclusive study on big data research and its possible application areas. The authors also explained the under explored area of big data dimensionality and its scope in the field of computational intelligence (CI). Jin and

Mathematical background and definitions

In this section, we have given basic definitions and mathematical formulations related to the relevant terms and techniques we have used in this paper.

Proposed architecture

The proposed architecture that is divided into three different approaches to study the nature and the behavior of the clusters and the FOU. In the first approach, Dataset as a Cluster & FOU (DC-FOU), the dataset has been divided into groups or clusters using k-means clustering algorithm and for each cluster its equivalent FOU is generated. The cluster centroid and the defuzzified centroid of the FOU are compared to explore the possibility for any similarity. In the second approach, Attribute as

Simulations and experimental results

We have performed a number of experiments on our proposed approaches and evaluated the results in this section. For the experiments, we have considered two large datasets. One is the US Census 1990 dataset (Anon, 1990) and other is the household power consumption dataset (Anon, 2012).

Conclusion

This paper has discussed a novel way to handle the veracity issue of big data based on the concept of type-2 fuzzy sets. More importantly, it introduces the concept of reducing big data instances with the help of footprint of uncertainty, which are generated when modeling T2 FSs. This instance reduction ability of the proposed FOU based method is examined with the existing clustering algorithm. FOUs are capable to encompass all the points in the specific vicinity bounded by LMF and UMF. The

Acknowledgments

Authors gratefully acknowledge the valuable comments received from the reviewers and the editors which have helped them in improving the paper significantly. Third author is grateful to the Department of Science and Technology, Government of India for the financial support in the form of INSPIRE Fellowship.

References (70)

EnríquezJ.G. et al.
Entity reconciliation in big data sources: A systematic mapping study
Expert Syst. Appl.
(2017)
OsmanA.M.S.
A novel big data analytics framework for smart cities
Future Gener. Comput. Syst.
(2019)
ArdagnaD. et al.
Context-aware data quality assessment for big data
Future Gener. Comput. Syst.
(2018)
ShuklaA.K. et al.
Big-data clustering with interval type-2 fuzzy uncertainty modeling in gene expression datasets
Eng. Appl. Artif. Intell.
(2019)
ZadehL.A.
Fuzzy sets
Inf. Control
(1965)
WuD. et al.
Recommendations on designing practical interval type-2 fuzzy systems
Eng. Appl. Artif. Intell.
(2019)
GaxiolaF. et al.
Interval type-2 fuzzy weight adjustment for backpropagation neural networks with application in time series prediction
Inform. Sci.
(2014)
OlivasF. et al.
Interval type-2 fuzzy logic for dynamic parameter adaptation in a modified gravitational search algorithm
Inform. Sci.
(2019)
OlivasF. et al.
Ant colony optimization with dynamic parameter adaptation based on interval type-2 fuzzy logic systems
Appl. Soft Comput.
(2017)
SanchezM.A. et al.
Generalized type-2 fuzzy systems for controlling a mobile robot and a performance comparison with interval type-2 and type-1 fuzzy systems
Expert Syst. Appl.
(2015)

MuhuriP.K. et al.

Semi-elliptic membership function: Representation, generation, operations, defuzzification, ranking and its application to the real-time task scheduling problem

Eng. Appl. Artif. Intell.

(2017)

BakloutiN. et al.

A beta basis function interval type-2 fuzzy neural network for time series applications

Eng. Appl. Artif. Intell.

(2018)

LiJ. et al.

A novel parallel distance metric-based approach for diversified ranking on large graphs

Future Gener. Comput. Syst.

(2018)

RadhakrishnaV. et al.

A novel fuzzy similarity measure and prevalence estimation approach for similarity profiled temporal association pattern mining

Future Gener. Comput. Syst.

(2018)

TripathyB.K. et al.

Hadoop based uncertain possibilistic kernelized c-means algorithms for image segmentation and a comparative analysis

Appl. Soft Comput.

(2016)

ChahuaraP. et al.

Context-aware decision making under uncertainty for voice-based control of smart home

Expert Syst. Appl.

(2017)

KunduS. et al.

FGSN: fuzzy granular social networks–model and applications

Inform. Sci.

(2015)

De MaioC. et al.

Time aware knowledge extraction for microblog summarization on twitter

Inf. Fusion

(2016)

MelinP. et al.

A review on type-2 fuzzy logic applications in clustering, classification and pattern recognition

Appl. Soft Comput.

(2014)

SanchezM.A. et al.

Fuzzy granular gravitational clustering algorithm for multivariate data

Inform. Sci.

(2014)

GuptaC. et al.

ClusFuDE: Forecasting low dimensional numerical data using an improved method based on automatic clustering, fuzzy relationships and differential evolution

Eng. Appl. Artif. Intell.

(2018)

HathawayR.J. et al.

Scalable visual assessment of cluster tendency for large data sets

Pattern Recognit.

(2006)

ChoiB.I. et al.

Interval type-2 fuzzy membership function generation methods for pattern recognition

Inform. Sci.

(2009)

HathawayR.J. et al.

Scalable visual assessment of cluster tendency for large data sets

Pattern Recognit.

(2006)

ChenH. et al.

Business intelligence and analytics: From big data to big impact

MIS Q.

(2012)

ShuklaA.K. et al.

General Type-2 fuzzy decision making and its application to travel time selection

J. Intell. Fuzzy Systems

(2019)

LaneyD.

3D data management: Controlling data volume, velocity and variety

META Group Res. Note

(2001)

ZikopoulosP. et al.

Harness the Power of Big Data the IBM Big Data Platform

(2012)

Interpretability Issues in Fuzzy Modeling, Vol. 128

(2013)

ShuklaA.K. et al.

A bibliometric overview of the field of type-2 fuzzy sets and systems

IEEE Comput. Intell. Mag.

(2020)

MendelJ.M.

General type-2 fuzzy logic systems made simple: a tutorial

IEEE Trans. Fuzzy Syst.

(2013)

WuD. et al.

Similarity measures for closed general type-2 fuzzy sets: overview, comparisons, and a geometric approach

IEEE Trans. Fuzzy Syst.

(2018)

LiangQ. et al.

Interval type-2 fuzzy logic systems: theory and design

IEEE Trans. Fuzzy Syst.

(2000)

ShuklaA.K. et al.

NSGA-II based multi-objective pollution routing problem with higher order uncertainty

OlivasF. et al.

Dynamic parameter adaptation in particle swarm optimization using interval type-2 fuzzy logic

Soft Comput.

(2016)

Cited by (29)

Adaptive enhanced interval type-2 possibilistic fuzzy local information clustering with dual-distance for land cover classification
2023, Engineering Applications of Artificial Intelligence
Land cover classification of remote sensing image is faced with uncertainties such as “significant difference in category density”, “the same object with different spectra”, and “different objects with the same spectrum”. Existing possibilistic fuzzy clustering-related method still cannot fully meet the interpretation requirements of remote sensing data. To accurately classify remote sensing image with complex geographical distribution, this paper proposes a robust interval type-2 dual-distance driven possibilistic fuzzy clustering motivated by interval-valued number for land cover classification. Firstly, interval-valued data model is established by using the local mean and variance of single-valued data. Secondly, Hausdorff distance and MW distance for interval-valued numbers are introduced to realize the maximum separability of overlapping categories in interval-valued data and generate interval uncertainty sets. To further enhance its robustness, weighted local information factors are constructed by making full use of the generated fuzzy membership and possibilistic typicality, and a novel robust interval type-2 possibilistic fuzzy C-means clustering is proposed. Finally, the adaptive type reduction method is introduced, meanwhile the adaptive expansion of interval-valued data model is realized by using the adaptive contraction expansion control factor. After the iterative algorithm convergences, the accurate classification of covered objects in remote sensing images is realized. Experimental results indicate that the classification performance of the proposed algorithm is better than that of existing interval type-2 fuzzy clustering algorithm and its variants, and it is more suitable for the interpretation of remote sensing images in the actual environments.
An advanced TOPSIS method with new fuzzy metric based on interval type-2 fuzzy sets
2021, Expert Systems with Applications
Citation Excerpt :
Shukla, Muhuri et al. (2020) applied comparative analysis to big data with the help of IT2FSs. Shukla, Yadav et al. (2020) benefitted from IT2FSs for handling the veracity issue in big data. Tolga et al. (2020) used finite-interval-valued type-2 Gaussian fuzzy numbers with TODIM (Iterative Multi Criteria Decision Making) technique for healthcare device selection problem.
Decision-making techniques are among important topics applied in operations research. Multi-Criteria Decision-Making (MCDM) is one of the very commonly used subjects of this issue. Technique for Order Preference by Similarity to Ideal Solutions (TOPSIS) is among the most utilized MCDM approaches in terms of both convenience and efficiency. To provide the reality and ease of application of these methods as much as possible Interval Type-2 Fuzzy Numbers (IT2FN)s, a special kind of type-2 fuzzy sets (T2FS)s, are frequently used together with MCDM methods. However, before the final stage of the solution to the problem, defuzzification techniques are applied for obtaining the optimum solution. Switching to crisp numbers before the last step of the procedure reduces the effectiveness of using the IT2FNs. The aim of this paper is to contribute to the literature with a new idea of fuzzy metric function whose result is again a fuzzy number. The result of this study will be a serious contribution to the literature with its metric function structure that is a fuzzy number. A metric function whose result will be a fuzzy set can also be made for different fuzzy structures. Also, a new partial order relation will be given for IT2FNs to define this fuzzy metric function. Moreover, this new fuzzy metric will be adapted to the TOPSIS method. Hence, it will be assured that the defuzzification operations are used only in the final sorting stage and the advantage of using fuzzy numbers will be able to continue in further steps. The advanced TOPSIS with a new fuzzy metric will be applied to a video chat program selection problem as an example.
Big data analytics adoption: Determinants and performances among small to medium-sized enterprises
2020, International Journal of Information Management
Citation Excerpt :
These data include unstructured, semi-structured, and structured data (Mohapatra & Mohanty, 2020). On the other hand, Velocity indicates the speed of generating and analysing the data in real-time (Kuo, Lin, & Lee, 2018; Shukla, Yadav, Kumar, & Muhuri, 2020). Mikalef, Pappas, Krogstie, and Giannakos (2018) stated that the meaning and concepts of big data, BDA, and BDA capabilities (BDAC) vary, though some researchers used these terms interchangeably.
Big data analytics (BDA) adoption is a game-changer in the current industrial environment for precision decision-making and optimal performance. Nonetheless, the determinants or consequences of its adoption in small and medium enterprises remain unclear, hence the objective of this study. Data analysis of 171 Iranian small and medium manufacturing firms revealed that complexity, uncertainty and insecurity, trialability, observability, top management support, organizational readiness, and external support affect significantly on BDA adoption. The findings confirm the strong impact of BDA adoption in small to medium-sized enterprises, marketing and financial, performance enhancement. Understanding the drivers of BDA adoption helps managers to employ appropriate initiatives that are vital for effective implementation. The results enable BDA service providers to attract and diffuse BDA in small to medium-sized enterprises.
A novel solution approach for multiobjective linguistic optimization problems based on the 2-tuple fuzzy linguistic representation model
2020, Applied Soft Computing Journal
Citation Excerpt :
Recently, it has been reported that T2 FS based research has also grown remarkably and are now quite mature [46]. There were quite a few notable works which suitably explored and exploited the capabilities of T2 FSs in different domains, e.g. clustering [47,48], big data [49], energy efficient scheduling [50], linguistic group decision making [51], and service quality evaluation [52], etc. Accordingly, to solve MOLOPs using T2 FS based approaches, Gupta and Muhuri proposed a perceptual reasoning based approach in [53,54].
This paper proposes a novel solution technique for the multi-objective linguistic optimization problems (MOLOPs) based on the 2-tuple fuzzy linguistic approach. The proposed approach has two main advantages. First, it can handle the MOLOPs in which the linguistic information are represented through either monotonic or non-monotonic functions. Second, for both the scenarios, it provides unique solutions in the linguistic form. On the other hand, the existing MOLOP solution approach which is based on the Tsukamoto’s inference method, provides unique solutions only for those MOLOPs in which the linguistic information are expressed as monotonic functions. For the MOLOPs, in which the linguistic information are expressed as non-monotonic functions, the Tsukamoto’s inference method based solution approach cannot provide unique solutions. Moreover, for both monotonic and non-monotonic cases, the Tsukamoto’s inference method based solution approach cannot provide linguistic solutions, but gives only numeric solutions. We have demonstrated the applicability of the proposed MOLOP solution approach considering a case study on student’s performance evaluation, and compared the results with the Tsukamoto’s inference method based solution approach. It is observed that the proposed approach is capable of addressing the limitations of the Tsukamoto’s inference method and hence is more suitable in solving MOLOPs.
A bibliometric analysis and cutting-edge overview on fuzzy techniques in Big Data
2020, Engineering Applications of Artificial Intelligence
Citation Excerpt :
IT2 FSs were used to represent these information granules due to its proximity with the theory of granular computing and applicability to support various measures of uncertainty. Shukla et al. (2020b) addressed the scalability of big datasets by modelling it with footprint of uncertainty in the IT2 FSs. Fig. 9 shows the most common used keywords for the type-2 fuzzy techniques in Big Data.
Over the last few years, Big Data has gained a tremendous attention from the research community. The data being generated in huge quantity from almost every field is unstructured and unprocessed. Extracting knowledge base and useful information from the big raw data is one of the major challenges, present today. Various computational intelligence and soft computing techniques have been proposed for efficient big data analytics. Fuzzy techniques are one of the soft computing approaches which can play a very crucial role in current big data challenges by pre-processing and reconstructing data. There is a wide spread application domains where traditional fuzzy sets (type-1 fuzzy sets) and higher order fuzzy sets (type-2 fuzzy sets) have shown remarkable outcomes. Although, this research domain of “fuzzy techniques in Big Data” is gaining some attention, there is a strong need for a motivation to encourage researchers to explore more in this area. In this paper, we have conducted bibliometric study on recent development in the field of “fuzzy techniques in big data”. In bibliometric study, various performance metrics including total papers, total citations, and citation per paper are calculated. Further, top 10 of most productive and highly cited authors, discipline, source journals, countries, institutions, and highly influential papers are also evaluated. Later, a comparative analysis is performed on the fuzzy techniques in big data after analysing the most influential works in this field.
Emerging Issues and Applications of Type-2 Fuzzy Sets and Systems
2020, Engineering Applications of Artificial Intelligence

View all citing articles on Scopus

^☆: No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.engappai.2019.103315.

View full text

Veracity handling and instance reduction in big data using interval type-2 fuzzy sets☆

Abstract

Introduction

Section snippets

Literature review

Mathematical background and definitions

Proposed architecture

Simulations and experimental results

Conclusion

Acknowledgments

Expert Syst. Appl.

Future Gener. Comput. Syst.

Future Gener. Comput. Syst.

Eng. Appl. Artif. Intell.

Inf. Control

Eng. Appl. Artif. Intell.

Inform. Sci.

Inform. Sci.

Appl. Soft Comput.

Expert Syst. Appl.

Eng. Appl. Artif. Intell.

Eng. Appl. Artif. Intell.

Future Gener. Comput. Syst.

Future Gener. Comput. Syst.

Appl. Soft Comput.

Expert Syst. Appl.

Inform. Sci.

Inf. Fusion

Appl. Soft Comput.

Inform. Sci.

Eng. Appl. Artif. Intell.

Pattern Recognit.

Inform. Sci.

Pattern Recognit.

Business intelligence and analytics: From big data to big impact

MIS Q.

General Type-2 fuzzy decision making and its application to travel time selection

J. Intell. Fuzzy Systems

3D data management: Controlling data volume, velocity and variety

META Group Res. Note

Harness the Power of Big Data the IBM Big Data Platform

Interpretability Issues in Fuzzy Modeling, Vol. 128

A bibliometric overview of the field of type-2 fuzzy sets and systems

IEEE Comput. Intell. Mag.

General type-2 fuzzy logic systems made simple: a tutorial

IEEE Trans. Fuzzy Syst.

Similarity measures for closed general type-2 fuzzy sets: overview, comparisons, and a geometric approach

IEEE Trans. Fuzzy Syst.

Interval type-2 fuzzy logic systems: theory and design

IEEE Trans. Fuzzy Syst.

NSGA-II based multi-objective pollution routing problem with higher order uncertainty

Dynamic parameter adaptation in particle swarm optimization using interval type-2 fuzzy logic

Soft Comput.