Object similarity measures and Pawlak’s indiscernibility on decision tables

doi:10.1016/j.ins.2020.05.030

Information Sciences

Volume 539, October 2020, Pages 104-135

https://doi.org/10.1016/j.ins.2020.05.030 Get rights and content

Highlights

•
We consider comparisons between granularities induced by object similarity measures and classical Pawlak indiscernibility on decision tables.
•
We interpret object similarity measures as potential refinements of the granularity induced by the indiscernibility relation.
•
We introduce the notion of ( $A$ , $ρ$ , $ν$ )-object measure, where $A$ is an attribute subset, and $ρ$ and $ν$ are collections of numerical maps depending of pairs of vales.
•
We apply the general results of the first part of our work to the study of several classical per-attribute similarities.
•
We analyze the relationship between granularity, the ( $A$ , $ρ$ , $ν$ )-object measures and the previous specific per-attribute similarities.

Abstract

In this paper we investigate the mathematical foundations of the notion of similarity between objects in relation to the granulations on a decision table $D$ . First of all, we compare the endogenous granulation induced by Pawlak’s indiscernibility with the exogenous granulation induced by a similarity measure $ζ$ defined on pairs of objects and assuming values in the unit interval. To this aim, the starting point of our analysis is the introduction of the notion of refinement of the granulation induced by an attribute subset A through the object similarity measure $ζ$ . More in detail, we say that $ζ$ refines the granulation induced by A if $ζ$ assumes value 1 on a pair of objects if and only if they are A-indiscernible. Next, starting from two given families $ρ$ and $ν$ of numerical maps defined on pairs of admissible values of $D$ , we determine a broad class of potential similarity measures on the objects of $D$ refining, sometimes under some specific additional hypotheses, the A-granulation on the object set of $D$ . With regard to a such class of similarity measures, we establish several mathematical properties. Finally, we focus our attention to the analysis of specific pairs of numerical maps $ρ$ and $ν$ that have been classically studied in literature and, for each of them, we exhibit the main properties with respect to the aforementioned refinement of granulation.

Introduction

The notion of similarity is a fundamental tool in data mining [3], [4], where it has been used in shared nearest neighbor (SNN) clustering approach [28] (which is a methodology useful to find clusters of different shapes, sizes and densities in high-dimensional data [25]), document clustering [17], information retrieval [45] and anomaly detection [1]. Furthermore, such a notion also occurs in various branches of theoretical computer science, such as granular computing [32], [36] and rough set theory (briefly RST) [15], [27], [44].

More in detail, in RST similarity is strictly related to the analysis of approximations of subsets of a given object set U, which in many contexts agrees with the object set of some decision table $D$ .

The underlying idea of RST is the following: based on a given clustering criterion, with any element $u \in U$ we associate an object subset $N (u)$ (called neighborhood in [24], [44]), and we use these neighborhoods to construct inner and outer approximations of specific subsets of U.

In particular, whenever an equivalence relation R on U is assigned, a classical example of neighborhood $N (u)$ is the equivalence class ${[u]}_{R}$ of each element $u \in U$ . Clearly, in this case we also have the further property that $U = ⋃ {N (u) : u \in U}$ , i.e. U is union of the neighborhoods of all its elements. In other terms, the neighborhood family ${N (u) : u \in U}$ is a covering of U.

However, in many more general situations, from an intuitive point of view, the fact that an object v belongs to $N (u)$ means that u and v are similar in relation to some fixed quantitative measurement fixed, which not necessarily induces an equivalence relation on U. In these more general cases, with any object $u \in U$ we associate more than one only neighborhood $N (u)$ , but a whole neighborhood system $N (u)$ , whose members are neighborhoods $N_{r} (u)$ of u, depending on numerical parameters r.

The notion of neighborhood system occurs in a natural way within numerous research scopes, such as fuzzy set theory [18], [48], database theory [16], [23], [24], social networks theory [14], decision theory [20], graph theory [8].

A classical example of neighborhood system occurs in RST, where we start with the indiscernibility relation induced by a fixed attribute subset, and next we construct the so-called lower and upper approximations of any object subset X, by means of which we obtain the usual notion of Pawlak’s exactness [29], [31], [46], [47].

On the other hand, starting from a covering of U, Zhu [50] introduced a new kind of neighborhood, which allowed to consider new covering-based rough sets, which have been recently investigated from both a matroidal and lattice point of view [2], [5], [9], [19], [21], [26], [38], [39].

In general, a neighborhood system $N (u)$ describes many possible ways for another object $v \in U$ to be similar to u. In other terms, $N (u)$ outlines several levels of similarity with u, each in relation to a given choice of the numerical parameter r.

In concrete cases one expects that there exists an acceptable numerical parameter r for which $u \in N_{r} (u) \in N (u)$ , for any $u \in U$ . With such a further assumption of reflexivity to the level r, it results that the neighborhood family $N_{r} (U) ≔ {N_{r} (u) : u \in U}$ is a covering of U.

Clearly, it is also natural to require a condition of symmetry with respect to the similarity between two any objects: if u is similar to v relatively to a fixed numerical parameter r, then also v is similar to u relatively to the same r. In terms of neighborhoods: if $u \in N_{r} (u)$ then $v \in N_{r} (v)$ .

A reflexive and symmetric relation is sometimes called a tolerance relation [37], [35], [36], [41], and it has been linked with the problem of reducing the number attributes, where the aim consists of deleting as much incompatibilies and redundancies as possible.

Another way to define a neighborhood system associated with any element $u \in U$ has been provided in [3], [49], [27], where a measure $ζ : U \times U \to [0, 1]$ , a threshold $r \in [0, 1]$ , the neighborhoods of the form $N^{⩾} (ζ | u, r) ≔ {v \in U : ζ (u, v) ⩾ r}$ and the corresponding neighborhood system $N^{⩾} (ζ | u) ≔ {N^{⩾} (ζ | u, r) : r \in [0, 1]}$ are given.

These types of neighborhood systems are the starting point of the present paper. In the next subsection we will describe the idea and the motivations underlying our work, highlighting the strict relation between granulation and the measures we will study in this paper. Let us finally point out that the possibility to work with some measures related to granulation has been successfully used in fuzzy set theory [40], [42], [43].

In the present paper, taking inspiration by Yao’s terminology [45], given a decision table $D$ with object set Ud, we call any numerical symmetric map $ζ : U \times U ⟶ [0, 1]$ a $D$ -object proximity measure and, when it also satisfies the condition $ζ (u, u) = 1$ for each $u \in U$ , a $D$ -object similarity measure. In the latter case, it is natural to think of the objects u and v as completely similar with respect to the criterion induced by the map $ζ$ when $ζ (u, v) = 1$ and, on the contrary, as completely dissimilar when $ζ (u, v) = 0$ .

Now, as we have already remarked in the previous subsection, when a $D$ -object proximity measure $ζ$ , an object u and a threshold r in the unit interval $[0, 1]$ are given, it is usual to consider the neighborhood $N^{⩾} (ζ | u, r)$ of all objects $v \in U$ having at least similarity r with respect to u and, moreover, the corresponding neighborhood system $N^{⩾} (ζ | r) ≔ {N^{⩾} (ζ | u, r) : u \in U}$ .

The first aim of our work is to provide a detailed mathematical investigation for the most of the comparisons which in literature are implicitly made between the neighborhood system $N^{⩾} (ζ | r)$ and the set system of all A-indiscernibility granules. However, in order to provide a suitable justification to the mathematical formalism used in the whole paper, we now briefly discuss our interpretation of the granulations on U induced by $ζ$ and by Pawlak’s A-indiscernibility, respectively.

To this regard, it is worthwhile noticing that the A-indiscernibility relation classifies the A-granules in a quite rigid way since, given two objects u and v, one can only establish if u and v are A-indiscernible or not, without quantifying, in case of discernibility, how discernible they are. In formal terms, we can consider the $D$ -object similarity measure ${ind}_{A} : U \times U ⟶ [0, 1]$ defined by ${ind}_{A} (u, v) ≔ \{\begin{matrix} 1 & if u \equiv_{A} v \\ 0 & otherwise \end{matrix}$ Hence, as the above object similarity map ${ind}_{A}$ takes only two possible values, namely 1 for the maximum similarity level (which corresponds to the indiscernibility between two objets) and 0 for any other form of dissimilarity, the rigidity of the similarity induced by the A-granulation is evident. On the other hand, whenever we have a symmetric map $ζ : U \times U ⟶ [0, 1]$ such that $u \equiv_{A} v \Leftrightarrow ζ (u, v) = 1,$ for any $(u, v) \in U \times U$ , then we can think of $ζ$ as an object similarity measure which refines the granulation of the object set U induced by ${ind}_{A}$ . In what follows, we will use the notation

to say that

ζ

refines the granulation induced by

{ind}_{A}

. Let us explain the motivation of such a terminology. On the one hand, when

, the more the number

ζ (u, v)

approaches 1, the more the objects u and v are close to be A-indiscernible. On the other hand, when we consider any two non-A-indiscernible objects u and v, their level of dissimilarity with respect to the measure

{ind}_{A}

is always 0, while the value

ζ (u, v)

, ranging in

[0, 1)

, provides a more refined information about the dissimilarity of u and v. Hence, when the

D

-object similarity measure

ζ

satisfies the condition

, it becomes a more flexible tool in order to give information concerning the A-granulation of U. Furthermore, the aforementioned refinement can be observed from two different perspectives.

•
In the first case, let $u \in U$ be fixed. Then, when we choose any two distinct objects $w, w^{'} \in U ⧹ {[u]}_{A}$ , from a mere examination of the A-indiscernibility partition, we are not able to state which of the two objects has a greater or lesser similarity level with u in relation to the A-granulation of U. In fact, we have that ${ind}_{A} (u, w^{'}) = {ind}_{A} (u, w) = 0$ . Now, if the object similarity measure $ζ$ satisfies the condition
and, in addition, it also results that $0 ⩽ ζ (u, w^{'}) < ζ (u, w) < 1$ , then we can interpret these inequalities as conditions of better outer similarity between w and u, with respect to that between $w^{'}$ and u, in relation to the A-granulation of U.
•
In the second case, let $u, v \in U$ belong to the same A-indiscernibility granule (that is $u \equiv_{A} v$ ), and $z \notin {[u]}_{A}$ . Then, if we simply use the measure ${ind}_{A}$ , as in the previous case we have that ${ind}_{A} (u, z) = {ind}_{A} (v, z) = 0$ . On the other hand, as before, if
and, in addition, it also results that $0 ⩽ ζ (u, z) < ζ (v, z) < 1$ , we can interpret these inequalities as conditions of better inner similarity between v and z, with respect to that between u and z, in relation to the A-granulation of U.

Then, based on the interpretation given above, we introduce the following fundamental notion of our work.

Definition 1.1

We say that the $D$ -object similarity measure $ζ : U \times U ⟶ [0, 1]$ refines the A-granulation of U if

We now conclude the present subsection by discussing a further interpretation of the refinement of the granulation in terms of endogenous granulation and exogenous granulation. To this regard, when we fix an attribute subset A, the object similarity measure ${ind}_{A}$ induces on the object set U a type of granulation that it is natural to call endogenous. In fact, such a granulation is given simply by taking the A-indiscernibility granules, which are induced directly by the intrinsic nature of the given decision table $D$ . On the other hand, when we choose an appropriate object similarity measure $ζ : U \times U ⟶ [0, 1]$ and a fixed threshold $r \in [0, 1]$ , we can consider the set system $N^{⩾} (ζ | r) ≔ {N^{⩾} (ζ | u, r) : u \in U}$ . It easily follows that $N^{⩾} (ζ | r)$ is a covering of U because $u \in N^{⩾} (ζ | u, r)$ for all $u \in U$ . Therefore we can interpret $N^{⩾} (ζ | r)$ as a granulation of U, where any two granules $N^{⩾} (ζ | u, r)$ and $N^{⩾} (ζ | v, r)$ may also have a non-empty intersection. However, the important fact in our discussion is that we can interpret $N^{⩾} (ζ | r)$ as a granulation of an exogenous type. In fact it is induced by the object similarity measure $ζ$ , whose nature is (a priori) not directly related to the way in which the attributes act on the objects of the given decision table. Thus, in general, it is natural to ask which can be the links between the endogenous granulation induced by ${ind}_{A}$ and the exogenous granulation induced by $ζ$ .

To this regard, we can also have more refined interrelations between the object similarity measures ${ind}_{A}$ and $ζ$ , with respect to the condition

. However, at a first level of analysis, the condition

appears to be one of the more natural starting points from which to undertake a comparison between the two aforementioned endogenous and exogenous granulations.

In fact, if the measure $ζ$ satisfies the condition

, this is equivalent to say that its exogenous granulation of lower threshold 1 coincides with the endogenous A-granulation.

On the other hand, when we choose a threshold $r \in [0, 1)$ the size of any neighborhood $N^{⩾} (ζ | u, r)$ potentially tends to increase with respect to $N^{⩾} (ζ | u, 1) = {[u]}_{A}$ . This leads to a situation which is more complex than the granulation of level 1 induced by $ζ$ and corresponding exactly to the greater amount of information that the measure $ζ$ provides with respect to the A-indiscernibility measure ${ind}_{A}$ .

To conclude this subsection, let us spend few words concerning dissimilarity measures. With regard to these measures, there is no a univocal definition in literature. However, also in such a case the dissimilarity between the two objects u and v is usually given by means of a numerical measure $ϕ (u, v)$ of the degree at which the two objects differ. Hence, the dissimilarity is lower for more similar pairs of objects. In particular, we interpret the value $ϕ (u, v) = 0$ as null dissimilarity (and therefore maximal similarity) between the objects u and v. Therefore, it is natural to consider any metric $ϕ : U \times U ⟶ [0, \infty)$ as a specific type of object dissimilarity measure. Nevertheless, a formal and complete investigation of the object dissimilarity measures in relation to both metrics and refinements of the A-granulation goes far beyond the scopes of the present paper, and it will be the subject of study in future works.

In the first part of this paper, we will work on the mathematical foundations of the previous extension of the A-granulation. This analysis will be undertaken regardless of a specific $D$ -object similarity measure and the corresponding results will be provided in Section 3.

The above notion of refining of the A-granulation is the basic interpretative idea of the present paper. Nevertheless, after assuming the interpretative standpoint relied on the previous notion, a more technical problem comes out: how can we determine a broad family of $D$ -object similarity measures refining the A-granulation on the object set U?

To this regard, getting ideas and generalizing what it has been done in [3], [27], [49], we will introduce a class of $D$ -object proximity measures starting from two given families of numerical maps whose domain agrees with the value set of a decision table.

Let us discuss how to obtain the aforementioned family of $D$ -object proximity measures. Notice furthermore that the most part of numerical maps we will consider in the paper are frequently used in data mining and related fields (for further details, we refer the reader to [3]).

Given a decision table $D$ with attribute set $Ω ≔ Con \cup Dec$ , where Con denotes the set of all the condition attributes and Dec that of all the decision attributes, we may define an object proximity (or similarity) map by means of a numerical map defined attribute per-attribute. We call such a map a b-numerical map of $D$ , where b is any attribute of $Ω$ , and the family $ρ = {ρ_{b} : b \in Ω}$ a value numerical map family of $D$ . One example is provided by the classical overlap map family $δ = {δ_{b} : b \in Ω}$ (see [3], [27]), where $δ_{b} (s, t) ≔ \{\begin{matrix} 1 & if s = t \\ 0 & otherwise, \end{matrix}$ for any $b \in Ω$ , and inducing the object map $Σ_{A, δ} : U \times U \to [0, 1]$ defined by: $Σ_{A, δ} (u, v) ≔ \frac{1}{| A |} \sum_{a \in A} δ_{a} (a (u), a (v)),$ for all $u, v \in U$ and where A is a non-empty attribute subset.

Now, at a first level of generalization, in order to extend the previous example and to include a wide range of classical similarity measures that have been studied in literature (see again[3], [27]), it suffices to replace the previous overlap family $δ$ with any value numerical family $ρ$ of $D$ .

However, in this paper we will introduce an even more general context than the one just described.

More in detail, given two value numerical map families $ρ = {ρ_{b} : b \in Ω}$ and $ν = {ν_{b} : b \in Ω}$ of $D$ and fixed any non-empty attribute subset A, we introduce the notion $(A, ρ, ν)$ -object measure of $D$ , i.e. the induced map $Σ_{A, ρ, ν} : U \times U ⟶ R$ defined by $Σ_{A, ρ, ν} (u, v) ≔ \frac{\sum_{a \in A} ρ_{a} (a (u), a (v))}{\sum_{a \in A} ν_{a} (a (u), a (v))},$ for all $(u, v) \in U \times U$ . Then it is easy to verify that all the similarity measures described in [3] can be expressed in terms of $(A, ρ, ν)$ -object measures, for appropriate value numerical families $ρ$ and $ν$ . Hence the notion of $(A, ρ, ν)$ -object measure becomes a unifying concept, in whose perspective we can frame and investigate several classical similarity measures. Furthermore, in order to better help the readers to fix the theoretical features of our work, throughout the paper we provide some simple examples whose value is purely illustrative of the theory.

In Section 5, we will demonstrate some general properties of the $(A, ρ, ν)$ -object measures. Nevertheless, the more substantial part of our work consists of analyzing the basic mathematical properties of specific $(A, ρ, ν)$ -object measures coming from data-mining and related fields (see [3], [27], [49]). It should be noted here that the goal of our paper is in a certain sense complementary to the spirit of [3], [27], [49]. As a matter of fact, our definition of $(A, ρ, ν)$ -object measure allows us to recollect several possible cases of similarity maps, among which those studied in [3], [27], [49] and, furthermore, to investigate in a more manageable way and more in general the mathematical properties of these maps, above all with regard to the notion of refinement of the A-granulation. However, the reader interested in algorithmic comparisons between various $D$ -object similarity measures already existing in literature, should consult the excellent work of Liu et al. [27], [34]. On the other hand, in the perspective of this paper, we tried to provide researchers and practitioners of data mining, granular computing and related fields dealing with these families of maps with a precise and coherent mathematical foundation of such a topic.

More in detail, based on [3], [27], [49], we consider seven similarity value maps and compare the corresponding induced object maps. These measures are: overlap [3] which is the translation in terms of numerical maps of the classical indiscernibility relation; Eskin measure [12], which is a numerical map assigning more weight to mismatches occurring on attributes assuming many values; inverse occurrence frequency (briefly IOF) [33], which assigns a lower similarity to mismatches on more frequent values; occurrence frequency (briefly OF) [33] which assigns a lower similarity to less frequent values; Goodall3 measure [13] which assigns a high similarity to a match whenever the matching values are infrequent regardless of the frequencies of the other values and Goodall4 measure which is the complement of the previous one; and finally the Lin measure which assigns higher weight to matches on frequent values, and lower weight to mismatches on infrequent values.

We firstly give a proof to some properties that have been simply stated in [3], [27], [49] and, next, we tried to understand when they are object proximity measures (IOF, Goodall3, Goodall4) and when they are objecy similarity measures (overlap, Eskin, OF, Lin). Next, after fixing a non-empty attribute subset A, we investigate their connection with the corresponding granulation, finding some sufficient conditions for which the above measures refine granulation. For example, in general, the IOF measure turns out to be an object proximity measure which assumes value 1 if and only if either the two objects are A-indiscernible or they are not indiscernible and, for any attribute which discerns them with respect to A, at least one of the values assumed by u or v on the given attribute occurs only once. Thus, a specific condition on the attribute subsets A explicits the cases where the granulation induced by A can be refined by that induced by the IOF measure.

In Section 2 we provide the basic notions about Pawlak’s decision tables. which will be useful for the remaining part of the paper.

In Section 3 we introduce the notions of object proximity and of object similarity measures and, next, analyze the links between indiscernibility relation, dependency relation, exactness and the above measures. Furthermore, when the object similarity measure $ζ$ refines the A-granulation, we find the greatest subinterval of the unit interval for which the neighborhoods $N^{⩾} (ζ | u, r)$ agree with ${[u]}_{A}$ and, in the general case, we study its main properties.

In Section 4 we study overlap maps and its induced object similarity measure. It will be the basic starting model whose generalization will lead in a natural way to the definition and the investigation of the notion of $(A, ρ, ν)$ -object measure of $D$ .

In Section 5 we provide a formal definition of per-attribute numerical maps defined on pairs of the admissible values of $D$ and, next, starting from two arbitrary per-attribute numerical map families $ρ$ and $ν$ and from a given attribute subset A, we define the notion of $(A, ρ, ν)$ -object measure.

Finally, in Section 6 we consider some classical per-attribute numerical map families and study their basic properties and those of the corresponding induced object measure, analyzing the cases where they are proximity or similarity measures. In the last situation, we will ask whether they refine granulation and, in negative case, we find some additional sufficient conditions on the attribute subset A so that the corresponding $(A, ρ, ν)$ -object similarity refines the A-granulation.

Section snippets

Background on decision tables

In this section we will provide some basic notation and also deal with background notions and properties of Pawlak’s decision tables [29]. The main notions concerning decision tables are dependency and exactness. In particular, we will prove that these notions may be related (see Theorem 2.2). In Table 1 we give some specific notations we will use in the paper.

If X is any finite set, we denote by $P (X)$ the power set of X and we set $P^{*} (X) ≔ P (X) ⧹ {\emptyset}$ . We use the notation $| X |$ to denote the number of

Object proximity and similarity measures

In this section, we will introduce the notion of object proximity measure and that of object similarity measure. These notions have been inspired by the terminology used by Yao in [44], [45]. The analysis that we will undertake in the present section concerns the connections between indiscernibility relation (also from the point of view of dependency and exactness) and object proximity/similarity measures.

The underlying idea is derived mainly from [27], [49], and it consists of verifying

Similarity measures induced by overlap maps

In this section we establish the basic properties of the $D$ -object similarity measure which permit us to express the classical indiscernibility conditions in terms of numerical maps. Clearly, overlap induces an example of object similarity measure. Such a measure will be a model whose generalization leads to the definition of the object measures which we will study in what follows.

For any attribute $a \in Ω$ , let $δ_{a} : Λ_{a} \times Λ_{a} \to {0, 1}$ be the overlap map on the set $Λ_{a}$ , that is $δ_{a} (s, t) ≔ \{\begin{matrix} 1 & if s = t \\ 0 & otherwise, \end{matrix}$ for all $($

Object numerical measures induced by value numerical map families

In this section we use the particular object similarity measure defined in (13) as a reference model to deal with more general cases. As a matter of fact, we will introduce the notion of $(A, ρ, ν)$ -object measure, where A is a fixed non-empty attribute subset and $ρ$ and $ν$ are collections of numerical maps defined on $Λ_{a} \times Λ_{a}$ , for each $a \in A$ . Our main aim is to introduce a unifying perspective which takes into account the various measures introduced in [3], [27], [49]. Nevertheless, as we will see in the

Some object measures induced by classical value measures

The present section is devoted to the study of some object measures induced by classical value measures. For each of these value measures, we will provide basic mathematical properties, some of which have been also established in [3] and for which it is not easy to find a formal proof in literature. Therefore, in line with our attempt to provide a mathematical foundation for the theory of similarity measures, we considered it appropriate to give a demonstration for the aforementioned properties

Conclusions

The underlying idea of our paper consists of providing a mathematical investigation of the notion of similarity between objects of a decision table in relation to Pawlak’s indiscernibility. As a matter of fact, such a notion has been classically studied from a topological point of view, where it has been described through the introduction of neighborhoods.

In our perspective, when one has a given universe U of objects, the similarity between two objects $u, v \in U$ may be measured by means of a

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

We are extremely grateful to the anonymous reviewers who helped us to improve the quality of our paper with their thorough suggestions. A particular thanks goes to the referee who suggested us the introduction of the information table $I_{D, ζ}^{A}$ .

References (50)

G. Chiaselotti et al.
The granular partition lattice of an information table
Inf. Sci.
(2016)
G. Chiaselotti et al.
Knowledge pairing systems in granular computing
Knowl. Based Syst.
(2017)
G. Chiaselotti et al.
Dependency structures for decision tables
Int. J. Approx. Reason.
(2017)
G. Chiaselotti et al.
Granular computing on information tables: families of subsets and operators
Inf. Sci.
(2018)
G. Chiaselotti et al.
Notions from rough set theory in a generalized dependency relation context
Int. J. Approx. Reason.
(2018)
J.R. Hipp et al.
Measuring neighborhood: constructing network neighborhoods
Social Networks
(2012)
Q. Hu et al.
Neighborhood classifiers
Expert Syst. Appl.
(2008)
Q. Hu et al.
Neighborhood rough set based heterogeneous feature subset selection
Inf. Sci.
(2008)
T.-J. Li et al.
Generalized fuzzy rough approximation operators based on fuzzy coverings
Int. J. Approx. Reason.
(2008)
X. Li et al.
Rough sets and matroids from a lattice-theoretic viewpoint
Inf. Sci.
(2016)

Cited by (12)

Outlier detection in a multiset-valued information system based on rough set theory and granular computing
2024, Information Sciences
Outlier detection on data with missing information values is especially tricky because the uncertainty caused by missing information values may contribute to an object being an outlier. A multiset-valued information system (MSVIS) is an information system (IS) in which information values are multisets. This kind of IS is a useful way of handling datasets with missing information values. In this paper, we study outlier detection in an MSVIS based on rough set theory and granular computing. First, some concepts of multisets and probability distribution sets are reviewed, and the fact that a weak one-to-one correspondence exists between multisets and rational probability distribution sets is illustrated. In this way, multisets may be treated as rational probability distribution sets. Then, an MSVIS can be induced by an incomplete information system (IIS) and viewed as the result of information fusion of multiple categorical ISs. Next, a tolerance relation in an MSVIS is constructed with the induced rational probability distribution sets. Then, the outlier factor in an MSVIS is formulated, and the corresponding outlier detection algorithm is proposed. Finally, the performance evaluation by AUC (area under the curve) and F1-score shows the superiority of the proposed algorithm over some existing algorithms.
Optimal scale generation in two-class dominance decision tables with sequential three-way decision
2023, Information Sciences
Optimal scale selection is studied widely at present to obtain scale rules in multi-scale decision tables. However, one limitation of this method is that it cannot directly extract scale rules from single-scale decision tables. In this study, we determine how to generate an optimal scale in two-class dominance decision tables with sequential three-way decision (S3WD). First, we use the variance of each attribute to describe its importance in a single-scale two-class dominance decision table. Then, we arrange all attributes in descending order in accordance with their importance. Second, we examine each attribute by following the aforementioned order. Thereafter, we construct different object granules on the basis of the numerical size of attribute value and different decision values and then expand them individually until their limit is reached. Third, we label and delete objects that are already in the object granules. Then, we continue to construct object granules for the remaining objects by following the preceding method until all the objects are labeled (if the information system is inconsistent, then it should be labeled with ”until it cannot be marked”). In the process above, we adopt the idea of S3WD and successively classify the objects into positive, boundary, and negative regions in accordance with the three states of objects (i.e., object classification has been confirmed, to be confirmed, and cannot be confirmed). We perform experiments on some UCI datasets and demonstrate the effectiveness of our method. In summary, our work provides a method for generating an optimal scale in single-scale decision tables and extracting scale rules from the generated tables.
Neighborhood rough set based ensemble feature selection with cross-class sample granulation
2022, Applied Soft Computing
Exploring feature significance associated with label is a fundamental task in the architecture of feature selection. Nevertheless, most of the existing schemes are limited by the global feature significance over the entire universe. It follows that some specific characteristics of features implied in sample subspaces may be overlapped. To fill such a gap, a novel ensemble feature selection with cross-class sample granulation is developed. Our method explicitly involves two main phases: (1) cross-class sample granulation — data is separated into multiple granules which are generated by querying the locations of samples in their respective classes, so as to provide local bases; (2) ensemble feature selection — localized evaluations of feature significance are integrated which are induced by leveraging multiple homogeneous fine-granularity measures from those bases, so as to select qualified features. To validate the effectiveness of our proposed method, it is compared with several well-established feature selection schemes in CART, KNN and SVM classification performance. Experimental results on 20 UCI data sets demonstrate that our method is superior as it yields higher accuracy with satisfactory elapsed time.
Variable radius neighborhood rough sets and attribute reduction
2022, International Journal of Approximate Reasoning
Citation Excerpt :
However, the equivalence relation requires that all attribute values of objects in the same equivalent class are identical. The condition is so strict that the information about similarity relations [1,4,35,40] and order relations [28,34] between objects is easily ignored. Consequently, a lot of scholars have extended rough sets to neighborhood rough sets [21], covering rough sets [5,66], variable precision rough sets [67], multigranulation rough sets [32,55], fuzzy rough sets [15,45], k-nearest neighbor rough sets [18,19], and other fields [7,30,54,65].
Neighborhood rough sets provide important insights into dealing with numerical data. Neighborhood radius, a key factor that affects data uncertainty, is uniformly given in most of the existing neighborhood rough sets. Although it is concise and convenient to construct a granular structure, the same radius is not appropriate for the unique circumstance of each element in the universe. Therefore, taking the different environment of each object and label distribution into consideration, in this paper, we propose two novel neighborhood rough set models, namely, variable radius neighborhood rough sets (VRNRs) and neighborhood rough sets based on α-covering (α-CNRSs). They customize the neighborhood radius for each object or local region of the universe by surrounding functions. Based on an investigation of the basic properties of VRNRs and α-CNRSs, we present two attribute reduction algorithms. Moreover, three comparative experiments are designed in terms of the running time, model stability, and classification accuracy. Theoretical analyses and experimental results show that the two new neighborhood rough set models have good robustness and validity in attribute reduction and classification performance.
Multiview sequential three-way decisions based on partition order product space
2022, Information Sciences
In granular computing, a set of attributes (features) is often selected as a view to describe objects from a particular angle. In each view, objects can be further described from different levels of granularities (abstraction), and each granularity determines a level. Multiview and multilevel are two basic principles of granular computing, which render a solution more comprehensive and reasonable. From the viewpoint of granular computing, existing three-way decisions cannot effectively combine multiview and multilevel to make decisions. As a new granular computing model, the partition order product space solves a problem from multiple views and at multiple levels in each view, which follows the principles of multiview and multilevel. In this paper, we discuss three-way decisions based on partition order product space. First, we propose two search algorithms: depth-first and breadth-first searching algorithms, to obtain a solution for problem solving in partition order product space. Second, we introduce two fusion strategies to fuse multiple one-level views: optimistic fusion method and pessimistic fusion method. Consequently, based on two search algorithms and two fusion strategies, we propose four multiview sequential three-way decisions, which can simultaneously make decisions from multiple views and multiple levels. Experimental results demonstrate the effectiveness of the proposed models.
Granular cabin: An efficient solution to neighborhood learning in big data
2022, Information Sciences
Citation Excerpt :
Essentially, a neighborhood involves a region where samples are somewhat indiscernible or at least not significantly distinguishable [36]. Usually a priori criterion depending on feature space is fixed to quantify similarity or dissimilarity, and then an assigned threshold enables to granulate samples into specific kind of neighborhoods [3]. That means, each sample can be associated with a subset from the universe.
Neighborhood Learning (NL) is a paradigm covering theories and techniques of neighborhood, which facilitates data organization, representation and generalization. While delivering impressive performances across various fields such as granular computing, cluster analysis, NL is argued to be computationally demanding, thereby limiting its utility and applicability. In this study, a simple and generic scheme named granular cabin is proposed for drastically speeding up the algorithmic implementation of NL. Specifically, this scheme is deployed to Neighborhood Rough Set (NRS) which is a typical NL methodology. And three major applications of NRS are concerned including approximation computation, neighborhood classification and feature selection. Extensive experiments demonstrate that NRS methodology enhanced by granular cabin consumes much less time. This study offers a promising solution that ensures the great potential of NL in big data.

View all citing articles on Scopus

View full text

Object similarity measures and Pawlak’s indiscernibility on decision tables

Highlights

Abstract

Introduction

Section snippets

Background on decision tables

Object proximity and similarity measures

Similarity measures induced by overlap maps

Object numerical measures induced by value numerical map families

Some object measures induced by classical value measures

Conclusions

Declaration of Competing Interest

Acknowledgements

Inf. Sci.

Knowl. Based Syst.

Int. J. Approx. Reason.

Inf. Sci.

Int. J. Approx. Reason.

Social Networks

Expert Syst. Appl.

Inf. Sci.

Int. J. Approx. Reason.

Inf. Sci.

Int. J. Approx. Reason.

Inf. Sci.

Inf. Sci.

Inf. Sci.

Inf. Sci.

Int. J. Approx. Reason.

Inf. Sci.

Int. J. Approx. Reason.

Inf. Sci.

Knowl. Based Syst.

Inf. Sci.

Inf. Sci.

Inf. Sci.

Appl. Math. Model.

Inf. Sci.