Elsevier

Information Sciences

Volume 539, October 2020, Pages 104-135
Information Sciences

Object similarity measures and Pawlak’s indiscernibility on decision tables

https://doi.org/10.1016/j.ins.2020.05.030Get rights and content

Highlights

  • We consider comparisons between granularities induced by object similarity measures and classical Pawlak indiscernibility on decision tables.

  • We interpret object similarity measures as potential refinements of the granularity induced by the indiscernibility relation.

  • We introduce the notion of (A, ρ, ν)-object measure, where A is an attribute subset, and ρ and ν are collections of numerical maps depending of pairs of vales.

  • We apply the general results of the first part of our work to the study of several classical per-attribute similarities.

  • We analyze the relationship between granularity, the (A, ρ, ν)-object measures and the previous specific per-attribute similarities.

Abstract

In this paper we investigate the mathematical foundations of the notion of similarity between objects in relation to the granulations on a decision table D. First of all, we compare the endogenous granulation induced by Pawlak’s indiscernibility with the exogenous granulation induced by a similarity measure ζ defined on pairs of objects and assuming values in the unit interval. To this aim, the starting point of our analysis is the introduction of the notion of refinement of the granulation induced by an attribute subset A through the object similarity measure ζ. More in detail, we say that ζ refines the granulation induced by A if ζ assumes value 1 on a pair of objects if and only if they are A-indiscernible. Next, starting from two given families ρ and ν of numerical maps defined on pairs of admissible values of D, we determine a broad class of potential similarity measures on the objects of D refining, sometimes under some specific additional hypotheses, the A-granulation on the object set of D. With regard to a such class of similarity measures, we establish several mathematical properties. Finally, we focus our attention to the analysis of specific pairs of numerical maps ρ and ν that have been classically studied in literature and, for each of them, we exhibit the main properties with respect to the aforementioned refinement of granulation.

Introduction

The notion of similarity is a fundamental tool in data mining [3], [4], where it has been used in shared nearest neighbor (SNN) clustering approach [28] (which is a methodology useful to find clusters of different shapes, sizes and densities in high-dimensional data [25]), document clustering [17], information retrieval [45] and anomaly detection [1]. Furthermore, such a notion also occurs in various branches of theoretical computer science, such as granular computing [32], [36] and rough set theory (briefly RST) [15], [27], [44].

More in detail, in RST similarity is strictly related to the analysis of approximations of subsets of a given object set U, which in many contexts agrees with the object set of some decision table D.

The underlying idea of RST is the following: based on a given clustering criterion, with any element uU we associate an object subset N(u) (called neighborhood in [24], [44]), and we use these neighborhoods to construct inner and outer approximations of specific subsets of U.

In particular, whenever an equivalence relation R on U is assigned, a classical example of neighborhood N(u) is the equivalence class [u]R of each element uU. Clearly, in this case we also have the further property that U={N(u):uU}, i.e. U is union of the neighborhoods of all its elements. In other terms, the neighborhood family {N(u):uU} is a covering of U.

However, in many more general situations, from an intuitive point of view, the fact that an object v belongs to N(u) means that u and v are similar in relation to some fixed quantitative measurement fixed, which not necessarily induces an equivalence relation on U. In these more general cases, with any object uU we associate more than one only neighborhood N(u), but a whole neighborhood system N(u), whose members are neighborhoods Nr(u) of u, depending on numerical parameters r.

The notion of neighborhood system occurs in a natural way within numerous research scopes, such as fuzzy set theory [18], [48], database theory [16], [23], [24], social networks theory [14], decision theory [20], graph theory [8].

A classical example of neighborhood system occurs in RST, where we start with the indiscernibility relation induced by a fixed attribute subset, and next we construct the so-called lower and upper approximations of any object subset X, by means of which we obtain the usual notion of Pawlak’s exactness [29], [31], [46], [47].

On the other hand, starting from a covering of U, Zhu [50] introduced a new kind of neighborhood, which allowed to consider new covering-based rough sets, which have been recently investigated from both a matroidal and lattice point of view [2], [5], [9], [19], [21], [26], [38], [39].

In general, a neighborhood system N(u) describes many possible ways for another object vU to be similar to u. In other terms, N(u) outlines several levels of similarity with u, each in relation to a given choice of the numerical parameter r.

In concrete cases one expects that there exists an acceptable numerical parameter r for which uNr(u)N(u), for any uU. With such a further assumption of reflexivity to the level r, it results that the neighborhood family Nr(U){Nr(u):uU} is a covering of U.

Clearly, it is also natural to require a condition of symmetry with respect to the similarity between two any objects: if u is similar to v relatively to a fixed numerical parameter r, then also v is similar to u relatively to the same r. In terms of neighborhoods: if uNr(u) then vNr(v).

A reflexive and symmetric relation is sometimes called a tolerance relation [37], [35], [36], [41], and it has been linked with the problem of reducing the number attributes, where the aim consists of deleting as much incompatibilies and redundancies as possible.

Another way to define a neighborhood system associated with any element uU has been provided in [3], [49], [27], where a measure ζ:U×U[0,1], a threshold r[0,1], the neighborhoods of the form N(ζ|u,r){vU:ζ(u,v)r} and the corresponding neighborhood system N(ζ|u){N(ζ|u,r):r[0,1]} are given.

These types of neighborhood systems are the starting point of the present paper. In the next subsection we will describe the idea and the motivations underlying our work, highlighting the strict relation between granulation and the measures we will study in this paper. Let us finally point out that the possibility to work with some measures related to granulation has been successfully used in fuzzy set theory [40], [42], [43].

In the present paper, taking inspiration by Yao’s terminology [45], given a decision table D with object set Ud, we call any numerical symmetric map ζ:U×U[0,1] a D-object proximity measure and, when it also satisfies the condition ζ(u,u)=1 for each uU, a D-object similarity measure. In the latter case, it is natural to think of the objects u and v as completely similar with respect to the criterion induced by the map ζ when ζ(u,v)=1 and, on the contrary, as completely dissimilar when ζ(u,v)=0.

Now, as we have already remarked in the previous subsection, when a D-object proximity measure ζ, an object u and a threshold r in the unit interval [0,1] are given, it is usual to consider the neighborhood N(ζ|u,r) of all objects vU having at least similarity r with respect to u and, moreover, the corresponding neighborhood system N(ζ|r){N(ζ|u,r):uU}.

The first aim of our work is to provide a detailed mathematical investigation for the most of the comparisons which in literature are implicitly made between the neighborhood system N(ζ|r) and the set system of all A-indiscernibility granules. However, in order to provide a suitable justification to the mathematical formalism used in the whole paper, we now briefly discuss our interpretation of the granulations on U induced by ζ and by Pawlak’s A-indiscernibility, respectively.

To this regard, it is worthwhile noticing that the A-indiscernibility relation classifies the A-granules in a quite rigid way since, given two objects u and v, one can only establish if u and v are A-indiscernible or not, without quantifying, in case of discernibility, how discernible they are. In formal terms, we can consider the D-object similarity measure indA:U×U[0,1] defined byindA(u,v)1ifuAv0otherwiseHence, as the above object similarity map indA takes only two possible values, namely 1 for the maximum similarity level (which corresponds to the indiscernibility between two objets) and 0 for any other form of dissimilarity, the rigidity of the similarity induced by the A-granulation is evident. On the other hand, whenever we have a symmetric map ζ:U×U[0,1] such thatuAvζ(u,v)=1,for any (u,v)U×U, then we can think of ζ as an object similarity measure which refines the granulation of the object set U induced by indA. In what follows, we will use the notation

to say that ζ refines the granulation induced by indA. Let us explain the motivation of such a terminology. On the one hand, when
, the more the number ζ(u,v) approaches 1, the more the objects u and v are close to be A-indiscernible. On the other hand, when we consider any two non-A-indiscernible objects u and v, their level of dissimilarity with respect to the measure indA is always 0, while the value ζ(u,v), ranging in [0,1), provides a more refined information about the dissimilarity of u and v. Hence, when the D-object similarity measure ζ satisfies the condition
, it becomes a more flexible tool in order to give information concerning the A-granulation of U. Furthermore, the aforementioned refinement can be observed from two different perspectives.
  • In the first case, let uU be fixed. Then, when we choose any two distinct objects w,wU[u]A, from a mere examination of the A-indiscernibility partition, we are not able to state which of the two objects has a greater or lesser similarity level with u in relation to the A-granulation of U. In fact, we have that indA(u,w)=indA(u,w)=0. Now, if the object similarity measure ζ satisfies the condition

    and, in addition, it also results that 0ζ(u,w)<ζ(u,w)<1, then we can interpret these inequalities as conditions of better outer similarity between w and u, with respect to that between w and u, in relation to the A-granulation of U.

  • In the second case, let u,vU belong to the same A-indiscernibility granule (that is uAv), and z[u]A. Then, if we simply use the measure indA, as in the previous case we have that indA(u,z)=indA(v,z)=0. On the other hand, as before, if

    and, in addition, it also results that 0ζ(u,z)<ζ(v,z)<1, we can interpret these inequalities as conditions of better inner similarity between v and z, with respect to that between u and z, in relation to the A-granulation of U.

Then, based on the interpretation given above, we introduce the following fundamental notion of our work.

Definition 1.1

We say that the D-object similarity measure ζ:U×U[0,1] refines the A-granulation of U if

.

We now conclude the present subsection by discussing a further interpretation of the refinement of the granulation in terms of endogenous granulation and exogenous granulation. To this regard, when we fix an attribute subset A, the object similarity measure indA induces on the object set U a type of granulation that it is natural to call endogenous. In fact, such a granulation is given simply by taking the A-indiscernibility granules, which are induced directly by the intrinsic nature of the given decision table D. On the other hand, when we choose an appropriate object similarity measure ζ:U×U[0,1] and a fixed threshold r[0,1], we can consider the set system N(ζ|r){N(ζ|u,r):uU}. It easily follows that N(ζ|r) is a covering of U because uN(ζ|u,r) for all uU. Therefore we can interpret N(ζ|r) as a granulation of U, where any two granules N(ζ|u,r) and N(ζ|v,r) may also have a non-empty intersection. However, the important fact in our discussion is that we can interpret N(ζ|r) as a granulation of an exogenous type. In fact it is induced by the object similarity measure ζ, whose nature is (a priori) not directly related to the way in which the attributes act on the objects of the given decision table. Thus, in general, it is natural to ask which can be the links between the endogenous granulation induced by indA and the exogenous granulation induced by ζ.

To this regard, we can also have more refined interrelations between the object similarity measures indA and ζ, with respect to the condition

. However, at a first level of analysis, the condition
appears to be one of the more natural starting points from which to undertake a comparison between the two aforementioned endogenous and exogenous granulations.

In fact, if the measure ζ satisfies the condition

, this is equivalent to say that its exogenous granulation of lower threshold 1 coincides with the endogenous A-granulation.

On the other hand, when we choose a threshold r[0,1) the size of any neighborhood N(ζ|u,r) potentially tends to increase with respect to N(ζ|u,1)=[u]A. This leads to a situation which is more complex than the granulation of level 1 induced by ζ and corresponding exactly to the greater amount of information that the measure ζ provides with respect to the A-indiscernibility measure indA.

To conclude this subsection, let us spend few words concerning dissimilarity measures. With regard to these measures, there is no a univocal definition in literature. However, also in such a case the dissimilarity between the two objects u and v is usually given by means of a numerical measure ϕ(u,v) of the degree at which the two objects differ. Hence, the dissimilarity is lower for more similar pairs of objects. In particular, we interpret the value ϕ(u,v)=0 as null dissimilarity (and therefore maximal similarity) between the objects u and v. Therefore, it is natural to consider any metric ϕ:U×U[0,) as a specific type of object dissimilarity measure. Nevertheless, a formal and complete investigation of the object dissimilarity measures in relation to both metrics and refinements of the A-granulation goes far beyond the scopes of the present paper, and it will be the subject of study in future works.

In the first part of this paper, we will work on the mathematical foundations of the previous extension of the A-granulation. This analysis will be undertaken regardless of a specific D-object similarity measure and the corresponding results will be provided in Section 3.

The above notion of refining of the A-granulation is the basic interpretative idea of the present paper. Nevertheless, after assuming the interpretative standpoint relied on the previous notion, a more technical problem comes out: how can we determine a broad family of D-object similarity measures refining the A-granulation on the object set U?

To this regard, getting ideas and generalizing what it has been done in [3], [27], [49], we will introduce a class of D-object proximity measures starting from two given families of numerical maps whose domain agrees with the value set of a decision table.

Let us discuss how to obtain the aforementioned family of D-object proximity measures. Notice furthermore that the most part of numerical maps we will consider in the paper are frequently used in data mining and related fields (for further details, we refer the reader to [3]).

Given a decision table D with attribute set ΩConDec, where Con denotes the set of all the condition attributes and Dec that of all the decision attributes, we may define an object proximity (or similarity) map by means of a numerical map defined attribute per-attribute. We call such a map a b-numerical map of D, where b is any attribute of Ω, and the family ρ={ρb:bΩ} a value numerical map family of D. One example is provided by the classical overlap map family δ={δb:bΩ} (see [3], [27]), whereδb(s,t)1ifs=t0otherwise,for any bΩ, and inducing the object map ΣA,δ:U×U[0,1] defined by:ΣA,δ(u,v)1|A|aAδa(a(u),a(v)),for all u,vU and where A is a non-empty attribute subset.

Now, at a first level of generalization, in order to extend the previous example and to include a wide range of classical similarity measures that have been studied in literature (see again[3], [27]), it suffices to replace the previous overlap family δ with any value numerical family ρ of D.

However, in this paper we will introduce an even more general context than the one just described.

More in detail, given two value numerical map families ρ={ρb:bΩ} and ν={νb:bΩ} of D and fixed any non-empty attribute subset A, we introduce the notion (A,ρ,ν)-object measure of D, i.e. the induced map ΣA,ρ,ν:U×UR defined byΣA,ρ,ν(u,v)aAρa(a(u),a(v))aAνa(a(u),a(v)),for all (u,v)U×U. Then it is easy to verify that all the similarity measures described in [3] can be expressed in terms of (A,ρ,ν)-object measures, for appropriate value numerical families ρ and ν. Hence the notion of (A,ρ,ν)-object measure becomes a unifying concept, in whose perspective we can frame and investigate several classical similarity measures. Furthermore, in order to better help the readers to fix the theoretical features of our work, throughout the paper we provide some simple examples whose value is purely illustrative of the theory.

In Section 5, we will demonstrate some general properties of the (A,ρ,ν)-object measures. Nevertheless, the more substantial part of our work consists of analyzing the basic mathematical properties of specific (A,ρ,ν)-object measures coming from data-mining and related fields (see [3], [27], [49]). It should be noted here that the goal of our paper is in a certain sense complementary to the spirit of [3], [27], [49]. As a matter of fact, our definition of (A,ρ,ν)-object measure allows us to recollect several possible cases of similarity maps, among which those studied in [3], [27], [49] and, furthermore, to investigate in a more manageable way and more in general the mathematical properties of these maps, above all with regard to the notion of refinement of the A-granulation. However, the reader interested in algorithmic comparisons between various D-object similarity measures already existing in literature, should consult the excellent work of Liu et al. [27], [34]. On the other hand, in the perspective of this paper, we tried to provide researchers and practitioners of data mining, granular computing and related fields dealing with these families of maps with a precise and coherent mathematical foundation of such a topic.

More in detail, based on [3], [27], [49], we consider seven similarity value maps and compare the corresponding induced object maps. These measures are: overlap [3] which is the translation in terms of numerical maps of the classical indiscernibility relation; Eskin measure [12], which is a numerical map assigning more weight to mismatches occurring on attributes assuming many values; inverse occurrence frequency (briefly IOF) [33], which assigns a lower similarity to mismatches on more frequent values; occurrence frequency (briefly OF) [33] which assigns a lower similarity to less frequent values; Goodall3 measure [13] which assigns a high similarity to a match whenever the matching values are infrequent regardless of the frequencies of the other values and Goodall4 measure which is the complement of the previous one; and finally the Lin measure which assigns higher weight to matches on frequent values, and lower weight to mismatches on infrequent values.

We firstly give a proof to some properties that have been simply stated in [3], [27], [49] and, next, we tried to understand when they are object proximity measures (IOF, Goodall3, Goodall4) and when they are objecy similarity measures (overlap, Eskin, OF, Lin). Next, after fixing a non-empty attribute subset A, we investigate their connection with the corresponding granulation, finding some sufficient conditions for which the above measures refine granulation. For example, in general, the IOF measure turns out to be an object proximity measure which assumes value 1 if and only if either the two objects are A-indiscernible or they are not indiscernible and, for any attribute which discerns them with respect to A, at least one of the values assumed by u or v on the given attribute occurs only once. Thus, a specific condition on the attribute subsets A explicits the cases where the granulation induced by A can be refined by that induced by the IOF measure.

In Section 2 we provide the basic notions about Pawlak’s decision tables. which will be useful for the remaining part of the paper.

In Section 3 we introduce the notions of object proximity and of object similarity measures and, next, analyze the links between indiscernibility relation, dependency relation, exactness and the above measures. Furthermore, when the object similarity measure ζ refines the A-granulation, we find the greatest subinterval of the unit interval for which the neighborhoods N(ζ|u,r) agree with [u]A and, in the general case, we study its main properties.

In Section 4 we study overlap maps and its induced object similarity measure. It will be the basic starting model whose generalization will lead in a natural way to the definition and the investigation of the notion of (A,ρ,ν)-object measure of D.

In Section 5 we provide a formal definition of per-attribute numerical maps defined on pairs of the admissible values of D and, next, starting from two arbitrary per-attribute numerical map families ρ and ν and from a given attribute subset A, we define the notion of (A,ρ,ν)-object measure.

Finally, in Section 6 we consider some classical per-attribute numerical map families and study their basic properties and those of the corresponding induced object measure, analyzing the cases where they are proximity or similarity measures. In the last situation, we will ask whether they refine granulation and, in negative case, we find some additional sufficient conditions on the attribute subset A so that the corresponding (A,ρ,ν)-object similarity refines the A-granulation.

Section snippets

Background on decision tables

In this section we will provide some basic notation and also deal with background notions and properties of Pawlak’s decision tables [29]. The main notions concerning decision tables are dependency and exactness. In particular, we will prove that these notions may be related (see Theorem 2.2). In Table 1 we give some specific notations we will use in the paper.

If X is any finite set, we denote by P(X) the power set of X and we set P(X)P(X){}. We use the notation |X| to denote the number of

Object proximity and similarity measures

In this section, we will introduce the notion of object proximity measure and that of object similarity measure. These notions have been inspired by the terminology used by Yao in [44], [45]. The analysis that we will undertake in the present section concerns the connections between indiscernibility relation (also from the point of view of dependency and exactness) and object proximity/similarity measures.

The underlying idea is derived mainly from [27], [49], and it consists of verifying

Similarity measures induced by overlap maps

In this section we establish the basic properties of the D-object similarity measure which permit us to express the classical indiscernibility conditions in terms of numerical maps. Clearly, overlap induces an example of object similarity measure. Such a measure will be a model whose generalization leads to the definition of the object measures which we will study in what follows.

For any attribute aΩ, let δa:Λa×Λa{0,1} be the overlap map on the set Λa, that isδa(s,t)1ifs=t0otherwise,for all (

Object numerical measures induced by value numerical map families

In this section we use the particular object similarity measure defined in (13) as a reference model to deal with more general cases. As a matter of fact, we will introduce the notion of (A,ρ,ν)-object measure, where A is a fixed non-empty attribute subset and ρ and ν are collections of numerical maps defined on Λa×Λa, for each aA. Our main aim is to introduce a unifying perspective which takes into account the various measures introduced in [3], [27], [49]. Nevertheless, as we will see in the

Some object measures induced by classical value measures

The present section is devoted to the study of some object measures induced by classical value measures. For each of these value measures, we will provide basic mathematical properties, some of which have been also established in [3] and for which it is not easy to find a formal proof in literature. Therefore, in line with our attempt to provide a mathematical foundation for the theory of similarity measures, we considered it appropriate to give a demonstration for the aforementioned properties

Conclusions

The underlying idea of our paper consists of providing a mathematical investigation of the notion of similarity between objects of a decision table in relation to Pawlak’s indiscernibility. As a matter of fact, such a notion has been classically studied from a topological point of view, where it has been described through the introduction of neighborhoods.

In our perspective, when one has a given universe U of objects, the similarity between two objects u,vU may be measured by means of a

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

We are extremely grateful to the anonymous reviewers who helped us to improve the quality of our paper with their thorough suggestions. A particular thanks goes to the referee who suggested us the introduction of the information table ID,ζA.

References (50)

  • X. Li et al.

    Generalized three-way decision models based on subset evaluation

    Int. J. Approx. Reason.

    (2017)
  • R. Liu et al.

    Shared-nearest-neighbor-based clustering by fast search and find of density peaks

    Inf. Sci.

    (2018)
  • F. Liu et al.

    A comparison study of similarity measures for covering-based neighborhood classifiers

    Inf. Sci.

    (2018)
  • Z. Pawlak et al.

    Rough sets and Boolean reasoning

    Inf. Sci.

    (2007)
  • M.J. Benitez-Caballero et al.

    Bireducts with tolerance relations

    Inf. Sci.

    (2018)
  • S. Wang et al.

    Four matroidal structures of covering and their relationships with rough sets

    Int. J. Approx. Reason.

    (2013)
  • S. Xia et al.

    Granular ball computing classifiers for efficient, scalable and robust learning

    Inf. Sci.

    (2019)
  • X. Xie et al.

    A novel incremental attribute reduction approach for dynamic incomplete decision systems

    Int. J. Approx. Reason.

    (2018)
  • J. Yang et al.

    Knowledge distance measure in multigranulation spaces of fuzzy equivalence relations

    Inf. Sci.

    (2018)
  • J. Yang et al.

    Optimal granularity selection based on cost-sensitive sequential three-way decisions with rough fuzzy sets

    Knowl. Based Syst.

    (2019)
  • Y.Y. Yao

    Relational interpretations of neighborhood operators and rough set approximation operators

    Inf. Sci.

    (1998)
  • Y.Y. Yao

    Neighborhood systems and approximate retrieval

    Inf. Sci.

    (2006)
  • Y. Yao et al.

    Covering based rough set approximations

    Inf. Sci.

    (2012)
  • Z. Zhang

    A rough set approach to intuitionistic fuzzy soft set based decision marking

    Appl. Math. Model.

    (2012)
  • W. Zhu

    Topological approaches to covering rough sets

    Inf. Sci.

    (2007)
  • Cited by (12)

    • Variable radius neighborhood rough sets and attribute reduction

      2022, International Journal of Approximate Reasoning
      Citation Excerpt :

      However, the equivalence relation requires that all attribute values of objects in the same equivalent class are identical. The condition is so strict that the information about similarity relations [1,4,35,40] and order relations [28,34] between objects is easily ignored. Consequently, a lot of scholars have extended rough sets to neighborhood rough sets [21], covering rough sets [5,66], variable precision rough sets [67], multigranulation rough sets [32,55], fuzzy rough sets [15,45], k-nearest neighbor rough sets [18,19], and other fields [7,30,54,65].

    • Granular cabin: An efficient solution to neighborhood learning in big data

      2022, Information Sciences
      Citation Excerpt :

      Essentially, a neighborhood involves a region where samples are somewhat indiscernible or at least not significantly distinguishable [36]. Usually a priori criterion depending on feature space is fixed to quantify similarity or dissimilarity, and then an assigned threshold enables to granulate samples into specific kind of neighborhoods [3]. That means, each sample can be associated with a subset from the universe.

    View all citing articles on Scopus
    View full text