Record linkage based on a three-way decision with the use of granular descriptors

https://doi.org/10.1016/j.eswa.2018.12.038Get rights and content

Highlights

  • Three-way decision is proposed to solve record linkage problem.

  • Granule describing distribution uncertain data points is used in three-way decision.

  • Coverage of uncertainty is defined by entropy and memberships.

  • The specificity of granule construction is controlled by a new parameter.

Abstract

Record linkage is a typical two-class recognition problem in data mining. To improve its classification performance of the problem, this paper proposes to apply three-way classification to identify uncertain points (regions) for further clerical investigation in decision-making. The detailed three-way decision process is realized by a two-phase approach. During the first phase, an information granule is constructed to describe the uncertain region in the data space. In the second phase, the constructed granule is utilized to discriminate between certain points (those with a high likelihood of belonging to one of the classes) and uncertain points (viz. those requiring clerical attention). For uncertain points, manual investigation is realized; for certain points, the generic binary classifier is applied for classification. Synthetic data and publicly available data are used to demonstrate the performance of the proposed approach. Finally, the proposed approach is shown effective in applications involving real-world record linkage data.

Introduction

Record linkage is an important part of data processing (Fellegi and Sunter, 1969, Reyes-Galaviz et al., 2017, Vatsalan et al., 2017). In big data era, we usually face the problems of processing several disparate data sources simultaneously, e.g., in health care system, government records, publication and other areas. While there exist many records coming from different sources belonging to the same entity (Christen, 2012a, Christen, 2012b), it is not advisable to develop data analytics on all these raw data sources directly. The major purpose of record linkage is to match and merge all records of an entity, which will refine the data sources and make data analysis more efficient. Generally, the record linkage is regarded as a two-class problem of decision-making, which attempts to distinguish matched and not-matched record pairs. However, in real-world applications, records could not be correctly justified by such a binary decision process due to several compelling factors (Berkel, 1988), e.g., data noise, handwriting error, address change. The factors may cause to classify an entity's records as not-matched or two entity's records as matched, especially for points (record-pairs) with values located close to the classification boundary. In this situation, human's further investigation is required to guarantee the system's credibility. Then, we call the decision-making process involving “clerical investigation” as a three-way decision (Liang et al., 2017, Liang and Liu, 2015). To correctly merge and refine data from multiple resources, in this study, we propose to regard record linkage as a three-way decision problem.

The previously mentioned factors affecting the final decision-making are actually related to uncertainty. By taking uncertainty into account in pattern recognition, the three-way decision process extends the generic binary decision to three alternatives viewed as acceptance, rejection and uncertain region. Then, with further investigation on the uncertain region, the effect of uncertainty would be reduced in decision making (Yao, 2011, Liang et al., 2015). Therefore, it is seen that the key step of applying three-way decision process to record linkage is to distinguish these three regions, especially the uncertain region. In light of existing studies reported in the literature, the uncertain region is usually viewed as negative parts of two binary decision models. For example, for a two-class problem recognizing ClassA and ClassB, the uncertain region consists of data points in both not-ClassA (the negative part of the binary model classifying ClassA) and not-ClassB. To theoretically implement this discrimination mechanism, fuzzy and rough set models are regarded as two commonly used alternatives in three-way modeling (Pawlak, 1982, Tsang et al., 2013). For example, rough set-based positive, negative, and boundary regions and their extensions were utilized to describe the mentioned three regions in (Yao, 2015). Shadow sets are developed on a basis of fuzzy sets, and they generalized the decision-making with fuzzy sets into a three-valued approximation process where results could support the three-way decision's outcomes (Pedrycz, 1998). Recently, a newly proposed model, namely the game-theoretic rough set, was also developed on the basis of rough sets to deal with this issue (Herbert, 2011). The model proposed a mechanism to configure suitable parameters describing probabilistic rough sets which are the reflection of regions in three-way decision. Furthermore, by introducing game theory to construct an objective function minimizing the uncertainty degree of three probabilistic rough set-based regions, the final thresholds are effectively determined to distinguish certain and uncertain regions in three-way decision (Azam, 2014).

By summarizing the mentioned methods above, we could regard the process of describing the uncertain region as falling within the realm of Granular Computing (GrC) (Bargiela and Pedrycz, 2016, Al-Hmouz et al., 2018, Yao, October 2013). For example, in traditional two-class applications, the belongingness (membership) of data could be realized by an information granule-based index, such as probabilities and membership functions. Based on an objective function maximizing the classification performance, an optimal threshold is selected as the boundary dividing the index space into two certain granules representing two classes. For three-way decision, uncertain points are more likely to have an index value around the threshold of two-way decision. Therefore, researchers often move the threshold in opposite directions. This leads to a new, fuzzy interval that extends the previously constructed two granules into three information granules. By optimizing the interval (granule) to contain as many “uncertain” points as possible, the classification accuracy will be improved. Therefore, it is seen that the three-way decision is realized by granular computing of a one-dimension index, which means classifiers are formed by index thresholds to distinguish certain and uncertain regions. This formation is simple and poorly-explained for a classifier. Especially, in classification of high-dimensional data, the classifier's morphology more likely depends on data distribution. In Al-Hmouz et al. (2018), an attempt based on data distribution was completed to divide the regions in three-way decision, which proposes to divide each coordinate into three regions and make decision by the comprehensive analysis of all coordinates. One noticeable problem is that, the division of data space will contain 9 regions for a two-dimension data, and 3n regions for n-dimensional datasets.

Based on the above description, the objective of this paper could be summarized as the development of a new GrC method reflecting data distribution of certain and uncertain points in three-way decision, and the GrC method could be applied for improving record linkage performance. In this paper, we propose to a two-phase approach to realize the three-way decision by the usage of granular descriptors. In the first phase, we propose to construct an information granule describing the distribution of uncertain points in the data space directly. This information granule would divide data into certain and uncertain regions without any more redundant regions generated (Al-Hmouz et al., 2018). The originality of this process contains that the coverage function is calculated by a newly defined uncertainty degree from entropy, and that the specificity index is controlled by a parameter used in the construction of information granule. Based on this granule, we identify uncertain points for “clerical investigation”. In the second phase, all the remaining certain points are distinguished by a generic two-way decision. The proposed three-way decision approach is applied into both synthetic data and publicly available data, its performance is validated effective by several evaluation indexes. Finally, real-world record linkage datasets with different noise levels are discussed in this study.

The paper is organized as follows. Section 2 discusses the problem formulation of three-way decision. Section 3 presents the process of constructing an information granule. Considering that the function of the granule is to describe the uncertain region, the coverage and specificity functions in granular construction are specialized by uncertainty degree and a control parameter. Section 4 forms the classification rules of three-way decision by granular principle, and also elaborates on several evaluation indexes. Section 5 reports some experimental studies concerning two-dimensional synthetic data, publicly available datasets, and real-world record linkage data. Finally, Section 6 concludes this paper.

Section snippets

Problem formulation

In pattern recognition, a two-class classification is a standard problem. Assume the dataset is composed of N data, X= {xk| xkRn, k = 1, 2, ⋯, N} where n is the dimensionality of data space and N is the number of data points, two classes (ClassA and ClassB) are denoted as CA and CB, respectively. Assuming a certain classifier has been constructed, for a given input xk its output is denoted by yk. The well-known two-class decision (Liu, Liang, & Wang, 2016) problem could be expressed in the

Granular description of the uncertain region

According to the formulated problem, it is seen that three-way decision is necessary and critical for classification. The key task of three-way decision is to find a way to describe the uncertain points and to separate certain and uncertain regions. Since information granules are sound data descriptors in pattern recognition and knowledge discovery (Pedrycz et al., 2015, Gacek and Pedrycz, 2015), therefore we propose to construct an information granule to describe the shape of the uncertain

Classification rules of three-way decision in record linkage

After having the granule Ω describing uncertain region, we could distinguish data points as uncertain points for “clerical investigation” and certain points for classification. This process is called the three-way decision. The excluded dataset is defined as X0, and the remained data as X’. Assuming the number of points in X0 as N0, the uncertain rate reflecting the percentage of uncertain points could be expressed as γ%, as below.γ%=N0N×100%Generally, the uncertain rate of a real record

Synthetic data

To discuss the proposed three-way decision, we generated a two-dimensional synthetic dataset to experiment the record linkage application. The dataset contains data coming from two classes in which 500 data points are generated randomly following Gaussian distribution. Firstly, the original data are normalized to [0,1]2. Then, the data are partitioned into two clusters with the use of the FCM algorithm. The partition matrix and cluster centers are produced, which will be utilized in further

Conclusions

In this paper, an approach based on granular description of the uncertain region is proposed as a three-way decision-making process, which is applied to improve the performance of record linkage. First, information granule covering uncertain points that need “clerical investigation” is constructed using justifiable granulating. In this process, FCM clustering is applied to generate the partition matrix and membership degree. A function calculating a data point's uncertainty degree is defined by

CRediT authorship contribution statement

Tinghui Ouyang: Methodology, Investigation, Validation, Writing - original draft. Witold Pedrycz: Conceptualization, Supervision, Writing - review & editing. Nick J. Pizzi: Data curation, Funding acquisition.

Acknowledgment

The work in this paper is supported by Natural Sciences and Engineering Research Council of Canada (NSERC) in the form of NSERC Strategic Grant (No. STPGP 462980).

References (32)

  • R. Al-Hmouz et al.

    Development of multimodal biometric systems with three-way and fuzzy set-based decision mechanisms

    International Journal of Fuzzy Systems

    (2018)
  • A. Bargiela et al.

    Granular computing

    Handbook on computational intelligence: volume 1: fuzzy logic, systems, artificial neural networks, and learning systems

    (2016)
  • B.V. Berkel et al.

    Triphone analysis: A combined method for the correction of orthographical and typographical errors

  • P. Christen

    A survey of indexing techniques for scalable record linkage and deduplication

    IEEE Transactions on Knowledge and Data Engineering

    (2012)
  • P. Christen

    Data matching - concepts and techniques for record Linkage, entity Resolution, and duplicate detection

    (2012)
  • I.P. Fellegi et al.

    A theory for record linkage

    Journal of the American Statistical Association

    (1969)
  • Cited by (21)

    • Online structural clustering based on DBSCAN extension with granular descriptors

      2022, Information Sciences
      Citation Excerpt :

      For example, in [31], granular models are built in the input space based on the rule of T-S model firstly, then applied to determine the output partitioned. In [32–33], granular models combined with fuzzy theory are developed as granular fuzzy models in applications, which has a better scalability than basic granular models. Referring to this idea, this paper proposes to construct information granules and build granular models to describe the generated structural clusters in DBSCAN, subsequently make use of these numerical descriptors to guide the final online clustering as other DBSCAN online algorithms, e.g. incremental DBSCAN, grid-based DBSCAN.

    View all citing articles on Scopus
    View full text