Record linkage based on a three-way decision with the use of granular descriptors
Introduction
Record linkage is an important part of data processing (Fellegi and Sunter, 1969, Reyes-Galaviz et al., 2017, Vatsalan et al., 2017). In big data era, we usually face the problems of processing several disparate data sources simultaneously, e.g., in health care system, government records, publication and other areas. While there exist many records coming from different sources belonging to the same entity (Christen, 2012a, Christen, 2012b), it is not advisable to develop data analytics on all these raw data sources directly. The major purpose of record linkage is to match and merge all records of an entity, which will refine the data sources and make data analysis more efficient. Generally, the record linkage is regarded as a two-class problem of decision-making, which attempts to distinguish matched and not-matched record pairs. However, in real-world applications, records could not be correctly justified by such a binary decision process due to several compelling factors (Berkel, 1988), e.g., data noise, handwriting error, address change. The factors may cause to classify an entity's records as not-matched or two entity's records as matched, especially for points (record-pairs) with values located close to the classification boundary. In this situation, human's further investigation is required to guarantee the system's credibility. Then, we call the decision-making process involving “clerical investigation” as a three-way decision (Liang et al., 2017, Liang and Liu, 2015). To correctly merge and refine data from multiple resources, in this study, we propose to regard record linkage as a three-way decision problem.
The previously mentioned factors affecting the final decision-making are actually related to uncertainty. By taking uncertainty into account in pattern recognition, the three-way decision process extends the generic binary decision to three alternatives viewed as acceptance, rejection and uncertain region. Then, with further investigation on the uncertain region, the effect of uncertainty would be reduced in decision making (Yao, 2011, Liang et al., 2015). Therefore, it is seen that the key step of applying three-way decision process to record linkage is to distinguish these three regions, especially the uncertain region. In light of existing studies reported in the literature, the uncertain region is usually viewed as negative parts of two binary decision models. For example, for a two-class problem recognizing ClassA and ClassB, the uncertain region consists of data points in both not-ClassA (the negative part of the binary model classifying ClassA) and not-ClassB. To theoretically implement this discrimination mechanism, fuzzy and rough set models are regarded as two commonly used alternatives in three-way modeling (Pawlak, 1982, Tsang et al., 2013). For example, rough set-based positive, negative, and boundary regions and their extensions were utilized to describe the mentioned three regions in (Yao, 2015). Shadow sets are developed on a basis of fuzzy sets, and they generalized the decision-making with fuzzy sets into a three-valued approximation process where results could support the three-way decision's outcomes (Pedrycz, 1998). Recently, a newly proposed model, namely the game-theoretic rough set, was also developed on the basis of rough sets to deal with this issue (Herbert, 2011). The model proposed a mechanism to configure suitable parameters describing probabilistic rough sets which are the reflection of regions in three-way decision. Furthermore, by introducing game theory to construct an objective function minimizing the uncertainty degree of three probabilistic rough set-based regions, the final thresholds are effectively determined to distinguish certain and uncertain regions in three-way decision (Azam, 2014).
By summarizing the mentioned methods above, we could regard the process of describing the uncertain region as falling within the realm of Granular Computing (GrC) (Bargiela and Pedrycz, 2016, Al-Hmouz et al., 2018, Yao, October 2013). For example, in traditional two-class applications, the belongingness (membership) of data could be realized by an information granule-based index, such as probabilities and membership functions. Based on an objective function maximizing the classification performance, an optimal threshold is selected as the boundary dividing the index space into two certain granules representing two classes. For three-way decision, uncertain points are more likely to have an index value around the threshold of two-way decision. Therefore, researchers often move the threshold in opposite directions. This leads to a new, fuzzy interval that extends the previously constructed two granules into three information granules. By optimizing the interval (granule) to contain as many “uncertain” points as possible, the classification accuracy will be improved. Therefore, it is seen that the three-way decision is realized by granular computing of a one-dimension index, which means classifiers are formed by index thresholds to distinguish certain and uncertain regions. This formation is simple and poorly-explained for a classifier. Especially, in classification of high-dimensional data, the classifier's morphology more likely depends on data distribution. In Al-Hmouz et al. (2018), an attempt based on data distribution was completed to divide the regions in three-way decision, which proposes to divide each coordinate into three regions and make decision by the comprehensive analysis of all coordinates. One noticeable problem is that, the division of data space will contain 9 regions for a two-dimension data, and 3n regions for n-dimensional datasets.
Based on the above description, the objective of this paper could be summarized as the development of a new GrC method reflecting data distribution of certain and uncertain points in three-way decision, and the GrC method could be applied for improving record linkage performance. In this paper, we propose to a two-phase approach to realize the three-way decision by the usage of granular descriptors. In the first phase, we propose to construct an information granule describing the distribution of uncertain points in the data space directly. This information granule would divide data into certain and uncertain regions without any more redundant regions generated (Al-Hmouz et al., 2018). The originality of this process contains that the coverage function is calculated by a newly defined uncertainty degree from entropy, and that the specificity index is controlled by a parameter used in the construction of information granule. Based on this granule, we identify uncertain points for “clerical investigation”. In the second phase, all the remaining certain points are distinguished by a generic two-way decision. The proposed three-way decision approach is applied into both synthetic data and publicly available data, its performance is validated effective by several evaluation indexes. Finally, real-world record linkage datasets with different noise levels are discussed in this study.
The paper is organized as follows. Section 2 discusses the problem formulation of three-way decision. Section 3 presents the process of constructing an information granule. Considering that the function of the granule is to describe the uncertain region, the coverage and specificity functions in granular construction are specialized by uncertainty degree and a control parameter. Section 4 forms the classification rules of three-way decision by granular principle, and also elaborates on several evaluation indexes. Section 5 reports some experimental studies concerning two-dimensional synthetic data, publicly available datasets, and real-world record linkage data. Finally, Section 6 concludes this paper.
Section snippets
Problem formulation
In pattern recognition, a two-class classification is a standard problem. Assume the dataset is composed of N data, X= {xk| xk∈Rn, k = 1, 2, ⋯, N} where n is the dimensionality of data space and N is the number of data points, two classes (ClassA and ClassB) are denoted as CA and CB, respectively. Assuming a certain classifier has been constructed, for a given input xk its output is denoted by yk. The well-known two-class decision (Liu, Liang, & Wang, 2016) problem could be expressed in the
Granular description of the uncertain region
According to the formulated problem, it is seen that three-way decision is necessary and critical for classification. The key task of three-way decision is to find a way to describe the uncertain points and to separate certain and uncertain regions. Since information granules are sound data descriptors in pattern recognition and knowledge discovery (Pedrycz et al., 2015, Gacek and Pedrycz, 2015), therefore we propose to construct an information granule to describe the shape of the uncertain
Classification rules of three-way decision in record linkage
After having the granule Ω describing uncertain region, we could distinguish data points as uncertain points for “clerical investigation” and certain points for classification. This process is called the three-way decision. The excluded dataset is defined as X0, and the remained data as X’. Assuming the number of points in X0 as N0, the uncertain rate reflecting the percentage of uncertain points could be expressed as γ%, as below.Generally, the uncertain rate of a real record
Synthetic data
To discuss the proposed three-way decision, we generated a two-dimensional synthetic dataset to experiment the record linkage application. The dataset contains data coming from two classes in which 500 data points are generated randomly following Gaussian distribution. Firstly, the original data are normalized to [0,1]2. Then, the data are partitioned into two clusters with the use of the FCM algorithm. The partition matrix and cluster centers are produced, which will be utilized in further
Conclusions
In this paper, an approach based on granular description of the uncertain region is proposed as a three-way decision-making process, which is applied to improve the performance of record linkage. First, information granule covering uncertain points that need “clerical investigation” is constructed using justifiable granulating. In this process, FCM clustering is applied to generate the partition matrix and membership degree. A function calculating a data point's uncertainty degree is defined by
CRediT authorship contribution statement
Tinghui Ouyang: Methodology, Investigation, Validation, Writing - original draft. Witold Pedrycz: Conceptualization, Supervision, Writing - review & editing. Nick J. Pizzi: Data curation, Funding acquisition.
Acknowledgment
The work in this paper is supported by Natural Sciences and Engineering Research Council of Canada (NSERC) in the form of NSERC Strategic Grant (No. STPGP 462980).
References (32)
- et al.
Analyzing uncertainties of probabilistic rough set regions with game-theoretic rough sets
International Journal of Approximate Reasoning
(2014) - et al.
An improved method to construct basic probability assignment based on the confusion matrix for classification problem
Information Sciences
(2016) - et al.
Three-way decisions based on decision-theoretic rough sets under linguistic assessment with the aid of group decision making
Applied Soft Computing
(2015) - et al.
A novel three-way decision model based on incomplete information system
Knowledge-Based Systems
(2016) - et al.
Model of selecting prediction window in ramps forecasting
Renewable Energy
(2017) - et al.
Predictive model of yaw error in a wind turbine
Energy
(2017) - et al.
Data description: A general framework of information granules
Knowledge-Based Systems
(2015) - et al.
A supervised gradient-based learning algorithm for optimized entity resolution
Data and Knowledge Engineering
(2017) The superiority of three-way decisions in probabilistic rough set models
Information Science
(2011)- et al.
Selection of time window for wind power ramp prediction based on risk model
Energy Conversion and Management
(2016)
Development of multimodal biometric systems with three-way and fuzzy set-based decision mechanisms
International Journal of Fuzzy Systems
Granular computing
Handbook on computational intelligence: volume 1: fuzzy logic, systems, artificial neural networks, and learning systems
Triphone analysis: A combined method for the correction of orthographical and typographical errors
A survey of indexing techniques for scalable record linkage and deduplication
IEEE Transactions on Knowledge and Data Engineering
Data matching - concepts and techniques for record Linkage, entity Resolution, and duplicate detection
A theory for record linkage
Journal of the American Statistical Association
Cited by (21)
A three-way conflict analysis model with decision makers’ varying preferences [Formula presented]
2024, Applied Soft ComputingFuzzy rule-based anomaly detectors construction via information granulation
2023, Information SciencesA three-way decision approach with S-shaped utility function under Pythagorean fuzzy information
2022, Expert Systems with ApplicationsStructural rule-based modeling with granular computing
2022, Applied Soft ComputingOnline structural clustering based on DBSCAN extension with granular descriptors
2022, Information SciencesCitation Excerpt :For example, in [31], granular models are built in the input space based on the rule of T-S model firstly, then applied to determine the output partitioned. In [32–33], granular models combined with fuzzy theory are developed as granular fuzzy models in applications, which has a better scalability than basic granular models. Referring to this idea, this paper proposes to construct information granules and build granular models to describe the generated structural clusters in DBSCAN, subsequently make use of these numerical descriptors to guide the final online clustering as other DBSCAN online algorithms, e.g. incremental DBSCAN, grid-based DBSCAN.
Thresholds learning of three-way decisions in pairwise crime linkage
2022, Applied Soft Computing