Elsevier

Knowledge-Based Systems

Volume 176, 15 July 2019, Pages 97-109
Knowledge-Based Systems

PatternFinder: Pattern discovery for truth discovery

https://doi.org/10.1016/j.knosys.2019.03.027Get rights and content

Highlights

  • First study to discover patterns for truth discovery.

  • An optimization problem is formulated to discover the patterns.

  • An iterative algorithm is proposed to jointly and iteratively learn the variables.

  • Experimental results show the effectiveness and efficiency of the proposed algorithm.

Abstract

Truth discovery methods infer truths from multiple sources. These methods usually resolve conflicts based on the information on the entity level. However, due to the existence of incompleteness and the difficulty in entity matching, the information on the individual entity is often insufficient. This motivates pattern discovery, which aims to mine useful patterns across entities from a global perspective. In this paper, we introduce pattern discovery for truth discovery and formulate it as an optimization problem. To solve such a problem, we propose an algorithm called PatternFinder that jointly and iteratively learns the variables. Additionally, we also propose an optimized grouping strategy to enhance its efficiency. Experimental results on simulated and real-world datasets demonstrate the advantage of the proposed methods, which outperform the state-of-the-art baselines in terms of both effectiveness and efficiency.

Introduction

It is critical to identify correct information from multi-source conflicting data. Such a task is called truth discovery [1], [2]. A straightforward truth discovery approach is to conduct majority voting or averaging. However, the most significant shortcoming of such approaches is that they assume all the sources are equally reliable. As information quality usually varies a lot among different sources [2], such approaches may not achieve correct results.

To improve the performance, various truth discovery methods [1], [3], [4], [5], [6], [7], [8] are proposed. In these methods, a common principle is applied. That is, if an entity’s information provided by a source is often supported by other sources, the source is regarded as a reliable one, and in turn, its information is more likely to be true. It can be inferred that for one entity, its correct values are found by resolving the conflicts among multiple sources. Regardless of the duplicate situation among different sources [3], for a set of entities, the greater the number of sources that provide information for each entity, the more likely we can identify the reliable sources and find the truths.

Unfortunately, when the evidence is insufficient on the entity level, the picture is different. The insufficient evidence makes it difficult for existing methods to identify the reliable sources and find the truths, especially for entities covered mostly by the unreliable sources. For the circumstance where there is insufficient evidence on the entity level [9], [10], we first analyze its causes in three aspects, i.e., long-tail phenomenon, mismatching, and incompleteness.

  • Long-tail phenomenon. The phenomenon where the entities’ information is provided by very few sources is common in applications [4], [9]. For one source, it may contain the information about a large number of entities. However, most of the entities in this source may not have the corresponding information in the other sources.

  • Mismatching. In many applications, it is common that each source has its own entity identifier and becomes an isolated island of information [10]. To identify each entity’s information from multiple sources, a natural way is to first conduct entity matching. However, due to erroneous values in the multi-source noisy data, it is hard to correctly link the records [11], [12], [13], [14], which also results in insufficient evidence for the entities.

  • Incompleteness. Due to the incomplete entry, inaccurate extraction or heterogeneous schemas, it is very prevalent that sources only provide information for a subset of attributes about a given entity [15]. Thus, even if entity matching is feasible and effective, enough information for each attribute of the entities cannot be guaranteed.

Due to the existence of insufficient evidence on the entity level in these three aspects, it is challenging to find the truths about the multi-source unaligned data. We use an example to illustrate how existing methods work that motivates our approach.

Example 1

Table 1 contains six records collected from three hospitals {X1,X2,X3}. Each record o specifies a patient described by four attributes: name, age, condition, and measure, among which the condition denotes the clinical symptom of the patient and measure denotes the therapeutic drug for the patient. All erroneous values are marked in italics and their correct values are given in the following brackets. Note that we do not know whether the records provided by the different hospitals refer to the same patient, and missing values are represented as “-”.

Based on this example, we can see that o1,o2,o4,o6 tend to refer to different patients, as they have dissimilar attribute values. For o3 and o5, though they may refer to the same person (same name and measure), due to the wrong value “feven” of the condition from o3 and the missing value of age from o5, there is not enough information to link o3 and o5. As a result, o1,o2,o3,o4,o5,o6 will be all treated as a record for a separate entity. However, only one piece of information for each patient is insufficient for the existing methods [3], [4], [5], [6], [7], [8], [16], [17], [18], [19] to infer the truths and identify the reliable sources. Given o1 provided by X1, there are two circumstances: (1) X2 and X3 may provide the same information as o1 to support o1 to be true; (2) X2 and X3 may provide different information from o1, then the information provided by the most reliable source is true. Hence, more evidence is needed from X2 and X3. Without more evidence about these patients, existing methods will consider o1,o2,o3,o4,o5,o6 all to be true, and fail to find the true value of o3, o4, and o6, e.g., “fever” for “feven” for the condition of o3. Moreover, considering these records all to be true will draw a conclusion that all the sources are reliable, while the fact is that X2 and X3 are not very reliable as they contain several errors.

Observations. The above example indicates that truth discovery methods will become less effective when faced with long-tail phenomena, mismatching, and incompleteness issues on the entity level. Fortunately, such entities may still find counterparts that share similar patterns. Consider hospitals and social forums as examples. Patients from different hospitals may be different, but the properties (e.g., symptoms, medical history, demographics) of patients with the same disease could be quite comparable; multiple online social forums may attract overlapping but different sets of users, in which user communities and community patterns may be shared across platforms. Therefore, when the evidence is not sufficient on the entity level, the latent patterns shared among different entities would be helpful to discover the truths of the entities.

Pattern Discovery. Motivated by these observations, in this study, we propose to leverage pattern discovery for truth discovery on multi-source unaligned data. A pattern is a triple variable that contains an applied set, an attribute set, and a value combination towards the attribute set. For each pattern, the applied set precisely describes the scope that the pattern is suitable for. If a record is in the applied set of a pattern, its values on the attribute set match the value combination of the pattern.

Example 2

Table 2 shows two patterns whose attribute set contains the condition and measure. For the first pattern, its applied set is {o1,o3,o5}, and the value combination is (fever, febrifuge). It states that for {o1,o3,o5}, their values for the condition and measure should be “fever” and “febrifuge”, respectively. Considering o3, the value “feven” for the condition will be corrected to “fever”. Similarly, the second pattern states that for {o2,o4,o6}, their values for the condition and measure should be “stroke” and “thrombolytic”, respectively. The errors in o4 and o6 will then also be corrected.

Based on this example, we can infer that, when the evidence is insufficient on the entity level, matching the corresponding patterns can help to improve the performance of truth discovery. Therefore, it is crucial to design algorithms for all the entities across sources so that the patterns shared among them can be automatically discovered. However, discovering proper patterns raises several challenges.

  • As no oracle tells which attribute can make up the attribute set of the patterns, the question is how to infer the attribute set so that it can accurately cover all and only the significant attributes.

  • With errors in the multi-source records, the concern is how to generate the value combinations concisely enough to be close to the true ones.

  • We need to accurately apply the patterns to each record, and efficiently find the applied set for each pattern.

In this paper, we jointly address these issues. First, to obtain the attribute set, we assign an attribute weight to each attribute. The higher the weight of an attribute, the higher the possibility that it belongs to the attribute set. Second, to ensure the accuracy of the value combinations, a source weight is assigned to each source, which indicates that the information provided by the sources with higher weights are more reliable. Third, to find the applied set for each pattern, we aim to infer the latent groups which share the same pattern. The patterns can then be discovered by inferring the group-level representatives and applied to the group members. In summary, we model the pattern discovery problem by an optimization framework, where the latent groups, the group-level representatives, the source weights, and the attribute weights are defined as four sets of variables. The objective is to minimize the overall weighted deviation between the group-level representatives and the multi-group records. We propose an algorithm to solve the optimization problem by iteratively updating the four sets of variables. Benefiting from the iterative procedure, we can achieve high-quality patterns on significant attributes provided by reliable sources. By using a global analysis of all the entities, the proposed algorithm can take advantage of more evidence from the entities. The generated patterns will be useful in detecting the high-level behavior of entities observed from multiple perspectives, such as user community patterns of multiple social networks [20], patient symptoms recorded and diagnosed by multiple hospitals [21], and traffic features captured by multiple sensors [22].

Contributions. We summarize our contributions as follows.

  • We study pattern discovery for truth discovery. We formally define patterns and then formulate an optimization problem to discover the patterns.

  • We propose an iterative algorithm called PatternFinder to solve the problem by jointly inferring the latent groups, the group-level representatives, the source weights, and the attribute weights.

  • To improve the efficiency, we enhance PatternFinder by an optimized grouping strategy.

  • We present extensive experiments with simulated and real-world datasets. The experimental results clearly demonstrate the advantages of PatternFinder compared to the baselines.

Organization. We analyze related work in Section 2 and define the problem of pattern discovery in Section 3. Section 4 describes the overall solution and the main component PatternFinder, followed by the experimental results in Section 5. We conclude the paper with final remarks in Section 6.

Section snippets

Related work

The traditional problem of truth discovery has been studied for years to resolve conflicts among multiple sources [1], [4]. Existing approaches [3], [4], [5], [6], [7], [8], [16], [18], [19], [23], [24], [25], [26], [27], [28], [29] adopt a common principle that if the information provided by a source is often supported by other sources, the source is regarded as a reliable one, and in turn, its information is more likely to be true. Thus, for one entity, the value provided by reliable sources

Problem definition

We first define useful terms for the multi-source data and then give the problem definition.

An entity is a person or thing of interest. An attribute is a feature used to describe the entity. A data item is a paired entity-attribute. A data source describes the place where information about data items can be collected. A claim is a value of a data item provided by a data source. A record contains all the claims about an entity provided by a data source. Suppose there are K data sources, each of

Methodology

In this section, we formally introduce the approach of pattern discovery for truth discovery. We first provide the whole solution overview in Section 4.1. To achieve accurate patterns, we propose an optimization framework in Section 4.2 and solve it through the iterative algorithm PatternFinder in Section 4.3. To improve the efficiency, we then develop a scalable strategy for PatternFinder in Section 4.4. Finally, we discuss how to generate patterns and truths according to the output of

Experiments

In this section, we evaluate the proposed methods using both simulated and real-world datasets. The experimental results clearly demonstrate the advantages of the proposed methods in pattern discovery and truth discovery in terms of both effectiveness and efficiency. We first discuss the experimental setup in Section 5.1, and then present experimental results for the simulated and real-world datasets in Sections 5.2 Experiments using simulated datasets, 5.3 Experiments using real-world datasets

Conclusion

In this paper, we introduce pattern discovery for truth discovery of multi-source unaligned data. We model this pattern discovery problem as a task of inferring latent groups using a general optimization framework. In this model, the objective is to minimize the overall weighted deviation between the group-level representatives and the multi-group records where each source is weighted by its reliability and each attribute is weighted by its significance. We developed a four-step iterative

Acknowledgments

This paper was partially supported by NSF IIS-1747614, IIS-1553411, NSFC grant U1509216, U1866602, 61602129 and MOE-Microsoft Key Laboratory of Natural Language Processing and Speech, Harbin Institute of Technology .

References (44)

  • YinX. et al.

    Truth discovery with multiple conflicting information providers on the web

    IEEE Trans. Knowl. Data Eng.

    (2008)
  • X.S. Fang, Q.Z. Sheng, X. Wang, A.H.H. Ngu, Value veracity estimation for multi-truth objects via a graph-based...
  • WhangS.E. et al.

    Pay-as-you-go entity resolution

    IEEE Trans. Knowl. Data Eng.

    (2013)
  • ChristenP.

    A survey of indexing techniques for scalable record linkage and deduplication

    IEEE Trans. Knowl. Data Eng.

    (2012)
  • ElmagarmidA.K. et al.

    Duplicate record detection: A survey

    IEEE Trans. Knowl. Data Eng.

    (2007)
  • KöpckeH. et al.

    Evaluation of entity resolution approaches on real-world match problems

    PVLDB

    (2010)
  • N. Koudas, S. Sarawagi, D. Srivastava, Record linkage:similarity measures and algorithms, in: Proc. of SIGMOD, 2006,...
  • C. Ye, H. Wang, J. Li, H. Gao, S. Cheng, Crowdsourcing-enhanced missing values imputation based on Bayesian network,...
  • Y. Li, Q. Li, J. Gao, L. Su, B. Zhao, W. Fan, J. Han, On the discovery of evolving truth, in: Proc. of SIGKDD, 2015,...
  • X. Wang, Q.Z. Sheng, L. Yao, X. Li, X.S. Fang, X. Xu, B. Benatallah, Truth discovery via exploiting implications from...
  • H. Zhang, Q. Li, F. Ma, H. Xiao, Y. Li, J. Gao, L. Su, Influence-aware truth discovery, in: Proc. of CIKM, 2016, pp....
  • ZhaoB. et al.

    A bayesian approach to discovering truth from conflicting sources for data integration

    PVLDB

    (2012)
  • Cited by (0)

    View full text