Probabilistic correlation-based similarity measure on text records

doi:10.1016/j.ins.2014.08.007

Information Sciences

Volume 289, 24 December 2014, Pages 8-24

https://doi.org/10.1016/j.ins.2014.08.007 Get rights and content

Abstract

Large scale unstructured text records are stored in text attributes in databases and information systems, such as scientific citation records or news highlights. Approximate string matching techniques for full text retrieval, e.g., edit distance and cosine similarity, can be adopted for unstructured text record similarity evaluation. However, these techniques do not show the best performance when applied directly, owing to the difference between unstructured text records and full text. In particular, the information are limited in text records of short length, and various information formats such as abbreviation and data missing greatly affect the record similarity evaluation.

In this paper, we propose a novel probabilistic correlation-based similarity measure. Rather than simply conducting the matching of tokens between two records, our similarity evaluation enriches the information of records by considering correlations of tokens. The probabilistic correlation between tokens is defined as the probability of them appearing together in the same records. Then we compute weights of tokens and discover correlations of records based on the probabilistic correlations of tokens. The extensive experimental results demonstrate the effectiveness of our proposed approach.

Introduction

Unstructured text records are prevalent in databases and information systems, such as personal information management systems (PIM) and scientific literature digital library (CiteSeer). Various applications, for example similarity search [12], duplicate record detection [8], information integration [1] and so on, rely on the similarity evaluation among these unstructured records of text values. Table 1 shows an example of unstructured record database which stores several citation records as text attributes. Due to various information formats such as abbreviation and data missing, it is not easy to evaluate the similarity of unstructured records in the real world.

Since unstructured text records are text strings of short length (as shown in Table 1), we can apply approximate string matching techniques such as edit distance [21] to measure the similarity. However, these character-based matching approaches can only capture limited similarity and fail in many cases such as various word orders and incomplete information formats. Therefore, other than character-based string matching techniques, we can also treat each unstructured record as a text document and apply full text retrieval techniques to measure the record similarity. Specifically, records are represented by a set of weighted token features and similarity is computed based on these features. Cohen [4] proposes a word token based cosine similarity with tf^∗idf which can detect the similarity of records with various word orders and data missing. Gravano et al. [9] propose a more effective approach by using q-grams as tokens of records, which can handle spelling errors in records.

Unfortunately, the characteristics of unstructured text records are different from those of strings in full texts. First, due to the short length of text records, most words appear only once in a record, that is, the term frequency (tf) is 1 in most cases of such short text records in databases. We show the statistics of term frequency in Table 2. More than $90 %$ tokens, even the tokens of q-grams, appear only once in a record. Therefore, only the inverse document frequency (idf) [27] takes effect in the tf^∗idf [24] weighting scheme and no local features of each record are considered. Moreover, the popular matching similarity measure used for full text, cosine similarity, is based on the assumption that tokens are independent of each other, and the correlations between tokens are ignored. Due to various information representation formats of unstructured text records such as abbreviation and data missing, latent correlations of records can hardly be detected by only considering the matching of tokens.

Example 1

Consider records No. 3 and 4 in Table 1 with different author representations of “Sudipto Guha, Nick Koudas, Amit Marathe, Divesh Srivastava” and “S. Guha, et al.” respectively. By using the cosine similarity which is based on the dot product of two record vectors, we have only one matching token “Guha” and the similarity value is low. Even worse, there is no matching token at all between the different representations of the same conference, “Very Large Data Bases” and “VLDB”, and the cosine similarity value is 0 between these two representations. As a consequence, the cosine similarity of records No. 3 and 4 is low, which actually describe the same citation entity. Cohen et al. [5] conclude that full text retrieval techniques, tf^∗idf and cosine similarity, do not show the best performance when they are applied directly to text records in databases.

Motivated by the unsuitability of string matching and full text retrieval techniques in measuring similarity between text attribute records, in this paper, we mainly focus on developing the similarity metrics based on the correlation of tokens, and perform the similarity evaluation over records directly without data cleaning. In our similarity approach, rather than matching tokens of records, the correlations between tokens are considered, which help to discover more correlations of short text records with limited information. The correlations between tokens are investigated based on the probability that tokens appear in the same records. Then, these token correlations are utilized in two aspects, i.e. intra-correlation and inter-correlation. The intra-correlations consider the correlations of tokens in a record, and are utilized in the weighting of tokens. Rather than simply assigning equal term frequencies to tokens, we develop the discriminative importance of each token based on the degree of correlations with other tokens in a record. The inter-correlations represent the correlations of tokens between two records, which can further discover the correlations of records in addition to matched tokens. Based on the correlations of tokens, we can perform the similarity evaluation on text records with more diverse formats, for example with abbreviation and data missing.

Our contributions in this paper are summarized as follow:

•
We develop a dictionary to capture the probabilistic correlations of tokens, and represent text records with the consideration of both token frequencies and correlations. Highly correlated tokens are merged as phrase tokens to reduce the size.
•
We propose a probabilistic correlation-based feature weighting scheme, namely correlation weight, by considering the intra-correlation of tokens in a record. Instead of term frequency, which is equal to 1 in most records without any discriminative ability, the intra-correlation is employed to serve as local features of tokens in a record.
•
We design a probabilistic correlation-based similarity function, called correlation similarity, by utilizing the inter-correlation of tokens in two records. In particular, we prove that the existing cosine similarity can be interpreted as a special case of the proposed correlation similarity.
•
We extend the existing semantic-based word similarity in WordNet to our semantic-based record similarity, named semantic-based similarity (sbs). In particular we combine the sbs method with our correlation similarity and propose the semantic-based correlation similarity (scor).
•
We report an extensive experimental evaluation, to demonstrate superiority of the proposed approach compared with existing measures and our semantic-based measure.

The rest of this paper is organized as follows. We illustrate the probabilistic correlation of tokens in Section 2. Section 3 presents our probabilistic correlation-based weighting scheme and also the probabilistic correlation-based similarity function. In Section 4, we discuss the effectiveness of our approach from a methodological perspective. The extension on semantic-based similarity is introduced in Section 5. Section 6 demonstrates the performance of our approach through an experimental evaluation. In Section 7, we discuss some related work. Finally, we conclude this paper in Section 8. An early extended abstract of this paper is reported in [26].

Section snippets

Probabilistic correlation

The cosine similarity measure [4] makes an assumption that tokens in records are independent of each other, and the correlations between tokens are ignored. In practice, however, token correlations do exist, for example, the token “International” has a high probability of appearing together with the token “Conference” in citation records. In this section, we develop a model of correlations between tokens by considering the conditional probability of token co-occurrence.

Record similarity measure

In this section, we illustrate our probabilistic correlation-based record similarity measure. Firstly, we discuss the weighting scheme of tokens in the records. The correlation between tokens is used in the weighting of tokens. Then we introduce our correlation similarity function which is also based on the correlation between tokens.

Methodology analysis

In this section, we analyze the effectiveness of our approach from a methodological perspective, especially in dealing with various information formats of unstructured text records, such as abbreviation and incomplete information.

Since we study the similarity of unstructured text records with short length and limited information, our correlation similarity function relaxes the constraint of token matching in the cosine similarity function, by considering the further inter-correlations of tokens

Integrating with semantic-based similarity from external sources

Our proposed correlation technique can successfully obtain a part of the relationships among tokens that appear together, without any external sources. Nevertheless, when external knowledge bases are available, the ontology-based or semantic-based approach can retrieve more token relationships. In this section, we present a novel approach to incorporate our correlation-based similarity with the semantic-based similarity.

Experimental evaluation

In this section, we report experimental results. Section 6.1 evaluates the performance of our probabilistic correlation-based techniques with various settings. And Section 6.2 presents the comparison with existing approaches.

Dataset. We employ two datasets in our experiments, Cora and Restaurant.¹ The first dataset Cora, prepared by McCallum et al. [19], consists of 1295 citation records of 122 research papers. The average length of records in

Related work

In this paper, we concentrate on the similarity measure on unstructured text records in databases and concern several issues in this application, i.e., the length of the record is always short and the information in such a short record are limited.

One solution for addressing the problem is to conduct data cleaning and formatting first, by segmenting unstructured text records into structured entities with certain attributes [3] or recovering missing values [30]. Then, the similarity evaluation

Conclusions

In this paper, we propose a novel similarity measure for text records based on the probabilistic correlation of tokens. We define the probabilistic correlation between two word tokens as the probability that these tokens appear in the same records. Then we merge words with high correlations into phrase and extend the correlation between phrase tokens. A feature weighting scheme is performed based on the intra-correlation of tokens in a record. Furthermore, we develop a correlation-based

Acknowledgment

This work is supported in part by China NSFC Grant No. 61202008, 61232018, 61370055, Hong Kong RGC/NSFC Project No. N HKUST637/13, National Grand Fundamental Research 973 Program of China Grant No. 2012-CB316200, Microsoft Research Asia Gift Grant, Google Faculty Award 2013.

References (31)

E. Agichtein, S. Sarawagi, Scalable information extraction and integration, in: Tutorial of KDD’06,...
M. Bilenko, R.J. Mooney, Adaptive duplicate detection using learnable string similarity measures, in: KDD’03, 2003, pp....
V. Borkar, K. Deshmukh, S. Sarawagi, Automatic segmentation of text into structured records, in: SIGMOD’01, 2001, pp....
W.W. Cohen, Integration of heterogeneous databases without common domains using queries based on textual similarity,...
W.W. Cohen, P. Ravikumar, S.E. Fienberg, A comparison of string distance metrics for name-matching tasks, in: IJCAI’03...
W.W. Cohen and S. Sarawagi, Exploiting dictionaries in named entity extraction: combining semi-markov extraction...
S.C. Deerwester et al.
Indexing by latent semantic analysis
J. Am. Soc. Inf. Sci.
(1990)
A.K. Elmagarmid et al.
Duplicate record detection: a survey
IEEE Trans. Knowl. Data Eng.
(2007)
L. Gravano, P.G. Ipeirotis, N. Koudas, D. Srivastava, Text joins in an rdbms for web data integration, in: WWW’03,...
S. Guha, N. Koudas, A. Marathe, D. Srivastava, Merging the results of approximate match operations, in: VLDB’04, 2004,...

R. Gupta, S. Sarawagi, Creating probabilistic databases from information extraction models. in: VLDB’06, 2006, pp....

G.R. Hjaltason et al.

Index-driven similarity search in metric spaces (survey article)

ACM Trans. Database Syst.

(2003)

T. Hofmann, Probabilistic latent semantic analysis, in: Proceedings of the 15th Annual Conference on Uncertainty in...

T. Hofmann, Probabilistic latent semantic indexing, in: SIGIR’99, 1999, pp....

R. Jin, J.Y. Chai, L. Si, Learn to weight terms in information retrieval using category information, in: ICML’05, 2005,...

Cited by (25)

Framework for syntactic string similarity measures
2019, Expert Systems with Applications
Citation Excerpt :
However, this is not necessarily the best way to represent the strings, as common tokens like a, to, and the often have high frequencies. In Song, Zhu, and Chen (2014), it was observed that due to the short length of the phrases, most words appear only once in a text record and term frequency is therefore not efficient. Term frequency-inverse document frequency (TF-IDF) is therefore introduced to address this problem.
Similarity measure is an essential component of information retrieval, document clustering, text summarization, and question answering, among others. In this paper, we introduce a general framework of syntactic similarity measures for matching short text. We thoroughly analyze the measures by dividing them into three components: character-level similarity, string segmentation, and matching technique. Soft variants of the measures are also introduced. With the help of two existing toolkits (SecondString and SimMetric), we provide an open-source Java toolkit of the proposed framework, which integrates the individual components together so that completely new combinations can be created. Experimental results reveal that the performance of the similarity measures depends on the type of the dataset. For well-maintained dataset, using a token-level measure is important but the basic (crisp) variant is usually enough. For uncontrolled dataset where typing errors are expected, the soft variants of the token-level measures are necessary. Among all tested measures, a soft token-level measure that combines set matching and q-grams at the character level perform best. A gap between human perception and syntactic measures still remains due to lacking semantic analysis.
Term discrimination for text search tasks derived from negative binomial distribution
2018, Information Processing and Management
Citation Excerpt :
Recently, Ke (2014) utilized the Shannon entropy in order to introduce the least information theory and a derived term weighting scheme. Song, Zhu, and Chen (2014) proposed a correlation-based similarity measurement scheme. Lertnattee and Theeramunkong (2004) exploited the distribution of terms among and within classes in order to increase classification accuracy.
Accurate term discrimination in information retrieval is essential for identifying important terms in specific documents. In addition to the widely known inverse document frequency (IDF) method, alternative approaches such as the residual inverse document frequency (RIDF) scheme have been introduced for term discrimination. However, existing methods' performance is not unconditionally convincing. We propose a new collection frequency weighting scheme derived from the negative binomial distribution model of term occurrences. Factorial experiments were performed to examine potential interaction effect between collection frequency weight methods and term frequency weight methods according to the mean average precision and normalized discounted cumulative gain performance assessors. The results indicate that our proposed term discrimination method offers a significant gain in accuracy as compared to the IDF and RIDF scheme. This finding is reinforced by the fact that the results show no interaction effects among factors.
A novel similarity/dissimilarity measure for intuitionistic fuzzy sets and its application in pattern recognition
2016, Expert Systems with Applications
Citation Excerpt :
They claimed that the proposed similarity measure can outperform the existing similarity measures in solving the pattern recognition problems. Besides the fuzzy similarity and intuitionistic fuzzy similarity measures, the probabilistic-based similarity (correlation) measures have been widely used in real world problems, for example in image clustering and partitioning (Kappes et al., 2015), in text matching (Song et al., 2014) or in time-series detection (Gao, Jiang, Chen, & Han, 2009). In (Song et al., 2014) a probabilistic correlation-based method was successfully adopted for unstructured text record similarity evaluation, where approximate string matching techniques for full text retrieval, i.e. edit distance and cosine similarity fail in cases of various word orders or incomplete information formats.
Among the most interesting measures in intuitionistic fuzzy sets (IFSs) theory, the similarity measure is an essential tool to compare and determine degree of similarity between IFSs. Although there exist many similarity measures for IFSs, most of them cannot satisfy the axioms of similarity measure or provide reasonable results. In this paper, a novel knowledge-based similarity/dissimilarity measure between IFSs is proposed. Firstly, we define a new knowledge measure of information conveyed by the IFS and prove some properties of the proposed knowledge measure. Based on the proposed knowledge measure of IFSs, we construct a novel similarity/dissimilarity measure between IFSs and prove some properties of the proposed similarity measure. Then we use some illustrative examples to show that the proposed measures, though simple in concept and calculus, can overcome the drawbacks of the existing measures. Finally, we apply the proposed similarity/dissimilarity measure between IFSs in the pattern recognition problems to demonstrate that the proposed measure is the most reliable to deal with the pattern recognition problem in comparison with the existing similarity measures.
Efficient Processing of Group Planning Queries Over Spatial-Social Networks
2022, IEEE Transactions on Knowledge and Data Engineering
Natural Language Processing Using Neighbour Entropy-based Segmentation
2021, Journal of Computing and Information Technology
Schema extraction on semi-structured data
2020, arXiv

View all citing articles on Scopus

^☆: A preliminary, extended abstract of this paper appears in [26].

View full text

Probabilistic correlation-based similarity measure on text records☆