Probabilistic correlation-based similarity measure on text records☆
Introduction
Unstructured text records are prevalent in databases and information systems, such as personal information management systems (PIM) and scientific literature digital library (CiteSeer). Various applications, for example similarity search [12], duplicate record detection [8], information integration [1] and so on, rely on the similarity evaluation among these unstructured records of text values. Table 1 shows an example of unstructured record database which stores several citation records as text attributes. Due to various information formats such as abbreviation and data missing, it is not easy to evaluate the similarity of unstructured records in the real world.
Since unstructured text records are text strings of short length (as shown in Table 1), we can apply approximate string matching techniques such as edit distance [21] to measure the similarity. However, these character-based matching approaches can only capture limited similarity and fail in many cases such as various word orders and incomplete information formats. Therefore, other than character-based string matching techniques, we can also treat each unstructured record as a text document and apply full text retrieval techniques to measure the record similarity. Specifically, records are represented by a set of weighted token features and similarity is computed based on these features. Cohen [4] proposes a word token based cosine similarity with tf∗idf which can detect the similarity of records with various word orders and data missing. Gravano et al. [9] propose a more effective approach by using q-grams as tokens of records, which can handle spelling errors in records.
Unfortunately, the characteristics of unstructured text records are different from those of strings in full texts. First, due to the short length of text records, most words appear only once in a record, that is, the term frequency (tf) is 1 in most cases of such short text records in databases. We show the statistics of term frequency in Table 2. More than tokens, even the tokens of q-grams, appear only once in a record. Therefore, only the inverse document frequency (idf) [27] takes effect in the tf∗idf [24] weighting scheme and no local features of each record are considered. Moreover, the popular matching similarity measure used for full text, cosine similarity, is based on the assumption that tokens are independent of each other, and the correlations between tokens are ignored. Due to various information representation formats of unstructured text records such as abbreviation and data missing, latent correlations of records can hardly be detected by only considering the matching of tokens. Example 1 Consider records No. 3 and 4 in Table 1 with different author representations of “Sudipto Guha, Nick Koudas, Amit Marathe, Divesh Srivastava” and “S. Guha, et al.” respectively. By using the cosine similarity which is based on the dot product of two record vectors, we have only one matching token “Guha” and the similarity value is low. Even worse, there is no matching token at all between the different representations of the same conference, “Very Large Data Bases” and “VLDB”, and the cosine similarity value is 0 between these two representations. As a consequence, the cosine similarity of records No. 3 and 4 is low, which actually describe the same citation entity. Cohen et al. [5] conclude that full text retrieval techniques, tf∗idf and cosine similarity, do not show the best performance when they are applied directly to text records in databases.
Motivated by the unsuitability of string matching and full text retrieval techniques in measuring similarity between text attribute records, in this paper, we mainly focus on developing the similarity metrics based on the correlation of tokens, and perform the similarity evaluation over records directly without data cleaning. In our similarity approach, rather than matching tokens of records, the correlations between tokens are considered, which help to discover more correlations of short text records with limited information. The correlations between tokens are investigated based on the probability that tokens appear in the same records. Then, these token correlations are utilized in two aspects, i.e. intra-correlation and inter-correlation. The intra-correlations consider the correlations of tokens in a record, and are utilized in the weighting of tokens. Rather than simply assigning equal term frequencies to tokens, we develop the discriminative importance of each token based on the degree of correlations with other tokens in a record. The inter-correlations represent the correlations of tokens between two records, which can further discover the correlations of records in addition to matched tokens. Based on the correlations of tokens, we can perform the similarity evaluation on text records with more diverse formats, for example with abbreviation and data missing.
Our contributions in this paper are summarized as follow:
- •
We develop a dictionary to capture the probabilistic correlations of tokens, and represent text records with the consideration of both token frequencies and correlations. Highly correlated tokens are merged as phrase tokens to reduce the size.
- •
We propose a probabilistic correlation-based feature weighting scheme, namely correlation weight, by considering the intra-correlation of tokens in a record. Instead of term frequency, which is equal to 1 in most records without any discriminative ability, the intra-correlation is employed to serve as local features of tokens in a record.
- •
We design a probabilistic correlation-based similarity function, called correlation similarity, by utilizing the inter-correlation of tokens in two records. In particular, we prove that the existing cosine similarity can be interpreted as a special case of the proposed correlation similarity.
- •
We extend the existing semantic-based word similarity in WordNet to our semantic-based record similarity, named semantic-based similarity (sbs). In particular we combine the sbs method with our correlation similarity and propose the semantic-based correlation similarity (scor).
- •
We report an extensive experimental evaluation, to demonstrate superiority of the proposed approach compared with existing measures and our semantic-based measure.
The rest of this paper is organized as follows. We illustrate the probabilistic correlation of tokens in Section 2. Section 3 presents our probabilistic correlation-based weighting scheme and also the probabilistic correlation-based similarity function. In Section 4, we discuss the effectiveness of our approach from a methodological perspective. The extension on semantic-based similarity is introduced in Section 5. Section 6 demonstrates the performance of our approach through an experimental evaluation. In Section 7, we discuss some related work. Finally, we conclude this paper in Section 8. An early extended abstract of this paper is reported in [26].
Section snippets
Probabilistic correlation
The cosine similarity measure [4] makes an assumption that tokens in records are independent of each other, and the correlations between tokens are ignored. In practice, however, token correlations do exist, for example, the token “International” has a high probability of appearing together with the token “Conference” in citation records. In this section, we develop a model of correlations between tokens by considering the conditional probability of token co-occurrence.
Record similarity measure
In this section, we illustrate our probabilistic correlation-based record similarity measure. Firstly, we discuss the weighting scheme of tokens in the records. The correlation between tokens is used in the weighting of tokens. Then we introduce our correlation similarity function which is also based on the correlation between tokens.
Methodology analysis
In this section, we analyze the effectiveness of our approach from a methodological perspective, especially in dealing with various information formats of unstructured text records, such as abbreviation and incomplete information.
Since we study the similarity of unstructured text records with short length and limited information, our correlation similarity function relaxes the constraint of token matching in the cosine similarity function, by considering the further inter-correlations of tokens
Integrating with semantic-based similarity from external sources
Our proposed correlation technique can successfully obtain a part of the relationships among tokens that appear together, without any external sources. Nevertheless, when external knowledge bases are available, the ontology-based or semantic-based approach can retrieve more token relationships. In this section, we present a novel approach to incorporate our correlation-based similarity with the semantic-based similarity.
Experimental evaluation
In this section, we report experimental results. Section 6.1 evaluates the performance of our probabilistic correlation-based techniques with various settings. And Section 6.2 presents the comparison with existing approaches.
Dataset. We employ two datasets in our experiments, Cora and Restaurant.1 The first dataset Cora, prepared by McCallum et al. [19], consists of 1295 citation records of 122 research papers. The average length of records in
Related work
In this paper, we concentrate on the similarity measure on unstructured text records in databases and concern several issues in this application, i.e., the length of the record is always short and the information in such a short record are limited.
One solution for addressing the problem is to conduct data cleaning and formatting first, by segmenting unstructured text records into structured entities with certain attributes [3] or recovering missing values [30]. Then, the similarity evaluation
Conclusions
In this paper, we propose a novel similarity measure for text records based on the probabilistic correlation of tokens. We define the probabilistic correlation between two word tokens as the probability that these tokens appear in the same records. Then we merge words with high correlations into phrase and extend the correlation between phrase tokens. A feature weighting scheme is performed based on the intra-correlation of tokens in a record. Furthermore, we develop a correlation-based
Acknowledgment
This work is supported in part by China NSFC Grant No. 61202008, 61232018, 61370055, Hong Kong RGC/NSFC Project No. N HKUST637/13, National Grand Fundamental Research 973 Program of China Grant No. 2012-CB316200, Microsoft Research Asia Gift Grant, Google Faculty Award 2013.
References (31)
- E. Agichtein, S. Sarawagi, Scalable information extraction and integration, in: Tutorial of KDD’06,...
- M. Bilenko, R.J. Mooney, Adaptive duplicate detection using learnable string similarity measures, in: KDD’03, 2003, pp....
- V. Borkar, K. Deshmukh, S. Sarawagi, Automatic segmentation of text into structured records, in: SIGMOD’01, 2001, pp....
- W.W. Cohen, Integration of heterogeneous databases without common domains using queries based on textual similarity,...
- W.W. Cohen, P. Ravikumar, S.E. Fienberg, A comparison of string distance metrics for name-matching tasks, in: IJCAI’03...
- W.W. Cohen and S. Sarawagi, Exploiting dictionaries in named entity extraction: combining semi-markov extraction...
- et al.
Indexing by latent semantic analysis
J. Am. Soc. Inf. Sci.
(1990) - et al.
Duplicate record detection: a survey
IEEE Trans. Knowl. Data Eng.
(2007) - L. Gravano, P.G. Ipeirotis, N. Koudas, D. Srivastava, Text joins in an rdbms for web data integration, in: WWW’03,...
- S. Guha, N. Koudas, A. Marathe, D. Srivastava, Merging the results of approximate match operations, in: VLDB’04, 2004,...
Index-driven similarity search in metric spaces (survey article)
ACM Trans. Database Syst.
Cited by (25)
Framework for syntactic string similarity measures
2019, Expert Systems with ApplicationsCitation Excerpt :However, this is not necessarily the best way to represent the strings, as common tokens like a, to, and the often have high frequencies. In Song, Zhu, and Chen (2014), it was observed that due to the short length of the phrases, most words appear only once in a text record and term frequency is therefore not efficient. Term frequency-inverse document frequency (TF-IDF) is therefore introduced to address this problem.
Term discrimination for text search tasks derived from negative binomial distribution
2018, Information Processing and ManagementCitation Excerpt :Recently, Ke (2014) utilized the Shannon entropy in order to introduce the least information theory and a derived term weighting scheme. Song, Zhu, and Chen (2014) proposed a correlation-based similarity measurement scheme. Lertnattee and Theeramunkong (2004) exploited the distribution of terms among and within classes in order to increase classification accuracy.
A novel similarity/dissimilarity measure for intuitionistic fuzzy sets and its application in pattern recognition
2016, Expert Systems with ApplicationsCitation Excerpt :They claimed that the proposed similarity measure can outperform the existing similarity measures in solving the pattern recognition problems. Besides the fuzzy similarity and intuitionistic fuzzy similarity measures, the probabilistic-based similarity (correlation) measures have been widely used in real world problems, for example in image clustering and partitioning (Kappes et al., 2015), in text matching (Song et al., 2014) or in time-series detection (Gao, Jiang, Chen, & Han, 2009). In (Song et al., 2014) a probabilistic correlation-based method was successfully adopted for unstructured text record similarity evaluation, where approximate string matching techniques for full text retrieval, i.e. edit distance and cosine similarity fail in cases of various word orders or incomplete information formats.
Efficient Processing of Group Planning Queries Over Spatial-Social Networks
2022, IEEE Transactions on Knowledge and Data EngineeringNatural Language Processing Using Neighbour Entropy-based Segmentation
2021, Journal of Computing and Information TechnologySchema extraction on semi-structured data
2020, arXiv