Elsevier

Information Sciences

Volume 289, 24 December 2014, Pages 8-24
Information Sciences

Probabilistic correlation-based similarity measure on text records

https://doi.org/10.1016/j.ins.2014.08.007Get rights and content

Abstract

Large scale unstructured text records are stored in text attributes in databases and information systems, such as scientific citation records or news highlights. Approximate string matching techniques for full text retrieval, e.g., edit distance and cosine similarity, can be adopted for unstructured text record similarity evaluation. However, these techniques do not show the best performance when applied directly, owing to the difference between unstructured text records and full text. In particular, the information are limited in text records of short length, and various information formats such as abbreviation and data missing greatly affect the record similarity evaluation.

In this paper, we propose a novel probabilistic correlation-based similarity measure. Rather than simply conducting the matching of tokens between two records, our similarity evaluation enriches the information of records by considering correlations of tokens. The probabilistic correlation between tokens is defined as the probability of them appearing together in the same records. Then we compute weights of tokens and discover correlations of records based on the probabilistic correlations of tokens. The extensive experimental results demonstrate the effectiveness of our proposed approach.

Introduction

Unstructured text records are prevalent in databases and information systems, such as personal information management systems (PIM) and scientific literature digital library (CiteSeer). Various applications, for example similarity search [12], duplicate record detection [8], information integration [1] and so on, rely on the similarity evaluation among these unstructured records of text values. Table 1 shows an example of unstructured record database which stores several citation records as text attributes. Due to various information formats such as abbreviation and data missing, it is not easy to evaluate the similarity of unstructured records in the real world.

Since unstructured text records are text strings of short length (as shown in Table 1), we can apply approximate string matching techniques such as edit distance [21] to measure the similarity. However, these character-based matching approaches can only capture limited similarity and fail in many cases such as various word orders and incomplete information formats. Therefore, other than character-based string matching techniques, we can also treat each unstructured record as a text document and apply full text retrieval techniques to measure the record similarity. Specifically, records are represented by a set of weighted token features and similarity is computed based on these features. Cohen [4] proposes a word token based cosine similarity with tfidf which can detect the similarity of records with various word orders and data missing. Gravano et al. [9] propose a more effective approach by using q-grams as tokens of records, which can handle spelling errors in records.

Unfortunately, the characteristics of unstructured text records are different from those of strings in full texts. First, due to the short length of text records, most words appear only once in a record, that is, the term frequency (tf) is 1 in most cases of such short text records in databases. We show the statistics of term frequency in Table 2. More than 90% tokens, even the tokens of q-grams, appear only once in a record. Therefore, only the inverse document frequency (idf) [27] takes effect in the tfidf [24] weighting scheme and no local features of each record are considered. Moreover, the popular matching similarity measure used for full text, cosine similarity, is based on the assumption that tokens are independent of each other, and the correlations between tokens are ignored. Due to various information representation formats of unstructured text records such as abbreviation and data missing, latent correlations of records can hardly be detected by only considering the matching of tokens.

Example 1

Consider records No. 3 and 4 in Table 1 with different author representations of “Sudipto Guha, Nick Koudas, Amit Marathe, Divesh Srivastava” and “S. Guha, et al.” respectively. By using the cosine similarity which is based on the dot product of two record vectors, we have only one matching token “Guha” and the similarity value is low. Even worse, there is no matching token at all between the different representations of the same conference, “Very Large Data Bases” and “VLDB”, and the cosine similarity value is 0 between these two representations. As a consequence, the cosine similarity of records No. 3 and 4 is low, which actually describe the same citation entity. Cohen et al. [5] conclude that full text retrieval techniques, tfidf and cosine similarity, do not show the best performance when they are applied directly to text records in databases.

Motivated by the unsuitability of string matching and full text retrieval techniques in measuring similarity between text attribute records, in this paper, we mainly focus on developing the similarity metrics based on the correlation of tokens, and perform the similarity evaluation over records directly without data cleaning. In our similarity approach, rather than matching tokens of records, the correlations between tokens are considered, which help to discover more correlations of short text records with limited information. The correlations between tokens are investigated based on the probability that tokens appear in the same records. Then, these token correlations are utilized in two aspects, i.e. intra-correlation and inter-correlation. The intra-correlations consider the correlations of tokens in a record, and are utilized in the weighting of tokens. Rather than simply assigning equal term frequencies to tokens, we develop the discriminative importance of each token based on the degree of correlations with other tokens in a record. The inter-correlations represent the correlations of tokens between two records, which can further discover the correlations of records in addition to matched tokens. Based on the correlations of tokens, we can perform the similarity evaluation on text records with more diverse formats, for example with abbreviation and data missing.

Our contributions in this paper are summarized as follow:

  • We develop a dictionary to capture the probabilistic correlations of tokens, and represent text records with the consideration of both token frequencies and correlations. Highly correlated tokens are merged as phrase tokens to reduce the size.

  • We propose a probabilistic correlation-based feature weighting scheme, namely correlation weight, by considering the intra-correlation of tokens in a record. Instead of term frequency, which is equal to 1 in most records without any discriminative ability, the intra-correlation is employed to serve as local features of tokens in a record.

  • We design a probabilistic correlation-based similarity function, called correlation similarity, by utilizing the inter-correlation of tokens in two records. In particular, we prove that the existing cosine similarity can be interpreted as a special case of the proposed correlation similarity.

  • We extend the existing semantic-based word similarity in WordNet to our semantic-based record similarity, named semantic-based similarity (sbs). In particular we combine the sbs method with our correlation similarity and propose the semantic-based correlation similarity (scor).

  • We report an extensive experimental evaluation, to demonstrate superiority of the proposed approach compared with existing measures and our semantic-based measure.

The rest of this paper is organized as follows. We illustrate the probabilistic correlation of tokens in Section 2. Section 3 presents our probabilistic correlation-based weighting scheme and also the probabilistic correlation-based similarity function. In Section 4, we discuss the effectiveness of our approach from a methodological perspective. The extension on semantic-based similarity is introduced in Section 5. Section 6 demonstrates the performance of our approach through an experimental evaluation. In Section 7, we discuss some related work. Finally, we conclude this paper in Section 8. An early extended abstract of this paper is reported in [26].

Section snippets

Probabilistic correlation

The cosine similarity measure [4] makes an assumption that tokens in records are independent of each other, and the correlations between tokens are ignored. In practice, however, token correlations do exist, for example, the token “International” has a high probability of appearing together with the token “Conference” in citation records. In this section, we develop a model of correlations between tokens by considering the conditional probability of token co-occurrence.

Record similarity measure

In this section, we illustrate our probabilistic correlation-based record similarity measure. Firstly, we discuss the weighting scheme of tokens in the records. The correlation between tokens is used in the weighting of tokens. Then we introduce our correlation similarity function which is also based on the correlation between tokens.

Methodology analysis

In this section, we analyze the effectiveness of our approach from a methodological perspective, especially in dealing with various information formats of unstructured text records, such as abbreviation and incomplete information.

Since we study the similarity of unstructured text records with short length and limited information, our correlation similarity function relaxes the constraint of token matching in the cosine similarity function, by considering the further inter-correlations of tokens

Integrating with semantic-based similarity from external sources

Our proposed correlation technique can successfully obtain a part of the relationships among tokens that appear together, without any external sources. Nevertheless, when external knowledge bases are available, the ontology-based or semantic-based approach can retrieve more token relationships. In this section, we present a novel approach to incorporate our correlation-based similarity with the semantic-based similarity.

Experimental evaluation

In this section, we report experimental results. Section 6.1 evaluates the performance of our probabilistic correlation-based techniques with various settings. And Section 6.2 presents the comparison with existing approaches.

Dataset. We employ two datasets in our experiments, Cora and Restaurant.1 The first dataset Cora, prepared by McCallum et al. [19], consists of 1295 citation records of 122 research papers. The average length of records in

Related work

In this paper, we concentrate on the similarity measure on unstructured text records in databases and concern several issues in this application, i.e., the length of the record is always short and the information in such a short record are limited.

One solution for addressing the problem is to conduct data cleaning and formatting first, by segmenting unstructured text records into structured entities with certain attributes [3] or recovering missing values [30]. Then, the similarity evaluation

Conclusions

In this paper, we propose a novel similarity measure for text records based on the probabilistic correlation of tokens. We define the probabilistic correlation between two word tokens as the probability that these tokens appear in the same records. Then we merge words with high correlations into phrase and extend the correlation between phrase tokens. A feature weighting scheme is performed based on the intra-correlation of tokens in a record. Furthermore, we develop a correlation-based

Acknowledgment

This work is supported in part by China NSFC Grant No. 61202008, 61232018, 61370055, Hong Kong RGC/NSFC Project No. N HKUST637/13, National Grand Fundamental Research 973 Program of China Grant No. 2012-CB316200, Microsoft Research Asia Gift Grant, Google Faculty Award 2013.

References (31)

  • E. Agichtein, S. Sarawagi, Scalable information extraction and integration, in: Tutorial of KDD’06,...
  • M. Bilenko, R.J. Mooney, Adaptive duplicate detection using learnable string similarity measures, in: KDD’03, 2003, pp....
  • V. Borkar, K. Deshmukh, S. Sarawagi, Automatic segmentation of text into structured records, in: SIGMOD’01, 2001, pp....
  • W.W. Cohen, Integration of heterogeneous databases without common domains using queries based on textual similarity,...
  • W.W. Cohen, P. Ravikumar, S.E. Fienberg, A comparison of string distance metrics for name-matching tasks, in: IJCAI’03...
  • W.W. Cohen and S. Sarawagi, Exploiting dictionaries in named entity extraction: combining semi-markov extraction...
  • S.C. Deerwester et al.

    Indexing by latent semantic analysis

    J. Am. Soc. Inf. Sci.

    (1990)
  • A.K. Elmagarmid et al.

    Duplicate record detection: a survey

    IEEE Trans. Knowl. Data Eng.

    (2007)
  • L. Gravano, P.G. Ipeirotis, N. Koudas, D. Srivastava, Text joins in an rdbms for web data integration, in: WWW’03,...
  • S. Guha, N. Koudas, A. Marathe, D. Srivastava, Merging the results of approximate match operations, in: VLDB’04, 2004,...
  • R. Gupta, S. Sarawagi, Creating probabilistic databases from information extraction models. in: VLDB’06, 2006, pp....
  • G.R. Hjaltason et al.

    Index-driven similarity search in metric spaces (survey article)

    ACM Trans. Database Syst.

    (2003)
  • T. Hofmann, Probabilistic latent semantic analysis, in: Proceedings of the 15th Annual Conference on Uncertainty in...
  • T. Hofmann, Probabilistic latent semantic indexing, in: SIGIR’99, 1999, pp....
  • R. Jin, J.Y. Chai, L. Si, Learn to weight terms in information retrieval using category information, in: ICML’05, 2005,...
  • Cited by (25)

    • Framework for syntactic string similarity measures

      2019, Expert Systems with Applications
      Citation Excerpt :

      However, this is not necessarily the best way to represent the strings, as common tokens like a, to, and the often have high frequencies. In Song, Zhu, and Chen (2014), it was observed that due to the short length of the phrases, most words appear only once in a text record and term frequency is therefore not efficient. Term frequency-inverse document frequency (TF-IDF) is therefore introduced to address this problem.

    • Term discrimination for text search tasks derived from negative binomial distribution

      2018, Information Processing and Management
      Citation Excerpt :

      Recently, Ke (2014) utilized the Shannon entropy in order to introduce the least information theory and a derived term weighting scheme. Song, Zhu, and Chen (2014) proposed a correlation-based similarity measurement scheme. Lertnattee and Theeramunkong (2004) exploited the distribution of terms among and within classes in order to increase classification accuracy.

    • A novel similarity/dissimilarity measure for intuitionistic fuzzy sets and its application in pattern recognition

      2016, Expert Systems with Applications
      Citation Excerpt :

      They claimed that the proposed similarity measure can outperform the existing similarity measures in solving the pattern recognition problems. Besides the fuzzy similarity and intuitionistic fuzzy similarity measures, the probabilistic-based similarity (correlation) measures have been widely used in real world problems, for example in image clustering and partitioning (Kappes et al., 2015), in text matching (Song et al., 2014) or in time-series detection (Gao, Jiang, Chen, & Han, 2009). In (Song et al., 2014) a probabilistic correlation-based method was successfully adopted for unstructured text record similarity evaluation, where approximate string matching techniques for full text retrieval, i.e. edit distance and cosine similarity fail in cases of various word orders or incomplete information formats.

    • Efficient Processing of Group Planning Queries Over Spatial-Social Networks

      2022, IEEE Transactions on Knowledge and Data Engineering
    • Natural Language Processing Using Neighbour Entropy-based Segmentation

      2021, Journal of Computing and Information Technology
    View all citing articles on Scopus

    A preliminary, extended abstract of this paper appears in [26].

    View full text