Documents similarity measurement using field association terms

https://doi.org/10.1016/S0306-4573(03)00019-0Get rights and content

Abstract

Conventional approaches to text analysis and information retrieval which measured document similarity by using considering all of the information in texts are a relatively inefficiency for processing large text collections in heterogeneous subject areas. This paper outlined a new text manipulation system FA-Sim that is useful for retrieving information in large heterogeneous texts and for recognizing content similarity in text excerpts. FA-Sim is based on flexible text matching procedures carried out in various contexts and various field ranks. FA-Sim measures texts similarity by using specific field association (FA) terms instead of by comparing all text information. Similarity between texts is faster and higher by using FA-Sim than other two analysis methods. Therefore, Recall and Precision significantly improved by 39% and 37% over these two traditional methods.

Introduction

There are many document retrieval systems and there have been many advances in areas such as keyword retrieval, similar file retrieval, automatic document classification and document summarization (Dillon & Gray, 1983; Fagan, 1989; Fuhr, 1989; Griffiths, Robinson, & Willet, 1984; Spark, 1971).

To use linear text words in information retrieval, it is important to recognize text portions which are semantically homogeneous text meanings (i.e. texts are sufficiently similar to establish clear relation). Identifying semantically homogeneous text portions allows generation of text links which relate texts are transform into structural text representations, providing selected text and creating link paths. Identifying semantically related text portions allow efficient search and retrieval of relevant texts. To find semantic homogeneity in unrestricted text environments, texts themselves are the principal basis for text analysis. Due to syntactic and semantic ambiguities inherent in natural language texts (Griffiths et al., 1984; Jin & Tackkim, 1995), comparing of words (not comparing extensive text) from different texts is usually adequate.

Much recent retrieval system evaluation is based mainly on user criteria (Croft, 1986) such as type of output presentation, amount of user effort needed for search and level of coverage of the target collection. The most important measures are:

  • (1)

    ability of the system to retrieve wanted information,

  • (2)

    ability of the system to reject unwanted information.


Several evaluation studies use test methodology based mainly on Recall value and Precision value that apply to a set of test similarities (Salton & Lesk, 1968; Salton & McGill, 1983). Recall is the proportion of relevant material actually retrieved; Precision is the proportion of retrieved material which is relevant. To generate Recall and Precision requires:

  • (1)

    differentiation between similar retrieved documents and similar documents that are not retrieved,

  • (2)

    differentiation between similar documents that are relevant to input and similar documents that are not relevant to input.


The classification technique of document classification and similar file retrieval are based on Vector-Space models (Salton & McGill, 1983) and Probabilistic models (Fuhr, 1989). These methods make it possible to retrieve and classify texts according to arbitrary databases without referring to systematic classified information.

Readers generally identify the subject of a text (document field) when they notice specific terms (field association (FA) terms) in that text. These specific FA terms are single or compound terms occurring in document fields (Atlam, Morita, Fuketa, & Aoe, 2002; Fuketa, Lee, Tsuji, Okada, & Aoe, 2000; Tom, 1997; Tsuji, Nigazawa, Okada, & Aoe, 1999; Yang & Pederson, 1997). For example, document fields <Politics> or <Baseball> can be usually recognized as subjects of texts, when the reader finds “election” or “catcher”. FA terms are useful in measuring similarity between document fields without comparing all information in those documents.

The traditional analysis methods Word Form and Word Stem use text words to measure similarity between document fields (Salton, 1989). This paper introduces a new (field association similarity (FA-Sim)) method which uses FA terms in document fields to measure document similarity.

Section 2 of this paper identifies simple FA terms, provides document field trees and discusses how FA terms can be found electronically by determining FA terms levels algorithm. Section 3 introduces a procedure for indexing FA terms automatically and how to calculate term weights. Section 4 introduces pairwise document similarities and identifies similarity measurement by combing a Vector-Space model with FA-Sim. Section 5 describes FA-Sim algorithm and compares FA-Sim method with Word Form and Word Stem methods using Recall and Precision to measure average similarity. Also, Section 5 verifies FA-Sim using 38,000 articles on various topics from a data set of 20 Newsgroups from CNN Web Site (1995–2001) and a large English Penn-Treebank Corpus (Treebank Project Release 2, 1995). Section 6 concludes with an indication of possible future work.

Section snippets

Document field tree

A document field tree structure ranks relationships between document fields (Aoe, Morita, & Mochizuki, 1989; Breiman, Friedman, Olshen, & Stone, 1984; Safavian & Landgrebe, 1991). The field tree in Fig. 1, based on Imidas’99 (Dozawa, 1999), contains 14 super-fields, 443 sub-fields and 393 terminal fields. Root names are omitted unless there is conflict between super-fields and sub-fields. In such cases, only terminal fields are described and FA terms and paths are manually assigned. For

Text analysis approach

There are two common methods of representing of natural-language texts (Blair & Maron, 1984):

  • 1.

    Shallow keyword representation method may be used where individual terms or keywords are assigned to information items and used to represent document content. In keyword representation systems retrieval decisions are based on comparisons of keyword and retrieved items contain keyword patterns corresponding to information requests.

  • 2.

    Deep knowledge representation method may be used where a formal knowledge

Pairwise document similarities

The type of term vectors shown in expression (1) represent the computation of a similarity measure between document items Di and Dj. Assuming that FA term weight FAwik of FA term FATk in Di is given by term frequency FAtfik or by combining term frequency and inverse document frequency (FAtfik×FAidfik), pairwise document similarity FA-Sim(Di,Dj) may be computed:FA-Sim(Di,Dj)=∑k=1tFAwik·FAwjkWhen document similarity is based on measurement of expression (2), document pairs with highly weighted

Method comparison

Two widely used criteria for evaluating information retrieval systems are Recall and Precision. Determining Recall and Precision depend on identifying documents which are similar and relevant and identifying documents which are similar and irrelevant. The present study determines variations in relevance assessments based on average Recall and Precision values. Retrieval effectiveness is measured:Recall=NumberofrelevantsimilartextsretrievedTotalnumberofrelevanttextsincollectionPrecision=Numberof

Evaluation of results

To verify the efficiency of the FA-Sim method, about 38,000 articles from a data set of 20 Newsgroups from CNN Web Site (1996–2001) were selected. There were various topics related to sports, computers, politics, economics, etc. This is Method is also applied on the large Penn-Treebank English Corpus (Treebank Project Release 2 (1995)).

Effectiveness of document retrieval system is evaluated by using pairwise document comparison into three processing methods. Comparison of Recall and Precision

Conclusion

Traditional analysis methods Word Form and Word Stem use text words to measure similarity between document fields. This paper introduces a new (FA-Sim) method which uses specific FA terms in document fields to measure document similarity. By using this new method (FA-Sim) measurement becomes more easily and effectively. So, in terms of Recall and Precision a new method, FA-Sim using these specific FA terms, is superior to some other common methods which measured document similarity by using all

Acknowledgements

Authors are grateful acknowledge support from The Japan Science Society (The Nippon Foundation).

References (33)

  • W.B. Croft et al.

    Using probabilistic models of information retrieval without relevance information

    Journal of Documentations

    (1976)
  • M. Dillon et al.

    Fully automatic syntactically based indexing systems

    Journal of the American Society for the Information Science (ASIS)

    (1983)
  • Dozawa, T. (1999). Innovative multi information dictionary Imidas’99. Annual Series, Zueisha Publication Co., Japan (in...
  • J. Fagan

    The effectiveness of a non-syntactic approach to automatic phrase indexing for document retrieval

    Journal of the American Society for Information Science (ASIS)

    (1989)
  • A. Griffiths et al.

    Hierarchic agglomerative clustering methods for automatic document classification

    Journal of Documentation

    (1984)
  • Jin, Y., & Tackkim, Y. (1995). Noun-sense Disambiguation from the Concept Base in MT. NLPRS, 32,...
  • Cited by (49)

    • Context constraint disambiguation of word semantics by field association schemes

      2011, Information Processing and Management
      Citation Excerpt :

      In this study five different levels are defined based on the field association term’s reminding ability of a field. The following criteria determine which level a field association term belongs to Atlam et al. (2002, 2003). Levels for field association term

    • A method of extracting malicious expressions in bulletin board systems by using context analysis

      2011, Information Processing and Management
      Citation Excerpt :

      The module KW determines keywords consisting of sequential expressions from the results of morphological analysis. The module FIELD determines document fields (Atlam, Elmarhomy, Fuketa, Morita, & Aoe, 2006; Atlam, Fuketa, Morita, & Aoe, 2003; Fuketa, Lee, Tsuji, Okada, & Aoe, 2000; Fuketa et al., 2005). This module is carried out by matching field association words to the results of morphological analysis.

    • Ranking of field association terms using Co-word analysis

      2008, Information Processing and Management
    • A new approach for text similarity using articles

      2008, International Journal of Information Technology and Decision Making
    View all citing articles on Scopus
    View full text