Documents similarity measurement using field association terms
Introduction
There are many document retrieval systems and there have been many advances in areas such as keyword retrieval, similar file retrieval, automatic document classification and document summarization (Dillon & Gray, 1983; Fagan, 1989; Fuhr, 1989; Griffiths, Robinson, & Willet, 1984; Spark, 1971).
To use linear text words in information retrieval, it is important to recognize text portions which are semantically homogeneous text meanings (i.e. texts are sufficiently similar to establish clear relation). Identifying semantically homogeneous text portions allows generation of text links which relate texts are transform into structural text representations, providing selected text and creating link paths. Identifying semantically related text portions allow efficient search and retrieval of relevant texts. To find semantic homogeneity in unrestricted text environments, texts themselves are the principal basis for text analysis. Due to syntactic and semantic ambiguities inherent in natural language texts (Griffiths et al., 1984; Jin & Tackkim, 1995), comparing of words (not comparing extensive text) from different texts is usually adequate.
Much recent retrieval system evaluation is based mainly on user criteria (Croft, 1986) such as type of output presentation, amount of user effort needed for search and level of coverage of the target collection. The most important measures are:
- (1)
ability of the system to retrieve wanted information,
- (2)
ability of the system to reject unwanted information.
Several evaluation studies use test methodology based mainly on Recall value and Precision value that apply to a set of test similarities (Salton & Lesk, 1968; Salton & McGill, 1983). Recall is the proportion of relevant material actually retrieved; Precision is the proportion of retrieved material which is relevant. To generate Recall and Precision requires:
- (1)
differentiation between similar retrieved documents and similar documents that are not retrieved,
- (2)
differentiation between similar documents that are relevant to input and similar documents that are not relevant to input.
The classification technique of document classification and similar file retrieval are based on Vector-Space models (Salton & McGill, 1983) and Probabilistic models (Fuhr, 1989). These methods make it possible to retrieve and classify texts according to arbitrary databases without referring to systematic classified information.
Readers generally identify the subject of a text (document field) when they notice specific terms (field association (FA) terms) in that text. These specific FA terms are single or compound terms occurring in document fields (Atlam, Morita, Fuketa, & Aoe, 2002; Fuketa, Lee, Tsuji, Okada, & Aoe, 2000; Tom, 1997; Tsuji, Nigazawa, Okada, & Aoe, 1999; Yang & Pederson, 1997). For example, document fields <Politics> or <Baseball> can be usually recognized as subjects of texts, when the reader finds “election” or “catcher”. FA terms are useful in measuring similarity between document fields without comparing all information in those documents.
The traditional analysis methods Word Form and Word Stem use text words to measure similarity between document fields (Salton, 1989). This paper introduces a new (field association similarity (FA-Sim)) method which uses FA terms in document fields to measure document similarity.
Section 2 of this paper identifies simple FA terms, provides document field trees and discusses how FA terms can be found electronically by determining FA terms levels algorithm. Section 3 introduces a procedure for indexing FA terms automatically and how to calculate term weights. Section 4 introduces pairwise document similarities and identifies similarity measurement by combing a Vector-Space model with FA-Sim. Section 5 describes FA-Sim algorithm and compares FA-Sim method with Word Form and Word Stem methods using Recall and Precision to measure average similarity. Also, Section 5 verifies FA-Sim using 38,000 articles on various topics from a data set of 20 Newsgroups from CNN Web Site (1995–2001) and a large English Penn-Treebank Corpus (Treebank Project Release 2, 1995). Section 6 concludes with an indication of possible future work.
Section snippets
Document field tree
A document field tree structure ranks relationships between document fields (Aoe, Morita, & Mochizuki, 1989; Breiman, Friedman, Olshen, & Stone, 1984; Safavian & Landgrebe, 1991). The field tree in Fig. 1, based on Imidas’99 (Dozawa, 1999), contains 14 super-fields, 443 sub-fields and 393 terminal fields. Root names are omitted unless there is conflict between super-fields and sub-fields. In such cases, only terminal fields are described and FA terms and paths are manually assigned. For
Text analysis approach
There are two common methods of representing of natural-language texts (Blair & Maron, 1984):
- 1.
Shallow keyword representation method may be used where individual terms or keywords are assigned to information items and used to represent document content. In keyword representation systems retrieval decisions are based on comparisons of keyword and retrieved items contain keyword patterns corresponding to information requests.
- 2.
Deep knowledge representation method may be used where a formal knowledge
Pairwise document similarities
The type of term vectors shown in expression (1) represent the computation of a similarity measure between document items Di and Dj. Assuming that FA term weight FAwik of FA term FATk in Di is given by term frequency FAtfik or by combining term frequency and inverse document frequency (FAtfik×FAidfik), pairwise document similarity FA-Sim(Di,Dj) may be computed:When document similarity is based on measurement of expression (2), document pairs with highly weighted
Method comparison
Two widely used criteria for evaluating information retrieval systems are Recall and Precision. Determining Recall and Precision depend on identifying documents which are similar and relevant and identifying documents which are similar and irrelevant. The present study determines variations in relevance assessments based on average Recall and Precision values. Retrieval effectiveness is measured:
Evaluation of results
To verify the efficiency of the FA-Sim method, about 38,000 articles from a data set of 20 Newsgroups from CNN Web Site (1996–2001) were selected. There were various topics related to sports, computers, politics, economics, etc. This is Method is also applied on the large Penn-Treebank English Corpus (Treebank Project Release 2 (1995)).
Effectiveness of document retrieval system is evaluated by using pairwise document comparison into three processing methods. Comparison of Recall and Precision
Conclusion
Traditional analysis methods Word Form and Word Stem use text words to measure similarity between document fields. This paper introduces a new (FA-Sim) method which uses specific FA terms in document fields to measure document similarity. By using this new method (FA-Sim) measurement becomes more easily and effectively. So, in terms of Recall and Precision a new method, FA-Sim using these specific FA terms, is superior to some other common methods which measured document similarity by using all
Acknowledgements
Authors are grateful acknowledge support from The Japan Science Society (The Nippon Foundation).
References (33)
- et al.
Similarity measurement using term negative weight to word Similarity
Information Processing and Management Journal
(2000) - et al.
A new method for selecting english field association terms of compound words and its knowledge representation
Information Processing and Management Journal
(2002) Models for retrieval with probabilistic indexing
Information Processing and Retrieval
(1989)- et al.
A document classification method by using field association words
Information Science Journal
(2000) - et al.
Term weighting approaches in automatic text retrieval
Information Processing and Management
(1988) - et al.
Developing a new similarity measure from two different perspectives
Information Processing and Management Journal
(2001) - et al.
An efficient retrieval algorithm of collocate information using tree structure
Transaction of The IPSJ
(1989) - et al.
An evaluation of retrieval effectiveness for a full-text document- retrieval system
Communications of the MCA
(1984) - Breiman, L., Friedman, J. H., Olshen, R.-A., & Stone, C. J. (1984). Classification and regression trees. Chapman &...
- Croft, W. B., 1986. User-specified domain knowledge for document retrieval. In Proceedings of the ACM conference on...
Using probabilistic models of information retrieval without relevance information
Journal of Documentations
Fully automatic syntactically based indexing systems
Journal of the American Society for the Information Science (ASIS)
The effectiveness of a non-syntactic approach to automatic phrase indexing for document retrieval
Journal of the American Society for Information Science (ASIS)
Hierarchic agglomerative clustering methods for automatic document classification
Journal of Documentation
Cited by (49)
Context constraint disambiguation of word semantics by field association schemes
2011, Information Processing and ManagementCitation Excerpt :In this study five different levels are defined based on the field association term’s reminding ability of a field. The following criteria determine which level a field association term belongs to Atlam et al. (2002, 2003). Levels for field association term
A method of extracting malicious expressions in bulletin board systems by using context analysis
2011, Information Processing and ManagementCitation Excerpt :The module KW determines keywords consisting of sequential expressions from the results of morphological analysis. The module FIELD determines document fields (Atlam, Elmarhomy, Fuketa, Morita, & Aoe, 2006; Atlam, Fuketa, Morita, & Aoe, 2003; Fuketa, Lee, Tsuji, Okada, & Aoe, 2000; Fuketa et al., 2005). This module is carried out by matching field association words to the results of morphological analysis.
Ranking of field association terms using Co-word analysis
2008, Information Processing and ManagementA new approach for text similarity using articles
2008, International Journal of Information Technology and Decision Making