A fast procedure for the calculation of similarity coefficients in automatic classification

doi:10.1016/0306-4573(81)90026-1

Information Processing & Management

Volume 17, Issue 2, 1981, Pages 53-60

https://doi.org/10.1016/0306-4573(81)90026-1 Get rights and content

Abstract

A fast algorithm is described for comparing the lists of terms representing documents in automatic classification experiments. The speed of the procedure arises from the fact that all of the non-zero-valued coefficients for a given document are identified together, using an inverted file to the terms in the document collection. The complexity and running time of the algorithm are compared with previously described procedures.

References (8)

C.J. Van Rijsbergen et al.
Document clustering: an evaluation of some experiments with the Cranfield 1400 collection
Inf. Proc. Manag.
(1975)
W.B. Croft
Clustering large files of documents using the single-link method
J. Am. Soc. Inf. Sci.
(1917)
W.B. Croft
A file organisation for cluster-based retrieval
A.F. Harding et al.
Indexing exhaustivity and the computation of similarity matrices
J. Am. Soc. Inf. Sci.
(1980)

There are more references available in the full text version of this article.

Cited by (26)

Computer-assisted IR spectra prediction - linked similarity searches for structures and spectra
1997, Analytica Chimica Acta
The prediction of IR spectra of organic compounds in the range between 2250 and 550 cm⁻¹ containing carbon, nitrogen, oxygen and halogen atoms based on a spectroscopic database is outlined. Structure similarity searches are performed to determine appropriate reference molecules whose spectra are then used for the prediction of the spectrum of the query molecule. The performance and reliability of the prediction system was extensively tested by a ‘leave one out’ procedure.
Recent trends in hierarchic document clustering: A critical review
1988, Information Processing and Management
This article reviews recent research into the use of hierarchic agglomerative clustering methods for document retrieval. After an introduction to the calculation of interdocument similarities and to clustering methods that are appropriate for document clustering, the article discusses algorithms that can be used to allow the implementation of these methods on databases of nontrivial size. The validation of document hierarchies is described using tests based on the theory of random graphs and on empirical characteristics of document collections that are to be clustered. A range of search strategies is available for retrieval from document hierarchies and the results are presented of a series of research projects that have used these strategies to search the clusters resulting from several different types of hierarchic agglomerative clustering method. It is suggested that the complete linkage method is probably the most effective method in terms of retrieval performance; however, it is also difficult to implement in an efficient manner. Other applications of document clustering techniques are discussed briefly; experimental evidence suggests that nearest neighbor clusters, possibly represented as a network model, provide a reasonably efficient and effective means of including interdocument similarity information in document retrieval systems.
The calculation of intermolecular similarity coefficients using an inverted file algorithm
1982, Analytica Chimica Acta
Molecular similarity analysis
2013, Chemoinformatics for Drug Discovery
High-speed rough clustering for very large document collections
2010, Journal of the American Society for Information Science and Technology
Similarity methods in chemoinformatics
2009, Annual Review of Information Science and Technology

View all citing articles on Scopus

View full text

A fast procedure for the calculation of similarity coefficients in automatic classification

Abstract

Inf. Proc. Manag.

Clustering large files of documents using the single-link method

J. Am. Soc. Inf. Sci.

A file organisation for cluster-based retrieval

Indexing exhaustivity and the computation of similarity matrices

J. Am. Soc. Inf. Sci.