Abstract
The automated analysis of documents is an important task given the rapid increase in availability of digital texts. Automatic text processing systems often encode documents as vectors of term occurrence frequencies, a representation which facilitates the classification and clustering of documents. Historically, this approach derives from the related field of data mining, where database entries are commonly represented as points in a vector space. While this lineage has certainly contributed to the development of text processing, there are situations where document collections do not conform to this clustered structure, and where the vector representation may be unsuitable for text analysis. As a proof-of-concept, we had previously presented a framework where the optimal alignments of documents could be used for visualising the relationships within small sets of documents. In this paper we develop this approach further by using it to automatically generate the version histories of various document collections. For comparison, version histories generated using conventional methods of document representation are also produced. To facilitate this comparison, a simple procedure for evaluating the accuracy of the version histories thus generated is proposed.
Similar content being viewed by others
References
Cancedda N et al (2003) Word sequence kernels. J Mach Learn Res 3: 1059–1082
Cristianini N, Taylor S (2000) An introduction to support vector machines. Cambridge University Press, Cambridge
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3: 1289–1305
Hammouda K, Kamel M (2004) Document similarity using a phrase indexing graph model. Knowl Inf Syst 6(6): 710–727
Hand D et al (2001) Principles of data mining. MIT Press, Cambridge
Honkela T, Hyvarinen A (2004) Linguistic feature extraction using independent component analysis, vol 1
Huang Z (1997) A fast clustering algorithm to cluster very large categorical data sets in data mining. In: SIGMOD workshop on research issues on data mining and knowldge discovery
Hung C, Wermter S (2004) Neural network based document clustering using WordNet ontologies. Int J Hybrid Intel Syst 1(3,4): 127–142
Lagus K et al (2004) Mining massive document collections by the WEBSOM method. Inf Sci 163(1–3): 135–156
Lan M et al (2005) A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In: WWW ’05: special interest tracks and posters of the 14th international conference on World Wide Web. ACM Press, New York, pp 1032–1033
Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: SIGKD99D, San Diego, USA, pp 16–22
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions and reversals. Sov Phys Dokl 10(8): 707–710
Liu R, Lu Y (2002) Incremental context mining for adaptive document classification. In: KDD, ACM press, New York, pp 599–604
Lodhi H et al (2002) Text classification using string kernels. J Mach Learn Res 2: 419–444
Lodhi H et al (2000) Text classification using string kernels. In: Advances in Neural Information Processing Systems (NIPS), pp 563–569
Merkl D (1997) Exploration of text collections with hierarchical feature maps. In: SIGIR ’97: proceedings of the 20th annual international ACM SIGIR conference on research and development in information retrieval. ACM Press, New York, NY, USA, pp 186–195
Mount DW (2001) Bioinformatics: sequence and genome analysis. Cold Spring Harbor Laboratory Press, Cold Spring Harbor
Muthukrishnan S (2002) Efficient algorithms for document retrieval problems. In: ACM-SIAM symposium on discrete algorithms, San Francisco, USA, pp 657–666
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3): 443–453
Park LAF et al (2005) A novel document retrieval method using the discrete wavelet transform. ACM T Inform Syst 23(3): 267–298
Peng T et al (2007) SVM based adaptive learning method for text classification from positive and unlabeled documents. Knowl Inf Syst (Online First)
Pullwitt D (2002) Integrating contextual information to enhance SOM-based text document clustering. Neural Netw 15(8–9): 1099–1106
Roiger R, Geatz M (2003) Data mining: a tutorial based primer. Addison Wesley, New York
Salton G et al (1975) A vector space model for automatic indexing. Commun ACM 18: 229–237
Soukoreff WR, Mackenzie SI (2001) Measuring errors in text entry tasks: an application of the Levenshtein string distance statistic. In: CHI ’01: CHI ’01 extended abstracts on human factors in computing systems, New York, NY, USA, ACM Press, pp 319–320
Tamara KG, Dianne OP (1998) A semidiscrete matrix decomposition for latent semantic indexing in information retrieval. ACM T Inform Syst 16(4): 322–346
Tan S (2006) An effective refinement strategy for KNN text classifier. Expert Syst Appl 30(2): 290–298
Woon WL, Wong K-SD (2006) A self organising approach to document space visualisation. Multimed Cyberscape J (special issue on Multimedia Data Mining 4(1)
Woon WL, Wong K-SD (2006) SNITCH: a cross-platform tool for document corpus analysis. In: IEEE international conference on computer and communication engineering, Kuala Lumpur, Malaysia
Yang H-C, Lee C-H (2004) A text mining approach on automatic generation of web directories and hierarchies. Expert Syst Appl 27(4): 645–663
Yang Y et al (2002) CLOPE: a fast and effective clustering algorithm for transactional data. In: SIGKDD, ACM Press, New York, pp 682–687
Zaiane O, Antonie M-L (2002) Classifying text documents by associating terms with text categories. In: Conferences in research and practice in information technology, vol 5. Melbourne, Australia. ACS. Thirteenth Australasian Database Conference (ADC2002), pp 215–222
Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8(3): 374–384
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Woon, W.L., Wong, KS.D. String alignment for automated document versioning. Knowl Inf Syst 18, 293–309 (2009). https://doi.org/10.1007/s10115-008-0130-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-008-0130-x