Skip to main content
Log in

String alignment for automated document versioning

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

The automated analysis of documents is an important task given the rapid increase in availability of digital texts. Automatic text processing systems often encode documents as vectors of term occurrence frequencies, a representation which facilitates the classification and clustering of documents. Historically, this approach derives from the related field of data mining, where database entries are commonly represented as points in a vector space. While this lineage has certainly contributed to the development of text processing, there are situations where document collections do not conform to this clustered structure, and where the vector representation may be unsuitable for text analysis. As a proof-of-concept, we had previously presented a framework where the optimal alignments of documents could be used for visualising the relationships within small sets of documents. In this paper we develop this approach further by using it to automatically generate the version histories of various document collections. For comparison, version histories generated using conventional methods of document representation are also produced. To facilitate this comparison, a simple procedure for evaluating the accuracy of the version histories thus generated is proposed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Cancedda N et al (2003) Word sequence kernels. J Mach Learn Res 3: 1059–1082

    Article  MATH  MathSciNet  Google Scholar 

  2. Cristianini N, Taylor S (2000) An introduction to support vector machines. Cambridge University Press, Cambridge

    Google Scholar 

  3. Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3: 1289–1305

    Article  MATH  Google Scholar 

  4. Hammouda K, Kamel M (2004) Document similarity using a phrase indexing graph model. Knowl Inf Syst 6(6): 710–727

    Article  Google Scholar 

  5. Hand D et al (2001) Principles of data mining. MIT Press, Cambridge

    Google Scholar 

  6. Honkela T, Hyvarinen A (2004) Linguistic feature extraction using independent component analysis, vol 1

  7. Huang Z (1997) A fast clustering algorithm to cluster very large categorical data sets in data mining. In: SIGMOD workshop on research issues on data mining and knowldge discovery

  8. Hung C, Wermter S (2004) Neural network based document clustering using WordNet ontologies. Int J Hybrid Intel Syst 1(3,4): 127–142

    MATH  Google Scholar 

  9. Lagus K et al (2004) Mining massive document collections by the WEBSOM method. Inf Sci 163(1–3): 135–156

    Article  Google Scholar 

  10. Lan M et al (2005) A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In: WWW ’05: special interest tracks and posters of the 14th international conference on World Wide Web. ACM Press, New York, pp 1032–1033

  11. Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: SIGKD99D, San Diego, USA, pp 16–22

  12. Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions and reversals. Sov Phys Dokl 10(8): 707–710

    MathSciNet  Google Scholar 

  13. Liu R, Lu Y (2002) Incremental context mining for adaptive document classification. In: KDD, ACM press, New York, pp 599–604

  14. Lodhi H et al (2002) Text classification using string kernels. J Mach Learn Res 2: 419–444

    Article  MATH  Google Scholar 

  15. Lodhi H et al (2000) Text classification using string kernels. In: Advances in Neural Information Processing Systems (NIPS), pp 563–569

  16. Merkl D (1997) Exploration of text collections with hierarchical feature maps. In: SIGIR ’97: proceedings of the 20th annual international ACM SIGIR conference on research and development in information retrieval. ACM Press, New York, NY, USA, pp 186–195

  17. Mount DW (2001) Bioinformatics: sequence and genome analysis. Cold Spring Harbor Laboratory Press, Cold Spring Harbor

    Google Scholar 

  18. Muthukrishnan S (2002) Efficient algorithms for document retrieval problems. In: ACM-SIAM symposium on discrete algorithms, San Francisco, USA, pp 657–666

  19. Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3): 443–453

    Article  Google Scholar 

  20. Park LAF et al (2005) A novel document retrieval method using the discrete wavelet transform. ACM T Inform Syst 23(3): 267–298

    Article  Google Scholar 

  21. Peng T et al (2007) SVM based adaptive learning method for text classification from positive and unlabeled documents. Knowl Inf Syst (Online First)

  22. Pullwitt D (2002) Integrating contextual information to enhance SOM-based text document clustering. Neural Netw 15(8–9): 1099–1106

    Article  Google Scholar 

  23. Roiger R, Geatz M (2003) Data mining: a tutorial based primer. Addison Wesley, New York

    Google Scholar 

  24. Salton G et al (1975) A vector space model for automatic indexing. Commun ACM 18: 229–237

    Article  MathSciNet  Google Scholar 

  25. Soukoreff WR, Mackenzie SI (2001) Measuring errors in text entry tasks: an application of the Levenshtein string distance statistic. In: CHI ’01: CHI ’01 extended abstracts on human factors in computing systems, New York, NY, USA, ACM Press, pp 319–320

  26. Tamara KG, Dianne OP (1998) A semidiscrete matrix decomposition for latent semantic indexing in information retrieval. ACM T Inform Syst 16(4): 322–346

    Article  Google Scholar 

  27. Tan S (2006) An effective refinement strategy for KNN text classifier. Expert Syst Appl 30(2): 290–298

    Article  Google Scholar 

  28. Woon WL, Wong K-SD (2006) A self organising approach to document space visualisation. Multimed Cyberscape J (special issue on Multimedia Data Mining 4(1)

  29. Woon WL, Wong K-SD (2006) SNITCH: a cross-platform tool for document corpus analysis. In: IEEE international conference on computer and communication engineering, Kuala Lumpur, Malaysia

  30. Yang H-C, Lee C-H (2004) A text mining approach on automatic generation of web directories and hierarchies. Expert Syst Appl 27(4): 645–663

    Article  MathSciNet  Google Scholar 

  31. Yang Y et al (2002) CLOPE: a fast and effective clustering algorithm for transactional data. In: SIGKDD, ACM Press, New York, pp 682–687

  32. Zaiane O, Antonie M-L (2002) Classifying text documents by associating terms with text categories. In: Conferences in research and practice in information technology, vol 5. Melbourne, Australia. ACS. Thirteenth Australasian Database Conference (ADC2002), pp 215–222

  33. Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8(3): 374–384

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Lee Woon.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Woon, W.L., Wong, KS.D. String alignment for automated document versioning. Knowl Inf Syst 18, 293–309 (2009). https://doi.org/10.1007/s10115-008-0130-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-008-0130-x

Keywords

Navigation