String alignment for automated document versioning

Woon, Wei Lee; Wong, Kuok-Shoong Daniel

doi:10.1007/s10115-008-0130-x

String alignment for automated document versioning

Regular Paper
Published: 11 March 2008

Volume 18, pages 293–309, (2009)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Wei Lee Woon¹ &
Kuok-Shoong Daniel Wong²

99 Accesses
4 Citations
Explore all metrics

Abstract

The automated analysis of documents is an important task given the rapid increase in availability of digital texts. Automatic text processing systems often encode documents as vectors of term occurrence frequencies, a representation which facilitates the classification and clustering of documents. Historically, this approach derives from the related field of data mining, where database entries are commonly represented as points in a vector space. While this lineage has certainly contributed to the development of text processing, there are situations where document collections do not conform to this clustered structure, and where the vector representation may be unsuitable for text analysis. As a proof-of-concept, we had previously presented a framework where the optimal alignments of documents could be used for visualising the relationships within small sets of documents. In this paper we develop this approach further by using it to automatically generate the version histories of various document collections. For comparison, version histories generated using conventional methods of document representation are also produced. To facilitate this comparison, a simple procedure for evaluating the accuracy of the version histories thus generated is proposed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Cancedda N et al (2003) Word sequence kernels. J Mach Learn Res 3: 1059–1082
Article MATH MathSciNet Google Scholar
Cristianini N, Taylor S (2000) An introduction to support vector machines. Cambridge University Press, Cambridge
Google Scholar
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3: 1289–1305
Article MATH Google Scholar
Hammouda K, Kamel M (2004) Document similarity using a phrase indexing graph model. Knowl Inf Syst 6(6): 710–727
Article Google Scholar
Hand D et al (2001) Principles of data mining. MIT Press, Cambridge
Google Scholar
Honkela T, Hyvarinen A (2004) Linguistic feature extraction using independent component analysis, vol 1
Huang Z (1997) A fast clustering algorithm to cluster very large categorical data sets in data mining. In: SIGMOD workshop on research issues on data mining and knowldge discovery
Hung C, Wermter S (2004) Neural network based document clustering using WordNet ontologies. Int J Hybrid Intel Syst 1(3,4): 127–142
MATH Google Scholar
Lagus K et al (2004) Mining massive document collections by the WEBSOM method. Inf Sci 163(1–3): 135–156
Article Google Scholar
Lan M et al (2005) A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In: WWW ’05: special interest tracks and posters of the 14th international conference on World Wide Web. ACM Press, New York, pp 1032–1033
Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: SIGKD99D, San Diego, USA, pp 16–22
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions and reversals. Sov Phys Dokl 10(8): 707–710
MathSciNet Google Scholar
Liu R, Lu Y (2002) Incremental context mining for adaptive document classification. In: KDD, ACM press, New York, pp 599–604
Lodhi H et al (2002) Text classification using string kernels. J Mach Learn Res 2: 419–444
Article MATH Google Scholar
Lodhi H et al (2000) Text classification using string kernels. In: Advances in Neural Information Processing Systems (NIPS), pp 563–569
Merkl D (1997) Exploration of text collections with hierarchical feature maps. In: SIGIR ’97: proceedings of the 20th annual international ACM SIGIR conference on research and development in information retrieval. ACM Press, New York, NY, USA, pp 186–195
Mount DW (2001) Bioinformatics: sequence and genome analysis. Cold Spring Harbor Laboratory Press, Cold Spring Harbor
Google Scholar
Muthukrishnan S (2002) Efficient algorithms for document retrieval problems. In: ACM-SIAM symposium on discrete algorithms, San Francisco, USA, pp 657–666
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3): 443–453
Article Google Scholar
Park LAF et al (2005) A novel document retrieval method using the discrete wavelet transform. ACM T Inform Syst 23(3): 267–298
Article Google Scholar
Peng T et al (2007) SVM based adaptive learning method for text classification from positive and unlabeled documents. Knowl Inf Syst (Online First)
Pullwitt D (2002) Integrating contextual information to enhance SOM-based text document clustering. Neural Netw 15(8–9): 1099–1106
Article Google Scholar
Roiger R, Geatz M (2003) Data mining: a tutorial based primer. Addison Wesley, New York
Google Scholar
Salton G et al (1975) A vector space model for automatic indexing. Commun ACM 18: 229–237
Article MathSciNet Google Scholar
Soukoreff WR, Mackenzie SI (2001) Measuring errors in text entry tasks: an application of the Levenshtein string distance statistic. In: CHI ’01: CHI ’01 extended abstracts on human factors in computing systems, New York, NY, USA, ACM Press, pp 319–320
Tamara KG, Dianne OP (1998) A semidiscrete matrix decomposition for latent semantic indexing in information retrieval. ACM T Inform Syst 16(4): 322–346
Article Google Scholar
Tan S (2006) An effective refinement strategy for KNN text classifier. Expert Syst Appl 30(2): 290–298
Article Google Scholar
Woon WL, Wong K-SD (2006) A self organising approach to document space visualisation. Multimed Cyberscape J (special issue on Multimedia Data Mining 4(1)
Woon WL, Wong K-SD (2006) SNITCH: a cross-platform tool for document corpus analysis. In: IEEE international conference on computer and communication engineering, Kuala Lumpur, Malaysia
Yang H-C, Lee C-H (2004) A text mining approach on automatic generation of web directories and hierarchies. Expert Syst Appl 27(4): 645–663
Article MathSciNet Google Scholar
Yang Y et al (2002) CLOPE: a fast and effective clustering algorithm for transactional data. In: SIGKDD, ACM Press, New York, pp 682–687
Zaiane O, Antonie M-L (2002) Classifying text documents by associating terms with text categories. In: Conferences in research and practice in information technology, vol 5. Melbourne, Australia. ACS. Thirteenth Australasian Database Conference (ADC2002), pp 215–222
Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8(3): 374–384
Article Google Scholar

Download references

Author information

Authors and Affiliations

Masdar Institute of Science and Technology, M.I.T., 1-175, 77 Mass. Ave., Cambridge, MA, 02139, USA
Wei Lee Woon
Malaysia University of Science and Technology, GL33, Block C, Dataran Usahawan Kelana, 17 Jln.SS7/26, 47301, Petaling Jaya, Malaysia
Kuok-Shoong Daniel Wong

Authors

Wei Lee Woon
View author publications
You can also search for this author in PubMed Google Scholar
Kuok-Shoong Daniel Wong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei Lee Woon.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Woon, W.L., Wong, KS.D. String alignment for automated document versioning. Knowl Inf Syst 18, 293–309 (2009). https://doi.org/10.1007/s10115-008-0130-x

Download citation

Received: 07 May 2007
Revised: 30 September 2007
Accepted: 19 January 2008
Published: 11 March 2008
Issue Date: March 2009
DOI: https://doi.org/10.1007/s10115-008-0130-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

String alignment for automated document versioning

Abstract

Access this article

Similar content being viewed by others

Document Versioning Using Feature Space Distances

Discovering Similar Passages within Large Text Documents

Dynamic Relative Compression, Dynamic Partial Sums, and Substring Concatenation

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

String alignment for automated document versioning

Abstract

Access this article

Similar content being viewed by others

Document Versioning Using Feature Space Distances

Discovering Similar Passages within Large Text Documents

Dynamic Relative Compression, Dynamic Partial Sums, and Substring Concatenation

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation