Abstract
The automated analysis of documents is an important task given the rapid increase in availability of digital texts. In an earlier publication, we had presented a framework where the edit distances between documents was used to reconstruct the version history of a set of documents. However, one problem which we encountered was the high computational costs of calculating these edit distances. In addition, the number of document comparisons which need to be done scales quadratically with the number of documents. In this paper we propose a simple approximation which retains many of the benefits of the method, but which greatly reduces the time required to calculate these edit distances. To test the utility of this method, the accuracy of the results obtained using this approximation is compared to the original results.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Woon, W.L., Wong, K.-S.: String alignment for automated document versioning. Knowledge and Information Systems (2008)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)
Soukoreff, W.R., Mackenzie, S.I.: Measuring errors in text entry tasks: an application of the levenshtein string distance statistic. In: CHI 2001: CHI 2001 Extended Abstracts on Human Factors in Computing Systems, pp. 319–320. ACM Press, New York (2001)
Lodhi, H., Taylor, J.S., Cristianini, N., Watkins, C.J.C.H.: Text classification using string kernels. In: Advances in Neural Information Processing Systems (NIPS), pp. 563–569 (2000)
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., WatkinsText, C.: classification using string kernels. J. Mach. Learn. Res. 2, 419–444 (2002)
Cancedda, N., Gaussier, E., Goutte, C., Renders, J.M.: Word sequence kernels. J. Mach. Learn. Res. 3, 1059–1082 (2003)
Cristianini, N., Taylor, S.: An introduction to support vector machines. Cambridge University Press, Cambridge (2000)
Mercer, J.: Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 209, 415–446 (1909)
Aradhye, H., Dorai, C.: New kernels for analyzing multimodal data in multimedia using kernel machines. In: Proceedings of 2002 IEEE International Conference on Multimedia and Expo, ICME 2002, vol. 2, pp. 37–40 (2002)
Lan, M., Tan, C.-L., Low, H.-B., Sung, S.-Y.: A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In: WWW 2005: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, pp. 1032–1033. ACM Press, New York (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Woon, W.L., Wong, KS.D., Aung, Z., Svetinovic, D. (2014). Document Versioning Using Feature Space Distances. In: Loo, C.K., Yap, K.S., Wong, K.W., Teoh, A., Huang, K. (eds) Neural Information Processing. ICONIP 2014. Lecture Notes in Computer Science, vol 8835. Springer, Cham. https://doi.org/10.1007/978-3-319-12640-1_59
Download citation
DOI: https://doi.org/10.1007/978-3-319-12640-1_59
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12639-5
Online ISBN: 978-3-319-12640-1
eBook Packages: Computer ScienceComputer Science (R0)