Abstract
Management of large collection of replicated data in centralized or distributed environments is important for many systems that provide data mining, mirroring, storage, and content distribution. In its simplest form, the documents are generated, duplicated and updated by emails and web pages. Although redundancy may increase the reliability at a level, uncontrolled redundancy aggravates the retrieval performance and might be useless if the returned documents are obsolete. Document similarity matching algorithms do not provide the information on the differences of documents, and file synchronization algorithms are usually inefficient and ignore the structural and syntactic organization of documents. In this paper, we propose the S2S matching approach. The S2S matching is composed of structural and syntactic phases to compare documents. Firstly, in the structural phase, documents are decomposed into components by its syntax and compared at the coarse level. The structural mapping processes the decomposed documents based on its syntax without actually mapping at the word level. The structural mapping can be applied in a hierarchical way based on the structural organization of a document. Secondly, the syntactic matching algorithm uses a heuristic look-ahead algorithm for matching consecutive tokens with a verification patch. Our two-phase S2S matching approach provides faster results than currently available string matching algorithms.
Similar content being viewed by others
References
Aho AV, Hirschberg DS and Ullman JD (1976). Bounds on the complexity of the longest common subsequence problem. J ACM 23(1): 1–12
Lewenstein AM and Porat E (2004). Faster algorithms for string matching with k mismatches. J Algorithms 50: 257–275
Apostolico (1996) String editing and longest common subsequences. In: Rozenberg G, Salomaa A (eds). Vol II of Handbook of Formal Languages. Springer, Heidelberg
Brewington B and Cybenko G (2000). Keeping up with the changing web. IEEE Comput 33(5): 52–58
Chen G, Wu X, Zhu X, Arslan AN and He Y (2006). Efficient string matching with wildcards and length constraints. Knowl Inf Syst 10(4): 399–419
Cho J, Garcia-Molina H (2000) The evolution of the web and implications for an incremental crawler. In Proceedings of 26th international conference on very large data bases, pp 117–178
Broder (1997) On the resemblance and containment of documents. Compression and complexity of sequences(SEQUENCES’97), IEEE Computer Society pp 21–29
DBWORLD (2007) DBWorld, http://www.cs.wisc.edu/dbworld/ [Online; accessed 04-30-2007]
Deerwester S, Dumais ST, Furnas GW, Landauer TK and Harshman R (1990). Indexing by Latent Semantic Analysis. J Am Soc Inf Sci 41: 391–407
Dumais ST (1991). Improving the retrieval of information from external resources. Behav Res Methods Instr Comput 23: 229–236
Hammouda KM and Kamel MS (2004). Document similarity using a phrase indexing graph model. Knowl Inf Syst 6(6): 710–727
Hirschberg DS (1977). Algorithms for the longest common subsequence problem. J ACM 24(4): 664–675
Hunt J, Vo KP and Tichy W (1998). Delta algorithms: an empirical analysis. ACM Trans Softw Eng Methodol 7: 192–214
Li Z, Ng WK and Sun A (2005). Web data extraction based on structural similarity. Knowl Inf Syst 8(4): 438–461
Korn D, Vo K-P (2002) Engineering a differencing and compression data format. In: Proceedings of the usenix annual technical conference, pp 219–228
Miller W and Myers EW (1985). A file comparison program. Softw Pract Exper 15: 1025–1040
Milojicic DS, Kalogeraki V, Lukose R, Nagarajal K, Pruyne J, Richard B, Rollis S, Xu Z (2002) Peer-to-peer computing. HP technical report, HPL-2002-57
Nakatsu N, Kambayashi Y and Yajima S (1982). A longest common subsequence algorithm suitable for similar text strings. Acta Info 18: 171–179
Navarro G (2001). A guided tour to approximate string matching. ACM Comput Surv 33(1): 31–88
Rivest R (1992) The MD5 message-digest algorithm. RFC1321
Sankoff D and Kruskal JB (1983). Time warps, string edits and macromolecules: the theory and practice of sequence comparison. Addison Wesley, Reading
Savant A, Suel T (2003) Server-friendly delta compression for efficient Web Access. In: 8th international workshop on web content caching and distribution (WCW)
Schubert E, Schaffert S, Bry F (2005) Structure-preserving difference search for XML documents. In: Proceedings of the extreme markup languages conference, Montreal, QC Canada
Trigdell A (2000) Efficient algorithms for sorting and synchronization. PhD Thesis, Australian National University
Trigdell A, Mackerras P (1996) The rsync algorithm. Technical Report TR-CS-96-05, Australian National University
Wang H, Liu C (2006) Neighbourhood counting metric for sequences. In: Advances in intelligent IT active media 2006, IOS Press, pp 243–260
Wikimedia Diff (2007) Wikimedia, Meta-Wiki. http://meta.wikimedia.org/wiki/Diff, [Online; accessed 04-30-2007]
Wikipedia (2007) The Free Encyclopedia. http://en.wikipedia.org [Online; accessed 04-30-2007]
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Aygün, R.S. S2S: structural-to-syntactic matching similar documents. Knowl Inf Syst 16, 303–329 (2008). https://doi.org/10.1007/s10115-007-0108-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-007-0108-0