Skip to main content
Log in

S2S: structural-to-syntactic matching similar documents

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Management of large collection of replicated data in centralized or distributed environments is important for many systems that provide data mining, mirroring, storage, and content distribution. In its simplest form, the documents are generated, duplicated and updated by emails and web pages. Although redundancy may increase the reliability at a level, uncontrolled redundancy aggravates the retrieval performance and might be useless if the returned documents are obsolete. Document similarity matching algorithms do not provide the information on the differences of documents, and file synchronization algorithms are usually inefficient and ignore the structural and syntactic organization of documents. In this paper, we propose the S2S matching approach. The S2S matching is composed of structural and syntactic phases to compare documents. Firstly, in the structural phase, documents are decomposed into components by its syntax and compared at the coarse level. The structural mapping processes the decomposed documents based on its syntax without actually mapping at the word level. The structural mapping can be applied in a hierarchical way based on the structural organization of a document. Secondly, the syntactic matching algorithm uses a heuristic look-ahead algorithm for matching consecutive tokens with a verification patch. Our two-phase S2S matching approach provides faster results than currently available string matching algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aho AV, Hirschberg DS and Ullman JD (1976). Bounds on the complexity of the longest common subsequence problem. J ACM 23(1): 1–12

    Article  MATH  MathSciNet  Google Scholar 

  2. Lewenstein AM and Porat E (2004). Faster algorithms for string matching with k mismatches. J Algorithms 50: 257–275

    Article  MATH  MathSciNet  Google Scholar 

  3. Apostolico (1996) String editing and longest common subsequences. In: Rozenberg G, Salomaa A (eds). Vol II of Handbook of Formal Languages. Springer, Heidelberg

  4. Brewington B and Cybenko G (2000). Keeping up with the changing web. IEEE Comput 33(5): 52–58

    Google Scholar 

  5. Chen G, Wu X, Zhu X, Arslan AN and He Y (2006). Efficient string matching with wildcards and length constraints. Knowl Inf Syst 10(4): 399–419

    Article  Google Scholar 

  6. Cho J, Garcia-Molina H (2000) The evolution of the web and implications for an incremental crawler. In Proceedings of 26th international conference on very large data bases, pp 117–178

  7. Broder (1997) On the resemblance and containment of documents. Compression and complexity of sequences(SEQUENCES’97), IEEE Computer Society pp 21–29

  8. DBWORLD (2007) DBWorld, http://www.cs.wisc.edu/dbworld/ [Online; accessed 04-30-2007]

  9. Deerwester S, Dumais ST, Furnas GW, Landauer TK and Harshman R (1990). Indexing by Latent Semantic Analysis. J Am Soc Inf Sci 41: 391–407

    Article  Google Scholar 

  10. Dumais ST (1991). Improving the retrieval of information from external resources. Behav Res Methods Instr Comput 23: 229–236

    Google Scholar 

  11. Hammouda KM and Kamel MS (2004). Document similarity using a phrase indexing graph model. Knowl Inf Syst 6(6): 710–727

    Article  Google Scholar 

  12. Hirschberg DS (1977). Algorithms for the longest common subsequence problem. J ACM 24(4): 664–675

    Article  MATH  MathSciNet  Google Scholar 

  13. Hunt J, Vo KP and Tichy W (1998). Delta algorithms: an empirical analysis. ACM Trans Softw Eng Methodol 7: 192–214

    Article  Google Scholar 

  14. Li Z, Ng WK and Sun A (2005). Web data extraction based on structural similarity. Knowl Inf Syst 8(4): 438–461

    Article  Google Scholar 

  15. Korn D, Vo K-P (2002) Engineering a differencing and compression data format. In: Proceedings of the usenix annual technical conference, pp 219–228

  16. Miller W and Myers EW (1985). A file comparison program. Softw Pract Exper 15: 1025–1040

    Article  Google Scholar 

  17. Milojicic DS, Kalogeraki V, Lukose R, Nagarajal K, Pruyne J, Richard B, Rollis S, Xu Z (2002) Peer-to-peer computing. HP technical report, HPL-2002-57

  18. Nakatsu N, Kambayashi Y and Yajima S (1982). A longest common subsequence algorithm suitable for similar text strings. Acta Info 18: 171–179

    MATH  MathSciNet  Google Scholar 

  19. Navarro G (2001). A guided tour to approximate string matching. ACM Comput Surv 33(1): 31–88

    Article  Google Scholar 

  20. Rivest R (1992) The MD5 message-digest algorithm. RFC1321

  21. Sankoff D and Kruskal JB (1983). Time warps, string edits and macromolecules: the theory and practice of sequence comparison. Addison Wesley, Reading

    Google Scholar 

  22. Savant A, Suel T (2003) Server-friendly delta compression for efficient Web Access. In: 8th international workshop on web content caching and distribution (WCW)

  23. Schubert E, Schaffert S, Bry F (2005) Structure-preserving difference search for XML documents. In: Proceedings of the extreme markup languages conference, Montreal, QC Canada

  24. Trigdell A (2000) Efficient algorithms for sorting and synchronization. PhD Thesis, Australian National University

  25. Trigdell A, Mackerras P (1996) The rsync algorithm. Technical Report TR-CS-96-05, Australian National University

  26. Wang H, Liu C (2006) Neighbourhood counting metric for sequences. In: Advances in intelligent IT active media 2006, IOS Press, pp 243–260

  27. Wikimedia Diff (2007) Wikimedia, Meta-Wiki. http://meta.wikimedia.org/wiki/Diff, [Online; accessed 04-30-2007]

  28. Wikipedia (2007) The Free Encyclopedia. http://en.wikipedia.org [Online; accessed 04-30-2007]

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ramazan S. Aygün.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aygün, R.S. S2S: structural-to-syntactic matching similar documents. Knowl Inf Syst 16, 303–329 (2008). https://doi.org/10.1007/s10115-007-0108-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-007-0108-0

Keywords