S2S: structural-to-syntactic matching similar documents

Aygün, Ramazan S.

doi:10.1007/s10115-007-0108-0

S2S: structural-to-syntactic matching similar documents

Regular Paper
Published: 12 October 2007

Volume 16, pages 303–329, (2008)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Ramazan S. Aygün¹

89 Accesses
Explore all metrics

Abstract

Management of large collection of replicated data in centralized or distributed environments is important for many systems that provide data mining, mirroring, storage, and content distribution. In its simplest form, the documents are generated, duplicated and updated by emails and web pages. Although redundancy may increase the reliability at a level, uncontrolled redundancy aggravates the retrieval performance and might be useless if the returned documents are obsolete. Document similarity matching algorithms do not provide the information on the differences of documents, and file synchronization algorithms are usually inefficient and ignore the structural and syntactic organization of documents. In this paper, we propose the S2S matching approach. The S2S matching is composed of structural and syntactic phases to compare documents. Firstly, in the structural phase, documents are decomposed into components by its syntax and compared at the coarse level. The structural mapping processes the decomposed documents based on its syntax without actually mapping at the word level. The structural mapping can be applied in a hierarchical way based on the structural organization of a document. Secondly, the syntactic matching algorithm uses a heuristic look-ahead algorithm for matching consecutive tokens with a verification patch. Our two-phase S2S matching approach provides faster results than currently available string matching algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aho AV, Hirschberg DS and Ullman JD (1976). Bounds on the complexity of the longest common subsequence problem. J ACM 23(1): 1–12
Article MATH MathSciNet Google Scholar
Lewenstein AM and Porat E (2004). Faster algorithms for string matching with k mismatches. J Algorithms 50: 257–275
Article MATH MathSciNet Google Scholar
Apostolico (1996) String editing and longest common subsequences. In: Rozenberg G, Salomaa A (eds). Vol II of Handbook of Formal Languages. Springer, Heidelberg
Brewington B and Cybenko G (2000). Keeping up with the changing web. IEEE Comput 33(5): 52–58
Google Scholar
Chen G, Wu X, Zhu X, Arslan AN and He Y (2006). Efficient string matching with wildcards and length constraints. Knowl Inf Syst 10(4): 399–419
Article Google Scholar
Cho J, Garcia-Molina H (2000) The evolution of the web and implications for an incremental crawler. In Proceedings of 26th international conference on very large data bases, pp 117–178
Broder (1997) On the resemblance and containment of documents. Compression and complexity of sequences(SEQUENCES’97), IEEE Computer Society pp 21–29
DBWORLD (2007) DBWorld, http://www.cs.wisc.edu/dbworld/ [Online; accessed 04-30-2007]
Deerwester S, Dumais ST, Furnas GW, Landauer TK and Harshman R (1990). Indexing by Latent Semantic Analysis. J Am Soc Inf Sci 41: 391–407
Article Google Scholar
Dumais ST (1991). Improving the retrieval of information from external resources. Behav Res Methods Instr Comput 23: 229–236
Google Scholar
Hammouda KM and Kamel MS (2004). Document similarity using a phrase indexing graph model. Knowl Inf Syst 6(6): 710–727
Article Google Scholar
Hirschberg DS (1977). Algorithms for the longest common subsequence problem. J ACM 24(4): 664–675
Article MATH MathSciNet Google Scholar
Hunt J, Vo KP and Tichy W (1998). Delta algorithms: an empirical analysis. ACM Trans Softw Eng Methodol 7: 192–214
Article Google Scholar
Li Z, Ng WK and Sun A (2005). Web data extraction based on structural similarity. Knowl Inf Syst 8(4): 438–461
Article Google Scholar
Korn D, Vo K-P (2002) Engineering a differencing and compression data format. In: Proceedings of the usenix annual technical conference, pp 219–228
Miller W and Myers EW (1985). A file comparison program. Softw Pract Exper 15: 1025–1040
Article Google Scholar
Milojicic DS, Kalogeraki V, Lukose R, Nagarajal K, Pruyne J, Richard B, Rollis S, Xu Z (2002) Peer-to-peer computing. HP technical report, HPL-2002-57
Nakatsu N, Kambayashi Y and Yajima S (1982). A longest common subsequence algorithm suitable for similar text strings. Acta Info 18: 171–179
MATH MathSciNet Google Scholar
Navarro G (2001). A guided tour to approximate string matching. ACM Comput Surv 33(1): 31–88
Article Google Scholar
Rivest R (1992) The MD5 message-digest algorithm. RFC1321
Sankoff D and Kruskal JB (1983). Time warps, string edits and macromolecules: the theory and practice of sequence comparison. Addison Wesley, Reading
Google Scholar
Savant A, Suel T (2003) Server-friendly delta compression for efficient Web Access. In: 8th international workshop on web content caching and distribution (WCW)
Schubert E, Schaffert S, Bry F (2005) Structure-preserving difference search for XML documents. In: Proceedings of the extreme markup languages conference, Montreal, QC Canada
Trigdell A (2000) Efficient algorithms for sorting and synchronization. PhD Thesis, Australian National University
Trigdell A, Mackerras P (1996) The rsync algorithm. Technical Report TR-CS-96-05, Australian National University
Wang H, Liu C (2006) Neighbourhood counting metric for sequences. In: Advances in intelligent IT active media 2006, IOS Press, pp 243–260
Wikimedia Diff (2007) Wikimedia, Meta-Wiki. http://meta.wikimedia.org/wiki/Diff, [Online; accessed 04-30-2007]
Wikipedia (2007) The Free Encyclopedia. http://en.wikipedia.org [Online; accessed 04-30-2007]

Download references

Author information

Authors and Affiliations

Computer Science Department, University of Alabama in Huntsville, Huntsville, AL, 35899, USA
Ramazan S. Aygün

Authors

Ramazan S. Aygün
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Ramazan S. Aygün.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aygün, R.S. S2S: structural-to-syntactic matching similar documents. Knowl Inf Syst 16, 303–329 (2008). https://doi.org/10.1007/s10115-007-0108-0

Download citation

Received: 06 September 2006
Revised: 03 August 2007
Accepted: 01 September 2007
Published: 12 October 2007
Issue Date: September 2008
DOI: https://doi.org/10.1007/s10115-007-0108-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

S2S: structural-to-syntactic matching similar documents

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

String similarity search and join: a survey

Locating similar names through locality sensitive hashing and graph theory

A Brief Overview of Dead-Zone Pattern Matching Algorithms

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

S2S: structural-to-syntactic matching similar documents

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

String similarity search and join: a survey

Locating similar names through locality sensitive hashing and graph theory

A Brief Overview of Dead-Zone Pattern Matching Algorithms

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now