Abstract
Document matching has become a crucial task for data integration. A considerable amount of algorithms for comparing XML documents have been proposed in the literature. Yet, the existing approaches fall short in ability to identify structural similarities of fuzzy XML documents. To fill this gap, in this paper, we provide an integrated comparison approach to cope with structural similarities of the fuzzy XML documents. Firstly, we propose a new fuzzy XML document tree model to represent fuzzy XML document. Secondly, we offer element/attribute features similarity measure approach to identify matching nodes. Thirdly, we present an effective algorithm based on the tree edit distance to detect the structural similarities between fuzzy XML document trees represented with the proposed model. Finally, the experimental results demonstrate that our approach can efficiently perform structural similarity measure of the fuzzy XML documents.
Similar content being viewed by others
References
Thomo A, Venkatesh S (2008) Rewriting of visibly pushdown languages for xml data integration. In: Proceedings of the 17th ACM conference on information and knowledge management. ACM, Napa Valley, pp 521–530
Nierman A, Jagadish HV (2002) Evaluating structural similarity in XML documents. In: Proceedings of ACM SIGMOD WebDB, vol 2. ACM, Madison, pp 61–66
Dalamagas T, Cheng T, Winkel KJ et al (2006) A methodology for clustering XML documents by structure. Inf Syst 31(3):187–228. doi:10.1016/j.is.2004.11.009
Guha S, Jagadish HV, Koudas N, Srivastava D, Yu T (2006) Integrating XML data sources using approximate joins. ACM Trans Database Syst 31(1):161–207
Köpcke H, Rahm E (2010) Frameworks for entity matching: a comparison. Data Knowl Eng 69(2):197–210. doi:10.1016/j.datak.2009.10.003
Ribeiro L, H\(\ddot{a}\)rder T (2006) Entity identification in XML documents. In: 18th GI-workshop on the foundations of databases, pp 130–134
Weis M, Naumann F, Brosy F (2006) A duplicate detection benchmark for XML (and relational) data. In: SIGMOD 2006 workshop on information quality for information systems. Chicago
Oliboni B, Pozzani G (2008) Representing fuzzy information by using XML schema. In: Proceedings of the 19th international conference on database and expert systems application. Turin, pp 683-687. doi:10.1109/DEXA.2008.44
Abiteboul S, Segoufin L, Vianu V (2006) Representing and querying XML with incomplete information. ACM Trans Database Syst 31(1):208–254
Nierman A, Jagadish HV (2002) ProTDB: probabilistic data in XML. In: Proceedings of the 28th international conference on vary large data bases. Hong Kong, VLDB Endowment, pp 646–657. doi:10.1016/B978-155860869-6/50063-9
Negoita C, Zadeh L, Zimmermann H (1978) Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets Syst 1:3–28
Gaurav A, Alhajj R (2006) Incorporating fuzziness in XML and mapping fuzzy relational data into fuzzy XML. In: Proceedings of the 2006 ACM symposium on applied computing. ACM, Dijon, pp 456–460. doi:10.1145/1141277.1141386
Turowski K, Weng U (2002) Representing and processing fuzzy information-an XML-based approach. Knowl Based Syst 15(1):67–75. doi:10.1016/S0950-7051(01)00122-8
Tekli J, Chbeir R, Traina AJ, Traina C, Fileto R (2015) Approximate XML structure validation based on document- grammar tree similarity. Inf Sci 295:258–302
Tekli J, Chbeir R (2012) A novel XML document structure comparison framework based-on sub-tree commonalities and label semantics. Web Semant 11:14–40. doi:10.1016/j.websem.2011.10.002
Algergawy A, Nayak R, Saake G (2010) Element similarity measures in XML schema matching. Inf Sci 180(24):4975–4998. doi:10.1016/j.ins.2010.08.022
Wojnar A, Mlýnková I, Dokulil J (2010) Structural and semantic aspects of similarity of document type definitions and XML schemas. Inf Sci 180(10):1817–1836
Sabbah T, Selamat A, Ashraf M, Herawan T (2014) Effect of thesaurus size on schema matching quality. Knowl Based Syst 71:211–226. doi:10.1016/j.knosys.2014.08.002
Ma ZM, Yan L (2007) Fuzzy XML data modeling with the UML and relational data models. Data Knowl Eng 63(3):972–996. doi:10.1016/j.datak.2007.06.003
Nicol G, Wood L, Champion M et al (2001) Document object model (DOM) level 3 core specification. W3C Work Draft 13:1–146
Cohen W W, Ravikumar P, Fienberg S E (2003) A comparison of string distance metrics for name-matching tasks. In: Kdd workshop on data cleaning and object consolidation, vol 3. Washington, pp 73–78
Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the international conference on machine learning. Madison, pp 296–304
Levenshtein VI (1966) Binary codes capable of correcting deletions. Insertions Revers Sov Phys Doklady 6:707–710
Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88
Marie A, Gal A (2008) Boosting schema matchers. In: Proceedings of the OTM 2008 confederated inter. Conferences. Springer, Monterrey, pp 283–300
XML Data Repository. http://www.cs.washington.edu/research/xmldatasets/
Sorrentino S, Bergamaschi S, Gawinecki M, Po L (2010) Schema label normalization for improving schema matching. Data Knowl Eng 69(12):1254–1273. doi:10.1016/j.datak.2010.10.004
Acknowledgements
This work was supported by the \(National Natural Science Foundation of China \) (61370075 & 61572118) and the \(Program for New Century Excellent Talents in University \) (NCET- 05-0288).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhao, Z., Ma, Z. A methodology for measuring structure similarity of fuzzy XML documents. Computing 99, 493–506 (2017). https://doi.org/10.1007/s00607-017-0553-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00607-017-0553-x