Abstract
Data instances integration, specially on the web, involves analyzing and matching data from two or more sources, including XML sources. XML sources, in particular, introduce new challenges to the integration process, given their dynamic and irregular structure. In this context, one of the hardest steps is to find out which XML instances are similar. This paper presents a group of algorithms to prepare XML instances for comparison. We analyse the benefit of these algorithms over existing XML comparison approaches.
This work is partially supported by the DIGITEX Project of CNPq Foundation. CTInfo Process Nr.: 550.845/2005-4.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Carvalho, J.C.P., da Silva, A.S.: Finding similar identities among objects from multiple web sources. In: Chiang, R.H.L., Laender, A.H.F., Lim, E.-P. (eds.) WIDM, pp. 90–93. ACM Press, New York (2003)
Wiederhold, G.: Intelligent integration of information. In: Buneman, P., Jajodia, S. (eds.) Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, SIGMOD 1993, SIGMOD Record (ACM Special Interest Group on Management of Data), Washington, May 26–28, 1993, vol. 22(2), pp. 434–437. ACM Press, New York (1993)
Manolescu, I., Florescu, D., Kossmann, D.K.: Answering XML queries over heterogeneous data sources. In: Proceedings of the 27th International Conference on Very Large Data Bases(VLDB 2001), Orlando, pp. 241–250. Morgan Kaufmann, San Francisco (2001)
Consortium, W.W.W.: Extensible markup language (XML) 1.0, W3C recommendation. 2nd edn. (2000), Available at http://www.w3.org/TR/2000/WD-xml-2e-20000814
Weis, M., Naumann, F.: Detecting duplicate objects in XML documents. In: Naumann, F., Scannapieco, M. (eds.) IQIS, pp. 10–19. ACM Press, New York (2004)
Flesca, S., Manco, G., Masciari, E., Pontieri, L., Pugliese, A.: Fast detection of XML structural similarity. IEEE Trans. Knowl. Data Eng. 17(2), 160–175 (2005)
Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: WebDB, pp. 61–66 (2002)
Tai, K.-C.: The tree-to-tree correction problem. J. ACM 26(3), 422–433 (1979)
Lu, S.-Y.: A tree-to-tree distance and its application to cluster analysis. IEEE Trans. Pattern Anal. Mach. Intell. 1(2), 219–224 (1979)
Shasha, D., Zhang, K.: Fast algorithms for the unit cost editing distance between trees. J. Algorithms 11(4), 581–621 (1990)
Wang, J.T.-L., Zhang, K., Jeong, K., Shasha, D.: A system for approximate tree matching. IEEE Trans. Knowl. Data Eng. 6(4), 559–571 (1994)
Shasha, D., Zhang, K.: Approximate tree pattern matching. In: Pattern Matching Algorithms, pp. 341–371. Oxford University Press, Oxford (1997)
Chen, J., DeWitt, D.J., Tian, F., Wang, Y.: NiagaraCQ: A scalable continuous query system for Internet databases. SIGMOD Record (ACM Special Interest Group on Management of Data) 29(2), 379–390 (2000)
Wang, Y., DeWitt, D.J., yi Cai, J.: X-diff: An effective change detection algorithm for XML documents. In: ICDE, pp. 519–530 (2003)
Marian, A., Abiteboul, S., Cobéna, G., Mignet, L.: Change-centric management of versions in an XML warehouse. In: Proceedings of the 27th International Conference on Very Large Data Bases(VLDB 2001), Orlando, pp. 581–590. Morgan Kaufmann, San Francisco (2001)
Buttler, D.: A short survey of document structure similarity algorithms. In: International Conference on Internet Computing, pp. 3–9 (2004)
Broder, A.: On the resemblance and containment of documents. In: SEQS: Sequences 1991 (1998)
Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33(1), 31–88 (2001)
Winkler, W.: The state of record linkage and current research problems (1999), http://citeseer.ist.psu.edu/article/winkler99state.html
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gonçalves, R., dos Santos Mello, R. (2007). Improving XML Instances Comparison with Preprocessing Algorithms. In: Wagner, R., Revell, N., Pernul, G. (eds) Database and Expert Systems Applications. DEXA 2007. Lecture Notes in Computer Science, vol 4653. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74469-6_2
Download citation
DOI: https://doi.org/10.1007/978-3-540-74469-6_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74467-2
Online ISBN: 978-3-540-74469-6
eBook Packages: Computer ScienceComputer Science (R0)