Skip to main content

Improving XML Instances Comparison with Preprocessing Algorithms

  • Conference paper
Book cover Database and Expert Systems Applications (DEXA 2007)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4653))

Included in the following conference series:

  • 1205 Accesses

Abstract

Data instances integration, specially on the web, involves analyzing and matching data from two or more sources, including XML sources. XML sources, in particular, introduce new challenges to the integration process, given their dynamic and irregular structure. In this context, one of the hardest steps is to find out which XML instances are similar. This paper presents a group of algorithms to prepare XML instances for comparison. We analyse the benefit of these algorithms over existing XML comparison approaches.

This work is partially supported by the DIGITEX Project of CNPq Foundation. CTInfo Process Nr.: 550.845/2005-4.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Carvalho, J.C.P., da Silva, A.S.: Finding similar identities among objects from multiple web sources. In: Chiang, R.H.L., Laender, A.H.F., Lim, E.-P. (eds.) WIDM, pp. 90–93. ACM Press, New York (2003)

    Chapter  Google Scholar 

  2. Wiederhold, G.: Intelligent integration of information. In: Buneman, P., Jajodia, S. (eds.) Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, SIGMOD 1993, SIGMOD Record (ACM Special Interest Group on Management of Data), Washington, May 26–28, 1993, vol. 22(2), pp. 434–437. ACM Press, New York (1993)

    Google Scholar 

  3. Manolescu, I., Florescu, D., Kossmann, D.K.: Answering XML queries over heterogeneous data sources. In: Proceedings of the 27th International Conference on Very Large Data Bases(VLDB 2001), Orlando, pp. 241–250. Morgan Kaufmann, San Francisco (2001)

    Google Scholar 

  4. Consortium, W.W.W.: Extensible markup language (XML) 1.0, W3C recommendation. 2nd edn. (2000), Available at http://www.w3.org/TR/2000/WD-xml-2e-20000814

  5. Weis, M., Naumann, F.: Detecting duplicate objects in XML documents. In: Naumann, F., Scannapieco, M. (eds.) IQIS, pp. 10–19. ACM Press, New York (2004)

    Google Scholar 

  6. Flesca, S., Manco, G., Masciari, E., Pontieri, L., Pugliese, A.: Fast detection of XML structural similarity. IEEE Trans. Knowl. Data Eng. 17(2), 160–175 (2005)

    Article  Google Scholar 

  7. Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: WebDB, pp. 61–66 (2002)

    Google Scholar 

  8. Tai, K.-C.: The tree-to-tree correction problem. J. ACM 26(3), 422–433 (1979)

    Article  MATH  MathSciNet  Google Scholar 

  9. Lu, S.-Y.: A tree-to-tree distance and its application to cluster analysis. IEEE Trans. Pattern Anal. Mach. Intell. 1(2), 219–224 (1979)

    MATH  Google Scholar 

  10. Shasha, D., Zhang, K.: Fast algorithms for the unit cost editing distance between trees. J. Algorithms 11(4), 581–621 (1990)

    Article  MATH  MathSciNet  Google Scholar 

  11. Wang, J.T.-L., Zhang, K., Jeong, K., Shasha, D.: A system for approximate tree matching. IEEE Trans. Knowl. Data Eng. 6(4), 559–571 (1994)

    Article  Google Scholar 

  12. Shasha, D., Zhang, K.: Approximate tree pattern matching. In: Pattern Matching Algorithms, pp. 341–371. Oxford University Press, Oxford (1997)

    Google Scholar 

  13. Chen, J., DeWitt, D.J., Tian, F., Wang, Y.: NiagaraCQ: A scalable continuous query system for Internet databases. SIGMOD Record (ACM Special Interest Group on Management of Data) 29(2), 379–390 (2000)

    Google Scholar 

  14. Wang, Y., DeWitt, D.J., yi Cai, J.: X-diff: An effective change detection algorithm for XML documents. In: ICDE, pp. 519–530 (2003)

    Google Scholar 

  15. Marian, A., Abiteboul, S., Cobéna, G., Mignet, L.: Change-centric management of versions in an XML warehouse. In: Proceedings of the 27th International Conference on Very Large Data Bases(VLDB 2001), Orlando, pp. 581–590. Morgan Kaufmann, San Francisco (2001)

    Google Scholar 

  16. Buttler, D.: A short survey of document structure similarity algorithms. In: International Conference on Internet Computing, pp. 3–9 (2004)

    Google Scholar 

  17. Broder, A.: On the resemblance and containment of documents. In: SEQS: Sequences 1991 (1998)

    Google Scholar 

  18. Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33(1), 31–88 (2001)

    Article  Google Scholar 

  19. Winkler, W.: The state of record linkage and current research problems (1999), http://citeseer.ist.psu.edu/article/winkler99state.html

Download references

Author information

Authors and Affiliations

Authors

Editor information

Roland Wagner Norman Revell Günther Pernul

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Gonçalves, R., dos Santos Mello, R. (2007). Improving XML Instances Comparison with Preprocessing Algorithms. In: Wagner, R., Revell, N., Pernul, G. (eds) Database and Expert Systems Applications. DEXA 2007. Lecture Notes in Computer Science, vol 4653. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74469-6_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-74469-6_2

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-74467-2

  • Online ISBN: 978-3-540-74469-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics