Skip to main content
Log in

Discovering mappings in hierarchical data from multiple sources using the inherent structure

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Unprecedented amounts of media data are publicly accessible. However, it is increasingly difficult to integrate relevant media from multiple and diverse sources for effective applications. The functioning of a multimodal integration system requires metadata, such as ontologies, that describe media resources and media components. Such metadata are generally application-dependent and this can cause difficulties when media needs to be shared across application domains. There is a need for a mechanism that can relate the common and uncommon terms and media components. In this paper, we develop an algorithm to mine and automatically discover mappings in hierarchical media data, metadata, and ontologies, using the structural information inherent in these types of data. We evaluate the performance of this algorithm for various parameters using both synthetic and real-world data collections and show that the structure-based mining of relationships provides high degrees of precision.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Bille P (2003) A Tree edit distance, alignment distance and inclusion. IT University of Copenhagen, Technical Report Series, TR-2003-23

  2. Bremer J, Gertz M (2003) An efficient XML node identification and indexing scheme. VLDB

  3. Brickley D, Guha R (2000) Resource description framework (RDF) schema specification. http://www.w3.org/TR/RDF-schema

  4. Candan KS, Kim JW, Liu H, Suvarna R (2004) Structure-based mining of hierarchical media data, meta-data, and ontologies. In: Proceedings of the 5th workshop on multimedia data mining in conjunction with the ACM conference on knowledge discovery & data mining, August 22–25. Seattle, WA, USA

  5. Candan KS, Li WS (2000) Using random walks for mining web document associations. In: Proceedings of the Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 294–305

  6. Candan KS, Li WS (2001) Discovering web document associations for web site summarization. DaWaK 152–161

  7. Candan KS, Li WS (2001) On similarity measures for multimedia database applications. Knowl Inf Syst 3(1):30–51

    Article  MATH  Google Scholar 

  8. Chawathe S (1999) On the editing comparing hierarchical data in external memory. In: Proceedings of the 25th international conference on very large data bases. Edinburgh, Scotland, UK

  9. Chawathe S, GarciaMolina H (1997) Meaningful change detection in structured data. In: Proceedings of the ACM SIGMOD international conference on management of data. Tucson, Arizona, pp 26–37

  10. Cooper BF, Sample N, Franklin MJ, Hjaltason GR, Shadmon M (2001) A fast index for semistructured data. VLDB, pp 341–350

  11. Doan A, Domingos P, Levy A (2000) Learning source descriptions for data integration. In: Proceedings of the WebDB workshop, pp 81–92

  12. Document Object Model (DOM) (1997) http://www.w3.org/DOM/

  13. Dublin Core Initiative and Metadata Element Set (1995) http://dublincore.org

  14. Extensible 3D (X3D) Graphics (2000) http://www.web3d.org/x3d.html

  15. Extensible Markup Language (XML) (2004) http://www.w3.org/TR/REC-xml

  16. Farach M, Thorup M (1997) Sparse dynamic programming for evolutionarytree comparison. SIAM J Comput 26(1):210–223

    Article  MathSciNet  MATH  Google Scholar 

  17. Goldman R, Widom J (1997) Enabling query formulation and optimization in semistructured databases. VLDB, pp 436–445

  18. Gower J (1975) Generalized procrustes analysis. Psychometrika 40:33–51

    Article  MATH  MathSciNet  Google Scholar 

  19. Guha RV, Bray T (1997) Meta content framework using XML. http://www.w3.org/TR/NOTE-MCF-XML-970624

  20. Kendall DG (1984) Shape manifolds: procrustean metrics and complex projective spaces. Bull London Math Soc 16:81–121

    Article  MATH  MathSciNet  Google Scholar 

  21. Kruskal JB (1964) Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29(1):1–27

    Article  MATH  MathSciNet  Google Scholar 

  22. Kruskal JB (1964) Nonmetric multidimensional scaling: a numerical method. Psychometrika 29(2):115–129

    Article  MATH  MathSciNet  Google Scholar 

  23. Kruskal JB, Wish M (1978) Multidimensional scaling. Sage Publications, Beverly Hills

  24. Lassila O (1997) Introduction to RDF metadata. http://www.w3.org/TR/NOTE-rdf-simple-intro

  25. Lee J, Kim M, Lee Y (1993) Information retrieval based on conceptual distance in IS–A hierarchies. J Doc 49(2):188–207

    Article  Google Scholar 

  26. Li Q, Moon B (2001) Indexing and querying XML data for regular path expressions, VLDB

  27. Li W, Clifton C (1994) Semantic integration in heterogeneous databases using neural networks. In: Proceedings of the 20th international conference on very large data bases, pp 1–12

  28. Li WS, Candan KS, Vu Q, Agrawal D (2002) Query relaxation by structure and semantics for retrieval of logical web documents. TKDE 14(4):768–791

    Google Scholar 

  29. Lu SY (1979) A tree-to-tree distance and its application to cluster analysis. IEEE Trans PAMI 1:219–224

    MATH  Google Scholar 

  30. Luccio F, Pagli L (1995) Approximate matching for two families of trees. Inf Comput 123(1):111–120

    Article  MathSciNet  MATH  Google Scholar 

  31. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical Statistical Probability, vol 1, pp 281–297

  32. Madhavan J, Bernstein PA, Rahm E (2001) Generic schema matching with cupid. In: Proceedings of the 27th international conference on very large data bases, pp 49-58

  33. McHugh J, Abiteboul S, Goldman R, Quass D, Widom J (1997) Lore: a database management system for semistructured data. SIGMOD Rec 26(3):54–66

    Article  Google Scholar 

  34. Miller R, Ioannidis Y, Ramakrishnan R (1994) Schema equivalence in heterogeneous systems: bridging theory and practice. Inf Syst 19(1):3–31

    Article  Google Scholar 

  35. Miller RJ, Haas L, Hernandez MA (2000) Schema mapping as query discovery. In: Proceedings of the 26th international conference on very large data bases, pp 77–88

  36. Milo T, Suciu D (1999) Index structures for path expressions. In: Proceedings of the ICDT'99. ICDT, pp 277–295

  37. Milo T, Zohar S (1998) Using schema matching to simplify heterogeneous data translation. In: Proceedings of the conference on very large data bases, pp 122–133

  38. Mitra P, Wiederhold G, Jannink J (1999) Semiautomatic integration of knowledge sources. In: Proceedings of Fusion'99. Sunnyvale, USA

  39. Mitra P, Wiederhold G, Kersten M (2000) A graph oriented model for articulation of ontology interdependencies. In: Proceedings of the extending database technologies. Lecture Notes in Computer Science, vol 1777, pp 86–100

  40. Myers E (1986) An O(ND) difference algorithms and its variations. Algorithmica 1(2):251–266

    Article  MATH  MathSciNet  Google Scholar 

  41. Namespaces in XML (1999) http://www.w3.org/TR/REC-xml-names

  42. Palopoli L, Sacca D, Ursino D (1998) An automatic technique for detecting type conflicts in database schemas. In: Proceedings of the 7th international conference on information and knowledge management (CIKM), pp 306–313

  43. Rada R, Mili H, Bicknell E, Blettner M (1989) Development and application of a metric on semantic nets. IEEE Trans Syst, Manage Cybern 19(1):17–30

    Article  Google Scholar 

  44. Rahm E, Bernstein PA (2001) A survey of approaches to automatic schema matching. VLDB J 10:334–350

    Article  MATH  Google Scholar 

  45. Rao P, Moon B (2004) PRIX: indexing and querying XML using Prufer sequences, ICDE

  46. Resnik P (1995) Using information content to evaluate semantic similarity in a taxanomy. IJCAI, pp 448–453

  47. Resnik P (1999) Sematic similarity in a taxanomy: an information-based measure and its application to problems of ambiguity in natural language. J Artif Intell Res 11:95-130

    MATH  MathSciNet  Google Scholar 

  48. Selkow S (1977) The tree to tree editing problem. Inf Process Lett 6(6):184–186

    Article  MATH  MathSciNet  Google Scholar 

  49. Tai KC (1979) The tree-to-tree correction problem. J ACM 36:422–433

    Article  MathSciNet  Google Scholar 

  50. The Moving Picture Experts Group (MPEG) (2001) homepage http://www.chiariglione.org/mpeg/

  51. Torgerson WS (1952) Multidimensional scaling. I. Theory and method. Psycometrika 17:401–419

    Article  MATH  MathSciNet  Google Scholar 

  52. University of Pennsylvania TreeBank Project collection at http://www.cs.washington.edu/research/xmldatasets/www/repository.html

  53. Wang H, Park S, Fan W, Yu P (2003) ViST: a dynamic index method for querying XML data by tree structures. SIGMOD

  54. Wang J, Zhang K, Jeong K, Shasha D (1994) A system for approximate tree matching. IEEE TKDE, pp 559–571

  55. Zhang C, Naughton JF, DeWitt DJ, Luo Q, Lohman GM (2001) On supporting containment queries in relational database management

  56. Zhang K (1989) The editing distance between trees: algorithms and applications. PhD Thesis, Courant Institute, Department of Computer Science

    Google Scholar 

  57. Zhang K, Shasha D (1989) Simple fast algorithms for the editing distance between trees and related problems. SIAM J Comput 18:1245–1262

    Article  MathSciNet  MATH  Google Scholar 

  58. Zhang K, Shasha D (1997) Approximate tree pattern matching. In: Apostolico A, Galil Z (eds) Pattern matching in strings, trees, and arrays. Oxford University, Oxford, pp 341–371

  59. Zhang K, Wang JTL, Shasha D (1996) On the editing distance between undirected acyclic graphs. Int J Comput Sci 7(1):43–57

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K. Selçuk Candan.

Additional information

K. Selçuk Candan is an Associate Professor at the Department of Computer Science and Engineering at the Arizona State University. He joined the department in August 1997, after receiving his Ph.D. from the Computer Science Department at the University of Maryland at College Park. He received the 1997 ACM DC Chapter award of Samuel N. Alexander Fellowship for his Ph.D. work. His research interests include development of indexing and retrieval schemes for multimedia and Web information and management of dynamic, heterogeneous, and distributed data. He has published various articles in respected journals and conferences in these areas. He also served as program committee member, chair person, and guest editor in various workshops, conferences, and journals. He received his B.S. degree, first ranked in the department, in computer science from Bilkent University in Turkey in 1993. http://www.public.asu.edu/~candan.

Jong Wook Kim received his B.S. from Korea University, Seoul, Korea in 1998, his M.S. from KAIST, Daejon, Korea, in 2000. He is currently a Ph.D. student at the Department of Computer Science and Engineering, Arizona State University, AZ, USA. His primary research interests are web data mining, information retrieval and database systems. His current research concentrates on mining in web communities like discussion board.

Huan Liu earned his Ph.D. in Computer Science in 1989 at University of Southern California, and Bachelor of Engineering in the Electrical Engineering and Computer Science Department at Shanghai Jiao Tong University in 1983. He conducted research at Telecom (Telstra) Australia Research Laboratories in Melbourne, Australia. In January 1994, he joined the School of Computing at the National University of Singapore, and became an Associate Professor. Since January 2000, he is with Department of Computer Science and Engineering at Arizona State University as an Associate Professor. He is a senior member of IEEE, member of ACM, and AAAI. His principal research interests include machine learning, feature and subset selection, data preprocessing, bioinformatics, and data (including text and web) mining. He has worked on real-world data mining applications and published extensively in journal and conference papers, book chapters, and books. He serves on the editorial board of journals, handbook of data mining, encyclopedia of data mining and warehousing.

Reshma Suvarna is currently employed at Honeywell as a Senior Software Engineer. Her work at Honeywell is in the area of Aerospace Electronic Systems. She recieved her Masters Degree in Computer Science and Engineering from Arizona State University in 2003. In addition to her current work in the Aerospace Electronic Systems, she is interested in data mining, web mining, and software engineering research.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Candan, K.S., Kim, J.W., Liu, H. et al. Discovering mappings in hierarchical data from multiple sources using the inherent structure. Knowl Inf Syst 10, 185–210 (2006). https://doi.org/10.1007/s10115-005-0230-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-005-0230-9

Keywords

Navigation