Skip to main content
Log in

Learning element similarity matrix for semi-structured document analysis

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Capturing latent structural and semantic properties in semi-structured documents (e.g., XML documents) is crucial for improving the performance of related document analysis tasks. Structured Link Vector Mode (SLVM) is a representation recently proposed for modeling semi-structured documents. It uses an element similarity matrix to capture the latent relationships between XML elements—the constructing components of an XML document. In this paper, instead of applying heuristics to define the element similarity matrix, we propose to compute the matrix using the machine learning approach. In addition, we incorporate term semantics into SLVM using latent semantic indexing to enhance the model accuracy, with the element similarity learnability property preserved. For performance evaluation, we applied the similarity learning to k-nearest neighbors search and similarity-based clustering, and tested the performance using two different XML document collections. The SLVM obtained via learning was found to outperform significantly the conventional Vector Space Model and the edit-distance-based methods. Also, the similarity matrix, obtained as a by-product, can provide higher-level knowledge on the semantic relationships between the XML elements.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Early Americas Digital Archive. http://www.mith2.umd.edu:8080/eada/intro.js. Accessed 1 Aug 2007

  2. Contemporary Culture Virtual Archives in XML. http://www.covax.or. Accessed 1 Aug 2007

  3. Berry M (2003). Survey of text mining: clustering, classification and retrieval. Springer, Berlin

    Google Scholar 

  4. Zhang ZP, Li R, Cao SL, Zhu YY (2003) Similarity metric for XML documents. In: Proceedings of the 2003 workshop on knowledge and experience management (FGWM 2003), Karlsruhe

  5. Nierman A, Jagadish HV (2002) Evaluating structural similarity in XML documents. In: Proceedings of the international workshop on the web and databases (WebDB), Madison, WI

  6. Zhang K, Statman R and Shasha D (1992). On the editing distance between unordered labeled trees. Inf Process Lett 42(3): 133–139

    Article  MATH  MathSciNet  Google Scholar 

  7. Abolhassani M, Fuhr N, Malik S (2003) HyREX at INEX. In: Proceedings of the 2003 INEX workshop, Dagstuhl, Germany, pp. 15–17

  8. Azevedo MIM, Amorim LP, Ziviani N (2005) A universal model for XML information retrieval. In: Lecture notes in computer science, vol 3493. Springer, Berlin, pp 311–321

  9. Flesca S, Manco G, Masciari E, Pontieri L, Pugliese A (2002) Detecting structural similarities between XML documents. In: Proceedings of the international workshop on the web and databases (WebDB), Madison, WI, pp 55–60

  10. Schenkel R, Theobald A, Weikum G (2003) XXL @ INEX 2003. In: Proceedings of the 2003 INEX workshop, Dagstuhl, Germany, pp 59–68

  11. (1998). WordNet: an electronic lexical database. MIT Press, Cambridge

    MATH  Google Scholar 

  12. Hunter A and Liu W (2006). Merging uncertain information with semantic heterogeneity in XML. Knowl Inf Syst 9(2): 230–258

    Article  Google Scholar 

  13. Yang J and Chen X (2002). A semi-structured document model for text mining. J Comput Sci Technol 17(5): 603–610

    Article  MATH  Google Scholar 

  14. Ogilvie P, Callan J (2002) Language models and structured document retrieval. In: Proceedings of the 2002 INEX workshop, Dagstuhl, Germany, pp 33–40

  15. Mass Y, Mandelbrod M, Amitay E, Carmel D, Maarek Y, Soffer A (2002) JuruXML: An XML retrieval system at INEX’02. In: Proceedings of the 2002 INEX workshop, Dagstuhl, Germany, pp 73–80

  16. Crouch C, Mahajan A, Bellamkonda A (2004) Flexible XML retrieval based on the extended vector model. In: Proceedings of the 2004 INEX workshop, Dagstuhl, Germany, pp 292–302

  17. Liu S, Chu W (2003) Cooperative XML (CoXML) Query answering at INEX 03. In: Proceedings of the 2003 INEX workshop, Dagstuhl, Germany, pp 94–101

  18. Vittaut J, Piwowarski B, Gallinari P (2004) An algebra for structured queries in Bayesian networks. In: Proceedings of the 2004 INEX workshop, Dagstuhl, Germany, pp 100–112

  19. Sigurbjornsson B, Kamps J, Rijke M (2004) The University of Amsterdam at INEX 2004. In: Proceedings of the 2004 INEX workshop, Schloss Dagstuhl, December, Germany, pp 104–109

  20. Woodley A, Geva S (2004) NLPX at INEX 2004. In: Proceedings of the 2004 INEX workshop, Dagstuhl, Germany, pp 382–394

  21. Bilenko M, Basu S, Mooney R (2004) Integrating constraints and metric learning in semi-supervised clustering. In: Proceedings of the 21st international conference on machine learning (ICML-2004), Banff, Canada, pp 81–88

  22. Deerwester S, Dumais ST, Landauer TK, Furnas GW and Harshman RA (1990). Indexing by latent semantic analysis. J Soc Inf Sci 41(6): 391–407

    Article  Google Scholar 

  23. Salton G and McGill MJ (1983). Introduction to modern information retrieval. McGraw-Hill, New York

    MATH  Google Scholar 

  24. Schölkopf B and Smola A (2002). Learning with kernels support vector machines, regularization, optimization and beyond. MIT Press, Cambridge

    Google Scholar 

  25. Chapelle O, Vapnik V, Bousquet O and Mukherjee S (2002). Choosing multiple parameters for support vector machines.. Mach Learn 46: 131–159

    Article  MATH  Google Scholar 

  26. Grandvalet Y, Canu S (2003) Adaptive scaling or feature selection in SVMs. In: Neural information processing systems. MIT Press, Cambridge, pp 553–560

  27. Lanckriet GRG, Cristianini N, Ghaoui LE, Bartlett P and Jordan MI (2004). Learning the kernel matrix with semi-definite programming. J Mach Learn Res 5: 27–72

    Google Scholar 

  28. Ong S, Smola AJ and Williamson RC (2003). Hyperkernels. Neural information processing systems. MIT Press, Cambridge, 478–485

    Google Scholar 

  29. Bach FR, Lanckriet GRG, Jordan MI (2004) Multiple kernel learning, conic duality, and the SMO algorithm. In: Proceedings of 21st international conference on machine learning, ACM Press, Banff, p 6

  30. Xing E, Ng AY, Jordan MI, Russell S (2002) Distance metric learning, with application to clustering with side-information. In: Proceedings of the neural information processing systems, Vancouver, BC, Canada, pp 505–512

  31. Schultz M, Joachims T (2003) Learning a distance metric from relative comparison. In: Proceedings of the neural information processing systems (NIPS), Whistler, BC

  32. Kandola J, Shawe-Taylor J, Cristianini N (2002) Learning semantic similarity. In: Proceedings of the neural information processing systems (NIPS), Vancouver, BC, Canada, pp 657–664

  33. Zhang Z, Yeung DY, Kwok JT (2004) Bayesian inference for transductive learning of kernel matrix using the Tanner–Wong data augmentation algorithm. In: Proceedings of the 21st international conference on machine learning (ICML-2004), Banff, AL, Canada, pp 935–942

  34. Liu N, Zhang BY, Yan J, Yang Q, Yan SC, Chen Z, Ma WY (2004) Learning similarity measures in the non-orthogonal space. In: Proceedings of the 13th conference on information and knowledge management (CIKM 2004), Washington, DC, USA, pp 334–341

  35. SVDPACK. http://www.netlib.org/svdpack. Accessed 1 Aug 2007

  36. ACMSIGMOD Record. http://www.acm.org/sigs/sigmod/record/xml/XMLSigmodRecordMarch1999.zip

  37. CEDB. http://www.ecph.com.cn. Accessed 1 Aug 2007

  38. Sneath P and Sokal RR (1973). Numerical taxonomy—the principles and practice of numerical classification. W. H. Freeman, San Francisco

    MATH  Google Scholar 

  39. Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, San Diego, CA, pp 16–20

  40. Kleinberg JM (1999). Authoritative sources in a hyperlinked environment. J ACM 46(5): 604–632

    Article  MATH  MathSciNet  Google Scholar 

  41. Candan KS, Li WS (2001) Discovering web document associations for web site summarization. In: Proceedings of the third international conference on data warehousing and knowledge discovery, Munich, Germany, pp 152–161

  42. Flake GW, Lawrence S, Giles CL and Coetzee F (2002). Self-organization of the web and identification of communities. IEEE Comput 35(3): 66–71

    Google Scholar 

  43. Cheung WK and Sun Y (2007). Identifying a hierarchy of bipartite subgraphs for web site abstraction. Web Intell Agent Syst 5(3): 343–355

    Google Scholar 

  44. INEX. http://inex.is.informatik.uni-duisburg.de:2004. Accessed 1 Aug 2007

  45. Leung H, Chung F and Chan SC (2005). On the use of hierarchical information in sequential mining based XML document similarity computation. Knowl Inf Syst 7(4): 476–498

    Article  Google Scholar 

  46. Hammouda KM and Kamel MS (2004). Document similarity using a phrase indexing graph model. Knowl Inf Syst 6(6): 710–727

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianwu Yang.

Additional information

An abridged version of this manuscript has appeared in the Proceedings of the 2005 IEEE/WIC/ACM international conference on web intelligence, Compeigne, France, September, 2005.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, J., Cheung, W.K. & Chen, X. Learning element similarity matrix for semi-structured document analysis. Knowl Inf Syst 19, 53–78 (2009). https://doi.org/10.1007/s10115-008-0138-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-008-0138-2

Keywords

Navigation