Abstract
Capturing latent structural and semantic properties in semi-structured documents (e.g., XML documents) is crucial for improving the performance of related document analysis tasks. Structured Link Vector Mode (SLVM) is a representation recently proposed for modeling semi-structured documents. It uses an element similarity matrix to capture the latent relationships between XML elements—the constructing components of an XML document. In this paper, instead of applying heuristics to define the element similarity matrix, we propose to compute the matrix using the machine learning approach. In addition, we incorporate term semantics into SLVM using latent semantic indexing to enhance the model accuracy, with the element similarity learnability property preserved. For performance evaluation, we applied the similarity learning to k-nearest neighbors search and similarity-based clustering, and tested the performance using two different XML document collections. The SLVM obtained via learning was found to outperform significantly the conventional Vector Space Model and the edit-distance-based methods. Also, the similarity matrix, obtained as a by-product, can provide higher-level knowledge on the semantic relationships between the XML elements.
Similar content being viewed by others
References
Early Americas Digital Archive. http://www.mith2.umd.edu:8080/eada/intro.js. Accessed 1 Aug 2007
Contemporary Culture Virtual Archives in XML. http://www.covax.or. Accessed 1 Aug 2007
Berry M (2003). Survey of text mining: clustering, classification and retrieval. Springer, Berlin
Zhang ZP, Li R, Cao SL, Zhu YY (2003) Similarity metric for XML documents. In: Proceedings of the 2003 workshop on knowledge and experience management (FGWM 2003), Karlsruhe
Nierman A, Jagadish HV (2002) Evaluating structural similarity in XML documents. In: Proceedings of the international workshop on the web and databases (WebDB), Madison, WI
Zhang K, Statman R and Shasha D (1992). On the editing distance between unordered labeled trees. Inf Process Lett 42(3): 133–139
Abolhassani M, Fuhr N, Malik S (2003) HyREX at INEX. In: Proceedings of the 2003 INEX workshop, Dagstuhl, Germany, pp. 15–17
Azevedo MIM, Amorim LP, Ziviani N (2005) A universal model for XML information retrieval. In: Lecture notes in computer science, vol 3493. Springer, Berlin, pp 311–321
Flesca S, Manco G, Masciari E, Pontieri L, Pugliese A (2002) Detecting structural similarities between XML documents. In: Proceedings of the international workshop on the web and databases (WebDB), Madison, WI, pp 55–60
Schenkel R, Theobald A, Weikum G (2003) XXL @ INEX 2003. In: Proceedings of the 2003 INEX workshop, Dagstuhl, Germany, pp 59–68
(1998). WordNet: an electronic lexical database. MIT Press, Cambridge
Hunter A and Liu W (2006). Merging uncertain information with semantic heterogeneity in XML. Knowl Inf Syst 9(2): 230–258
Yang J and Chen X (2002). A semi-structured document model for text mining. J Comput Sci Technol 17(5): 603–610
Ogilvie P, Callan J (2002) Language models and structured document retrieval. In: Proceedings of the 2002 INEX workshop, Dagstuhl, Germany, pp 33–40
Mass Y, Mandelbrod M, Amitay E, Carmel D, Maarek Y, Soffer A (2002) JuruXML: An XML retrieval system at INEX’02. In: Proceedings of the 2002 INEX workshop, Dagstuhl, Germany, pp 73–80
Crouch C, Mahajan A, Bellamkonda A (2004) Flexible XML retrieval based on the extended vector model. In: Proceedings of the 2004 INEX workshop, Dagstuhl, Germany, pp 292–302
Liu S, Chu W (2003) Cooperative XML (CoXML) Query answering at INEX 03. In: Proceedings of the 2003 INEX workshop, Dagstuhl, Germany, pp 94–101
Vittaut J, Piwowarski B, Gallinari P (2004) An algebra for structured queries in Bayesian networks. In: Proceedings of the 2004 INEX workshop, Dagstuhl, Germany, pp 100–112
Sigurbjornsson B, Kamps J, Rijke M (2004) The University of Amsterdam at INEX 2004. In: Proceedings of the 2004 INEX workshop, Schloss Dagstuhl, December, Germany, pp 104–109
Woodley A, Geva S (2004) NLPX at INEX 2004. In: Proceedings of the 2004 INEX workshop, Dagstuhl, Germany, pp 382–394
Bilenko M, Basu S, Mooney R (2004) Integrating constraints and metric learning in semi-supervised clustering. In: Proceedings of the 21st international conference on machine learning (ICML-2004), Banff, Canada, pp 81–88
Deerwester S, Dumais ST, Landauer TK, Furnas GW and Harshman RA (1990). Indexing by latent semantic analysis. J Soc Inf Sci 41(6): 391–407
Salton G and McGill MJ (1983). Introduction to modern information retrieval. McGraw-Hill, New York
Schölkopf B and Smola A (2002). Learning with kernels support vector machines, regularization, optimization and beyond. MIT Press, Cambridge
Chapelle O, Vapnik V, Bousquet O and Mukherjee S (2002). Choosing multiple parameters for support vector machines.. Mach Learn 46: 131–159
Grandvalet Y, Canu S (2003) Adaptive scaling or feature selection in SVMs. In: Neural information processing systems. MIT Press, Cambridge, pp 553–560
Lanckriet GRG, Cristianini N, Ghaoui LE, Bartlett P and Jordan MI (2004). Learning the kernel matrix with semi-definite programming. J Mach Learn Res 5: 27–72
Ong S, Smola AJ and Williamson RC (2003). Hyperkernels. Neural information processing systems. MIT Press, Cambridge, 478–485
Bach FR, Lanckriet GRG, Jordan MI (2004) Multiple kernel learning, conic duality, and the SMO algorithm. In: Proceedings of 21st international conference on machine learning, ACM Press, Banff, p 6
Xing E, Ng AY, Jordan MI, Russell S (2002) Distance metric learning, with application to clustering with side-information. In: Proceedings of the neural information processing systems, Vancouver, BC, Canada, pp 505–512
Schultz M, Joachims T (2003) Learning a distance metric from relative comparison. In: Proceedings of the neural information processing systems (NIPS), Whistler, BC
Kandola J, Shawe-Taylor J, Cristianini N (2002) Learning semantic similarity. In: Proceedings of the neural information processing systems (NIPS), Vancouver, BC, Canada, pp 657–664
Zhang Z, Yeung DY, Kwok JT (2004) Bayesian inference for transductive learning of kernel matrix using the Tanner–Wong data augmentation algorithm. In: Proceedings of the 21st international conference on machine learning (ICML-2004), Banff, AL, Canada, pp 935–942
Liu N, Zhang BY, Yan J, Yang Q, Yan SC, Chen Z, Ma WY (2004) Learning similarity measures in the non-orthogonal space. In: Proceedings of the 13th conference on information and knowledge management (CIKM 2004), Washington, DC, USA, pp 334–341
SVDPACK. http://www.netlib.org/svdpack. Accessed 1 Aug 2007
ACMSIGMOD Record. http://www.acm.org/sigs/sigmod/record/xml/XMLSigmodRecordMarch1999.zip
CEDB. http://www.ecph.com.cn. Accessed 1 Aug 2007
Sneath P and Sokal RR (1973). Numerical taxonomy—the principles and practice of numerical classification. W. H. Freeman, San Francisco
Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, San Diego, CA, pp 16–20
Kleinberg JM (1999). Authoritative sources in a hyperlinked environment. J ACM 46(5): 604–632
Candan KS, Li WS (2001) Discovering web document associations for web site summarization. In: Proceedings of the third international conference on data warehousing and knowledge discovery, Munich, Germany, pp 152–161
Flake GW, Lawrence S, Giles CL and Coetzee F (2002). Self-organization of the web and identification of communities. IEEE Comput 35(3): 66–71
Cheung WK and Sun Y (2007). Identifying a hierarchy of bipartite subgraphs for web site abstraction. Web Intell Agent Syst 5(3): 343–355
INEX. http://inex.is.informatik.uni-duisburg.de:2004. Accessed 1 Aug 2007
Leung H, Chung F and Chan SC (2005). On the use of hierarchical information in sequential mining based XML document similarity computation. Knowl Inf Syst 7(4): 476–498
Hammouda KM and Kamel MS (2004). Document similarity using a phrase indexing graph model. Knowl Inf Syst 6(6): 710–727
Author information
Authors and Affiliations
Corresponding author
Additional information
An abridged version of this manuscript has appeared in the Proceedings of the 2005 IEEE/WIC/ACM international conference on web intelligence, Compeigne, France, September, 2005.
Rights and permissions
About this article
Cite this article
Yang, J., Cheung, W.K. & Chen, X. Learning element similarity matrix for semi-structured document analysis. Knowl Inf Syst 19, 53–78 (2009). https://doi.org/10.1007/s10115-008-0138-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-008-0138-2