Learning element similarity matrix for semi-structured document analysis

Yang, Jianwu; Cheung, William K.; Chen, Xiaoou

doi:10.1007/s10115-008-0138-2

Learning element similarity matrix for semi-structured document analysis

Regular Paper
Published: 08 May 2008

Volume 19, pages 53–78, (2009)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Jianwu Yang¹,
William K. Cheung² &
Xiaoou Chen¹

224 Accesses
13 Citations
Explore all metrics

Abstract

Capturing latent structural and semantic properties in semi-structured documents (e.g., XML documents) is crucial for improving the performance of related document analysis tasks. Structured Link Vector Mode (SLVM) is a representation recently proposed for modeling semi-structured documents. It uses an element similarity matrix to capture the latent relationships between XML elements—the constructing components of an XML document. In this paper, instead of applying heuristics to define the element similarity matrix, we propose to compute the matrix using the machine learning approach. In addition, we incorporate term semantics into SLVM using latent semantic indexing to enhance the model accuracy, with the element similarity learnability property preserved. For performance evaluation, we applied the similarity learning to k-nearest neighbors search and similarity-based clustering, and tested the performance using two different XML document collections. The SLVM obtained via learning was found to outperform significantly the conventional Vector Space Model and the edit-distance-based methods. Also, the similarity matrix, obtained as a by-product, can provide higher-level knowledge on the semantic relationships between the XML elements.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Local Linear Matrix Factorization for Document Modeling

Unsupervised meta-path selection for text similarity measure based on heterogeneous information networks

Article 14 July 2018

Efficient Graph-Based Document Similarity

References

Early Americas Digital Archive. http://www.mith2.umd.edu:8080/eada/intro.js. Accessed 1 Aug 2007
Contemporary Culture Virtual Archives in XML. http://www.covax.or. Accessed 1 Aug 2007
Berry M (2003). Survey of text mining: clustering, classification and retrieval. Springer, Berlin
Google Scholar
Zhang ZP, Li R, Cao SL, Zhu YY (2003) Similarity metric for XML documents. In: Proceedings of the 2003 workshop on knowledge and experience management (FGWM 2003), Karlsruhe
Nierman A, Jagadish HV (2002) Evaluating structural similarity in XML documents. In: Proceedings of the international workshop on the web and databases (WebDB), Madison, WI
Zhang K, Statman R and Shasha D (1992). On the editing distance between unordered labeled trees. Inf Process Lett 42(3): 133–139
Article MATH MathSciNet Google Scholar
Abolhassani M, Fuhr N, Malik S (2003) HyREX at INEX. In: Proceedings of the 2003 INEX workshop, Dagstuhl, Germany, pp. 15–17
Azevedo MIM, Amorim LP, Ziviani N (2005) A universal model for XML information retrieval. In: Lecture notes in computer science, vol 3493. Springer, Berlin, pp 311–321
Flesca S, Manco G, Masciari E, Pontieri L, Pugliese A (2002) Detecting structural similarities between XML documents. In: Proceedings of the international workshop on the web and databases (WebDB), Madison, WI, pp 55–60
Schenkel R, Theobald A, Weikum G (2003) XXL @ INEX 2003. In: Proceedings of the 2003 INEX workshop, Dagstuhl, Germany, pp 59–68
(1998). WordNet: an electronic lexical database. MIT Press, Cambridge
MATH Google Scholar
Hunter A and Liu W (2006). Merging uncertain information with semantic heterogeneity in XML. Knowl Inf Syst 9(2): 230–258
Article Google Scholar
Yang J and Chen X (2002). A semi-structured document model for text mining. J Comput Sci Technol 17(5): 603–610
Article MATH Google Scholar
Ogilvie P, Callan J (2002) Language models and structured document retrieval. In: Proceedings of the 2002 INEX workshop, Dagstuhl, Germany, pp 33–40
Mass Y, Mandelbrod M, Amitay E, Carmel D, Maarek Y, Soffer A (2002) JuruXML: An XML retrieval system at INEX’02. In: Proceedings of the 2002 INEX workshop, Dagstuhl, Germany, pp 73–80
Crouch C, Mahajan A, Bellamkonda A (2004) Flexible XML retrieval based on the extended vector model. In: Proceedings of the 2004 INEX workshop, Dagstuhl, Germany, pp 292–302
Liu S, Chu W (2003) Cooperative XML (CoXML) Query answering at INEX 03. In: Proceedings of the 2003 INEX workshop, Dagstuhl, Germany, pp 94–101
Vittaut J, Piwowarski B, Gallinari P (2004) An algebra for structured queries in Bayesian networks. In: Proceedings of the 2004 INEX workshop, Dagstuhl, Germany, pp 100–112
Sigurbjornsson B, Kamps J, Rijke M (2004) The University of Amsterdam at INEX 2004. In: Proceedings of the 2004 INEX workshop, Schloss Dagstuhl, December, Germany, pp 104–109
Woodley A, Geva S (2004) NLPX at INEX 2004. In: Proceedings of the 2004 INEX workshop, Dagstuhl, Germany, pp 382–394
Bilenko M, Basu S, Mooney R (2004) Integrating constraints and metric learning in semi-supervised clustering. In: Proceedings of the 21st international conference on machine learning (ICML-2004), Banff, Canada, pp 81–88
Deerwester S, Dumais ST, Landauer TK, Furnas GW and Harshman RA (1990). Indexing by latent semantic analysis. J Soc Inf Sci 41(6): 391–407
Article Google Scholar
Salton G and McGill MJ (1983). Introduction to modern information retrieval. McGraw-Hill, New York
MATH Google Scholar
Schölkopf B and Smola A (2002). Learning with kernels support vector machines, regularization, optimization and beyond. MIT Press, Cambridge
Google Scholar
Chapelle O, Vapnik V, Bousquet O and Mukherjee S (2002). Choosing multiple parameters for support vector machines.. Mach Learn 46: 131–159
Article MATH Google Scholar
Grandvalet Y, Canu S (2003) Adaptive scaling or feature selection in SVMs. In: Neural information processing systems. MIT Press, Cambridge, pp 553–560
Lanckriet GRG, Cristianini N, Ghaoui LE, Bartlett P and Jordan MI (2004). Learning the kernel matrix with semi-definite programming. J Mach Learn Res 5: 27–72
Google Scholar
Ong S, Smola AJ and Williamson RC (2003). Hyperkernels. Neural information processing systems. MIT Press, Cambridge, 478–485
Google Scholar
Bach FR, Lanckriet GRG, Jordan MI (2004) Multiple kernel learning, conic duality, and the SMO algorithm. In: Proceedings of 21st international conference on machine learning, ACM Press, Banff, p 6
Xing E, Ng AY, Jordan MI, Russell S (2002) Distance metric learning, with application to clustering with side-information. In: Proceedings of the neural information processing systems, Vancouver, BC, Canada, pp 505–512
Schultz M, Joachims T (2003) Learning a distance metric from relative comparison. In: Proceedings of the neural information processing systems (NIPS), Whistler, BC
Kandola J, Shawe-Taylor J, Cristianini N (2002) Learning semantic similarity. In: Proceedings of the neural information processing systems (NIPS), Vancouver, BC, Canada, pp 657–664
Zhang Z, Yeung DY, Kwok JT (2004) Bayesian inference for transductive learning of kernel matrix using the Tanner–Wong data augmentation algorithm. In: Proceedings of the 21st international conference on machine learning (ICML-2004), Banff, AL, Canada, pp 935–942
Liu N, Zhang BY, Yan J, Yang Q, Yan SC, Chen Z, Ma WY (2004) Learning similarity measures in the non-orthogonal space. In: Proceedings of the 13th conference on information and knowledge management (CIKM 2004), Washington, DC, USA, pp 334–341
SVDPACK. http://www.netlib.org/svdpack. Accessed 1 Aug 2007
ACMSIGMOD Record. http://www.acm.org/sigs/sigmod/record/xml/XMLSigmodRecordMarch1999.zip
CEDB. http://www.ecph.com.cn. Accessed 1 Aug 2007
Sneath P and Sokal RR (1973). Numerical taxonomy—the principles and practice of numerical classification. W. H. Freeman, San Francisco
MATH Google Scholar
Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, San Diego, CA, pp 16–20
Kleinberg JM (1999). Authoritative sources in a hyperlinked environment. J ACM 46(5): 604–632
Article MATH MathSciNet Google Scholar
Candan KS, Li WS (2001) Discovering web document associations for web site summarization. In: Proceedings of the third international conference on data warehousing and knowledge discovery, Munich, Germany, pp 152–161
Flake GW, Lawrence S, Giles CL and Coetzee F (2002). Self-organization of the web and identification of communities. IEEE Comput 35(3): 66–71
Google Scholar
Cheung WK and Sun Y (2007). Identifying a hierarchy of bipartite subgraphs for web site abstraction. Web Intell Agent Syst 5(3): 343–355
Google Scholar
INEX. http://inex.is.informatik.uni-duisburg.de:2004. Accessed 1 Aug 2007
Leung H, Chung F and Chan SC (2005). On the use of hierarchical information in sequential mining based XML document similarity computation. Knowl Inf Syst 7(4): 476–498
Article Google Scholar
Hammouda KM and Kamel MS (2004). Document similarity using a phrase indexing graph model. Knowl Inf Syst 6(6): 710–727
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computer Science and Technology, Peking University, Beijing, 100871, China
Jianwu Yang & Xiaoou Chen
Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong
William K. Cheung

Authors

Jianwu Yang
View author publications
You can also search for this author in PubMed Google Scholar
William K. Cheung
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoou Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianwu Yang.

Additional information

An abridged version of this manuscript has appeared in the Proceedings of the 2005 IEEE/WIC/ACM international conference on web intelligence, Compeigne, France, September, 2005.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, J., Cheung, W.K. & Chen, X. Learning element similarity matrix for semi-structured document analysis. Knowl Inf Syst 19, 53–78 (2009). https://doi.org/10.1007/s10115-008-0138-2

Download citation

Received: 07 August 2007
Revised: 14 February 2008
Accepted: 24 February 2008
Published: 08 May 2008
Issue Date: April 2009
DOI: https://doi.org/10.1007/s10115-008-0138-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning element similarity matrix for semi-structured document analysis

Abstract

Access this article

Similar content being viewed by others

Local Linear Matrix Factorization for Document Modeling

Unsupervised meta-path selection for text similarity measure based on heterogeneous information networks

Efficient Graph-Based Document Similarity

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning element similarity matrix for semi-structured document analysis

Abstract

Access this article

Similar content being viewed by others

Local Linear Matrix Factorization for Document Modeling

Unsupervised meta-path selection for text similarity measure based on heterogeneous information networks

Efficient Graph-Based Document Similarity

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation