Unsupervised meta-path selection for text similarity measure based on heterogeneous information networks

Wang, Chenguang; Song, Yangqiu; Li, Haoran; Zhang, Ming; Han, Jiawei

doi:10.1007/s10618-018-0581-y

Unsupervised meta-path selection for text similarity measure based on heterogeneous information networks

Published: 14 July 2018

Volume 32, pages 1735–1767, (2018)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Chenguang Wang¹,
Yangqiu Song ORCID: orcid.org/0000-0002-7818-6090²,
Haoran Li³,
Ming Zhang³ &
…
Jiawei Han⁴

1633 Accesses
13 Citations
Explore all metrics

Abstract

Heterogeneous information network (HIN) is a general representation of many different applications, such as social networks, scholar networks, and knowledge networks. A key development of HIN is called PathSim based on meta-path, which measures the pairwise similarity of two entities in the HIN of the same type. When using PathSim in practice, we usually need to handcraft some meta-paths which are paths over entity types instead of entities themselves. However, finding useful meta-paths is not trivial to human. In this paper, we present an unsupervised meta-path selection approach to automatically find useful meta-paths over HIN, and then develop a new similarity measure called KnowSim which is an ensemble of selected meta-paths. To solve the high computational cost of enumerating all possible meta-paths, we propose to use an approximate personalized PageRank algorithm to find useful subgraphs to allocate the meta-paths. We apply KnowSim to text clustering and classification problems to demonstrate that unsupervised meta-path selection can help improve the clustering and classification results. We use Freebase, a well-known world knowledge base, to conduct semantic parsing and construct HIN for documents. Our experiments on 20Newsgroups and RCV1 datasets show that KnowSim results in impressive high-quality document clustering and classification performance. We also demonstrate the approximate personalized PageRank algorithm can efficiently and effectively compute the meta-path based similarity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Meta Paths and Meta Structures: Analysing Large Heterogeneous Information Networks

PathSimExt: Revisiting PathSim in Heterogeneous Information Networks

W-PathSim: Novel Approach of Weighted Similarity Measure in Content-Based Heterogeneous Information Networks by Applying LDA Topic Modeling

Notes

Note that in the HIN for text, entity features include both tf of words and named entities.
http://qwone.com/~jason/20Newsgroups/.
https://developers.google.com/freebase/.
We do not test longer meta-paths, because Theorem 2 in Sun et al. (2011) has demonstrated their limit on effective propagation of semantic knowledge.
We keep 20 and 43 entity types for 20NG and GCAT respectively, because they have relatively larger number of instances.

References

Aggarwal CC, Zhai C (2012) A survey of text classification algorithms. In: Mining text data. Springer, pp 163–222
Andersen R, Chung F, Lang K (2006) Local graph partitioning using pagerank vectors. In: Proceedings of the IEEE annual symposium on foundations of computer science (FOCS), pp 475–486
Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Zachary I (2007) A nucleus for a web of open data. Springer, Dbpedia
Google Scholar
Banko M, Cafarella MJ, Soderland S, Broadhead M, Etzioni O (2007) Open information extraction from the web. In: Proceedings of the international joint conference on artificial intelligence (IJCAI), pp 2670–2676
Basu S, Bilenko M, Mooney RJ (2004) A probabilistic framework for semi-supervised clustering. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 59–68
Berant J, Chou A, Frostig R, Liang P (2013) Semantic parsing on Freebase from question-answer pairs. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 1533–1544
Berg C, Christensen JP, Ressel P (1984) Harmonic analysis on semigroups: theory of positive definite and related functions, volume 100 of graduate texts in mathematics, 1st edn. Springer, Berlin
Book Google Scholar
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res (JMLR) 3:993–1022
MATH Google Scholar
Bollacker KD, Evans C, Paritosh P, Sturge T, Taylor J (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the ACM special interest group on management of data (SIGMOD), pp 1247–1250
Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge
Book Google Scholar
Budanitsky A, Hirst G (2006) Evaluating wordnet-based measures of lexical semantic relatedness. Comput Linguist 32(1):13–47
Article Google Scholar
Cai H, Zheng VW, Chang KC (2018) A comprehensive survey of graph embedding: problems, techniques and applications. IEEE Trans Knowl Data Eng (TKDE)
Cohen WW, Ravikumar P, Fienberg SE (2003) A comparison of string distance metrics for name-matching tasks. In: Proceedings of the international joint conference on artificial intelligence (IJCAI) workshop on information integration, pp 73–78
Cui P, Wang X, Pei J, Wenwu Z (2017) A survey on network embedding. CoRR. arXiv:1711.08752
Do Q, Roth D, Sammons M, Tu Y, Vydiswaran VGV (2009) Robust, light-weight approaches to compute lexical similarity. In: Computer science research and technical reports. University of Illinois, pp 94–94
Dong X, Gabrilovich E, Heitz G, Horn W, Lao N, Murphy K, Strohmann T, Sun S, Zhang W (2014) Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 601–610
Dong Y, Chawla NV, Swami A (2017) metapath2vec: scalable representation learning for heterogeneous networks. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 135–144
Etzioni O, Cafarella M, Downey D (2004) Webscale information extraction in knowitall (preliminary results). In: Proceedings of the international world wide web conference (WWW), pp 100–110
Fu T-Y, Lee W-C, Lei Z (2017) Hin2vec: explore meta-paths in heterogeneous information networks for representation learning. In: Proceedings of the ACM international conference on information and knowledge management (CIKM), pp 1797–1806
Gabrilovich E, Markovitch S (2005) Feature generation for text categorization using world knowledge. In: Proceedings of the international joint conference on artificial intelligence (IJCAI), pp 1048–1053
Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: Proceedings of the international joint conference on artificial intelligence (IJCAI), pp 1606–1611
Ganesan P, Garcia-Molina H, Widom J (2003) Exploiting hierarchical domain structure to compute similarity. ACM Trans Inf Syst (TOIS) 21(1):64–93
Article Google Scholar
Goyal P, Ferrara E (2018) Graph embedding techniques, applications, and performance: a survey. Knowl Based Syst 151:78–94
Article Google Scholar
Han J, Sun Y, Yan X, Yu PS (2010) Mining knowledge from databases: an information network analysis approach. In: Proceedings of the ACM special interest group on management of data (SIGMOD), pp 1251–1252
Hassan S, Mihalcea R, Banea C (2007) Random walk term weighting for improved text classification. In: Proceedings of the international conference on semantic computing (ICSC), pp 242–249
He X, Cai D, Niyogi P (2006) Laplacian score for feature selection. In: Proceedings of the neural information processing systems (NIPS), pp 507–514
Hotho A, Staab S, Stumme G (2003) Ontologies improve text document clustering. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 541–544
Hu J, Fang L, Cao Y, Zeng H-J, Li H, Yang Q, Chen Z (2008) Enhancing text clustering by leveraging Wikipedia semantics. In: Proceedings of the international ACM SIGIR conference on research and development in information retrieval (SIGIR), pp 179–186
Hu X, Sun N, Zhang C, Chua T-S (2009a) Exploiting internal and external semantics for the clustering of short texts using world knowledge. In: Proceedings of the ACM international conference on information and knowledge management (CIKM), pp 919–928
Hu X, Zhang X, Lu C, Park EK, Zhou X (2009b) Exploiting Wikipedia as external knowledge for document clustering. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 389–396
Huang A (2008) Similarity measures for text document clustering. In: Proceedings of the New Zealand computer science research student conference (NZCSRSC), pp 49–56
Huang Z, Mamoulis N (2017) Heterogeneous information network embedding for meta path based proximity. CoRR. arXiv:1701.05291
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. Springer, Berlin
Google Scholar
Kim J, Rousseau F, Vazirgiannis M (2015) Convolutional sentence kernel from word embeddings for short text categorization. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 775–780
Kong X, Yu PS, Ding Y, Wild DJ (2012) Meta path-based collective classification in heterogeneous information networks. In: Proceedings of the ACM international conference on information and knowledge management (CIKM), pp 1567–1571
Kong X, Zhang J, Yu PS (2013) Inferring anchor links across multiple heterogeneous social networks. In: Proceedings of the ACM international conference on information and knowledge management (CIKM), pp 179–188
Lakkaraju P, Gauch S, Speretta M (2008) Document similarity based on concept tree distance. In: Proceedings of the nineteenth ACM conference on hypertext and hypermedia, pp 127–132
Lang K (1995) Newsweeder: learning to filter netnews. In: Proceedings of the international conference on machine learning (ICML), pp 331–339
Chapter Google Scholar
Lao N, Cohen WW (2010) Relational retrieval using a combination of path-constrained random walks. Mach Learn 81(1):53–67
Article MathSciNet Google Scholar
Lao N, Mitchell T, Cohen WW (2011) Random walk inference and learning in a large scale knowledge base. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 529–539
Lewis DD, Yang Y, Rose TG, Li F (2004) RCV1: a new benchmark collection for text categorization research. J Mach Learn Res (JMLR) 5:361–397
Google Scholar
Li Y, Wang C, Han F, Han J, Roth D, Yan X (2013) Mining evidences for named entity disambiguation. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 1070–1078
Lu Q, Getoor L (2003) Link-based classification. In: Proceedings of the international conference on machine learning (ICML), pp 496–503
Luss R, d’Aspremont A (2008) Support vector machine classification with indefinite kernels. In: Proceedings of the neural information processing systems (NIPS), pp 953–960
McCallum A, Nigam K, et al (1998) A comparison of event models for naive bayes text classification. In: Proceedings of the association for the advancement of artificial intelligence (AAAI) workshop, vol 752, pp 41–48
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of the neural information processing systems (NIPS), pp 3111–3119
Mitchell TM, Cohen WW, Hruschka ER Jr, Talukdar PP, Betteridge J, Carlson A, Mishra BD, Gardner M, Kisiel B, Krishnamurthy J, Lao N, Mazaitis K, Mohamed T, Nakashole N, Platanios EA, Ritter A, Samadi M, Settles B, Wang RC, Wijaya DT, Gupta A, Chen X, Saparov A, Greaves M, Welling J (2015) Never-ending learning. In: Proceedings of the association for the advancement of artificial intelligence (AAAI), pp 2302–2310
Mooney RJ (2007) Learning for semantic parsing. In: Proceedings of the international conference on computational linguistics and intelligent text processing (CICLing), pp 311–324
Chapter Google Scholar
Nesterov Y (2005) Smooth minimization of non-smooth functions. Math Program 103(1):127–152
Article MathSciNet Google Scholar
Paul C, Rettinger A, Mogadala A, Knoblock CA, Pedro S (2016) Efficient graph-based document similarity. In: International conference on the semantic web. latest advances and new domains, pp 334–349
Ponzetto SP, Strube M (2007) Deriving a large-scale taxonomy from Wikipedia. In: Proceedings of the association for the advancement of artificial intelligence (AAAI), pp 1440–1445
Ratinov L, Roth D (2009) Design challenges and misconceptions in named entity recognition. In: Proceedings of the SIGNLL conference on computational natural language learning (CoNLL), pp 147–155
Rousseau F, Kiagias E, Vazirgiannis M (2015) Text categorization as a graph classification problem. In: Annual meeting of the association for computational linguistics (ACL), pp 1702–1712
Sahami M (1998) Using machine learning to improve information access. PhD thesis, Stanford University
Schuhmacher M, Ponzetto SP (2014) Knowledge-based graph document modeling. In: Proceedings of the ACM international conference on web search and data mining (WSDM), pp 543–552
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47
Article MathSciNet Google Scholar
Shang J, Qu M, Liu J, Kaplan LM, Han J, Peng J (2016) Meta-path guided embedding for similarity search in large-scale heterogeneous information networks. CoRR. arXiv:610.09769
Siolas G, d’Alché-Buc F (2000) Support vector machines based on a semantic kernel for text categorization. In: Proceedings of international joint conference on neural networks (IJCNN), pp 205–209
Song Y, Roth D (2015) Unsupervised sparse vector densification for short text similarity. In: Proceedings of the annual conference of the North American chapter of the association for computational linguistics: human language technologies (NAACL-HLT), pp 1275–1280
Song Y, Pan S, Liu S, Zhou MX, Qian W (2009) Topic and keyword re-ranking for LDA-based topic modeling. In: Proceedings of the ACM conference on information and knowledge management (CIKM), pp 1757–1760
Song Y, Wang H, Wang Z, Li H, Chen W (2011) Short text conceptualization using a probabilistic knowledgebase. In: Proceedings of the international joint conference on artificial intelligence (IJCAI), pp 2330–2336
Song Y, Wang S, Wang H (2015) Open domain short text conceptualization: a generative + descriptive modeling approach. In: Proceedings of the international joint conference on artificial intelligence (IJCAI), pp 3820–3826
Strehl A, Ghosh J (2003) Cluster ensembles-a knowledge reuse framework for combining multiple partitions. J Mach Learn Res (JMLR) 3:583–617
MathSciNet MATH Google Scholar
Strehl A, Ghosh J, Mooney R (2000) Impact of similarity measures on web-page clustering. In: Proceedings of the association for the advancement of artificial intelligence (AAAI), pp 58–64
Suchanek FM, Kasneci G, Weikum G (2007) Yago: a core of semantic knowledge. In: Proceedings of the international world wide web conference (WWW), pp 697–706
Sun Y, Han J, Yan X, Yu PS, Tianyi W (2011) Pathsim: meta path-based top-k similarity search in heterogeneous information networks. Proc VLDB Endow (PVLDB) 4(11):992–1003
Google Scholar
Sun Y, Norick B, Han J, Yan X, Yu PS, Yu X (2012) Integrating meta-path selection with user-guided object clustering in heterogeneous information networks. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 1348–1356
Sun Y, Norick B, Han J, Yan X, Yu PS, Yu X (2013) Pathselclus: integrating meta-path selection with user-guided object clustering in heterogeneous information networks. ACM Trans Knowl Discov Data (TKDD) 7(3):11:1–11:23
Google Scholar
Tang J, Qu M, Mei Q (2015) PTE: predictive text embedding through large-scale heterogeneous text networks. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 1165–1174
Traverso I, Vidal M-E, Kämpgen B, Sure-Vetter Y (2016) GADES: a graph-based semantic similarity measure. In: International conference on semantic systems (SEMANTiCS), pp 101–104
Tu K, Cui P, Wang X, Wang F, Zhu W (2018) Structural deep embedding for hyper-networks. In: Proceedings of the association for the advancement of artificial intelligence (AAAI)
Wan X, Peng Y (2005) The earth mover’s distance as a semantic measure for document similarity. In: Proceedings of the ACM international conference on information and knowledge management (CIKM), pp 301–302
Wang C, Duan N, Zhou M, Zhang M (2013) Paraphrasing adaptation for web search ranking. In: Annual meeting of the association for computational linguistics (ACL), pp 41–46
Wang C, Song Y, El-Kishky A, Roth D, Zhang M, Han J (2015a) Incorporating world knowledge to document clustering via heterogeneous information networks. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 1215–1224
Wang C, Song Y, Li H, Zhang M, Han J (2015b) Knowsim: a document similarity measure on structured heterogeneous information networks. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 506–513
Wang C, Song Y, Roth D, Wang C, Han J, Ji H, Zhang M (2015c) Constrained information-theoretic tripartite graph clustering to identify semantically similar relations. In: Proceedings of the international joint conference on artificial intelligence (IJCAI), pp 3882–3889
Wang C, Song Y, Li H, Zhang M, Han J (2016a) Text classification with heterogeneous information network kernels. In: Proceedings of the association for the advancement of artificial intelligence (AAAI), pp 2130–2136
Wang C, Song Y, Roth D, Zhang M, Han J (2016) World knowledge as indirect supervision for document clustering. ACM Trans Knowl Discov Data (TKDD) 11(2):13:1–13:36
Google Scholar
Wang C, Song Y, Li H, Sun Y, Zhang M, Han J (2017) Distant meta-path similarities for text-based heterogeneous information networks. In: Proceedings of the ACM international conference on information and knowledge management (CIKM), pp 1629–1638
Wang P, Domeniconi C (2008) Building semantic kernels for text classification using Wikipedia. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 713–721
Wang P, Hu J, Zeng H-J, Chen L, Chen Z (2007) Improving text classification by using encyclopedia knowledge. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 332–341
Wang W, Do DB, Lin X (2005) Term graph model for text classification. In: Proceedings of the international conference on advanced data mining and applications (ADMA), pp 19–30
Chapter Google Scholar
Ying Y, Campbell C, Girolami M (2009) Analysis of SVM with indefinite kernels. In: Proceedings of the neural information processing systems (NIPS), pp 2205–2213
Zelnik-Manor L, Perona P (2005) Self-tuning spectral clustering. In: Saul LK, Weiss Y, Bottou L (eds) Proceedings of the neural information processing systems (NIPS), pp 1601–1608
Zhang J, Kong X, Yu PS (2013) Predicting social links for new users across aligned heterogeneous social networks. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 1289–1294
Zhang J, Kong X, Yu PS (2014) Transferring heterogeneous links across location-based social networks. In: Proceedings of the ACM international conference on web search and data mining (WSDM), pp 303–312

Download references

Acknowledgements

Chenguang Wang, Haoran Li, and Ming Zhang gratefully acknowledge the support by the National Natural Science Foundation of China (NSFC Grant Nos. 61772039, 91646202 and 61472006). Yangqiu Song was supported by China 973 Fundamental R&D Program (No. 2014CB340304) and the Early Career Scheme (ECS, No. 26206717) from Research Grants Council in Hong Kong. Jiawei Han was sponsored in part by U.S. Army Research Lab. under Cooperative Agreement No. W911NF-09-2-0053 (NSCTA), DARPA under Agreement No. W911NF-17-C-0099, National Science Foundation IIS 16-18481, IIS 17-04532, and IIS-17-41317, DTRA HDTRA11810026, and Grant 1U54GM114838 awarded by NIGMS through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative (www.bd2k.nih.gov). The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied by these agencies. We also thank the conference versions’ and journal version’s anonymous reviewers for their valuable comments and suggestions that help improve the quality of this manuscript.

Author information

Authors and Affiliations

Amazon AI, 2100 University Ave, East Palo Alto, CA, USA
Chenguang Wang
Department of CSE, HKUST, Clear Water Bay, Hong Kong
Yangqiu Song
School of EECS, Peking University, Beijing, China
Haoran Li & Ming Zhang
Department of CS, UIUC, Urbana, IL, 61801, USA
Jiawei Han

Authors

Chenguang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yangqiu Song
View author publications
You can also search for this author in PubMed Google Scholar
Haoran Li
View author publications
You can also search for this author in PubMed Google Scholar
Ming Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jiawei Han
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yangqiu Song.

Additional information

Responsible editor: Hanghang Tong

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, C., Song, Y., Li, H. et al. Unsupervised meta-path selection for text similarity measure based on heterogeneous information networks. Data Min Knowl Disc 32, 1735–1767 (2018). https://doi.org/10.1007/s10618-018-0581-y

Download citation

Received: 03 July 2017
Accepted: 02 July 2018
Published: 14 July 2018
Issue Date: November 2018
DOI: https://doi.org/10.1007/s10618-018-0581-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Unsupervised meta-path selection for text similarity measure based on heterogeneous information networks

Abstract

Access this article

Similar content being viewed by others

Meta Paths and Meta Structures: Analysing Large Heterogeneous Information Networks

PathSimExt: Revisiting PathSim in Heterogeneous Information Networks

W-PathSim: Novel Approach of Weighted Similarity Measure in Content-Based Heterogeneous Information Networks by Applying LDA Topic Modeling

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Unsupervised meta-path selection for text similarity measure based on heterogeneous information networks

Abstract

Access this article

Similar content being viewed by others

Meta Paths and Meta Structures: Analysing Large Heterogeneous Information Networks

PathSimExt: Revisiting PathSim in Heterogeneous Information Networks

W-PathSim: Novel Approach of Weighted Similarity Measure in Content-Based Heterogeneous Information Networks by Applying LDA Topic Modeling

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation