Skip to main content
Log in

HINMINE: heterogeneous information network mining with information retrieval heuristics

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

The paper presents an approach to mining heterogeneous information networks by decomposing them into homogeneous networks. The proposed HINMINE methodology is based on previous work that classifies nodes in a heterogeneous network in two steps. In the first step the heterogeneous network is decomposed into one or more homogeneous networks using different connecting nodes. We improve this step by using new methods inspired by weighting of bag-of-words vectors mostly used in information retrieval. The methods assign larger weights to nodes which are more informative and characteristic for a specific class of nodes. In the second step, the resulting homogeneous networks are used to classify data either by network propositionalization or label propagation. We propose an adaptation of the label propagation algorithm to handle imbalanced data and test several classification algorithms in propositionalization. The new methodology is tested on three data sets with different properties. For each data set, we perform a series of experiments and compare different heuristics used in the first step of the methodology. We also use different classifiers which can be used in the second step of the methodology when performing network propositionalization. Our results show that HINMINE, using different network decomposition methods, can significantly improve the performance of the resulting classifiers, and also that using a modified label propagation algorithm is beneficial when the data set is imbalanced.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. https://knowledgepit.fedcsis.org/contest/view.php?id=107

  2. https://aminer.org/citation

  3. http://grouplens.org/datasets/hetrec-2011/

  4. https://www.imdb.com

References

  • Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research, 7, 2399–2434.

    MathSciNet  MATH  Google Scholar 

  • Burt, R., & Minor, M. (1983). Applied Network Analysis: A Methodological Introduction: Sage Publications.

  • Cantador, I., Brusilovsky, P., & Kuflik, T. (2011). 2Nd workshop on information heterogeneity and fusion in recommender systems (hetrec 2011). In Proceedings of the 5th ACM conference on Recommender systems. RecSys. New York: ACM.

  • Consortium (2000). Gene ontology: tool for the unification of biology. The gene ontology consortium. Nature genetics, 25(1), 25–29.

  • de Sousa, C. A. R., Rezende, S. O., & Batista, G. E (2013). Influence of graph construction on semi-supervised learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 160–175): Springer.

  • Debole, F., & Sebastiani, F (2004). Supervised term weighting for automated text categorization. In Text Mining and Its Applications (pp. 81–97): Springer.

  • Demṡar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7(Jan), 1–30.

    MathSciNet  Google Scholar 

  • D’Orazio, V., Landis, S. T., Palmer, G., & Schrodt, P. (2014). Separating the wheat from the chaff: Applications of automated document classification using support vector machines. Polytical Analysis, 22(2), 224–242.

    Article  Google Scholar 

  • Grčar, M., Trdin, N., & Lavrač, N. (2013). A methodology for mining document-enriched heterogeneous information networks. The Computer Journal, 56(3), 321–335.

    Article  Google Scholar 

  • Han, E.-H., & Karypis, G (2000). Centroid-based document classification: Analysis and experimental results. In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery (pp. 424–431): Springer.

  • Hwang, T., & Kuang, R. (2010). A heterogeneous label propagation algorithm for disease gene discovery. In Proceedings of SIAM International Conference on Data Mining (pp. 583–594).

  • Jeh, G., & Widom, J (2002). SimRank: A measure of structural-context similarity. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 538–543): ACM.

  • Ji, M., Sun, Y., Danilevsky, M., Han, J., & Gao, J. (2010). Graph regularized transductive classification on heterogeneous information networks. In Proceedings of the 25th European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (pp. 570–586).

  • Jones, K.S. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11–21.

    Article  Google Scholar 

  • Kleinberg, J.M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604–632.

    Article  MathSciNet  MATH  Google Scholar 

  • Kondor, R.I., & Lafferty, J.D. (2002). Diffusion kernels on graphs and other discrete input spaces. In Proceedings of the 19th International Conference on Machine Learning (pp. 315–322).

  • Kralj, J., Valmarska, A., Robnik-Ṡikonja, M., & Lavraċ, N. (2015). Mining text enriched heterogeneous citation networks. In Proceedings of the 19th Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 672–683).

  • Kwok, J.T.-Y. (1998). Automated text categorization using support vector machine. In Proceedings of the 5th International Conference on Neural Information Processing (pp. 347–351).

  • Lan, M., Tan, C.L., Su, J., & Lu, Y. (2009). Supervised and traditional term weighting methods for automatic text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4), 721–735.

    Article  Google Scholar 

  • Liu, W., & Chang, S.-F (2009). Robust multi-class transductive learning with graphs. In IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009 (pp. 381–388): IEEE.

  • Manevitz, L.M., & Yousef, M. (2002). One-class SVMs for document classification. Journal of Machine Learning Research, 2, 139–154.

    MATH  Google Scholar 

  • Martineau, J., & Finin, T. (2009). Delta TFIDF: an improved feature space for sentiment analysis. In Proceedings of the third AAAI internatonal conference on weblogs and social media. San Jose: AAAI Press.

  • Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. Technical report: Stanford InfoLab.

    Google Scholar 

  • Robertson, S.E., & Walker, S. (1994). Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 232–241). New York: Springer.

  • Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., & Eliassi-Rad, T. (2008). Collective classification in network data. AI magazine, 29(3), 93.

    Article  Google Scholar 

  • Storn, R., & Price, K. (1997). Differential evolution; A simple and efficient heuristic for global optimization over continuous spaces. Journal of Global Optimization, 11(4), 341–359.

    Article  MathSciNet  MATH  Google Scholar 

  • Sun, Y., & Han, J. (2012). Mining Heterogeneous Information Networks: Principles and Methodologies: Morgan & Claypool Publishers.

  • Sun, Y., Yu, Y., & Han, J. (2009). Ranking-based clustering of heterogeneous information networks with star network schema. In Proceedings of the 15th ACM SIGKDD I,nternational Conference on Knowledge Discovery and Data Mining (pp. 797–806).

  • Tan, S. (2006). An effective refinement strategy for KNN text classifier. Expert Systems with Applications, 30(2), 290–298.

    Article  Google Scholar 

  • Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., & Su, Z. (2008). Arnetminer: Extraction and mining of academic social networks. In KDD’08 (pp. 990–998).

  • Vanunu, O., Magger, O., Ruppin, E., Shlomi, T., & Sharan, R. (2010). Associating genes and protein complexes with disease via network propagation. PLoS Computational Biology, 6(1).

  • Zachary, W. (1977). An information flow model for conflict and fission in small groups. Journal of Anthropological Research, 33, 452–473.

    Article  Google Scholar 

  • Zhou, D., Bousquet, O., Lal, T.N., Weston, J., & Schölkopf, B. (2004). Learning with local and global consistency. Advances in N,eural Information Processing Systems, 16(16), 321–328.

    Google Scholar 

  • Zhu, X., Ghahramani, Z., Lafferty, J., & et al. (2003). Semi-supervised learning using gaussian fields and harmonic functions. In ICML, (Vol. 3 pp. 912–919).

Download references

Acknowledgments

This research was supported by the European Commission through the Human Brain Project (Grant number 604102) and three National Research Agency grants: the research programmes Knowledge Technologies (P2-0103), Artificial intelligence and intelligent systems (P2-0209) and project Development and applications of new semantic data mining methods in life sciences (J2-5478). Our thanks goes to Miha Grčar for previous work on this topic, which has inspired the research described in this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jan Kralj.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kralj, J., Robnik-Šikonja, M. & Lavrač, N. HINMINE: heterogeneous information network mining with information retrieval heuristics. J Intell Inf Syst 50, 29–61 (2018). https://doi.org/10.1007/s10844-017-0444-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-017-0444-9

Keywords

Navigation