Automatic feature selection for supervised learning in link prediction applications: a comparative study

Pecli, Antonio; Cavalcanti, Maria Claudia; Goldschmidt, Ronaldo

doi:10.1007/s10115-017-1121-6

Automatic feature selection for supervised learning in link prediction applications: a comparative study

Regular Paper
Published: 25 October 2017

Volume 56, pages 85–121, (2018)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Antonio Pecli¹,
Maria Claudia Cavalcanti¹ &
Ronaldo Goldschmidt¹

1448 Accesses
Explore all metrics

Abstract

For the last years, a considerable amount of attention has been devoted to the research about the link prediction (LP) problem in complex networks. This problem tries to predict the likelihood of an association between two not interconnected nodes in a network to appear in the future. One of the most important approaches to the LP problem is based on supervised machine learning (ML) techniques for classification. Although many works have presented promising results with this approach, choosing the set of features (variables) to train the classifiers is still a major challenge. In this article, we report on the effects of three different automatic variable selection strategies (Forward, Backward and Evolutionary) applied to the feature-based supervised learning approach in LP applications. The results of the experiments show that the use of these strategies does lead to better classification models than classifiers built with the complete set of variables. Such experiments were performed over three datasets (Microsoft Academic Network, Amazon and Flickr) that contained more than twenty different features each, including topological and domain-specific ones. We also describe the specification and implementation of the process used to support the experiments. It combines the use of the feature selection strategies, six different classification algorithms (SVM, K-NN, naïve Bayes, CART, random forest and multilayer perceptron) and three evaluation metrics (Precision, F-Measure and Area Under the Curve). Moreover, this process includes a novel ML voting committee inspired approach that suggests sets of features to represent data in LP applications. It mines the log of the experiments in order to identify sets of features frequently selected to produce classification models with high performance. The experiments showed interesting correlations between frequently selected features and datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on feature extraction and learning techniques for link prediction in homogeneous and heterogeneous complex networks

Article Open access 28 October 2024

Link Prediction Using Evolutionary Neural Network Models

Enhance Link Prediction in Online Social Networks Using Similarity Metrics, Sampling, and Classification

Notes

Also known as feature extraction or feature engineering.
https://github.com/alpecli/predlig.
http://academic.research.microsoft.com.
https://snap.stanford.edu/data/com-Amazon.html.
https://snap.stanford.edu/data/web-flickr.html.
PredLig’s code is available for download at https://github.com/alpecli/predlig.
https://github.com/AKSW/mexproject.
Bias is the set of characteristics that collectively influence the way an algorithm searches for hypotheses that separate the classes of a problem.
It is important to notice that we applied the Wilcoxon signed-ranks test 108 times independently. In each time, the test verified whether there was a statistical difference between two algorithms: a classification algorithm and a modified version of itself (the combination of the algorithm with a feature selection configuration).
Table 9 highlights in bold font the experiment executions associated with the 26 experiment configurations that revealed significant difference in the hypothesis test.
In fact, ES2 was the only FS configuration that significantly improved SVM’s performance.

References

Adafre SF, de Rijke M (2005) Discovering missing links in Wikipedia. In: Proceedings of the 3rd international workshop on Link discovery. ACM, pp 90–97
Adamic LA, Adar E (2003) Friends and neighbors on the web. Soc Netw 25(3):211–230
Article Google Scholar
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: VLDB’94, Proceedings of 20th international conference on very large data bases, Santiago de Chile, Chile, 12–15 September 1994, pp 487–499
Aha D, Bankert R (1996) A comparative evaluation of sequential feature selection algorithms. In: Fisher D, Lenz H-J (eds) Learning from data, volume 112 of Lecture Notes in Statistics. Springer, New York, pp 199–206. doi:10.1007/978-1-4612-2404-4-19
Aiello LM, Barrat A, Schifanella R, Cattuto C, Markines B, Menczer F (2012) Friendship prediction and homophily in social media. TWEB 6:9. doi:10.1145/2180861.2180866
Article Google Scholar
Airoldi EM, Blei DM, Fienberg SE, Xing EP (2008) Mixed membership stochastic blockmodels. J Mach Learn Res 9:1981–2014
MATH Google Scholar
Airoldi EM, Blei DM, Fienberg SE, Xing EP, Jaakkola T (2006) Mixed membership stochastic block models for relational data with application to protein–protein interactions. In: Proceedings of the international biometrics society annual meeting, pp 1–34
Backstrom L, Leskovec J (2011) Supervised random walks: predicting and recommending links in social networks. In: Proceedings of the fourth ACM international conference on Web search and data mining WSDM ’11. ACM, New York, NY, USA, pp 635–644. doi:10.1145/1935826.1935914
Barabasi AL, Jeong H, Neda Z, Ravasz E (2001) Evolution of the social network of scientific collaboration. Soc Netw 25:211–230
Google Scholar
Batagelj V, Zaversnik M (2003) An O(m) algorithm for cores decomposition of networks. CoRR, cs.DS/0310049
Benzi M, Estrada E, Klymko C (2012) Ranking hubs and authorities using matrix functions. CoRR
Bonacich P, Lloyd P (2001) Eigenvector-like measures of centrality for asymmetric relations. Soc Netw 23:191–201. doi:10.1016/S0378-8733(01)00038-7
Article Google Scholar
Caruana R, Karampatziakis N, Yessenalina A (2008) An empirical evaluation of supervised learning in high dimensions. In: Proceedings of the twenty-fifth international conference machine learning (ICML 2008), Helsinki, Finland, 5–9 June 2008, pp. 96–103. doi:10.1145/1390156.1390169
Dash M, Liu H (2007) Dimensionality reduction. In: Wiley Encyclopedia of Computer Science and Engineering. Wiley, Hoboken. doi:10.1002/9780470050118.ecse112
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet MATH Google Scholar
Dong Y, Tang J, Wu S, Tian J, Chawla NV, Rao J, Cao H (2012) Link prediction and recommendation across heterogeneous social networks. In: 2012 IEEE 12th international conference on data mining (ICDM). IEEE, pp 181–190
Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19:1–16. doi:10.1109/TKDE.2007.250581
Article Google Scholar
Estrada E (2011) The structure of complex networks: theory and applications. Oxford University Press, Inc., New York
Book Google Scholar
Freeman LC (1978) Centrality in social networks conceptual clarification In: Social Networks, vol 1, Issue 3. Elsevier, Lausanne, pp 215–239
Freitas AA (2002) Data mining and knowledge discovery with evolutionary algorithms. Springer-Verlag New York, Inc., Secaucus
Book MATH Google Scholar
Freschi V (2009) A graph-based semi-supervised algorithm for protein function prediction from interaction maps. In: Third international conference learning and intelligent optimization, LION 3, Trento, Italy, 14–18 January 2009, Selected Papers, pp 249–258. doi:10.1007/978-3-642-11169-3-18
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32:675–701
Article MATH Google Scholar
Hagberg AA, Schult DA, Swart PJ (2008) Exploring network structure, dynamics, and function using networkX. In: Varoquaux G, Vaught T, Millman J (eds) Proceedings of the 7th Python in Science Conference. Pasadena, pp 11–15
Hasan M, Zaki M (2011) A survey of link prediction in social networks. In: Aggarwal CC (ed) Social network data analytics. Springer US, Berlin, pp 243–275. doi:10.1007/978-1-4419-8462-3-9
Hasan MA, Chaoji V, Salem S, Zaki M (2006) Link prediction using supervised learning. In: Proceedings of SDM 06 workshop on Link Analysis, Counterterrorism and Security
Hsieh C-J, Chiang K-Y, Dhillon IS (2012) Low rank modeling of signed networks. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 507–515
Huang D, Meyn SP (2012) Feature selection for composite hypothesis testing with small samples: fundamental limits and algorithms. In: ICASSP. IEEE, pp 1917–1920
Huang Z, Li X, Chen H (2005) Link prediction approach to collaborative filtering. In: ACM/IEEE Joint Conference on Digital Libraries, JCDL 2005, Denver, CO, USA, 7–11 June 2005, Proceedings, pp 141–142. doi:10.1145/1065385.1065415
Jannach D, Zanker M, Felfernig A, Friedrich G (2010) Recommender systems: an introduction, 1st edn. Cambridge University Press, New York
Book Google Scholar
Katz L (1953) A new status index derived from sociometric analysis. Psychometrika 18:39–43
Article MATH Google Scholar
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97:273–324. doi:10.1016/S0004-3702(97)00043-X
Article MATH Google Scholar
Krebs VE (2002) Mapping networks of terrorist cells. Connections 24:43–52
Google Scholar
Kumar V, Minz S (2014) Feature selection: a literature review. Smart CR 4:211–229. doi:10.6029/smartcr.2014.03.007
Google Scholar
Lee J, Bengio S, Kim S, Lebanon G, Singer Y (2014) Local collaborative ranking. In: Proceedings of the 23rd international conference on World Wide Web WWW ’14. ACM, New York, NY, USA, pp 85–96. doi:10.1145/2566486.2567970
Leicht EA, Holme P, Newman MEJ (2006) Vertex similarity in networks. Phys Rev E. doi:10.1103/PhysRevE.73.026120
Leskovec J (2011) Stanford network analysis project. http://snap.stanford.edu/data
Leskovec J, Krevl A (2014) SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data
Li X, Chen H (2009) Recommendation as link prediction: a graph kernel-based machine learning approach. In: Proceedings of the 2009 Joint International Conference on Digital Libraries, JCDL 2009, Austin, TX, USA, 15–19 June 2009, pp 213–216. doi:10.1145/1555400.1555433
Liben-Nowell D, Kleinberg JM (2007) The link-prediction problem for social networks. JASIST 58:1019–1031. doi:10.1002/asi.20591
Article Google Scholar
Lind PG, Gonzalez MC, Herrmann HJ (2005) Cycles and clustering in bipartite networks. Phys Rev E Stat Nonlin Soft Matter Phys 72(5 Pt 2):056127
Liu Y, Kou Z (2007) Predicting who rated what in large-scale datasets. SIGKDD Explor 9:62–65. doi:10.1145/1345448.1345462
Article Google Scholar
Lü L, Zhou T (2010) Link prediction in complex networks: a survey. Physica A 390(6):1150–1170
Article Google Scholar
Lü L, Zhou T (2010) Link prediction in weighted networks: the role of weak ties. EPL (Europhys Lett) 89:18001
Article Google Scholar
Malin B, Airoldi E, Carley KM (2005) A network analysis model for disambiguation of names in lists. Comput Math Organ. Theory 11:119–139. doi:10.1007/s10588-005-3940-3
Article MATH Google Scholar
Menon A, Elkan C (2011) Link prediction via matrix factorization. In: Gunopulos D, Hofmann T, Malerba D, Vazirgiannis M (eds) Machine learning and knowledge discovery in databases, volume 6912 of Lecture Notes in Computer Science. Springer, Berlin, pp 437–452. doi:10.1007/978-3-642-23783-6-28
Mori J, Kajikawa Y, Kashima H, Sakata I (2012) Machine learning approach for finding business partners and building reciprocal relationships. Expert Syst Appl 39:10402–10407. doi:10.1016/j.eswa.2012.01.202
Article Google Scholar
Ngo T (2011) Data mining: practical machine learning tools and technique, third edition by ian h. witten, eibe frank, mark a. hell. ACM SIGSOFT Softw Eng Notes 36:51–52. doi:10.1145/2020976.2021004
Article Google Scholar
Oyama S, Hayashi K, Kashima H (2011) Cross-temporal link prediction. In: IEEE 11th International Conference on Data Mining (ICDM). IEEE, Vancouver, pp 1188–1193
Page L, Brin S, Motwani R, Winograd T (1999) The PageRank citation ranking: bringing order to the Web. Technical Report 1999-66 Stanford InfoLab. Previous number = SIDL-WP-1999-0120
Pecli A, Giovanini B, Pacheco CC, Moreira C, Ferreira F, Tosta F, Tesolin J, Dias MV, Filho S, Cavalcanti MC, Goldschmidt RR (2015) Dimensionality reduction for supervised learning in link prediction problems. In: ICEIS 2015—Proceedings of the 17th international conference on enterprise information systems, vol 1, Barcelona, Spain, 27–30 April 2015, pp 295–302
Pedregosa F, Varoquaux G, Gramfort A, Thirion B, Grisel VM, Blondel O, Prettenhofer M, Weiss P, Dubourg R, Vanderplas V, Passos J, Cournapeau A, Brucher D, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
Pourebrahimi A, Shirazi B, Chamani T (2014) Improving link prediction in social network with population based metaheuristics algorithm. Int J Mechatron Electr Comput Technol 12: 1202–1213
Raymond R, Kashima H (2010) Fast and scalable algorithms for semi-supervised link prediction on static and dynamic graphs. In: Proceedings of the 2010 European Conference on Machine Learning and Knowledge Discovery in Databases: Part III ECML PKDD’10. Springer, Berlin, pp 131–147
Rendle S, Freudenthaler C, Gantner Z, Schmidt-Thieme L (2009) Bpr: Bayesian personalized ranking from implicit feedback. In: Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence. AUAI Press, pp 452–461
Rickmers AD, Todd HN (1967) Statistics: an introduction. McGraw-Hill, New York
MATH Google Scholar
Saramäki J, Kivelä M, Onnela J, Kaski K, Kertesz (2007) Generalizations of the clustering coefficient to weighted complex networks. Phys Rev E 75:027105
Shi Y, Larson M, Hanjalic A (2010) List-wise learning to rank with matrix factorization for collaborative filtering. In: Proceedings of the fourth ACM Conference on Recommender Systems RecSys ’10. ACM, New York, NY, USA, pp 269–272. doi:10.1145/1864708.1864764
Song D, Meyer DA (2015) Recommending positive links in signed social networks by optimizing a generalized AUC. In: Twenty-ninth AAAI conference on artificial intelligence
Song D, Meyer DA, Tao D (2015) Efficient latent link recommendation in signed networks. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining KDD ’15. ACM, New York, NY, USA, pp 1105–1114. doi:10.1145/2783258.2783358
Souza G (2015) Recomendacao em Redes Sociais Baseada em Grafos. Technical Report S2729r Military Institute of Engineering
Takes FW, Kosters WA (2013) Computing the eccentricity distribution of large graphs. Algorithms 6:100. doi:10.3390/a6010100
Article MathSciNet Google Scholar
Wang P, Xu B, Wu Y, Zhou X (2015) Link prediction in social networks: the state-of-the-art. SCIENCE China Inf Sci 58:1–38. doi:10.1007/s11432-014-5237-y
Google Scholar
Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1:80–83. doi:10.2307/3001968
Article Google Scholar
Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE Trans Evolut Comput 1:67–82
Article Google Scholar
Wu S, Sun J, Tang J (2013) Patent partner recommendation in enterprise social networks. In: Sixth ACM international conference on Web Search and Data Mining, WSDM 2013, Rome, Italy, 4–8 February 2013, pp 43–52. doi:10.1145/2433396.2433404
Xu Y, Rockmore D (2012) Feature selection for link prediction. In: Proceedings of the 5th Ph.D. Workshop on Information and Knowledge. ACM, pp 25–32
Yang Y, Lichtenwalter RN, Chawla NV (2015) Evaluating link prediction methods. CoRR, abs/1505.04094
Yu L, Liu H (2003) Feature selection for high-dimensional data: A fast correlation-based filter solution. In: Proceedings of the twentieth international conference machine learning (ICML 2003), 21–24 August 2003, Washington, DC, USA, pp 856–863
Zhu J, Hong J, Hughes JG (2002) Using Markov models for web site link prediction. In: HYPERTEXT 2002, Proceedings of the 13th ACM conference on hypertext and hypermedia, 11–15 June 2002, University of Maryland, College Park, MD, USA, pp 169–170. doi:10.1145/513338.513381

Download references

Acknowledgements

This work has been partially supported by CNPq (307647/2012-9) and by CAPES (student scholarship).

Author information

Authors and Affiliations

Military Institute of Engineering (IME), Praca Gen. Tiburcio 80, Rio de Janeiro, Brazil
Antonio Pecli, Maria Claudia Cavalcanti & Ronaldo Goldschmidt

Authors

Antonio Pecli
View author publications
You can also search for this author inPubMed Google Scholar
Maria Claudia Cavalcanti
View author publications
You can also search for this author inPubMed Google Scholar
Ronaldo Goldschmidt
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Ronaldo Goldschmidt.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pecli, A., Cavalcanti, M.C. & Goldschmidt, R. Automatic feature selection for supervised learning in link prediction applications: a comparative study. Knowl Inf Syst 56, 85–121 (2018). https://doi.org/10.1007/s10115-017-1121-6

Download citation

Received: 01 April 2016
Revised: 08 June 2017
Accepted: 10 October 2017
Published: 25 October 2017
Issue Date: July 2018
DOI: https://doi.org/10.1007/s10115-017-1121-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic feature selection for supervised learning in link prediction applications: a comparative study

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A survey on feature extraction and learning techniques for link prediction in homogeneous and heterogeneous complex networks

Link Prediction Using Evolutionary Neural Network Models

Enhance Link Prediction in Online Social Networks Using Similarity Metrics, Sampling, and Classification

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now