Abstract
This paper presents GraphDBLP, a system that models the DBLP bibliography as a graph database for performing graph-based queries and social network analyses. GraphDBLP also enriches the DBLP data through semantic keyword similarities computed via word-embedding. In this paper, we discuss how the system was formalized as a multi-graph, and how similarity relations were identified through word2vec. We also provide three meaningful queries for exploring the DBLP community to (i) investigate author profiles by analysing their publication records; (ii) identify the most prolific authors on a given topic, and (iii) perform social network analyses over the whole community. To date, GraphDBLP contains 5+ million nodes and 24+ million relationships, enabling users to explore the DBLP data by referencing more than 3.3 million publications, 1.7 million authors, and more than 5 thousand publication venues. Through the use of word-embedding, more than 7.5 thousand keywords and related similarity values were collected. GraphDBLP was implemented on top of the Neo4j graph database. The whole dataset and the source code are publicly available to foster the improvement of GraphDBLP in the whole computer science community.
Similar content being viewed by others
Notes
In this work, venues include conferences and journals.
A multi-graph is a graph where multiple edges between two nodes are permitted and might be specified through labels. Our notation was inspired by [17].
In addition to single words, even n-grams can be mapped to vectors. An n-gram is a set of n consecutive words. As outlined in Section 3.2 frequent co-occurrences of n consecutive words are identified and replaced by a single word e.g., machine learning is replaced by machine_learning.
A similar (but reversed problem) is the Skip-n-gram model i.e., to train a neural network to predict the representation of n context words from the representation of w. The Skip-n-gram approach can be summarised as “predicting the context given a word” while the CBOW, in a nutshell, is “predicting the word given a context”.
Py2neo Python library Available: http://py2neo.org/.
Though the same result could be achieved adding a property on the node, the use of multiple labels allows one to immediately access to the nodes with the desired label.
Performed through the stop words dictionary by the NLTK framework [10].
The edges selected using the Similarity label.
The idea is inspired by [7] though they compute the weight of triples through arithmetic functions.
The lower quartile is the 25th percentile while the upper quartile is the 75th percentile.
References
Adomavicius G, Sankaranarayanan R, Sen S, Tuzhilin A (2005) Incorporating contextual information in recommender systems using a multidimensional approach. ACM Trans Inf Syst (TOIS) 23(1):103–145
Aggarwal C C (2011) An introduction to social network data analytics. Socl Netw Data Anal 1–15
Albanese M, d’Acierno A, Moscato V, Persia F, Picariello A (2013) A multimedia recommender system. ACM Trans Internet Technol (TOIT) 13(1):3
Amato F, Moscato V, Picariello A, Piccialli F (2017) Sos: a multimedia recommender system for online social networks. Fut Gen Comput Syst
Angles R, Gutierrez C (2008) Survey of graph database models. ACM Comput Surv (CSUR) 40(1):1
Bao J, Zheng Y, Wilkie D, Mokbel M (2015) Recommendations in location-based social networks: a survey. GeoInformatica 19(3):525–565
Barrat A, Barthelemy M, Pastor-Satorras R, Vespignani A (2004) The architecture of complex weighted networks. Proc Natl Acad Sci USA 101(11):3747–3752
Belák V, Lam S, Hayes C (2012) Cross-community influence in discussion fora. ICWSM 12:34–41
Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155
Bird S, Klein E, Loper E (2009) Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media Inc.
Boselli R, Cesarini M, Marrara S, Mercorio F, Mezzanzanica M, Pasi G, Viviani M (2017) Wolmis: a labor market intelligence system for classifying web job vacancies. J Intell Inf Syst. https://doi.org/10.1007/s10844-017-0488-x
Boselli R, Cesarini M, Mercorio F, Mezzanzanica M (2017) Using machine learning for labour market intelligence. In: Altun Y, Das K, Mielikäinen T, Malerba D, Stefanowski J, Read J, Zitnik M, Ceci M, Dzeroski S (eds) Machine learning and knowledge discovery in databases - European conference, ECML PKDD 2017, Skopje, Macedonia, September 18–22, 2017, Proceedings, Part III, Lecture Notes in Computer Science, vol 10536. Springer, pp 330–342. DOI https://doi.org/10.1007/978-3-319-71273-4_27, (to appear in print)
Boselli R, Cesarini M, Mercorio F, Mezzanzanica M, Vaccarino A (2017) A pipeline for multimedia twitter analysis through graph databases: preliminary results. In: DATA 2017 - the international conference on data technologies and applications. https://doi.org/10.5220/0006490703430349
Cattell R (2011) Scalable sql and nosql data stores. ACM Sigmod Record 39 (4):12–27
Chikhaoui B, Chiazzaro M, Wang S (2015) A new granger causal model for influence evolution in dynamic social networks: the case of dblp. In: AAAI, pp 51–57
Colace F, De Santo M, Greco L, Moscato V, Picariello A (2015) A collaborative user-centered framework for recommending items in online social networks. Comput Hum Behav 51:694–704
Consens M P, Mendelzon A O (1990) Graphlog: a visual formalism for real life recursion. In: Proceedings of the ninth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems. ACM, pp 404–416
Deng H, King I, Lyu M R (2008) Formal models for expert finding on dblp bibliography data. In: Eighth IEEE international conference on data mining, 2008. ICDM’08. IEEE, pp 163–172
Diederich J, Balke W T, Thaden U (2007) Demonstrating the semantic growbag: automatically creating topic facets for faceteddblp. In: Proceedings of the 7th ACM/IEEE-CS joint conference on digital libraries. ACM, pp 505–505
Distributed graph database (2017) http://titan.thinkaurelius.com/
Du N, Wu B, Pei X, Wang B, Xu L (2007) Community detection in large-scale social networks. In: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis. ACM, pp 16–25
Elmacioglu E, Lee D (2005) On six degrees of separation in dblp-db and more. ACM SIGMOD Record 34(2):33–40
Girvan M, Newman M E (2002) Community structure in social and biological networks. Proc Nat Acad Sci 99(12):7821–7826
Han J, Haihong E, Le G, Du J (2011) Survey on nosql database. In: 2011 6th international conference on pervasive computing and applications (ICPCA). IEEE, pp 363–366
Jiang M, Cui P, Chen X, Wang F, Zhu W, Yang S (2015) Social recommendation with cross-domain transferable knowledge. IEEE Trans Knowl Data Eng 27(11):3084–3097
Le T, Zhang D (2015) Dblpminer: a tool for exploring bibliographic data. In: 2015 IEEE international conference on information reuse and integration (IRI). IEEE, pp 435–442
Lee S, Song SI, Kahng M, Lee D, Lee SG (2011) Random walk based entity ranking on graph for multidimensional recommendation. In: Proceedings of the fifth ACM conference on recommender systems. ACM, pp 93–100
Ley M (2009) Dblp: some lessons learned. Proc VLDB Endow 2(2):1493–1500
Li X, Chen H (2013) Recommendation as link prediction in bipartite graphs: a graph kernel-based machine learning approach. Decis Support Syst 54(2):880–890
Liu L, Tang J, Han J, Jiang M, Yang S (2010) Mining topic-level influence in heterogeneous networks. In: Proceedings of the 19th ACM international conference on information and knowledge management. ACM, pp 199–208
Marrara S, Pasi G, Viviani M, Cesarini M, Mercorio F, Mezzanzanica M, Pappagallo M A language modelling approach for discovering novel labour market occupations from the web. In: Sheth AP, Ngonga A, Wang Y, Chang E, Slezak D, Franczyk B, Alt R, Tao X, Unland R (eds) Proceedings of the international conference on web intelligence. ACM, Leipzig, pp 1026–1034. https://doi.org/10.1145/3106426.3109035
Mehmood Y, Barbieri N, Bonchi F, Ukkonen A (2013) Csi: community-level social influence analysis. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 48–63
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781
Mikolov T, Sutskever I, Chen K, Corrado G S, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Mikolov T, Yih WT, Zweig G (2013) Linguistic regularities in continuous space word representations. In: Hlt-naacl, vol 13, pp 746–751
Moreira C, Calado P, Martins B (2015) Learning to rank academic experts in the dblp dataset. Expert Syst 32(4):477–493
Nascimento M A, Sander J, Pound J (2003) Analysis of sigmod’s co-authorship graph. ACM Sigmod Record 32(3):8–10
Newman M E (2003) The structure and function of complex networks. SIAM Rev 45(2):167–256
Newman M E (2004) Who is the best connected scientist? A study of scientific coauthorship networks. In: Complex networks. Springer, pp 337–370
Papadopoulos S, Kompatsiaris Y, Vakali A, Spyridonos P (2012) Community detection in social media. Data Min Knowl Disc 24(3):515–554
Pham T A N, Li X, Cong G, Zhang Z (2015) A general graph-based model for recommendation in event-based social networks. In: 2015 IEEE 31st international conference on data engineering (ICDE). IEEE, pp 567–578
Ricci F, Rokach L, Shapira B, Kantor P B (2015) Recommender systems handbook. Springer
Scott J (2017) Social network analysis. Sage
Stonebraker M (2010) Sql databases v. nosql databases. Commun ACM 53 (4):10–11
Tagarelli A, Interdonato R (2013) Ranking vicarious learners in research collaboration networks. In: International conference on Asian digital libraries. Springer, pp 93–102
Tang J, Sun J, Wang C, Yang Z (2009) Social influence analysis in large-scale networks. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 807–816
Tesoriero C (2013) Getting started with orientDB. Packt Publishing Ltd
Watts D J, Strogatz S H (1998) Collective dynamics of ‘small-world’ networks. Nature 393(6684):440–442
Webber J (2012) A programmatic introduction to neo4j. In: Proceedings of the 3rd annual conference on systems, programming, and applications: software for humanity. ACM, pp 217–218
Wu Y, Cao N, Gotz D, Tan Y P, Keim D A (2016) A survey on visual analytics of social media data. IEEE Trans Multimed 18(11):2135–2148
Zaiane O R, Chen J, Goebel R (2007) Dbconnect: mining research community on dblp data. In: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis. ACM, pp 74–81
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mezzanzanica, M., Mercorio, F., Cesarini, M. et al. GraphDBLP: a system for analysing networks of computer scientists through graph databases. Multimed Tools Appl 77, 18657–18688 (2018). https://doi.org/10.1007/s11042-017-5503-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-017-5503-2