Abstract
Natural Language Processing (NLP) techniques have enabled automated analysis over a large collection of documents, which makes it possible to quantitatively compare researcher profiles based on their publications. This paper proposes a novel researcher similarity measuring system which combines a variety of techniques, including topic modelling, Word2vec and word mover distance calculations on publication abstracts. The proposed method, implemented in python, matches researchers based upon a document’s texts by evaluating the semantic meanings of words and topics. The distances between researchers are calculated over various text features in an hierarchical structure. Results show that the system is successful in identifying existing co-authorships from sample data despite co-authorship properties having been removed, as well as suggesting valid potential academic collaboration links from related research areas irrespective of previous collaboration activity.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Downloaded from https://github.com/idio/wiki2vec/.
- 2.
Check https://dev.elsevier.com/sc_apis.html for more details.
References
Ahlgren, P., Grönqvist, L.: Evaluation of retrieval effectiveness with incomplete relevance data: theoretical and experimental comparison of three measures. Inf. Process. Manag. 44(1), 212–225 (2008)
Arora, S., Ge, R., Moitra, A.: Learning topic models - going beyond SVD. CoRR abs/1204.1956 (2012)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Gollapalli, S.D., Mitra, P., Giles, C.L.: Similar researcher search in academic environments. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2012, pp. 167–170. ACM, New York (2012)
Hitchcock, F.L.: The distribution of a product from several sources to numerous localities. J. Math. Phys. 20(1–4), 224–230 (1941)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1999, pp. 50–57. ACM, New York (1999)
Jones, K.S., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments. In: Information Processing and Management, pp. 779–840 (2000)
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Statist. 22(1), 79–86 (1951). https://doi.org/10.1214/aoms/1177729694
Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q.: From word embeddings to document distances. In: Proceedings of the 32nd International Conference on Machine Learning, vol. 37 (2015)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013)
Newman, M.E.J.: Coauthorship networks and patterns of scientific collaboration. Proc. Natl. Acad. Sci. United States Am. 101(1), 5200–5205 (2004)
Pele, O., Werman, M.: A linear time histogram metric for improved SIFT matching. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III. LNCS, vol. 5304, pp. 495–508. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88690-7_37
Pele, O., Werman, M.: Fast and robust earth mover’s distances. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 460–467. IEEE, September 2009
Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, pp. 45–50, May 2010
Wagner, W.: Steven bird, ewan klein and edward loper: natural language processing with python, analyzing text with the natural language toolkit. Lang. Resour. Eval. 44(4), 421–424 (2010)
Wallach, H.M., Murray, I., Salakhutdinov, R., Mimno, D.: Evaluation methods for topic models. In: Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, pp. 1105–1112. ACM, New York (2009)
Xu, Y., Guo, X., Hao, J., Ma, J., Lau, R.Y.K., Xu, W.: Combining social network and semantic concept analysis for personalized academic researcher recommendation. Decis. Support Syst. 54(1), 564–573 (2012)
Acknowledgements
Dr Joel Nothman from the Sydney Informatics Hub has provided valuable suggestions and feedbacks to this work.
Prof. Nick Enfield, director of SSSHARC, Faculty of Arts and Social Sciences, the University of Sydney, initiated the question and supported this work.
The major development was conducted under the Capstone student project program initiated by the School of IT, the University of Sydney.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Sun, C., Ng, K.T.J., Henville, P., Marchant, R. (2019). Hierarchical Word Mover Distance for Collaboration Recommender System. In: Islam, R., et al. Data Mining. AusDM 2018. Communications in Computer and Information Science, vol 996. Springer, Singapore. https://doi.org/10.1007/978-981-13-6661-1_23
Download citation
DOI: https://doi.org/10.1007/978-981-13-6661-1_23
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-6660-4
Online ISBN: 978-981-13-6661-1
eBook Packages: Computer ScienceComputer Science (R0)