Abstract
Scientific literature recommender systems (SLRSs) provide papers to researchers according to their scientific interests. Systems rely on inter-researcher similarity measures that are usually computed according to publication contents (i.e., by extracting paper topics and citations). We highlight two major issues related to this design. The required full-text access and processing are expensive and hardly feasible. Moreover, clues about meetings, encounters, and informal exchanges between researchers (which are related to a social dimension) were not exploited to date. In order to tackle these issues, we propose an original SLRS based on a threefold contribution. First, we argue the case for defining inter-researcher similarity measures building on publicly available metadata. Second, we define topical and social measures that we combine together to issue socio-topical recommendations. Third, we conduct an evaluation with 71 volunteer researchers to check researchers’ perception against socio-topical similarities. Experimental results show a significant 11.21% accuracy improvement of socio-topical recommendations compared to baseline topical recommendations.
Similar content being viewed by others
Notes
It may be argued that Google Scholar (http://scholar.google.com) and derivatives, such as ArnetMiner (Tang et al. 2008) (http://arnetminer.org) meet this need. These search engines surely are helpful for finding document related to a query (e.g., bibliometrics). However, they do not succeed in taking a researcher’s name as input for recommending him/her papers or other researcher names that would be relevant for his/her overall scientific activity (as we intend to do in this paper).
Subject to charges like the ACM Portal (http://portal.acm.org) and SpringerLink (http://springerlink.com) or free like CiteSeerX (http://citeseerx.ist.psu.edu), DBLP (http://www.informatik.uni-trier.de/~ley/db) or arXiv (http://arxiv.org).
Trec stands for the Text REtrieval Conference (see Voorhees and Harman 2005).
Available for download at http://trec.nist.gov/trec_eval.
A demonstration can be seen at http://www.irit.fr/~Guillaume.Cabanac/expeSimT.
“in the absence of significance tests, performance differences of less than 5% must be disregarded \(\ldots\) broadly characterize performance differences, assumed significant, as noticeable if the difference is of the order of 5–10%, and as material if it is more than 10%.” Spärck Jones (1974) as cited by Sanderson (2010, p. 313).
References
Adomavicius, G., & Tuzhilin, A (2005). Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering, 17(6), 734–749. doi:10.1109/TKDE.2005.99
Agarwal, N., Haque, E., Liu, H., & Parsons, L. (2005). Research paper recommender systems: A subspace clustering approach. In W. Fan, Z. Wu, & J. Yang (Eds.), WAIM’05: Proceedings of the 6th international conference on web-age information management. LNCS (Vol. 3739, pp. 475–491). New York: Springer. doi:10.1007/11563952_42.
Alonso, O., Rose, D. E., & Stewart, B. (2008). Crowdsourcing for relevance evaluation. SIGIR Forum, 42(2), 9–15. doi:10.1145/1480506.1480508.
Balabanović, M., & Shoham, Y. (1997). Fab: Content-based, collaborative recommendation. Communications of the ACM, 40(3), 66–72. doi:10.1145/245108.245124.
Belkin, N. J., & Croft, W. B. (1992). Information filtering and information retrieval: Two sides of the same coin? Communications of the ACM, 35(12), 29–38. doi:10.1145/138859.138861.
Ben Jabeur, L., Tamine, L., & Boughanem, M. (2010). A social model for Literature Access: Towards a weighted social network of authors. In RIAO’10: Proceedings of the 9th international conference on information retrieval and its applications. CDROM.
Biryukov, M. (2008). Co-author network analysis in DBLP: Classifying personal names. In MCO’08: Proceedings of the 2nd international conference on modelling, computation and optimization in information systems and management sciences. Communications in computer and information science (Vol. 14, pp. 399–408). New York: Springer. doi:10.1007/978-3-540-87477-5_43.
Bogers, T., & van den Bosch, A. (2008). Recommending scientific articles using CiteULike. In RecSys’08: Proceedings of the 4th ACM conference on recommender systems, ACM, New York, NY, USA (pp. 287–290). doi:10.1145/1454008.1454053.
Buckley, C., & Voorhees, E. M. (2000). Evaluating evaluation measure stability. In SIGIR’00: Proceedings of the 23rd international ACM SIGIR conference, ACM, New York, NY, USA (pp. 33–40). doi:10.1145/345508.345543.
Buckley, C., & Voorhees, E. M. (2005). Retrieval system evaluation. In E. M. Voorhees & D. K. Harman (Eds.), TREC: Experiment and evaluation in information retrieval (Chap. 3, pp. 53–75). Cambridge, MA: MIT Press.
Cazella, S. C., & Campos Alvares, L. O. (2005). Modeling user’s opinion relevance to recommending research papers. In UM’05: Proceedings of the 10th international conference on user modeling. LNCS (Vol. 3538, pp. 327–331). New York: Springer. doi:10.1007/11527886_42.
Cleverdon, C. W. (1962). Report on the testing and analysis of an investigation into the comparative efficiency of indexing systems. ASLIB Cranfield Research Project, Cranfield, UK.
Deng, H., King, I., & Lyu, M. R. (2008). Formal models for expert finding on DBLP bibliography data. In ICDM’08: Proceedings of the 8th IEEE international conference on data mining (pp. 163–172). Washington, DC: IEEE Computer Society. doi:10.1109/ICDM.2008.29.
Dolamic, L., & Savoy, J. (2010). When stopword lists make the difference. Journal of the American Society for Information Science and Technology, 61(1), 200–203. doi:10.1002/asi.21186.
Easley, D., & Kleinberg, J. (2010). Networks, crowds, and markets: Reasoning about a highly connected world. New York: Cambridge University Press.
Elmacioglu, E., & Lee, D. (2005). On six degrees of separation in DBLP-DB and more. SIGMOD Record, 34(2), 33–40. doi:10.1145/1083784.1083791.
Fox, C. (1989). A stop list for general text. SIGIR Forum, 24(1–2), 19–21. doi:10.1145/378881.378888.
Fox, E. A., & Shaw, J. A. (1993). Combination of multiple searches. In D. K. Harman (Ed.), TREC-1: Proceedings of the first text retrieval conference, NIST, Gaithersburg, MD, USA (pp. 243–252).
Garfield, E. (1955). Citation indexes for science: A new dimension in documentation through association of ideas. Science, 122(3159), 108–111. doi:10.1126/science.122.3159.108.
Garfield, E. (1996). What is the primordial reference for the phrase ‘Publish or perish’? The Scientist, 10(12), 11. http://www.the-scientist.com/article/display/17052.
Garfield, E. (2006). The history and meaning of the journal impact factor. Journal of the American Medical Association, 295(1), 90–93. doi:10.1001/jama.295.1.90.
Glenisson, P., Glänzel, W., Janssens, F., & Moor, B. D. (2005a). Combining full text and bibliometric information in mapping scientific disciplines. Information Processing and Management, 41(6), 1548–1572. doi:10.1016/j.ipm.2005.03.021.
Glenisson, P., Glänzel, W., & Persson, O. (2005b). Combining full-text analysis and bibliometric indicators. A pilot study. Scientometrics, 63(1), 163–180. doi:10.1007/s11192-005-0208-0.
Goldberg, D., Nichols, D., Oki, B. M., & Terry, D. B. (1992). Using collaborative filtering to weave an Information Tapestry. Communications of the ACM, 35(12), 61–70. doi:10.1145/138859.138867.
Gori, M., & Pucci, A. (2006). Research paper recommender systems: A random-walk based approach. In WI’06: Proceedings of the 5th IEEE/WIC/ACM international conference on web intelligence, IEEE Computer Society, Los Alamitos, CA, USA (pp. 778–781). doi:10.1109/WI.2006.149.
Herlocker, J. L., Konstan, J. A., Terveen, L. G., & Riedl, J. T. (2004). Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems, 22(1), 5–53. doi:10.1145/963770.963772.
Hirsch, J. E. (2005). An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences of the United States of America, 102(46), 16569–16572. doi:10.1073/pnas.0507655102.
Hirsch, J. E. (2010). An index to quantify an individual’s scientific research output that takes into account the effect of multiple coauthorship. Scientometrics, 85(3), 741–754. doi: 10.1007/s11192-010-0193-9.
Huang, Z., Yan, Y., Qiu, Y., & Qiao, S. (2009). Exploring emergent semantic communities from DBLP bibliography database. In N. Memon & R. Alhajj (Eds.), ASONAM’09: Proceedings of the 1st international conference on advances in social network analysis and mining, IEEE Computer Society (pp. 219–224). doi:ASONAM.2009.6.
Hubert, G., & Mothe, J. (2009). An adaptable search engine for multimodal information retrieval. Journal of the American Society for Information Science and Technology, 60(8), 1625–1634. doi:10.1002/asi.21091.
Hull, D. (1993). Using statistical testing in the evaluation of retrieval experiments. In SIGIR’93: Proceedings of the 16th annual international ACM SIGIR conference, ACM Press, New York, NY, USA (pp. 329–338). doi:10.1145/160688.160758
Hurtado Martín, G., Cornelis, C., & Naessens, H. (2009). Training a personal alert system for research information recommendation. In J. P. Carvalho, D. Dubois, U. Kaymak, & J. M. C. Sousa (Eds.), IFSA/EUSFLAT’09: Proceedings of the joint 2009 international Fuzzy systems association world congress and 2009 European Society of fuzzy logic and technology conference (pp. 408–413).
Hurtado Martín, G., Schockaert, S., Cornelis, C., & Naessens, H. (2010). Metadata impact on research paper similarity. In M. Lalmas, J. Jose, A. Rauber, F. Sebastiani, & I. Frommholz (Eds.), ECDL’10: Proceedings of the 14th European conference on research and advanced technology for digital libraries. LNCS (Vol. 6273, pp. 457–460). New York: Springer. doi:10.1007/978-3-642-15464-5_56.
Janas, J. M. (1977). Automatic recognition of the part-of-speech for English texts. Information Processing & Management, 13(4), 205–213. doi:10.1016/0306-4573(77)90001-2.
Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4), 422–446. doi:10.1145/582415.582418.
Karoui, H., Kanawati, R., & Petrucci, L. (2006). COBRAS: Cooperative CBR system for bibliographical reference recommendation. In ECCBR’06: Proceedings of the 8th European conference on advances in case-based reasoning. LNCS (Vol. 4106, pp. 76–90). New York: Springer. doi:10.1007/11805816_8.
Kelly, D. (2009). Methods for evaluating interactive information retrieval systems with users. Foundations and Trends in Information Retrieval, 3(1–2), 1–224. doi:10.1561/1500000012.
Klas, C. P., & Fuhr, N. (2000). A new effective approach for categorizing web documents. In Proceedings of the 22th BCS-IRSG colloquium on IR research.
Lee, J. H. (1997). Analyses of multiple evidence combination. In SIGIR’97: Proceedings of the 20th annual international ACM SIGIR conference, ACM Press, New York, NY, USA (pp. 267–276). doi:10.1145/258525.258587.
Ley, M. (2002). The DBLP computer science bibliography: Evolution, research issues, perspectives. In A. H. F. Laender & A. L. Oliveira (Eds.), SPIRE’02 : Proceedings of the 9th international conference on string processing and information retrieval. LNCS (Vol. 2476, pp. 1–10). New York: Springer. doi:10.1007/3-540-45735-6_1.
Likert, R. (1932). A technique for the measurement of attitudes. Archives of Psychology, 22(140), 5–54.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge: Cambridge University Press.
McNee, S. M., Albert, I., Cosley, D., Gopalkrishnan, P., Lam, S. K., Rashid, A. M., Konstan, J. A., & Riedl, J. (2002). On the recommending of citations for research papers. In CSCW’02: Proceedings of the 2002 ACM conference on computer supported cooperative work, ACM, New York, NY, USA (pp. 116–125). doi:10.1145/587078.587096.
McNee, S. M., Kapoor, N., & Konstan, J. A. (2006). Don’t look stupid: Avoiding pitfalls when recommending research papers. In CSCW ’06: Proceedings of the 2006 20th anniversary conference on computer supported cooperative work, ACM, New York, NY, USA (pp. 171–180). doi:10.1145/1180875.1180903.
Micarelli, A., Sciarrone, F., & Marinilli, M. (2007). In Web document modeling. LNCS (Vol. 4321, pp. 155–192). New York: Springer. doi:10.1007/978-3-540-72079-9_5.
Milgram, S. (1967). The small-world problem. Psychology Today, 1(1), 61–67.
Mimno, D., & McCallum, A. (2007). Mining a digital library for influential authors. In JCDL’07: Proceedings of the 7th ACM/IEEE-CS joint conference on digital libraries, ACM, New York, NY, USA (pp. 105–106). doi:10.1145/1255175.1255196.
Mittelbach, F., & Goossens, M. (2005). \({\hbox{L}}{\hbox{\sc a}}{\hbox{T}}_{\rm{E}}{\hbox{X}}\) companion (2nd ed.). Boston, MA: Pearson Education.
Montaner, M., López, B., & de la Rosa, J. L. (2003). A taxonomy of recommender agents on the Internet. Artificial Intelligence Review, 19(4), 285–330. doi:10.1023/A:1022850703159.
Naak, A., Hage, H., & Aïmeur, E. (2009). A multi-criteria collaborative filtering approach for research paper recommendation in papyres. In MCETECH’09: Proceedings of the 4th international conference on E-technologies: Innovation in an open world. LNBIP (Vol. 26, pp. 25–39). New York: Springer. doi:10.1007/978-3-642-01187-0_3.
Porcel, C., López-Herrera, A. G., & Herrera-Viedma, E. (2009). A recommender system for research resources based on fuzzy linguistic modeling. Expert Systems with Applications, 36(3), 5173–5183. doi:10.1016/j.eswa.2008.06.038.
Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.
Powley, B., & Dale, R. (2007). Evidence-based information extraction for high accuracy citation and author name identification. In RIAO’07: Proceedings of the 8th conference on information retrieval and its applications. CID, CDROM.
Reips, U. D. (2002). Standards for Internet-based experimenting. Experimental Psychology, 49(4), 243–256. doi:10.1026//1618-3169.49.4.243.
Reips, U. D. (2007). The methodology of Internet-based experiments. In A. N. Joinson, K. Y. A. McKenna, T. Postmes, & U. D. Reips (Eds.), The Oxford handbook of Internet psychology. New York: Oxford University Press (Chap. 24, pp. 373–390).
Reips, U. D., & Lengler, R. (2005). The Web experiment list: A Web service for the recruitment of participants and archiving of Internet-based experiments. Behavior Research Methods, 37(2), 287–292.
Reitz, F., & Hoffmann, O. (2010). An analysis of the evolving coverage of computer science sub-fields in the DBLP digital library. In M. Lalmas, J. Jose, A. Rauber, F. Sebastiani, & I. Frommholz (Eds.), ECDL’10: Proceedings of the 14th European conference on research and advanced technology for digital libraries. LNCS (Vol. 6273, pp. 216–227). New York: Springer. doi:10.1007/978-3-642-15464-5_23.
Resnick, P., & Varian, H. R. (1997). Recommender systems. Communications of the ACM, 40(3), 56–58. doi:10.1145/245108.245121.
Rosen-Zvi, M., Chemudugunta, C., Griffiths, T., Smyth, P., & Steyvers, M. (2010). Learning author-topic models from text corpora. ACM Transactions on Information Systems, 28(1), 4:1–4:38. doi:10.1145/1658377.1658381.
Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. In UAI’04: Proceedings of the 20th annual conference on uncertainty in artificial intelligence, AUAI Press, Arlington, Virginia (pp. 487–494).
Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513–523. doi:10.1016/0306-4573(88)90021-0.
Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620. doi:10.1145/361219.361220.
Sanderson, M. (2010). Test collection based evaluation of information retrieval systems. Foundations and Trends in Information Retrieval, 4(4), 247–375. doi:10.1561/1500000009.
Sanderson, M., & Zobel, J. (2005). Information retrieval system evaluation: Effort, sensitivity, and reliability. In SIGIR’05: Proceedings of the 28th annual international ACM SIGIR conference, ACM, New York, NY, USA (pp. 162–169). doi:10.1145/1076034.1076064.
Spärck Jones, K. (1973). Index term weighting. Information Storage and Retrieval , 9(11), 619–633. doi:10.1016/0020-0271(73)90043-0.
Spärck Jones, K. (1974). Automatic indexing. Journal of Documentation, 30(4), 393–432. doi:10.1108/eb026588.
Student. (1908). The probable error of a mean. Biometrika, 6(1), 1–25. doi:10.2307/2331554.
Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., & Su, Z. (2008). ArnetMiner: Extraction and mining of academic social networks. In KDD’08: Proceeding of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA (pp. 990–998). doi:10.1145/1401890.1402008.
Travers, J., & Milgram, S. (1969). An experimental study of the small world problem. Sociometry, 32(4), 425–443. doi:10.2307/2786545.
Tsatsaronis, G., Varlamis, I., Stamou, S., Nørvåg, K., & Vazirgiannis, M. (2009). Semantic relatedness hits bibliographic data. In WIDM’09: Proceeding of the 11th international workshop on Web information and data management, ACM, New York, NY, USA (pp. 87–90). doi:10.1145/1651587.1651607.
Voorhees, E. M. (2002). The philosophy of information retrieval evaluation. In C. Peters, M. Braschler, J. Gonzalo, & M. Kluck (Eds.), CLEF’01: Second workshop of the cross-language evaluation forum. LNCS (Vol. 2406, pp. 355–370). New York: Springer. doi:10.1007/3-540-45691-0_34.
Voorhees, E. M., & Harman, D. K. (2005). TREC: Experiment and evaluation in information retrieval. Cambridge, MA: MIT Press.
Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 80–83. doi:10.2307/3001968.
Yan, E., & Ding, Y. (2009). Applying centrality measures to impact analysis: A coauthorship network analysis. Journal of the American Society for Information Science and Technology, 60(10), 2107–2118. doi:10.1002/asi.21128.
Yang, Z., Hong, L., & Davison, B. D. (2010). Topic-driven multi-type citation network analysis. In RIAO’10: Proceedings of the 9th international conference on information retrieval and its applications. CDROM.
Zamparelli, R. (1998). Internet publications: Pay-per-use or pay-per-subscription? In C. Nikolaou & C. Stephanidis (Eds.), ECDL’98: Proceedings of the 2nd European conference on research and advanced technology for digital libraries. LNCS (Vol. 1513, pp. 635–636). New York: Springer. doi:10.1007/3-540-49653-X_38.
Zhou, D., Orshanskiy, S. A., Zha, H., & Giles, C. L. (2007). Co-ranking authors and documents in a heterogeneous network. In ICDM’07: Proceedings of the 7th IEEE international conference on data mining (pp. 739–744). doi:10.1109/ICDM.2007.57.
Acknowledgments
The constructive criticisms and suggestions of the referees are warmly acknowledged. I am also grateful to the 71 volunteer researchers who took part in the experiment reported in this paper. Their feedback, comments, and insightful advice have been a source of stimulating thinking. Finally, I am indebted to Anaïs Lefeuvre for her involvement in this work as a research assistant.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cabanac, G. Accuracy of inter-researcher similarity measures based on topical and social clues. Scientometrics 87, 597–620 (2011). https://doi.org/10.1007/s11192-011-0358-1
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-011-0358-1