research-article

Beyond keyword search: discovering relevant scientific literature

Authors:

Khalid El-Arini,

Carlos GuestrinAuthors Info & Claims

KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 439 - 447

https://doi.org/10.1145/2020408.2020479

Published: 21 August 2011 Publication History

Abstract

In scientific research, it is often difficult to express information needs as simple keyword queries. We present a more natural way of searching for relevant scientific literature. Rather than a string of keywords, we define a query as a small set of papers deemed relevant to the research task at hand. By optimizing an objective function based on a fine-grained notion of influence between documents, our approach efficiently selects a set of highly relevant articles. Moreover, as scientists trust some authors more than others, results are personalized to individual preferences. In a user study, researchers found the papers recommended by our method to be more useful, trustworthy and diverse than those selected by popular alternatives, such as Google Scholar and a state-of-the-art topic modeling approach.

References

[1]

ACM Digital Library. http://portal.acm.org.

[2]

R. Adler, J. Ewing, and P. Taylor. Citation statistics. Statistical Science, 24:1--14, 2009.

[3]

E. M. Airoldi, E. A. Erosheva, S. E. Fienberg, C. Joutard, T. Love, and S. Shringarpure. Reconceptualizing the classification of PNAS articles. Proceedings of the National Academy of Sciences USA, 2010.

[4]

A.-L. Barabási. On the topology of the scientific collaboration networks. Physica A, 311:590--614, 2002.

[5]

M. Bates. The design of browsing and berrypicking techniques for the online search interface. Online Review, 13:407--424, 1989.

[6]

D. M. Blei and J. Lafferty. Dynamic topic models. In ICML, 2006.

Digital Library

[7]

D. M. Blei and J. Lafferty. A correlated topic model of science. Ann. Appl. Stat., 1:17--35, 2007.

[8]

D. M. Blei and J. Lafferty. Topic Models. Chapman and Hall, 2009.

[9]

K. Bollacker, S. Lawrence, and C. L. Giles. Discovering relevant scientific literature on the Web. IEEE Intelligent Systems and their Applications, 15:42--47, 2000.

Digital Library

[10]

J. Chang and D. M. Blei. Hierarchical relational models for document networks. Annals of Applied Statistics, 4:124--150, 2010.

[11]

P. Chen, H. Xie, S. Maslov, and S. Redner. Finding scientific gems with Google. Journal of Informetrics, 1:8--15, 2007.

[12]

D. J. de Solla Price. Networks of scientific papers. Science, 149:510:515, 1965.

[13]

D. Diderot. In D. Diderot and J. d'Alembert, editors, Encyclopedia, or a systematic dictionary of the sciences, arts and crafts, Paris, 1755. Briasson, David, Le Breton, and Durand. (tr. from French).

[14]

L. Dietz, S. Bickel, and T. Scheffer. Unsupervised prediction of citation influences. In ICML, 2007.

Digital Library

[15]

K. El-Arini and C. Guestrin. Beyond keyword search: Discovering relevant scientific literature. Technical report, Carnegie Mellon University Machine Learning Department, 2011.

[16]

K. El-Arini, G. Veda, D. Shahaf, and C. Guestrin. Turning down the noise in the blogosphere. In KDD, 2009.

Digital Library

[17]

C. Elkan and K. Noto. Learning classifiers from only positive and unlabeled data. In KDD, 2008.

Digital Library

[18]

E. A. Erosheva, S. E. Fienberg, and J. Lafferty. Mixed-membership models of scientific publications. Proceedings of the National Academy of Sciences USA, 101:5220--5227, 2004.

[19]

M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the Internet topology. In SIGCOMM, 1999.

Digital Library

[20]

E. Garfield. Citation analysis as a tool in journal evaluation. Science, 178:471--479, 1972.

[21]

S. Gerrish and D. M. Blei. A language-based approach to measuring scholarly impact. In ICML, 2010.

[22]

Z. Ghahramani and K. A. Heller. Bayesian sets. In NIPS, 2006.

[23]

Google Scholar. http://scholar.google.com.

[24]

T. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences USA, 101:5228--5235, 2004.

[25]

J. E. Hirsch. An index to quantify an individual's scientific research output. Proceedings of the National Academy of Sciences USA, 102:16569--16572, 2005.

[26]

S. Khuller, A. Moss, and J. Naor. The budgeted maximum coverage problem. Information Processing Letters, 1999.

Digital Library

[27]

J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46:604--632, 1999.

Digital Library

[28]

N. Lao and W. W. Cohen. Relational learning using a combination of path-constrained random walks. Machine Learning, 81(1):53--67, 2010.

Digital Library

[29]

J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. Glance. Cost-effective outbreak detection in networks. In KDD, 2007.

Digital Library

[30]

S. M. McNee, I. Albert, D. Cosley, P. Gopalkrishnan, S. K. Lam, A. M. Rashid, J. A. Konstan, and J. Riedl. On the recommending of citations for research papers. In CSCW, 2002.

Digital Library

[31]

G. Nemhauser, L. Wolsey, and M. Fisher. An analysis of the approximations for maximizing submodular set functions. Mathematical Programming, 14:265--294, 1978.

Digital Library

[32]

M. E. J. Newman. Scientific collaboration networks: I. network construction and fundamental results. Phys. Rev. E, 64:016131, 2001.

[33]

M. E. J. Newman. The structure of scientific collaboration networks. Proc. Natl. Acad. Sci. USA, 98:404--409, 2001.

[34]

C. Olston and E. H. Chi. Scenttrails: Integrating browsing and searching on the Web. ACM Transactions on Computer-Human Interaction, 10:177--197, 2003.

Digital Library

[35]

L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: bringing order to the Web. Technical report, Stanford University InfoLab, 1999.

[36]

S. Pandit and C. Olston. Navigation-aided retrieval. In WWW, 2007.

Digital Library

[37]

J. S. Provan and M. O. Ball. The complexity of counting cuts and of computing the probability that a graph is connected. SIAM J. Comput., 12:777--788, 1983.

Digital Library

[38]

F. Radicchi, S. Fortunato, B. Markines, and A. Vespignani. Diffusion of scientific credits and the ranking of scientists. Physical Review E, 80:056103, 2009.

[39]

S. Redner. How popular is your paper? an empirical study of the citation distribution. Eur. Phys. J. B, 4:131--134, 1998.

[40]

M. Rosvall and C. T. Bergstrom. Maps of random walks on complex networks reveal community structure. Proc. Natl. Acad. Sci. USA, 105:1118--1123, 2008.

[41]

M. Rozen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents. In UAI, 2004.

Digital Library

[42]

B. Shaparenko and T. Joachims. Information genealogy: uncovering the flow of ideas in non-hyperlinked document databases. In KDD, 2007.

Digital Library

[43]

Thomson Reuters Web of Knowledge. http://wokinfo.com/about/whatitis.

[44]

R. Torres, S. M. McNee, M. Abel, J. A. Konstan, and J. Riedl. Enhancing digital libraries with TechLens. In JCDL, 2004.

Digital Library

[45]

L. Valiant. The complexity of enumeration and reliability problems. SIAM J. Comput., 8:410--421, 1979.

Digital Library

Cited By

Kanara AKumari PPrathap B(2024)Python Driven Keyword Analysis for SEO Optimization2024 10th International Conference on Advanced Computing and Communication Systems (ICACCS)10.1109/ICACCS60874.2024.10717132(1170-1176)Online publication date: 14-Mar-2024
https://doi.org/10.1109/ICACCS60874.2024.10717132
Kia S(2024)Submodular Maximization Subject to Uniform and Partition Matroids: From Theory to Practical Applications and Distributed SolutionsReference Module in Materials Science and Materials Engineering10.1016/B978-0-443-14081-5.00090-8Online publication date: 2024
https://doi.org/10.1016/B978-0-443-14081-5.00090-8
Tran TPham CTrung DNguyen U(2024)Improved Streaming Algorithm for Minimum Cost Submodular Cover ProblemComputational Data and Social Networks10.1007/978-981-97-0669-3_21(222-233)Online publication date: 29-Feb-2024
https://doi.org/10.1007/978-981-97-0669-3_21
Show More Cited By

Index Terms

Beyond keyword search: discovering relevant scientific literature
1. Information systems
  1. Information retrieval
2. Mathematics of computing
  1. Probability and statistics

Recommendations

An Architecture of an Academic Search Engine with Personalized Search Result Ranking Mechanism
ICNCC '16: Proceedings of the Fifth International Conference on Network, Communication and Computing

A rapid increasing of information on the Internet and World Wide Web causes information overloaded problem. Thus, search engines become important tools to help WWW users to discover the information they need. With an exponentially increasing of ...
Concept Based Personalized Search and Collaborative Search Using Modified HITS Algorithm
MIKE 2013: Proceedings of the First International Conference on Mining Intelligence and Knowledge Exploration - Volume 8284

Keyword based search is commonly used by popular search engines. The major problem with this kind of search is that we do not get user intended results for the search. In addition, every user gets the same set of results for the same query whereas, ...
Web search using dynamic keyword suggestion

Web search has become an essential task for most people. As the Web grows rapidly, effective searches have grown increasingly important. Most of us, however, have experienced frustration in trying to search for something on the Web. In existing keyword-...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

August 2011

1446 pages

ISBN:9781450308137

DOI:10.1145/2020408

General Chair:
Chid Apte
IBM Research
,
Program Chairs:
Joydeep Ghosh
UT Austin
,
Padhraic Smyth
UC Irvine

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 August 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD '11

Sponsor:

KDD '11: The 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 21 - 24, 2011

California, San Diego, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

62
Total Citations
View Citations
1,347
Total Downloads

Downloads (Last 12 months)64
Downloads (Last 6 weeks)4

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kanara AKumari PPrathap B(2024)Python Driven Keyword Analysis for SEO Optimization2024 10th International Conference on Advanced Computing and Communication Systems (ICACCS)10.1109/ICACCS60874.2024.10717132(1170-1176)Online publication date: 14-Mar-2024
https://doi.org/10.1109/ICACCS60874.2024.10717132
Kia S(2024)Submodular Maximization Subject to Uniform and Partition Matroids: From Theory to Practical Applications and Distributed SolutionsReference Module in Materials Science and Materials Engineering10.1016/B978-0-443-14081-5.00090-8Online publication date: 2024
https://doi.org/10.1016/B978-0-443-14081-5.00090-8
Tran TPham CTrung DNguyen U(2024)Improved Streaming Algorithm for Minimum Cost Submodular Cover ProblemComputational Data and Social Networks10.1007/978-981-97-0669-3_21(222-233)Online publication date: 29-Feb-2024
https://doi.org/10.1007/978-981-97-0669-3_21
Banihashem KBiabani LGoudarzi SHajiaghayi MJabbarzade PMonemizadeh MOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Dynamic non-monotone submodular maximizationProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666883(17369-17382)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3666883
El Halabi MFusco FNorouzi-Fard ATardos JTarnawski JKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)Fairness in streaming submodular maximization over a matroid constraintProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3618775(9150-9171)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3618775
Huang ZNaseri SBonab HSarwar SAllan JYoshioka MKiseleva JAliannejadi M(2023)Hierarchical Transformer-based Query by Multiple DocumentsProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3578337.3605130(105-115)Online publication date: 9-Aug-2023
https://dl.acm.org/doi/10.1145/3578337.3605130
Gong SNong QFang JDu D(2023)Algorithms for Cardinality-Constrained Monotone DR-Submodular Maximization with Low Adaptivity and Query ComplexityJournal of Optimization Theory and Applications10.1007/s10957-023-02353-7200:1(194-214)Online publication date: 18-Dec-2023
https://doi.org/10.1007/s10957-023-02353-7
Yuan JTang S(2023)Group fairness in non-monotone submodular maximizationJournal of Combinatorial Optimization10.1007/s10878-023-01019-445:3Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1007/s10878-023-01019-4
Wu BHan K(2022)Fast Algorithm for Big Data Summarization with Knapsack and Partition Matroid Constraints2022 International Conference on INnovations in Intelligent SysTems and Applications (INISTA)10.1109/INISTA55318.2022.9894252(1-6)Online publication date: 8-Aug-2022
https://doi.org/10.1109/INISTA55318.2022.9894252
Wang XChen LLyu DBan TGuan YChen Q(2022)Research Concept Link Prediction via Graph Convolutional Network2022 8th International Conference on Big Data and Information Analytics (BigDIA)10.1109/BigDIA56350.2022.9874237(220-225)Online publication date: 24-Aug-2022
https://doi.org/10.1109/BigDIA56350.2022.9874237
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten