Abstract
Scholarly digital libraries provide access to scientific publications and comprise useful resources for researchers who search for literature on specific subject areas. CiteSeerX is an example of such a digital library search engine that provides access to more than 10 million academic documents and has nearly one million users and three million hits per day. Artificial Intelligence (AI) technologies are used in many components of CiteSeerX including Web crawling, document ingestion, and metadata extraction. CiteSeerX also uses an unsupervised algorithm called noun phrase chunking (NP-Chunking) to extract keyphrases out of documents. However, often NP-Chunking extracts many unimportant noun phrases. In this paper, we investigate and contrast three supervised keyphrase extraction models to explore their deployment in CiteSeerX for extracting high quality keyphrases. To perform user evaluations on the keyphrases predicted by different models, we integrate a voting interface into CiteSeerX. We show the development and deployment of the keyphrase extraction models and the maintenance requirements.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
We have used NLP Stanford part of speech tagger.
References
Grobid. https://github.com/kermitt2/grobid (2008–2020)
Abu-Jbara, A., Radev, D.: Coherent citation-based summarization of scientific papers. In: ACL: HLT, pp. 500–509 (2011)
Adar, E., Datta, S.: Building a scientific concept hierarchy database (schbase). In: ACL, pp. 606–615 (2015)
Alzaidy, R., Caragea, C., Giles, C.L.: Bi-lstm-crf sequence labeling for keyphrase extraction from scholarly documents. In: WWW, pp. 2551–2557. ACM (2019)
Augenstein, I., Das, M., Riedel, S., Vikraman, L., McCallum, A.: Semeval 2017 task 10: Scienceie-extracting keyphrases and relations from scientific publications. arXiv preprint arXiv:1704.02853 (2017)
Barker, K., Cornacchia, N.: Using noun phrase heads to extract document keyphrases. In: Hamilton, H.J. (ed.) AI 2000. LNCS (LNAI), vol. 1822, pp. 40–52. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45486-1_4
Blank, I., Rokach, L., Shani, G.: Leveraging the citation graph to recommend keywords. In: RecSys, pp. 359–362 (2013)
Bulgarov, F., Caragea, C.: A comparison of supervised keyphrase extraction models. In: WWW, pp. 13–14 (2015)
Caragea, C., Bulgarov, F., Godea, A., Gollapalli, S.D.: Citation-enhanced keyphrase extraction from research papers: a supervised approach. In: EMNLP (2014)
Caragea, C., Bulgarov, F.A., Godea, A., Gollapalli, S.D.: Citation-enhanced keyphrase extraction from research papers: a supervised approach. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, 25–29 October 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1435–1446 (2014) http://aclweb.org/anthology/D/D14/D14-1150.pdf
Caragea, C., Wu, J., Gollapalli, S.D., Giles, C.L.: Document type classification in online digital libraries. In: Twenty-Eighth IAAI Conference (2016)
Chen, H.H., Treeratpituk, P., Mitra, P., Giles, C.L.: Csseer: an expert recommendation system based on citeseerx. In: JCDL, pp. 381–382 (2013)
Councill, I., Giles, C.L., Kan, M.Y.: ParsCit: an open-source CRF reference string parsing package. LREC 8, 661–667 (2008)
El-Beltagy, S.R., Rafea, A.: Kp-miner: participation in semeval-2. In: SemEval, pp. 190–193 (2010)
Florescu, C., Caragea, C.: Positionrank: an unsupervised approach to keyphrase extraction from scholarly documents. In: ACL, pp. 1105–1115 (2017)
Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., Nevill-Manning, C.G.: Domain-specific keyphrase extraction. In: IJCAI, pp. 668–673 (1999)
Giles, C.L., Bollacker, K.D., Lawrence, S.: Citeseer: an automatic citation indexing system. In: JCDL, pp. 89–98 (1998)
Gollapalli, S.D., Caragea, C.: Extracting keyphrases from research papers using citation networks. In: AAAI, pp. 1629–1635 (2014)
Gollapalli, S.D., Li, X.L., Yang, P.: Incorporating expert knowledge into keyphrase extraction. In: AAAI, pp. 3180–3187 (2017)
Grineva, M., Grinev, M., Lizorkin, D.: Extracting key terms from noisy and multitheme documents. In: WWW, pp. 661–670 (2009)
Hall, D., Jurafsky, D., Manning, C.D.: Studying the history of ideas using topic models. In: EMNLP, pp. 363–371 (2008)
Hammouda, K.M., Matute, D.N., Kamel, M.S.: CorePhrase: keyphrase extraction for document clustering. In: Perner, P., Imiya, A. (eds.) MLDM 2005. LNCS (LNAI), vol. 3587, pp. 265–274. Springer, Heidelberg (2005). https://doi.org/10.1007/11510888_26
Han, H., Giles, C.L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic document metadata extraction using support vector machines. In: JCDL, pp. 37–48. IEEE (2003)
Hasan, K.S., Ng, V.: Conundrums in unsupervised keyphrase extraction: making sense of the state-of-the-art. In: COLING, pp. 365–373 (2010)
Hasan, K.S., Ng, V.: Automatic keyphrase extraction: a survey of the state of the art. In: ACL, pp. 1262–1273, June 2014
Hong, K., Jeon, H., Jeon, C.: Personalized research paper recommendation system using keyword extraction based on userprofile. In: Journal of Convergence Information Technology (JCIT) (2013)
Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: EMNLP (2003)
Jurgens, D., Kumar, S., Hoover, R., McFarland, D., Jurafsky, D.: Measuring the evolution of a scientific field through citation frames. TACL 6, 391–406 (2018)
Khabsa, M., Giles, C.L.: The number of scholarly documents on the public web. PLoS One 9(5), 25 (2014)
Larsen, P., Von Ins, M.: The rate of growth in scientific publication and the decline in coverage provided by science citation index. Scientometrics 84(3), 575–603 (2010)
Liu, Z., Huang, W., Zheng, Y., Sun, M.: Automatic keyphrase extraction via topic decomposition. In: EMNLP, pp. 366–376 (2010)
Liu, Z., Li, P., Zheng, Y., Sun, M.: Clustering to find exemplar terms for keyphrase extraction. In: EMNLP, pp. 257–266 (2009)
Lopez, P., Romary, L.: Humb: automatic key term extraction from scientific articles in grobid. In: SemEval, pp. 248–251 (2010)
Mahata, D., Kuriakose, J., Shah, R.R., Zimmermann, R.: Key2vec: automatic ranked keyphrase extraction from scientific articles using phrase embeddings. In: NAACL, pp. 634–639 (2018)
Medelyan, O., Frank, E., Witten, I.H.: Human-competitive tagging using automatic keyphrase extraction. In: EMNLP, pp. 1318–1327 (2009)
Mihalcea, R., Tarau, P.: Textrank: bringing order into texts. In: EMNLP (2004)
Nguyen, T.D., Kan, M.-Y.: Keyphrase extraction in scientific publications. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 317–326. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-77094-7_41
Orduña-Malea, E., Ayllón, J.M., Martín-Martín, A., López-Cózar, E.D.: Methods for estimating the size of google scholar. Scientometrics 104(3), 931–949 (2015)
Patel, K., Caragea, C.: Exploring word embeddings in CRF-based keyphrase extraction from research papers. In: K-CAP, pp. 37–44. ACM (2019)
Qazvinian, V., Radev, D.R.: Scientific paper summarization using citation summary networks. In: COLING. pp. 689–696, Manchester, United Kingdom (2008)
Qazvinian, V., Radev, D.R., Özgür, A.: Citation summarization through keyphrase extraction. In: COLING, pp. 895–903 (2010)
Ritchie, A., Teufel, S., Robertson, S.: How to find better index terms through citations. In: CLIIR, pp. 25–32 (2006)
Sefid, A., et al.: Cleaning noisy and heterogeneous metadata for record linking across scholarly big datasets. In: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pp. 9601–9606 (2019)
Sinha, A., et al.: An overview of microsoft academic service (mas) and applications. In: WWW, pp. 243–246 (2015)
Song, I.Y., Allen, R.B., Obradovic, Z., Song, M.: Keyphrase extraction-based query expansion in digital libraries. In: JCDL, pp. 202–209 (2006)
Tan, C., Card, D., Smith, N.A.: Friendships, rivalries, and trysts: Characterizing relations between ideas in texts. arXiv preprint arXiv:1704.07828 (2017)
Teregowda, P., Urgaonkar, B., Giles, C.L.: Cloud 2010. In: 2010 IEEE 3rd International Conference on Cloud Computing, pp. 115–122 (2010)
Treeratpituk, P., Giles, C.L.: Disambiguating authors in academic publications using random forests. In: JCDL, pp. 39–48. ACM (2009)
Wan, X., Xiao, J.: Single document keyphrase extraction using neighborhood knowledge. AAAI. 8, 855–860 (2008)
Williams, K., Wu, J., Choudhury, S.R., Khabsa, M., Giles, C.L.: Scholarly big data information extraction and integration in the citeseer digital library. IIWeb, pp. 68–73 (2014)
Wu, J., Kandimalla, B., Rohatgi, S., Sefid, A., Mao, J., Giles, C.L.: Citeseerx-2018: a cleansed multidisciplinary scholarly big dataset. In: IEEE Big Data, pp. 5465–5467 (2018)
Wu, J., et al.: Pdfmef: a multi-entity knowledge extraction framework for scholarly documents and semantic search. In: K-CAP, pp. 13:1–13:8. ACM (2015)
Wu, J., Liang, C., Yang, H., Giles, C.L.: Citeseerx data: Semanticizing scholarly papers. In: SBD, pp. 2:1–2:6. ACM (2016)
Wu, J., et al.: CiteSeerX: AI in a digital library search engine. In: AAAI, pp. 2930–2937 (2014)
Zhang, Y., Milios, E., Zincir-Heywood, N.: A comparative study on key phrase extraction methods in automatic web site summarization. JDIM 5(5), 323 (2007)
Acknowledgements
We thank the National Science Foundation (NSF) for support from grants CNS-1853919, IIS-1914575, and IIS-1813571, which supported this research. Any opinions, findings, and conclusions expressed here are those of the authors and do not necessarily reflect the views of NSF. We also thank our anonymous reviewers for their constructive feedback.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Patel, K., Caragea, C., Wu, J., Giles, C.L. (2020). Keyphrase Extraction in Scholarly Digital Library Search Engines. In: Ku, WS., Kanemasa, Y., Serhani, M.A., Zhang, LJ. (eds) Web Services – ICWS 2020. ICWS 2020. Lecture Notes in Computer Science(), vol 12406. Springer, Cham. https://doi.org/10.1007/978-3-030-59618-7_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-59618-7_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59617-0
Online ISBN: 978-3-030-59618-7
eBook Packages: Computer ScienceComputer Science (R0)