Skip to main content

Integrating Linguistic and Citation Information with Transformer for Predicting Top-Cited Papers

  • Conference paper
  • First Online:
Web Information Systems and Technologies (WEBIST 2022)

Abstract

Academic literature contains many different types of data, including language, citations, figures, tables, etc. Released in 2017 for natural language processing, the Transformer model is widely used in fields as diverse as image processing and network science. Transformer models trained on large datasets have been shown to perform well on different tasks when trained on additional small amounts of new data. Most studies of classification and regression in the academic literature have designed individually customized features for specific tasks without fully considering data interactions. Customizing features for each task is costly and complex. This paper addresses this issue by proposing a basic framework that can be consistently applied to various tasks from the diverse information in academic literature. Specifically, we propose an end-to-end fusion method that combines linguistic and citation information of academic literature data utilizing two Transformer models. The experiments were conducted using a dataset on 67 disciplines extracted from the Web of Science, one of the largest databases about academic literature, and classified papers with the top 20% of citations five years after publication. The results show that the proposed method improves the F-measure by 0.028 compared to using only citation or linguistic information on average. Repeated experiments on 67 data sets also showed that the proposed model has the smallest standard deviation of F values. In other words, our proposed method shows the best average performance and is stable with a small variance of the F-value. We also conducted a comparative analysis between the dataset’s characteristics and the proposed method’s performance. The results show that our proposed method correlates poorly with the dataset’s characteristics. In other words, our proposed method is highly versatile. Based on the above, our proposed method is superior regarding F-value, learning stability, and generality.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Web of Science https://www.webofknowledge.com.

  2. 2.

    SciBERT: https://github.com/allenai/scibert.

References

  1. Acuna, D.E., Allesina, S., Kording, K.P.: Predicting scientific success. Nature 489(7415), 201–202 (2012). https://doi.org/10.1038/489201a

    Article  Google Scholar 

  2. Alon, U., Yahav, E.: On the bottleneck of graph neural networks and its practical implications. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=i80OPhOCVH2

  3. Ayaz, S., Masood, N., Islam, M.A.: Predicting scientific impact based on h-index. Scientometrics 114(3), 993–1010 (2017). https://doi.org/10.1007/s11192-017-2618-1

    Article  Google Scholar 

  4. Bai, X., Zhang, F., Lee, I.: Predicting the citations of scholarly paper. J. Informetrics 13(1), 407–418 (2019). https://doi.org/10.1016/j.joi.2019.01.010, http://www.sciencedirect.com/science/article/pii/S1751157718301767

  5. Beltagy, I., Lo, K., Cohan, A.: SciBERT: pretrained language model for scientific text. In: EMNLP (2019)

    Google Scholar 

  6. Cao, X., Chen, Y., Liu, K.R.: A data analytic approach to quantifying scientific impact. J. Informetrics 10(2), 471–484 (2016). https://doi.org/10.1016/j.joi.2016.02.006, http://www.sciencedirect.com/science/article/pii/S1751157715301346

  7. Cohan, A., Feldman, S., Beltagy, I., Downey, D., Weld, D.S.: Specter: document-level representation learning using citation-informed transformers. In: ACL (2020)

    Google Scholar 

  8. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/n19-1423

  9. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021. OpenReview.net (2021). https://openreview.net/forum?id=YicbFdNTTy

  10. Garfield, E., Sher, I.H.: New factors in the evaluation of scientific literature through citation indexing. Am. Doc. 14(3), 195–201 (1963). https://doi.org/10.1002/asi.5090140304, https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.5090140304

  11. Hirsch, J.E.: An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences 102(46), 16569–16572 (2005). https://doi.org/10.1073/pnas.0507655102, https://www.pnas.org/content/102/46/16569

  12. Li, G., Müller, M., Thabet, A., Ghanem, B.: DeepGCNs: can GCNs go as deep as CNNs? (2019)

    Google Scholar 

  13. Miró, Ò., et al.: Analysis of h-index and other bibliometric markers of productivity and repercussion of a selected sample of worldwide emergency medicine researchers. Emergency Med. J. 34(3), 175–181 (2017). https://doi.org/10.1136/emermed-2016-205893, https://emj.bmj.com/content/34/3/175

  14. Mosbach, M., Andriushchenko, M., Klakow, D.: On the stability of fine-tuning BERT: misconceptions, explanations, and strong baselines (2021)

    Google Scholar 

  15. Mousavi, S.M., Ellsworth, W.L., Zhu, W., Chuang, L.Y., Beroza, G.C.: Earthquake transformer-an attentive deep-learning model for simultaneous earthquake detection and phase picking. Nat. Commun. 11(1), 1–12 (2020)

    Article  Google Scholar 

  16. Niepert, M., Ahmed, M., Kutzkov, K.: Learning convolutional neural networks for graphs. In: Balcan, M.F., Weinberger, K.Q. (eds.) Proceedings of The 33rd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 48, pp. 2014–2023. PMLR, New York, New York, USA, 20–22 June 2016. https://proceedings.mlr.press/v48/niepert16.html

  17. Ochi, M., Shiro, M., Mori, J., Sakata, I.: Which is more helpful in finding scientific papers to be top-cited in the future: Content or citations? Case analysis in the field of solar cells 2009. In: Mayo, F.J.D., Marchiori, M., Filipe, J. (eds.) Proceedings of the 17th International Conference on Web Information Systems and Technologies, WEBIST 2021, 26–28 October 2021, pp. 360–364. SCITEPRESS (2021). https://doi.org/10.5220/0010689100003058

  18. Ochi, M., Shiro, M., Mori, J., Sakata, I.: Classification of the top-cited literature by fusing linguistic and citation information with the transformer model. In: Decker, S., Mayo, F.J.D., Marchiori, M., Filipe, J. (eds.) Proceedings of the 18th International Conference on Web Information Systems and Technologies, WEBIST 2022, Valletta, Malta, 25–27 October 2022, pp. 286–293. SCITEPRESS (2022). https://doi.org/10.5220/0011542200003318

  19. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the web. Technical report 1999-66, Stanford InfoLab, November 1999. http://ilpubs.stanford.edu:8090/422/, previous number = SIDL-WP-1999-0120

  20. Park, I., Yoon, B.: Technological opportunity discovery for technological convergence based on the prediction of technology knowledge flow in a citation network. J. Informetrics 12(4), 1199–1222 (2018). https://doi.org/10.1016/j.joi.2018.09.007, https://www.sciencedirect.com/science/article/pii/S1751157718300907

  21. Bowman, S.R., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for learning natural language inference. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics (2015)

    Google Scholar 

  22. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, November 2019. https://arxiv.org/abs/1908.10084

  23. Sasaki, H., Hara, T., Sakata, I.: Identifying emerging research related to solar cells field using a machine learning approach. J. Sustain. Dev. Energy Water Environ. Syst. 4, 418–429 (2016). https://doi.org/10.13044/j.sdewes.2016.04.0032

  24. Schreiber, M.: How relevant is the predictive power of the h-index? A case study of the time-dependent hirsch index. J. Informetrics 7(2), 325–329 (2013). https://doi.org/10.1016/j.joi.2013.01.001, http://www.sciencedirect.com/science/article/pii/S1751157713000035

  25. Stegehuis, C., Litvak, N., Waltman, L.: Predicting the long-term citation impact of recent publications. J. Informetrics 9 (2015). https://doi.org/10.1016/j.joi.2015.06.005

  26. Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

  27. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353–355. Association for Computational Linguistics, Brussels, Belgium, November 2018. https://doi.org/10.18653/v1/W18-5446, https://aclanthology.org/W18-5446

  28. Yan, E., Guns, R.: Predicting and recommending collaborations: an author-, institution-, and country-level analysis. J. Informetrics 8(2), 295–309 (2014). https://doi.org/10.1016/j.joi.2014.01.008, https://www.sciencedirect.com/science/article/pii/S1751157714000091

  29. Yan, G., Liang, S., Zhang, Y., Liu, F.: Fusing transformer model with temporal features for ECG heartbeat classification. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 898–905. IEEE (2019)

    Google Scholar 

  30. Yi, Z., Ximeng, W., Guangquan, Z., Jie, L.: Predicting the dynamics of scientific activities: a diffusion-based network analytic methodology. Proceedings of the Association for Information Science and Technology, vol. 55, no. 1, pp. 598–607 (2018). https://doi.org/10.1002/pra2.2018.14505501065, https://asistdl.onlinelibrary.wiley.com/doi/abs/10.1002/pra2.2018.14505501065

  31. Zhang, J., Zhang, H., Xia, C., Sun, L.: Graph-BERT: only attention is needed for learning graph representations. CoRR abs/2001.05140 (2020). https://arxiv.org/abs/2001.05140

Download references

Acknowledgement

This article is based on results obtained from a project, JPNP20006, commissioned by the New Energy and Industrial Technology Development Organization (NEDO) and supported by JSPS KAKENHI Grant Number JP21K17860 and JP21K12068. The funders had no role in the study design, data collection and analysis, decision to publish, or manuscript preparation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Masanao Ochi .

Editor information

Editors and Affiliations

Appendices

A Queries from Web of Science

We list in Tables 3, 4, 5 and 6 the queries we used to create the dataset from the Web of Science. We use a “Dataset ID” in tables, and this “Dataset ID” is common across all tables. We have multiple queries set up in each dataset. We registered articles matching any one of these queries in the dataset.

Table 3. Queries from Web of Science Dataset 1–20.
Table 4. Queries from Web of Science Dataset 21–40.
Table 5. Queries from Web of Science Dataset 41–60.
Table 6. Queries from Web of Science Dataset 61–70.

B Network and Linguistic Features for Each Dataset

We present the features of each dataset we extracted in Table 7. We targeted a relatively small research area and selected queries As a result, we extracted datasets containing papers from 1,140 to 31,758.

Two types of features are available: one for citation information and the other for language information. The feature on citation information is mainly a network-related indicator. These are the first six (Number of Articles, Number of Nodes, Number of Edges, Network Density, Average Degree, and Gini Coefficient of Degree Distribution). Next, the features related to linguistic information are mainly indicators related to language. These are the last two (Number of Abstracts, Word Perplexity). The Number of Articles represents the number of papers in each dataset. The Number of Nodes represents the number of nodes used in the network, including papers cited in the dataset. The Number of Edges represents the number of edges in the network. The Network Density represents the network density, where N is the number of nodes and E is the number of edges; the network density D calculates as \(D = \frac{2E}{N(N-1)}\). The Average Degree represents the average degree, calculated as \(\frac{E}{N}\), and the average number of edges per node. The Gini Coefficient of Degree Distribution represents the degree distribution’s Gini coefficient and indicates whether the edges are biased toward a particular node. The Number of Abstracts represents the number of abstracts, which is the number of abstract information included in the data set. The language model may not be fully utilized if this value is low. The Word Perplexity represents the perplexity of words, an indicator of lexical diversity. If the vocabulary in a dataset D is V, and the proportion of occurrences of a word \(w_i\) is denoted by \(P(w_i)\), then the perplexity PP(D) is calculated by Equation \(PP(D) = \Pi _{i=0}^{V} P(w_i)^{-P(w_i)}\).

Table 7. Network and linguistic features for each dataset.

C Result for Each Dataset

We present our results per dataset in Table 8. When we check the results for each dataset, we find that the proposed method only performs the best of 19/67 datasets in the F-value measurement. The “Graph-BERT” method shows the best performance in 34/67 datasets.

Table 8. Classification Results for each dataset.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ochi, M., Shiro, M., Mori, J., Sakata, I. (2023). Integrating Linguistic and Citation Information with Transformer for Predicting Top-Cited Papers. In: Marchiori, M., Domínguez Mayo, F.J., Filipe, J. (eds) Web Information Systems and Technologies. WEBIST 2022. Lecture Notes in Business Information Processing, vol 494. Springer, Cham. https://doi.org/10.1007/978-3-031-43088-6_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43088-6_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43087-9

  • Online ISBN: 978-3-031-43088-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics