Abstract
Academic literature contains many different types of data, including language, citations, figures, tables, etc. Released in 2017 for natural language processing, the Transformer model is widely used in fields as diverse as image processing and network science. Transformer models trained on large datasets have been shown to perform well on different tasks when trained on additional small amounts of new data. Most studies of classification and regression in the academic literature have designed individually customized features for specific tasks without fully considering data interactions. Customizing features for each task is costly and complex. This paper addresses this issue by proposing a basic framework that can be consistently applied to various tasks from the diverse information in academic literature. Specifically, we propose an end-to-end fusion method that combines linguistic and citation information of academic literature data utilizing two Transformer models. The experiments were conducted using a dataset on 67 disciplines extracted from the Web of Science, one of the largest databases about academic literature, and classified papers with the top 20% of citations five years after publication. The results show that the proposed method improves the F-measure by 0.028 compared to using only citation or linguistic information on average. Repeated experiments on 67 data sets also showed that the proposed model has the smallest standard deviation of F values. In other words, our proposed method shows the best average performance and is stable with a small variance of the F-value. We also conducted a comparative analysis between the dataset’s characteristics and the proposed method’s performance. The results show that our proposed method correlates poorly with the dataset’s characteristics. In other words, our proposed method is highly versatile. Based on the above, our proposed method is superior regarding F-value, learning stability, and generality.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Web of Science https://www.webofknowledge.com.
- 2.
SciBERT: https://github.com/allenai/scibert.
References
Acuna, D.E., Allesina, S., Kording, K.P.: Predicting scientific success. Nature 489(7415), 201–202 (2012). https://doi.org/10.1038/489201a
Alon, U., Yahav, E.: On the bottleneck of graph neural networks and its practical implications. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=i80OPhOCVH2
Ayaz, S., Masood, N., Islam, M.A.: Predicting scientific impact based on h-index. Scientometrics 114(3), 993–1010 (2017). https://doi.org/10.1007/s11192-017-2618-1
Bai, X., Zhang, F., Lee, I.: Predicting the citations of scholarly paper. J. Informetrics 13(1), 407–418 (2019). https://doi.org/10.1016/j.joi.2019.01.010, http://www.sciencedirect.com/science/article/pii/S1751157718301767
Beltagy, I., Lo, K., Cohan, A.: SciBERT: pretrained language model for scientific text. In: EMNLP (2019)
Cao, X., Chen, Y., Liu, K.R.: A data analytic approach to quantifying scientific impact. J. Informetrics 10(2), 471–484 (2016). https://doi.org/10.1016/j.joi.2016.02.006, http://www.sciencedirect.com/science/article/pii/S1751157715301346
Cohan, A., Feldman, S., Beltagy, I., Downey, D., Weld, D.S.: Specter: document-level representation learning using citation-informed transformers. In: ACL (2020)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/n19-1423
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021. OpenReview.net (2021). https://openreview.net/forum?id=YicbFdNTTy
Garfield, E., Sher, I.H.: New factors in the evaluation of scientific literature through citation indexing. Am. Doc. 14(3), 195–201 (1963). https://doi.org/10.1002/asi.5090140304, https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.5090140304
Hirsch, J.E.: An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences 102(46), 16569–16572 (2005). https://doi.org/10.1073/pnas.0507655102, https://www.pnas.org/content/102/46/16569
Li, G., Müller, M., Thabet, A., Ghanem, B.: DeepGCNs: can GCNs go as deep as CNNs? (2019)
Miró, Ò., et al.: Analysis of h-index and other bibliometric markers of productivity and repercussion of a selected sample of worldwide emergency medicine researchers. Emergency Med. J. 34(3), 175–181 (2017). https://doi.org/10.1136/emermed-2016-205893, https://emj.bmj.com/content/34/3/175
Mosbach, M., Andriushchenko, M., Klakow, D.: On the stability of fine-tuning BERT: misconceptions, explanations, and strong baselines (2021)
Mousavi, S.M., Ellsworth, W.L., Zhu, W., Chuang, L.Y., Beroza, G.C.: Earthquake transformer-an attentive deep-learning model for simultaneous earthquake detection and phase picking. Nat. Commun. 11(1), 1–12 (2020)
Niepert, M., Ahmed, M., Kutzkov, K.: Learning convolutional neural networks for graphs. In: Balcan, M.F., Weinberger, K.Q. (eds.) Proceedings of The 33rd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 48, pp. 2014–2023. PMLR, New York, New York, USA, 20–22 June 2016. https://proceedings.mlr.press/v48/niepert16.html
Ochi, M., Shiro, M., Mori, J., Sakata, I.: Which is more helpful in finding scientific papers to be top-cited in the future: Content or citations? Case analysis in the field of solar cells 2009. In: Mayo, F.J.D., Marchiori, M., Filipe, J. (eds.) Proceedings of the 17th International Conference on Web Information Systems and Technologies, WEBIST 2021, 26–28 October 2021, pp. 360–364. SCITEPRESS (2021). https://doi.org/10.5220/0010689100003058
Ochi, M., Shiro, M., Mori, J., Sakata, I.: Classification of the top-cited literature by fusing linguistic and citation information with the transformer model. In: Decker, S., Mayo, F.J.D., Marchiori, M., Filipe, J. (eds.) Proceedings of the 18th International Conference on Web Information Systems and Technologies, WEBIST 2022, Valletta, Malta, 25–27 October 2022, pp. 286–293. SCITEPRESS (2022). https://doi.org/10.5220/0011542200003318
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the web. Technical report 1999-66, Stanford InfoLab, November 1999. http://ilpubs.stanford.edu:8090/422/, previous number = SIDL-WP-1999-0120
Park, I., Yoon, B.: Technological opportunity discovery for technological convergence based on the prediction of technology knowledge flow in a citation network. J. Informetrics 12(4), 1199–1222 (2018). https://doi.org/10.1016/j.joi.2018.09.007, https://www.sciencedirect.com/science/article/pii/S1751157718300907
Bowman, S.R., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for learning natural language inference. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics (2015)
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, November 2019. https://arxiv.org/abs/1908.10084
Sasaki, H., Hara, T., Sakata, I.: Identifying emerging research related to solar cells field using a machine learning approach. J. Sustain. Dev. Energy Water Environ. Syst. 4, 418–429 (2016). https://doi.org/10.13044/j.sdewes.2016.04.0032
Schreiber, M.: How relevant is the predictive power of the h-index? A case study of the time-dependent hirsch index. J. Informetrics 7(2), 325–329 (2013). https://doi.org/10.1016/j.joi.2013.01.001, http://www.sciencedirect.com/science/article/pii/S1751157713000035
Stegehuis, C., Litvak, N., Waltman, L.: Predicting the long-term citation impact of recent publications. J. Informetrics 9 (2015). https://doi.org/10.1016/j.joi.2015.06.005
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353–355. Association for Computational Linguistics, Brussels, Belgium, November 2018. https://doi.org/10.18653/v1/W18-5446, https://aclanthology.org/W18-5446
Yan, E., Guns, R.: Predicting and recommending collaborations: an author-, institution-, and country-level analysis. J. Informetrics 8(2), 295–309 (2014). https://doi.org/10.1016/j.joi.2014.01.008, https://www.sciencedirect.com/science/article/pii/S1751157714000091
Yan, G., Liang, S., Zhang, Y., Liu, F.: Fusing transformer model with temporal features for ECG heartbeat classification. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 898–905. IEEE (2019)
Yi, Z., Ximeng, W., Guangquan, Z., Jie, L.: Predicting the dynamics of scientific activities: a diffusion-based network analytic methodology. Proceedings of the Association for Information Science and Technology, vol. 55, no. 1, pp. 598–607 (2018). https://doi.org/10.1002/pra2.2018.14505501065, https://asistdl.onlinelibrary.wiley.com/doi/abs/10.1002/pra2.2018.14505501065
Zhang, J., Zhang, H., Xia, C., Sun, L.: Graph-BERT: only attention is needed for learning graph representations. CoRR abs/2001.05140 (2020). https://arxiv.org/abs/2001.05140
Acknowledgement
This article is based on results obtained from a project, JPNP20006, commissioned by the New Energy and Industrial Technology Development Organization (NEDO) and supported by JSPS KAKENHI Grant Number JP21K17860 and JP21K12068. The funders had no role in the study design, data collection and analysis, decision to publish, or manuscript preparation.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Queries from Web of Science
We list in Tables 3, 4, 5 and 6 the queries we used to create the dataset from the Web of Science. We use a “Dataset ID” in tables, and this “Dataset ID” is common across all tables. We have multiple queries set up in each dataset. We registered articles matching any one of these queries in the dataset.
B Network and Linguistic Features for Each Dataset
We present the features of each dataset we extracted in Table 7. We targeted a relatively small research area and selected queries As a result, we extracted datasets containing papers from 1,140 to 31,758.
Two types of features are available: one for citation information and the other for language information. The feature on citation information is mainly a network-related indicator. These are the first six (Number of Articles, Number of Nodes, Number of Edges, Network Density, Average Degree, and Gini Coefficient of Degree Distribution). Next, the features related to linguistic information are mainly indicators related to language. These are the last two (Number of Abstracts, Word Perplexity). The Number of Articles represents the number of papers in each dataset. The Number of Nodes represents the number of nodes used in the network, including papers cited in the dataset. The Number of Edges represents the number of edges in the network. The Network Density represents the network density, where N is the number of nodes and E is the number of edges; the network density D calculates as \(D = \frac{2E}{N(N-1)}\). The Average Degree represents the average degree, calculated as \(\frac{E}{N}\), and the average number of edges per node. The Gini Coefficient of Degree Distribution represents the degree distribution’s Gini coefficient and indicates whether the edges are biased toward a particular node. The Number of Abstracts represents the number of abstracts, which is the number of abstract information included in the data set. The language model may not be fully utilized if this value is low. The Word Perplexity represents the perplexity of words, an indicator of lexical diversity. If the vocabulary in a dataset D is V, and the proportion of occurrences of a word \(w_i\) is denoted by \(P(w_i)\), then the perplexity PP(D) is calculated by Equation \(PP(D) = \Pi _{i=0}^{V} P(w_i)^{-P(w_i)}\).
C Result for Each Dataset
We present our results per dataset in Table 8. When we check the results for each dataset, we find that the proposed method only performs the best of 19/67 datasets in the F-value measurement. The “Graph-BERT” method shows the best performance in 34/67 datasets.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ochi, M., Shiro, M., Mori, J., Sakata, I. (2023). Integrating Linguistic and Citation Information with Transformer for Predicting Top-Cited Papers. In: Marchiori, M., Domínguez Mayo, F.J., Filipe, J. (eds) Web Information Systems and Technologies. WEBIST 2022. Lecture Notes in Business Information Processing, vol 494. Springer, Cham. https://doi.org/10.1007/978-3-031-43088-6_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-43088-6_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43087-9
Online ISBN: 978-3-031-43088-6
eBook Packages: Computer ScienceComputer Science (R0)