Integrating Linguistic and Citation Information with Transformer for Predicting Top-Cited Papers

Ochi, Masanao; Shiro, Masanori; Mori, Jun’ichiro; Sakata, Ichiro

doi:10.1007/978-3-031-43088-6_7

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 494))

Included in the following conference series:

International Conference on Web Information Systems and Technologies

268 Accesses

Abstract

Academic literature contains many different types of data, including language, citations, figures, tables, etc. Released in 2017 for natural language processing, the Transformer model is widely used in fields as diverse as image processing and network science. Transformer models trained on large datasets have been shown to perform well on different tasks when trained on additional small amounts of new data. Most studies of classification and regression in the academic literature have designed individually customized features for specific tasks without fully considering data interactions. Customizing features for each task is costly and complex. This paper addresses this issue by proposing a basic framework that can be consistently applied to various tasks from the diverse information in academic literature. Specifically, we propose an end-to-end fusion method that combines linguistic and citation information of academic literature data utilizing two Transformer models. The experiments were conducted using a dataset on 67 disciplines extracted from the Web of Science, one of the largest databases about academic literature, and classified papers with the top 20% of citations five years after publication. The results show that the proposed method improves the F-measure by 0.028 compared to using only citation or linguistic information on average. Repeated experiments on 67 data sets also showed that the proposed model has the smallest standard deviation of F values. In other words, our proposed method shows the best average performance and is stable with a small variance of the F-value. We also conducted a comparative analysis between the dataset’s characteristics and the proposed method’s performance. The results show that our proposed method correlates poorly with the dataset’s characteristics. In other words, our proposed method is highly versatile. Based on the above, our proposed method is superior regarding F-value, learning stability, and generality.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Identification of important citations by exploiting research articles’ metadata and cue-terms from content

Article 22 November 2018

Predicting citation impact of academic papers across research areas using multiple models and early citations

Article Open access 25 June 2024

Contextualised segment-wise citation function classification

Article 12 July 2023

Notes

1.
Web of Science https://www.webofknowledge.com.
2.
SciBERT: https://github.com/allenai/scibert.

References

Acuna, D.E., Allesina, S., Kording, K.P.: Predicting scientific success. Nature 489(7415), 201–202 (2012). https://doi.org/10.1038/489201a
Article Google Scholar
Alon, U., Yahav, E.: On the bottleneck of graph neural networks and its practical implications. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=i80OPhOCVH2
Ayaz, S., Masood, N., Islam, M.A.: Predicting scientific impact based on h-index. Scientometrics 114(3), 993–1010 (2017). https://doi.org/10.1007/s11192-017-2618-1
Article Google Scholar
Bai, X., Zhang, F., Lee, I.: Predicting the citations of scholarly paper. J. Informetrics 13(1), 407–418 (2019). https://doi.org/10.1016/j.joi.2019.01.010, http://www.sciencedirect.com/science/article/pii/S1751157718301767
Beltagy, I., Lo, K., Cohan, A.: SciBERT: pretrained language model for scientific text. In: EMNLP (2019)
Google Scholar
Cao, X., Chen, Y., Liu, K.R.: A data analytic approach to quantifying scientific impact. J. Informetrics 10(2), 471–484 (2016). https://doi.org/10.1016/j.joi.2016.02.006, http://www.sciencedirect.com/science/article/pii/S1751157715301346
Cohan, A., Feldman, S., Beltagy, I., Downey, D., Weld, D.S.: Specter: document-level representation learning using citation-informed transformers. In: ACL (2020)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/n19-1423
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021. OpenReview.net (2021). https://openreview.net/forum?id=YicbFdNTTy
Garfield, E., Sher, I.H.: New factors in the evaluation of scientific literature through citation indexing. Am. Doc. 14(3), 195–201 (1963). https://doi.org/10.1002/asi.5090140304, https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.5090140304
Hirsch, J.E.: An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences 102(46), 16569–16572 (2005). https://doi.org/10.1073/pnas.0507655102, https://www.pnas.org/content/102/46/16569
Li, G., Müller, M., Thabet, A., Ghanem, B.: DeepGCNs: can GCNs go as deep as CNNs? (2019)
Google Scholar
Miró, Ò., et al.: Analysis of h-index and other bibliometric markers of productivity and repercussion of a selected sample of worldwide emergency medicine researchers. Emergency Med. J. 34(3), 175–181 (2017). https://doi.org/10.1136/emermed-2016-205893, https://emj.bmj.com/content/34/3/175
Mosbach, M., Andriushchenko, M., Klakow, D.: On the stability of fine-tuning BERT: misconceptions, explanations, and strong baselines (2021)
Google Scholar
Mousavi, S.M., Ellsworth, W.L., Zhu, W., Chuang, L.Y., Beroza, G.C.: Earthquake transformer-an attentive deep-learning model for simultaneous earthquake detection and phase picking. Nat. Commun. 11(1), 1–12 (2020)
Article Google Scholar
Niepert, M., Ahmed, M., Kutzkov, K.: Learning convolutional neural networks for graphs. In: Balcan, M.F., Weinberger, K.Q. (eds.) Proceedings of The 33rd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 48, pp. 2014–2023. PMLR, New York, New York, USA, 20–22 June 2016. https://proceedings.mlr.press/v48/niepert16.html
Ochi, M., Shiro, M., Mori, J., Sakata, I.: Which is more helpful in finding scientific papers to be top-cited in the future: Content or citations? Case analysis in the field of solar cells 2009. In: Mayo, F.J.D., Marchiori, M., Filipe, J. (eds.) Proceedings of the 17th International Conference on Web Information Systems and Technologies, WEBIST 2021, 26–28 October 2021, pp. 360–364. SCITEPRESS (2021). https://doi.org/10.5220/0010689100003058
Ochi, M., Shiro, M., Mori, J., Sakata, I.: Classification of the top-cited literature by fusing linguistic and citation information with the transformer model. In: Decker, S., Mayo, F.J.D., Marchiori, M., Filipe, J. (eds.) Proceedings of the 18th International Conference on Web Information Systems and Technologies, WEBIST 2022, Valletta, Malta, 25–27 October 2022, pp. 286–293. SCITEPRESS (2022). https://doi.org/10.5220/0011542200003318
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the web. Technical report 1999-66, Stanford InfoLab, November 1999. http://ilpubs.stanford.edu:8090/422/, previous number = SIDL-WP-1999-0120
Park, I., Yoon, B.: Technological opportunity discovery for technological convergence based on the prediction of technology knowledge flow in a citation network. J. Informetrics 12(4), 1199–1222 (2018). https://doi.org/10.1016/j.joi.2018.09.007, https://www.sciencedirect.com/science/article/pii/S1751157718300907
Bowman, S.R., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for learning natural language inference. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics (2015)
Google Scholar
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, November 2019. https://arxiv.org/abs/1908.10084
Sasaki, H., Hara, T., Sakata, I.: Identifying emerging research related to solar cells field using a machine learning approach. J. Sustain. Dev. Energy Water Environ. Syst. 4, 418–429 (2016). https://doi.org/10.13044/j.sdewes.2016.04.0032
Schreiber, M.: How relevant is the predictive power of the h-index? A case study of the time-dependent hirsch index. J. Informetrics 7(2), 325–329 (2013). https://doi.org/10.1016/j.joi.2013.01.001, http://www.sciencedirect.com/science/article/pii/S1751157713000035
Stegehuis, C., Litvak, N., Waltman, L.: Predicting the long-term citation impact of recent publications. J. Informetrics 9 (2015). https://doi.org/10.1016/j.joi.2015.06.005
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353–355. Association for Computational Linguistics, Brussels, Belgium, November 2018. https://doi.org/10.18653/v1/W18-5446, https://aclanthology.org/W18-5446
Yan, E., Guns, R.: Predicting and recommending collaborations: an author-, institution-, and country-level analysis. J. Informetrics 8(2), 295–309 (2014). https://doi.org/10.1016/j.joi.2014.01.008, https://www.sciencedirect.com/science/article/pii/S1751157714000091
Yan, G., Liang, S., Zhang, Y., Liu, F.: Fusing transformer model with temporal features for ECG heartbeat classification. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 898–905. IEEE (2019)
Google Scholar
Yi, Z., Ximeng, W., Guangquan, Z., Jie, L.: Predicting the dynamics of scientific activities: a diffusion-based network analytic methodology. Proceedings of the Association for Information Science and Technology, vol. 55, no. 1, pp. 598–607 (2018). https://doi.org/10.1002/pra2.2018.14505501065, https://asistdl.onlinelibrary.wiley.com/doi/abs/10.1002/pra2.2018.14505501065
Zhang, J., Zhang, H., Xia, C., Sun, L.: Graph-BERT: only attention is needed for learning graph representations. CoRR abs/2001.05140 (2020). https://arxiv.org/abs/2001.05140

Download references

Acknowledgement

This article is based on results obtained from a project, JPNP20006, commissioned by the New Energy and Industrial Technology Development Organization (NEDO) and supported by JSPS KAKENHI Grant Number JP21K17860 and JP21K12068. The funders had no role in the study design, data collection and analysis, decision to publish, or manuscript preparation.

Author information

Authors and Affiliations

Department of Technology Management for Innovation, Graduate School of Engineering, The University of Tokyo, Hongo 7-3-1, Bunkyo, Tokyo, Japan
Masanao Ochi, Jun’ichiro Mori & Ichiro Sakata
Human Informatics and Interaction Research Institute, National Institute of Advanced Industrial Science and Technology, Umezono 1-1-1 Central2, Tsukuba, Ibaraki, Japan
Masanori Shiro

Authors

Masanao Ochi
View author publications
You can also search for this author in PubMed Google Scholar
Masanori Shiro
View author publications
You can also search for this author in PubMed Google Scholar
Jun’ichiro Mori
View author publications
You can also search for this author in PubMed Google Scholar
Ichiro Sakata
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Masanao Ochi .

Editor information

Editors and Affiliations

University of Padua (UNIPD), Padua, Italy
Massimo Marchiori
University of Seville, Seville, Spain
Francisco José Domínguez Mayo
Polytechnic Institute of Setúbal/INSTICC, Setubal, Portugal
Joaquim Filipe

Appendices

A Queries from Web of Science

We list in Tables 3, 4, 5 and 6 the queries we used to create the dataset from the Web of Science. We use a “Dataset ID” in tables, and this “Dataset ID” is common across all tables. We have multiple queries set up in each dataset. We registered articles matching any one of these queries in the dataset.

Table 3. Queries from Web of Science Dataset 1–20.

Full size table

Table 4. Queries from Web of Science Dataset 21–40.

Full size table

Table 5. Queries from Web of Science Dataset 41–60.

Full size table

Table 6. Queries from Web of Science Dataset 61–70.

Full size table

B Network and Linguistic Features for Each Dataset

We present the features of each dataset we extracted in Table 7. We targeted a relatively small research area and selected queries As a result, we extracted datasets containing papers from 1,140 to 31,758.

Two types of features are available: one for citation information and the other for language information. The feature on citation information is mainly a network-related indicator. These are the first six (Number of Articles, Number of Nodes, Number of Edges, Network Density, Average Degree, and Gini Coefficient of Degree Distribution). Next, the features related to linguistic information are mainly indicators related to language. These are the last two (Number of Abstracts, Word Perplexity). The Number of Articles represents the number of papers in each dataset. The Number of Nodes represents the number of nodes used in the network, including papers cited in the dataset. The Number of Edges represents the number of edges in the network. The Network Density represents the network density, where N is the number of nodes and E is the number of edges; the network density D calculates as $D = \frac{2E}{N(N-1)}$. The Average Degree represents the average degree, calculated as $\frac{E}{N}$, and the average number of edges per node. The Gini Coefficient of Degree Distribution represents the degree distribution’s Gini coefficient and indicates whether the edges are biased toward a particular node. The Number of Abstracts represents the number of abstracts, which is the number of abstract information included in the data set. The language model may not be fully utilized if this value is low. The Word Perplexity represents the perplexity of words, an indicator of lexical diversity. If the vocabulary in a dataset D is V, and the proportion of occurrences of a word $w_i$ is denoted by $P(w_i)$, then the perplexity PP(D) is calculated by Equation $PP(D) = \Pi _{i=0}^{V} P(w_i)^{-P(w_i)}$.

Table 7. Network and linguistic features for each dataset.

Full size table

C Result for Each Dataset

We present our results per dataset in Table 8. When we check the results for each dataset, we find that the proposed method only performs the best of 19/67 datasets in the F-value measurement. The “Graph-BERT” method shows the best performance in 34/67 datasets.

Table 8. Classification Results for each dataset.

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ochi, M., Shiro, M., Mori, J., Sakata, I. (2023). Integrating Linguistic and Citation Information with Transformer for Predicting Top-Cited Papers. In: Marchiori, M., Domínguez Mayo, F.J., Filipe, J. (eds) Web Information Systems and Technologies. WEBIST 2022. Lecture Notes in Business Information Processing, vol 494. Springer, Cham. https://doi.org/10.1007/978-3-031-43088-6_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-43088-6_7
Published: 29 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43087-9
Online ISBN: 978-3-031-43088-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Integrating Linguistic and Citation Information with Transformer for Predicting Top-Cited Papers

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Identification of important citations by exploiting research articles’ metadata and cue-terms from content

Predicting citation impact of academic papers across research areas using multiple models and early citations

Contextualised segment-wise citation function classification

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

A Queries from Web of Science

B Network and Linguistic Features for Each Dataset

C Result for Each Dataset

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us