Abstract
Text representation is the preliminary work for in-depth analysis and mining of information in scientific papers. It directly affects the effects of downstream tasks such as, scientific papers classification, clustering, and similarity calculation. However, recent researches mainly considered citation network and partial structural information, which is insufficient when representing scientific papers. Therefore, in order to improve the performance of text representation model, this paper proposed MV-HATrans, a text representation model that combines multi-viewpoint information, such as the semantic information of knowledge graph and structural information. This model extracts word information from three aspects, including contextual content, part of speech, and word meaning of WordNet. Based on combination of hierarchical attention mechanism and transformer, the model achieves the full text representation of scientific papers. Finally, this paper uses the binary experimental dataset AAPR, which indicates whether scientific papers are accepted or not, and applies the proposed model of text representation to achieve the goal of automatic quality assessment. Results show that in the quality classification of scientific papers, adopting part-of-speech information and semantic information based on WordNet definitions can effectively achieve the accuracy of prediction as 70.14%. Among all the structural modules, authors and abstracts contributes the most to the quality classification of scientific papers, especially authors as 9.51%.
Similar content being viewed by others
References
Achakulvisut, T., Acuna, D. E., Ruangrong, T., & Kording, K. (2016). Science concierge: A fast content-based recommendation system for scientific publications. PLoS One, 11(7), e0158423.
Amami, M., Pasi, G., Stella, F., & Faiz, R. (2016). An LDA-based approach to scientific paper recommendation. In E. Metais, F. Meziane, M. Saraee, V. Sugumaran, & S. Vadera (Eds.), Natural language processing and information systems. Cham: Springer.
Chen, G., & Xu, T. (2019). Sentence function recognition based on active learning. Data Analysis and Knowledge Discovery, 3(08), 53–61.
Chen, Y. (2008). Multi-class scientific literature automatic categorization system. Huazhong University of Science & Technology. Master thesis.
Dong, F., Zhang, Y., & Yang, J. (2017a). Attention-based recurrent convolutional neural network for automatic essay scoring. In Proceedings of the 21st conference on computational natural language learning (CoNLI 2017), 153–162.
Dong, Y., Chawla, N. V., & Swami, A. (2017b). Metapath2vec scalable representation learning for heterogeneous networks. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017, 135–144.
Du, J. (2010). Scientific paper discrimination method research based-on word co-occurrence network and support vector machine. Harbin Institute of Technology. Master thesis.
Fassin, Y. (2018). A new qualitative rating system for scientific publications and a fame index for academics. Journal of the Association for Information Science and Technology, 69(11), 1396–1399.
Ganguly, S., & Pudi, V. (2017). Paper2vec: Combining graph and text information for scientific paper representation. In European conference on information retrieval, 383–395.
Heffernan, K., & Teufel, S. (2018). Identifying problems and solutions in scientific text. Scientometrics, 116(1), 1–16.
Huang, Y., Lu, W., & Cheng, Q. (2016a). The structure recognition of academic text chapter content based recognition. Journal of the China Society for Scientific and Technical Information, 35(03), 293–300.
Huang, Y., Lu, W., Cheng, Q. et al. (2016b). The structure recognition of academic text paragraph-based recognition. Journal of the China Society for Scientific and Technical Information, 35(05), 530–538.
Jiang, L. L., Li, Y., Li, W. Q., & Xiong, Y. (2014). Representation model for conceptual design based on multi-viewpoint. Computer Integrated Manufacturing Systems, 5, 1.
Kazemi, B., & Abhari, A. (2020). Content-based Node2Vec for representation of papers in the scientific literature. Data & Knowledge Engineering, 127, 101794.
Kong, X., Mao, M., Wang, W., et al. (2018). VOPRec: Vector representation learning of papers with text information and structural identity for recommendation. IEEE Transactions on Emerging Topics in Computing, 9, 226–237.
Li, D., Tian, D., & Hu, X. (2015). Standard literature language model based on deep learning. Journal of Jilin University (Engineering and Technology Edition), 45(2), 596–599.
Li, J., & Wu, Y. (2015). Feature selection method of scientific literatures based on optimized K-medoids algorithm. Journal of Central China Normal University(Natural Sciences), 49(4), 541–545.
Li, L., Mao, L., Zhang, Y., et al. (2017). Computational linguistics literature and citations oriented citation linkage, classification and summarization. International Journal on Digital Libraries, 40, 173–190.
Lu, W., Huang, Y., & Cheng, Q. (2014). The structure function of academic text and its classification. Journal of the China Society for Scientific and Technical Information, 33(09), 979–985.
Liu, K., Zhou, L., & Chen, X. (2012). A new clustering algorithm for scientific literature based on keywords. Library and Information Service, 56(4), 6.
Liu, M., Lang, B., Gu, Z., & Zeeshan, A. (2017). Measuring similarity of academic articles with semantic profile and joint word embedding. Tsinghua Science and Technology, 22(06), 619–632.
Luo, J., Wang, Q., & Li, Y. (2014). Word clustering based on word2vec and semantic similarity. In Proceedings of the 33rd Chinese Control Conference, 517-521. IEEE
Muller, M. C. (2017). Semantic author name disambiguation with word embeddings. International Conference on Theory and Practice of Digital Libraries, 2017, 300–311.
Osman, Ahmed Hamza, & Barukub, Omar Mohammed. (2020). Graph-based text representation and matching: A review of the state of the art and future challenges. IEEE Access, 8, 87562–87583.
Palangi, H., Deng, L., Shen, Y., et al. (2016). Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval. IEEE/ACM Transactions on Audio, Speech and Language Processing, 24(4), 694–707.
Peng, D., Yang, J., & Lu, J. (2020). Similar case matching with explicit knowledge-enhanced text representation. Applied Soft Computing, 95, 106514.
Peng, G., & Fen, W. Y. (2015). Topic mining in scientific literature based on LDA topic model and life cycle theory. Journal of the China Society for Scientific and Technical Information, 34(03), 286–299.
Polavarapu, N., Navathe, S. B., & Ramnarayanan, R, et al. (2005). Investigation into biomedical literature classification using support vector machines. In 2005 IEEE Computational Systems Bioinformatics Conference, 366–374. IEEE.
Rachman, G. H., Khodra, M. L., & Widyantoro, D. H. (2017). Rhetorical sentence categorization for scientific paper using word2Vec semantic representation. Journal of Physics Conference Series, 801(1), 012070.
Ramesh, K., Vasumurthy, C., & Venkatesh, D. (2014). High quality assessment of similarity by using multiple view points. International Journal of Emerging Technology in Computer Science and Electronics., 9(3), 72–74.
Rios, A., & Kavuluru, R. (2015). Convolutional neural networks for biomedical text classification: Application in indexing biomedical articles. Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics, 2015, 258–267.
Salimi, N. (2017). Quality assessment of scientific outputs using the BWM. Scientometrics, 112(1), 195–213.
Setyawan, A., Ardiansyah, F. (2014). Automatic subject classification based on DDC system for library document. Skripsi Mahasiswa Ekstensi, 2(1).
Shen, A., Salehi, B., Baldwin, T., et al. (2019). A joint model for multimodal document quality assessment. 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2019, 107–110.
Tang, Z., Li, W., Li, Y., et al. (2020). Several alternative term weighting methods for text representation and classification. Knowledge-Based Systems, 207, 106399.
Tshitoyan, V., Dagdelen, J., Weston, L., et al. (2019). Unsupervised word embeddings capture latent knowledge from materials science literature. Nature, 571(7763), 95.
Wang, D., Gao, R., Ye, W., et al. (2018). Research on the structure recognition of academic texts under different characteristics. Journal of the China Society for Scientific and Technical Information, 37(10), 31–42.
Wang, H., Ye, P., & Deng, S. (2014). The application of machine-learning in the research on automatic categorization of Chinese periodical articles. Data Analysis and Knowledge Discovery, 03, 80–87.
Wang, J., & He, W. (2009). Dissertation integrated assessment model to inform the fuzzy. Journal of Minzu University of China (Natural Sciences Edition), 18(01), 86–90.
Wang, J., Lu, W., Liu, J., et al. (2019). Research on structure function recognition of academic text based on multi-level fusion. Library and Information Service, 63(13), 95–104.
Wang, L., Yao, C., & Liu, Z. (2019). A scientific paper evaluation method based on text mining and bibliometrics. Information Science, 37(05), 66–70.
Wang, Q., Zeng, J., Liu, J., & Qi, J. (2020). Structure function recognition of academic text paragraph based on deep learning. Information Science, 38(03), 64–69.
Wang, R., Li, Z., & Cao, J, et al. (2019). Chinese text feature extraction and classification based on deep learning. In Proceedings of the 3rd international conference on computer science and application engineering, 1–5.
Wang, Y., Fu, Z., & Chen, B. (2016). Topic identification of scientific literature based on LDA topic model: Comparative analysis of two views of global and discipline. Information Studies: Theory & Application, 39(07), 121-126+101.
Wang, Z., Le, X., & He, Y. (2017). Recognizing core topic sentences with improved textrank algorithm based on WMD semantic similarity. Data Analysis and Knowledge Discovery, 1(04), 1–8.
Wen, Z., Hui, L., Hongjiao, X., et al. (2018). Application of deep learning technology in data analysis of scientific and technical literature. Information Studies: Theory & Application, 41(05), 110–113.
Wu, L., Liang, X., & Song, H. (2020). A method of keywords association analysis of scientific papers based on super-network. Journal of the China Society for Scientific and Technical Information, 39(03), 253–258.
Xie, H., Feng, G., & He, W. (2018). Research on semantic classification of scientific and technical literature based on deep learning. Information Studies: Theory & Application, 41(11), 153–158.
Xiong, W., & Zhou, J. (2000). Great military rhetoric. Beijing: Great Wall Press.
Xu, H., Dong, M., Zhu, D., et al. (2016). Text classification with topic-based word embedding and convolutional neural networks. Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, 2016, 88–97.
Yan, S. (2017). An evaluation on the quality of the engineering master theses based on the cloud-model. Journal of Xi’an University of Posts and Telecommunications, 22(05), 121–126.
Yang, H., Gao, B., & Sun, H. (2016). Extracting topics of computer science literature with LDA model. Data Analysis and Knowledge Discovery, 11, 23–29.
Yang, P., Sun, X., & Li, W, et al. (2018). Automatic academic paper rating based on modularized hierarchical convolutional neural network. arXiv preprint: arXiv:1805.03977.
Yang, Z., Yang, D., & Dyer, C, et al. (2016). Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, 1480–1489.
Yoon, S. H., Kim, S. W., Kim, J. S., et al. (2011). On computing text-based similarity in scientific literature. International Conference on World Wide Web, 2011, 169–170.
Zhao, Q., Geng, Q., Jin, J., et al. (2017). A topical coverage and authority unification model for expert recommendation. Library and Information Service, 1, 80–88.
Zhang, Z., Yang, H., Bu, J., et al. (2018). ANRL: Attributed network representation learning via deep neural networks. IJCAI, 18, 3155–3161.
Zhang, Z., Chu, Y., & Wu, X. (2019). Multi-source literature topics based on LDA and their differences taking “machine learning as an example. Information Science, 037(006), 108–112.
Zhao, S., Zhang, D., Duan, Z., et al. (2018). A novel classification method for paper-reviewer recommendation. Scientometrics, 115(3), 1293–1313.
Zhao, F., Zhang, Y., Lu, J., et al. (2019). Measuring academic influence using heterogeneous author-citation networks. Scientometrics, 118, 1119–1140.
Zheng, J., Cai, F., Chen, H., et al. (2020). Pre-train, interact, fine-tune: A novel interaction representation for text classification. Information Processing & Management, 57, 102215.
Zhu, D., Dai, X. Y., & Chen, J. (2019). Representing anything from scholar papers. Journal of Web Semantics, 59, 100498.
Zhu, L., Du, X., & Li, H. (2018). Study on the construction of index system for automatic review of academic paper quality under the perspective of knowledge production. Library and Information Service, 62(24), 79–86.
Acknowledgements
The authors warmly thank reviewers for their valuable suggestions. This research was partly supported by Basic and Applied Basic Research Fund of Guangdong Province (No. 2019B1515120085), National Natural Science Foundation of China [Grant Number: 71373291], and Science and Technology Planning Project of Guangdong Province (China) [Grant Number: 2016B030303003].Jiayi Luo and Ying Xiao are the co second authors.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lu, Y., Luo, J., Xiao, Y. et al. Text representation model of scientific papers based on fusing multi-viewpoint information and its quality assessment. Scientometrics 126, 6937–6963 (2021). https://doi.org/10.1007/s11192-021-04028-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-021-04028-4