Ensemble System for Identification of Cited Text Spans: Based on Two Steps of Feature Selection

Xu, Jin; Zhang, Chengzhi; Ma, Shutian

doi:10.1007/978-3-030-31624-2_8

Ensemble System for Identification of Cited Text Spans: Based on Two Steps of Feature Selection

Jin Xu¹¹,
Chengzhi Zhang¹¹ &
Shutian Ma¹¹

Conference paper
First Online: 18 September 2019

555 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11772))

Abstract

CL-SciSumm Shared Task proposed a novel approach which is to generate scientific summary based on cited text spans (CTS) in target paper. This mechanism requires identifying CTS from reference paper according to citation sentence (citance) firstly. Therefore, CTS identification has then arisen the attention of many scholars since identified sentences will finally be aggregated for summary generation. Prior studies viewed this task as a text classification problem and feature selection is one key step for modeling the linkage between CTS and citance. Since most studies have paved the work by building features arbitrarily and applying them directly to model training. There is a lack of investigation to evaluate the effectiveness of features. Performance variation caused by different classifiers are barely taken into consideration as well. To further improve the performance of CTS identification, this paper builds an ensemble system based on two steps of feature selection. In the first step, we construct a set of features and do correlation analysis to select those which are higher-correlated with CTS. The second step is responsible for assigning several basic classifiers (SVM, Decision Tree and Logistic Regression) with their best performing feature sets. Experimental results demonstrate that our proposed systems can surpass the previous best performing one.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

References

Abu-Jbara, A., Radev, D.: Coherent citation-based summarization of scientific papers. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 500–509. Association for Computational Linguistics (2011)
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002). https://doi.org/10.1613/jair.953
Article MATH Google Scholar
Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967). https://doi.org/10.1109/TIT.1967.1053964
Article MATH Google Scholar
Cover, T.: Estimation by the nearest neighbor rule. IEEE Trans. Inf. Theory 14(1), 50–55 (1968)
Article MathSciNet Google Scholar
Elkiss, A., Shen, S., Fader, A., Erkan, G., States, D., Radev, D.: Blind men and elephants: what do citation summaries tell us about a research article? J. Am. Soc. Inf. Sci. Technol. 59(1), 51–62 (2008). https://doi.org/10.1002/asi.20707
Article Google Scholar
Gunetti, D., Picardi, C.: Keystroke analysis of free text. ACM Trans. Inf. Syst. Secur. 8(3), 312–347 (2005)
Article Google Scholar
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Article Google Scholar
Hong, X., Chen, S., Harris, C.J.: A kernel-based two-class classifier for imbalanced data sets. IEEE Trans. Neural Netw. 18(1), 28–41 (2007). https://doi.org/10.1109/TNN.2006.882812
Article Google Scholar
Jaidka, K., Yasunaga, M., Chandrasekaran, M.K., Radev, D., Kan, M.-Y.: The CL-SciSumm shared task 2018: results and key insights. In: CEUR Workshop Proceedings, Ann Arbor, MI, United States, vol. 2132, pp. 74–83 (2018)
Google Scholar
Kusner, M., Sun, Y., Kolkin, N., Weinberger, K.: From word embeddings to document distances. In: International Conference on Machine Learning, pp. 957–966 (2015)
Google Scholar
Li, L., Chi, J., Chen, M., Huang, Z., Zhu, Y., Fu, X.: CIST@CLSciSumm-18: methods for computational linguistics scientific citation linkage, facet classification and summarization. In: CEUR Workshop Proceedings, Ann Arbor, MI, United States, vol. 2132, pp. 84–95 (2018)
Google Scholar
Li, L., Zhang, Y., Mao, L., Chi, J., Chen, M., Huang, Z.: CIST@ CLSciSumm-17: multiple features based citation linkage, classification and summarization. In: Proceedings of the 2nd Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL2017), Tokyo, Japan, August 2017
Google Scholar
Ma, S., Xu, J., Wang, J., Zhang, C.: NJUST @ CLSciSumm-17. In: CEUR Workshop Proceedings, Tokyo, Japan, vol. 2002, pp. 16–25 (2017)
Google Scholar
Ma, S., Xu, J., Zhang, C.: Automatic identification of cited text spans: a multi-classifier approach over imbalanced dataset. Scientometrics 116(2), 1303–1330 (2018). https://doi.org/10.1007/s11192-018-2754-2
Article Google Scholar
Nanba, H., Okumura, M.: Towards multi-paper summarization using reference information. IJCAI 99, 926–931 (1999)
Google Scholar
Pearson, K.: Notes on regression and inheritance in the case of two parents. Proc. R. Soc. Lond. 58, 240–242 (1895)
Article Google Scholar
Teufel, S., Siddharthan, A., Tidhar, D.: Automatic classification of citation function. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 103–110 (2006). http://www.aclweb.org/anthology/W/W06/W06-1613
Wang, P., Li, S., Wang, T., Zhou, H., Tang, J.: NUDT @ CLSciSumm-18. In: CEUR Workshop Proceedings, Ann Arbor, MI, United States, vol. 2132, pp. 102–113 (2018)
Google Scholar
Yeager, K.: LibGuides: SPSS tutorials: pearson correlation. https://libguides.library.kent.edu/SPSS/PearsonCorr. Accessed 5 May 2019

Download references

Acknowledgements

This work is supported by Major Projects of National Social Science Fund (No. 17ZDA291).

Author information

Authors and Affiliations

Department of Information Management, Nanjing University of Science and Technology, Nanjing, 210094, China
Jin Xu, Chengzhi Zhang & Shutian Ma

Authors

Jin Xu
View author publications
You can also search for this author in PubMed Google Scholar
Chengzhi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Shutian Ma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chengzhi Zhang .

Editor information

Editors and Affiliations

Fudan University, Shanghai, China
Qi Zhang
Fuzhou University, Fuzhou, China
Xiangwen Liao
Shandong University, Qingdao, China
Zhaochun Ren

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, J., Zhang, C., Ma, S. (2019). Ensemble System for Identification of Cited Text Spans: Based on Two Steps of Feature Selection. In: Zhang, Q., Liao, X., Ren, Z. (eds) Information Retrieval. CCIR 2019. Lecture Notes in Computer Science(), vol 11772. Springer, Cham. https://doi.org/10.1007/978-3-030-31624-2_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-31624-2_8
Published: 18 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-31623-5
Online ISBN: 978-3-030-31624-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics