Abstract
CL-SciSumm Shared Task proposed a novel approach which is to generate scientific summary based on cited text spans (CTS) in target paper. This mechanism requires identifying CTS from reference paper according to citation sentence (citance) firstly. Therefore, CTS identification has then arisen the attention of many scholars since identified sentences will finally be aggregated for summary generation. Prior studies viewed this task as a text classification problem and feature selection is one key step for modeling the linkage between CTS and citance. Since most studies have paved the work by building features arbitrarily and applying them directly to model training. There is a lack of investigation to evaluate the effectiveness of features. Performance variation caused by different classifiers are barely taken into consideration as well. To further improve the performance of CTS identification, this paper builds an ensemble system based on two steps of feature selection. In the first step, we construct a set of features and do correlation analysis to select those which are higher-correlated with CTS. The second step is responsible for assigning several basic classifiers (SVM, Decision Tree and Logistic Regression) with their best performing feature sets. Experimental results demonstrate that our proposed systems can surpass the previous best performing one.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
References
Abu-Jbara, A., Radev, D.: Coherent citation-based summarization of scientific papers. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 500–509. Association for Computational Linguistics (2011)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002). https://doi.org/10.1613/jair.953
Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967). https://doi.org/10.1109/TIT.1967.1053964
Cover, T.: Estimation by the nearest neighbor rule. IEEE Trans. Inf. Theory 14(1), 50–55 (1968)
Elkiss, A., Shen, S., Fader, A., Erkan, G., States, D., Radev, D.: Blind men and elephants: what do citation summaries tell us about a research article? J. Am. Soc. Inf. Sci. Technol. 59(1), 51–62 (2008). https://doi.org/10.1002/asi.20707
Gunetti, D., Picardi, C.: Keystroke analysis of free text. ACM Trans. Inf. Syst. Secur. 8(3), 312–347 (2005)
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Hong, X., Chen, S., Harris, C.J.: A kernel-based two-class classifier for imbalanced data sets. IEEE Trans. Neural Netw. 18(1), 28–41 (2007). https://doi.org/10.1109/TNN.2006.882812
Jaidka, K., Yasunaga, M., Chandrasekaran, M.K., Radev, D., Kan, M.-Y.: The CL-SciSumm shared task 2018: results and key insights. In: CEUR Workshop Proceedings, Ann Arbor, MI, United States, vol. 2132, pp. 74–83 (2018)
Kusner, M., Sun, Y., Kolkin, N., Weinberger, K.: From word embeddings to document distances. In: International Conference on Machine Learning, pp. 957–966 (2015)
Li, L., Chi, J., Chen, M., Huang, Z., Zhu, Y., Fu, X.: CIST@CLSciSumm-18: methods for computational linguistics scientific citation linkage, facet classification and summarization. In: CEUR Workshop Proceedings, Ann Arbor, MI, United States, vol. 2132, pp. 84–95 (2018)
Li, L., Zhang, Y., Mao, L., Chi, J., Chen, M., Huang, Z.: CIST@ CLSciSumm-17: multiple features based citation linkage, classification and summarization. In: Proceedings of the 2nd Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL2017), Tokyo, Japan, August 2017
Ma, S., Xu, J., Wang, J., Zhang, C.: NJUST @ CLSciSumm-17. In: CEUR Workshop Proceedings, Tokyo, Japan, vol. 2002, pp. 16–25 (2017)
Ma, S., Xu, J., Zhang, C.: Automatic identification of cited text spans: a multi-classifier approach over imbalanced dataset. Scientometrics 116(2), 1303–1330 (2018). https://doi.org/10.1007/s11192-018-2754-2
Nanba, H., Okumura, M.: Towards multi-paper summarization using reference information. IJCAI 99, 926–931 (1999)
Pearson, K.: Notes on regression and inheritance in the case of two parents. Proc. R. Soc. Lond. 58, 240–242 (1895)
Teufel, S., Siddharthan, A., Tidhar, D.: Automatic classification of citation function. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 103–110 (2006). http://www.aclweb.org/anthology/W/W06/W06-1613
Wang, P., Li, S., Wang, T., Zhou, H., Tang, J.: NUDT @ CLSciSumm-18. In: CEUR Workshop Proceedings, Ann Arbor, MI, United States, vol. 2132, pp. 102–113 (2018)
Yeager, K.: LibGuides: SPSS tutorials: pearson correlation. https://libguides.library.kent.edu/SPSS/PearsonCorr. Accessed 5 May 2019
Acknowledgements
This work is supported by Major Projects of National Social Science Fund (No. 17ZDA291).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Xu, J., Zhang, C., Ma, S. (2019). Ensemble System for Identification of Cited Text Spans: Based on Two Steps of Feature Selection. In: Zhang, Q., Liao, X., Ren, Z. (eds) Information Retrieval. CCIR 2019. Lecture Notes in Computer Science(), vol 11772. Springer, Cham. https://doi.org/10.1007/978-3-030-31624-2_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-31624-2_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-31623-5
Online ISBN: 978-3-030-31624-2
eBook Packages: Computer ScienceComputer Science (R0)