Skip to main content

Ensemble System for Identification of Cited Text Spans: Based on Two Steps of Feature Selection

  • Conference paper
  • First Online:
  • 555 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11772))

Abstract

CL-SciSumm Shared Task proposed a novel approach which is to generate scientific summary based on cited text spans (CTS) in target paper. This mechanism requires identifying CTS from reference paper according to citation sentence (citance) firstly. Therefore, CTS identification has then arisen the attention of many scholars since identified sentences will finally be aggregated for summary generation. Prior studies viewed this task as a text classification problem and feature selection is one key step for modeling the linkage between CTS and citance. Since most studies have paved the work by building features arbitrarily and applying them directly to model training. There is a lack of investigation to evaluate the effectiveness of features. Performance variation caused by different classifiers are barely taken into consideration as well. To further improve the performance of CTS identification, this paper builds an ensemble system based on two steps of feature selection. In the first step, we construct a set of features and do correlation analysis to select those which are higher-correlated with CTS. The second step is responsible for assigning several basic classifiers (SVM, Decision Tree and Logistic Regression) with their best performing feature sets. Experimental results demonstrate that our proposed systems can surpass the previous best performing one.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://tac.nist.gov/2014/BiomedSumm/index.html.

  2. 2.

    http://wing.comp.nus.edu.sg/~cl-scisumm2019.

  3. 3.

    https://github.com/WING-NUS/scisumm-corpus.

  4. 4.

    https://blog.csdn.net/shijiebei2009/article/details/39696523/.

  5. 5.

    http://tartarus.org/~martin/PorterStemmer/.

  6. 6.

    https://radimrehurek.com/gensim/models/word2vec.html.

  7. 7.

    https://radimrehurek.com/gensim/models/doc2vec.html.

  8. 8.

    https://pypi.org/project/lda/.

  9. 9.

    https://pypi.org/project/pytextrank/.

  10. 10.

    https://wordnet.princeton.edu/download.

  11. 11.

    http://scikit-learn.org/.

References

  • Abu-Jbara, A., Radev, D.: Coherent citation-based summarization of scientific papers. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 500–509. Association for Computational Linguistics (2011)

    Google Scholar 

  • Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002). https://doi.org/10.1613/jair.953

    Article  MATH  Google Scholar 

  • Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967). https://doi.org/10.1109/TIT.1967.1053964

    Article  MATH  Google Scholar 

  • Cover, T.: Estimation by the nearest neighbor rule. IEEE Trans. Inf. Theory 14(1), 50–55 (1968)

    Article  MathSciNet  Google Scholar 

  • Elkiss, A., Shen, S., Fader, A., Erkan, G., States, D., Radev, D.: Blind men and elephants: what do citation summaries tell us about a research article? J. Am. Soc. Inf. Sci. Technol. 59(1), 51–62 (2008). https://doi.org/10.1002/asi.20707

    Article  Google Scholar 

  • Gunetti, D., Picardi, C.: Keystroke analysis of free text. ACM Trans. Inf. Syst. Secur. 8(3), 312–347 (2005)

    Article  Google Scholar 

  • He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)

    Article  Google Scholar 

  • Hong, X., Chen, S., Harris, C.J.: A kernel-based two-class classifier for imbalanced data sets. IEEE Trans. Neural Netw. 18(1), 28–41 (2007). https://doi.org/10.1109/TNN.2006.882812

    Article  Google Scholar 

  • Jaidka, K., Yasunaga, M., Chandrasekaran, M.K., Radev, D., Kan, M.-Y.: The CL-SciSumm shared task 2018: results and key insights. In: CEUR Workshop Proceedings, Ann Arbor, MI, United States, vol. 2132, pp. 74–83 (2018)

    Google Scholar 

  • Kusner, M., Sun, Y., Kolkin, N., Weinberger, K.: From word embeddings to document distances. In: International Conference on Machine Learning, pp. 957–966 (2015)

    Google Scholar 

  • Li, L., Chi, J., Chen, M., Huang, Z., Zhu, Y., Fu, X.: CIST@CLSciSumm-18: methods for computational linguistics scientific citation linkage, facet classification and summarization. In: CEUR Workshop Proceedings, Ann Arbor, MI, United States, vol. 2132, pp. 84–95 (2018)

    Google Scholar 

  • Li, L., Zhang, Y., Mao, L., Chi, J., Chen, M., Huang, Z.: CIST@ CLSciSumm-17: multiple features based citation linkage, classification and summarization. In: Proceedings of the 2nd Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL2017), Tokyo, Japan, August 2017

    Google Scholar 

  • Ma, S., Xu, J., Wang, J., Zhang, C.: NJUST @ CLSciSumm-17. In: CEUR Workshop Proceedings, Tokyo, Japan, vol. 2002, pp. 16–25 (2017)

    Google Scholar 

  • Ma, S., Xu, J., Zhang, C.: Automatic identification of cited text spans: a multi-classifier approach over imbalanced dataset. Scientometrics 116(2), 1303–1330 (2018). https://doi.org/10.1007/s11192-018-2754-2

    Article  Google Scholar 

  • Nanba, H., Okumura, M.: Towards multi-paper summarization using reference information. IJCAI 99, 926–931 (1999)

    Google Scholar 

  • Pearson, K.: Notes on regression and inheritance in the case of two parents. Proc. R. Soc. Lond. 58, 240–242 (1895)

    Article  Google Scholar 

  • Teufel, S., Siddharthan, A., Tidhar, D.: Automatic classification of citation function. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 103–110 (2006). http://www.aclweb.org/anthology/W/W06/W06-1613

  • Wang, P., Li, S., Wang, T., Zhou, H., Tang, J.: NUDT @ CLSciSumm-18. In: CEUR Workshop Proceedings, Ann Arbor, MI, United States, vol. 2132, pp. 102–113 (2018)

    Google Scholar 

  • Yeager, K.: LibGuides: SPSS tutorials: pearson correlation. https://libguides.library.kent.edu/SPSS/PearsonCorr. Accessed 5 May 2019

Download references

Acknowledgements

This work is supported by Major Projects of National Social Science Fund (No. 17ZDA291).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chengzhi Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xu, J., Zhang, C., Ma, S. (2019). Ensemble System for Identification of Cited Text Spans: Based on Two Steps of Feature Selection. In: Zhang, Q., Liao, X., Ren, Z. (eds) Information Retrieval. CCIR 2019. Lecture Notes in Computer Science(), vol 11772. Springer, Cham. https://doi.org/10.1007/978-3-030-31624-2_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-31624-2_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-31623-5

  • Online ISBN: 978-3-030-31624-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics