Skip to main content

Data Augmentation on Problem and Method Sentence Classification Task in Scientific Paper: A Mechanism Analysis Study

  • Conference paper
  • First Online:
Wisdom, Well-Being, Win-Win (iConference 2024)

Part of the book series: Lecture Notes in Computer Science ((volume 14598))

Included in the following conference series:

  • 54 Accesses

Abstract

Billions of scientific papers lead to the need to identify essential parts of the massive text. Scientific research is an activity from putting forward problems to using methods. To learn the main idea from scientific papers, we focus on extracting problem and method sentences. Annotating sentences in scientific papers is labor-intensive, resulting in the creation of small-scale datasets that limit model learning. To tackle this challenge, data augmentation has been adopted due to its ability to generate synthetic data with minor variations, thereby expanding the scale of the original training dataset. Nowadays, there are various data augmentation methods, such as those based on random word replacement or back translation. Nevertheless, their suitability for sentence classification tasks in scientific papers remains unexplored. Thus, this paper constructs two manually annotation datasets and evaluates their performance. Furthermore, this paper delves into the mechanisms underlying their effects. Previous studies have suggested that data augmentation can diminish reliance on high-frequency patterns in models. Therefore, this paper employs attention values to represent the model's dependence on words and analyzes how data augmentation methods alter the attention values of individual words within sentences. The experimental results indicate that data augmentation methods can improve the macro F1 score in sentence classification tasks. Furthermore, data augmentation methods effectively reduce the attention values assigned to stop words, commonly used words in scientific papers, and commonly used words in method and problem sentences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  1. Bornmann, L., Mutz, R.: Growth rates of modern science: a bibliometric analysis based on the number of publications and cited references. J. Am. Soc. Inf. Sci. 66(11), 2215–2222 (2015)

    Google Scholar 

  2. Dernoncourt, F., Lee, J.Y.: Pubmed 200k rct: a dataset for sequential sentence classification in medical abstracts. In: Proceedings of the Eighth International Joint Conference on Natural Language Processing, IJCNLP, pp. 308–313. Asian Federation of Natural Language Processing, Taipei, Taiwan (2017)

    Google Scholar 

  3. Dernoncourt, F., Lee, J.Y., Szolovits, P.: Neural networks for joint sentence classification in medical paper abstracts. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL, pp. 694–700. Association for Computational Linguistics, Valencia, Spain (2016)

    Google Scholar 

  4. Ding, B., Qin, C., Liu, L., Bing, L., Joty, S., Li, B.: Is gpt-3 a good data annotator?. arXiv preprint arXiv:2212.10450 (2022)

  5. Ferreira, T.M., Costa, A.H.R.: DeepBT and NLP Data Augmentation Techniques: A New Proposal and a Comprehensive Study. In: Cerri, R., Prati, R.C. (eds.) BRACIS 2020. LNCS (LNAI), vol. 12319, pp. 435–449. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61377-8_30

    Chapter  Google Scholar 

  6. Fisas, B., Saggion, H., Ronzano, F.: On the Discoursive Structure of computer graphics research papers. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL, pp. 42–51. Association for Computational Linguistics: Colorado, USA (2015)

    Google Scholar 

  7. Graa, M., Kim, Y., Schamper, J., Khadivi, S., Ney, H.: Generalizing back-translation in neural machine translation. In: Proceedings of the Fourth Conference on Machine Translation, WMT, pp. 45–52. Association for Computational Linguistics, Florence, Italy (2019)

    Google Scholar 

  8. Iwatsuki, K., Aizawa, A.: Communicative-function-based sentence classification for construction of an academic formulaic expression database. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, EACL, pp. 3476–3497. Association for Computational Linguistics, Online (2021)

    Google Scholar 

  9. Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  10. Luan, Y., He, L., Ostendorf, M., Hajishirzi, H.: Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP, pp. 3219–3232. Association for Computational Linguistics, Brussels, Belgium (2018)

    Google Scholar 

  11. Luo, Z., Lu, W., He, J., Wang, Y.: Combination of research questions and methods: A new measurement of scientific novelty. J. Informet. 16(2), 101282 (2022)

    Article  Google Scholar 

  12. Sakai, T., Hirokawa, S.: Feature words that classify problem sentence in scientific article. In: Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services, IIWAS, pp. 360–367. Association for Computing Machinery, New York, USA (2012)

    Google Scholar 

  13. Shakeel, M.H., Karim, A., Khan, I.: A multi-cascaded model with data augmentation for enhanced paraphrase detection in short texts. Inf. Process. Manage. 57(3), 102204 (2020)

    Article  Google Scholar 

  14. Shorten, C., Khoshgoftaar, T.M., Furht, B.: Text data augmentation for deep learning. Journal of Big Data 8(1), 101 (2021)

    Article  Google Scholar 

  15. Wang, R., Zhang, C., Zhang, Y., Zhang, J.: Extracting Methodological Sentences from Unstructured Abstracts of Academic Articles. In: Sundqvist, A., Berget, G., Nolin, J., Skjerdingstad, K.I. (eds.) iConference 2020. LNCS, vol. 12051, pp. 790–798. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-43687-2_66

    Chapter  Google Scholar 

  16. Wang, W. Y., Yang, D.: That’s so annoying!!!: a lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using# petpeeve tweets. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP, pp. 2557–2563. Association for Computational Linguistics, Lisbon, Portugal (2015)

    Google Scholar 

  17. Wei, J., Zou, K.: EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP, pp. 6382–6388. Association for Computational Linguistics, Hong Kong, China (2019)

    Google Scholar 

  18. Wilson, E.B.: An Introduction to Scientific Research. Dover Publications (1991)

    Google Scholar 

  19. Wu, X., Lv, S., Zang, L., Han, J., Hu, S.: Conditional BERT contextual augmentation. In: Proceedings of the International Conference on Computational Science, ICCS, pp. 84–95. Springer, Faro, Portugal (2018)

    Google Scholar 

  20. Xie, Q., Dai, Z., Hovy, E., Luong, M.T., Le, Q.V.: Unsupervised data augmentation for consistency training. In: Proceedings of the Advances in Neural Information Processing Systems, NIPS, pp. 6256–6268. Curran Associates Inc, Vancouver, Canada (2020)

    Google Scholar 

  21. Zeng, X., Li, Y., Zhai, Y., Zhang, Y.: Counterfactual generator: a weakly-supervised method for named entity recognition. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP, pp. 7270–7280. Association for Computational Linguistics, Online: Association for Computational Linguistics (2020)

    Google Scholar 

  22. Zhang, H., Ren, F.: Bertatde at semeval-2020 task 6: extracting term-definition pairs in free text using pre-trained model. In: Proceedings of the Fourteenth Workshop on Semantic Evaluation, SemEval, pp. 690–696. International Committee for Computational Linguistics, Online (2020)

    Google Scholar 

Download references

Acknowledgments

This work is supported by National Natural Science Foundation of China (Grant No. 72074113).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chengzhi Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, Y., Zhang, C. (2024). Data Augmentation on Problem and Method Sentence Classification Task in Scientific Paper: A Mechanism Analysis Study. In: Sserwanga, I., et al. Wisdom, Well-Being, Win-Win. iConference 2024. Lecture Notes in Computer Science, vol 14598. Springer, Cham. https://doi.org/10.1007/978-3-031-57867-0_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-57867-0_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-57866-3

  • Online ISBN: 978-3-031-57867-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics