Skip to main content

Optimized Machine Learning Methods Predict Discourse Segment Type in Biological Research Articles

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10959))

Abstract

To define salient rhetorical elements in scholarly text, we have earlier defined a set of Discourse Segment Types: semantically defined spans of discourse at the level of a clause with a single rhetorical purpose, such as Hypothesis, Method or Result. In this paper, we use machine learning methods to predict these Discourse Segment Types in a corpus of biomedical research papers. The initial experiment used features related to verb type and form, obtaining F-scores ranging from 0.41–0.65. To improve our results, we explored a variety of methods for balancing classes, before applying classification algorithms. We also performed an ablation study and stepwise approach for feature selection. Through these feature selection processes, we were able to reduce our 37 features to the 9 most informative ones, while maintaining F1 scores in the range of 0.63–0.65. Next, we performed an experiment with a reduced set of target classes. Using only verb tense features, logistic regression, a decision tree classifier and a random forest classifier, we predicted that a segment type was either a Result/Method or a Fact/Implication, with F1 scores above 0.8. Interestingly, findings from this machine learning approach are in line with a reader experiment, which found a correlation between verb tense and a biomedical reader’s interpretation of discourse segment type. This suggests that experimental and concept-centric discourse in biology texts can be distinguished by humans or machines, using verb tense as a key feature.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Burns, G.A.P.C., Dasigi, P., de Waard, A., Hovy, E.H.: Automated detection of discourse segment and experimental types from the text of cancer pathway results sections. Database 2016 (2016). baw122. https://doi.org/10.1093/database/baw122

  2. Dasigi, P., Burns, G.A.P.C., Hovy, E.H., de Waard, A.: Experiment segmentation in scientific discourse as clause-level structured prediction using recurrent neural networks. arXiv preprint arXiv:1702.05398. https://arxiv.org/abs/1702.05398 (2017)

  3. de Waard, A.: Manually curated dataset of papers into segments and DSTs: “Discourse Segment Type vs. Linguistic Features”. Mendeley Data, vol. 3 (2017). http://dx.doi.org/10.17632/4bh33fdx4v.3

  4. de Waard, A., Pander Maat, H.: Verb form indicates discourse segment type in biological research papers: experimental evidence. J. Engl. Acad. Purp. 11(4), 357–366 (2012)

    Article  Google Scholar 

  5. de Waard, A., Buitelaar, P., Eigner, T.: Identifying the epistemic value of discourse segments in biology texts. In: Bunt, H., Petukhova, V., Wubben, S. (eds.) Proceedings of the Eighth International Conference on Computational Semantics (IWCS-8 2009), pp. 351–354. Association for Computational Linguistics, Stroudsburg (2009)

    Google Scholar 

  6. de Waard, A.: Realm traversal in biological discourse: from model to experiment and back again. In: Multidisciplinary Perspectives on Signalling Text Organisation, MAD 2010, Moissac, 17–20 March 2010, p. 136 (2010). https://hal.archives-ouvertes.fr/hal-01391515/document#page=139

  7. de Waard, A., Pander Maat, H.: A classification of research verbs to facilitate discourse segment identification in biological text. In: Proceedings from the Interdisciplinary Workshop on Verbs. The Identification and Representation of Verb Features, Pisa, Italy (2010). http://linguistica.sns.it/Workshop_verb/papers/de%20Waard_verb2010_submission_69.pdf

  8. Elhassan, T., Aljurf, M., Al-Mohanna, F., Shoukri, M.: Classification of imbalance data using tomek link (T-Link) combined with random under-sampling (RUS) as a data reduction Method. J. Informat. Data Min. 1(2), 1–12 (2016). http://datamining.imedpub.com/classification-of-imbalance-data-using-tomek-linktlink-combined-with-random-undersampling-rus-as-a-data-reduction-method.pdf

    Article  Google Scholar 

  9. Liakata, M., Thomson, P., de Waard, A., et al.: A three-way perspective on scientific discourse annotation for knowledge extraction. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pp. 37–46, Jeju, Republic of Korea, 12 July 2012 (2012). http://www.aclweb.org/anthology/W12–4305

  10. Lemaitre, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(17), 1–5 (2017). http://jmlr.org/papers/v18/16-365

    MATH  Google Scholar 

  11. Ratner, A., Bach, S.H., Ehrenberg, H., Fries, J., Wu, S., Ré, C.: Snorkel: rapid training data creation with weak supervision. Proc. VLDB Endow. 11(3), 269–282 (2017)

    Article  Google Scholar 

  12. de Waard, A., Pander Maat, H.: Epistemic modality and knowledge attribution in scientific discourse: a taxonomy of types and overview of features. In Proceedings of the Workshop on Detecting Structure in Scholarly Discourse (ACL 2012), pp. 47–55. Association for Computational Linguistics, Stroudsburg, PA, USA (2012). https://dl.acm.org/citation.cfm?id=2391180

  13. Voorhoeve, P.M., et al.: A genetic screen implicates miRNA-372 and miRNA-373 as oncogenes in testicular germ cell tumors. Cell 124(6), 1169–1181 (2006). https://www.ncbi.nlm.nih.gov/pubmed/16564011

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Corey A. Harper .

Editor information

Editors and Affiliations

Appendices

Appendices

1.1 Appendix 5.1. Starting Feature List and Descriptions

Feature class

Feature

Included in experiment #

Frequently Used Verb

Top 10 Verb

1

Frequently Used Verb

‘Show’ Verb

1, 3

Verb Tense

Future

1, 3, 4

Verb Tense

Gerund

1, 4

Verb Tense

Past

1, 2, 3, 4

Verb Tense

Past participle

1, 4

Verb Tense

Past perfect

1, 4

Verb Tense

Past progressive

1, 3, 4

Verb Tense

Present

1, 2, 4

Verb Tense

Present perfect

1, 4

Verb Tense

Present progressive

1, 4

Verb Tense

To-infinitive

1, 2, 3, 4

Verb Class

Cause and effect

1

Verb Class

Change and growth

1

Verb Class

Discourse verb

1

Verb Class

Interpretation

1, 2, 3

Verb Class

Investigation

1, 2, 3

Verb Class

None

1

Verb Class

Observation

1, 3

Verb Class

Prediction

1

Verb Class

Procedure

1, 2, 3

Verb Class

Properties

1, 3

Modality Marker

Modal

1, 2, 3

Modality Marker

Verb class interpretation

1, 2, 3

Modality Marker

Ruled by verb class interpretation

1, 2, 3

Modality Marker

Reference internal

1

Modality Marker

Reference external

1

Modality Marker

First person

1

Modality Marker

Modal significant_ly

1

Modality Marker

Possible_ility_ly

1

Modality Marker

Potential_ly

1

Modality Marker

UN_Likely

1

Modality Marker

Sum_Adverbs_YesNO

1

1.2 Appendix 5.2. Description of Sampling Methods Used

Sampling method

Description

Method

RandomUnderSampler

Undersamples the majority classes by randomly picking samples

Undersampler

Tomeklinks

Undersamples the majority classes by removing Tomek’s links

Undersampler

ClusterCentroids

Under samples the majority classes by replacing a cluster of the majority samples by the cluster centroid of a KMeans algorithm

Undersampler

CondensedNearestNeighbor

Under samples the majority classes using the condensed nearest neighbor method

Undersampler

OneSidedSelection

Uses one-sided selection method on majority classes

Undersampler

InstanceHardnessThreshold

Samples with lower probabilities are removed from the majority class

Undersampler

RandomOverSampler

Randomly generates new samples from the minority classes

Oversampler

SMOTE

Synthetic Minority Oversampling Technique; generates new samples of minority class by interpolation

Oversampler

SMOTEborderline

Generates new samples of minority class specific to the borders between two classes

Oversampler

SMOTEborderline2

Generates new samples of minority class specific to the borders between two classes

Oversampler

SMOTETomek

Combines use of SMOTE on minority class and Tomek Links on majority class

Over and undersampler

SMOTEENN

Combines use of SMOTE on minority class and Edited Nearest Neighbors on majority class

Over and undersampler

1.3 Appendix 5.3. Accuracy, Precision, Recall and F1 Scores of All 36 Models Tested

Classifier

Class balancer

Accuracy

Precision

Recall

F1

LR

No Class Balancer

0.62

0.68

0.63

0.64

DTC

No Class Balancer

0.64

0.64

0.64

0.64

RFC

No Class Balancer

0.64

0.65

0.65

0.64

LR

RandomUnderSampler

0.58

0.64

0.58

0.59

DTC

RandomUnderSampler

0.55

0.64

0.55

0.56

RFC

RandomUnderSampler

0.57

0.63

0.56

0.57

LR

Tomeklinks

0.63

0.68

0.63

0.64

DTC

Tomeklinks

0.64

0.64

0.64

0.64

RFC

Tomeklinks

0.64

0.64

0.64

0.64

LR

ClusterCentroids

0.55

0.64

0.55

0.55

DTC

ClusterCentroids

0.35

0.48

0.35

0.32

RFC

ClusterCentroids

0.38

0.47

0.38

0.35

LR

CondensedNearestNeighbor

0.62

0.67

0.62

0.62

DTC

CondensedNearestNeighbor

0.53

0.59

0.53

0.53

RFC

CondensedNearestNeighbor

0.55

0.60

0.55

0.55

LR

OneSidedSelection

0.60

0.65

0.6

0.61

DTC

OneSidedSelection

0.47

0.47

0.47

0.46

RFC

OneSidedSelection

0.48

0.43

0.48

0.45

LR

InstanceHarnessThreshold

0.46

0.58

0.46

0.5

DTC

InstanceHarnessThreshold

0.37

0.61

0.37

0.41

RFC

InstanceHarnessThreshold

0.40

0.61

0.4

0.44

LR

RandomOverSampler

0.63

0.68

0.63

0.64

DTC

RandomOverSampler

0.60

0.64

0.6

0.61

RFC

RandomOverSampler

0.61

0.64

0.61

0.61

LR

SMOTE

0.63

0.68

0.63

0.64

DTC

SMOTE

0.62

0.64

0.63

0.63

RFC

SMOTE

0.63

0.64

0.63

0.63

LR

SMOTEborderline

0.63

0.68

0.63

0.65

DTC

SMOTEborderline

0.63

0.64

0.63

0.63

RFC

SMOTEborderline

0.62

0.63

0.62

0.62

LR

SMOTEborderline2

0.63

0.68

0.63

0.64

DTC

SMOTEborderline3

0.63

0.64

0.63

0.63

RFC

SMOTEborderline4

0.62

0.64

0.62

0.62

LR

SMOTETomek

0.63

0.68

0.63

0.65

DTC

SMOTETomek

0.63

0.64

0.63

0.63

RFC

SMOTETomek

0.63

0.65

0.63

0.63

LR

SMOTEENN

0.50

0.63

0.50

0.52

DTC

SMOTEENN

0.42

0.65

0.42

0.45

RFC

SMOTEENN

0.44

0.63

0.44

0.46

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cox, J., Harper, C.A., de Waard, A. (2018). Optimized Machine Learning Methods Predict Discourse Segment Type in Biological Research Articles. In: González-Beltrán, A., Osborne, F., Peroni, S., Vahdati, S. (eds) Semantics, Analytics, Visualization . SAVE-SD SAVE-SD 2017 2018. Lecture Notes in Computer Science(), vol 10959. Springer, Cham. https://doi.org/10.1007/978-3-030-01379-0_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-01379-0_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-01378-3

  • Online ISBN: 978-3-030-01379-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics