Abstract
To define salient rhetorical elements in scholarly text, we have earlier defined a set of Discourse Segment Types: semantically defined spans of discourse at the level of a clause with a single rhetorical purpose, such as Hypothesis, Method or Result. In this paper, we use machine learning methods to predict these Discourse Segment Types in a corpus of biomedical research papers. The initial experiment used features related to verb type and form, obtaining F-scores ranging from 0.41–0.65. To improve our results, we explored a variety of methods for balancing classes, before applying classification algorithms. We also performed an ablation study and stepwise approach for feature selection. Through these feature selection processes, we were able to reduce our 37 features to the 9 most informative ones, while maintaining F1 scores in the range of 0.63–0.65. Next, we performed an experiment with a reduced set of target classes. Using only verb tense features, logistic regression, a decision tree classifier and a random forest classifier, we predicted that a segment type was either a Result/Method or a Fact/Implication, with F1 scores above 0.8. Interestingly, findings from this machine learning approach are in line with a reader experiment, which found a correlation between verb tense and a biomedical reader’s interpretation of discourse segment type. This suggests that experimental and concept-centric discourse in biology texts can be distinguished by humans or machines, using verb tense as a key feature.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Burns, G.A.P.C., Dasigi, P., de Waard, A., Hovy, E.H.: Automated detection of discourse segment and experimental types from the text of cancer pathway results sections. Database 2016 (2016). baw122. https://doi.org/10.1093/database/baw122
Dasigi, P., Burns, G.A.P.C., Hovy, E.H., de Waard, A.: Experiment segmentation in scientific discourse as clause-level structured prediction using recurrent neural networks. arXiv preprint arXiv:1702.05398. https://arxiv.org/abs/1702.05398 (2017)
de Waard, A.: Manually curated dataset of papers into segments and DSTs: “Discourse Segment Type vs. Linguistic Features”. Mendeley Data, vol. 3 (2017). http://dx.doi.org/10.17632/4bh33fdx4v.3
de Waard, A., Pander Maat, H.: Verb form indicates discourse segment type in biological research papers: experimental evidence. J. Engl. Acad. Purp. 11(4), 357–366 (2012)
de Waard, A., Buitelaar, P., Eigner, T.: Identifying the epistemic value of discourse segments in biology texts. In: Bunt, H., Petukhova, V., Wubben, S. (eds.) Proceedings of the Eighth International Conference on Computational Semantics (IWCS-8 2009), pp. 351–354. Association for Computational Linguistics, Stroudsburg (2009)
de Waard, A.: Realm traversal in biological discourse: from model to experiment and back again. In: Multidisciplinary Perspectives on Signalling Text Organisation, MAD 2010, Moissac, 17–20 March 2010, p. 136 (2010). https://hal.archives-ouvertes.fr/hal-01391515/document#page=139
de Waard, A., Pander Maat, H.: A classification of research verbs to facilitate discourse segment identification in biological text. In: Proceedings from the Interdisciplinary Workshop on Verbs. The Identification and Representation of Verb Features, Pisa, Italy (2010). http://linguistica.sns.it/Workshop_verb/papers/de%20Waard_verb2010_submission_69.pdf
Elhassan, T., Aljurf, M., Al-Mohanna, F., Shoukri, M.: Classification of imbalance data using tomek link (T-Link) combined with random under-sampling (RUS) as a data reduction Method. J. Informat. Data Min. 1(2), 1–12 (2016). http://datamining.imedpub.com/classification-of-imbalance-data-using-tomek-linktlink-combined-with-random-undersampling-rus-as-a-data-reduction-method.pdf
Liakata, M., Thomson, P., de Waard, A., et al.: A three-way perspective on scientific discourse annotation for knowledge extraction. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pp. 37–46, Jeju, Republic of Korea, 12 July 2012 (2012). http://www.aclweb.org/anthology/W12–4305
Lemaitre, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(17), 1–5 (2017). http://jmlr.org/papers/v18/16-365
Ratner, A., Bach, S.H., Ehrenberg, H., Fries, J., Wu, S., Ré, C.: Snorkel: rapid training data creation with weak supervision. Proc. VLDB Endow. 11(3), 269–282 (2017)
de Waard, A., Pander Maat, H.: Epistemic modality and knowledge attribution in scientific discourse: a taxonomy of types and overview of features. In Proceedings of the Workshop on Detecting Structure in Scholarly Discourse (ACL 2012), pp. 47–55. Association for Computational Linguistics, Stroudsburg, PA, USA (2012). https://dl.acm.org/citation.cfm?id=2391180
Voorhoeve, P.M., et al.: A genetic screen implicates miRNA-372 and miRNA-373 as oncogenes in testicular germ cell tumors. Cell 124(6), 1169–1181 (2006). https://www.ncbi.nlm.nih.gov/pubmed/16564011
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendices
1.1 Appendix 5.1. Starting Feature List and Descriptions
Feature class | Feature | Included in experiment # |
---|---|---|
Frequently Used Verb | Top 10 Verb | 1 |
Frequently Used Verb | ‘Show’ Verb | 1, 3 |
Verb Tense | Future | 1, 3, 4 |
Verb Tense | Gerund | 1, 4 |
Verb Tense | Past | 1, 2, 3, 4 |
Verb Tense | Past participle | 1, 4 |
Verb Tense | Past perfect | 1, 4 |
Verb Tense | Past progressive | 1, 3, 4 |
Verb Tense | Present | 1, 2, 4 |
Verb Tense | Present perfect | 1, 4 |
Verb Tense | Present progressive | 1, 4 |
Verb Tense | To-infinitive | 1, 2, 3, 4 |
Verb Class | Cause and effect | 1 |
Verb Class | Change and growth | 1 |
Verb Class | Discourse verb | 1 |
Verb Class | Interpretation | 1, 2, 3 |
Verb Class | Investigation | 1, 2, 3 |
Verb Class | None | 1 |
Verb Class | Observation | 1, 3 |
Verb Class | Prediction | 1 |
Verb Class | Procedure | 1, 2, 3 |
Verb Class | Properties | 1, 3 |
Modality Marker | Modal | 1, 2, 3 |
Modality Marker | Verb class interpretation | 1, 2, 3 |
Modality Marker | Ruled by verb class interpretation | 1, 2, 3 |
Modality Marker | Reference internal | 1 |
Modality Marker | Reference external | 1 |
Modality Marker | First person | 1 |
Modality Marker | Modal significant_ly | 1 |
Modality Marker | Possible_ility_ly | 1 |
Modality Marker | Potential_ly | 1 |
Modality Marker | UN_Likely | 1 |
Modality Marker | Sum_Adverbs_YesNO | 1 |
1.2 Appendix 5.2. Description of Sampling Methods Used
Sampling method | Description | Method |
---|---|---|
RandomUnderSampler | Undersamples the majority classes by randomly picking samples | Undersampler |
Tomeklinks | Undersamples the majority classes by removing Tomek’s links | Undersampler |
ClusterCentroids | Under samples the majority classes by replacing a cluster of the majority samples by the cluster centroid of a KMeans algorithm | Undersampler |
CondensedNearestNeighbor | Under samples the majority classes using the condensed nearest neighbor method | Undersampler |
OneSidedSelection | Uses one-sided selection method on majority classes | Undersampler |
InstanceHardnessThreshold | Samples with lower probabilities are removed from the majority class | Undersampler |
RandomOverSampler | Randomly generates new samples from the minority classes | Oversampler |
SMOTE | Synthetic Minority Oversampling Technique; generates new samples of minority class by interpolation | Oversampler |
SMOTEborderline | Generates new samples of minority class specific to the borders between two classes | Oversampler |
SMOTEborderline2 | Generates new samples of minority class specific to the borders between two classes | Oversampler |
SMOTETomek | Combines use of SMOTE on minority class and Tomek Links on majority class | Over and undersampler |
SMOTEENN | Combines use of SMOTE on minority class and Edited Nearest Neighbors on majority class | Over and undersampler |
1.3 Appendix 5.3. Accuracy, Precision, Recall and F1 Scores of All 36 Models Tested
Classifier | Class balancer | Accuracy | Precision | Recall | F1 |
---|---|---|---|---|---|
LR | No Class Balancer | 0.62 | 0.68 | 0.63 | 0.64 |
DTC | No Class Balancer | 0.64 | 0.64 | 0.64 | 0.64 |
RFC | No Class Balancer | 0.64 | 0.65 | 0.65 | 0.64 |
LR | RandomUnderSampler | 0.58 | 0.64 | 0.58 | 0.59 |
DTC | RandomUnderSampler | 0.55 | 0.64 | 0.55 | 0.56 |
RFC | RandomUnderSampler | 0.57 | 0.63 | 0.56 | 0.57 |
LR | Tomeklinks | 0.63 | 0.68 | 0.63 | 0.64 |
DTC | Tomeklinks | 0.64 | 0.64 | 0.64 | 0.64 |
RFC | Tomeklinks | 0.64 | 0.64 | 0.64 | 0.64 |
LR | ClusterCentroids | 0.55 | 0.64 | 0.55 | 0.55 |
DTC | ClusterCentroids | 0.35 | 0.48 | 0.35 | 0.32 |
RFC | ClusterCentroids | 0.38 | 0.47 | 0.38 | 0.35 |
LR | CondensedNearestNeighbor | 0.62 | 0.67 | 0.62 | 0.62 |
DTC | CondensedNearestNeighbor | 0.53 | 0.59 | 0.53 | 0.53 |
RFC | CondensedNearestNeighbor | 0.55 | 0.60 | 0.55 | 0.55 |
LR | OneSidedSelection | 0.60 | 0.65 | 0.6 | 0.61 |
DTC | OneSidedSelection | 0.47 | 0.47 | 0.47 | 0.46 |
RFC | OneSidedSelection | 0.48 | 0.43 | 0.48 | 0.45 |
LR | InstanceHarnessThreshold | 0.46 | 0.58 | 0.46 | 0.5 |
DTC | InstanceHarnessThreshold | 0.37 | 0.61 | 0.37 | 0.41 |
RFC | InstanceHarnessThreshold | 0.40 | 0.61 | 0.4 | 0.44 |
LR | RandomOverSampler | 0.63 | 0.68 | 0.63 | 0.64 |
DTC | RandomOverSampler | 0.60 | 0.64 | 0.6 | 0.61 |
RFC | RandomOverSampler | 0.61 | 0.64 | 0.61 | 0.61 |
LR | SMOTE | 0.63 | 0.68 | 0.63 | 0.64 |
DTC | SMOTE | 0.62 | 0.64 | 0.63 | 0.63 |
RFC | SMOTE | 0.63 | 0.64 | 0.63 | 0.63 |
LR | SMOTEborderline | 0.63 | 0.68 | 0.63 | 0.65 |
DTC | SMOTEborderline | 0.63 | 0.64 | 0.63 | 0.63 |
RFC | SMOTEborderline | 0.62 | 0.63 | 0.62 | 0.62 |
LR | SMOTEborderline2 | 0.63 | 0.68 | 0.63 | 0.64 |
DTC | SMOTEborderline3 | 0.63 | 0.64 | 0.63 | 0.63 |
RFC | SMOTEborderline4 | 0.62 | 0.64 | 0.62 | 0.62 |
LR | SMOTETomek | 0.63 | 0.68 | 0.63 | 0.65 |
DTC | SMOTETomek | 0.63 | 0.64 | 0.63 | 0.63 |
RFC | SMOTETomek | 0.63 | 0.65 | 0.63 | 0.63 |
LR | SMOTEENN | 0.50 | 0.63 | 0.50 | 0.52 |
DTC | SMOTEENN | 0.42 | 0.65 | 0.42 | 0.45 |
RFC | SMOTEENN | 0.44 | 0.63 | 0.44 | 0.46 |
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Cox, J., Harper, C.A., de Waard, A. (2018). Optimized Machine Learning Methods Predict Discourse Segment Type in Biological Research Articles. In: González-Beltrán, A., Osborne, F., Peroni, S., Vahdati, S. (eds) Semantics, Analytics, Visualization . SAVE-SD SAVE-SD 2017 2018. Lecture Notes in Computer Science(), vol 10959. Springer, Cham. https://doi.org/10.1007/978-3-030-01379-0_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-01379-0_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01378-3
Online ISBN: 978-3-030-01379-0
eBook Packages: Computer ScienceComputer Science (R0)