Optimized Machine Learning Methods Predict Discourse Segment Type in Biological Research Articles

Cox, Jessica; Harper, Corey A.; de Waard, Anita

doi:10.1007/978-3-030-01379-0_7

Optimized Machine Learning Methods Predict Discourse Segment Type in Biological Research Articles

Jessica Cox¹⁷,
Corey A. Harper¹⁷ &
Anita de Waard¹⁷

Conference paper
First Online: 31 October 2018

397 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10959))

Abstract

To define salient rhetorical elements in scholarly text, we have earlier defined a set of Discourse Segment Types: semantically defined spans of discourse at the level of a clause with a single rhetorical purpose, such as Hypothesis, Method or Result. In this paper, we use machine learning methods to predict these Discourse Segment Types in a corpus of biomedical research papers. The initial experiment used features related to verb type and form, obtaining F-scores ranging from 0.41–0.65. To improve our results, we explored a variety of methods for balancing classes, before applying classification algorithms. We also performed an ablation study and stepwise approach for feature selection. Through these feature selection processes, we were able to reduce our 37 features to the 9 most informative ones, while maintaining F1 scores in the range of 0.63–0.65. Next, we performed an experiment with a reduced set of target classes. Using only verb tense features, logistic regression, a decision tree classifier and a random forest classifier, we predicted that a segment type was either a Result/Method or a Fact/Implication, with F1 scores above 0.8. Interestingly, findings from this machine learning approach are in line with a reader experiment, which found a correlation between verb tense and a biomedical reader’s interpretation of discourse segment type. This suggests that experimental and concept-centric discourse in biology texts can be distinguished by humans or machines, using verb tense as a key feature.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Burns, G.A.P.C., Dasigi, P., de Waard, A., Hovy, E.H.: Automated detection of discourse segment and experimental types from the text of cancer pathway results sections. Database 2016 (2016). baw122. https://doi.org/10.1093/database/baw122
Dasigi, P., Burns, G.A.P.C., Hovy, E.H., de Waard, A.: Experiment segmentation in scientific discourse as clause-level structured prediction using recurrent neural networks. arXiv preprint arXiv:1702.05398. https://arxiv.org/abs/1702.05398 (2017)
de Waard, A.: Manually curated dataset of papers into segments and DSTs: “Discourse Segment Type vs. Linguistic Features”. Mendeley Data, vol. 3 (2017). http://dx.doi.org/10.17632/4bh33fdx4v.3
de Waard, A., Pander Maat, H.: Verb form indicates discourse segment type in biological research papers: experimental evidence. J. Engl. Acad. Purp. 11(4), 357–366 (2012)
Article Google Scholar
de Waard, A., Buitelaar, P., Eigner, T.: Identifying the epistemic value of discourse segments in biology texts. In: Bunt, H., Petukhova, V., Wubben, S. (eds.) Proceedings of the Eighth International Conference on Computational Semantics (IWCS-8 2009), pp. 351–354. Association for Computational Linguistics, Stroudsburg (2009)
Google Scholar
de Waard, A.: Realm traversal in biological discourse: from model to experiment and back again. In: Multidisciplinary Perspectives on Signalling Text Organisation, MAD 2010, Moissac, 17–20 March 2010, p. 136 (2010). https://hal.archives-ouvertes.fr/hal-01391515/document#page=139
de Waard, A., Pander Maat, H.: A classification of research verbs to facilitate discourse segment identification in biological text. In: Proceedings from the Interdisciplinary Workshop on Verbs. The Identification and Representation of Verb Features, Pisa, Italy (2010). http://linguistica.sns.it/Workshop_verb/papers/de%20Waard_verb2010_submission_69.pdf
Elhassan, T., Aljurf, M., Al-Mohanna, F., Shoukri, M.: Classification of imbalance data using tomek link (T-Link) combined with random under-sampling (RUS) as a data reduction Method. J. Informat. Data Min. 1(2), 1–12 (2016). http://datamining.imedpub.com/classification-of-imbalance-data-using-tomek-linktlink-combined-with-random-undersampling-rus-as-a-data-reduction-method.pdf
Article Google Scholar
Liakata, M., Thomson, P., de Waard, A., et al.: A three-way perspective on scientific discourse annotation for knowledge extraction. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pp. 37–46, Jeju, Republic of Korea, 12 July 2012 (2012). http://www.aclweb.org/anthology/W12–4305
Lemaitre, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(17), 1–5 (2017). http://jmlr.org/papers/v18/16-365
MATH Google Scholar
Ratner, A., Bach, S.H., Ehrenberg, H., Fries, J., Wu, S., Ré, C.: Snorkel: rapid training data creation with weak supervision. Proc. VLDB Endow. 11(3), 269–282 (2017)
Article Google Scholar
de Waard, A., Pander Maat, H.: Epistemic modality and knowledge attribution in scientific discourse: a taxonomy of types and overview of features. In Proceedings of the Workshop on Detecting Structure in Scholarly Discourse (ACL 2012), pp. 47–55. Association for Computational Linguistics, Stroudsburg, PA, USA (2012). https://dl.acm.org/citation.cfm?id=2391180
Voorhoeve, P.M., et al.: A genetic screen implicates miRNA-372 and miRNA-373 as oncogenes in testicular germ cell tumors. Cell 124(6), 1169–1181 (2006). https://www.ncbi.nlm.nih.gov/pubmed/16564011

Download references

Author information

Authors and Affiliations

Elsevier, Amsterdam, Netherlands
Jessica Cox, Corey A. Harper & Anita de Waard

Authors

Jessica Cox
View author publications
You can also search for this author in PubMed Google Scholar
Corey A. Harper
View author publications
You can also search for this author in PubMed Google Scholar
Anita de Waard
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Corey A. Harper .

Editor information

Editors and Affiliations

Oxford e-Research Centre, University of Oxford, Oxford, UK
Alejandra González-Beltrán
Waltan Hall, KMi, Open University, Milton Keynes, UK
Francesco Osborne
Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy
Silvio Peroni
University of Bonn, Bonn, Germany
Sahar Vahdati

Appendices

1.1 Appendix 5.1. Starting Feature List and Descriptions

Feature class	Feature	Included in experiment #
Frequently Used Verb	Top 10 Verb	1
Frequently Used Verb	‘Show’ Verb	1, 3
Verb Tense	Future	1, 3, 4
Verb Tense	Gerund	1, 4
Verb Tense	Past	1, 2, 3, 4
Verb Tense	Past participle	1, 4
Verb Tense	Past perfect	1, 4
Verb Tense	Past progressive	1, 3, 4
Verb Tense	Present	1, 2, 4
Verb Tense	Present perfect	1, 4
Verb Tense	Present progressive	1, 4
Verb Tense	To-infinitive	1, 2, 3, 4
Verb Class	Cause and effect	1
Verb Class	Change and growth	1
Verb Class	Discourse verb	1
Verb Class	Interpretation	1, 2, 3
Verb Class	Investigation	1, 2, 3
Verb Class	None	1
Verb Class	Observation	1, 3
Verb Class	Prediction	1
Verb Class	Procedure	1, 2, 3
Verb Class	Properties	1, 3
Modality Marker	Modal	1, 2, 3
Modality Marker	Verb class interpretation	1, 2, 3
Modality Marker	Ruled by verb class interpretation	1, 2, 3
Modality Marker	Reference internal	1
Modality Marker	Reference external	1
Modality Marker	First person	1
Modality Marker	Modal significant_ly	1
Modality Marker	Possible_ility_ly	1
Modality Marker	Potential_ly	1
Modality Marker	UN_Likely	1
Modality Marker	Sum_Adverbs_YesNO	1

1.2 Appendix 5.2. Description of Sampling Methods Used

Sampling method	Description	Method
RandomUnderSampler	Undersamples the majority classes by randomly picking samples	Undersampler
Tomeklinks	Undersamples the majority classes by removing Tomek’s links	Undersampler
ClusterCentroids	Under samples the majority classes by replacing a cluster of the majority samples by the cluster centroid of a KMeans algorithm	Undersampler
CondensedNearestNeighbor	Under samples the majority classes using the condensed nearest neighbor method	Undersampler
OneSidedSelection	Uses one-sided selection method on majority classes	Undersampler
InstanceHardnessThreshold	Samples with lower probabilities are removed from the majority class	Undersampler
RandomOverSampler	Randomly generates new samples from the minority classes	Oversampler
SMOTE	Synthetic Minority Oversampling Technique; generates new samples of minority class by interpolation	Oversampler
SMOTEborderline	Generates new samples of minority class specific to the borders between two classes	Oversampler
SMOTEborderline2	Generates new samples of minority class specific to the borders between two classes	Oversampler
SMOTETomek	Combines use of SMOTE on minority class and Tomek Links on majority class	Over and undersampler
SMOTEENN	Combines use of SMOTE on minority class and Edited Nearest Neighbors on majority class	Over and undersampler

1.3 Appendix 5.3. Accuracy, Precision, Recall and F1 Scores of All 36 Models Tested

Classifier	Class balancer	Accuracy	Precision	Recall	F1
LR	No Class Balancer	0.62	0.68	0.63	0.64
DTC	No Class Balancer	0.64	0.64	0.64	0.64
RFC	No Class Balancer	0.64	0.65	0.65	0.64
LR	RandomUnderSampler	0.58	0.64	0.58	0.59
DTC	RandomUnderSampler	0.55	0.64	0.55	0.56
RFC	RandomUnderSampler	0.57	0.63	0.56	0.57
LR	Tomeklinks	0.63	0.68	0.63	0.64
DTC	Tomeklinks	0.64	0.64	0.64	0.64
RFC	Tomeklinks	0.64	0.64	0.64	0.64
LR	ClusterCentroids	0.55	0.64	0.55	0.55
DTC	ClusterCentroids	0.35	0.48	0.35	0.32
RFC	ClusterCentroids	0.38	0.47	0.38	0.35
LR	CondensedNearestNeighbor	0.62	0.67	0.62	0.62
DTC	CondensedNearestNeighbor	0.53	0.59	0.53	0.53
RFC	CondensedNearestNeighbor	0.55	0.60	0.55	0.55
LR	OneSidedSelection	0.60	0.65	0.6	0.61
DTC	OneSidedSelection	0.47	0.47	0.47	0.46
RFC	OneSidedSelection	0.48	0.43	0.48	0.45
LR	InstanceHarnessThreshold	0.46	0.58	0.46	0.5
DTC	InstanceHarnessThreshold	0.37	0.61	0.37	0.41
RFC	InstanceHarnessThreshold	0.40	0.61	0.4	0.44
LR	RandomOverSampler	0.63	0.68	0.63	0.64
DTC	RandomOverSampler	0.60	0.64	0.6	0.61
RFC	RandomOverSampler	0.61	0.64	0.61	0.61
LR	SMOTE	0.63	0.68	0.63	0.64
DTC	SMOTE	0.62	0.64	0.63	0.63
RFC	SMOTE	0.63	0.64	0.63	0.63
LR	SMOTEborderline	0.63	0.68	0.63	0.65
DTC	SMOTEborderline	0.63	0.64	0.63	0.63
RFC	SMOTEborderline	0.62	0.63	0.62	0.62
LR	SMOTEborderline2	0.63	0.68	0.63	0.64
DTC	SMOTEborderline3	0.63	0.64	0.63	0.63
RFC	SMOTEborderline4	0.62	0.64	0.62	0.62
LR	SMOTETomek	0.63	0.68	0.63	0.65
DTC	SMOTETomek	0.63	0.64	0.63	0.63
RFC	SMOTETomek	0.63	0.65	0.63	0.63
LR	SMOTEENN	0.50	0.63	0.50	0.52
DTC	SMOTEENN	0.42	0.65	0.42	0.45
RFC	SMOTEENN	0.44	0.63	0.44	0.46

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cox, J., Harper, C.A., de Waard, A. (2018). Optimized Machine Learning Methods Predict Discourse Segment Type in Biological Research Articles. In: González-Beltrán, A., Osborne, F., Peroni, S., Vahdati, S. (eds) Semantics, Analytics, Visualization . SAVE-SD SAVE-SD 2017 2018. Lecture Notes in Computer Science(), vol 10959. Springer, Cham. https://doi.org/10.1007/978-3-030-01379-0_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-01379-0_7
Published: 31 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01378-3
Online ISBN: 978-3-030-01379-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

Buying options

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

Appendices

1.1 Appendix 5.1. Starting Feature List and Descriptions

1.2 Appendix 5.2. Description of Sampling Methods Used

1.3 Appendix 5.3. Accuracy, Precision, Recall and F1 Scores of All 36 Models Tested

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation