Regularising LSTM classifier by transfer learning for detecting misogynistic tweets with small training set

Bashar, Md Abul; Nayak, Richi; Suzor, Nicolas

doi:10.1007/s10115-020-01481-0

Regularising LSTM classifier by transfer learning for detecting misogynistic tweets with small training set

Regular Paper
Published: 18 June 2020

Volume 62, pages 4029–4054, (2020)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

1800 Accesses
24 Citations
126 Altmetric
17 Mentions
Explore all metrics

Abstract

Supervised machine learning methods depend highly on the quality of the training dataset and the underlying model. In particular, neural network models, that have shown great success in dealing with natural language problems, require a large dataset to learn a vast number of parameters. However, it is not always easy to build a large (labelled) dataset. For example, due to the complex nature of tweets and the manual labour involved, it is hard to create a large Twitter data set with the misogynistic label. In this paper, we propose to regularise a long short-term memory (LSTM) classifier using a pretrained LSTM-based language model (LM) to build an accurate classification model with a small training set. We explain transfer learning (TL) with a Bayesian interpretation and show that TL can be viewed as an uncertainty regularisation technique in Bayesian inference. We show that a LM pre-trained on a sequence of general to task-specific domain datasets can be used to regularise a LSTM classifier effectively when a small training dataset is available. Empirical analysis with two small Twitter datasets reveals that an LSTM model trained in this way can outperform the state-of-the-art classification models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Identifying Fake Twitter Trends with Deep Learning

Incorporating pre-training in long short-term memory networks for tweet classification

Article 14 August 2018

Sentiment Analysis on Real-Time Twitter Data Using LSTM with Mutually Inclusive Classifiers

Notes

References

Ahluwalia R, Soni H, Callow E, Nascimento A, De Cock M (2018) Detecting hate speech against women in English tweets. EVALITA Eval NLP Speech Tools Ital 12:194
Article Google Scholar
Amnesty International (2018) Toxic twitter—a toxic place for women. https://bit.ly/2FZYQhV
Badjatiya P, Gupta S, Gupta M, Varma V (2017) Deep learning for hate speech detection in tweets. In: Proceedings of the 26th international conference on world wide web companion, international world wide web conferences steering committee, pp 759–760
Bartlett J, Norrie R, Patel S, Rumpel R, Wibberley S (2014) Misogyny on twitter. Demos. Retrieved from analysis and policy observatory website https://apo.org.au/node/39610
Bashar MA, Nayak R, Suzor N, Weir B (2018) Misogynistic tweet detection: modelling CNN with small datasets. In: The 16th Australasian data mining conference
Blundell C, Cornebise J, Kavukcuoglu K, Wierstra D (2015) Weight uncertainty in neural network. In: International conference on machine learning, pp 1613–1622
Bouchard G, Triggs B (2004) The tradeoff between generative and discriminative classifiers. In: 16th IASC international symposium on computational statistics (COMPSTAT’04), pp 721–728
Bradbury J, Merity S, Xiong C, Socher R (2016) Quasi-recurrent neural networks. arXiv preprint arXiv:1611.01576
Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 785–794
Dai AM, Le QV (2015) Semi-supervised sequence learning. In: Advances in neural information processing systems, pp 3079–3087
Davidson T, Warmsley D, Macy M, Weber I (2017) Automated hate speech detection and the problem of offensive language. arXiv preprint arXiv:1703.04009
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Downey A (2012) Think Bayes: Bayesian statistics made simple. Green Tea Press, Needham
Google Scholar
Dragiewicz M, Burgess J, Matamoros-Fernández A, Salter M, Suzor NP, Woodlock D, Harris B (2018) Technology facilitated coercive control: domestic violence and the competing roles of digital media platforms. Fem Med Stud 18:1–17
Article Google Scholar
Fadaee M, Bisazza A, Monz C (2017) Data augmentation for low-resource neural machine translation. In: Proceedings of the 55th annual meeting of the association for computational linguistics (volume 2: short papers), vol 2, pp 567–573
Fersini E, Nozza D, Rosso P (2018) Overview of the evalita 2018 task on automatic misogyny identification (AMI). In: Proceedings of the 6th evaluation campaign of Natural Language Processing and Speech tools for Italian (EVALITA’18). Italy, Turin
Gal Y (2016) Uncertainty in deep learning. University of Cambridge, Cambridge
Google Scholar
Gal Y, Ghahramani Z (2016) A theoretically grounded application of dropout in recurrent neural networks. In: Advances in neural information processing systems, pp 1019–1027
Gitari ND, Zuping Z, Damien H, Long J (2015) A lexicon-based approach for hate speech detection. Int J Multimed Ubiquitous Eng 10(4):215–230
Article Google Scholar
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp 249–256
Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B (1998) Support vector machines. IEEE Intell Syst Appl 13(4):18–28
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Hoerl AE, Kennard RW (1970) Ridge regression: applications to nonorthogonal problems. Technometrics 12(1):69–82
Article Google Scholar
Howard J, Ruder S (2018) Universal language model fine-tuning for text classification. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: long papers), vol 1, pp 328–339
Jozefowicz R, Vinyals O, Schuster M, Shazeer N, Wu Y (2016) Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410
Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1746–1751
Kwok I, Wang Y (2013) Locate the hate: detecting tweets against blacks. In: Twenty-seventh AAAI conference on artificial intelligence
Lee DD, Seung HS (2001) Algorithms for non-negative matrix factorization. In: Advances in neural information processing systems, pp 556–562
Lewis DD (1998) Naive (Bayes) at forty: the independence assumption in information retrieval. In: European conference on machine learning. Springer, pp 4–15
Li Y, Algarni A, Zhong N (2010) Mining positive and negative patterns for relevance feature discovery. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Washington, pp 753–762
Li Y, Gal Y (2017) Dropout inference in bayesian neural networks with alpha-divergences. In: Proceedings of the 34th international conference on machine learning, JMLR.org, vol 70, pp 2052–2061
Liaw A, Wiener M et al (2002) Classification and regression by randomforest. R News 2(3):18–22
Google Scholar
Liu P, Li W, Zou L (2019) Nuli at semeval-2019 task 6: transfer learning for offensive language detection using bidirectional transformers. In: Proceedings of the 13th international workshop on semantic evaluation, pp 87–91
Logeswaran L, Lee H (2018) An efficient framework for learning sentence representations. arXiv preprint arXiv:1803.02893
Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. Association for Computational Linguistics, vol 1, pp 142–150
MacKay DJ (1992) A practical bayesian framework for backpropagation networks. Neural Comput 4(3):448–472
Article Google Scholar
McHugh ML (2012) Interrater reliability: the kappa statistic. Biochem Med 22(3):276–282
Article MathSciNet Google Scholar
Melis G, Dyer C, Blunsom P (2017) On the state of the art of evaluation in neural language models. arXiv preprint arXiv:1707.05589
Merity S, Keskar NS, Socher R (2017) Regularizing and optimizing LSTM language models. arXiv preprint arXiv:1708.02182
Merity S, Xiong C, Bradbury J, Socher R (2016) Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843
Mikolov T, Karafiát M, Burget L, Černockỳ J Khudanpur S (2010) Recurrent neural network based language model. In: Eleventh annual conference of the international speech communication association
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Molina-González MD, Plaza-del Arco FM, Martín-Valdivia M, Ureña López L (2019) Ensemble learning to detect aggressiveness in mexican spanish tweets. In: Proceedings of the first workshop for Iberian languages evaluation forum (IberLEF 2019), CEUR WS proceedings
Pitsilis GK, Ramampiaro H, Langseth H (2018) Detecting offensive language in tweets using deep learning. arXiv preprint arXiv:1801.04433
Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language understanding with unsupervised learning. Technical report, OpenAI
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533
Article Google Scholar
Sharif Razavian A, Azizpour H, Sullivan J, Carlsson S (2014) CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 806–813
Silva L, Mondal M, Correa D, Benevenuto F, Weber I (2016) Analyzing the targets of hate in online social media. In: Tenth International AAAI conference on web and social media
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
MathSciNet MATH Google Scholar
Sun C, Shrivastava A, Singh S, Gupta A (2017) Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE international conference on computer vision, pp 843–852
Sundermeyer M, Schlüter R, Ney H (2012) LSTM neural networks for language modeling. In: Thirteenth annual conference of the international speech communication association
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, pp 3104–3112
Suzor N, Van Geelen T, Myers West S (2018) Evaluating the legitimacy of platform governance: a review of research and a shared research agenda. Int Commun Gaz 80(4):385–400
Article Google Scholar
The Online Hate Index: Innovation Brief (2018) Technical report, the anti-defamation League’s center for technology and society. https://www.adl.org/media/10894/download
Turian J, Ratinov L, Bengio Y (2010), Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 384–394
Wang B, Wang A, Chen F, Wang Y, Kuo C-CJ (2019) Evaluating word embedding models: methods and experimental results. APSIPA Trans Signal Inf Process 8:e19
Article Google Scholar
Wang D, Nyberg E (2015) A long short-term memory model for answer sentence selection in question answering. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 2: short papers), vol 2, pp 707–712
Wang W, Chen L, Thirunarayan K, Sheth AP (2014) Cursing in english on twitter. In: Proceedings of the 17th ACM conference on computer supported cooperative work & social computing. ACM, pp 415–425
Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10(Feb):207–244
MATH Google Scholar
Xiang G, Fan B, Wang L, Hong J, Rose C (2012) Detecting offensive tweets via topical feature discovery over a large scale twitter corpus. In: Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, pp 1980–1984
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision. Springer, pp 818–833
Zhang KW, Bowman SR (2018) Language modeling teaches you more syntax than translation does: lessons learned through auxiliary task analysis. arXiv preprint arXiv:1809.10040
Zhang Z, Luo L (2019) Hate speech detection: a solved problem? The challenging case of long tail on twitter. Semant Web 10(5):925–945
Article Google Scholar

Download references

Acknowledgements

his research was partially supported by the QUT IFE Catapult fund. Suzor is the recipient of an Australian Research Council DECRA Fellowship (project number DE160101542).

Author information

Authors and Affiliations

School of Computer Science, Queensland University of Technology, Brisbane, Australia
Md Abul Bashar & Richi Nayak
School of Law, Queensland University of Technology, Brisbane, Australia
Nicolas Suzor

Authors

Md Abul Bashar
View author publications
You can also search for this author in PubMed Google Scholar
Richi Nayak
View author publications
You can also search for this author in PubMed Google Scholar
Nicolas Suzor
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Md Abul Bashar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Description of evaluation measures

True Positive (TP) True positives are instances classified as positive by the model that actually are positive.
True Negative (TN): True negatives are instances the model classifies as negative that actually are negative.
False Positive (FP): False positives are instances identified by model as positive that actually are negative.
False Negative (FN): False negatives are instances the model classifies as negative that actually are positive.
Accuracy (Ac): It is the percentage of correctly classified instances, and it is calculated as $\frac{\hbox {TP} + \hbox {TN}}{\hbox {TP} + \hbox {TN} + \hbox {FP} + \hbox {FN}}$.
Precision (Pr): It calculates a model’s ability to return only relevant instances. It is calculated as $\frac{\hbox {TP}}{\hbox {TP} + \hbox {FP}}$.
Recall (Re): It calculates a model’s ability to identify all relevant instances. It is calculated as $\frac{\hbox {TP}}{\hbox {TP} + \hbox {FN}}$.
$F_1$ Score ($F_1$): A single metric that combines recall and precision using the harmonic mean. $F_1$ Score is calculated as $2 \times \frac{\hbox {precision}}{\hbox {precision} + \hbox {recall}}$.
Cohen Kappa (CK): Cohen’s kappa score is used to measure inter-rater and intra-rater reliability for categorical items [37]. It is calculated as $\frac{\hbox {OA}-\hbox {AC}}{1-\hbox {AC}}$, where OA is the relative observed agreement between predicted labels and actual labels and AC is the probability of agreement by chance.
Area Under Curve (AUC): Area under the receiver operating characteristic (ROC) curve is called area under the curve (AUC). ROC plots the true positive rate versus the false positive rate as a function of the model’s threshold for classifying a positive. AUC calculates the overall performance of a classification model.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bashar, M.A., Nayak, R. & Suzor, N. Regularising LSTM classifier by transfer learning for detecting misogynistic tweets with small training set. Knowl Inf Syst 62, 4029–4054 (2020). https://doi.org/10.1007/s10115-020-01481-0

Download citation

Received: 04 August 2019
Accepted: 06 June 2020
Published: 18 June 2020
Issue Date: October 2020
DOI: https://doi.org/10.1007/s10115-020-01481-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Regularising LSTM classifier by transfer learning for detecting misogynistic tweets with small training set

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Identifying Fake Twitter Trends with Deep Learning

Incorporating pre-training in long short-term memory networks for tweet classification

Sentiment Analysis on Real-Time Twitter Data Using LSTM with Mutually Inclusive Classifiers

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix A: Description of evaluation measures

Appendix A: Description of evaluation measures

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now