Supervised machine learning methods depend highly on the quality of the training dataset and the underlying model. In particular, neural network models, that have shown great success in dealing with natural language problems, require a large dataset to learn a vast number of parameters. However, it is not always easy to build a large (labelled) dataset. For example, due to the complex nature of tweets and the manual labour involved, it is hard to create a large Twitter data set with the misogynistic label. In this paper, we propose to regularise a long short-term memory (LSTM) classifier using a pretrained LSTM-based language model (LM) to build an accurate classification model with a small training set. We explain transfer learning (TL) with a Bayesian interpretation and show that TL can be viewed as an uncertainty regularisation technique in Bayesian inference. We show that a LM pre-trained on a sequence of general to task-specific domain datasets can be used to regularise a LSTM classifier effectively when a small training dataset is available. Empirical analysis with two small Twitter datasets reveals that an LSTM model trained in this way can outperform the state-of-the-art classification models.

Appendix A: Description of evaluation measures
Appendix A: Description of evaluation measures
True Positive (TP) True positives are instances classified as positive by the model that actually are positive.
True Negative (TN): True negatives are instances the model classifies as negative that actually are negative.
False Positive (FP): False positives are instances identified by model as positive that actually are negative.
False Negative (FN): False negatives are instances the model classifies as negative that actually are positive.
Accuracy (Ac): It is the percentage of correctly classified instances, and it is calculated as \(\frac{\hbox {TP} + \hbox {TN}}{\hbox {TP} + \hbox {TN} + \hbox {FP} + \hbox {FN}}\).
Precision (Pr): It calculates a model’s ability to return only relevant instances. It is calculated as \(\frac{\hbox {TP}}{\hbox {TP} + \hbox {FP}}\).
Recall (Re): It calculates a model’s ability to identify all relevant instances. It is calculated as \(\frac{\hbox {TP}}{\hbox {TP} + \hbox {FN}}\).
\(F_1\) Score (\(F_1\)): A single metric that combines recall and precision using the harmonic mean. \(F_1\) Score is calculated as \(2 \times \frac{\hbox {precision}}{\hbox {precision} + \hbox {recall}}\).
Cohen Kappa (CK): Cohen’s kappa score is used to measure inter-rater and intra-rater reliability for categorical items [37]. It is calculated as \(\frac{\hbox {OA}-\hbox {AC}}{1-\hbox {AC}}\), where OA is the relative observed agreement between predicted labels and actual labels and AC is the probability of agreement by chance.
Area Under Curve (AUC): Area under the receiver operating characteristic (ROC) curve is called area under the curve (AUC). ROC plots the true positive rate versus the false positive rate as a function of the model’s threshold for classifying a positive. AUC calculates the overall performance of a classification model.
