3.1 Datasets
The availability of labeled datasets for text classification has become the main driving force behind the fast advancement of this research field. In this section, we summarize the characteristics of these datasets in terms of domains and give an overview in Table
2, including the number of categories, average sentence length, the size of each dataset, related papers, data sources to access, and applications.
3.1.1 Sentiment Analysis (SA).
SA is the process of analyzing and reasoning the subjective text within emotional color. It is crucial to get information on whether it supports a particular point of view from the text that is distinct from the traditional text classification that analyzes the objective content of the text. SA can be binary or multi-class. Binary SA is to divide the text into two categories, including positive and negative. Multi-class SA classifies text to multi-level or fine-grained labels. The SA datasets include
Movie Review (MR) [
226,
257],
Stanford Sentiment Treebank (SST) [
227],
Multi-Perspective Question Answering (MPQA) [
229,
258], IMDB [
230], Yelp [
231],
Amazon Reviews (AM) [
93], NLP&CC 2013 [
111], Subj [
250], CR [
251], SS-Twitter [
259], SS-Youtube [
259], SE1604 [
260], and so on. Here we detail several datasets.
MR. The MR is a movie review dataset, each of which corresponds to a sentence. The corpus has 5,331 positive data and 5,331 negative data. 10-fold cross-validation by random splitting is commonly used to test MR.
SST. The SST is an extension of MR. It has two categories. SST-1 with fine-grained labels with five classes. It has 8,544 training texts and 2,210 test texts, respectively. Furthermore, SST-2 has 9,613 texts with binary labels being partitioned into 6,920 training texts, 872 development texts, and 1,821 testing texts.
MPQA. The MPQA is an opinion dataset. It has two class labels and also an MPQA dataset of opinion polarity detection sub-tasks. MPQA includes 10,606 sentences extracted from news articles from various news sources. It should be noted that it contains 3,311 positive texts and 7,293 negative texts without labels of each text.
IMDB reviews. The IMDB review is developed for binary sentiment classification of film reviews with the same amount in each class. It can be separated into training and test groups on average, by 25,000 comments per group.
Yelp reviews. The Yelp review is summarized from the Yelp Dataset Challenges in 2013, 2014, and 2015. This dataset has two categories. Yelp-2 of these were used for negative and positive emotion classification tasks, including 560,000 training texts and 38,000 test texts. Yelp-5 is used to detect fine-grained affective labels with 650,000 training and 50,000 test texts in all classes.
AM. The AM is a popular corpus formed by collecting Amazon website product reviews [
232]. This dataset has two categories. The Amazon-2 with two classes includes 3,600,000 training sets and 400,000 testing sets. Amazon-5, with five classes, includes 3,000,000 and 650,000 comments for training and testing.
3.1.2 News Classification (NC).
News content is one of the most crucial information sources which has a critical influence on people. The NC system facilitates users to get vital knowledge in real-time. News classification applications mainly encompass recognizing news topics and recommending related news according to user interest. The news classification datasets include
20 Newsgroups (20NG) [
34],
AG News (AG) [
93,
234], R8 [
235], R52 [
235], Sogou News (Sogou) [
136], and so on. Here we detail several datasets.
20NG. The 20NG is a newsgroup text dataset. It has 20 categories with the same number of each category and includes 18,846 texts.
AG. The AG News is a search engine for news from academia, choosing the four largest classes. It uses the title and description fields of each news. AG contains 120,000 texts for training and 7,600 texts for testing.
R8 and R52. R8 and R52 are two subsets which are the subset of Reuters [
252]. R8 has 8 categories, divided into 2,189 test files and 5,485 training courses. R52 has 52 categories, split into 6,532 training files and 2,568 test files.
Sogou. The Sogou combines two datasets, including SogouCA and SogouCS news sets. The label of each text is the domain name in the URL.
3.1.3 Topic Labeling (TL).
The topic analysis attempts to get the meaning of the text by defining the sophisticated text theme. The topic labeling is one of the essential components of the topic analysis technique, intending to assign one or more subjects for each document to simplify the topic analysis. The topic labeling datasets include DBPedia [
238], Ohsumed [
239], Yahoo answers (YahooA) [
93], EUR-Lex [
240], Amazon670K [
241], Bing [
244], Fudan [
245], and PubMed [
261]. Here we detail several datasets.
DBpedia. The DBpedia is a large-scale multi-lingual knowledge base generated using Wikipedia’s most ordinarily used infoboxes. It publishes DBpedia each month, adding or deleting classes and properties in every version. DBpedia’s most prevalent version has 14 classes and is divided into 560,000 training data and 70,000 test data.
Ohsumed. The Ohsumed belongs to the MEDLINE database. It includes 7,400 texts and has 23 cardiovascular disease categories. All texts are medical abstracts and are labeled into one or more classes.
YahooA. The YahooA is a topic labeling task with 10 classes. It includes 140,000 training data and 5,000 test data. All texts contains three elements, being question titles, question contexts, and best answers, respectively.
3.1.4 Question Answering (QA).
The QA task can be divided into two types: the extractive QA and the generative QA. The extractive QA gives multiple candidate answers for each question to choose which one is the right answer. Thus, the text classification models can be used for the extractive QA task. The QA discussed in this paper is all extractive QA. The QA system can apply the text classification model to recognize the correct answer and set others as candidates. The question answering datasets include
Stanford Question Answering Dataset (SQuAD) [
246], TREC-QA [
248], WikiQA [
249], Subj [
250], CR [
251], MS MARCO [
262], and Quora [
263]. Here we detail several datasets.
SQuAD. The SQuAD is a set of question and answer pairs obtained from Wikipedia articles. The SQuAD has two categories. SQuAD1.1 contains 536 pairs of 107,785 Q&A items. SQuAD2.0 combines 100,000 questions in SQuAD1.1 with more than 50,000 unanswerable questions that crowd workers face in a form similar to answerable questions [
264].
TREC-QA. The TREC-QA includes 5,452 training texts and 500 testing texts. It has two versions. TREC-6 contains 6 categories, and TREC-50 has 50 categories.
WikiQA. The WikiQA dataset includes questions with no correct answer, which needs to evaluate the answer.
MS MARCO. The MS MARCO contains questions and answers. The questions and part of the answers are sampled from actual web texts by the Bing search engine. Others are generative. It is used for developing generative QA systems released by Microsoft.
3.1.5 Natural Language Inference (NLI).
NLI is used to predict whether the meaning of one text can be deduced from another. Paraphrasing is a generalized form of NLI. It uses the task of measuring the semantic similarity of sentence pairs to decide whether one sentence is the interpretation of another. The NLI datasets include
Stanford Natural Language Inference (SNLI) [
181],
Multi-Genre Natural Language Inference (MNLI) [
265],
Sentences Involving Compositional Knowledge (SICK) [
266],
Microsoft Research Paraphrase (MSRP) [
267],
Semantic Textual Similarity (STS) [
268],
Recognising Textual Entailment (RTE) [
269], SciTail [
270], etc. Here we detail several of the primary datasets.
SNLI. The SNLI is generally applied to NLI tasks. It contains 570,152 human-annotated sentence pairs, including training, development, and test sets, which are annotated with three categories: neutral, entailment, and contradiction.
MNLI. The MNLI is an expansion of SNLI, embracing a broader scope of written and spoken text genres. It includes 433,000 sentence pairs annotated by textual entailment labels.
SICK. The SICK contains almost 10,000 English sentence pairs. It consists of neutral, entailment, and contradictory labels.
MSRP. The MSRP consists of sentence pairs, usually for the text-similarity task. Each pair is annotated by a binary label to discriminate whether they are paraphrases. It respectively includes 1,725 training and 4,076 test sets.
3.1.6 Multi-Label (ML) Datasets.
In multi-label classification, an instance has multiple labels, and each label can only take one of the multiple classes. There are many datasets based on multi-label text classification. It includes Reuters [
252],
Reuters Corpus Volume I (RCV1) [
255], RCV1-2K [
255],
Arxiv Academic Paper Dataset (AAPD) [
117], Patent,
Web of Science (
WOS-11967) [
271], AmazonCat-13K [
272],
BlurbGenreCollection (BGC) [
273], etc. Here we detail several datasets.
Reuters. The Reuters is a popularly used dataset for text classification from Reuters financial news services. It has 90 training classes, 7,769 training texts, and 3,019 testing texts, containing multiple labels and single labels. There are also some Reuters sub-sets of data, such as R8, BR52, RCV1, and RCV1-v2.
RCV1 and RCV1-2K. The RCV1 is collected from Reuters News articles from 1996-1997, which is human-labeled with 103 categories. It consists of 23,149 training and 784,446 testing texts, respectively. The RCV1-2K dataset has the same features as the RCV1. However, the label set of RCV1-2K has been expanded with some new labels. It contains 2,456 labels.
AAPD. The AAPD is a large dataset in the computer science field for the multi-label text classification from website.
1 It has 55,840 papers, including the abstract and the corresponding subjects with 54 labels in total. The aim is to predict the corresponding subjects of each paper according to the abstract.
Patent Dataset. The Patent Dataset is obtained from USPTO,
2 which is a patent system grating U.S. patents containing textual details such title and abstract. It contains 100,000 US patents awarded in the real-world with multiple hierarchical categories.
WOS-11967. The WOS-11967 is crawled from the Web of Science, consisting of abstracts of published papers with two labels for each example. It is more shallow, but significantly broader, with fewer classes in total.
3.1.7 Others.
There are some datasets for other applications, such as SemEval-2010 Task 8 [
274], ACE 2003-2004 [
275], TACRED [
276], and NYT-10 [
277], FewRel [
278],
Dialog State Tracking Challenge 4 (DSTC 4) [
279], ICSI
Meeting Recorder Dialog Act (MRDA) [
280], and
Switchboard Dialog Act (SwDA) [
281], and so on.