1 Introduction

In law, a judgment is a decision made by a court that resolves a controversy and determines the rights and liabilities of parties in a legal action or proceeding. Most courts now store their judgments electronically. In 2013, China Judgment Online System, the largest judgment documents sharing website around the world, was launched officially. Up to now, over 23 million electronic judgment documents are recorded and more than 70K new judgments are indexed everyday. This huge amount of judicial documents is of great importance not only for improving judicial justice and openness, but also for court administrators for record keeping and future reference in decision making and judgment writing. Furthermore, data sharing and deeper analysis on these judgments is the key approach in the process of information construction and legislation system improvement.

In China Judgment Online System, judicial cases are indexed with five major types depending on the cause of action, which are administrative case, criminal case, civil case, compensation case and execution case respectively. Beneath each type usually lies other organization hierarchies. Grouping by keywords is one of the most common used methods, for example, keywords in criminal cases can be illegal possession, surrender, joint offence, penalty, etc. Keywords matching does make a contribution to a better organized judgment documenting system. But, there are also some limits existing, on one hand, a list of keywords must be manually created and maintained. However, enumerating all keywords completely is difficult, thus leading to extra human labor cost. On the other hand, grouping by keywords does not meet all demands in real applications, when new classification requirements are generated, it is almost impossible to categorize all of them manually.

Text classification, as an important task in natural language processing, involves assigning a text document to one of the predefined classes or topics. Text classification has been widely studied in the text mining and information retrieval communities and commonly used in a number of diverse domains such as news automatic categorization, spam detecting and filtering, opinion mining and document indexing, etc. In this paper, we propose a machine learning approach to automatically classify Chinese judgment documents into predefined categories. Our work includes: (1) propose an automated method to construct a list of judicial specific stop-words, (2) propose an effective strategy for Chinese judgment documents representation as well as feature dimensional reduction while keeping as much important information as possible, (3) achieve high performance utilizing machine learning based algorithms to classify Chinese judgment documents.

To evaluate the performance of our approach, we also manually label 6735 pieces of Chinese judgment documents that are related to liabilities of product quality into 13 categories based on the statutory standard of industry division. With the experiment results on this dataset, we will prove the contributions of this paper by answering the research questions as follows:

  1. (1)

    How is the performance of the classifier improved by domain stop words list construction and text preprocessing in Chinese judgment documents classification?

  2. (2)

    What kind of features should be selected and how can we benefit from dimensional reduction?

  3. (3)

    Which machine learning algorithm achieves better performance for Chinese judgment documents classification?

The remainder of this paper is laid out as follows. Section 2 introduces related work of this paper. Section 3 introduces our approach in detail. Section 4 describes our experiments and evaluation results and Sect. 5 concludes with a discussion of future work.

2 Related Work

In recent years, the problem of text classification has gained increasing attention due to the large amounts of text data that are created in many information-centric applications, which is concomitant with tremendous researches on methods and algorithms in text classification. In this section, we provide an overview of key techniques for text classification.

Technically, text data is distinguished from other forms of data such as relational or quantitative data in many aspects. The most important characteristic of text data is that it is sparse and high dimensional [1]. Text data can be analyzed at different levels of representation. Bag-of-words (BOW) simply represent text data as a string of words. TF-IDF takes both word frequency and document frequency into consideration to determine the importance of each word, which is commonly used in the representation of documents. Strzalkowski demonstrated that a proper term weighting is important and different types of terms and terms that are derived from different means should be differentiated [2]. Jiang integrated rich document representations to derive high quality information from unstructured data to improve text classification [3]. Liu studied document representation based on semantic smoothed topic model [4]. Besides, document representations in many different applications have been studied. Yang proposed a novel approach for business document representation in e-commerce [5], and Arguello introduced two document representation models for blog recommendation [6].

The bag of words (BOW) representation can help retain a great deal of useful information, but it is also troublesome because BOW vectors are very high dimensional. To provide an ideal lower-dimensional representation, dimension reduction in many forms are studied to find the semantic space and its relationship to the BOW representation. Two techniques for dimensional reduction stand out. The first one is latent semantic indexing (LSI) which is based on singular vector decomposition to find a latent semantic space and construct a low rank approximation of the original matrix while preserving the similarity between the documents [7]. The other one is topic models that provide a probabilistic framework for the dimension reduction task [8]. Hofmann proposed PLSI that provides a crucial step in topic modeling by extending LSI in a probabilistic context which is a good basis for text analysis but contains a large number of parameters that grows linearly with the number of documents [9]. Latent Dirichlet Allocation (LDA) includes a process that generates topics in each document, therefore, it greatly reduces the number of parameters to be learned and is an improvement of PLSI [10].

A wide variety of machine learning algorithms have been designed for text classification and applied in many applications. Apte proposed an automated learning method of decision rules for text categorization [11]. Baker introduced a text classification method based on distributional clustering of words [12]. Drucker studied support vector machines for spam categorization [13]. Ng and Jordan made comparisons of discriminative classifiers and generative classifiers [14]. Sun presented supervised Latent Semantic Indexing for document categorization [15].

3 Approach

In this section, we present our approach for Chinese judgment documents classification in detail as follows. Section 3.1 presents an overview of the workflow for Chinese judgment documents classification. Section 3.2 introduces text preprocessing. Section 3.3 introduces a method of document representation and approaches to provide a lower-dimensional feature vector after generated by document representation method. Section 3.4 introduces classifiers used in our work.

3.1 Overview

Figure 1 presents the overview of the workflow we use for Chinese judgment documents classification. To analyze Chinese judgment documents in a deep going way, the classification approach starts from setting a clear-out classification goal and then, based on it, accessing Chinese Judgment Online System to obtain amount of to-be-classified judgment documents with a certain kind of cause of action. To build a classification model and evaluate the performance of it, we put a huge amount of human efforts in labeling a proportion of those judgment documents according to the classes we have defined in the goal setting process. After that, labeled judgment documents will be separated into 2 parts, respectively called training dataset and test dataset, one for model training and the other one for performance evaluation. As for how large a portion should be used as training dataset, this problem has been well discussed in [16] and 70% is the number we choose in our work.

Fig. 1.
figure 1

Overview of the workflow for Chinese judgment documents classification

Different from documents written in English, Chinese documents require different text preprocessing methods because of the huge distinction in morphology, grammar, syntax, etc. Beyond that, judgment documents also have their own characteristic. Generally, a certain format is used when a court administrator writing a judgment. Therefore, not all of content in a judgment is useful for achieving our classification goal. Due to this reason, we figure out that extracting the content we need for classification only can make a contribution to reducing noise thus improving performance. In Sect. 3.2.1 we will introduce the method we use to extract contents. As mentioned, Chinese documents should be segmented into words before they are available for representation. For this purpose, we employ techniques in natural language processing, also called Chinese Word Segmentation, whose detail will be clarified in Sect. 3.2.2. While it is fairly easy to use a published set of stop words list, in many cases, using such stop words is completely insufficient for certain applications. To improve the performance of classifying Chinese judgment documents, we propose an algorithm to construct a judgment domain specific stop words list in Sect. 3.2.3, and then we remove all stop words in Sect. 3.2.4.

After the step of text preprocessing, we employ TF-IDF to represent documents, or so called feature extraction, which is widely used in document classification research. To further reduce feature dimensions and improve performance, we also apply feature selection methods. Section 3.3 will introduce the methods in detail. Supervised machine learning algorithms, Naive Bayes (NB), Decision Tree, Random Forest and Support Vector Machine (SVM) are used in building classification model.

3.2 Text Preprocessing

In this section, we mainly focus on introducing the methods used to solve the problem of how to preprocess Chinese judgment documents. As Chinese documents are different from English-written ones in terms of morphology, grammar, syntax, etc. extensive preprocessing methods are required.

3.2.1 Content Extraction

After accessing Chinese Judgment Online System and obtaining an amount of judgment documents, we investigated a lot of documents and found that no matter what category a judgment belongs, a certain format exists in all of them as court administrators usually use a similar pattern when writing judgments. In addition, not the complete content in a judgment is related to a certain object of classification, we are concerned with a certain part of the contents for different purposes. Moreover, irrelevant information existing in a to-be-analyzed text is called noises in text classification, which effects experiment results and leads to unpleasant performance especially in short texts such as judgment documents. Due to the reasons listed above, we think extracting particular contents is necessary for reducing noises and therefore, improving performance.

In our work, regular expression is utilized to extract a certain part of content in a judgment document. Regular expression, abbreviated as regex, is used to represent patterns that matching text need to conform to. For extracting different parts from Chinese judgment documents, different regular expressions are developed. Table 1 presents the most used regular expressions we have studied and summarized for extracting different part of contents from a judgment.

Table 1. A summary of the most used regular expressions for extracting different contents from a judgment

3.2.2 Word Segmentation

Tokenization of raw text is a standard preprocessing step for many natural language processing (NLP) tasks, tokenization usually involves punctuation splitting and separation of some affixes like possessives. While unlike English, Chinese language requires more extensive preprocessing, known as word segmentation. As the most fundamental task in Chinese NLP tasks, word segmentation has been studied for several years. Chinese word segmentation involves splitting a paragraph of text into a sequence of words, sometimes includes part-of-speech tagging, semantic dependency relationship mining and name entity recognition.

Currently, there exist a number of Chinese word segmentation systems, including Jieba, ICTCLAS, SCWS, LTP, NLPIR, etc. Most of them can achieve a satisfying performance. LTP [17], known as Language Technology Platform, has the highest accuracy among all and offers multithread processing service, but it has some disadvantages as follows: (1) results are given in xml format which need another processing, thus leading to extra time cost especially for large datasets, (2) sometimes tasks are possible to fail. Jieba is one of the most efficient systems and can be easily integrated in python systems, but Jieba has lower accuracy. Taking all factors into consideration, we combine Jieba and LTP and utilize them in our work, LTP is regarded as a main tool while Jieba works as a complement in case some of the tasks fail. To improve performance, all segmented documents will be stored in our database.

3.2.3 Judicial Specific Stop-Words

While it is easy to use a published set of stop words, in the task of preprocessing judgment documents, using such stop words is completely insufficient. For example, in judgments, terms like plaintiff, defendant, argue, libel, court, etc. occur almost in every document. So, these terms should be regarded as potential stop words in judgment documents retrieval and classification. However, the common used stop words list does not contain such specific terms. To construct a judicial stop words list, we first, use 6735 pieces of documents as our corpus, and then we use 2 methods as follows to do the job:

  1. 1.

    Use the terms that occur frequently across judgment documents (low IDF terms) as stop words. Inverse Document Frequency (IDF) refers to the inverse fraction of documents in whole collection that contain a specific term, IDF is often used to represent the importance of a term. In other words, if a term occurs almost in every document of a collection, that means it is not important and can be regard as potential stop word. After word segmentation, we sum the term frequency of each unique word by scanning all documents. And then sort the terms in descending order. Top N terms have been chosen as stop words. The number N is chosen manually after human scanning. To make sure the set of stop words be judicial related, we filter all common used Chinese words in advance. The benefit of this approach is that it is very intuitive and easy to implement.

  2. 2.

    Use the least frequent terms as stop words. Terms are extremely infrequent may not be useful for text mining and retrieval. For example, these terms in judgments may be location names, human names, specific feeling expressions, etc. which may not be relevant for a judgment classification object. Removing these terms can significantly reduce overall feature space.

3.2.4 Stop-Words Removal

Removing as many stop words as possible can significantly reduce noises and make great contributions to a better classification performance. Given different text mining goals, different text preprocessing steps are in need. After a raw document is segmented into a list of terms, each term is regarded as data stream, which would go through every procedure defined by text preprocessing workflow module. In specific, each procedure returns null if a term is recognized as a stop word with correspondent type, otherwise returns itself. In this way, each word can be filtered, therefore, none stop words are left for further analysis. In specific, 7 kinds of stop words removal procedures are considered for Chinese judgment documents classification as follows:

  1. (1)

    Numbers, Chinese numbers, English letters and special symbols. As none of word segmentation systems can provide accuracy of 100%, some errors can occur in this step. As a result, words with numbers or Chinese numbers, English letters and special symbols can be found in word segmentation results, which is not useful for judgment classification.

  2. (2)

    Judicial stop words. After a set of judicial stop words has been constructed as presented in last section, we remove all words in every document that is matched as a judicial stop word.

  3. (3)

    Location names. By investigating an amount of judgment documents, location has occurred in almost every judgment, such as province, city, district, village, street, etc. By accessing a location name database, we can remove most of them. For the rest, we match the last word of a term, see if it represents a location, if is, remove the term.

  4. (4)

    Human names. Human name recognition is one of the Name Entity Recognition (NER) tasks in natural language processing. Some NLP systems provide such services to accurately label human names occur in a given text. However, in judgments, human names are also considered as least frequent terms, as a step of stop words removal, using such systems for this purpose is a waste of resource. Therefore, by removing the least frequent terms, we can remove human names together.

3.3 Feature Extraction

The input of text classification is the content of judgments, however, a sequence of words can not be fed directly to the machine learning algorithms. In order to address this, text should be transformed into other formats which can be mathematically computed, this process is called document representation. In this section, we select TF-IDF for document representation and feature reduction methods will be introduced at the end of the section.

3.3.1 Document Representation

Although stop words removal help remove an amount of noises, there still exists a large amount of words carrying very little meaningful information. Therefore, TF-IDF is utilized in our approach to calculate the amount of meaning each term carries.

TF-IDF, stands for term frequency-inversed document frequency, is one of the most used algorithms to transform a text into a feature vector in text mining and retrieval. Typically, the TF-IDF weight is composed of two terms: Term Frequency (TF) computes the number of times a word appears in a document, Inversed Document Frequency (IDF) measures how important a term is. Since every document is different in length, it is possible that a term would appear much more times in a long document than shorter ones. Thus, Term Frequency is often divided by the total number of words in a document for normalization. While computing Term Frequency, all terms are considered important equally. However, Inversed Document Frequency is aimed at weight down the frequent terms (stop words included) while scale up the rare ones. The TF-IDF of a word w in document d is calculated as: \(\text {TF-IDF(w,d)} = \text {TF(w,d) * IDF(w)}\), where \(\text {TF(w,d)} = \text {(frequency of w in d)}/\text {(total number of words in d)}\), and \(\text {IDF(w)} = \text {log}\_\text {e(total number of documents in the corpus}/\text {number of documents with w in} \text {it)}\). By calculating the TF-IDF weight, each term is regarded as a feature of the document, and the corresponding value is its TF-IDF weight. Thus all documents are transformed into a feature vector.

3.3.2 Feature Reduction

Similar with document presentation for other kinds of documents, TF-IDF always generates a large feature vector as it computes every term occurs in every document. As a result, feeding the feature vector generated by TF-IDF directly to machine learning classifiers is cost inefficient. Hence, we study feature reduction for Chinese judgment classification in order to achieve better performance while keeping as much important information as possible. Three feature reduction methods have been studied and utilized in our work as follows:

  1. 1.

    Minimum document frequency. Document Frequency (DF) represents the number of documents with a word in it. If the DF of a term is extremely low, it might not be a meaningful word for text classification. By adjusting minimum document frequency requirement in document representation, we can easily filter out a large amount of features, thus, provide a lower dimensional feature vector.

  2. 2.

    Principle component analysis (PCA). PCA is a statistical procedure that uses an orthogonal transformation to convert a set of possibly correlated variables into a set of values of linearly uncorrelated variables called principle components. This transformation is defined in such a way that the first principle component has the largest possible variance. The first K principle components are chosen as the new vector basis. In this way, feature dimensions can be reduced greatly while maintaining as much information as possible.

  3. 3.

    Truncated Singular Value Decomposition (SVD). The dimensionality of documents is reduced by projecting the bag-of-words vectors into a semantic space. In specific, SVD construct a low rank approximation of the original matrix while preserving the similarity between the documents.

3.4 Classifiers

At present, the most used machine learning algorithms are Nave Bayes, Decision Tree, Random Forest and Support Vector Machine (SVM). In order to study the performance that different classifiers can achieve in classifying Chinese judgment documents, after documentation representation and dimensional reduction, we apply off-the-shelf machine learning algorithms to train a classification model that can be used to classify an unseen record as belonging to one of the predefined categories. In our work, we will evaluate all these most used machine learning algorithms by training corresponding learned classifiers and compare their results together.

4 Evaluation

To evaluate the performance of Chinese judgment documents classification, we experiment on further separating judgments related to liabilities for product quality into their specific industries. In this section, we introduce the experiment dataset and evaluation metrics. At the end of this section, experiments and results are presented.

4.1 Dataset

In order to train a classification model and test the accuracy and performance of it, a golden standard dataset for Chinese judgment documents is required. As there exist no such datasets, we make a lot of efforts to manually label 6735 pieces of judgment documents which are related to liabilities for product quality into 13 categories based on the statutory standard of industry division.

Hold-out method is utilized in dataset separation. To avoid introducing extra errors in splitting datasets and to maintain the consistency of data distribution, we use stratified sampling method other than random sampling to separate training dataset and test dataset. With stratified sampling method, the ratio of each category in training dataset and test dataset maintain the same. However, experiment results using singular hold-out may not be stable and reliable. Therefore, multiple stratified sampling is used and the average of all results are given to evaluate performance. As for the setting of ratio value, there exists a trade-off, since the more samples training dataset contains, the closer the classification model is to the one trained by the overall dataset, but the less samples test dataset will contain, which may lead to inaccurate and unstable results. While the larger test dataset is, the smaller training set will be, thus leading to a more distorted classification model. No perfect solution exists for this problem, common way to tackle it is using 2/3–4/5 samples as training dataset while the rest as test dataset. In our work, we set the ratio as 70%. Figure 2 illustrates the distribution of our datasets for Chinese judgment documents classification. Among all labeled judgment documents, 70% is used for building classification model, 30% for performance evaluation.

Fig. 2.
figure 2

Datasets for Chinese judgment documents classification

4.2 Evaluation Metrics

To evaluate the performance of Chinese judgment documents classification, three evaluation metrics are employed as follows:

  1. 1.

    Overall accuracy. Overall accuracy represents the percentage of records in the test dataset which are classified correctly.

  2. 2.

    Precision and recall for each category. Given a category c, its precision is the percentage of records classified by the algorithm as c that indeed belong to c. And its recall it the percentage of records belongs to c that are correctly classified by the classifier.

  3. 3.

    F-measure. For each category, F-measure is calculated by 2*precision*recall/(precision + recall). F-measure represents the balance between precision and recall, the higher the F-measure of a category is, the better the performance of the classifier on this category is.

4.3 Experiments and Results

This section presents the experiment results to answer our research problems. NB stands for Nave Bayes, DT stands for Decision Tree, RFC stands for Random Forest Classifier, and SVM stands for Support Vector Machine.

RQ (1): How is the performance of the classifier improved by domain stop words list construction and text preprocessing in Chinese judgment documents classification?

We first analyze the document frequency of each word, Fig. 3 presents the statistics results of words with certain document frequencies. There are 54633 words that only exists in one document, 9804 words exists only in two documents, 315 words exists in 1000–6700 documents, etc. Learning from the data, words with extremely high document frequency only takes a little space. Based on the words sorted by document frequency, we construct a list of judicial stop words. Among them, 2170 judicial stop words with high document frequency have been filtered out.

Fig. 3.
figure 3

Number of words with certain document frequency

With stop words removal in text preprocessing, the number of feature dimensions is reduced from 98750 to 68155. Figure 4 presents the overall accuracies of classifiers with text preprocessing and without text preprocessing. As text preprocessing removes a large amount of noises significantly, thus provides a lower dimensional feature vector. As Fig. 4 illustrates, the overall accuracy is improved greatly no matter which classifier is used. In specific, the overall accuracy of NB improves 0.89%, of DT improves 6.22%, of RFC improves 7.2%, and of SVM improves 1.14%.

Fig. 4.
figure 4

Overall accuracies of classifiers with text preprocessing and without text preprocessing

When it comes to efficiency, as Table 2 presents, the time cost of classification model training is reduced on average, especially when using SVM, the performance is greatly improved when feature dimensions get lower. Taken both overall accuracy and time cost into consideration, text preprocessing does great work in improving the performance of Chinese judgment classification.

Table 2. Time cost of classifiers with text preprocessing and without text preprocessing

RQ (2): What kind of features should be selected and how can we benefit from dimensional reduction?

By utilizing 3 kinds of dimensional reduction methods can we reduce feature dimensions. Figure 3 illustrates the words with low document frequency, which may not be meaningful for classification, occupy most of the places. We have done experiments on adjusting the parameter of minimum document frequency requirement in feature extraction step, and tried PCA and Truncated SVD for further dimensional reduction. D presents the abbreviation of feature dimensions after dimensional reduction. Figure 5(a) presents the overall accuracies of each classifier with different approaches. Figure 5(b) presents the corresponding running time cost.

As Fig. 5(a) illustrates, for each classifier, the overall accuracies do not change a lot accordingly with the change of minimum document frequency. However, SVM shows a decreasing trend when feature dimensions get lower in general. The overall accuracy of RFC reaches peak when D is 9654, Min_df is 5 and D is 500 with PCA or SVD. With the increasing trend RFC shows, PCA and SVD can help improve the accuracy of RFC. But for DT, PCA and SVD do not show positive effects on improving overall classification accuracy. However, with Min_df, DT improves in overall accuracy when feature dimensions are reduced. For NB, SVD works better than PCA in general, but both of them do not show improvement in overall accuracy.

Fig. 5.
figure 5

Classification results using different dimensional reduction approaches

Meanwhile, when it comes to the time costs of utilizing each classifier for classification model training, the time cost of SVM reduces accordingly with the reduction of feature dimensions. For other classifiers, the performance improves at a certain degree with PCA and SVD. Taken both overall accuracy and time cost into consideration, SVM achieves the best overall accuracy comparing to the others, but it costs more time in model training, with minimum document frequency and PCA or SVD for further dimensional reduction, better performance can be achieved with a little cost of accuracy.

RQ (3): Which machine learning algorithm achieves better performance for Chinese judgment documents classification?

Figure 6(a), (b), (c) illustrates the precision, recall, F1-score of each category respectively when using different classifier. In each figure, the last cluster represents the average result. When sorting the performance of classifiers by average precision, SVM takes the first place, RFC is better than DT, and DT is better than NB. Meanwhile, when sorting by average recall, we have the same results. Since F1-score measures the balance of precision and recall, the higher F1-score is, the better performance. Therefore, as Fig. 6(c) presents, for Chinese judgment documents classification, we have the overall performance order as: SVM > RFC > DT > NB. Drilling down to each category, the distinction of the performance of classifiers can be tracked back to the different training dataset size of each category, also known as value of support. With categories with larger amount of dataset, i.e. Pharmaceutical Industry and Chemical Industry, SVM can achieve better results than others. However, with smaller datasets, RFC can have more stable performance. Based on the experiment results, SVM achieves better performance with F1-score at 87%.

Fig. 6.
figure 6

Classification results

5 Conclusion

Approaches to automatically classify Chinese judgment documents utilizing a wide variety of machine learning algorithms are explored in this paper. Different from other documents, Chinese judgments have a certain kind of format, as a result, for different classification goal, related content extraction from original judgments is necessary for removing unrelated information. To improve the performance of classification, first, we construct a list of judicial stop words by statistically analyzing words that occur frequently across all documents as well as words with least frequencies. Second, we utilize three different dimensional reduction methods to reduce feature dimensions while keeping as much information as possible, which include minimum document frequency, PCA and Truncated SVD. Results of experiments demonstrate the effectiveness of those dimensional reduction approaches for improving the performance in Chinese judgments classification. Third, four machine learning algorithms are applied for document classification, SVM achieves better performance compared to others, with average F1-score at 87%.

These methods can be easily applied in other judgments classification. To realize a more openness, justice and functional judicial system, more classification tasks in other kinds of judgments and deeper analysis on judgments should be carried out. As a judgment could belong to multiple classes, for achieving better performance, multi-labeled text classification methods need to be explored. Besides, a more functional search engine based on automatic classification for Chinese judgment documents should be studied and developed.