1 Introduction
Legal case retrieval is a specialized
Information Retrieval (IR) task, which aims to retrieve relevant cases given a query case. It is of vital importance in pursuing legal justice in various legal systems. In common law, precedents are fundamental for legal reasoning, following the doctrine of
stare decisis. Meanwhile, although prior cases are not always cited directly in some other law systems (e.g., Germany [
19], China), they are still critical for supporting the decision-making process. In recent years, China has established a system of similar case retrieval
1 and continuously expanded the scope of compulsory retrieval
2 for consistency in legal decisions. With the rapid growth of digitalized case documents, how to identify relevant cases effectively has drawn increasing research attention in both legal and IR communities. In recent years, several benchmark datasets have been constructed, such as COLIEE [
34], AILA [
5], and LeCaRD [
27]. They provide binary or graded relevance labels for training and evaluating legal case retrieval models [
43,
56]. However, we still lack a solid understanding of
relevance in legal case retrieval, especially how users make relevance judgments in this scenario, which may hinder future progress in this area.
Relevance is a key notion in IR. Generally,
relevant information is defined as information that pertains to the matter of the problem at hand [
40]. It has different manifestations, including the relevance calculated by algorithms of IR systems (i.e., system relevance) and the relevance that a user assesses (i.e., user relevance) [
50]. The IR testing is generally based on the comparison between the system relevance and the user relevance, taking the user relevance as the gold standard [
10,
40]. However, things become a bit more complex concerning legal case retrieval. As a specific task oriented to judicial applications, relevance judgments in legal case retrieval should follow the legal standards, in other words, consider validity in the legal domain. For instance, the relevance assessed by the user might not agree with the authoritative legal rule, and the corresponding result could not satisfy the information need in legal practice consequently. Table
1 gives an example of considering domain validity for making relevance judgments. According to the legal rules, the “theft” case is more relevant than the “credit card fraud” to the query case, where the defendant used a stolen credit card. Therefore, the domain validity should also be considered when discussing user relevance in legal case retrieval, which has not received due attention.
In the research line around relevance in IR, understanding how users determine relevance is essential for investigating the concept of relevance [
16,
40]. In the legal domain, there also exist some user studies specific to the e-discovery task [
32] to investigate the factors that affect user relevance judgments [
9,
53,
54]. However, the task definition of e-discovery [
32] differs from that of legal case retrieval. The e-discovery task is to retrieve various “electronically stored information” (e.g., letters, e-mails) regarding a request in legal proceedings, while the legal case retrieval task aims to identify the relevant legal cases that support the decision process of the given query case. Therefore, the definitions of relevance in the two tasks are different correspondingly. As far as we know, existing research efforts in legal case retrieval are mainly put into developing retrieval models (i.e., system relevance) and some theoretical discussions [
45,
48], lacking an empirical and quantitative investigation of the user side. Concerning user relevance in legal case retrieval, we propose the first two research questions in this article:
•
RQ1: What factors will affect the process of making relevance judgments in legal case retrieval?
•
RQ2: How do users allocate their attention when making relevance judgments?
Given the manifestations of relevance, we further put efforts into interpreting the gap between user relevance and system relevance in legal case retrieval. Specifically, our third research question is:
•
RQ3: Do retrieval models pay attention to similar contents with users? Can we utilize the user’s attention to improve the model for legal case retrieval?
To address these research questions, we conducted a laboratory user study that involved 72 participants with different levels of domain expertise. Beyond relevance judgments, details of the decision-making process were collected, including user feedback, highlights, and multi-aspect reasons. With the collected data, we systematically investigated the process of making relevance judgments and the corresponding reasons. Furthermore, we investigated system relevance in an explainable way by comparing the models’ attention with users and thus proposed a two-stage framework to improve the retrieval models in legal case retrieval. With a better understanding of relevance judgments in legal case retrieval, our study provides implications for developing better case retrieval systems in legal practice.
5 The User View of Relevance
Regarding RQ2, we attempt to investigate users’ understanding of relevance according to their semi-structured reasons and text annotations.
5.1 Criteria for Relevance Judgments
To understand the actual users’ criteria for relevance, we inspect their semi-structured explanations. We only consider the sessions in which the user gives concrete contents in the corresponding text field for the optional aspects.
11 The external legal expert (Ph.D, majoring in criminal law) annotates the correctness of the written content in each aspect based on the courts’ decisions.
As a result, besides the provided aspects (i.e., key element, key circumstance, cause, issue, and application of law), no other effective aspects for relevance judgment are proposed in the study. Specifically, the reasons in the “other” area are mostly detailed interpretations for the reasons ahead.
The importance of each provided aspect is as shown in Table
9. In general, users of all domain expertise report to follow the instructions according to the importance scores. Recall that the required two aspects (i.e., KE and KC) are used in the relevance instructions. Comparing the two aspects, the “key element” is significantly more important than “key circumstance.” The results reflect that users could realize the roles of these two aspects in relevance judgment, where “key element” is more general and qualitative while “key circumstance” is more specific.
Comparing the importance scores of the optional aspects, we note that the overall trend of importance is consistent in the CU and NCU groups, though the differences in the importance scores are more slight among the NCU group. The results indicate that users majoring in law can generally understand the meanings and roles of these aspects in determining case relevance, but users lacking specific criminal knowledge might hardly distinguish among them. However, the trend of importance assigned by NLU groups is contrary. For one thing, they might hardly understand the actual legal meanings of these aspects. For another, this group considers these aspects much less often, as shown in Table
10.
Table
10 provides more details about the usage of the optional aspects, including the frequency and correctness. Among the optional aspects, “cause” is utilized the most frequently. Since the “cause” is the standard expression of criminal charges in the study, which is the legal characterization of the case, it helps determine the relevance between cases. Comparing among the domain expertise groups, users with law backgrounds can identify the causes correctly with a high probability, and the NCU group uses this aspect more often. The results are reasonable. As a fundamental legal concept, the cause is not too difficult for users majoring in law to identify and utilize. Without much more specific knowledge of the criminal law, the relationship of the causes works as a significant aspect for determining case relevance among the users majoring in other legal specialties. Meanwhile, users majoring in criminal law can better understand more fine-grained points than the cause (e.g., key circumstances) and thus less refer to the cause. However, identifying the cause is still difficult for users without law backgrounds according to the accuracy and frequency of usage of the cause in the NLU group. As for the “issue,” it seems pretty difficult for both NCU and NLU to generalize. Last but not least, although the “application of law” is clearly defined, users still understand it distinctly. This result suggests that “application of law” is still too general to apply in practice.
To sum up, users with law backgrounds can understand legal relevance better and make use of various legal aspects consistently. On the other hand, it is difficult for users without law backgrounds to comprehend the legal meanings of these concepts, and thus their understanding of case relevance might differ from the requirements in the law.
5.2 User Attention
Besides the general relevance criteria, we attempt to understand users’ cognitive process of relevance judgment in a fine-grained way. In our user study, participants were instructed to highlight the contents they paid attention to while making relevance judgments. We consider the explicit text annotations by highlighting as an indicator of user attention following previous research [
6,
24,
33]. We study the consistency of text annotations under different conditions and then investigate the patterns of user attention distribution based on these annotations. In particular, we inspect how different biases influence the attention allocation during relevance judgment, including the
positional bias, which is popularly discussed in general web search [
24], and the
structural bias, which is caused by the internal structure of a legal case.
5.2.1 Consistency.
To measure the consistency of the text annotations, we employ the overlap coefficient [
51], which enables us to compare the annotations of different lengths, following previous studies [
6,
33]. The metric is calculated as follows:
where
\(A_1\) and
\(A_2\) are two sets of words annotated by two users. In our study, we split the case document into words using the Chinese word segmentation toolkit
12 and filter out the Chinese stopwords. If the user highlighted partial words, we consider the whole words annotated. The overlap coefficient is calculated between each pair of users for each query or candidate case. Table
11 gives the average values of the overlap coefficient. Similar to Section
4, we mainly investigate the effects of domain expertise, query type, and case relevance on this metric.
As shown in Table
11, there exist significant differences among domain expertise groups in the consistency of query case annotations, and the consistency in the NLU group is significantly lower than the other two user groups. As expected, users seem more confused and pay attention to diverse contents in the face of the query case description without legal background compared to those majoring in law. However, all domain expertise groups achieve similar intra-group agreements in candidate case highlights. The result indicates that although different domain groups vary in understanding relevance criteria, users show consistent patterns when comparing two cases, such as matching. On the other hand, the number of causes involved in the query case significantly impacts the consistency of user annotations in both query and candidate cases. We assume that a query case might be more complicated if multiple causes are involved, and thus users might focus on different contents in the whole case description. The difference in candidate case annotations is less significant than that in query case annotations. It might be because users would consider the matching signals between two cases more in candidate case annotations, which might weaken the influences of divergence in query case understanding. As expected, the relevance label of the candidate case has no significant effects on the consistency of query case annotations, which also validates our experimental settings. However, there appear more disagreements in user attention when they judge an irrelevant case. It is reasonable that the evidence for determining irrelevant documents might be more dispersed.
5.2.2 Distribution in Vertical Position.
Based on the highlights, we further investigate the positional patterns of user attention allocation in the candidate cases. Since users might annotate words or sentences, we consider the short sentence as a unit here to reduce potential biases. In detail, each case is segmented into a list of short sentences by the comma. We use the comma instead of the period because the whole sentence (split by period) that contains multiple circumstances might be pretty long, while a short sentence (split by comma) usually involves a single point. Then, the short sentences that include highlighted texts are denoted as “1” and the other as “0.” In that way, we could obtain a vector composed of 0 and 1 for a candidate case based on a user’s annotation. In total, we construct 540 vectors for all the candidate cases in our study. To observe the distribution on vertical positions, we divide each vector into 10 bins evenly and consider the proportion of “1” in each bin as the annotation ratio. In our study, each case is divided into 10 vertical intervals, as shown in Figure
4.
Generally, the vertical intervals can be grouped into three areas, i.e., the top 30%, 30%–80%, and the last 20%. The first 30% attract the most attention, and then the ratio drops a lot after 30%. In the intervals from 30% to 80%, the ratio always decreases gradually. Interestingly, the ratio shows a sharp increase at the beginning of the last 20% intervals. We explain these patterns by combining the position bias and the document structures. On the one hand, users tend to read the beginning document for preliminary relevance judgment, and user attention decays with the height, similar to the previous works in web search [
24]. On the other hand, a case document is semi-structured, usually composed of six basic components (also referred to as
sections) [
44], i.e.,
Party Information,
Procedural Posture,
Facts,
Holdings,
Decision, and
End of Document. Since the first two sections (i.e., Party Information and Procedural Posture) usually consist of a small proportion of sentences in the whole document, the beginning of the “Facts” will occur in the latter part of the top 30% area. Given that, users might pay much attention to this area for an overview of the case. In particular, the “Facts” contains more detailed information following the case summary, such as arguments from both sides, evidence, and so forth. Compared with the case summary, the detailed information might be less important for users to judge, which also explains the decrease of the annotation ratio in the middle area. It is worth mentioning that there is an increase in the 80%–90%. We think it might be because this area usually involves contents of “Holdings” and “Decisions,” which are court opinions and should be a significant reference for relevance judgment.
Furthermore, we investigate the effects of domain expertise on the distribution. Results are shown in Figure
4(a). We conduct the Kruskal-Wallis test in each interval as well as the B-H adjustment for p-value. As a result, we observe significant differences in the 0%–10%, 30%–40%, and 40%–50% intervals after the adjustment. Users with lower domain expertise seem more likely to be affected by position bias. For instance, the NLU group annotates more content at the very beginning of the document, and the corresponding ratio drops earlier and faster. As for the middle area of the case (i.e., 30%–50%), the CU group pays more attention than the other two groups. With more specific knowledge of criminal law, the users might further compare more detailed information beyond the brief summary of the case for making relevance judgments.
In this part, we also wonder whether users’ attention allocation will differ when they consider a candidate case to be relevant or not. As shown in Figure
4(b), we observe significant differences in several intervals by Mann-Whitney U Test with B-H adjustment. Generally, the change of the annotation ratio with the vertical positions seems sharper when the case is relevant. In particular, users pay more attention to the top area that usually involves the case summary and less attention to the areas that contain details if they think the candidate is relevant. We think that users might be confident when they judge a candidate to be relevant and thus mainly focused on the general but key points. On the other hand, they might be more cautious and consider more details for their irrelevance judgments.
5.2.3 Distribution in Components of the Case Document.
We further investigate the annotation ratio in different components with the assumption that the internal document structure would influence user attention allocation. Similarly, we segment texts in a component (i.e., section) into short sentences by the comma and calculate the annotation ratio, i.e., the proportion of highlighted sentences in each. Results are shown in Figure
5. Generally, the “Facts” and “Holdings” are the principal parts of a case document and tend to draw the most user attention in our study. Specifically, the “Facts” describe the full case story and the “Holdings” contain the court’s analysis of the case, which are fundamental to determining case relevance. Compared with the “Facts” section, the “Holdings” section is usually more concise, including key points for the court to make decisions, and thus involves the highest annotation ratio. On the other hand, the “Decision” section that incorporates the final sentence might be too general to compare the relevance between cases, though it is always considered as a core part in a case document.
As shown in Figure
5(a), different distributions occur across domain expertise groups. Users without law backgrounds show different behavioral patterns compared with those majoring in law, especially in the first three sections. They allocate much more attention to the “Party Information” and the “Procedural Posture” than other user groups. These sections are mainly composed of the basic information of both sides and former backgrounds, rarely mentioning the concrete story of the current case, and thus seem less significant for relevance judgment. Meanwhile, the NLU group shows less interest in concrete case circumstances than the other groups. We assume that it is difficult for users without professional legal training to identify key points in the lengthy case story description. Given that, they might prefer contents that involve explicit legal items (e.g., charges), even though some are not indeed related to the current case, such as the previous verdict or criminal records mentioned in the “Party Information” and “Procedural Posture” sections. These different patterns also indicate that users without law backgrounds might focus on information that is less helpful for the relevance judgment task. On the other hand, we observe similar patterns in sections under different relevance conditions, shown in Figure
5(b). Combined with the observations in vertical positions (see Figure
4(b)), we think that the general attention allocation on the broad sections is similar, though it might vary in more specific positions when users think the case is relevant or not.
5.2.4 Summary.
We focus on investigating how users allocate attention when making relevance judgments based on their highlights, including inter-user consistency and distribution patterns. Regarding consistency, more disagreements occur in the query case understanding when users lack domain knowledge or multiple causes are involved. Meanwhile, an irrelevant case might involve more inconsistent user attention than a relevant one. Regarding the attention distribution, we observe the impacts of both positional and structural biases. In vertical positions, users tend to pay much attention to the top 30% parts, followed by a sharp drop. The middle area (i.e., 30%–40%) is less cared about, and the annotation ratio decreases gradually in this area. There is an increase in the last area of the document, which might be related to the document structure. Users mainly focus on the “Facts” and “Holdings” parts, considering the case structure. Furthermore, these effects on the attention distribution patterns also differ with domain expertise and case relevance. One of the challenges in legal case retrieval is to process the lengthy legal case documents [
43,
56], and we believe that these findings can provide inspirations for the related research.
6 The System View of Relevance
In this section, we focus on system relevance. To address RQ3, we first compare the distribution of attention weights in retrieval models with that of users. Then we attempt to improve the performance of retrieval models with the help of user attention.
6.1 Model Attention vs. User Attention
We consider two categories of models, including traditional
bag-of-words (BOW) models and dense models. Specifically, we inspect the tf-idf [
38] and BM25 [
36] among the BOW models and BERT [
12] among the dense models. These models are representative in each category and popularly adopted in legal case retrieval [
27,
34,
37,
43].
6.1.1 Experimental Settings.
In addition to the collected dataset in our study, we use LeCaRD [
27] for training and validating. In the following experiments, we denote LeCaRD and the dataset collected in the user study as
Dataset-L and
Dataset-U, respectively. As for the BOW models (i.e., tf-idf and BM25), the
idf is calculated based on all of the documents in LeCaRD. Since cases in the user study are mostly generated from LeCaRD, we think it could represent the vocabulary distribution. The parameters in BM25 are set as default values [
35]. As for the dense models (i.e., BERT), we utilize the version that is pre-trained on
\(6.63M\) Chinese criminal case documents [
60] (referred to as
Criminal-BERT). We then fine-tune it with a binary sentence-pair classification task for relevance prediction. These experimental settings are consistent with those in LeCaRD [
27]. In particular, we filter out the query cases selected for the user study along with all of their candidate cases from LeCaRD and then divide the left data into training and validation sets randomly by
\(4:1\). In that way, we have 73 queries for training and the remaining 18 for validating. Under each query case, there are 30 candidate cases with four-level relevance labels. We transfer the four-level labels into a binary scale for simplicity. Similar to the above analysis, the cases labeled as 3 and 4 are relevant, and the others are not relevant in the binary scale. For the relevance prediction task, we utilize the three core sections of a case document, i.e., “Facts,” “Holdings,” and “Decision,” and concatenate the query case with each section, respectively. Since the three sections are from the same case, they share the label. For each section type, we fine-tune a BERT correspondingly. Models for these three sections are trained following the same setting. In detail, we truncate the texts symmetrically for each query-section pair input if it exceeds the maximum input length of BERT. Adam optimizer is used, and the learning rate is set as
\(1e-5\). According to the validation data, all the models could converge after around two epochs. Note that we use the query-section pair instead of the query-document pair as input. Since the case document is always long, especially the “Facts” section, only a part of “Facts” would be considered in the traditional query-document pair under the length limitation of BERT (i.e., 512 tokens). Given that, we utilize three sections separately in the experiments.
The Dataset-U works as the testing set for all methods. Metrics for ranking and classification are utilized for evaluation. Different from Dataset-L, each query case in Dataset-U only involves four candidate cases. In that case, we focus on evaluating the entire ranking list with MAP. Meanwhile, we also utilize the micro-average of precision, recall, and F1 scores as evaluation metrics, following recent benchmarks for legal case retrieval [
34]. Since tf-idf and BM25 methods only give ranking scores, they are evaluated with the ranking metric (i.e., MAP). Meanwhile, the BERT models are training for classification. To calculate the ranking metrics, we rank the results according to the predicted probability to be relevant.
The performance in Dataset-U is shown in Table
12. Note that our focus is to inspect the attention weights rather than compare the performance of models. We look at the performance to validate the experimental settings (e.g., model training) before calculating the specific attention weights. As expected, the BERT models outperform the BOW ones on a non-trivial scale. Comparing tf-idf and BM25, BM25 achieves better performance in both ranking metrics. Comparing the BERT models of different sections, metrics for ranking and classification do not show a consistent trend. We think all three sections are informative for relevance prediction but might not play the same roles. We also analyze them separately in the following experiments. To sum up, the performance of different models shows a similar trend with those in previous studies [
27,
43]. Therefore, we think that the experimental settings are reasonable and further analyze their attention weights.
6.1.2 Model Attention.
Similar to user attention, we would like to understand what the models focus on when calculating the similarity score. Therefore, we calculate the attention/importance weight of each term as a representation of the model attention. We provide the details of calculating attention weights. The attention mechanism [
49] has been well applied in neural models. In particular, BERT is composed of multi-head transformers, which are structured based on self-attention. In self-attention, each word assigns weights to other words, and the corresponding weight could be interpreted as importance or attention. We extract the attention maps from BERT referring to the visualization tool [
52] and use the average value across multiple heads. With concatenating the query and section as input, we can calculate the query-to-query, section-to-section, and query-to-section attention maps. Given the input pair
\([CLS]\lt Q\gt [SEP]\lt S\gt [SEP]\), the attention weight on each term of the candidate section
\(s_j\) is calculated as
where
\(\omega\) denotes the weights in the attention map, and
n and
m denote the length of query
Q and section
S in the input, respectively. Following previous work [
6], the former part in Equation (
2) represents the attention from the query to a term in the section, indicating the matching signal, while the latter part represents the self-attention weight of the section, indicating the importance of the term within the section. For each term in the section, we investigate its role in relevance prediction by summing the two kinds of attention weight. As for the query terms, assuming that users have no idea about the candidate case when they read the query, we focus on the query-to-query attention for representing the process of query case understanding, represented by Equation (
3):
Regarding each section type, we use the corresponding BERT model that has been fine-tuned on LeCaRD (i.e., BERT-F/BERT-H/BERT-D) to infer the attention weights. Considering the limited input length, we first segment the query and section into text spans with no more than 254 characters when constructing the input pairs. Once getting the attention weights on each span, we concatenate them to obtain the weights of the query or section. In this way, we can make full use of the entire query or section.
On the other hand, attention is not well defined in the traditional BOW models. Nevertheless, we use the weight of each word to represent its importance in the model. To be specific, the
\(tf \cdot idf\) value is considered to represent the word importance within a text span (i.e., self-attention within a query or a candidate section). As for BM25, given the section containing
\(\lbrace s_1, s_2, \ldots , s_k \rbrace\) words, the contribution of each word
\(s_j\) in the matching score is measured by
where
\(freq(s_j, S)\) denotes the frequency of word
\(s_j\) in section
S,
\(|S|\) is the length of section in words, and the
avgsl is the average length of sections in the collection. The parameters
k and
b are set as default [
27]. Note that the attention weight is calculated in terms (i.e., characters) for BERT and in words for tf-idf and BM25, and we assign each character of the word with the word weight for the BOW models to align the unit. Last but not least, all the weights of each query/section are normalized to the
\([0, 1]\) range by
min-max for comparability across different models.
6.1.3 Comparison between Model and User.
We attempt to compare the attention of models and users by inspecting the similarity of their distributions. Similar to Section
5.2, the distribution of user attention is represented by their text annotations. In detail, for each term in the query or section, “1” denotes being highlighted, and “0” denotes the opposite. Taking the “0/1” vector as the representation of the user attention observation, we measure the similarity between two vectors via log-likelihood, inspired by the evaluation of click models. The log-likelihood
\(ll(m,u,t)\) between the model attention and the user attention on a text span is calculated as follows:
where
\(o_{ui}\) denotes whether the user
u highlights the
ithe term,
\(\hat{\omega }_{mi}\) denotes the normalized attention weight of model
m, and
\(|T|\) refers to the length of the text span in terms. To ensure the numerical stability, the model weight
\(\hat{\omega }\) is clipped between 0.00001 and 0.99999 in Equation (
5).
First, we inspect the similarity between model attention and all users’ on the query and candidate cases, as shown in Table
13. Besides similarity in each section, we concatenate three sections according to the original order in the case document and measure the overall similarity in the case (referred to as “Merge”). Furthermore, we also average the weights of BM25 and tf-idf in the candidates to consider the internal term importance and matching signal simultaneously (referred to as “combine”). As for the query case, since the query span in the input of the three BERT models is identical, we average their attention weights. Results are shown in Table
13, where the higher value of log-likelihood indicates the higher similarity with user attention. In the query case, BERT outperforms tf-idf significantly in terms of similarity with user attention. It suggests that the dense model (e.g., BERT) might be better in query understanding, while the frequency-based models could hardly identify the essential information in the query case. In the candidate case, we find that BM25 achieves better agreements with user attention than tf-idf, indicating the matching signal should be more vital in determining relevance. Furthermore, combining two models can improve the similarity, which suggests that both word importance and matching signal are useful for relevance judgment. In general, considering all three sections is beneficial except for the “Decision” section results. The exception might be related to the much lower annotation ratio (see Figure
5) and distinct vocabulary from the query case description. Compared with the BOW models, the BERT model still performs better most of the time in terms of consistency with user attention. Specifically, significant tests are conducted between BERT and other models in the “Merge” column of the candidate case, and BERT achieves significantly higher similarity with user attention. Overall, the better agreement with user attention in both query and candidate cases is also consistent with its better performance in relevance prediction in Table
12. It is worth mentioning that the gap between BERT and the BOW model (e.g., the “combine”) in the candidate case is not as pronounced as that in the query case. As an explanation, we think the matching signals, which the traditional BOW models can also obtain, perform a dominant role in the candidate case.
Further, we investigate the differences in the attention similarity under different conditions. Similarly, domain expertise and query type are considered as the independent variable of user and task property, respectively. As shown in Table
14, significant differences are observed among domain expertise groups in both query and candidate cases. In particular, both types of models seem to be much more similar to the NLU users in attention distribution. We thus believe that these retrieval models are mainly based on the basic textual features (e.g., keyword matching) and rarely incorporate legal knowledge in relevance prediction, similar to the users without law backgrounds. Unlike domain expertise, the query type factor has few significant effects on the similarity coefficient. The difference only occurs in the BERT model on the query case, indicating that the query case involving multiple causes might be more confusing for models, especially the dense model. Besides, we are also interested in whether there is any difference in model attention when it makes a correct or wrong prediction. As shown in Table
14, the attention distribution of the model
13 is significantly closer to that of the user on the candidate case when it makes a correct relevance prediction.
Given the above observations, we make a further investigation of the attention similarity on the three specific sections of the candidate case. We mainly care about domain expertise and prediction correctness factors since query type seems to have no effects on this similarity coefficient on the candidate case. Results are shown in Table
15. We find that the significant differences of domain expertise or prediction correctness mostly occur in the “Facts” section. As one of the implications, these results inspire us to improve relevance prediction performance by exploiting professional users’ attention, especially on the “Facts” section.
6.2 Proposed Method
Inspired by the above observations, we propose a two-stage framework, which utilizes the attention of users majoring in law for relevance prediction in legal case retrieval. Experimental results on two datasets demonstrate the effectiveness of the proposed methods.
6.2.1 Framework.
As illustrated in Figure
6, the proposed framework generally consists of two stages. The first stage aims to optimize the model attention weights with user attention. Given the model that has been tuned in Stage 1, the following stage fine-tunes the model for the target task (i.e., case relevance prediction) with a sentence pair classification task. Details of each stage are given as follows.
In Stage 1, the attention weights are represented in a similar way as described in Section
6.1.2. The attention weights for the terms in a section segment are a combination of query-to-section and section-to-section attention, following Equation (
2), denoted as
\([A_{s1}, \ldots , A_{sm}]\). Meanwhile, the attentions weights of the query segment (denoted as
\([A_{q1, \ldots , A_qn}]\)) are based on the query-to-query attention, following Equation (
3). On the other hand, we consider the user annotation ratio as the representations of user attention on the query and section segment, denoted as
\(\mathbf {U_q}=[U_{q1}, \ldots , U_{qn}]\) and
\(\mathbf {U_s}=[U_{s1}, \ldots , U_{sm}]\). The annotation ratio of each term is calculated as
The object of Stage 1 is to make the model attention close to the user attention, in other words, minimize the loss
\(\mathcal {L}(\mathbf {A_q}, \mathbf {A_s}, \mathbf {U_q}, \mathbf {U_s})\). In particular, we consider three types of loss functions in the following experiments. Taking the raw value of annotation ratio as the observation of user attention distribution, we optimize the following L1 loss:
Furthermore, we apply a softmax function to each attention representation and the post distributions are represented as
\(\mathbf {\tilde{A}_q}\),
\(\mathbf {\tilde{A}_s}\),
\(\mathbf {\tilde{U}_q}\), and
\(\mathbf {\tilde{U}_s}\), respectively. Given the normalized distributions, we attempt to minimize the Kullback-Leibler divergence loss (Equation (
8)) or the MSE loss (Equation (
9)):
Given the model optimized in Stage 1, we further fine-tune it for relevance prediction with a sentence pair classification task in Stage 2. Following the classic flow, the query-section pair is separated by the [SEP] token to construct the input in the form of
\([CLS]\lt Q\gt [SEP]\lt S\gt [SEP]\). Then the final hidden state vector of the [CLS] token is fed into a fully connected layer to make binary classification, which optimizes a cross-entropy loss. The text pair is truncated symmetrically in this stage if exceeding the maximum length, which makes the result comparable to that in Table
12. Since we focus on investigating the attention mechanism in this article, the more complicated models are beyond the research scope and left for future work.
Different from the process of extracting attention of the fine-tuned model in Section
6.1, the proposed framework could be viewed as a reverse process. It first tunes the parameters by optimizing the attention distributions and then fine-tunes the model for relevance prediction.
6.2.2 Experimental Settings.
In the proposed framework, the first stage requires users’ highlights as labels, and thus only the Dataset-U is involved. The query cases are divided randomly at 4:1 as training (12 queries and 48 candidates) and validating sets. According to the former analysis of domain expertise, the users without legal knowledge are much more likely to make incorrect relevance judgments. Meanwhile, their attention distribution is also significantly different from those majoring in law. Therefore, we only use the annotations of the CU and NCU groups to construct labels to avoid noisy guidance. In order to make full use of user annotations, we divide the query and candidate section into segments with no more than 254 characters and construct the input based on each pair of query and section segments. In particular, we filter out the input pairs that involve the segment without any user annotation to ensure numerical stability. Table
16 shows the statistics of the filtered dataset used in Stage 1. In this stage, we utilize the Criminal-BERT [
60] as the base model. As for the training process, we use the Adam optimizer and set the learning rate as
\(1e-5\). We select the stopping point according to the loss on the validation set and adopt the corresponding checkpoints for Stage 2. The second stage requires the final relevance label for fine-tuning. Therefore, we could also utilize the Dataset-L in this stage. The pre-processing of dataset is the same with as in Section
6.1.1, including data filtering, train/dev sets partition, text-pair truncation, and so forth. Similarly, the Adam optimizer is utilized, and the learning rate is set as
\(1e-5\) during training. The main difference lies in that the fine-tuning process is conducted on the model saved in Stage 1 rather than the initial Criminal-BERT. According to the loss on the validation set, this stage could always converge after about two epochs, and we adopt the best epoch on the validation set for evaluation. To validate the effectiveness of the proposed framework, we consider the model fine-tuned directly based on the Criminal-BERT as the baseline.
The Dataset-U is considered as the testing dataset for evaluating relevance prediction. Besides, we also compare the performance on the validation set of Dataset-L. Similar to the previous sections, the evaluation metrics include the ranking metric (e.g., MAP) and classification metrics (e.g., micro-precision, recall, F1). The models of each section category are trained and evaluated separately.
6.2.3 Results.
The performance of relevance prediction on both datasets is shown in Table
17. Among the models, “base” refers to the baseline model that is fine-tuned directly based on the Criminal-BERT, while “ts” refers to the proposed two-stage method. Specifically, “L1/KLD/MSE” refer to the three types of loss functions in Stage 1, respectively. In general, the two-stage models outperform the baselines based on the “Facts” and “Holdings” sections, suggesting the effectiveness of optimizing model attention via user attention (Stage 1). Moreover, the proposed framework achieves performance on both datasets, i.e., with user highlights and without. The result is encouraging. Since it is much more time-consuming for annotators to provide fine-grained text annotations than the mere relevance label, the affordable dataset that involves user highlights might be relatively small-scaled. The proposed framework can adapt to the limited data size, where the second stage can utilize more data without user highlights. As a result, it also works well on the data without user highlighting. The result shows that we can exploit the limited user highlights to improve the general legal case retrieval task.
Among different sections, the improvements on the “Facts” section are more outstanding. The result is consistent with the former analysis in Table
15, where the significant differences mainly occur in the “Facts” section. The results on the “Decision” section seem a bit strange. Given that only a tiny proportion of “Decision” sections contain user annotations (see Table
16), we think that the few data are likely to cause misleading (e.g., over-fitting) in the first stage and further hurt the performance of the entire model. Therefore, we mainly look at the “Facts” and “Holdings” sections in the following analysis.
Furthermore, we inspect the attention weights of different models by calculating the similarity with the annotation ratio of the users with law backgrounds (i.e., CU and NCU groups). The similarity is measured via log-likelihood, as described in Section
6.1. Since all the models are trained and evaluated on the truncated texts, we calculate the similarity based on the same texts. Results are shown in Table
18. Compared with the “base” model, the attention weights in our proposed methods are more similar to those of the users. This trend is consistent in both query and candidate cases. The results suggest the effectiveness of integrating user attention into model attention in the proposed method.
6.3 Summary
Regarding RQ3, we investigate the similarity between model and user in their attention distribution. Generally, the BERT model is more likely to agree with users in attention allocation than the traditional BOW models on both query and candidate cases. Specifically, we find that the model’s attention is more similar to that of users without law backgrounds than that of professional users. Meanwhile, the model attention is closer to the users’ when it makes a correct prediction. Inspired by these findings, we propose a two-stage framework that utilizes professional users’ attention distribution for legal case retrieval. Experimental results on distinct datasets demonstrate its encouraging improvements.
7 Conclusions
In this article, we work on understanding relevance judgments in legal case retrieval from multiple perspectives. We conduct a laboratory user study centered on legal relevance that involves 72 participants with distinct domain expertise. With the collected data, we make an in-depth investigation into the process of making relevance judgments and attempt to interpret the user relevance and system relevance in this search scenario. In particular, we have made several interesting findings with regard to the research questions.
Regarding RQ1, we inspect whether the properties of user, query, and result would affect the process of making relevance judgments. In conclusion, the user’s domain expertise significantly affects measures of subjective user feedback and objective performance. Specifically, users without law backgrounds are more likely to make mistakes and tend to perceive greater task difficulty. The query type (i.e., the number of causes involved in the query case) seems not to make any difference in user feedback, while the performance drops under the multi-cause condition. As for the result property, we find that users might make more mistakes and feel less confident when they encounter a potential irrelevant case. Besides, it is worth mentioning that the accuracy and inter-user agreement are distinct measures for performance in legal case retrieval. In our study, although the users majoring in law achieve close accuracy measures, the users out of the corresponding legal specialty show greater disagreements in relevance judgments. Meanwhile, we find that the users without law backgrounds might make identical mistakes and thus significantly hurt the accuracy of relevance judgments, though they show better inter-user agreement than some professional users.
Regarding
RQ2, we investigate how users understand legal relevance based on their semi-structured reasons (Section
5.1) and fine-grained text annotations (Section
5.2). As for the generalized relevance criteria, users report to follow the relevance instructions well and distinguish the importance of “key element” and “key circumstance” in determining case relevance. Besides, the “cause” is sometimes considered to support the decisions, especially by the users specialized in other laws, while “issue” and “application of law” seem less helpful in legal practice. Besides, we observe that users without law backgrounds can hardly understand these legal aspects or their roles in relevance judgments. On the other hand, taking user highlights as the indicator of their attention, we inspect the inter-user consistency and observe various patterns of attention distribution. According to the
Overlap metric, users majoring in law achieve higher consistency in query understanding, and the multiple causes in the query or the potential irrelevant candidate might involve more disagreements. Different from the general web search [
24,
58], the attention distribution in vertical positions can be divided into three groups (i.e., 0%–30%, 30%–80%, and 80%–100%), which might result from positional and structural biases. Furthermore, different patterns of attention distribution can be observed under different domain expertise and relevance judgments.
Regarding RQ3, we extract the attention weights of retrieval models and compare them with users’ attention. Generally speaking, the neural retrieval model (i.e., BERT) seems to be closer to users than the BOW models in terms of attention distribution. Specifically, the model attention is more similar to users without law backgrounds, who are more likely to make mistakes in relevance judgments. Besides, the similarity between the attention distributions decreases when the model makes incorrect relevance predictions. Last but not least, we propose a two-stage framework that utilizes the attention of professional users for legal case retrieval. The experimental results show its improvements.
Our work sheds light on relevance in legal case retrieval. It has promising implications for related research, such as the construction of datasets and the design of retrieval models. For instance, aware of the effects of domain expertise, relevance annotations for legal case retrieval should be made by users with professional legal training. More discussions or annotators might be included if the annotators are not majoring in the specific area of law. Besides, since the query type and case relevance are also influential factors, they should be considered when collecting labels, such as designing a quality assurance mechanism. Moreover, understanding how users make relevance judgments in the entire case document and their differences from models could further support the development of retrieval models, such as considering the positional and structural biases. Specifically, the internal structure of the case document also exists in other legal systems [
21,
41], where our findings might be exploited. Beyond the legal case retrieval, our methodologies and findings could also benefit other similar professional search scenarios, such as patent retrieval, medical search, scientific literature search, and so forth.
8 Limitations and Future Work
We acknowledge some potential limitations of our work. One limitation lies in the base dataset, LeCaRD [
27], based on which the tasks in our study are designed. The dataset is built for legal case retrieval tasks in the Chinese law system, and some results might be retrained depending on different legal systems (e.g., the common law). Meanwhile, the LeCaRD is not perfect, such as containing the query case lacking a final judgment and involving some wrong labels. Since prior cases are not directly cited in the Chinese law system and no case citation information could be utilized, the relevance label in LeCaRD is determined by expert judgments with final court decisions as golden references. Given that, a case is unsuitable to be included in the public labeled dataset if it lacks a final decision (e.g., being remanded). The LeCaRD has also been updated several times to correct some mistakes in its previous versions. In this article, we keep the reported results consistent with the latest version of LeCaRD, even though the update has not influenced the main conclusions.
Another potential limitation lies in that the size of collected data is limited, as in most user studies [
31,
44]. Besides, as an attempt to understand the relationship between user relevance and system relevance, the retrieval models considered in this article are mostly fundamental. More complicated retrieval models are beyond the research scope and left for future research. As the approximation of user attention, the highlights might still vary from the real attention.
There are several promising directions for future work. Besides a laboratory user study, a large-scale crowd-sourcing study is promising. In particular, with a larger-scale dataset, it would be influential to further investigate how the factors that affect the dataset construction would affect downstream applications, such as retrieval system evaluation. This article focuses on the general distribution of attention, while it would also be an interesting direction to perform a linguistic analysis to characterize the differences in the “important” content identified by the retrieval models and users. To obtain more precise and fine-grained user attention, an eye-tracking study is also promising. Last but not least, it is still worth further investigating to incorporate the relevance-judging process and domain knowledge into the more complicated retrieval models for this task.