skip to main content
10.1145/3404835.3463250acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

LeCaRD: A Legal Case Retrieval Dataset for Chinese Law System

Published: 11 July 2021 Publication History

Abstract

Legal case retrieval is of vital importance for ensuring justice in different kinds of law systems and has recently received increasing attention in information retrieval (IR) research. However, the relevance judgment criteria of previous retrieval datasets are either not applicable to non-cited relationship cases or not instructive enough for future datasets to follow. Besides, most existing benchmark datasets do not focus on the selection of queries. In this paper, we construct the Chinese Legal Case Retrieval Dataset (LeCaRD), which contains 107 query cases and over 43,000 candidate cases. Queries and results are adopted from criminal cases published by the Supreme People's Court of China. In particular, to address the difficulty in relevance definition, we propose a series of relevance judgment criteria designed by our legal team and corresponding candidate case annotations are conducted by legal experts. Also, we develop a novel query sampling strategy that takes both query difficulty and diversity into consideration. For dataset evaluation, we implemented several existing retrieval models on LeCaRD as baselines. The dataset is now available to the public together with the complete data processing details.

Supplementary Material

MP4 File (1494.mp4)
Presentation video of LeCaRD

References

[1]
Piyush Arora, Murhaf Hossari, Alfredo Maldonado, Clare Conran, and Gareth JF Jones. 2018. Challenges in the development of effective systems for professional legal search. In ProfS/KG4IR/Data: Search@ SIGIR.
[2]
Trevor Bench-Capon, Michał Araszkiewicz, Kevin Ashley, Katie Atkinson, Floris Bex, Filipe Borges, Daniele Bourcier, Paul Bourgine, Jack G Conrad, Enrico Francesconi, et al. 2012. A history of AI and Law in 50 papers: 25 years of the international conference on AI and Law. Artificial Intelligence and Law, Vol. 20, 3 (2012), 215--319.
[3]
Paheli Bhattacharya, Kripabandhu Ghosh, Saptarshi Ghosh, Arindam Pal, Parth Mehta, Arnab Bhattacharya, and Prasenjit Majumder. 2019. Overview of the FIRE 2019 AILA Track: Artificial Intelligence for Legal Assistance. In FIRE (Working Notes). 1--12.
[4]
WG Cochran. 1977. Double sampling. Cochran WG. Sampling techniques. 3rd ed. New York: John Wiley & Sons, Inc (1977), 327--58.
[5]
Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement, Vol. 20, 1 (1960), 37--46.
[6]
Zhuyun Dai and Jamie Callan. 2019. Deeper text understanding for IR with contextual neural language modeling. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 985--988.
[7]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[8]
Joseph L Fleiss and Jacob Cohen. 1973. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and psychological measurement, Vol. 33, 3 (1973), 613--619.
[9]
Hanjo Hamann. 2019. The German Federal Courts Dataset 1950--2019: From Paper Archives to Linked Open Data. Journal of Empirical Legal Studies, Vol. 16, 3 (2019), 671--688.
[10]
Yoshinobu Kano, Mi-Young Kim, Masaharu Yoshioka, Yao Lu, Juliano Rabelo, Naoki Kiyota, Randy Goebel, and Ken Satoh. 2018. Coliee-2018: Evaluation of the competition on legal information extraction and entailment. In JSAI International Symposium on Artificial Intelligence. Springer, 177--192.
[11]
D Lewis. 1996. The TREC-5 filtering track, TREC-5.
[12]
Daniel Locke and Guido Zuccon. 2018. A test collection for evaluating legal case law search. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 1261--1264.
[13]
Daniel Locke, Guido Zuccon, and Harrisen Scells. 2017. Automatic query generation from legal texts for case law retrieval. In Asia Information Retrieval Symposium. Springer, 181--193.
[14]
Jay M Ponte and W Bruce Croft. 1998. A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. 275--281.
[15]
A Rakhlin. 2016. Convolutional Neural Networks for Sentence Classification. GitHub (2016).
[16]
Radim Rehurek, Petr Sojka, et al. 2011. Gensim-statistical semantics in python. Retrieved from genism. org (2011).
[17]
Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi at TREC-3. Nist Special Publication Sp, Vol. 109 (1995), 109.
[18]
Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval. Information processing & management, Vol. 24, 5 (1988), 513--523.
[19]
Yunqiu Shao, Jiaxin Mao, Yiqun Liu, Weizhi Ma, Ken Satoh, Min Zhang, and Shaoping Ma. [n.d.]. BERT-PLI: Modeling Paragraph-Level Interactions for Legal Case Retrieval.
[20]
Olga Shulayeva, Advaith Siddharthan, and Adam Wyner. 2017. Recognizing cited facts and principles in legal judgements. Artificial Intelligence and Law, Vol. 25, 1 (2017), 107--126.
[21]
Maosong Sun, Xinxiong Chen, Kaixu Zhang, Zhipeng Guo, and Zhiyuan Liu. 2016. Thulac: An efficient lexical analyzer for chinese.
[22]
Marc Van Opijnen and Cristiana Santos. 2017. On the concept of relevance in legal information retrieval. Artificial Intelligence and Law, Vol. 25, 1 (2017), 65--87.
[23]
Chaojun Xiao, Haoxi Zhong, Zhipeng Guo, Cunchao Tu, Zhiyuan Liu, Maosong Sun, Yansong Feng, Xianpei Han, Zhen Hu, Heng Wang, et al. 2018. Cail2018: A large-scale legal dataset for judgment prediction. arXiv preprint arXiv:1807.02478 (2018).
[24]
Chaojun Xiao, Haoxi Zhong, Zhipeng Guo, Cunchao Tu, Zhiyuan Liu, Maosong Sun, Tianyang Zhang, Xianpei Han, Zhen Hu, Heng Wang, et al. 2019. Cail2019-scm: A dataset of similar case matching in legal domain. arXiv preprint arXiv:1911.08962 (2019).
[25]
Haoxi Zhong, Zhengyan Zhang, Zhiyuan Liu, and Maosong Sun. 2019. Open Chinese Language Pre-trained Model Zoo. Technical Report. https://github.com/thunlp/openclap

Cited By

View all
  • (2025)Uncertainty-aware evidential learning for legal case retrieval with noisy correspondenceInformation Sciences10.1016/j.ins.2025.121915(121915)Online publication date: Jan-2025
  • (2025)The Use of Artificial Intelligence in Chinese Humanities and Social Sciences ResearchKI in Medien, Kommunikation und Marketing10.1007/978-3-658-46344-1_20(277-300)Online publication date: 15-Feb-2025
  • (2024)Predicting Critical Path of Labor Dispute Resolution in Legal Domain by Machine Learning Models Based on SHapley Additive exPlanations and Soft Voting StrategyMathematics10.3390/math1202027212:2(272)Online publication date: 14-Jan-2024
  • Show More Cited By

Index Terms

  1. LeCaRD: A Legal Case Retrieval Dataset for Chinese Law System

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
    July 2021
    2998 pages
    ISBN:9781450380379
    DOI:10.1145/3404835
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 July 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. legal case retrieval
    2. query sampling
    3. relevance judgment criteria

    Qualifiers

    • Short-paper

    Funding Sources

    • the National Key Research and Development Program of China
    • Beijing Academy of Artificial Intelligence (BAAI)
    • Natural Science Foundation of China
    • Tsinghua University Guoqiang Research Institute

    Conference

    SIGIR '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)208
    • Downloads (Last 6 weeks)11
    Reflects downloads up to 27 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Uncertainty-aware evidential learning for legal case retrieval with noisy correspondenceInformation Sciences10.1016/j.ins.2025.121915(121915)Online publication date: Jan-2025
    • (2025)The Use of Artificial Intelligence in Chinese Humanities and Social Sciences ResearchKI in Medien, Kommunikation und Marketing10.1007/978-3-658-46344-1_20(277-300)Online publication date: 15-Feb-2025
    • (2024)Predicting Critical Path of Labor Dispute Resolution in Legal Domain by Machine Learning Models Based on SHapley Additive exPlanations and Soft Voting StrategyMathematics10.3390/math1202027212:2(272)Online publication date: 14-Jan-2024
    • (2024)LEEC for judicial fairnessProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/833(7527-7535)Online publication date: 3-Aug-2024
    • (2024)Enhancing Criminal Case Matching through Diverse Legal FactorsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657960(2379-2383)Online publication date: 10-Jul-2024
    • (2024)LeCaRDv2: A Large-Scale Chinese Legal Case Retrieval DatasetProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657887(2251-2260)Online publication date: 10-Jul-2024
    • (2024)Explicitly Integrating Judgment Prediction with Legal Document Retrieval: A Law-Guided Generative ApproachProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657717(2210-2220)Online publication date: 10-Jul-2024
    • (2024)Event Grounded Criminal Court View Generation with Cooperative (Large) Language ModelsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657698(2221-2230)Online publication date: 10-Jul-2024
    • (2024)MileCut: A Multi-view Truncation Framework for Legal Case RetrievalProceedings of the ACM Web Conference 202410.1145/3589334.3645349(1341-1349)Online publication date: 13-May-2024
    • (2024)A Circumstance-Aware Neural Framework for Explainable Legal Judgment PredictionIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.338758036:11(5453-5467)Online publication date: Nov-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media