skip to main content
research-article

Incorporating Structural Information into Legal Case Retrieval

Published: 08 November 2023 Publication History

Abstract

Legal case retrieval has received increasing attention in recent years. However, compared to ad hoc retrieval tasks, legal case retrieval has its unique challenges. First, case documents are rather lengthy and contain complex legal structures. Therefore, it is difficult for most existing dense retrieval models to encode an entire document and capture its inherent complex structure information. Most existing methods simply truncate part of the document content to meet the input length limit of PLMs, which will lead to information loss. Additionally, the definition of relevance in the legal domain differs from that in the general domain. Previous semantic-based or lexical-based methods fail to provide a comprehensive understanding of the relevance of legal cases. In this article, we propose a Structured Legal case Retrieval (SLR) framework, which incorporates internal and external structural information to address the above two challenges. Specifically, to avoid the truncation of long legal documents, the internal structural information, which is the organization pattern of legal documents, can be utilized to split a case document into segments. By dividing the document-level semantic matching task into segment-level subtasks, SLR can separately process segments using different methods based on the characteristic of each segment. In this way, the key elements of a case document can be highlighted without losing other content information. Second, toward a better understanding of relevance in the legal domain, we investigate the connections between criminal charges appearing in large-scale case corpus to generate a chargewise relation graph. Then, the similarity between criminal charges can be pre-computed as the external structural information to enhance the recognition of relevant cases. Finally, a learning-to-rank algorithm integrates the features collected from internal and external structures to output the final retrieval results. Experimental results on public legal case retrieval benchmarks demonstrate the superior effectiveness of SLR over existing state-of-the-art baselines, including traditional bag-of-words and neural-based methods. Furthermore, we conduct a case study to visualize how the proposed model focuses on key elements and improves retrieval performance.

References

[1]
A. A. Askari, S. V. Verberne, O. Alonso, S. Marchesin, M. Najork, and G. Silvello. 2021. Combining lexical and neural retrieval with longformer-based summarization for effective case law retrieva. In Proceedings of the 2nd International Conference on Design of Experimental Search & Information REtrieval Systems. CEUR, 162–170.
[2]
Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv:2004.05150. Retrieved from https://arxiv.org/abs/2004.05150
[3]
Trevor Bench-Capon, Michał Araszkiewicz, Kevin Ashley, Katie Atkinson, Floris Bex, Filipe Borges, Daniele Bourcier, Paul Bourgine, Jack G. Conrad, Enrico Francesconi, et al. 2012. A history of AI and Law in 50 papers: 25 years of the international conference on AI and Law. Artif. Intell. Law 20, 3 (2012), 215–319.
[4]
Paheli Bhattacharya, Kripabandhu Ghosh, Arindam Pal, and Saptarshi Ghosh. 2022. Legal case document similarity: You need both network and text. Inf. Process. Manage. 59, 6 (2022), 103069.
[5]
Leo Breiman. 2001. Random forests. Mach. Learn. 45, 1 (2001), 5–32.
[6]
Si Chen, Pengfei Wang, Wei Fang, Xingchen Deng, and Feng Zhang. 2019. Learning to predict charges for judgment with legal graph. In International Conference on Artificial Neural Networks. Springer, 240–252.
[7]
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv:1904.10509. Retrieved from https://arxiv.org/abs/1904.10509
[8]
Hugh A. Chipman, Edward I. George, and Robert E. McCulloch. 1998. Bayesian CART model search. J. Am. Statist. Assoc. 93, 443 (1998), 935–948.
[9]
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Shijin Wang, and Guoping Hu. 2019. Pre-training with whole word masking for chinese bert. arXiv:1906.08101. Retrieved from https://arxiv.org/abs/1906.08101
[10]
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv:1901.02860. Retrieved from https://arxiv.org/abs/1901.02860
[11]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Retrieved from https://arxiv.org/abs/1810.04805
[12]
Yichao Du, Pengfei Luo, Xudong Hong, Tong Xu, Zhe Zhang, Chao Ren, Yi Zheng, and Enhong Chen. 2021. Inheritance-guided hierarchical assignment for clinical automatic diagnosis. arXiv:2101.11374. Retrieved from https://arxiv.org/abs/2101.11374
[13]
Filippo Galgani, Paul Compton, and Achim Hoffmann. 2012. Citation based summarisation of legal texts. In Pacific Rim International Conference on Artificial Intelligence. Springer, 40–52.
[14]
Jidong Ge, Xiaoyu Shen, Chuanyi Li, Wei Hu, Bin Luo, et al. 2021. Learning fine-grained fact-article correspondence in legal cases. arXiv:2104.10726. Retrieved from https://arxiv.org/abs/2104.10726
[15]
Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 855–864.
[16]
Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv:2004.10964. Retrieved from https://arxiv.org/abs/2004.10964
[17]
Hanjo Hamann. 2019. The german federal courts dataset 1950–2019: From paper archives to linked open data. J. Empir. Legal Stud. 16, 3 (2019), 671–688.
[18]
Pieter Hartel, Rolf + Wegberg, and Mark van Staalduinen. 2022. Investigating sentence severity with judicial open data: A case study on sentencing high-tech crime in the Dutch criminal justice system. Eur. J. Crim. Policy Res. (2022), 1–21.
[19]
Rabelo Juliano, Goebel Randy, Kano Yoshinobu, Kim Mi-Young, Yoshioka Masaharu, and Satoh Ken. 2021. Summary of the competition on legal information extraction/entailment (coliee) 2021. In Proceedings of the COLIEE Workshop in ICAIL.
[20]
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 30 (2017), 3146–3154.
[21]
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980. Retrieved from https://arxiv.org/abs/1412.6980
[22]
Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv:1609.02907. Retrieved from https://arxiv.org/abs/1609.02907
[23]
Haitao Li, Qingyao Ai, Jia Chen, Qian Dong, Yueyue Wu, Yiqun Liu, Chong Chen, and Qi Tian. 2023. SAILER: Structure-aware pre-trained language model for legal case retrieval. arXiv:2304.11370. Retrieved from https://arxiv.org/abs/2304.11370
[24]
Haitao Li, Weihang Su, Changyue Wang, Yueyue Wu, Qingyao Ai, and Yiqun Liu. 2023. THUIR@COLIEE 2023: Incorporating structural knowledge into pre-trained language models for legal case retrieval. arxiv:2305.06812 [cs.IR]. Retrieved from https://arxiv.org/abs/2305.06812
[25]
Minghan Li, Diana Nicoleta Popa, Johan Chagnon, Yagmur Gizem Cinar, and Eric Gaussier. 2023. The power of selecting key blocks with local pre-ranking for long document information retrieval. ACM Trans. Inf. Syst. 41, 3 (2023), 1–35.
[26]
Tie-Yan Liu and others. 2009. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval 3, 3 (2009), 225–331.
[27]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692. Retrieved from https://arxiv.org/abs/1907.11692
[28]
Yonghe Lu, Yuanyuan Zhai, Jiayi Luo, and Yongshan Chen. 2019. MLPV: Text representation of scientific papers based on structural information and Doc2vec. Am. J. Inf. Sci. Technol. 3, 3 (2019), 62–71.
[29]
Yougang Lyu, Zihan Wang, Zhaochun Ren, Pengjie Ren, Zhumin Chen, Xiaozhong Liu, Yujun Li, Hongsong Li, and Hongye Song. 2022. Improving legal judgment prediction through reinforced criminal element extraction. Inf. Process. Manage. 59, 1 (2022), 102780.
[30]
Yixiao Ma, Yunqiu Shao, Bulou Liu, Yiqun Liu, Min Zhang, and Shaoping Ma. 2021. Retrieving legal cases from a large-scale candidate corpus. Proceedings of the Eighth International Competition on Legal Information Extraction/Entailment (COLIEE’21) (2021).
[31]
Yixiao Ma, Yunqiu Shao, Yueyue Wu, Yiqun Liu, Ruizhe Zhang, Min Zhang, and Shaoping Ma. 2021. LeCaRD: A legal case retrieval dataset for chinese law system. Inf. Retriev. 2 (2021), 22.
[32]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv:1301.3781. Retrieved from https://arxiv.org/abs/1301.3781
[33]
Akshay Minocha, Navjyoti Singh, and Arjit Srivastava. 2015. Finding relevant indian judgments using dispersion of citation network. In Proceedings of the 24th International Conference on World Wide Web. 1085–1088.
[34]
William S. Noble. 2006. What is a support vector machine? Nat. Biotechnol. 24, 12 (2006), 1565–1567.
[35]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32 (2019), 8026–8037.
[36]
Jay M. Ponte and W. Bruce Croft. 1998. A language modeling approach to information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 275–281.
[37]
Juliano Rabelo, Mi-Young Kim, Randy Goebel, Masaharu Yoshioka, Yoshinobu Kano, and Ken Satoh. 2019. A summary of the COLIEE 2019 competition. In JSAI International Symposium on Artificial Intelligence. Springer, 34–49.
[38]
Juliano Rabelo, Mi-Young Kim, Randy Goebel, Masaharu Yoshioka, Yoshinobu Kano, and Ken Satoh. 2020. COLIEE 2020: Methods for legal document retrieval and entailment. In JSAI International Symposium on Artificial Intelligence. Springer, 196–210.
[39]
Radim Řehůřek, Petr Sojka, and others. 2011. Gensim—statistical semantics in python. Retrieved from genism.org (2011).
[40]
Stephen E. Robertson, Steve Walker, Susan Jones, Micheline M. Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi at TREC-3. Nist Special Publication Sp 109 (1995), 109.
[41]
Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24, 5 (1988), 513–523.
[42]
Manavalan Saravanan, Balaraman Ravindran, and Shivani Raman. 2009. Improving legal information retrieval using an ontological framework. Artif. Intell. Law 17, 2 (2009), 101–124.
[43]
Jaromir Savelka, Hannes Westermann, Karim Benyekhlef, Charlotte S Alexander, Jayla C. Grant, David Restrepo Amariles, Rajaa El Hamdani, Sébastien Meeùs, Aurore Troussel, Michał Araszkiewicz, et al. 2021. Lex rosetta: Transfer of predictive models across languages, jurisdictions, and legal domains. In Proceedings of the 18th International Conference on Artificial Intelligence and Law. 129–138.
[44]
Yunqiu Shao, Jiaxin Mao, Yiqun Liu, Weizhi Ma, Ken Satoh, Min Zhang, and Shaoping Ma. 2020. BERT-PLI: Modeling paragraph-level interactions for legal case retrieval. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’20). 3501–3507.
[45]
Yunqiu Shao, Yueyue Wu, Yiqun Liu, Jiaxin Mao, and Shaoping Ma. 2022. Understanding relevance judgments in legal case retrieval. ACM Trans. Inf. Syst. (2022).
[46]
Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. 2019. Adaptive attention span in transformers. arXiv:1905.07799. Retrieved from https://arxiv.org/abs/1905.07799
[47]
J. Sun. 2012. Jieba chinese word segmentation tool. https://github.com/fxsjy/jieba
[48]
Zhongxiang Sun, Jun Xu, Xiao Zhang, Zhenhua Dong, and Ji-Rong Wen. 2022. Law article-enhanced legal case matching: A model-agnostic causal learning approach. arXiv:2210.11012. Retrieved from https://arxiv.org/abs/2210.11012
[49]
Vu Tran, Minh Le Nguyen, Satoshi Tojo, and Ken Satoh. 2020. Encoded summarization: Summarizing documents into continuous vector space for legal case retrieval. Artif. Intell. Law 28, 4 (2020), 441–467.
[50]
Vu Tran, Minh Le Nguyen, and Ken Satoh. 2019. Building legal case retrieval systems with lexical matching and summarization using a pre-trained phrase scoring model. In Proceedings of the 17th International Conference on Artificial Intelligence and Law. 275–282.
[51]
Maarten Trompper and Radboud Winkels. 2016. Automatic assignment of section structure to texts of Dutch court judgments. In Legal Knowledge and Information Systems. IOS Press, 167–172.
[52]
Marc Van Opijnen and Cristiana Santos. 2017. On the concept of relevance in legal information retrieval. Artif. Intell. Law 25, 1 (2017), 65–87.
[53]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.
[54]
Pengfei Wang, Yu Fan, Shuzi Niu, Ze Yang, Yongfeng Zhang, and Jiafeng Guo. 2019. Hierarchical matching network for crime classification. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 325–334.
[55]
Zihan Wang, Hongye Song, Zhaochun Ren, Pengjie Ren, Zhumin Chen, Xiaozhong Liu, Hongsong Li, and Maarten de Rijke. 2021. Cross-domain contract element extraction with a bi-directional feedback clause-element relation network. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1003–1012.
[56]
Hannes Westermann, Jaromir Savelka, and Karim Benyekhlef. 2020. Paragraph similarity scoring and fine-tuned BERT for legal information retrieval and entailment. In JSAI International Symposium on Artificial Intelligence. Springer, 269–285.
[57]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2019. Huggingface’s transformers: State-of-the-art natural language processing. arXiv:1910.03771. Retrieved from https://arxiv.org/abs/1910.03771
[58]
Chaojun Xiao, Xueyu Hu, Zhiyuan Liu, Cunchao Tu, and Maosong Sun. 2021. Lawformer: A pre-trained language model for chinese legal long documents. AI Open 2, (2021), 79–84.
[59]
Chaojun Xiao, Haoxi Zhong, Zhipeng Guo, Cunchao Tu, Zhiyuan Liu, Maosong Sun, Yansong Feng, Xianpei Han, Zhen Hu, Heng Wang, et al. 2018. Cail2018: A large-scale legal dataset for judgment prediction. arXiv:1807.02478. Retrieved from https://arxiv.org/abs/1807.02478
[60]
Xiaobing Xue and W. Bruce Croft. 2009. Automatic query generation for patent search. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. 2037–2040.
[61]
Jun Yang, Weizhi Ma, Min Zhang, Xin Zhou, Yiqun Liu, and Shaoping Ma. 2021. Legalgnn: Legal information enhanced graph neural network for recommendation. ACM Trans. Inf. Syst. 40, 2 (2021), 1–29.
[62]
Linan Yue, Qi Liu, Binbin Jin, Han Wu, Kai Zhang, Yanqing An, Mingyue Cheng, Biao Yin, and Dayong Wu. 2021. NeurJudge: A circumstance-aware neural framework for legal judgment prediction. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 973–982.
[63]
Yiming Zeng, Ruili Wang, John Zeleznikow, and Elizabeth Kemp. 2005. Knowledge representation for the intelligent legal case retrieval. In International Conference on Knowledge-Based and Intelligent Information and Engineering Systems. Springer, 339–345.
[64]
Si Zhang, Hanghang Tong, Jiejun Xu, and Ross Maciejewski. 2019. Graph convolutional networks: a comprehensive review. Comput. Soc. Netw. 6, 1 (2019), 1–23.
[65]
Haoxi Zhong, Zhengyan Zhang, Zhiyuan Liu, and Maosong Sun. 2019. Open Chinese Language Pre-trained Model Zoo. Technical Report.
[66]
Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. 2020. Graph neural networks: A review of methods and applications. AI Open 1 (2020), 57–81.

Cited By

View all
  • (2024)LawLLM: Law Large Language Model for the US Legal SystemProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3680020(4882-4889)Online publication date: 21-Oct-2024
  • (2024)Vectorizing Judicial Texts: A Novel Approach for Analysing POCSO Judgments using Text Embeddings and Transformers2024 2nd International Conference on Artificial Intelligence, Blockchain, and Internet of Things (AIBThings)10.1109/AIBThings63359.2024.10863047(1-5)Online publication date: 7-Sep-2024
  • (2024)A systematic review of multidimensional relevance estimation in information retrievalWIREs Data Mining and Knowledge Discovery10.1002/widm.154114:5Online publication date: 7-May-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 42, Issue 2
March 2024
897 pages
EISSN:1558-2868
DOI:10.1145/3618075
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 November 2023
Online AM: 19 July 2023
Accepted: 12 June 2023
Revised: 28 April 2023
Received: 04 January 2023
Published in TOIS Volume 42, Issue 2

Check for updates

Author Tags

  1. Legal case retrieval
  2. structural information
  3. relevance

Qualifiers

  • Research-article

Funding Sources

  • Natural Science Foundation of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)509
  • Downloads (Last 6 weeks)43
Reflects downloads up to 27 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)LawLLM: Law Large Language Model for the US Legal SystemProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3680020(4882-4889)Online publication date: 21-Oct-2024
  • (2024)Vectorizing Judicial Texts: A Novel Approach for Analysing POCSO Judgments using Text Embeddings and Transformers2024 2nd International Conference on Artificial Intelligence, Blockchain, and Internet of Things (AIBThings)10.1109/AIBThings63359.2024.10863047(1-5)Online publication date: 7-Sep-2024
  • (2024)A systematic review of multidimensional relevance estimation in information retrievalWIREs Data Mining and Knowledge Discovery10.1002/widm.154114:5Online publication date: 7-May-2024

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media