Skip to main content

Advertisement

Log in

Unsupervised approaches for measuring textual similarity between legal court case reports

  • Original Research
  • Published:
Artificial Intelligence and Law Aims and scope Submit manuscript

Abstract

In the domain of legal information retrieval, an important challenge is to compute similarity between two legal documents. Precedents (statements from prior cases) play an important role in The Common Law system, where lawyers need to frequently refer to relevant prior cases. Measuring document similarity is one of the most crucial aspects of any document retrieval system which decides the speed, scalability and accuracy of the system. Text-based and network-based methods for computing similarity among case reports have already been proposed in prior works but not without a few pitfalls. Since legal citation networks are generally highly disconnected, network based metrics are not suited for them. Till date, only a few text-based and predominant embedding based methods have been employed, for instance, TF-IDF based approaches, Word2Vec (Mikolov et al. 2013) and Doc2Vec (Le and Mikolov 2014) based approaches. We investigate the performance of 56 different methodologies for computing textual similarity across court case statements when applied on a dataset of Indian Supreme Court Cases. Among the 56 different methods, thirty are adaptations of existing methods and twenty-six are our proposed methods. The methods studied include models such as BERT (Devlin et al. 2018) and Law2Vec (Ilias 2019). It is observed that the more traditional methods (such as the TF-IDF and LDA) that rely on a bag-of-words representation performs better than the more advanced context-aware methods (like BERT and Law2Vec) for computing document-level similarity. Finally we nominate, via empirical validation, five of our best performing methods as appropriate for measuring similarity between case reports. Among these five, two are adaptations of existing methods and the other three are our proposed methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. https://en.wikipedia.org/wiki/List_of_national_legal_systems/.

  2. https://en.wikipedia.org/wiki/Common_law/.

  3. The 20 Newsgroups dataset can be obtained from https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups.

  4. python-crfsuite can be found online at https://python-crfsuite.readthedocs.io/en/latest/.

  5. Note that Indian Supreme Court case documents are, unfortunately, not divided into sections or subsections, which makes it even more difficult to identify the various legal issues or rhetorical sections in a document.

  6. The mean number of paragraphs in a document was noted to be 26.7 and, the mean number of words per paragraph was noted to be 58.6.

  7. LexisNexis, a well known legal search system (https://www.lexisnexis.com/), is known to be assisted by this principle.

  8. We have used the Word2vec implementation from Gensim, an open-source package available at (https://radimrehurek.com/gensim/models/word2vec.html).

  9. We used the Doc2vec implementation from the Gensim, an open-source package available at (https://radimrehurek.com/gensim/models/doc2vec.html).

  10. The pre-trained BERT model is available at https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip.

References

  • Ahmad WU, Bai X, Peng N, Chang K (2018) Learning robust, transferable sentence representations for text classification. CoRR abs/1810.00681, arXiv1810.00681

  • Ainslie J, Ontanon S, Alberti C, Cvicek V, Fisher Z, Pham P, Ravula A, Sanghai S, Wang Q, Yang L (2020) Etc: Encoding long and structured inputs in transformers. arXiv2004.08483

  • Backstrom L, Kleinberg J (2014) Romantic partnerships and the dispersion of social ties: A network analysis of relationship status on facebook. In: Proc. ACM Conference on Computer Supported Cooperative Work & Social Computing (CSCW), pp 831–841

  • Batet M, Sánchez D, Valls A, Gibert K (2013) Semantic similarity estimation from multiple ontologies. Applied intelligence 38(1):29–44

    Article  Google Scholar 

  • Belinkov Y, Mohtarami M, Cyphers S, Glass J (2015) VectorSLU: A continuous word vector approach to answer selection in community question answering systems. In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

  • Beltagy I, Peters ME, Cohan A (2020) Longformer: The long-document transformer. arXiv2004.05150

  • Bhattacharya P, Paul S, Ghosh K, Ghosh S, Wyner A (2019) Identification of rhetorical roles of sentences in indian legal judgments. In: Proceedings of International Conference on Legal Knowledge and Information Systems (JURIX)

  • Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  • Brüninghaus S, Ashley KD (2001) Improving the representation of legal case texts with information extraction methods. In: Proceedings of the International Conference on Artificial Intelligence and Law, ICAIL ’01, pp 42–51

  • Chen Q, Peng Y, Lu Z (2018) Biosentvec: creating sentence embeddings for biomedical texts. CoRR abs/1810.09302, http://arxiv.org/abs/1810.09302, 1810.09302

  • Corrêa Júnior EA, Marinho VQ, dos Santos LB (2017) NILC-USP at SemEval-2017 task 4: a multi-view ensemble for twitter sentiment analysis. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp 611–615

  • Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805

  • Ekbal A, Haque R, Bandyopadhyay S (2008) Named entity recognition in Bengali: a conditional random field approach. In: Proceedings of the International Joint Conference on Natural Language Processing: Volume-II

  • Galgani F, Compton P, Hoffmann A (2012) Towards automatic generation of catchphrases for legal case reports. Springer, Berlin Heidelberg, pp 414–425

    Google Scholar 

  • Iacobacci I, Pilehvar MT, Navigli R (2016) Embeddings for word sense disambiguation: an evaluation study. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 897–907

  • Ilias C (2019) Law2Vec - Legal Word Embeddings by Ilias Chalkidis. https://archive.org/details/Law2Vec

  • Kumar S, Reddy PK, Reddy VB, Singh A (2011) Similarity analysis of legal judgments. In: Proc. ACM Compute Conference, pp 17:1–17:4

  • Kumar S, Reddy PK, Reddy VB, Suri M (2013) Finding similar legal judgements under common law system. Springer, Berlin, pp 103–116

    Google Scholar 

  • Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Jebara T, Xing EP (eds) Proc. International Conference on Machine Learning (ICML), JMLR Workshop and Conference Proceedings, pp 1188–1196

  • Liu B, Niu D, Wei H, Lin J, He Y, Lai K, Xu Y (2019) Matching article pairs with graphical decomposition and convolutions. In: Proceedings of the Conference of the Association for Computational Linguistics (ACL), pp 6284–6294

  • Luo F, Xiao H, Chang W (2011) Product named entity recognition using conditional random fields. In: 2011 Fourth international conference on business intelligence and financial engineering, IEEE, pp 86–89

  • Mandal A, Chaki R, Saha S, Ghosh K, Pal A, Ghosh S (2017a) Measuring similarity among legal court case documents. In: Proceedings of the 10th Annual ACM India Compute Conference, Association for Computing Machinery, Compute ’17, p 1–9, https://doi.org/10.1145/3140107.3140119

  • Mandal A, Ghosh K, Bhattacharya A, Pal A, Ghosh S (2017b) Overview of the FIRE 2017 irled track: information retrieval from legal documents. In: Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation, pp 63–68

  • Mandal A, Ghosh K, Pal A, Ghosh S (2017c) Automatic catchphrase identification from legal court case documents. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, ACM, New York, NY, USA, CIKM ’17, pp 2187–2190, http://doi.acm.org/10.1145/3132847.3133102

  • Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. CoRR abs/1301.3781

  • Minocha A, Singh N, Srivastava A (2015) Finding relevant indian judgments using dispersion of citation network. In: Proc. International Conference on World Wide Web (WWW) Companion, pp 1085–1088

  • Pappagari R, \(\dot{Z}\)elasko P, Villalba J, Carmiel Y, Dehak N (2019) Hierarchical transformers for long document classification. 1910.10781

  • Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in python. J Machine Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  • Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. CoRR abs/1802.05365, 1802.05365

  • Pvs A, Karthik G (2006) Part-of-speech tagging and chunking using conditional random fields and transformation based learningPVS. Shallow parsing for south asian languages 21:21–24

    Google Scholar 

  • Reimers N, Gurevych I (2019) Sentence-bert: Sentence embeddings using siamese bert-networks. In: Proc. Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 3982–3992

  • Santos E, Santos EE, Nguyen H, Pan L, Korah J (2011) A large-scale distributed framework for information retrieval in large dynamic search spaces. Appl Intell 35(3):375–398

    Article  Google Scholar 

  • Silfverberg M, Ruokolainen T, Lindén K, Kurimo M (2014) Part-of-speech tagging using conditional random fields: Exploiting sub-label dependencies for improved accuracy. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Association for Computational Linguistics, Baltimore, Maryland, pp 259–264

  • Sugathadasa K, Ayesha B, de Silva N, Perera AS, Jayawardana V, Lakmal D, Perera M (2018) Legal document retrieval using document vector embeddings and deep learning. In: Science and Information Conference, Springer, pp 160–175

  • Toutanova K, Klein D, Manning CD, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, Association for computational Linguistics, pp 173–180

  • Tran V, Le Nguyen M, Tojo S, Satoh K (2020) Encoded summarization: summarizing documents into continuous vector space for legal case retrieval. Artificial Intelligence and Law pp 1–27

  • Zhang P, Koppaka L (2007) Semantics-based legal citation network. In: Proceedings of the 11th International Conference on Artificial Intelligence and Law (ICAIL), pp 123–130

Download references

Funding

The first author is supported by the Visvesvaraya PhD scheme from the Ministry of Electronics and Information Technology (Grant No. VISPHDMEITY-1570).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arpan Mandal.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This manuscript is an extended version of our prior work Mandal et al. (2017) “Measuring Similarity among Legal Court Case Documents”, ACM COMPUTE conference 2017.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mandal, A., Ghosh, K., Ghosh, S. et al. Unsupervised approaches for measuring textual similarity between legal court case reports. Artif Intell Law 29, 417–451 (2021). https://doi.org/10.1007/s10506-020-09280-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10506-020-09280-2

Keywords

Navigation