Unsupervised approaches for measuring textual similarity between legal court case reports

Mandal, Arpan; Ghosh, Kripabandhu; Ghosh, Saptarshi; Mandal, Sekhar

doi:10.1007/s10506-020-09280-2

Unsupervised approaches for measuring textual similarity between legal court case reports

Original Research
Published: 04 January 2021

Volume 29, pages 417–451, (2021)
Cite this article

Artificial Intelligence and Law Aims and scope Submit manuscript

Arpan Mandal ORCID: orcid.org/0000-0001-8376-429X¹,
Kripabandhu Ghosh²,
Saptarshi Ghosh³ &
…
Sekhar Mandal¹

2278 Accesses
25 Citations
Explore all metrics

Abstract

In the domain of legal information retrieval, an important challenge is to compute similarity between two legal documents. Precedents (statements from prior cases) play an important role in The Common Law system, where lawyers need to frequently refer to relevant prior cases. Measuring document similarity is one of the most crucial aspects of any document retrieval system which decides the speed, scalability and accuracy of the system. Text-based and network-based methods for computing similarity among case reports have already been proposed in prior works but not without a few pitfalls. Since legal citation networks are generally highly disconnected, network based metrics are not suited for them. Till date, only a few text-based and predominant embedding based methods have been employed, for instance, TF-IDF based approaches, Word2Vec (Mikolov et al. 2013) and Doc2Vec (Le and Mikolov 2014) based approaches. We investigate the performance of 56 different methodologies for computing textual similarity across court case statements when applied on a dataset of Indian Supreme Court Cases. Among the 56 different methods, thirty are adaptations of existing methods and twenty-six are our proposed methods. The methods studied include models such as BERT (Devlin et al. 2018) and Law2Vec (Ilias 2019). It is observed that the more traditional methods (such as the TF-IDF and LDA) that rely on a bag-of-words representation performs better than the more advanced context-aware methods (like BERT and Law2Vec) for computing document-level similarity. Finally we nominate, via empirical validation, five of our best performing methods as appropriate for measuring similarity between case reports. Among these five, two are adaptations of existing methods and the other three are our proposed methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similarity Analysis of Legal Documents: A Survey

A novel network-based paragraph filtering technique for legal document similarity analysis

Article 19 October 2023

Searching Case Law Judgments by Using Other Judgments as a Query

Notes

https://en.wikipedia.org/wiki/List_of_national_legal_systems/.
https://en.wikipedia.org/wiki/Common_law/.
The 20 Newsgroups dataset can be obtained from https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups.
python-crfsuite can be found online at https://python-crfsuite.readthedocs.io/en/latest/.
Note that Indian Supreme Court case documents are, unfortunately, not divided into sections or subsections, which makes it even more difficult to identify the various legal issues or rhetorical sections in a document.
The mean number of paragraphs in a document was noted to be 26.7 and, the mean number of words per paragraph was noted to be 58.6.
LexisNexis, a well known legal search system (https://www.lexisnexis.com/), is known to be assisted by this principle.
We have used the Word2vec implementation from Gensim, an open-source package available at (https://radimrehurek.com/gensim/models/word2vec.html).
We used the Doc2vec implementation from the Gensim, an open-source package available at (https://radimrehurek.com/gensim/models/doc2vec.html).
The pre-trained BERT model is available at https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip.

References

Ahmad WU, Bai X, Peng N, Chang K (2018) Learning robust, transferable sentence representations for text classification. CoRR abs/1810.00681, arXiv1810.00681
Ainslie J, Ontanon S, Alberti C, Cvicek V, Fisher Z, Pham P, Ravula A, Sanghai S, Wang Q, Yang L (2020) Etc: Encoding long and structured inputs in transformers. arXiv2004.08483
Backstrom L, Kleinberg J (2014) Romantic partnerships and the dispersion of social ties: A network analysis of relationship status on facebook. In: Proc. ACM Conference on Computer Supported Cooperative Work & Social Computing (CSCW), pp 831–841
Batet M, Sánchez D, Valls A, Gibert K (2013) Semantic similarity estimation from multiple ontologies. Applied intelligence 38(1):29–44
Article Google Scholar
Belinkov Y, Mohtarami M, Cyphers S, Glass J (2015) VectorSLU: A continuous word vector approach to answer selection in community question answering systems. In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)
Beltagy I, Peters ME, Cohan A (2020) Longformer: The long-document transformer. arXiv2004.05150
Bhattacharya P, Paul S, Ghosh K, Ghosh S, Wyner A (2019) Identification of rhetorical roles of sentences in indian legal judgments. In: Proceedings of International Conference on Legal Knowledge and Information Systems (JURIX)
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Brüninghaus S, Ashley KD (2001) Improving the representation of legal case texts with information extraction methods. In: Proceedings of the International Conference on Artificial Intelligence and Law, ICAIL ’01, pp 42–51
Chen Q, Peng Y, Lu Z (2018) Biosentvec: creating sentence embeddings for biomedical texts. CoRR abs/1810.09302, http://arxiv.org/abs/1810.09302, 1810.09302
Corrêa Júnior EA, Marinho VQ, dos Santos LB (2017) NILC-USP at SemEval-2017 task 4: a multi-view ensemble for twitter sentiment analysis. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp 611–615
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805
Ekbal A, Haque R, Bandyopadhyay S (2008) Named entity recognition in Bengali: a conditional random field approach. In: Proceedings of the International Joint Conference on Natural Language Processing: Volume-II
Galgani F, Compton P, Hoffmann A (2012) Towards automatic generation of catchphrases for legal case reports. Springer, Berlin Heidelberg, pp 414–425
Google Scholar
Iacobacci I, Pilehvar MT, Navigli R (2016) Embeddings for word sense disambiguation: an evaluation study. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 897–907
Ilias C (2019) Law2Vec - Legal Word Embeddings by Ilias Chalkidis. https://archive.org/details/Law2Vec
Kumar S, Reddy PK, Reddy VB, Singh A (2011) Similarity analysis of legal judgments. In: Proc. ACM Compute Conference, pp 17:1–17:4
Kumar S, Reddy PK, Reddy VB, Suri M (2013) Finding similar legal judgements under common law system. Springer, Berlin, pp 103–116
Google Scholar
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Jebara T, Xing EP (eds) Proc. International Conference on Machine Learning (ICML), JMLR Workshop and Conference Proceedings, pp 1188–1196
Liu B, Niu D, Wei H, Lin J, He Y, Lai K, Xu Y (2019) Matching article pairs with graphical decomposition and convolutions. In: Proceedings of the Conference of the Association for Computational Linguistics (ACL), pp 6284–6294
Luo F, Xiao H, Chang W (2011) Product named entity recognition using conditional random fields. In: 2011 Fourth international conference on business intelligence and financial engineering, IEEE, pp 86–89
Mandal A, Chaki R, Saha S, Ghosh K, Pal A, Ghosh S (2017a) Measuring similarity among legal court case documents. In: Proceedings of the 10th Annual ACM India Compute Conference, Association for Computing Machinery, Compute ’17, p 1–9, https://doi.org/10.1145/3140107.3140119
Mandal A, Ghosh K, Bhattacharya A, Pal A, Ghosh S (2017b) Overview of the FIRE 2017 irled track: information retrieval from legal documents. In: Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation, pp 63–68
Mandal A, Ghosh K, Pal A, Ghosh S (2017c) Automatic catchphrase identification from legal court case documents. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, ACM, New York, NY, USA, CIKM ’17, pp 2187–2190, http://doi.acm.org/10.1145/3132847.3133102
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. CoRR abs/1301.3781
Minocha A, Singh N, Srivastava A (2015) Finding relevant indian judgments using dispersion of citation network. In: Proc. International Conference on World Wide Web (WWW) Companion, pp 1085–1088
Pappagari R, \(\dot{Z}\)elasko P, Villalba J, Carmiel Y, Dehak N (2019) Hierarchical transformers for long document classification. 1910.10781
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in python. J Machine Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. CoRR abs/1802.05365, 1802.05365
Pvs A, Karthik G (2006) Part-of-speech tagging and chunking using conditional random fields and transformation based learningPVS. Shallow parsing for south asian languages 21:21–24
Google Scholar
Reimers N, Gurevych I (2019) Sentence-bert: Sentence embeddings using siamese bert-networks. In: Proc. Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 3982–3992
Santos E, Santos EE, Nguyen H, Pan L, Korah J (2011) A large-scale distributed framework for information retrieval in large dynamic search spaces. Appl Intell 35(3):375–398
Article Google Scholar
Silfverberg M, Ruokolainen T, Lindén K, Kurimo M (2014) Part-of-speech tagging using conditional random fields: Exploiting sub-label dependencies for improved accuracy. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Association for Computational Linguistics, Baltimore, Maryland, pp 259–264
Sugathadasa K, Ayesha B, de Silva N, Perera AS, Jayawardana V, Lakmal D, Perera M (2018) Legal document retrieval using document vector embeddings and deep learning. In: Science and Information Conference, Springer, pp 160–175
Toutanova K, Klein D, Manning CD, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, Association for computational Linguistics, pp 173–180
Tran V, Le Nguyen M, Tojo S, Satoh K (2020) Encoded summarization: summarizing documents into continuous vector space for legal case retrieval. Artificial Intelligence and Law pp 1–27
Zhang P, Koppaka L (2007) Semantics-based legal citation network. In: Proceedings of the 11th International Conference on Artificial Intelligence and Law (ICAIL), pp 123–130

Download references

Funding

The first author is supported by the Visvesvaraya PhD scheme from the Ministry of Electronics and Information Technology (Grant No. VISPHDMEITY-1570).

Author information

Authors and Affiliations

Department of Computer Science and Technology, Indian Institute of Engineering Science and Technology, Howrah, Shibpur, India
Arpan Mandal & Sekhar Mandal
Department of Computational and Data Sciences (CDS), Indian Institutes of Science Education and Research, Kolkata, West Bengal, India
Kripabandhu Ghosh
Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India
Saptarshi Ghosh

Authors

Arpan Mandal
View author publications
You can also search for this author in PubMed Google Scholar
Kripabandhu Ghosh
View author publications
You can also search for this author in PubMed Google Scholar
Saptarshi Ghosh
View author publications
You can also search for this author in PubMed Google Scholar
Sekhar Mandal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arpan Mandal.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This manuscript is an extended version of our prior work Mandal et al. (2017) “Measuring Similarity among Legal Court Case Documents”, ACM COMPUTE conference 2017.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mandal, A., Ghosh, K., Ghosh, S. et al. Unsupervised approaches for measuring textual similarity between legal court case reports. Artif Intell Law 29, 417–451 (2021). https://doi.org/10.1007/s10506-020-09280-2

Download citation

Accepted: 14 December 2020
Published: 04 January 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s10506-020-09280-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised approaches for measuring textual similarity between legal court case reports

Abstract

Access this article

Similar content being viewed by others

Similarity Analysis of Legal Documents: A Survey

A novel network-based paragraph filtering technique for legal document similarity analysis

Searching Case Law Judgments by Using Other Judgments as a Query

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Unsupervised approaches for measuring textual similarity between legal court case reports

Abstract

Access this article

Similar content being viewed by others

Similarity Analysis of Legal Documents: A Survey

A novel network-based paragraph filtering technique for legal document similarity analysis

Searching Case Law Judgments by Using Other Judgments as a Query

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation