Skip to main content
Log in

A sequence labeling model for catchphrase identification from legal case documents

  • Original Research
  • Published:
Artificial Intelligence and Law Aims and scope Submit manuscript

Abstract

In a Common Law system, legal practitioners need frequent access to prior case documents that discuss relevant legal issues. Case documents are generally very lengthy, containing complex sentence structures, and reading them fully is a strenuous task even for legal practitioners. Having a concise overview of these documents can relieve legal practitioners from the task of reading the complete case statements. Legal catchphrases are (multi-word) phrases that provide a concise overview of the contents of a case document, and automated generation of catchphrases is a challenging problem in legal analytics. In this paper, we propose a novel supervised neural sequence tagging model for the extraction of catchphrases from legal case documents. Specifically, we show that incorporating document-specific information along with a sequence tagging model can enhance the performance of catchphrase extraction. We perform experiments over a set of Indian Supreme Court case documents, for which the gold-standard catchphrases (annotated by legal practitioners) are obtained from a popular legal information system. The performance of our proposed method is compared with that of several existing supervised and unsupervised methods, and our proposed method is empirically shown to be superior to all baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Code availability

The implementation of the proposed model (D2V-BiGRU-CRF) is available at https://github.com/amarnamarpan/D2V-BiGRU-CRF.

Notes

  1. https://legal.thomsonreuters.com/en/products/westlaw.

  2. https://www.manupatrafast.com/.

  3. Accuracy is a well-known set-based evaluation metric to measure the performance of classification algorithms, that measures what fraction of instances are correctly classified by a model. In the present context, accuracy can be used to measure what fraction of catchphrases are correctly identified by a method.

  4. The dataset is available online at https://archive.ics.uci.edu/ml/datasets/Legal+Case+Reports.

  5. https://scienceie.github.io/index.html.

  6. https://nlp.stanford.edu/pubs/FTDDataset_v1.txt.

  7. The GitHub url to our noun phrase extractor is https://github.com/amarnamarpan/NNP-extractor.

  8. https://en.wikipedia.org/wiki/Linear_regression.

  9. https://en.wikipedia.org/wiki/Logistic_regression.

  10. The complete list of POS tags can be found at www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.

  11. Available at https://pypi.org/project/python-crfsuite/.

  12. To get viterbi accuracy scores in pyCRFsuite one can use the ‘-i’ option while tagging.

  13. available at http://radimrehurek.com/gensim/index.html.

  14. available online at https://keras.io/.

  15. Available at https://www.cs.waikato.ac.nz/ml/weka/.

  16. github.com/dnmilne/wikipediaminer.

  17. To compute rouge recall score we use the implementation found at https://pypi.org/project/rouge-score/.

References

  • Al-Shboul B, Myaeng SH (2014) Wikipedia-based query phrase expansion in patent class search. Inform Retrieval J 17:430–451

    Article  Google Scholar 

  • Alzaidy R, Caragea C, Giles CL (2019) Bi-lstm-crf sequence labeling for keyphrase extraction from scholarly documents. In: Proceedings of the International Conference on World Wide Web, pp 2551–2557

  • Augenstein I, Das M, Riedel S, Vikraman L, McCallum A (2017) SemEval 2017 task 10: ScienceIE - extracting keyphrases and relations from scientific publications. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp 546–555

  • Bhattacharya P, Hiware K, Rajgaria S, Pochhi N, Ghosh K, Ghosh S (2019) A comparative study of summarization algorithms applied to legal case judgments. In: Advances in Information Retrieval, pp 413–428

  • Bizer C, Lehmann J, Kobilarov G, Auer S, Becker C, Cyganiak R, Hellmann S (2009) Dbpedia-a crystallization point for the web of data. J Web Semantics 7(3):154–165

    Article  Google Scholar 

  • Breiman L (2001) Random forests. Mach learn 45(1):5–32

    Article  Google Scholar 

  • Breiman L, Friedman JH, Olshen RA, Stone CJ (1983) Classification and regression trees. CRC Press, Cambridge

    MATH  Google Scholar 

  • Caragea C, Bulgarov FA, Godea A, Das Gollapalli S (2014) Citation-enhanced keyphrase extraction from research papers: A supervised approach. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp 1435–1446

  • Cardellino C, Teruel M, Alemany LA, Villata S (2017) A low-cost, high-coverage legal named entity recognizer, classifier and linker. In: Proceedings of International Conference on Articial Intelligence and Law), pp 9–18

  • Chiu JP, Nichols E (2016) Named entity recognition with bidirectional LSTM-CNNs. Trans Assoc Comput Linguist 4:357–370

    Article  Google Scholar 

  • Dhondt E, Verberne S, Oostdijk N, Beney J, Koster C, Boves L (2014) Dealing with temporal variation in patent categorization. Inform Retrieval J 17:520–544

    Article  Google Scholar 

  • Firoozeh N, Nazarenko A, Alizon F, Daille B (2019) Keyword extraction: issues and methods. Nat Lang Eng 26:259–291

    Article  Google Scholar 

  • Florescu C, Caragea C (2017) PositionRank: An unsupervised approach to keyphrase extraction from scholarly documents. In: Proceedings of Annual Meeting of the Association for Computational Linguistics, pp 1105–1115

  • Frank E, et al. (1999) Domain-specific keyphrase extraction. In: International Joint Conference on Artificial Intelligence, pp 668–673

  • Galgani F, et al. (2012) Towards automatic generation of catchphrases for legal case reports. In: Proceedings of Computational Linguistics and Intelligent Text Processing (CICLing), pp 414–425

  • Giamblanco N, Siddavaatam P (2017) Keyword and Keyphrase Extraction using Newton’s Law of Universal Gravitation. Proceedings of Canadian Conference on Electrical and Computer Engineering pp 1–4

  • Gollapalli SD, Li X, Yang P (2017) Incorporating expert knowledge into keyphrase extraction. In: Association for the Advancement of Artificial Intelligence

  • Hasan KS, Ng V (2014) Automatic keyphrase extraction: A survey of the state of the art. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 1262–1273

  • Haveliwala TH (2002) Topic-sensitive pagerank. In: Proceedings of the International Conference on World Wide Web, p 517–526

  • Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B (1998) Support vector machines. IEEE Intell Syst Appl 13(4):18–28

    Article  Google Scholar 

  • Hinton GE (1990) Connectionist learning procedures. In: Machine Learning, pp 555 – 610

  • Hu J, Li S, Yao Y, Yu L, Yang G, Hu J (2018) Patent keyword extraction algorithm based on distributed representation for patent classification. Entropy 20(2):104

    Article  Google Scholar 

  • Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of International Conference on Machine Learning, pp 282–289

  • Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C (2016) Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 260–270

  • Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of International Conference on Machine Learning, pp 1188–1196

  • Le TTN, Shirai K, Nguyen ML, Shimazu A (2015) Extracting indices from Japanese legal documents. Art Intell Law 23(4):315–344

    Article  Google Scholar 

  • Lin CY (2004) ROUGE: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, Association for Computational Linguistics, Barcelona, Spain, pp 74–81, https://www.aclweb.org/anthology/W04-1013

  • Liu Z, Li P, Zheng Y, Sun M (2009) Clustering to find exemplar terms for keyphrase extraction. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, p 257–266

  • Liu Z, Huang W, Zheng Y, Sun M (2010) Automatic keyphrase extraction via topic decomposition. In: Proceedings of the 2010 conference on Empirical Methods in Natural Language Processing, pp 366–376

  • Lossio-Ventura JA, Jonquet C, Roche M, Teisseire M (2016) Biomedical term extraction: overview and a new methodology. Inform Ret J 19:59–99

    Article  Google Scholar 

  • Mahdabi P, Crestani F (2014) The effect of citation analysis on query expansion for patent retrieval. Inform Ret J 17:412–429

    Article  Google Scholar 

  • Mandal A, Ghosh K, Pal A, Ghosh S (2017) Automatic catchphrase identification from legal court case documents. In: Conference on Information and Knowledge Management, ACM, New York, USA, CIKM ’17, pp 2187–2190

  • Mandal A, Ghosh K, Ghosh S, Mandal S (2021) Unsupervised approaches for measuring textual similarity between legal court case reports. Artificial Intelligence and Law

  • Medelyan O (2009) Human-competitive automatic topic indexing. PhD thesis, The University of Waikato, New Zealand

  • Nasar Z, Jaffry SW, Malik MK (2019) Textual keyword extraction and summarization: state-of-the-art. Inform Process Manag 56(6):102088

    Article  Google Scholar 

  • Nguyen S, Nguyen LM, Tojo S, Satoh K, Shimazu A (2018) Recurrent neural network-based models for recognizing requisite and effectuation parts in legal texts. Artificial Intelligence and Law pp 1–31

  • Okamoto M, Shan Z, Orihara R (2017) Applying information extraction for patent structure analysis. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, p 989–992

  • Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp 2227–2237

  • Qazvinian V, Radev DR, Özgür A (2010) Citation summarization through keyphrase extraction. In: Proceedings of Conference on Computational Linguistics, pp 895–903

  • Shi W, Zheng W, Yu JX, Cheng H, Zou L (2017) Keyphrase extraction using knowledge graphs. Data Sci Eng 2(4):275–288

    Article  Google Scholar 

  • Siddiqi S, Sharan A (2015) Keyword and keyphrase extraction techniques: a literature review. Int J Comput Appl 109(2)

  • Siegel S (1956) Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill series in psychology, McGraw-Hill

  • Suzuki S, Takatsuka H (2016) Extraction of keywords of novelties from patent claims. In: Proceedings of Conference on Computational Linguistics, pp 1192–1200

  • Tannebaum W, Rauber A (2014) Using query logs of uspto patent examiners for automatic query expansion in patent searching. Inform Ret J 17:452–470

    Article  Google Scholar 

  • Tomokiyo T, Hurst M (2003) A language model approach to keyphrase extraction. In: Proceedings of ACL Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pp 33–40

  • Tran V, Le Nguyen M, Tojo S, Satoh K (2020) Encoded summarization: summarizing documents into continuous vector space for legal case retrieval. Artificial Intelligence and Law pp 1–27

  • Tran VD, Nguyen ML, Satoh K (2018) Automatic catchphrase extraction from legal case documents via scoring using deep neural networks. CoRR arxiv:abs/1809.05219

  • Truong S, Le Minh N, Satoh K, Satoshi T, Shimazu A (2017) Single and multiple layer bi-lstmcrf for recognizing requisite and effectuation parts in legal texts. In: Proceedings of Automated Semantic Analysis of Information in Legal Texts

  • Vega-Oliveros DA, Gomes PS, Milios EE, Berton L (2019) A multi-centrality index for graph-based keyword extraction. Inform Process Manag 56(6):102063

    Article  Google Scholar 

  • Verberne S, Sappelli M, Hiemstra D, Kraaij W (2016) Evaluation and analysis of term scoring methods for term extraction. Inform Ret J 19(5):510–545

    Article  Google Scholar 

  • Witten IH, Paynter GW, Frank E, Gutwin C, Nevill-Manning CG (1999) Kea: Practical automatic keyphrase extraction. In: Proceedings of the Fourth ACM Conference on Digital Libraries, p 254–255

  • Wu YFB, Li Q (2008) Document keyphrases as subject metadata: Incorporating document key concepts in search results. Inform Ret J 11:229–249

    Article  Google Scholar 

  • Zahoor F, Bajwa IS (2014) Automatic extraction of catchphrases from software license agreement. Proceedings of International Conference on Intelligent Human-Machine Systems and Cybernetics 2:189–193

  • Zhong H, Xiao C, Tu C, Zhang T, Liu Z, Sun M (2020) How does NLP benefit legal system: A summary of legal artificial intelligence. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp 5218–5230

  • Zhou D, Truran M, Liu J, Zhang S (2014) Using multiple query representations in patent prior-art search. Inform Ret J 17:471–491

    Article  Google Scholar 

  • Zhu X, Lyu C, Ji D, Liao H, Li F (2020) Deep neural model with self-training for scientific keyphrase extraction. Public Library of Science (Plos one) 15(5):e0232547

Download references

Acknowledgements

The authors acknowledge faculty members from The West Bengal National University of Juridical Sciences (www.nujs.edu), and Rajiv Gandhi School of Intellectual Property Law (www.iitkgp.ac.in/department/IP) for insightful discussions. The research is partially supported by the TCG Centres for Research and Education in Science and Technology (CREST) through the project titled ‘Smart Legal Consultant: AI-based Legal Analytics’. The first author is supported by the Visvesvaraya PhD scheme from the Ministry of Electronics and Information Technology (Grant No. VISPHDMEITY-1570).

Funding

The first author received his research grants from the “Ministry of Electronics and Information Technology, Government of India” via granting the fellowship “Visvesvaraya PhD Scheme for Electronics and IT”. The research is partially supported by the TCG Centres for Research and Education in Science and Technology (CREST) through the project titled ‘Smart Legal Consultant: AI-based Legal Analytics’.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arpan Mandal.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mandal, A., Ghosh, K., Ghosh, S. et al. A sequence labeling model for catchphrase identification from legal case documents. Artif Intell Law 30, 325–358 (2022). https://doi.org/10.1007/s10506-021-09296-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10506-021-09296-2

Keywords

Navigation