A sequence labeling model for catchphrase identification from legal case documents

Mandal, Arpan; Ghosh, Kripabandhu; Ghosh, Saptarshi; Mandal, Sekhar

doi:10.1007/s10506-021-09296-2

A sequence labeling model for catchphrase identification from legal case documents

Original Research
Published: 30 July 2021

Volume 30, pages 325–358, (2022)
Cite this article

Artificial Intelligence and Law Aims and scope Submit manuscript

Arpan Mandal ORCID: orcid.org/0000-0001-8376-429X¹,
Kripabandhu Ghosh²,
Saptarshi Ghosh³ &
…
Sekhar Mandal¹

911 Accesses
7 Citations
Explore all metrics

Abstract

In a Common Law system, legal practitioners need frequent access to prior case documents that discuss relevant legal issues. Case documents are generally very lengthy, containing complex sentence structures, and reading them fully is a strenuous task even for legal practitioners. Having a concise overview of these documents can relieve legal practitioners from the task of reading the complete case statements. Legal catchphrases are (multi-word) phrases that provide a concise overview of the contents of a case document, and automated generation of catchphrases is a challenging problem in legal analytics. In this paper, we propose a novel supervised neural sequence tagging model for the extraction of catchphrases from legal case documents. Specifically, we show that incorporating document-specific information along with a sequence tagging model can enhance the performance of catchphrase extraction. We perform experiments over a set of Indian Supreme Court case documents, for which the gold-standard catchphrases (annotated by legal practitioners) are obtained from a popular legal information system. The performance of our proposed method is compared with that of several existing supervised and unsupervised methods, and our proposed method is empirically shown to be superior to all baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sequence-to-Sequence Models for Extracting Information from Registration and Legal Documents

Various Legal Factors Extraction Based on Machine Reading Comprehension

A Two-Stage Deep Neural Network for Sequence Labeling

Code availability

The implementation of the proposed model (D2V-BiGRU-CRF) is available at https://github.com/amarnamarpan/D2V-BiGRU-CRF.

Notes

https://legal.thomsonreuters.com/en/products/westlaw.
https://www.manupatrafast.com/.
Accuracy is a well-known set-based evaluation metric to measure the performance of classification algorithms, that measures what fraction of instances are correctly classified by a model. In the present context, accuracy can be used to measure what fraction of catchphrases are correctly identified by a method.
The dataset is available online at https://archive.ics.uci.edu/ml/datasets/Legal+Case+Reports.
https://scienceie.github.io/index.html.
https://nlp.stanford.edu/pubs/FTDDataset_v1.txt.
The GitHub url to our noun phrase extractor is https://github.com/amarnamarpan/NNP-extractor.
https://en.wikipedia.org/wiki/Linear_regression.
https://en.wikipedia.org/wiki/Logistic_regression.
The complete list of POS tags can be found at www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.
Available at https://pypi.org/project/python-crfsuite/.
To get viterbi accuracy scores in pyCRFsuite one can use the ‘-i’ option while tagging.
available at http://radimrehurek.com/gensim/index.html.
available online at https://keras.io/.
Available at https://www.cs.waikato.ac.nz/ml/weka/.
github.com/dnmilne/wikipediaminer.
To compute rouge recall score we use the implementation found at https://pypi.org/project/rouge-score/.

References

Al-Shboul B, Myaeng SH (2014) Wikipedia-based query phrase expansion in patent class search. Inform Retrieval J 17:430–451
Article Google Scholar
Alzaidy R, Caragea C, Giles CL (2019) Bi-lstm-crf sequence labeling for keyphrase extraction from scholarly documents. In: Proceedings of the International Conference on World Wide Web, pp 2551–2557
Augenstein I, Das M, Riedel S, Vikraman L, McCallum A (2017) SemEval 2017 task 10: ScienceIE - extracting keyphrases and relations from scientific publications. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp 546–555
Bhattacharya P, Hiware K, Rajgaria S, Pochhi N, Ghosh K, Ghosh S (2019) A comparative study of summarization algorithms applied to legal case judgments. In: Advances in Information Retrieval, pp 413–428
Bizer C, Lehmann J, Kobilarov G, Auer S, Becker C, Cyganiak R, Hellmann S (2009) Dbpedia-a crystallization point for the web of data. J Web Semantics 7(3):154–165
Article Google Scholar
Breiman L (2001) Random forests. Mach learn 45(1):5–32
Article Google Scholar
Breiman L, Friedman JH, Olshen RA, Stone CJ (1983) Classification and regression trees. CRC Press, Cambridge
MATH Google Scholar
Caragea C, Bulgarov FA, Godea A, Das Gollapalli S (2014) Citation-enhanced keyphrase extraction from research papers: A supervised approach. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp 1435–1446
Cardellino C, Teruel M, Alemany LA, Villata S (2017) A low-cost, high-coverage legal named entity recognizer, classifier and linker. In: Proceedings of International Conference on Articial Intelligence and Law), pp 9–18
Chiu JP, Nichols E (2016) Named entity recognition with bidirectional LSTM-CNNs. Trans Assoc Comput Linguist 4:357–370
Article Google Scholar
Dhondt E, Verberne S, Oostdijk N, Beney J, Koster C, Boves L (2014) Dealing with temporal variation in patent categorization. Inform Retrieval J 17:520–544
Article Google Scholar
Firoozeh N, Nazarenko A, Alizon F, Daille B (2019) Keyword extraction: issues and methods. Nat Lang Eng 26:259–291
Article Google Scholar
Florescu C, Caragea C (2017) PositionRank: An unsupervised approach to keyphrase extraction from scholarly documents. In: Proceedings of Annual Meeting of the Association for Computational Linguistics, pp 1105–1115
Frank E, et al. (1999) Domain-specific keyphrase extraction. In: International Joint Conference on Artificial Intelligence, pp 668–673
Galgani F, et al. (2012) Towards automatic generation of catchphrases for legal case reports. In: Proceedings of Computational Linguistics and Intelligent Text Processing (CICLing), pp 414–425
Giamblanco N, Siddavaatam P (2017) Keyword and Keyphrase Extraction using Newton’s Law of Universal Gravitation. Proceedings of Canadian Conference on Electrical and Computer Engineering pp 1–4
Gollapalli SD, Li X, Yang P (2017) Incorporating expert knowledge into keyphrase extraction. In: Association for the Advancement of Artificial Intelligence
Hasan KS, Ng V (2014) Automatic keyphrase extraction: A survey of the state of the art. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 1262–1273
Haveliwala TH (2002) Topic-sensitive pagerank. In: Proceedings of the International Conference on World Wide Web, p 517–526
Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B (1998) Support vector machines. IEEE Intell Syst Appl 13(4):18–28
Article Google Scholar
Hinton GE (1990) Connectionist learning procedures. In: Machine Learning, pp 555 – 610
Hu J, Li S, Yao Y, Yu L, Yang G, Hu J (2018) Patent keyword extraction algorithm based on distributed representation for patent classification. Entropy 20(2):104
Article Google Scholar
Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of International Conference on Machine Learning, pp 282–289
Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C (2016) Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 260–270
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of International Conference on Machine Learning, pp 1188–1196
Le TTN, Shirai K, Nguyen ML, Shimazu A (2015) Extracting indices from Japanese legal documents. Art Intell Law 23(4):315–344
Article Google Scholar
Lin CY (2004) ROUGE: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, Association for Computational Linguistics, Barcelona, Spain, pp 74–81, https://www.aclweb.org/anthology/W04-1013
Liu Z, Li P, Zheng Y, Sun M (2009) Clustering to find exemplar terms for keyphrase extraction. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, p 257–266
Liu Z, Huang W, Zheng Y, Sun M (2010) Automatic keyphrase extraction via topic decomposition. In: Proceedings of the 2010 conference on Empirical Methods in Natural Language Processing, pp 366–376
Lossio-Ventura JA, Jonquet C, Roche M, Teisseire M (2016) Biomedical term extraction: overview and a new methodology. Inform Ret J 19:59–99
Article Google Scholar
Mahdabi P, Crestani F (2014) The effect of citation analysis on query expansion for patent retrieval. Inform Ret J 17:412–429
Article Google Scholar
Mandal A, Ghosh K, Pal A, Ghosh S (2017) Automatic catchphrase identification from legal court case documents. In: Conference on Information and Knowledge Management, ACM, New York, USA, CIKM ’17, pp 2187–2190
Mandal A, Ghosh K, Ghosh S, Mandal S (2021) Unsupervised approaches for measuring textual similarity between legal court case reports. Artificial Intelligence and Law
Medelyan O (2009) Human-competitive automatic topic indexing. PhD thesis, The University of Waikato, New Zealand
Nasar Z, Jaffry SW, Malik MK (2019) Textual keyword extraction and summarization: state-of-the-art. Inform Process Manag 56(6):102088
Article Google Scholar
Nguyen S, Nguyen LM, Tojo S, Satoh K, Shimazu A (2018) Recurrent neural network-based models for recognizing requisite and effectuation parts in legal texts. Artificial Intelligence and Law pp 1–31
Okamoto M, Shan Z, Orihara R (2017) Applying information extraction for patent structure analysis. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, p 989–992
Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp 2227–2237
Qazvinian V, Radev DR, Özgür A (2010) Citation summarization through keyphrase extraction. In: Proceedings of Conference on Computational Linguistics, pp 895–903
Shi W, Zheng W, Yu JX, Cheng H, Zou L (2017) Keyphrase extraction using knowledge graphs. Data Sci Eng 2(4):275–288
Article Google Scholar
Siddiqi S, Sharan A (2015) Keyword and keyphrase extraction techniques: a literature review. Int J Comput Appl 109(2)
Siegel S (1956) Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill series in psychology, McGraw-Hill
Suzuki S, Takatsuka H (2016) Extraction of keywords of novelties from patent claims. In: Proceedings of Conference on Computational Linguistics, pp 1192–1200
Tannebaum W, Rauber A (2014) Using query logs of uspto patent examiners for automatic query expansion in patent searching. Inform Ret J 17:452–470
Article Google Scholar
Tomokiyo T, Hurst M (2003) A language model approach to keyphrase extraction. In: Proceedings of ACL Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pp 33–40
Tran V, Le Nguyen M, Tojo S, Satoh K (2020) Encoded summarization: summarizing documents into continuous vector space for legal case retrieval. Artificial Intelligence and Law pp 1–27
Tran VD, Nguyen ML, Satoh K (2018) Automatic catchphrase extraction from legal case documents via scoring using deep neural networks. CoRR arxiv:abs/1809.05219
Truong S, Le Minh N, Satoh K, Satoshi T, Shimazu A (2017) Single and multiple layer bi-lstmcrf for recognizing requisite and effectuation parts in legal texts. In: Proceedings of Automated Semantic Analysis of Information in Legal Texts
Vega-Oliveros DA, Gomes PS, Milios EE, Berton L (2019) A multi-centrality index for graph-based keyword extraction. Inform Process Manag 56(6):102063
Article Google Scholar
Verberne S, Sappelli M, Hiemstra D, Kraaij W (2016) Evaluation and analysis of term scoring methods for term extraction. Inform Ret J 19(5):510–545
Article Google Scholar
Witten IH, Paynter GW, Frank E, Gutwin C, Nevill-Manning CG (1999) Kea: Practical automatic keyphrase extraction. In: Proceedings of the Fourth ACM Conference on Digital Libraries, p 254–255
Wu YFB, Li Q (2008) Document keyphrases as subject metadata: Incorporating document key concepts in search results. Inform Ret J 11:229–249
Article Google Scholar
Zahoor F, Bajwa IS (2014) Automatic extraction of catchphrases from software license agreement. Proceedings of International Conference on Intelligent Human-Machine Systems and Cybernetics 2:189–193
Zhong H, Xiao C, Tu C, Zhang T, Liu Z, Sun M (2020) How does NLP benefit legal system: A summary of legal artificial intelligence. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp 5218–5230
Zhou D, Truran M, Liu J, Zhang S (2014) Using multiple query representations in patent prior-art search. Inform Ret J 17:471–491
Article Google Scholar
Zhu X, Lyu C, Ji D, Liao H, Li F (2020) Deep neural model with self-training for scientific keyphrase extraction. Public Library of Science (Plos one) 15(5):e0232547

Download references

Acknowledgements

The authors acknowledge faculty members from The West Bengal National University of Juridical Sciences (www.nujs.edu), and Rajiv Gandhi School of Intellectual Property Law (www.iitkgp.ac.in/department/IP) for insightful discussions. The research is partially supported by the TCG Centres for Research and Education in Science and Technology (CREST) through the project titled ‘Smart Legal Consultant: AI-based Legal Analytics’. The first author is supported by the Visvesvaraya PhD scheme from the Ministry of Electronics and Information Technology (Grant No. VISPHDMEITY-1570).

Funding

The first author received his research grants from the “Ministry of Electronics and Information Technology, Government of India” via granting the fellowship “Visvesvaraya PhD Scheme for Electronics and IT”. The research is partially supported by the TCG Centres for Research and Education in Science and Technology (CREST) through the project titled ‘Smart Legal Consultant: AI-based Legal Analytics’.

Author information

Authors and Affiliations

Department of Computer Science and Technology, Indian Institute of Engineering Science and Technology Shibpur, Howrah, India
Arpan Mandal & Sekhar Mandal
Department of Computational and Data Sciences (CDS), Indian Institute of Science Education and Research (IISER) Kolkata, Kolkata, West Bengal, India
Kripabandhu Ghosh
Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India
Saptarshi Ghosh

Authors

Arpan Mandal
View author publications
You can also search for this author in PubMed Google Scholar
Kripabandhu Ghosh
View author publications
You can also search for this author in PubMed Google Scholar
Saptarshi Ghosh
View author publications
You can also search for this author in PubMed Google Scholar
Sekhar Mandal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arpan Mandal.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mandal, A., Ghosh, K., Ghosh, S. et al. A sequence labeling model for catchphrase identification from legal case documents. Artif Intell Law 30, 325–358 (2022). https://doi.org/10.1007/s10506-021-09296-2

Download citation

Accepted: 24 June 2021
Published: 30 July 2021
Issue Date: September 2022
DOI: https://doi.org/10.1007/s10506-021-09296-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A sequence labeling model for catchphrase identification from legal case documents

Abstract

Access this article

Similar content being viewed by others

Sequence-to-Sequence Models for Extracting Information from Registration and Legal Documents

Various Legal Factors Extraction Based on Machine Reading Comprehension

A Two-Stage Deep Neural Network for Sequence Labeling

Code availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A sequence labeling model for catchphrase identification from legal case documents

Abstract

Access this article

Similar content being viewed by others

Sequence-to-Sequence Models for Extracting Information from Registration and Legal Documents

Various Legal Factors Extraction Based on Machine Reading Comprehension

A Two-Stage Deep Neural Network for Sequence Labeling

Code availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation