Benchmarks for Indian Legal NLP: A Survey

Kalamkar, Prathamesh; Venugopalan, Janani; Raghavan, Vivek

doi:10.1007/978-3-031-36190-6_3

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13856))

Included in the following conference series:

JSAI International Symposium on Artificial Intelligence

Abstract

Legal text is significantly different from English text (e.g. Wikipedia, News) used for training most natural language processing (NLP) algorithms. As a result, the state of the art algorithms (e.g. GPT-3, BERT derivatives), need additional effort (e.g. fine-tuning and further pre-training) to achieve optimal performance on legal text. Hence there is a need to create separate NLP data sets and benchmarks for legal text which are challenging and focus on tasks specific to legal systems. This will spur innovation in applications of NLP for legal text and will benefit AI community and legal fraternity. This paper focuses on an empirical review of the existing work in the use of NLP in Indian legal text and proposes ideas to create new benchmarks for Indian Legal NLP.

Supported by Ek Step Foundation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Abujabal, A., Saha Roy, R., Yahya, M., Weikum, G.: ComQA: a community-sourced dataset for complex factoid question answering with paraphrase clusters. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. vol. 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota (Jun 2019). https://doi.org/10.18653/v1/N19-1027, https://aclanthology.org/N19-1027
Bhattacharya, P., et al.: Fire 2019 aila track: Artificial intelligence for legal assistance (12 2019). https://doi.org/10.1145/3368567.3368587
Bhattacharya, P., Hiware, K., Rajgaria, S., Pochhi, N., Ghosh, K., Ghosh, S.: A comparative study of summarization algorithms applied to legal case judgments. In: Azzopardi, L., Stein, B., Fuhr, N., Mayr, P., Hauff, C., Hiemstra, D. (eds.) ECIR 2019. LNCS, vol. 11437, pp. 413–428. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-15712-8_27
Chapter Google Scholar
Bhattacharya, P., Paul, S., Ghosh, K., Ghosh, S., Wyner, A.: Identification of rhetorical roles of sentences in Indian legal judgments (2019)
Google Scholar
Bhattacharya, P., Poddar, S., Rudra, K., Ghosh, K., Ghosh, S.: Incorporating domain knowledge for extractive summarization of legal case documents. arXiv preprint arXiv:2106.15876 (2021)
Bowman, S.R., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326 (2015)
Chalkidis, I., Androutsopoulos, I., Aletras, N.: Neural legal judgment prediction in English (2019)
Google Scholar
Chalkidis, I., et al.: LexGLUE: a benchmark dataset for legal language understanding in English. arXiv preprint arXiv:2110.00976 (2021)
Chieu, H.L., Lee, Y.K.: Query based event extraction along a timeline. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2004)
Google Scholar
Choudhry, S., Khosla, M., Mehta, P.B.: The Oxford Handbook of the Indian Constitution. Oxford University Press, Oxford (2016)
Google Scholar
Fabbri, A.R., Kryściński, W., McCann, B., Xiong, C., Socher, R., Radev, D.: SummEval: re-evaluating summarization evaluation. Trans. Assoc. Comput. Linguist. 9(2), 391–409 (2021)
Google Scholar
Finlaysona, M.A., Cremisini, A., Ocal, M.: Extracting and aligning timelines
Google Scholar
Gehrke, J., Ginsparg, P., Kleinberg, J.: Overview of the 2003 KDD cup. ACM SIGKDD Explor. Newslett. 5(2), 149–151 (2003)
Google Scholar
Jurczyk, T., Zhai, M., Choi, J.D.: SelQA: a new benchmark for selection-based question answering. In: 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI) (2016). https://doi.org/10.1109/ICTAI.2016.0128
Krishna, K., Iyyer, M.: Generating question-answer hierarchies. arXiv preprint arXiv:1906.02622 (2019)
Grover, K., Kaur, K., Tiwari, K., Rupali, Kumar, P.: Deep learning based question generation using T5 transformer. In: Garg, D., Wong, K., Sarangapani, J., Gupta, S.K. (eds.) Advanced Computing. IACC 2020. Communications in Computer and Information Science, vol 1367. Springer, Singapore (2021). https://doi.org/10.1007/978-981-16-0401-0_18
Kwiatkowski, T., et al.: Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguist. 7, 452–466 (2019)
Google Scholar
Leban, G., Fortuna, B., Brank, J., Grobelnik, M.: Event registry: learning about world events from news. In: Proceedings of the 23rd International Conference on World Wide Web (2014)
Google Scholar
Leskovec, J., Kleinberg, J., Faloutsos, C.: Graphs over time: densification laws, shrinking diameters and possible explanations. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (2005)
Google Scholar
Liu, D., et al.: GLGE: a new general language generation evaluation benchmark. arXiv preprint arXiv:2011.11928 (2020)
Maynez, J., Narayan, S., Bohnet, B., McDonald, R.: On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661 (2020)
Minard, A.L.M., et al.: SemEval-2015 task 4: Timeline: Cross-document event ordering. In: 9th International Workshop on Semantic Evaluation (SemEval 2015) (2015)
Google Scholar
Nallapati, R., Zhou, B., Gulcehre, C., Xiang, B., et al.: Abstractive text summarization using sequence-to-sequence RNNs and beyond. arXiv preprint arXiv:1602.06023 (2016)
Narayan, S., Cohen, S.B., Lapata, M.: Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745 (2018)
Ning, Q., Zhou, B., Feng, Z., Peng, H., Roth, D.: CogCompTime: a tool for understanding time in natural language. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (2018)
Google Scholar
Parikh, V., et al.: Aila 2021: Shared task on artificial intelligence for legal assistance. In: Forum for Information Retrieval Evaluation (2021)
Google Scholar
Paul, S., Goyal, P., Ghosh, S.: LeSICiN: a heterogeneous graph-based approach for automatic legal statute identification from Indian legal documents (2021)
Google Scholar
Piskorski, J., Zavarella, V., Atkinson, M., Verile, M.: Timelines: entity-centric event extraction from online news. In: Text2Story@ ECIR (2020)
Google Scholar
Pradhan, S., Moschitti, A., Xue, N., Uryupina, O., Zhang, Y.: CoNLL-2012 shared task: modeling multilingual unrestricted coreference in Ontonotes. In: Joint Conference on EMNLP and CoNLL-Shared Task (2012)
Google Scholar
Qi, W., et al.: ProphetNet-X: large-scale pre-training models for English, Chinese, multi-lingual, dialog, and code generation. arXiv preprint arXiv:2104.08006 (2021)
Rabelo, J., Kim, M.-Y., Goebel, R., Yoshioka, M., Kano, Y., Satoh, K.: COLIEE 2020: methods for legal document retrieval and entailment. In: Okazaki, N., Yada, K., Satoh, K., Mineshima, K. (eds.) JSAI-isAI 2020. LNCS (LNAI), vol. 12758, pp. 196–210. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-79942-7_13
Chapter Google Scholar
Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: Unanswerable questions for squad (2018)
Google Scholar
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016)
Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., Eliassi-Rad, T.: Collective classification in network data. AI Magazine (Sep 2008) https://doi.org/10.1609/aimag.v29i3.2157, https://ojs.aaai.org/index.php/aimagazine/article/view/2157
Wang, A., et al.: SuperGLUE: a stickier benchmark for general-purpose language understanding systems (2020)
Google Scholar
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding (2019)
Google Scholar
Xiao, C., et al.: CAIL 2018: a large-scale legal dataset for judgment prediction (2018)
Google Scholar
Xiao, C., et al.: CAIL 2019-SCM: a dataset of similar case matching in legal domain (2019)
Google Scholar
Yang, Y., Yih, W.T., Meek, C.: WikiQA: a challenge dataset for open-domain question answering. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal (Sep 2015). https://doi.org/10.18653/v1/D15-1237, https://aclanthology.org/D15-1237
Yu, M., et al.: Spatiotemporal event detection: a review. Int. J. Digital Earth 13(12), 1339–1365 (2020)
Google Scholar
Zhong, H., Xiao, C., Tu, C., Zhang, T., Liu, Z., Sun, M.: How does NLP benefit legal system: A summary of legal artificial intelligence. arXiv preprint arXiv:2004.12158 (2020)
Zhong, H., Xiao, C., Tu, C., Zhang, T., Liu, Z., Sun, M.: JEC-QA: a legal-domain question answering dataset. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34 (2020)
Google Scholar

Download references

Acknowledgements

This paper is funded by EkStep Foundation

Author information

Authors and Affiliations

Thoughtworks India Pvt. Ltd., Chennai, India
Prathamesh Kalamkar & Janani Venugopalan
Ek Step Foundation, Bangalore, India
Vivek Raghavan

Authors

Prathamesh Kalamkar
View author publications
You can also search for this author in PubMed Google Scholar
Janani Venugopalan
View author publications
You can also search for this author in PubMed Google Scholar
Vivek Raghavan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Prathamesh Kalamkar .

Editor information

Editors and Affiliations

Kansai University, Suita, Japan
Katsutoshi Yada
Tokyo Metropolitan University, Tokyo, Japan
Yasufumi Takama
Keio University, Tokyo, Japan
Koji Mineshima
National Institute of Informatics, Tokyo, Japan
Ken Satoh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kalamkar, P., Venugopalan, J., Raghavan, V. (2023). Benchmarks for Indian Legal NLP: A Survey. In: Yada, K., Takama, Y., Mineshima, K., Satoh, K. (eds) New Frontiers in Artificial Intelligence. JSAI-isAI 2021. Lecture Notes in Computer Science(), vol 13856. Springer, Cham. https://doi.org/10.1007/978-3-031-36190-6_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-36190-6_3
Published: 19 July 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-36189-0
Online ISBN: 978-3-031-36190-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Benchmarks for Indian Legal NLP: A Survey