Skip to main content

Benchmarks for Indian Legal NLP: A Survey

  • Conference paper
  • First Online:
New Frontiers in Artificial Intelligence (JSAI-isAI 2021)

Abstract

Legal text is significantly different from English text (e.g. Wikipedia, News) used for training most natural language processing (NLP) algorithms. As a result, the state of the art algorithms (e.g. GPT-3, BERT derivatives), need additional effort (e.g. fine-tuning and further pre-training) to achieve optimal performance on legal text. Hence there is a need to create separate NLP data sets and benchmarks for legal text which are challenging and focus on tasks specific to legal systems. This will spur innovation in applications of NLP for legal text and will benefit AI community and legal fraternity. This paper focuses on an empirical review of the existing work in the use of NLP in Indian legal text and proposes ideas to create new benchmarks for Indian Legal NLP.

Supported by Ek Step Foundation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://paperswithcode.com/area/natural-language-processing.

  2. 2.

    https://indiaai.gov.in/article/ai-is-set-to-reform-justice-delivery-in-india.

  3. 3.

    https://www.news18.com/news/explainers/explained-cji-ramana-says-4-5-crore-cases-pending-heres-what-has-been-fuelling-backlog-3977411.html.

  4. 4.

    https://github.com/thunlp/LegalPapers.

  5. 5.

    https://njdg.ecourts.gov.in/njdgnew/index.php.

  6. 6.

    https://ncrb.gov.in/en/crime-and-criminal-tracking-network-systems-cctns.

  7. 7.

    https://indiankanoon.org/.

  8. 8.

    https://www.mhpolice.maharashtra.gov.in/Citizen/MH/PublishedFIRs.aspx.

  9. 9.

    https://www.indiacode.nic.in/.

References

  1. Abujabal, A., Saha Roy, R., Yahya, M., Weikum, G.: ComQA: a community-sourced dataset for complex factoid question answering with paraphrase clusters. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. vol. 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota (Jun 2019). https://doi.org/10.18653/v1/N19-1027, https://aclanthology.org/N19-1027

  2. Bhattacharya, P., et al.: Fire 2019 aila track: Artificial intelligence for legal assistance (12 2019). https://doi.org/10.1145/3368567.3368587

  3. Bhattacharya, P., Hiware, K., Rajgaria, S., Pochhi, N., Ghosh, K., Ghosh, S.: A comparative study of summarization algorithms applied to legal case judgments. In: Azzopardi, L., Stein, B., Fuhr, N., Mayr, P., Hauff, C., Hiemstra, D. (eds.) ECIR 2019. LNCS, vol. 11437, pp. 413ā€“428. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-15712-8_27

    Chapter  Google Scholar 

  4. Bhattacharya, P., Paul, S., Ghosh, K., Ghosh, S., Wyner, A.: Identification of rhetorical roles of sentences in Indian legal judgments (2019)

    Google Scholar 

  5. Bhattacharya, P., Poddar, S., Rudra, K., Ghosh, K., Ghosh, S.: Incorporating domain knowledge for extractive summarization of legal case documents. arXiv preprint arXiv:2106.15876 (2021)

  6. Bowman, S.R., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326 (2015)

  7. Chalkidis, I., Androutsopoulos, I., Aletras, N.: Neural legal judgment prediction in English (2019)

    Google Scholar 

  8. Chalkidis, I., et al.: LexGLUE: a benchmark dataset for legal language understanding in English. arXiv preprint arXiv:2110.00976 (2021)

  9. Chieu, H.L., Lee, Y.K.: Query based event extraction along a timeline. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2004)

    Google Scholar 

  10. Choudhry, S., Khosla, M., Mehta, P.B.: The Oxford Handbook of the Indian Constitution. Oxford University Press, Oxford (2016)

    Google Scholar 

  11. Fabbri, A.R., Kryściński, W., McCann, B., Xiong, C., Socher, R., Radev, D.: SummEval: re-evaluating summarization evaluation. Trans. Assoc. Comput. Linguist. 9(2), 391ā€“409 (2021)

    Google Scholar 

  12. Finlaysona, M.A., Cremisini, A., Ocal, M.: Extracting and aligning timelines

    Google Scholar 

  13. Gehrke, J., Ginsparg, P., Kleinberg, J.: Overview of the 2003 KDD cup. ACM SIGKDD Explor. Newslett. 5(2), 149ā€“151 (2003)

    Google Scholar 

  14. Jurczyk, T., Zhai, M., Choi, J.D.: SelQA: a new benchmark for selection-based question answering. In: 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI) (2016). https://doi.org/10.1109/ICTAI.2016.0128

  15. Krishna, K., Iyyer, M.: Generating question-answer hierarchies. arXiv preprint arXiv:1906.02622 (2019)

  16. Grover, K., Kaur, K., Tiwari, K., Rupali, Kumar, P.: Deep learning based question generation using T5 transformer. In: Garg, D., Wong, K., Sarangapani, J., Gupta, S.K. (eds.) Advanced Computing. IACC 2020. Communications in Computer and Information Science, vol 1367. Springer, Singapore (2021). https://doi.org/10.1007/978-981-16-0401-0_18

  17. Kwiatkowski, T., et al.: Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguist. 7, 452ā€“466 (2019)

    Google Scholar 

  18. Leban, G., Fortuna, B., Brank, J., Grobelnik, M.: Event registry: learning about world events from news. In: Proceedings of the 23rd International Conference on World Wide Web (2014)

    Google Scholar 

  19. Leskovec, J., Kleinberg, J., Faloutsos, C.: Graphs over time: densification laws, shrinking diameters and possible explanations. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (2005)

    Google Scholar 

  20. Liu, D., et al.: GLGE: a new general language generation evaluation benchmark. arXiv preprint arXiv:2011.11928 (2020)

  21. Maynez, J., Narayan, S., Bohnet, B., McDonald, R.: On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661 (2020)

  22. Minard, A.L.M., et al.: SemEval-2015 task 4: Timeline: Cross-document event ordering. In: 9th International Workshop on Semantic Evaluation (SemEval 2015) (2015)

    Google Scholar 

  23. Nallapati, R., Zhou, B., Gulcehre, C., Xiang, B., et al.: Abstractive text summarization using sequence-to-sequence RNNs and beyond. arXiv preprint arXiv:1602.06023 (2016)

  24. Narayan, S., Cohen, S.B., Lapata, M.: Donā€™t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745 (2018)

  25. Ning, Q., Zhou, B., Feng, Z., Peng, H., Roth, D.: CogCompTime: a tool for understanding time in natural language. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (2018)

    Google Scholar 

  26. Parikh, V., et al.: Aila 2021: Shared task on artificial intelligence for legal assistance. In: Forum for Information Retrieval Evaluation (2021)

    Google Scholar 

  27. Paul, S., Goyal, P., Ghosh, S.: LeSICiN: a heterogeneous graph-based approach for automatic legal statute identification from Indian legal documents (2021)

    Google Scholar 

  28. Piskorski, J., Zavarella, V., Atkinson, M., Verile, M.: Timelines: entity-centric event extraction from online news. In: Text2Story@ ECIR (2020)

    Google Scholar 

  29. Pradhan, S., Moschitti, A., Xue, N., Uryupina, O., Zhang, Y.: CoNLL-2012 shared task: modeling multilingual unrestricted coreference in Ontonotes. In: Joint Conference on EMNLP and CoNLL-Shared Task (2012)

    Google Scholar 

  30. Qi, W., et al.: ProphetNet-X: large-scale pre-training models for English, Chinese, multi-lingual, dialog, and code generation. arXiv preprint arXiv:2104.08006 (2021)

  31. Rabelo, J., Kim, M.-Y., Goebel, R., Yoshioka, M., Kano, Y., Satoh, K.: COLIEE 2020: methods for legal document retrieval and entailment. In: Okazaki, N., Yada, K., Satoh, K., Mineshima, K. (eds.) JSAI-isAI 2020. LNCS (LNAI), vol. 12758, pp. 196ā€“210. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-79942-7_13

    Chapter  Google Scholar 

  32. Rajpurkar, P., Jia, R., Liang, P.: Know what you donā€™t know: Unanswerable questions for squad (2018)

    Google Scholar 

  33. Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016)

  34. Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., Eliassi-Rad, T.: Collective classification in network data. AI Magazine (Sep 2008) https://doi.org/10.1609/aimag.v29i3.2157, https://ojs.aaai.org/index.php/aimagazine/article/view/2157

  35. Wang, A., et al.: SuperGLUE: a stickier benchmark for general-purpose language understanding systems (2020)

    Google Scholar 

  36. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding (2019)

    Google Scholar 

  37. Xiao, C., et al.: CAIL 2018: a large-scale legal dataset for judgment prediction (2018)

    Google Scholar 

  38. Xiao, C., et al.: CAIL 2019-SCM: a dataset of similar case matching in legal domain (2019)

    Google Scholar 

  39. Yang, Y., Yih, W.T., Meek, C.: WikiQA: a challenge dataset for open-domain question answering. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal (Sep 2015). https://doi.org/10.18653/v1/D15-1237, https://aclanthology.org/D15-1237

  40. Yu, M., et al.: Spatiotemporal event detection: a review. Int. J. Digital Earth 13(12), 1339ā€“1365 (2020)

    Google Scholar 

  41. Zhong, H., Xiao, C., Tu, C., Zhang, T., Liu, Z., Sun, M.: How does NLP benefit legal system: A summary of legal artificial intelligence. arXiv preprint arXiv:2004.12158 (2020)

  42. Zhong, H., Xiao, C., Tu, C., Zhang, T., Liu, Z., Sun, M.: JEC-QA: a legal-domain question answering dataset. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34 (2020)

    Google Scholar 

Download references

Acknowledgements

This paper is funded by EkStep Foundation

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Prathamesh Kalamkar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2023 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kalamkar, P., Venugopalan, J., Raghavan, V. (2023). Benchmarks for Indian Legal NLP: A Survey. In: Yada, K., Takama, Y., Mineshima, K., Satoh, K. (eds) New Frontiers in Artificial Intelligence. JSAI-isAI 2021. Lecture Notes in Computer Science(), vol 13856. Springer, Cham. https://doi.org/10.1007/978-3-031-36190-6_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-36190-6_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-36189-0

  • Online ISBN: 978-3-031-36190-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics