Extracting indices from Japanese legal documents

Le, Tho Thi Ngoc; Shirai, Kiyoaki; Nguyen, Minh Le; Shimazu, Akira

doi:10.1007/s10506-015-9168-8

Extracting indices from Japanese legal documents

Published: 08 September 2015

Volume 23, pages 315–344, (2015)
Cite this article

Artificial Intelligence and Law Aims and scope Submit manuscript

Tho Thi Ngoc Le¹,
Kiyoaki Shirai¹,
Minh Le Nguyen¹ &
…
Akira Shimazu¹

726 Accesses
6 Citations
Explore all metrics

Abstract

This article addresses the problem of automatically extracting legal indices which express the important contents of legal documents. Legal indices are not limited to single-word keywords and compound-word (or phrase) keywords, they are also clause keywords. We approach index extraction using structural information of Japanese sentences, i.e. chunks and clauses. Based on the assumption that legal indices are composed of important tokens from the documents, extracting legal indices is treated as a problem of collecting chunks and clauses that contain as many important tokens as possible. Each token is assigned a weight which is a statistical score, e.g. TF–IDF and Okapi BM25, to indicate its importance. The importance of a chunk or clause is determined based on the average weight of tokens included in that chunk or clause. Then, highly weighted chunks and clauses are recognized as the indices for legal documents. The experimental results on Japanese National Pension Act data show that our proposed method achieves better performance (8.6 % higher on F1-score) than TextRank, the most popular unsupervised method in extracting single-word and compound-word keywords. In addition, this approach is also applicable to extract clause keywords with high performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhancing legal judgment summarization with integrated semantic and structural information

Article 26 November 2023

Improving the Preformance of Judicial Precedent Search by Fine-Tuning S-BERT

A sentence is known by the company it keeps: Improving Legal Document Summarization Using Deep Clustering

Article 01 February 2023

Notes

See online version at http://www.oxforddictionaries.com/.
Independent words (analogous to free morphemes in English) are words which have meaning and can stand alone in a sentence. In Japanese, independent words can be: nouns, pronouns, verbs, adjectives, adjectival nouns, adverbs, adnominal adjectives, conjunctions, or interjections.
Auxiliary words are analogous to bound morphemes in English. They usually follow independent words to express the variation in meaning or to make clear the relations between and among independent words. In Japanese, auxiliary words can be either auxiliary verbs or particles.
Note that, CBAP program identifies the approximate boundaries of clauses. Hence, in experiments, we do not use CBAP but separate clauses manually.
Persons who have experiences to annotate several legal documents including one for Japanese National Pension Act. They have extracted indices from the main part of JNPA. There were two annotators working separately on JNPA document. Then, they discussed and made agreement on their annotations. We used the extraction results after they made agreement as the golden standard for evaluation.
Cabocha is available at https://code.google.com/p/cabocha/.
KNP is available at http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?KNP.
The law documents are obtained from the Japanese government web page corpus which is updated on July 1st, 2013.
\({\textit{Precision}} = {\# {\textit{correct}} {\textit{keywords}} \over \# {\textit{extracted}} {\textit{keywords}}} \); \( {\textit{Recall}} = {\# {\textit{correct}} {\textit{keywords}} \over \# {\textit{annotated}} {\textit{keywords}}} \); \( F1{\text{-}}{\textit{score}} = 2 \times {{{\textit{Precision}} \,\times \,{\textit{Recall}}} \over {{\textit{Precision}}\,+\, {\textit{Recall}}}}. \)
The manual of MeCab is available at http://taku910.github.io/mecab/.
NTCIR-1, also called Test Collection 1, is a test collection for retrieval testing of Japanese information retrieval systems, and was created in the NTCIR Project at the Research and Development Department of the National Center for Science Information Systems (NACSIS). This data set includes both Japanese and English scientific article abstracts. More details can be found at http://research.nii.ac.jp/ntcir/index-en.html.

References

Ashley KD, Brüninghaus S (2003) A predictive role for intermediate legal concepts. In: Proceedings of conference on JURIX’03, pp 1–10
Biber D, Johansson S, Leech G, Conrad S, Finegan E (1999) Longman grammar of spoken and written English. Pearson Education, England, pp 120
Google Scholar
Blair DC, Maron ME (1985) An evaluation of retrieval effectiveness for a full-text document-retrieval system. Commun ACM 28(3):289–299
Article Google Scholar
Brüninghaus S, Ashley KD (1999) Toward adding knowledge to learning algorithms for indexing legal cases. In: Proceedings of conference on ICAIL’99, pp 9–17
Frank E, Paynter GW, Witten IH, Gutwin C, Nevill-Manning CG (1999) Domain-specific keyphrase extraction. In: Proceedings of 16th international joint conference on artificial intelligence, pp 668–673
Grabmair M, Ashley KD (2011) Facilitating case comparison using value judgments and intermediate legal concepts. In: Proceedings of conference on ICAIL’11, pp 161–170
Hulth A (2003) Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of conference on EMNLP-ACL’03, pp 216–223
Katayama T (2007) Legal engineering—an engineering approach to laws in e-society age. In: Proceedings of the 1st international workshop on JURISIN, pp 1–5
Kudo T, Matsumoto Y (2002) Japanese dependency analysis using cascaded chunking. In: Proceedings of conference on ACL-CoNLL’02, pp 63–69
Kurohashi S, Nagao M (1994a) A syntactic analysis method of long Japanese sentences based on coordinate structures’ detection. Nat Lang Process 1(1):35–57 (In Japanese)
Article Google Scholar
Kurohashi S, Nagao M (1994b) A syntactic analysis method of long Japanese sentences based on the detection of conjunctive structures. Comput Linguist 20(4):507–534
Google Scholar
Le TTN, Nguyen ML, Shimazu A (2013) Unsupervised keyword extraction for Japanese legal documents. In: Proceedings of conference on JURIX’13, pp 97–106
Litvak M, Last M (2008) Graph-based keyword extraction for single-document summarization. In: Proceedings of conference on COLING’08, pp 17–24
Liu Z, Li P, Zheng Y, Sun M (2009) Clustering to find exemplar terms for keyphrase extraction. In: Proceedings of conference on EMNLP-ACL’09, pp 257–266
Liu Z, Huang W, Zheng Y, Sun M (2010) Automatic keyphrase extraction via topic decomposition. In: Proceedings of conference on EMNLP-ACL’10, pp 366–376
Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge
MATH Google Scholar
Maruyama T, Kashioka H, Kumano T, Tanaka H (2004) Development and evaluation of Japanese clause boundaries annotation program. Nat Lang Process 11(3):39–68 (in Japanese)
Article Google Scholar
Mathieu J (1999) Adaptation of a keyphrase extractor for Japanese text. In: Proceedings of conference on CAIS’99, pp 182–189
Matsuo Y, Ishizuka M (2004) Keyword extraction from a single document using word co-occurrence statistical information. Int J Artif Intell Tools 13:157–169
Article Google Scholar
Maxwell KT, Schafer B (2008) Concept and context in legal information retrieval. In: Proceedings of conference on JURIX’08, pp 63–72
Mihalcea R, Tarau P (2004) TextRank: bringing order into texts. In: Proceedings of conference on EMNLP-ACL’04, pp 404–411
Moens MF, Angheluta R (2003) Concept extraction from legal cases: the use of a statistic of coincidence. In: Proceedings of conference on ICAIL’03, pp 142–146
Nakagawa H, Mori T (2002) A simple but powerful automatic term extraction method. In: COLING-02 on COMPUTERM 2002: 2nd international workshop on computational terminology, vol 14, pp 1–7
Nakagawa H, Mori T (2003) Automatic term recognition based on statistics of compound nouns and their components. Terminology 9(2):201–219
Article Google Scholar
Ogawa Y, Matsuda T (1997) Overlapping statistical word indexing: a new indexing method for Japanese text. In: Proceedings of conference on ACM-SIGIR’97, pp 226–234
Robertson SE, Walker S, Jones S, Hancock-Beaulieu MM, Gatford M (1994) Okapi at TREC-3. In: Proceedings of TREC-3, pp 109–126
Saravanan M, Ravindran B, Raman S (2009) Improving legal information retrieval using an ontological framework. Artif Intell Law 17(2):101–124
Article Google Scholar
Suzuki Y, Fukumoto F, Sekiguchi Y (1997) Keyword extraction of radio news using term weighting for speech recognition. In: Proceedings of conference on ACM-NLPRS’97, pp 301–306
Suzuki Y, Fukumoto F, Sekiguchi Y (1998) Keyword extraction using term-domain interdependence for dictation of radio news. In: Proceedings of conference on COLING’98, pp 1272–1276
Turney PD (2000) Learning algorithms for keyphrase extraction. Inf Retr 2:303–336
Article Google Scholar
Turney PD (1999) Learning to extract keyphrases from text. National Research Council of Canada, Institute for Information Technology, technical report ERB-1057
Wan X, Xiao J (2008) Single document keyphrase extraction using neighborhood knowledge. In: Proceedings of conference on AAAI’08, pp 855–860
Wu W, Zhang B, Ostendorf M (2010) Automatic generation of personalized annotation tags for twitter users. In: Proceedings of conference on HLT-ACL’10, pp 689–692
Yoshida M, Nakagawa H (2005) Automatic term extraction based on perplexity of compound words. In: Proceedings of conference on IJCNLP’05, pp 269–279
Zhao WX, Jiang J, He J, Song Y, Achananuparp P, Lim EP, Li X (2011) Topical keyphrase extraction from twitter. In: Proceedings of conference on HLT-ACL’11, pp 379–388

Download references

Author information

Authors and Affiliations

School of Information Science, Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, Ishikawa, 923-1292, Japan
Tho Thi Ngoc Le, Kiyoaki Shirai, Minh Le Nguyen & Akira Shimazu

Authors

Tho Thi Ngoc Le
View author publications
You can also search for this author in PubMed Google Scholar
Kiyoaki Shirai
View author publications
You can also search for this author in PubMed Google Scholar
Minh Le Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Akira Shimazu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tho Thi Ngoc Le.

Additional information

This article is an improved and extended version of paper Unsupervised Keyword Extraction for Japanese Legal Documents, presented in JURIX’13.

Appendices

Appendix 1: Japanese stopwords

We list 44 Japanese stopwords in Table 13.

Table 13 List of 44 Japanese stopwords used in extracting keywords

Full size table

Appendix 2: Applying proposed approach on Japanese scientific articles

In this section, we present the experimental results of our proposed approach on extracting Japanese keywords in another domain. We use the Japanese article abstracts from NTCIR-1^{Footnote 11} data as data set for experiments. NTCIR-1 data set includes 332,918 Japanese article abstracts with keywords assigned by the authors. All abstracts are used as corpus to calculate the weight for tokens and 1000 abstracts with corresponding keywords are randomly chosen as the test set. The total number of keywords in selected abstracts is 4366, in which, only 2833 actually appear in the abstracts. We evaluated extraction performance on the number of keywords which exist in the chosen abstracts, i.e. 2833. All algorithm settings and evaluation criteria are the same as the legal data set JNPA. We compute the weights of tokens using TF–IDF weighting scheme, where IDF scores are computed with different corpora:

1.
496,997 Mainichi Shimbun articles from years 1991 to 1995;
2.
332,918 Japanese scientific article abstracts from NTCIR-1 data set;
3.
All news articles and scientific abstracts (829,915 documents in total);

Baseline for evaluation is TextRank. The parameter settings of TextRank are the same as described in Sect. 5.3. The performances of TextRank on NTCIR-1 data set with different parameters are plotted in Fig. 11. As shown in Fig. 11, TextRank performs well with cut-off threshold \(T = 1/3\) and percentage of highly ranked tokens \(S=0\). Though TextRank still not work satisfactorily with its original setting of co-occurrence window size \(W=2\), the performances are better when increasing the window size. Similar to the JNPA data, TextRank reaches its stable performance from the window size of 6 with all options of the two other parameters.

Table 14 Results of chunk-based keyword extraction approach in comparison to graph-based ranking approach TextRank on NTCIR-1 data

Full size table

Table 14 shows the details of keyword extraction results by TextRank and our approach. From this table, we observed that TextRank achieves better performance than our proposal on NTCIR-1 data. When having look on the annotated keywords, we realized that keywords in scientific documents are usually composed of more than one tokens. That is the reason why TextRank obtains better performances at parameter \(S=0\), i.e. keytokens are not consider in extraction process. Therefore, we also examine the performance of our approach when excluding keytokens in extraction process. As the results, the our overall performance is improved. However, comparing to TextRank, this improvement is very small, i.e. only 0.3 %. Hence, we conclude that on the Japanese general text, the performance by our approach is similar to TextRank. Because our approach achieves similar performance to TextRank on general text while it achieves better performance on legal text, we are able to conclude that, our proposed approach is more suitable for Japanese legal text.

Appendix 3: Extraction performance when changing order in proposed approach

In this section, we present the extraction performance when changing the order of processing in Algorithm 1. Specifically, removing unnecessary words will be done before the calculation of the chunk/clause weight. We run experiments on extracting keywords using Cabocha parser and TF–IDF weighting scheme. The extraction performance is shown in Table 15. From this table, we see that the overall performance of extraction decreases when removing auxiliary words before calculating the average weights of chunks. The reason for this decrease is that, more keywords have been extracted while the number of correct keywords is not improved. Hence, F1-scores are lower than which we reported in the paper.

Table 15 Extraction performance of proposed approach when removing unnecessary words is done before the calculation of the chunk/clause weight

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Le, T.T.N., Shirai, K., Nguyen, M.L. et al. Extracting indices from Japanese legal documents. Artif Intell Law 23, 315–344 (2015). https://doi.org/10.1007/s10506-015-9168-8

Download citation

Published: 08 September 2015
Issue Date: December 2015
DOI: https://doi.org/10.1007/s10506-015-9168-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Extracting indices from Japanese legal documents

Abstract

Access this article

Similar content being viewed by others

Enhancing legal judgment summarization with integrated semantic and structural information

Improving the Preformance of Judicial Precedent Search by Fine-Tuning S-BERT

A sentence is known by the company it keeps: Improving Legal Document Summarization Using Deep Clustering

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix 1: Japanese stopwords

Appendix 2: Applying proposed approach on Japanese scientific articles

Appendix 3: Extraction performance when changing order in proposed approach

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Extracting indices from Japanese legal documents

Abstract

Access this article

Similar content being viewed by others

Enhancing legal judgment summarization with integrated semantic and structural information

Improving the Preformance of Judicial Precedent Search by Fine-Tuning S-BERT

A sentence is known by the company it keeps: Improving Legal Document Summarization Using Deep Clustering

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix 1: Japanese stopwords

Appendix 2: Applying proposed approach on Japanese scientific articles

Appendix 3: Extraction performance when changing order in proposed approach

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation