Skip to main content
Log in

Extracting indices from Japanese legal documents

  • Published:
Artificial Intelligence and Law Aims and scope Submit manuscript

Abstract

This article addresses the problem of automatically extracting legal indices which express the important contents of legal documents. Legal indices are not limited to single-word keywords and compound-word (or phrase) keywords, they are also clause keywords. We approach index extraction using structural information of Japanese sentences, i.e. chunks and clauses. Based on the assumption that legal indices are composed of important tokens from the documents, extracting legal indices is treated as a problem of collecting chunks and clauses that contain as many important tokens as possible. Each token is assigned a weight which is a statistical score, e.g. TF–IDF and Okapi BM25, to indicate its importance. The importance of a chunk or clause is determined based on the average weight of tokens included in that chunk or clause. Then, highly weighted chunks and clauses are recognized as the indices for legal documents. The experimental results on Japanese National Pension Act data show that our proposed method achieves better performance (8.6 % higher on F1-score) than TextRank, the most popular unsupervised method in extracting single-word and compound-word keywords. In addition, this approach is also applicable to extract clause keywords with high performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. See online version at http://www.oxforddictionaries.com/.

  2. Independent words (analogous to free morphemes in English) are words which have meaning and can stand alone in a sentence. In Japanese, independent words can be: nouns, pronouns, verbs, adjectives, adjectival nouns, adverbs, adnominal adjectives, conjunctions, or interjections.

  3. Auxiliary words are analogous to bound morphemes in English. They usually follow independent words to express the variation in meaning or to make clear the relations between and among independent words. In Japanese, auxiliary words can be either auxiliary verbs or particles.

  4. Note that, CBAP program identifies the approximate boundaries of clauses. Hence, in experiments, we do not use CBAP but separate clauses manually.

  5. Persons who have experiences to annotate several legal documents including one for Japanese National Pension Act. They have extracted indices from the main part of JNPA. There were two annotators working separately on JNPA document. Then, they discussed and made agreement on their annotations. We used the extraction results after they made agreement as the golden standard for evaluation.

  6. Cabocha is available at https://code.google.com/p/cabocha/.

  7. KNP is available at http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?KNP.

  8. The law documents are obtained from the Japanese government web page corpus which is updated on July 1st, 2013.

  9. \({\textit{Precision}} = {\# {\textit{correct}} {\textit{keywords}} \over \# {\textit{extracted}} {\textit{keywords}}} \); \( {\textit{Recall}} = {\# {\textit{correct}} {\textit{keywords}} \over \# {\textit{annotated}} {\textit{keywords}}} \); \( F1{\text{-}}{\textit{score}} = 2 \times {{{\textit{Precision}} \,\times \,{\textit{Recall}}} \over {{\textit{Precision}}\,+\, {\textit{Recall}}}}. \)

  10. The manual of MeCab is available at http://taku910.github.io/mecab/.

  11. NTCIR-1, also called Test Collection 1, is a test collection for retrieval testing of Japanese information retrieval systems, and was created in the NTCIR Project at the Research and Development Department of the National Center for Science Information Systems (NACSIS). This data set includes both Japanese and English scientific article abstracts. More details can be found at http://research.nii.ac.jp/ntcir/index-en.html.

References

  • Ashley KD, Brüninghaus S (2003) A predictive role for intermediate legal concepts. In: Proceedings of conference on JURIX’03, pp 1–10

  • Biber D, Johansson S, Leech G, Conrad S, Finegan E (1999) Longman grammar of spoken and written English. Pearson Education, England, pp 120

    Google Scholar 

  • Blair DC, Maron ME (1985) An evaluation of retrieval effectiveness for a full-text document-retrieval system. Commun ACM 28(3):289–299

    Article  Google Scholar 

  • Brüninghaus S, Ashley KD (1999) Toward adding knowledge to learning algorithms for indexing legal cases. In: Proceedings of conference on ICAIL’99, pp 9–17

  • Frank E, Paynter GW, Witten IH, Gutwin C, Nevill-Manning CG (1999) Domain-specific keyphrase extraction. In: Proceedings of 16th international joint conference on artificial intelligence, pp 668–673

  • Grabmair M, Ashley KD (2011) Facilitating case comparison using value judgments and intermediate legal concepts. In: Proceedings of conference on ICAIL’11, pp 161–170

  • Hulth A (2003) Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of conference on EMNLP-ACL’03, pp 216–223

  • Katayama T (2007) Legal engineering—an engineering approach to laws in e-society age. In: Proceedings of the 1st international workshop on JURISIN, pp 1–5

  • Kudo T, Matsumoto Y (2002) Japanese dependency analysis using cascaded chunking. In: Proceedings of conference on ACL-CoNLL’02, pp 63–69

  • Kurohashi S, Nagao M (1994a) A syntactic analysis method of long Japanese sentences based on coordinate structures’ detection. Nat Lang Process 1(1):35–57 (In Japanese)

    Article  Google Scholar 

  • Kurohashi S, Nagao M (1994b) A syntactic analysis method of long Japanese sentences based on the detection of conjunctive structures. Comput Linguist 20(4):507–534

    Google Scholar 

  • Le TTN, Nguyen ML, Shimazu A (2013) Unsupervised keyword extraction for Japanese legal documents. In: Proceedings of conference on JURIX’13, pp 97–106

  • Litvak M, Last M (2008) Graph-based keyword extraction for single-document summarization. In: Proceedings of conference on COLING’08, pp 17–24

  • Liu Z, Li P, Zheng Y, Sun M (2009) Clustering to find exemplar terms for keyphrase extraction. In: Proceedings of conference on EMNLP-ACL’09, pp 257–266

  • Liu Z, Huang W, Zheng Y, Sun M (2010) Automatic keyphrase extraction via topic decomposition. In: Proceedings of conference on EMNLP-ACL’10, pp 366–376

  • Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge

    MATH  Google Scholar 

  • Maruyama T, Kashioka H, Kumano T, Tanaka H (2004) Development and evaluation of Japanese clause boundaries annotation program. Nat Lang Process 11(3):39–68 (in Japanese)

    Article  Google Scholar 

  • Mathieu J (1999) Adaptation of a keyphrase extractor for Japanese text. In: Proceedings of conference on CAIS’99, pp 182–189

  • Matsuo Y, Ishizuka M (2004) Keyword extraction from a single document using word co-occurrence statistical information. Int J Artif Intell Tools 13:157–169

    Article  Google Scholar 

  • Maxwell KT, Schafer B (2008) Concept and context in legal information retrieval. In: Proceedings of conference on JURIX’08, pp 63–72

  • Mihalcea R, Tarau P (2004) TextRank: bringing order into texts. In: Proceedings of conference on EMNLP-ACL’04, pp 404–411

  • Moens MF, Angheluta R (2003) Concept extraction from legal cases: the use of a statistic of coincidence. In: Proceedings of conference on ICAIL’03, pp 142–146

  • Nakagawa H, Mori T (2002) A simple but powerful automatic term extraction method. In: COLING-02 on COMPUTERM 2002: 2nd international workshop on computational terminology, vol 14, pp 1–7

  • Nakagawa H, Mori T (2003) Automatic term recognition based on statistics of compound nouns and their components. Terminology 9(2):201–219

    Article  Google Scholar 

  • Ogawa Y, Matsuda T (1997) Overlapping statistical word indexing: a new indexing method for Japanese text. In: Proceedings of conference on ACM-SIGIR’97, pp 226–234

  • Robertson SE, Walker S, Jones S, Hancock-Beaulieu MM, Gatford M (1994) Okapi at TREC-3. In: Proceedings of TREC-3, pp 109–126

  • Saravanan M, Ravindran B, Raman S (2009) Improving legal information retrieval using an ontological framework. Artif Intell Law 17(2):101–124

    Article  Google Scholar 

  • Suzuki Y, Fukumoto F, Sekiguchi Y (1997) Keyword extraction of radio news using term weighting for speech recognition. In: Proceedings of conference on ACM-NLPRS’97, pp 301–306

  • Suzuki Y, Fukumoto F, Sekiguchi Y (1998) Keyword extraction using term-domain interdependence for dictation of radio news. In: Proceedings of conference on COLING’98, pp 1272–1276

  • Turney PD (2000) Learning algorithms for keyphrase extraction. Inf Retr 2:303–336

    Article  Google Scholar 

  • Turney PD (1999) Learning to extract keyphrases from text. National Research Council of Canada, Institute for Information Technology, technical report ERB-1057

  • Wan X, Xiao J (2008) Single document keyphrase extraction using neighborhood knowledge. In: Proceedings of conference on AAAI’08, pp 855–860

  • Wu W, Zhang B, Ostendorf M (2010) Automatic generation of personalized annotation tags for twitter users. In: Proceedings of conference on HLT-ACL’10, pp 689–692

  • Yoshida M, Nakagawa H (2005) Automatic term extraction based on perplexity of compound words. In: Proceedings of conference on IJCNLP’05, pp 269–279

  • Zhao WX, Jiang J, He J, Song Y, Achananuparp P, Lim EP, Li X (2011) Topical keyphrase extraction from twitter. In: Proceedings of conference on HLT-ACL’11, pp 379–388

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tho Thi Ngoc Le.

Additional information

This article is an improved and extended version of paper Unsupervised Keyword Extraction for Japanese Legal Documents, presented in JURIX’13.

Appendices

Appendix 1: Japanese stopwords

We list 44 Japanese stopwords in Table 13.

Table 13 List of 44 Japanese stopwords used in extracting keywords

Appendix 2: Applying proposed approach on Japanese scientific articles

In this section, we present the experimental results of our proposed approach on extracting Japanese keywords in another domain. We use the Japanese article abstracts from NTCIR-1Footnote 11 data as data set for experiments. NTCIR-1 data set includes 332,918 Japanese article abstracts with keywords assigned by the authors. All abstracts are used as corpus to calculate the weight for tokens and 1000 abstracts with corresponding keywords are randomly chosen as the test set. The total number of keywords in selected abstracts is 4366, in which, only 2833 actually appear in the abstracts. We evaluated extraction performance on the number of keywords which exist in the chosen abstracts, i.e. 2833. All algorithm settings and evaluation criteria are the same as the legal data set JNPA. We compute the weights of tokens using TF–IDF weighting scheme, where IDF scores are computed with different corpora:

  1. 1.

    496,997 Mainichi Shimbun articles from years 1991 to 1995;

  2. 2.

    332,918 Japanese scientific article abstracts from NTCIR-1 data set;

  3. 3.

    All news articles and scientific abstracts (829,915 documents in total);

Baseline for evaluation is TextRank. The parameter settings of TextRank are the same as described in Sect. 5.3. The performances of TextRank on NTCIR-1 data set with different parameters are plotted in Fig. 11. As shown in Fig. 11, TextRank performs well with cut-off threshold \(T = 1/3\) and percentage of highly ranked tokens \(S=0\). Though TextRank still not work satisfactorily with its original setting of co-occurrence window size \(W=2\), the performances are better when increasing the window size. Similar to the JNPA data, TextRank reaches its stable performance from the window size of 6 with all options of the two other parameters.

Table 14 Results of chunk-based keyword extraction approach in comparison to graph-based ranking approach TextRank on NTCIR-1 data
Fig. 11
figure 11

The performance of TextRank on NTCIR-1 data with different values of parameters

Table 14 shows the details of keyword extraction results by TextRank and our approach. From this table, we observed that TextRank achieves better performance than our proposal on NTCIR-1 data. When having look on the annotated keywords, we realized that keywords in scientific documents are usually composed of more than one tokens. That is the reason why TextRank obtains better performances at parameter \(S=0\), i.e. keytokens are not consider in extraction process. Therefore, we also examine the performance of our approach when excluding keytokens in extraction process. As the results, the our overall performance is improved. However, comparing to TextRank, this improvement is very small, i.e. only 0.3 %. Hence, we conclude that on the Japanese general text, the performance by our approach is similar to TextRank. Because our approach achieves similar performance to TextRank on general text while it achieves better performance on legal text, we are able to conclude that, our proposed approach is more suitable for Japanese legal text.

Appendix 3: Extraction performance when changing order in proposed approach

In this section, we present the extraction performance when changing the order of processing in Algorithm 1. Specifically, removing unnecessary words will be done before the calculation of the chunk/clause weight. We run experiments on extracting keywords using Cabocha parser and TF–IDF weighting scheme. The extraction performance is shown in Table 15. From this table, we see that the overall performance of extraction decreases when removing auxiliary words before calculating the average weights of chunks. The reason for this decrease is that, more keywords have been extracted while the number of correct keywords is not improved. Hence, F1-scores are lower than which we reported in the paper.

Table 15 Extraction performance of proposed approach when removing unnecessary words is done before the calculation of the chunk/clause weight

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Le, T.T.N., Shirai, K., Nguyen, M.L. et al. Extracting indices from Japanese legal documents. Artif Intell Law 23, 315–344 (2015). https://doi.org/10.1007/s10506-015-9168-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10506-015-9168-8

Keywords

Navigation