Estimating Legal Document Structure by Considering Style Information and Table of Contents

Hatsutori, Yoichi; Yoshikawa, Katsumasa; Imai, Haruki

doi:10.1007/978-3-319-61572-1_18

Yoichi Hatsutori²⁵,
Katsumasa Yoshikawa²⁵ &
Haruki Imai²⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10247))

Included in the following conference series:

JSAI International Symposium on Artificial Intelligence

1199 Accesses

Abstract

Text analytics is used to analyze diverse documents. For example, legal documents (such as contracts, ordinances, regulations, and global standards) must be analyzed for corporations to manage their business risk and meet compliance requirements. However, since documents are often stored or published as documents without a common structure, they need to be preprocessed to analyze them in subsequent text analytics. In particular, the following two forms of preprocessing are useful for text analytics: (1) extracting text, and (2) estimating document structure (such as chapters, sections, and subsections), which is used to define the range of topics or articles in a document. This paper presents a preprocessing method to estimate document structure from documents without a common structure. The proposed method follows rule-based approach, and consists of three algorithms: (1) one is based on style information, such as bold font; (2) another is based on numbered objects, such as sections; and (3) the other is based on a document’s Table of Contents, which summarizes the document’s structure. The accuracy of the proposed method is also evaluated by using 102 documents. The proposed method was found to be able to estimate document structure with 96.6% accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

A Framework for Analyzing Legal Documents by Leveraging Knowledge Graphs

An Approach to Extract Major Parameters of Legal Documents Using Text Analytics

Extractive Summarization of Indian Legal Documents

Notes

1.
https://tika.apache.org/.

References

Dejean, H.: Numbered sequence detection in documents. In: Proceedings of SPIE - The International Society for Optical Engineering, vol. 7534, p. 753405 (2010). doi:10.1117/12.839494
Hatsutori, Y., Yoshikawa, K.: A method to estimate document structure from text document and its application to law articles. In: The Proceedings of the Ninth International Workshop on Juris-informatics (JURISIN 2015), Kanagawa, Japan, pp. 69–82 (2015)
Google Scholar
Igari, H., Shimazu, A., Ochimizu, K.: Document structure analysis with syntactic model and parsers: application to legal judgments. In: Okumura, M., Bekki, D., Satoh, K. (eds.) JSAI-isAI 2011. LNCS, vol. 7258, pp. 126–140. Springer, Heidelberg (2012). doi:10.1007/978-3-642-32090-3_12
Chapter Google Scholar
Iwai, I., doi, M., Yamaguchi, K., Fukui, M., Takebayashi, Y.: A document layout system using automatic document architecture extraction. SIGCHI Bull. 20(SI), 369–374 (1989). http://doi.acm.org/10.1145/67450.67520
Kohavi, R., Provost, F.: Glossary of terms. Mach. Learn. - Special Issue Appl. Mach. Learn. Knowl. Discov. Process 30(2–3), 271–274 (1998). http://dl.acm.org/citation.cfm?id=288808.288815
Mao, S., Rosenfeld, A., Kanungo, T.: Document structure analysis algorithms: a literature survey. In: Proceedings of SPIE - The International Society for Optical Engineering, vol. 5010, pp. 197–207 (2003). doi:10.1117/12.476326
Pasetto, D., Franke, H., Qian, W., Guo, Z., Guo, H., Duan, D., Ni, Y., Pan, Y., Bao, S., Cao, F., Su, Z.: Rts - an integrated analytic solution for managing regulation changes and their impact on business compliance. In: Proceedings of the ACM International Conference on Computing Frontiers, CF 2013, NY, USA, pp. 24:1–24:8 (2013). http://doi.acm.org/10.1145/2482767.2482798
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975). http://doi.acm.org/10.1145/361219.361220
Tata, S., Patel, J.M.: Estimating the selectivity of tf-idf based cosine similarity predicates. ACM SIGMOD Rec. 36(2), 7–12 (2007). http://doi.acm.org/10.1145/1328854.1328855
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1999, NY, USA, pp. 42–49 (1999). http://doi.acm.org/10.1145/312624.312647

Download references

Author information

Authors and Affiliations

IBM Research, Tokyo, IBM Japan Ltd., 19-21 Nihonbashi Hakozaki-cho Chuo-ku, Tokyo, 103-8510, Japan
Yoichi Hatsutori, Katsumasa Yoshikawa & Haruki Imai

Authors

Yoichi Hatsutori
View author publications
You can also search for this author in PubMed Google Scholar
Katsumasa Yoshikawa
View author publications
You can also search for this author in PubMed Google Scholar
Haruki Imai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yoichi Hatsutori .

Editor information

Editors and Affiliations

Graduate School of Business Sciences, University of Tsukuba, Tokyo, Japan
Setsuya Kurahashi
Fujitsu Laboratories Ltd., Kanagawa, Japan
Yuiko Ohta
Chiba University, Chiba, Japan
Sachiyo Arai
National Institute of Informatics, Tokyo, Japan
Ken Satoh
Ochanomizu University, Tokyo, Japan
Daisuke Bekki

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hatsutori, Y., Yoshikawa, K., Imai, H. (2017). Estimating Legal Document Structure by Considering Style Information and Table of Contents. In: Kurahashi, S., Ohta, Y., Arai, S., Satoh, K., Bekki, D. (eds) New Frontiers in Artificial Intelligence. JSAI-isAI 2016. Lecture Notes in Computer Science(), vol 10247. Springer, Cham. https://doi.org/10.1007/978-3-319-61572-1_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-61572-1_18
Published: 08 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-61571-4
Online ISBN: 978-3-319-61572-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics