Skip to main content

Estimating Legal Document Structure by Considering Style Information and Table of Contents

  • Conference paper
  • First Online:
Book cover New Frontiers in Artificial Intelligence (JSAI-isAI 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10247))

Included in the following conference series:

  • 1103 Accesses

Abstract

Text analytics is used to analyze diverse documents. For example, legal documents (such as contracts, ordinances, regulations, and global standards) must be analyzed for corporations to manage their business risk and meet compliance requirements. However, since documents are often stored or published as documents without a common structure, they need to be preprocessed to analyze them in subsequent text analytics. In particular, the following two forms of preprocessing are useful for text analytics: (1) extracting text, and (2) estimating document structure (such as chapters, sections, and subsections), which is used to define the range of topics or articles in a document. This paper presents a preprocessing method to estimate document structure from documents without a common structure. The proposed method follows rule-based approach, and consists of three algorithms: (1) one is based on style information, such as bold font; (2) another is based on numbered objects, such as sections; and (3) the other is based on a document’s Table of Contents, which summarizes the document’s structure. The accuracy of the proposed method is also evaluated by using 102 documents. The proposed method was found to be able to estimate document structure with 96.6% accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Notes

  1. 1.

    https://tika.apache.org/.

References

  1. Dejean, H.: Numbered sequence detection in documents. In: Proceedings of SPIE - The International Society for Optical Engineering, vol. 7534, p. 753405 (2010). doi:10.1117/12.839494

  2. Hatsutori, Y., Yoshikawa, K.: A method to estimate document structure from text document and its application to law articles. In: The Proceedings of the Ninth International Workshop on Juris-informatics (JURISIN 2015), Kanagawa, Japan, pp. 69–82 (2015)

    Google Scholar 

  3. Igari, H., Shimazu, A., Ochimizu, K.: Document structure analysis with syntactic model and parsers: application to legal judgments. In: Okumura, M., Bekki, D., Satoh, K. (eds.) JSAI-isAI 2011. LNCS, vol. 7258, pp. 126–140. Springer, Heidelberg (2012). doi:10.1007/978-3-642-32090-3_12

    Chapter  Google Scholar 

  4. Iwai, I., doi, M., Yamaguchi, K., Fukui, M., Takebayashi, Y.: A document layout system using automatic document architecture extraction. SIGCHI Bull. 20(SI), 369–374 (1989). http://doi.acm.org/10.1145/67450.67520

  5. Kohavi, R., Provost, F.: Glossary of terms. Mach. Learn. - Special Issue Appl. Mach. Learn. Knowl. Discov. Process 30(2–3), 271–274 (1998). http://dl.acm.org/citation.cfm?id=288808.288815

  6. Mao, S., Rosenfeld, A., Kanungo, T.: Document structure analysis algorithms: a literature survey. In: Proceedings of SPIE - The International Society for Optical Engineering, vol. 5010, pp. 197–207 (2003). doi:10.1117/12.476326

  7. Pasetto, D., Franke, H., Qian, W., Guo, Z., Guo, H., Duan, D., Ni, Y., Pan, Y., Bao, S., Cao, F., Su, Z.: Rts - an integrated analytic solution for managing regulation changes and their impact on business compliance. In: Proceedings of the ACM International Conference on Computing Frontiers, CF 2013, NY, USA, pp. 24:1–24:8 (2013). http://doi.acm.org/10.1145/2482767.2482798

  8. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975). http://doi.acm.org/10.1145/361219.361220

  9. Tata, S., Patel, J.M.: Estimating the selectivity of tf-idf based cosine similarity predicates. ACM SIGMOD Rec. 36(2), 7–12 (2007). http://doi.acm.org/10.1145/1328854.1328855

  10. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1999, NY, USA, pp. 42–49 (1999). http://doi.acm.org/10.1145/312624.312647

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yoichi Hatsutori .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Hatsutori, Y., Yoshikawa, K., Imai, H. (2017). Estimating Legal Document Structure by Considering Style Information and Table of Contents. In: Kurahashi, S., Ohta, Y., Arai, S., Satoh, K., Bekki, D. (eds) New Frontiers in Artificial Intelligence. JSAI-isAI 2016. Lecture Notes in Computer Science(), vol 10247. Springer, Cham. https://doi.org/10.1007/978-3-319-61572-1_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-61572-1_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-61571-4

  • Online ISBN: 978-3-319-61572-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics