Abstract
Text analytics is used to analyze diverse documents. For example, legal documents (such as contracts, ordinances, regulations, and global standards) must be analyzed for corporations to manage their business risk and meet compliance requirements. However, since documents are often stored or published as documents without a common structure, they need to be preprocessed to analyze them in subsequent text analytics. In particular, the following two forms of preprocessing are useful for text analytics: (1) extracting text, and (2) estimating document structure (such as chapters, sections, and subsections), which is used to define the range of topics or articles in a document. This paper presents a preprocessing method to estimate document structure from documents without a common structure. The proposed method follows rule-based approach, and consists of three algorithms: (1) one is based on style information, such as bold font; (2) another is based on numbered objects, such as sections; and (3) the other is based on a document’s Table of Contents, which summarizes the document’s structure. The accuracy of the proposed method is also evaluated by using 102 documents. The proposed method was found to be able to estimate document structure with 96.6% accuracy.
Similar content being viewed by others
Notes
References
Dejean, H.: Numbered sequence detection in documents. In: Proceedings of SPIE - The International Society for Optical Engineering, vol. 7534, p. 753405 (2010). doi:10.1117/12.839494
Hatsutori, Y., Yoshikawa, K.: A method to estimate document structure from text document and its application to law articles. In: The Proceedings of the Ninth International Workshop on Juris-informatics (JURISIN 2015), Kanagawa, Japan, pp. 69–82 (2015)
Igari, H., Shimazu, A., Ochimizu, K.: Document structure analysis with syntactic model and parsers: application to legal judgments. In: Okumura, M., Bekki, D., Satoh, K. (eds.) JSAI-isAI 2011. LNCS, vol. 7258, pp. 126–140. Springer, Heidelberg (2012). doi:10.1007/978-3-642-32090-3_12
Iwai, I., doi, M., Yamaguchi, K., Fukui, M., Takebayashi, Y.: A document layout system using automatic document architecture extraction. SIGCHI Bull. 20(SI), 369–374 (1989). http://doi.acm.org/10.1145/67450.67520
Kohavi, R., Provost, F.: Glossary of terms. Mach. Learn. - Special Issue Appl. Mach. Learn. Knowl. Discov. Process 30(2–3), 271–274 (1998). http://dl.acm.org/citation.cfm?id=288808.288815
Mao, S., Rosenfeld, A., Kanungo, T.: Document structure analysis algorithms: a literature survey. In: Proceedings of SPIE - The International Society for Optical Engineering, vol. 5010, pp. 197–207 (2003). doi:10.1117/12.476326
Pasetto, D., Franke, H., Qian, W., Guo, Z., Guo, H., Duan, D., Ni, Y., Pan, Y., Bao, S., Cao, F., Su, Z.: Rts - an integrated analytic solution for managing regulation changes and their impact on business compliance. In: Proceedings of the ACM International Conference on Computing Frontiers, CF 2013, NY, USA, pp. 24:1–24:8 (2013). http://doi.acm.org/10.1145/2482767.2482798
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975). http://doi.acm.org/10.1145/361219.361220
Tata, S., Patel, J.M.: Estimating the selectivity of tf-idf based cosine similarity predicates. ACM SIGMOD Rec. 36(2), 7–12 (2007). http://doi.acm.org/10.1145/1328854.1328855
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1999, NY, USA, pp. 42–49 (1999). http://doi.acm.org/10.1145/312624.312647
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Hatsutori, Y., Yoshikawa, K., Imai, H. (2017). Estimating Legal Document Structure by Considering Style Information and Table of Contents. In: Kurahashi, S., Ohta, Y., Arai, S., Satoh, K., Bekki, D. (eds) New Frontiers in Artificial Intelligence. JSAI-isAI 2016. Lecture Notes in Computer Science(), vol 10247. Springer, Cham. https://doi.org/10.1007/978-3-319-61572-1_18
Download citation
DOI: https://doi.org/10.1007/978-3-319-61572-1_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-61571-4
Online ISBN: 978-3-319-61572-1
eBook Packages: Computer ScienceComputer Science (R0)