Abstract
Contract analysis can significantly ease the work for humans using AI techniques. This paper shows a lengthy nested NER problem of element tagging on insurance policy (ETIP). Compared to NER, ETIP deals with not only different types of entities which vary from a short phrase to a long sentence, but also phrase or clause entities that could be nested. We present a novel hybrid framework of deep learning and heuristic filtering method to recognize the lengthy nested elements. First, a convolutional neural network is constructed to obtain good initial candidates of sliding windows with high softmax probability. Then, the concatenation operator on adjacent candidate segments is introduced to create phrase, clause, or sentence candidates. We design an effective voting strategy to resolve the classification conflict of the concatenated candidates and present a theoretical proof of F1-score optimization. In experiments, we have collected a large Chinese insurance contract dataset to test the performance of the proposed method. An extensive set of experiments is performed to investigate how sliding window candidates can work effectively in our filtering and voting strategy. The optimal parameters are determined by statistical analysis of the experimental data. The results show the promising performance of our method in the ETIP problem.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Azzopardi S, Gatt A, Pace GJ (2016) Integrating natural language and formal analysis for legal documents. In: 10th conference on language technologies and digital humanities, vol 2016
Baidu (2018) Baidu encyclopedia. https://github.com/Embedding/Chinese-Word-Vectors. Accessed 24 Apr 2020
Chalkidis I, Androutsopoulos I, Michos A (2017) Extracting contract elements. In: Proceedings of the 16th international conference on artificial intelligence and law, pp 19–28
Cohen W, McCallum A (2004) Information extraction and integration: an overview. In: SIGKDD conference
Cortez E, Da Silva AS (2013) Unsupervised information extraction by text segmentation. Springer, Berlin
Curtotti M, Mccreath E (2010) Corpus based classification of text in australian contracts. Soc Sci Electron Publ 687(1):406–424
Doddington GR, Mitchell A, Przybocki MA, Ramshaw LA, Strassel S, Weischedel RM (2004) The automatic content extraction (ace) program-tasks, data, and evaluation. In: LREC, vol 2, p 1
Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vis 88(2):303–338
Finkel JR, Manning CD (2009) Nested named entity recognition. In: Proceedings of the 2009 conference on empirical methods in natural language processing, vol 1. Association for Computational Linguistics, pp 141–150
Freitag D (1998) Information extraction from html: application of a general machine learning approach. In: AAAI/IAAI, pp 517–523
García-Constantino M, Atkinson K, Bollegala D, Chapman K, Coenen F, Roberts C, Robson K (2017) Cliel: context-based information extraction from commercial law documents. In: Proceedings of the 16th edition of the international conference on articial intelligence and law, ACM, pp 79–87
Hasan I, Parapar J, Blanco R (2008) Segmentation of legislative documents using a domain-specific lexicon. In: 19th international workshop on database and expert systems application, 2008. DEXA’08. IEEE, pp 665–669
Hu M, Li Z, Shen Y, Liu A, Liu G, Zheng K, Zhao L (2017) Cnn-iets: a cnn-based probabilistic approach for information extraction by text segmentation. In: Proceedings of the 2017 ACM on conference on information and knowledge management, ACM, pp 1159–1168
Indukuri KV, Krishna PR (2010) Mining e-contract documents to classify clauses. In: Proceedings of the third annual ACM Bangalore conference, ACM, p 7
Jieba (2017) https://github.com/fxsjy/jieba. Accessed 28 May 2018
Ju M, Miwa M, Ananiadou S (2018) A neural layered model for nested named entity recognition. In: Proceedings of the 2018 Conference of the North American chapter of the association for computational linguistics: human language technologies (long papers), vol 1, pp 1446–1459
Katiyar A, Cardie C (2018) Nested named entity recognition revisited. In: Proceedings of the 2018 Conference of the North American chapter of the association for computational linguistics: human language technologies (long papers), vol 1, pp 861–871
Kim JD, Ohta T, Tateisi Y, Tsujii J (2003) Genia corpus: a semantically annotated corpus for bio-textmining. Bioinformatics 19(suppl\_1):i180–i182
Kim Y (2014) Convolutional neural networks for sentence classification. arXiv:14085882
Loza Mencía E (2009) Segmentation of legal documents. In: Proceedings of the 12th international conference on artificial intelligence and law. ACM, pp 88–97
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:13013781
Moens MF, Uyttendaele C, Dumortier J (2000) Intelligent information extraction from legal texts. Inf Commun Technol Law 9(1):17–26
Muis AO, Lu W (2017) Labeling gaps between words: recognizing overlapping mentions with mention separators. In: Proceedings of the 2017 conference on empirical methods in natural language processing, pp 2608–2618
Nadeau D, Sekine S (2007) A survey of named entity recognition and classification. Lingvis Investig 30(1):3–26
NLPIR (2018). https://github.com/NLPIR-team/NLPIR. Accessed 28 May 2018
People’s_Daily (2018) News data from people’s daily. https://github.com/Embedding/Chinese-Word-Vectors. Accessed 24 Apr 2020
Piskorski J, Yangarber R (2013) Information extraction: past, present and future. In: Multi-source, multilingual information extraction and summarization. Springer, Berlin, Heidelberg, pp 23–49
Ritter A, Clark S, Etzioni O, et al. (2011) Named entity recognition in tweets: an experimental study. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 1524–1534
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Wikipedia (2018) Chinese wikipedia. https://github.com/Embedding/Chinese-Word-Vectors. Accessed 24 Apr 2020
Zhang H, Li J, Ji Y, Yue H (2016) Understanding subtitles by character-level sequence-to-sequence learning. IEEE Trans Ind Inform 13(2):616–624
Acknowledgements
Kai Zhang and Yuxuan Sun contributed equally to this work. This work was supported by the National Natural Science Foundation of China (No. 61902346) and National Innovation and Entrepreneurship Training Program for College Students (No. 201913021001, 201913021002).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sun, L., Zhang, K., Sun, Y. et al. ETIP: a lengthy nested NER problem for Chinese insurance policy analysis. Pattern Anal Applic 23, 1755–1765 (2020). https://doi.org/10.1007/s10044-020-00885-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-020-00885-6