CCC: Chinese Commercial Contracts Dataset for Documents Layout Understanding

Liu, Shu; Jin, Yongnan; Lu, Harry; Zhao, Shangqing; Lan, Man; Chen, Yuefeng; Yuan, Hao

doi:10.1007/978-3-031-44696-2_55

Shu Liu¹¹,
Yongnan Jin^11,15,
Harry Lu¹³,
Shangqing Zhao¹¹,
Man Lan^11,12,
Yuefeng Chen¹⁴ &
…
Hao Yuan¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14303))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

1385 Accesses

Abstract

In recent years, Visual document understanding tasks have become increasingly popular due to the growing demand for commercial applications, especially for processing complex image documents such as contracts, and patents. However, there is no high-quality domain-specific dataset available except for English. And for other languages like Chinese, it is hard to utilize current English datasets due to the significant differences in writing norms and layout formats. To mitigate this issue, we introduce the Chinese Commercial Contracts (CCC) dataset to explore better visual document layout understanding modeling for Chinese commercial contract in the paper. This dataset contains 10,000 images, each containing various elements such as text, tables, seals, and handwriting. Moreover, we propose the Chinese Layout Understanding Pre-train Transformer (CLUPT) Model, which is pre-trained on the proposed CCC dataset by incorporating textual and layout information into the pre-train task. Based on the VisionEncoder-LanguageDecoder model structure, our model can perform end-to-end Chinese document layout understanding tasks. The data and code are available at https://github.com/yysirs/CLUPT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis

OCR-Free Document Understanding Transformer

The YOLO model that still excels in document layout analysis

Article 19 November 2023

Notes

References

Afzal, M.Z., et al.: Deepdocclassifier: document classification with deep convolutional neural network. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1111–1115. IEEE (2015)
Google Scholar
Davis, B., Morse, B., Price, B., Tensmeyer, C., Wigington, C., Morariu, V.: End-to-end document recognition and understanding with dessurt. In: Computer Vision-ECCV 2022 Workshops: Tel Aviv, 23–27 October 2022, Proceedings, Part IV, pp. 280–296. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-25069-9_19
Dhouib, M., Bettaieb, G., Shabou, A.: Docparser: end-to-end OCR-free information extraction from visually rich documents. arXiv preprint arXiv:2304.12484 (2023)
Du, Y., et al.: Pp-ocr: a practical ultra lightweight OCR system. arXiv preprint arXiv:2009.09941 (2020)
Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D., Park, S.: Bros: a pre-trained language model focusing on text and layout for better key information extraction from documents. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 10767–10775 (2022)
Google Scholar
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Google Scholar
Kang, L., Kumar, J., Ye, P., Li, Y., Doermann, D.: Convolutional neural networks for document image classification. In: 2014 22nd International Conference on Pattern Recognition, pp. 3168–3172. IEEE (2014)
Google Scholar
Kim, G., et al.: OCR-free document understanding transformer. In: Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, 23–27 October 2022, Proceedings, Part XXVIII, pp. 498–517. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_29
Lee, K., et al.: Pix2struct: screenshot parsing as pretraining for visual language understanding. In: International Conference on Machine Learning, pp. 18893–18912. PMLR (2023)
Google Scholar
Lewis, M., et al.: Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019)
Li, C., et al.: Pp-ocrv3: more attempts for the improvement of ultra lightweight OCR system. arXiv preprint arXiv:2206.03001 (2022)
Li, M., Cui, L., Huang, S., Wei, F., Zhou, M., Li, Z.: Tablebank: table benchmark for image-based table detection and recognition. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 1918–1925 (2020)
Google Scholar
Li, M., et al.: Docbank: a benchmark dataset for document layout analysis. arXiv preprint arXiv:2006.01038 (2020)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Mathew, M., Karatzas, D., Jawahar, C.: Docvqa: a dataset for VQA on document images. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2200–2209 (2021)
Google Scholar
Peng, Q., et al.: Ernie-layout: layout knowledge enhanced pre-training for visually-rich document understanding. arXiv preprint arXiv:2210.06155 (2022)
Prasad, D., Gadpal, A., Kapadni, K., Visave, M., Sultanpure, K.: Cascadetabnet: an approach for end to end table detection and structure recognition from image-based documents. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 572–573 (2020)
Google Scholar
Shen, Z., Zhang, R., Dell, M., Lee, B.C.G., Carlson, J., Li, W.: Layoutparser: a unified toolkit for deep learning based document image analysis. arXiv preprint arXiv:2103.15348 (2021)
Smock, B., Pesala, R., Abraham, R.: Pubtables-1m: towards comprehensive table extraction from unstructured documents. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4634–4642 (2022)
Google Scholar
Stanisławek, T., et al.: Kleister: key information extraction datasets involving long documents with complex layouts. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 564–579. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_36
Wang, J., et al.: Towards robust visual information extraction in real world: new dataset and novel solution. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2738–2745 (2021)
Google Scholar
Wang, Z., Xu, Y., Cui, L., Shang, J., Wei, F.: Layoutreader: pre-training of text and layout for reading order detection (2021)
Google Scholar
Zhong, X., ShafieiBavani, E., Jimeno Yepes, A.: Image-based table recognition: data, model, and evaluation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12366, pp. 564–580. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1_34

Download references

Acknowledgements

We appreciate the support from National Natural Science Foundation of China with the Main Research Project on Machine Behavior and Human-Machine Collaborated Decision Making Methodology (72192820 & 72192824), Pudong New Area Science & Technology Development Fund (PKX2021-R05), Science and Technology Commission of Shanghai Municipality (22DZ2229004), and Shanghai Trusted Industry Internet Software Collaborative Innovation Center.

Author information

Authors and Affiliations

School of Computer Science and Technology, East China Normal University, Shanghai, China
Shu Liu, Yongnan Jin, Shangqing Zhao & Man Lan
Shanghai Institute of AI for Education, East China Normal University, Shanghai, China
Man Lan
Shanghai Qibao Dwight High School, Shanghai, China
Harry Lu
Shanghai Transsion Co., Ltd., Shanghai, China
Yuefeng Chen & Hao Yuan
YiJin Tech Co., Ltd., Shanghai, China
Yongnan Jin

Authors

Shu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yongnan Jin
View author publications
You can also search for this author in PubMed Google Scholar
Harry Lu
View author publications
You can also search for this author in PubMed Google Scholar
Shangqing Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Man Lan
View author publications
You can also search for this author in PubMed Google Scholar
Yuefeng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Hao Yuan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Man Lan .

Editor information

Editors and Affiliations

Emory University, Atlanta, GA, USA
Fei Liu
Microsoft Research Asia, Beijing, China
Nan Duan
Soochow University, Suzhou, China
Qingting Xu
Soochow University, Suzhou, China
Yu Hong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, S. et al. (2023). CCC: Chinese Commercial Contracts Dataset for Documents Layout Understanding. In: Liu, F., Duan, N., Xu, Q., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science(), vol 14303. Springer, Cham. https://doi.org/10.1007/978-3-031-44696-2_55

Download citation

DOI: https://doi.org/10.1007/978-3-031-44696-2_55
Published: 08 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44695-5
Online ISBN: 978-3-031-44696-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)

CCC: Chinese Commercial Contracts Dataset for Documents Layout Understanding

Abstract

Access this chapter

Similar content being viewed by others

LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis

OCR-Free Document Understanding Transformer

The YOLO model that still excels in document layout analysis

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

CCC: Chinese Commercial Contracts Dataset for Documents Layout Understanding

Abstract

Access this chapter

Similar content being viewed by others

LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis

OCR-Free Document Understanding Transformer

The YOLO model that still excels in document layout analysis

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation