Skip to main content

CCC: Chinese Commercial Contracts Dataset for Documents Layout Understanding

  • Conference paper
  • First Online:
Natural Language Processing and Chinese Computing (NLPCC 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14303))

  • 822 Accesses

Abstract

In recent years, Visual document understanding tasks have become increasingly popular due to the growing demand for commercial applications, especially for processing complex image documents such as contracts, and patents. However, there is no high-quality domain-specific dataset available except for English. And for other languages like Chinese, it is hard to utilize current English datasets due to the significant differences in writing norms and layout formats. To mitigate this issue, we introduce the Chinese Commercial Contracts (CCC) dataset to explore better visual document layout understanding modeling for Chinese commercial contract in the paper. This dataset contains 10,000 images, each containing various elements such as text, tables, seals, and handwriting. Moreover, we propose the Chinese Layout Understanding Pre-train Transformer (CLUPT) Model, which is pre-trained on the proposed CCC dataset by incorporating textual and layout information into the pre-train task. Based on the VisionEncoder-LanguageDecoder model structure, our model can perform end-to-end Chinese document layout understanding tasks. The data and code are available at https://github.com/yysirs/CLUPT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/PaddlePaddle/PaddleOCR.

  2. 2.

    https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/PPOCRLabel.

References

  1. Afzal, M.Z., et al.: Deepdocclassifier: document classification with deep convolutional neural network. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1111–1115. IEEE (2015)

    Google Scholar 

  2. Davis, B., Morse, B., Price, B., Tensmeyer, C., Wigington, C., Morariu, V.: End-to-end document recognition and understanding with dessurt. In: Computer Vision-ECCV 2022 Workshops: Tel Aviv, 23–27 October 2022, Proceedings, Part IV, pp. 280–296. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-25069-9_19

  3. Dhouib, M., Bettaieb, G., Shabou, A.: Docparser: end-to-end OCR-free information extraction from visually rich documents. arXiv preprint arXiv:2304.12484 (2023)

  4. Du, Y., et al.: Pp-ocr: a practical ultra lightweight OCR system. arXiv preprint arXiv:2009.09941 (2020)

  5. Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D., Park, S.: Bros: a pre-trained language model focusing on text and layout for better key information extraction from documents. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 10767–10775 (2022)

    Google Scholar 

  6. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)

    Google Scholar 

  7. Kang, L., Kumar, J., Ye, P., Li, Y., Doermann, D.: Convolutional neural networks for document image classification. In: 2014 22nd International Conference on Pattern Recognition, pp. 3168–3172. IEEE (2014)

    Google Scholar 

  8. Kim, G., et al.: OCR-free document understanding transformer. In: Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, 23–27 October 2022, Proceedings, Part XXVIII, pp. 498–517. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_29

  9. Lee, K., et al.: Pix2struct: screenshot parsing as pretraining for visual language understanding. In: International Conference on Machine Learning, pp. 18893–18912. PMLR (2023)

    Google Scholar 

  10. Lewis, M., et al.: Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019)

  11. Li, C., et al.: Pp-ocrv3: more attempts for the improvement of ultra lightweight OCR system. arXiv preprint arXiv:2206.03001 (2022)

  12. Li, M., Cui, L., Huang, S., Wei, F., Zhou, M., Li, Z.: Tablebank: table benchmark for image-based table detection and recognition. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 1918–1925 (2020)

    Google Scholar 

  13. Li, M., et al.: Docbank: a benchmark dataset for document layout analysis. arXiv preprint arXiv:2006.01038 (2020)

  14. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  15. Mathew, M., Karatzas, D., Jawahar, C.: Docvqa: a dataset for VQA on document images. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2200–2209 (2021)

    Google Scholar 

  16. Peng, Q., et al.: Ernie-layout: layout knowledge enhanced pre-training for visually-rich document understanding. arXiv preprint arXiv:2210.06155 (2022)

  17. Prasad, D., Gadpal, A., Kapadni, K., Visave, M., Sultanpure, K.: Cascadetabnet: an approach for end to end table detection and structure recognition from image-based documents. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 572–573 (2020)

    Google Scholar 

  18. Shen, Z., Zhang, R., Dell, M., Lee, B.C.G., Carlson, J., Li, W.: Layoutparser: a unified toolkit for deep learning based document image analysis. arXiv preprint arXiv:2103.15348 (2021)

  19. Smock, B., Pesala, R., Abraham, R.: Pubtables-1m: towards comprehensive table extraction from unstructured documents. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4634–4642 (2022)

    Google Scholar 

  20. Stanisławek, T., et al.: Kleister: key information extraction datasets involving long documents with complex layouts. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 564–579. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_36

  21. Wang, J., et al.: Towards robust visual information extraction in real world: new dataset and novel solution. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2738–2745 (2021)

    Google Scholar 

  22. Wang, Z., Xu, Y., Cui, L., Shang, J., Wei, F.: Layoutreader: pre-training of text and layout for reading order detection (2021)

    Google Scholar 

  23. Zhong, X., ShafieiBavani, E., Jimeno Yepes, A.: Image-based table recognition: data, model, and evaluation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12366, pp. 564–580. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1_34

Download references

Acknowledgements

We appreciate the support from National Natural Science Foundation of China with the Main Research Project on Machine Behavior and Human-Machine Collaborated Decision Making Methodology (72192820 & 72192824), Pudong New Area Science & Technology Development Fund (PKX2021-R05), Science and Technology Commission of Shanghai Municipality (22DZ2229004), and Shanghai Trusted Industry Internet Software Collaborative Innovation Center.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Man Lan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, S. et al. (2023). CCC: Chinese Commercial Contracts Dataset for Documents Layout Understanding. In: Liu, F., Duan, N., Xu, Q., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science(), vol 14303. Springer, Cham. https://doi.org/10.1007/978-3-031-44696-2_55

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-44696-2_55

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-44695-5

  • Online ISBN: 978-3-031-44696-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics