Skip to main content

DuIE: A Large-Scale Chinese Dataset for Information Extraction

  • Conference paper
  • First Online:
Book cover Natural Language Processing and Chinese Computing (NLPCC 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11839))

Abstract

Information extraction is an important foundation for knowledge graph construction, as well as many natural language understanding applications. Similar to many other artificial intelligence tasks, high quality annotated datasets are essential to train a high-performance information extraction system. Existing datasets, however, are mostly built for English. To promote research in Chinese information extraction and evaluate the performance of related systems, we build a large-scale high-quality dataset, named DuIE, and make it publicly available. We design an efficient coarse-to-fine procedure including candidate generation and crowdsourcing annotation, in order to achieve high data quality at a large data scale. DuIE contains 210,000 sentences and 450,000 instances covering 49 types of commonly used relations, reflecting the real-world scenario. We also hosted an open competition based on DuIE, which attracted 1,896 participants. The competition results demonstrated the potential of this dataset in promoting information extraction research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://lic2019.ccf.org.cn/.

  2. 2.

    https://baike.baidu.com/.

  3. 3.

    https://baijiahao.baidu.com.

  4. 4.

    http://ai.baidu.com/broad/download.

References

  1. Zeng, D., Liu, K., Lai, S., et al.: Relation classification via convolutional deep neural network (2014)

    Google Scholar 

  2. Jiang, X., Wang, Q., Li, P., et al.: Relation extraction with multi-instance multi-label convolutional neural networks. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 1471–1480 (2016)

    Google Scholar 

  3. Zeng, X., He, S., Liu, K., et al.: Large scaled relation extraction with reinforcement learning. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  4. Miwa, M., Bansal, M.: End-to-end relation extraction using LSTMs on sequences and tree structures. arXiv preprint: arXiv:1601.00770 (2016)

  5. Dai, D., Xiao, X., Lyu, Y., Dou, S., She, Q., Wang, H.: Joint extraction of entities and overlapping relations using position-attentive sequence labeling. In: AAAI (2019)

    Google Scholar 

  6. Takanobu, R., Zhang, T., Liu, J., et al.: A hierarchical framework for relation extraction with reinforcement learning. arXiv preprint: arXiv:1811.03925 (2018)

  7. Zheng, S., Wang, F., Bao, H., Hao, Y., Zhou, P., Xu, B.: Joint extraction of entities and relations based on a novel tagging scheme. In: ACL (2017)

    Google Scholar 

  8. Zeng, X., Zeng, D., He, S., et al.: Extracting relational facts by an end-to-end neural model with copy mechanism. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Long Papers, vol. 1, pp. 148–163 (2018)

    Google Scholar 

  9. Riedel, S., Yao, L., McCallum, A.: Modeling relations and their mentions without labeled text. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) ECML PKDD 2010, Part III. LNCS (LNAI), vol. 6323, pp. 148–163. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15939-8_10

    Chapter  Google Scholar 

  10. Hendrickx, I., Kim, S.N., Kozareva, Z., et al.: SemEval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. In: Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions, pp. 94–99. Association for Computational Linguistics (2009)

    Google Scholar 

  11. Han, X., Zhu, H., Yu, P., et al.: FewRel: a large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation. arXiv preprint: arXiv:1810.10147 (2018)

  12. Devlin, J., et al.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint: arXiv:1810.04805 (2018)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei He .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, S. et al. (2019). DuIE: A Large-Scale Chinese Dataset for Information Extraction. In: Tang, J., Kan, MY., Zhao, D., Li, S., Zan, H. (eds) Natural Language Processing and Chinese Computing. NLPCC 2019. Lecture Notes in Computer Science(), vol 11839. Springer, Cham. https://doi.org/10.1007/978-3-030-32236-6_72

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-32236-6_72

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-32235-9

  • Online ISBN: 978-3-030-32236-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics