DuIE: A Large-Scale Chinese Dataset for Information Extraction

Li, Shuangjie; He, Wei; Shi, Yabing; Jiang, Wenbin; Liang, Haijin; Jiang, Ye; Zhang, Yang; Lyu, Yajuan; Zhu, Yong

doi:10.1007/978-3-030-32236-6_72

Shuangjie Li¹³,
Wei He¹³,
Yabing Shi¹³,
Wenbin Jiang¹³,
Haijin Liang¹³,
Ye Jiang¹³,
Yang Zhang¹³,
Yajuan Lyu¹³ &
…
Yong Zhu¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11839))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

5164 Accesses
21 Citations

Abstract

Information extraction is an important foundation for knowledge graph construction, as well as many natural language understanding applications. Similar to many other artificial intelligence tasks, high quality annotated datasets are essential to train a high-performance information extraction system. Existing datasets, however, are mostly built for English. To promote research in Chinese information extraction and evaluate the performance of related systems, we build a large-scale high-quality dataset, named DuIE, and make it publicly available. We design an efficient coarse-to-fine procedure including candidate generation and crowdsourcing annotation, in order to achieve high data quality at a large data scale. DuIE contains 210,000 sentences and 450,000 instances covering 49 types of commonly used relations, reflecting the real-world scenario. We also hosted an open competition based on DuIE, which attracted 1,896 participants. The competition results demonstrated the potential of this dataset in promoting information extraction research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Zeng, D., Liu, K., Lai, S., et al.: Relation classification via convolutional deep neural network (2014)
Google Scholar
Jiang, X., Wang, Q., Li, P., et al.: Relation extraction with multi-instance multi-label convolutional neural networks. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 1471–1480 (2016)
Google Scholar
Zeng, X., He, S., Liu, K., et al.: Large scaled relation extraction with reinforcement learning. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Miwa, M., Bansal, M.: End-to-end relation extraction using LSTMs on sequences and tree structures. arXiv preprint: arXiv:1601.00770 (2016)
Dai, D., Xiao, X., Lyu, Y., Dou, S., She, Q., Wang, H.: Joint extraction of entities and overlapping relations using position-attentive sequence labeling. In: AAAI (2019)
Google Scholar
Takanobu, R., Zhang, T., Liu, J., et al.: A hierarchical framework for relation extraction with reinforcement learning. arXiv preprint: arXiv:1811.03925 (2018)
Zheng, S., Wang, F., Bao, H., Hao, Y., Zhou, P., Xu, B.: Joint extraction of entities and relations based on a novel tagging scheme. In: ACL (2017)
Google Scholar
Zeng, X., Zeng, D., He, S., et al.: Extracting relational facts by an end-to-end neural model with copy mechanism. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Long Papers, vol. 1, pp. 148–163 (2018)
Google Scholar
Riedel, S., Yao, L., McCallum, A.: Modeling relations and their mentions without labeled text. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) ECML PKDD 2010, Part III. LNCS (LNAI), vol. 6323, pp. 148–163. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15939-8_10
Chapter Google Scholar
Hendrickx, I., Kim, S.N., Kozareva, Z., et al.: SemEval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. In: Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions, pp. 94–99. Association for Computational Linguistics (2009)
Google Scholar
Han, X., Zhu, H., Yu, P., et al.: FewRel: a large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation. arXiv preprint: arXiv:1810.10147 (2018)
Devlin, J., et al.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint: arXiv:1810.04805 (2018)

Download references

Author information

Authors and Affiliations

Baidu Inc., Beijing, 100193, China
Shuangjie Li, Wei He, Yabing Shi, Wenbin Jiang, Haijin Liang, Ye Jiang, Yang Zhang, Yajuan Lyu & Yong Zhu

Authors

Shuangjie Li
View author publications
You can also search for this author in PubMed Google Scholar
Wei He
View author publications
You can also search for this author in PubMed Google Scholar
Yabing Shi
View author publications
You can also search for this author in PubMed Google Scholar
Wenbin Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Haijin Liang
View author publications
You can also search for this author in PubMed Google Scholar
Ye Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Yang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yajuan Lyu
View author publications
You can also search for this author in PubMed Google Scholar
Yong Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei He .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Jie Tang
National University of Singapore, Singapore, Singapore
Min-Yen Kan
Peking University, Beijing, China
Dongyan Zhao
Peking University, Beijing, China
Sujian Li
Zhengzhou University, Zhengzhou, China
Hongying Zan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, S. et al. (2019). DuIE: A Large-Scale Chinese Dataset for Information Extraction. In: Tang, J., Kan, MY., Zhao, D., Li, S., Zan, H. (eds) Natural Language Processing and Chinese Computing. NLPCC 2019. Lecture Notes in Computer Science(), vol 11839. Springer, Cham. https://doi.org/10.1007/978-3-030-32236-6_72

Download citation

DOI: https://doi.org/10.1007/978-3-030-32236-6_72
Published: 30 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32235-9
Online ISBN: 978-3-030-32236-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)