Skip to main content

Wukong-CMNER: A Large-Scale Chinese Multimodal NER Dataset with Images Modality

  • Conference paper
  • First Online:
Database Systems for Advanced Applications (DASFAA 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13945))

Included in the following conference series:

  • 1761 Accesses

Abstract

So far, Multimodal Named Entity Recognition (MNER) has been performed almost exclusively on English corpora. Chinese phrases are not naturally segmented, making Chinese NER more challenging; nonetheless, Chinese MNER needs to be paid more attention. Thus, we first construct Wukong-CMNER, a multimodal NER dataset for the Chinese corpus that includes images and text. There are 55,423 annotated image-text pairs in our corpus. Based on this dataset, we propose a lexicon-based prompting visual clue extraction (LPE) module to capture certain entity-related visual clues from the image. We further introduce a novel cross-modal alignment (CA) module to make the representations of the two modalities more consistent through contrastive learning. Through extensive experiments, we observe that: (1) Discernible performance boosts as we move from unimodal to multimodal, verifying the necessity of integrating visual clues into Chinese NER. (2) Cross-modal alignment module further improves the performance of the model. (3) Our two modules decouple from the subsequent predicting process, which enables a plug-and-play framework to enhance Chinese NER models for Chinese MNER task. LPE and CA achieve state-of-the-art (SOTA) results on Wukong-CMNER when combined with W2NER [11], demonstrating its effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Chen, D., Li, Z., Gu, B., Chen, Z.: Multimodal named entity recognition with image attributes and image knowledge. In: Jensen, C.S., et al. (eds.) DASFAA 2021. LNCS, vol. 12682, pp. 186–201. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73197-7_12

    Chapter  Google Scholar 

  2. Chen, S., Aguilar, G., Neves, L., Solorio, T.: Can images help recognize entities? a study of the role of images for multimodal NER. arXiv:2010.12712 (2020)

  3. Chen, X., et al.: Good visual guidance makes a better extractor: hierarchical visual prefix for multimodal entity and relation extraction. arXiv:2205.03521 (2022)

  4. Ding, R., Xie, P., Zhang, X., Lu, W., Li, L., Si, L.: A neural multi-digraph model for chinese ner with gazetteers. In: ACL, pp. 1462–1467 (2019)

    Google Scholar 

  5. Gina-Anne, L.: The third international Chinese language processing bakeoff: word segmentation and named entity recognition. In: CLP, pp. 108–117 (2006)

    Google Scholar 

  6. Gu, J., et al.: Wukong: 100 million large-scale Chinese cross-modal pre-training dataset and a foundation framework. arXiv:2202.06767 (2022)

  7. Gui, T., Ma, R., Zhang, Q., Zhao, L., Jiang, Y.G., Huang, X.: CNN-based Chinese NER with lexicon rethinking. In: IJCAI, pp. 4982–4988 (2019)

    Google Scholar 

  8. Gui, T., et al.: A lexicon-based graph neural network for Chinese NER. In: EMNLP, pp. 1040–1050 (2019)

    Google Scholar 

  9. He, H., Choi, J.D.: The stem cell hypothesis: dilemma behind multi-task learning with transformer encoders. arXiv preprint arXiv:2109.06939 (2021)

  10. He, H., Sun, X.: F-score driven max margin neural network for named entity recognition in Chinese social media. arXiv:1611.04234 (2016)

  11. Li, J., Fei, H., Liu, J., Wu, S., Zhang, M., Teng, C., Ji, D., Li, F.: Unified named entity recognition as word-word relation classification. In: AAAI. vol. 36, pp. 10965–10973 (2022)

    Google Scholar 

  12. Li, X., Yan, H., Qiu, X., Huang, X.: Flat: Chinese NER using flat-lattice transformer. In: ACL, pp. 6836–6842 (2020)

    Google Scholar 

  13. Liu, W., Fu, X., Zhang, Y., Xiao, W.: Lexicon enhanced Chinese sequence labeling using bert adapter. In: ACL, pp. 5847–5858 (2021)

    Google Scholar 

  14. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV, pp. 10012–10022 (2021)

    Google Scholar 

  15. Lu, D., Neves, L., Carvalho, V., Zhang, N., Ji, H.: Visual attention model for name tagging in multimodal social media. In: ACL, pp. 1990–1999 (2018)

    Google Scholar 

  16. Ma, R., Peng, M., Zhang, Q., Huang, X.: Simplify the usage of lexicon in Chinese NER. In: ACL, pp. 5951–5960 (2020)

    Google Scholar 

  17. Mengge, X., Bowen, Y., Tingwen, L., Yue, Z., Erli, M., Bin, W.: Porous lattice-based transformer encoder for Chinese NER. In: COLING (2019)

    Google Scholar 

  18. Moon, S., Neves, L., Carvalho, V.: Multimodal named entity recognition for short social media posts. In: NAACL-HLT, pp. 852–860 (2018)

    Google Scholar 

  19. Peng, N., Dredze, M.: Named entity recognition for chinese social media with jointly trained embeddings. In: EMNLP, pp. 548–554 (2015)

    Google Scholar 

  20. Sui, D., Tian, Z., Chen, Y., Liu, K., Zhao, J.: A large-scale chinese multimodal ner dataset with speech clues. In: ACL, pp. 2807–2818 (2021)

    Google Scholar 

  21. Sun, L., et al.: RIVA: a pre-trained tweet multimodal model based on text-image relation for multimodal NER. In: COLING, pp. 1852–1862 (2020)

    Google Scholar 

  22. Sun, L., Wang, J., Zhang, K., Su, Y., Weng, F.: RpBERT: a text-image relation propagation-based BERT model for multimodal NER. In: AAAI, vol. 35, pp. 13860–13868 (2021)

    Google Scholar 

  23. Sun, Y., et al.: ERNIE: enhanced representation through knowledge integration. arXiv:1904.09223 (2019)

  24. Wang, X., et al.: ITA: image-text alignments for multi-modal named entity recognition. arXiv:2112.06482 (2021)

  25. Wang, X., et al.: Prompt-based entity-related visual clue extraction and integration for multimodal named entity recognition. In: Database Systems for Advanced Applications. DASFAA 2022. LNCS, vol. 13247, pp. 297–305. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-00129-1_24

  26. Wang, X., et al.: CAT-MNER: multimodal named entity recognition with knowledge-refined cross-modal attention. In: ICME, pp. 1–6. IEEE (2022)

    Google Scholar 

  27. Weischedel, R., et al.: OntoNotes release 5.0 ldc2013t19. web download. Philadelphia: Linguistic data consortium, 2013 (2013)

    Google Scholar 

  28. Wu, S., Song, X., Feng, Z.: MECT: multi-metadata embedding based cross-transformer for chinese named entity recognition. In: ACL, pp. 1529–1539 (2021)

    Google Scholar 

  29. Wu, Z., Zheng, C., Cai, Y., Chen, J., Leung, H., Li, Q.: Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts. In: MM, pp. 1038–1046 (2020)

    Google Scholar 

  30. Xu, B., Huang, S., Sha, C., Wang, H.: MAF: a general matching and alignment framework for multimodal named entity recognition. In: WSDM, pp. 1215–1223 (2022)

    Google Scholar 

  31. Yamada, I., Asai, A., Shindo, H., Takeda, H., Matsumoto, Y.: Luke: deep contextualized entity representations with entity-aware self-attention. In: EMNLP (2020)

    Google Scholar 

  32. Yang, Z., Gong, B., Wang, L., Huang, W., Yu, D., Luo, J.: A fast and accurate one-stage approach to visual grounding. In: ICCV, pp. 4683–4693 (2019)

    Google Scholar 

  33. Yu, J., Jiang, J., Yang, L., Xia, R.: Improving multimodal named entity recognition via entity span detection with unified multimodal transformer. In: ACL (2020)

    Google Scholar 

  34. Zhang, D., Wei, S., Li, S., Wu, H., Zhu, Q., Zhou, G.: Multi-modal graph fusion for named entity recognition with targeted visual guidance. In: AAAI, vol. 35, pp. 14347–14355 (2021)

    Google Scholar 

  35. Zhang, H., Koh, J.Y., Baldridge, J., Lee, H., Yang, Y.: Cross-modal contrastive learning for text-to-image generation. In: CVPR, pp. 833–842 (2021)

    Google Scholar 

  36. Zhang, Q., Fu, J., Liu, X., Huang, X.: Adaptive co-attention network for named entity recognition in tweets. In: AAAI (2018)

    Google Scholar 

  37. Zhang, Y., Yang, J.: Chinese NER using lattice LSTM. In: ACL, pp. 1554–1564 (2018)

    Google Scholar 

  38. Zheng, C., Wu, Z., Wang, T., Cai, Y., Li, Q.: Object-aware multimodal named entity recognition in social media posts with adversarial learning. Multimedia 23, 2520–2532 (2020)

    Google Scholar 

Download references

Acknowledgements

This work is partially supported by the National Natural Science Foundation of China under Grant No. 61772534, partially supported by Public Computing Cloud, Renmin University of China.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Biao Qin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

Âİ 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bao, X., Wang, S., Qi, P., Qin, B. (2023). Wukong-CMNER: A Large-Scale Chinese Multimodal NER Dataset with Images Modality. In: Wang, X., et al. Database Systems for Advanced Applications. DASFAA 2023. Lecture Notes in Computer Science, vol 13945. Springer, Cham. https://doi.org/10.1007/978-3-031-30675-4_43

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-30675-4_43

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-30674-7

  • Online ISBN: 978-3-031-30675-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics