Skip to main content

Cross-Modal Contrastive Learning for Event Extraction

  • Conference paper
  • First Online:
  • 1552 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13945))

Abstract

Event extraction aims to extract information of triggers and arguments from texts. Recent advanced methods leverage information from other modalities (e.g., images and videos) besides the texts to enhance event extraction. However, the different modalities are often misaligned at the event level, negatively impacting model performance. To address this issue, we firstly constructed a new multi-modal event extraction benchmark Text Video Event Extraction (TVEE) dataset, containing 7,598 text-video pairs. The texts are automatically extracted from video captions, which are perfectly aligned to the video content in most cases. Secondly, we present a Cross-modal Contrastive Learning for Event Extraction (CoCoEE) model to extract events from multi-modal data by contrasting text-video and event-video representations. We conduct extensive experiments on our TVEE dataset and the current benchmark VM2E2 dataset. The results show that our proposed model outperforms baseline methods in terms of F-score. Furthermore, the proposed cross-modal contrastive learning method improves event extraction in each single modality. The dataset and code will be released once upon acceptance.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://www.youtube.com/c/ondemandnews.

  2. 2.

    https://cloud.tencent.com/product/ocr-catalog.

  3. 3.

    https://huggingface.co/bert-base-uncased.

  4. 4.

    https://huggingface.co/t5-base.

  5. 5.

    Because the multi-modal evaluation only focuses on event type extraction, it can’t show the performance of every module, we perform ablation study on text evaluation and video evaluation.

References

  1. Chen, B., et al.: Joint multimedia event extraction from video and article. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 74–88 (2021)

    Google Scholar 

  2. Chen, H., Shu, R., Takamura, H., Nakayama, H.: GraphPlan: story generation by planning with event graph. In: Proceedings of the 14th International Conference on Natural Language Generation, pp. 377–386 (2021)

    Google Scholar 

  3. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine learning, pp. 1597–1607 (2020)

    Google Scholar 

  4. Daiya, D.: Combining temporal event relations and pre-trained language models for text summarization. In: IEEE International Conference on Machine Learning and Applications, pp. 641–646 (2020)

    Google Scholar 

  5. Du, X., Cardie, C.: Event extraction by answering (almost) natural questions. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 671–683 (2020)

    Google Scholar 

  6. Du, X., Rush, A.M., Cardie, C.: GRiT: generative role-filler transformers for document-level event entity extraction. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 634–644 (2021)

    Google Scholar 

  7. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)

    Google Scholar 

  8. Huang, P.Y., Patrick, M., Hu, J., Neubig, G., Metze, F., Hauptmann, A.: Multilingual multimodal pre-training for zero-shot cross-lingual transfer of vision-language models. arXiv preprint arXiv:2103.08849 (2021)

  9. Kenton, J.D.M.W.C., Toutanova, L.K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171–4186 (2019)

    Google Scholar 

  10. Li, M., et al.: Timeline summarization based on event graph compression via time-aware optimal transport. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 6443–6456 (2021)

    Google Scholar 

  11. Li, M., et al.: Cross-media structured common space for multimedia event extraction. In: Proceedings of Annual Meeting of the Association for Computational Linguistics, pp. 2557–2568 (2020)

    Google Scholar 

  12. Li, Q., Ji, H., Huang, L.: Joint event extraction via structured prediction with global features. In: Proceedings of Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 73–82 (2013)

    Google Scholar 

  13. Lin, Y., Ji, H., Huang, F., Wu, L.: A joint neural model for information extraction with global features. In: Proceedings of Annual Meeting of the Association for Computational Linguistics, pp. 7999–8009. Association for Computational Linguistics, (2020)

    Google Scholar 

  14. Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, pp. 13–23 (2019)

    Google Scholar 

  15. Martschat, S., Markert, K.: A temporally sensitive submodularity framework for timeline summarization. In: Proceedings of the Conference on Computational Natural Language Learning, pp. 230–240 (2018)

    Google Scholar 

  16. Nguyen, T.H., Cho, K., Grishman, R.: Joint event extraction via recurrent neural networks. In: Proceedings of Annual Meeting of the Association for Computational Linguistics, pp. 300–309 (2016)

    Google Scholar 

  17. Nguyen, T.H., Grishman, R.: Event detection and domain adaptation with convolutional neural networks. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing, pp. 365–371 (2015)

    Google Scholar 

  18. Pratt, S., Yatskar, M., Weihs, L., Farhadi, A., Kembhavi, A.: Grounded situation recognition. In: European Conference on Computer Vision, pp. 314–332 (2020)

    Google Scholar 

  19. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)

    MathSciNet  MATH  Google Scholar 

  20. Sadhu, A., Gupta, T., Yatskar, M., Nevatia, R., Kembhavi, A.: Visual semantic role labeling for video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5589–5600 (2021)

    Google Scholar 

  21. Tong, M., et al.: Image enhanced event detection in news articles. Proceed. AAAI Conf. Artif. Intell. 34(5), 9040–9047 (2020)

    Google Scholar 

  22. Wadden, D., Wennberg, U., Luan, Y., Hajishirzi, H.: Entity, relation, and event extraction with contextualized span representations. arXiv preprint arXiv:1909.03546 (2019)

  23. Walker, C., Strassel, S., Medero, J., Maeda, K.: ACE 2005 multilingual training corpus. Linguist. Data Consort. Philadelp. 57, 45 (2006)

    Google Scholar 

  24. Wang, Z., et al.: CLEVE: contrastive pre-training for event extraction. In: Proceedings of Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing. vol. 1, pp. 6283–6297 (2021)

    Google Scholar 

  25. Yao, S., Yang, J., Lu, X., Shuang, K.: Contrastive learning for event extraction. In: International Conference on Machine Learning and Soft Computing, pp. 167–172 (2022)

    Google Scholar 

  26. Yeh, Y.T., Chen, Y.N.: QAInfomax: learning robust question answering system by mutual information maximization. In: Proceedings of Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing, pp. 3370–3375 (2019)

    Google Scholar 

  27. Zhang, H., Koh, J.Y., Baldridge, J., Lee, H., Yang, Y.: Cross-modal contrastive learning for text-to-image generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 833–842 (2021)

    Google Scholar 

  28. Zhang, S., Song, L., Jin, L., Xu, K., Yu, D., Luo, J.: Video-aided unsupervised grammar induction. In: Proceedings of Annual Meeting of the Association for Computational Linguistics, pp. 1513–1524 (2021)

    Google Scholar 

  29. Zhang, T., et al.: Improving event extraction via multimodal integration. In: Proceedings of ACM International Conference on Multimedia, pp. 270–278 (2017)

    Google Scholar 

  30. Zolfaghari, M., Zhu, Y., Gehler, P., Brox, T.: CrossCLR: cross-modal contrastive learning for multi-modal video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1450–1459 (2021)

    Google Scholar 

Download references

Acknowledgments

Supported by the National Key Research and Development Program of China (No. 2022YFF0712400), the National Natural Science Foundation of China (No. 62276063), and the Natural Science Foundation of Jiangsu Province under Grants No. BK20221457.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Meng Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, S., Ju, M., Zhang, Y., Zheng, Y., Wang, M., Qi, G. (2023). Cross-Modal Contrastive Learning for Event Extraction. In: Wang, X., et al. Database Systems for Advanced Applications. DASFAA 2023. Lecture Notes in Computer Science, vol 13945. Springer, Cham. https://doi.org/10.1007/978-3-031-30675-4_51

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-30675-4_51

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-30674-7

  • Online ISBN: 978-3-031-30675-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics