Cross-Modal Contrastive Learning for Event Extraction

Wang, Shuo; Ju, Meizhi; Zhang, Yunyan; Zheng, Yefeng; Wang, Meng; Qi, Guilin

doi:10.1007/978-3-031-30675-4_51

Cross-Modal Contrastive Learning for Event Extraction

Shuo Wang^15,16,
Meizhi Ju¹⁸,
Yunyan Zhang¹⁸,
Yefeng Zheng¹⁸,
Meng Wang^15,17 &
…
Guilin Qi^15,17

Conference paper
First Online: 15 April 2023

1552 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13945))

Abstract

Event extraction aims to extract information of triggers and arguments from texts. Recent advanced methods leverage information from other modalities (e.g., images and videos) besides the texts to enhance event extraction. However, the different modalities are often misaligned at the event level, negatively impacting model performance. To address this issue, we firstly constructed a new multi-modal event extraction benchmark Text Video Event Extraction (TVEE) dataset, containing 7,598 text-video pairs. The texts are automatically extracted from video captions, which are perfectly aligned to the video content in most cases. Secondly, we present a Cross-modal Contrastive Learning for Event Extraction (CoCoEE) model to extract events from multi-modal data by contrasting text-video and event-video representations. We conduct extensive experiments on our TVEE dataset and the current benchmark VM2E2 dataset. The results show that our proposed model outperforms baseline methods in terms of F-score. Furthermore, the proposed cross-modal contrastive learning method improves event extraction in each single modality. The dataset and code will be released once upon acceptance.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
https://www.youtube.com/c/ondemandnews.
2.
https://cloud.tencent.com/product/ocr-catalog.
3.
https://huggingface.co/bert-base-uncased.
4.
https://huggingface.co/t5-base.
5.
Because the multi-modal evaluation only focuses on event type extraction, it can’t show the performance of every module, we perform ablation study on text evaluation and video evaluation.

References

Chen, B., et al.: Joint multimedia event extraction from video and article. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 74–88 (2021)
Google Scholar
Chen, H., Shu, R., Takamura, H., Nakayama, H.: GraphPlan: story generation by planning with event graph. In: Proceedings of the 14th International Conference on Natural Language Generation, pp. 377–386 (2021)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine learning, pp. 1597–1607 (2020)
Google Scholar
Daiya, D.: Combining temporal event relations and pre-trained language models for text summarization. In: IEEE International Conference on Machine Learning and Applications, pp. 641–646 (2020)
Google Scholar
Du, X., Cardie, C.: Event extraction by answering (almost) natural questions. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 671–683 (2020)
Google Scholar
Du, X., Rush, A.M., Cardie, C.: GRiT: generative role-filler transformers for document-level event entity extraction. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 634–644 (2021)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Google Scholar
Huang, P.Y., Patrick, M., Hu, J., Neubig, G., Metze, F., Hauptmann, A.: Multilingual multimodal pre-training for zero-shot cross-lingual transfer of vision-language models. arXiv preprint arXiv:2103.08849 (2021)
Kenton, J.D.M.W.C., Toutanova, L.K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171–4186 (2019)
Google Scholar
Li, M., et al.: Timeline summarization based on event graph compression via time-aware optimal transport. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 6443–6456 (2021)
Google Scholar
Li, M., et al.: Cross-media structured common space for multimedia event extraction. In: Proceedings of Annual Meeting of the Association for Computational Linguistics, pp. 2557–2568 (2020)
Google Scholar
Li, Q., Ji, H., Huang, L.: Joint event extraction via structured prediction with global features. In: Proceedings of Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 73–82 (2013)
Google Scholar
Lin, Y., Ji, H., Huang, F., Wu, L.: A joint neural model for information extraction with global features. In: Proceedings of Annual Meeting of the Association for Computational Linguistics, pp. 7999–8009. Association for Computational Linguistics, (2020)
Google Scholar
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, pp. 13–23 (2019)
Google Scholar
Martschat, S., Markert, K.: A temporally sensitive submodularity framework for timeline summarization. In: Proceedings of the Conference on Computational Natural Language Learning, pp. 230–240 (2018)
Google Scholar
Nguyen, T.H., Cho, K., Grishman, R.: Joint event extraction via recurrent neural networks. In: Proceedings of Annual Meeting of the Association for Computational Linguistics, pp. 300–309 (2016)
Google Scholar
Nguyen, T.H., Grishman, R.: Event detection and domain adaptation with convolutional neural networks. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing, pp. 365–371 (2015)
Google Scholar
Pratt, S., Yatskar, M., Weihs, L., Farhadi, A., Kembhavi, A.: Grounded situation recognition. In: European Conference on Computer Vision, pp. 314–332 (2020)
Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)
MathSciNet MATH Google Scholar
Sadhu, A., Gupta, T., Yatskar, M., Nevatia, R., Kembhavi, A.: Visual semantic role labeling for video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5589–5600 (2021)
Google Scholar
Tong, M., et al.: Image enhanced event detection in news articles. Proceed. AAAI Conf. Artif. Intell. 34(5), 9040–9047 (2020)
Google Scholar
Wadden, D., Wennberg, U., Luan, Y., Hajishirzi, H.: Entity, relation, and event extraction with contextualized span representations. arXiv preprint arXiv:1909.03546 (2019)
Walker, C., Strassel, S., Medero, J., Maeda, K.: ACE 2005 multilingual training corpus. Linguist. Data Consort. Philadelp. 57, 45 (2006)
Google Scholar
Wang, Z., et al.: CLEVE: contrastive pre-training for event extraction. In: Proceedings of Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing. vol. 1, pp. 6283–6297 (2021)
Google Scholar
Yao, S., Yang, J., Lu, X., Shuang, K.: Contrastive learning for event extraction. In: International Conference on Machine Learning and Soft Computing, pp. 167–172 (2022)
Google Scholar
Yeh, Y.T., Chen, Y.N.: QAInfomax: learning robust question answering system by mutual information maximization. In: Proceedings of Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing, pp. 3370–3375 (2019)
Google Scholar
Zhang, H., Koh, J.Y., Baldridge, J., Lee, H., Yang, Y.: Cross-modal contrastive learning for text-to-image generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 833–842 (2021)
Google Scholar
Zhang, S., Song, L., Jin, L., Xu, K., Yu, D., Luo, J.: Video-aided unsupervised grammar induction. In: Proceedings of Annual Meeting of the Association for Computational Linguistics, pp. 1513–1524 (2021)
Google Scholar
Zhang, T., et al.: Improving event extraction via multimodal integration. In: Proceedings of ACM International Conference on Multimedia, pp. 270–278 (2017)
Google Scholar
Zolfaghari, M., Zhu, Y., Gehler, P., Brox, T.: CrossCLR: cross-modal contrastive learning for multi-modal video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1450–1459 (2021)
Google Scholar

Download references

Acknowledgments

Supported by the National Key Research and Development Program of China (No. 2022YFF0712400), the National Natural Science Foundation of China (No. 62276063), and the Natural Science Foundation of Jiangsu Province under Grants No. BK20221457.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Southeast University, Nanjing, China
Shuo Wang, Meng Wang & Guilin Qi
Southeast University-Monash University Joint Research Institute, Suzhou, China
Shuo Wang
Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, Nanjing, China
Meng Wang & Guilin Qi
Tencent Jarvis Lab, Shenzhen, China
Meizhi Ju, Yunyan Zhang & Yefeng Zheng

Authors

Shuo Wang
View author publications
You can also search for this author in PubMed Google Scholar
Meizhi Ju
View author publications
You can also search for this author in PubMed Google Scholar
Yunyan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yefeng Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Meng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Guilin Qi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Meng Wang .

Editor information

Editors and Affiliations

Tianjin University, Tianjin, China
Xin Wang
University of Torino, Turin, Italy
Maria Luisa Sapino
POSTECH, Pohang, Korea (Republic of)
Wook-Shin Han
University of California Santa Barbara, Santa Barbara, CA, USA
Amr El Abbadi
University of Auckland, Auckland, New Zealand
Gill Dobbie
Tianjin University, Tianjin, China
Zhiyong Feng
Beijing University of Posts and Telecommunications, Beijing, China
Yingxiao Shao
The University of Queensland, Brisbane, QLD, Australia
Hongzhi Yin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, S., Ju, M., Zhang, Y., Zheng, Y., Wang, M., Qi, G. (2023). Cross-Modal Contrastive Learning for Event Extraction. In: Wang, X., et al. Database Systems for Advanced Applications. DASFAA 2023. Lecture Notes in Computer Science, vol 13945. Springer, Cham. https://doi.org/10.1007/978-3-031-30675-4_51

Download citation

DOI: https://doi.org/10.1007/978-3-031-30675-4_51
Published: 15 April 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30674-7
Online ISBN: 978-3-031-30675-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics