Loading [MathJax]/extensions/MathMenu.js
Transformer-Based Spiking Neural Networks for Multimodal Audiovisual Classification | IEEE Journals & Magazine | IEEE Xplore

Transformer-Based Spiking Neural Networks for Multimodal Audiovisual Classification


Abstract:

The spiking neural networks (SNNs), as brain-inspired neural networks, have received noteworthy attention due to their advantages of low power consumption, high paralleli...Show More

Abstract:

The spiking neural networks (SNNs), as brain-inspired neural networks, have received noteworthy attention due to their advantages of low power consumption, high parallelism, and high fault tolerance. While SNNs have shown promising results in uni-modal data tasks, their deployment in multimodal audiovisual classification remains limited, and the effectiveness of capturing correlations between visual and audio modalities in SNNs needs improvement. To address these challenges, we propose a novel model called spiking multimodel transformer (SMMT) that combines SNNs and Transformers for multimodal audiovisual classification. The SMMT model integrates uni-modal subnetworks for visual and auditory modalities with a novel spiking cross-attention module for fusion, enhancing the correlation between visual and audio modalities. This approach leads to competitive accuracy in multimodal classification tasks with low energy consumption, making it an effective and energy-efficient solution. Extensive experiments on a public event-based data set (N-TIDIGIT&MNIST-DVS) and two self-made audiovisual data sets of real-world objects (CIFAR10-AV and UrbanSound8K-AV) demonstrate the effectiveness and energy efficiency of the proposed SMMT model in multimodal audiovisual classification tasks. Our constructed multimodal audiovisual data sets can be accessed at https://github.com/Guo-Lingyue/SMMT.
Page(s): 1077 - 1086
Date of Publication: 24 October 2023

ISSN Information:

Funding Agency:


References

References is not available for this document.