Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset

Van Horn, Grant; Qian, Rui; Wilber, Kimberly; Adam, Hartwig; Mac Aodha, Oisin; Belongie, Serge

doi:10.1007/978-3-031-20074-8_16

Grant Van Horn¹²,
Rui Qian¹²,
Kimberly Wilber¹³,
Hartwig Adam¹³,
Oisin Mac Aodha¹⁴ &
…
Serge Belongie¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13668))

Included in the following conference series:

European Conference on Computer Vision

2598 Accesses

Abstract

We present a new benchmark dataset, Sapsucker Woods 60 (SSW60), for advancing research on audiovisual fine-grained categorization. While our community has made great strides in fine-grained visual categorization on images, the counterparts in audio and video fine-grained categorization are relatively unexplored. To encourage advancements in this space, we have carefully constructed the SSW60 dataset to enable researchers to experiment with classifying the same set of categories in three different modalities: images, audio, and video. The dataset covers 60 species of birds and is comprised of images from existing datasets, and brand new, expert curated audio and video datasets. We thoroughly benchmark audiovisual classification performance and modality fusion experiments through the use of state-of-the-art transformer methods. Our findings show that performance of audiovisual fusion methods is better than using exclusively image or audio based methods for the task of video classification. We also present interesting modality transfer experiments, enabled by the unique construction of SSW60 to encompass three different modalities. We hope the SSW60 dataset and accompanying baselines spur research in this fascinating area.

The first two authors contributed equally. https://github.com/visipedia/ssw60.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Deep Learning Frameworks Applied For Audio-Visual Scene Classification

Leveraging Acoustic Images for Effective Self-supervised Audio Representation Learning

Siamese Vision Transformers are Scalable Audio-Visual Learners

References

iNaturalist 2021 Challenge. http://www.kaggle.com/c/inaturalist-2021. Accessed 7 Mar 2022
Macaulay Library. http://www.macaulaylibrary.org. Accessed 7 Mar 2022
Merlin Sound ID. http://merlin.allaboutbirds.org/sound-id. Accessed 7 Mar 2022
Alsahafi, Y., Lemmond, D., Ventura, J., Boult, T.: Carvideos: a novel dataset for fine-grained car classification in videos. In: International Conference on Information Technology-New Generations (2019)
Google Scholar
Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: ICCV (2017)
Google Scholar
Aytar, Y., Vondrick, C., Torralba, A.: Soundnet: learning sound representations from unlabeled video. In: NeurIPS (2016)
Google Scholar
Baker, E., Vincent, S.: A deafening silence: a lack of data and reproducibility in published bioacoustics research? Biodivers. Data J. (2019)
Google Scholar
Bayoudh, K., Knani, R., Hamdaoui, F., Mtibaa, A.: A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets. Vis. Comput. 38(8), 2939–2970 (2021)
Article Google Scholar
Berg, T., Liu, J., Lee, S.W., Alexander, M.L., Jacobs, D.W., Belhumeur, P.N.: Birdsnap: large-scale fine-grained visual categorization of birds. In: CVPR (2014)
Google Scholar
Bossard, L., Guillaumin, M., Van Gool, L.: Food-101 – mining discriminative components with random forests. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 446–461. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_29
Chapter Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR (2017)
Google Scholar
Chen, H., Xie, W., Afouras, T., Nagrani, A., Vedaldi, A., Zisserman, A.: Localizing visual sounds the hard way. In: CVPR (2021)
Google Scholar
Chen, H., Xie, W., Vedaldi, A., Zisserman, A.: Vggsound: a large-scale audio-visual dataset. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020)
Google Scholar
Chen, T., Rao, R.R.: Audio-visual integration in multimodal communication. Proc. IEEE 86(5), 837–852 (1998)
Article Google Scholar
Chronister, L., Rhinehart, T., Place, A., Kitzes, J.: An annotated set of audio recordings of Eastern North American birds containing frequency, time, and species information. Ecology (2021)
Google Scholar
Cramer, J., Lostanlen, V., Farnsworth, A., Salamon, J., Bello, J.P.: Chirping up the right tree: incorporating biological taxonomies into deep bioacoustic classifiers. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020)
Google Scholar
Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. In: ECCV (2018)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. In: IJCV (2010)
Google Scholar
Fayek, H.M., Kumar, A.: Large scale audiovisual learning of sounds with weakly labeled data. arXiv:2006.01595 (2020)
Fonseca, E., Favory, X., Pons, J., Font, F., Serra, X.: FSD50K: an open dataset of human-labeled sound events. arXiv:2010.00475 (2020)
Garofolo, J.S.: Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium (1993)
Google Scholar
Ge, Z., et al.: Exploiting temporal information for DCNN-based fine-grained object classification. In: International Conference on Digital Image Computing: Techniques and Applications (2016)
Google Scholar
Gebru, T., Krause, J., Wang, Y., Chen, D., Deng, J., Fei-Fei, L.: Fine-grained car detection for visual census estimation. In: AAAI (2017)
Google Scholar
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017)
Google Scholar
Gong, Y., Chung, Y.A., Glass, J.: AST: audio spectrogram transformer. In: Interspeech (2021)
Google Scholar
Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: ICCV (2017)
Google Scholar
Gupta, A., Dollar, P., Girshick, R.: LVIS: a dataset for large vocabulary instance segmentation. In: CVPR (2019)
Google Scholar
He, J., et al.: TransFG: a transformer architecture for fine-grained recognition. arXiv:2103.07976 (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
He, X., Peng, Y., Xie, L.: A new benchmark and approach for fine-grained cross-media retrieval. In: International Conference on Multimedia (2019)
Google Scholar
Hershey, S., et al.: CNN architectures for large-scale audio classification. In: ICASSP (2017)
Google Scholar
Hou, S., Feng, Y., Wang, Z.: VegFru: a domain-specific dataset for fine-grained visual categorization. In: ICCV (2017)
Google Scholar
Jia, M., et al.: Fashionpedia: ontology, segmentation, and an attribute localization dataset. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 316–332. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_19
Chapter Google Scholar
Kalogeiton, V., Ferrari, V., Schmid, C.: Analysing domain shift factors between videos and images for object detection. PAMI 38(11), 2327–2334 (2016)
Article Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)
Google Scholar
Kay, W., et al.: The kinetics human action video dataset. arXiv:1705.06950 (2017)
Kazakos, E., Nagrani, A., Zisserman, A., Damen, D.: Epic-fusion: audio-visual temporal binding for egocentric action recognition. In: ICCV (2019)
Google Scholar
Khosla, A., Jayadevaprakash, N., Yao, B., Fei-Fei, L.: Novel dataset for fine-grained image categorization. In: First Workshop on Fine-Grained Visual Categorization (2011)
Google Scholar
Krasin, I., et al.: Openimages: a public dataset for large-scale multi-label and multi-class image classification (2017). http://storage.googleapis.com/openimages/web/index.html
Krause, J., et al.: The unreasonable effectiveness of noisy data for fine-grained recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 301–320. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_19
Chapter Google Scholar
Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for fine-grained categorization. In: ICCV Workshops (2013)
Google Scholar
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: ICCV (2011)
Google Scholar
Kumar, N., et al.: Leafsnap: a computer vision system for automatic plant species identification. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7573, pp. 502–516. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33709-3_36
Chapter Google Scholar
Lee, S., et al.: ACAV100M: automatic curation of large-scale datasets for audio-visual video representation learning. In: ICCV (2021)
Google Scholar
Li, G., Wei, Y., Tian, Y., Xu, C., Wen, J.R., Hu, D.: Learning to answer questions in dynamic audio-visual scenarios. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19108–19118 (2022)
Google Scholar
Li, Y., Li, Y., Vasconcelos, N.: RESOUND: towards action recognition without representation bias. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 520–535. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_32
Chapter Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Lin, Y.-L., Morariu, V.I., Hsu, W., Davis, L.S.: Jointly optimizing 3D model fitting and fine-grained classification. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 466–480. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10593-2_31
Chapter Google Scholar
Liu, J., Kanazawa, A., Jacobs, D., Belhumeur, P.: Dog breed classification using part localization. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7572, pp. 172–185. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33718-5_13
Chapter Google Scholar
Lostanlen, V., Salamon, J., Farnsworth, A., Kelling, S., Bello, J.P.: Birdvox-full-night: a dataset and benchmark for avian flight call detection. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018)
Google Scholar
Mac Aodha, O., et al.: Bat detective-deep learning tools for bat acoustic signal detection. PLoS Comput. Biol. 14(3), e1005995 (2018)
Google Scholar
Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. arXiv:1306.5151 (2013)
Monfort, M., et al.: Moments in time dataset: one million videos for event understanding. PAMI 42(2), 502–508 (2019)
Article Google Scholar
Morfi, V., Bas, Y., Pamuła, H., Glotin, H., Stowell, D.: NIPS4Bplus: a richly annotated birdsong audio dataset. PeerJ Comput. Sci. 5, e223 (2019)
Google Scholar
Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention bottlenecks for multimodal fusion. In: NeurIPS (2021)
Google Scholar
Nilsback, M.E., Zisserman, A.: A visual vocabulary for flower classification. In: CVPR (2006)
Google Scholar
Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: Indian Conference on Computer Vision, Graphics & Image Processing (2008)
Google Scholar
Park, D.S., et al.: SpecAugment: a simple data augmentation method for automatic speech recognition. In: Interspeech (2019)
Google Scholar
Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.V.: Cats and dogs. In: CVPR (2012)
Google Scholar
Piczak, K.J.: ESC: dataset for environmental sound classification. In: International Conference on Multimedia (2015)
Google Scholar
Piergiovanni, A., Ryoo, M.S.: Fine-grained activity recognition in baseball videos. In: CVPR Workshops (2018)
Google Scholar
Robinson, T., Fransen, J., Pye, D., Foote, J., Renals, S.: WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition. In: International Conference on Acoustics, Speech, and Signal Processing (1995)
Google Scholar
Roemer, C., Julien, J.F., Bas, Y.: An automatic classifier of bat sonotypes around the world. Methods Ecol. Evol. 12(12), 2432–2444 (2021)
Article Google Scholar
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. In: IJCV (2015)
Google Scholar
Saito, T., Kanezaki, A., Harada, T.: IBC127: video dataset for fine-grained bird classification. In: International Conference on Multimedia and Expo (2016)
Google Scholar
Salamon, J., Jacoby, C., Bello, J.P.: A dataset and taxonomy for urban sound research. In: International Conference on Multimedia (2014)
Google Scholar
Sevilla-Lara, L., Zha, S., Yan, Z., Goswami, V., Feiszli, M., Torresani, L.: Only time can tell: discovering temporal data for temporal modeling. In: WACV (2021)
Google Scholar
Shao, D., Zhao, Y., Dai, B., Lin, D.: FineGym: a hierarchical video dataset for fine-grained action understanding. In: CVPR (2020)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402 (2012)
Stowell, D., Wood, M.D., Pamuła, H., Stylianou, Y., Glotin, H.: Automatic acoustic detection of birds through deep learning: the first bird audio detection challenge. Methods Ecol. Evol. 10(3), 368–380 (2019)
Article Google Scholar
Thomee, B., et al.: YFCC100M: the new data in multimedia research. Commun. ACM 59(2), 64–73 (2016)
Article Google Scholar
Tian, Y., Li, D., Xu, C.: Unified multisensory perception: weakly-supervised audio-visual video parsing. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 436–454. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_26
Chapter Google Scholar
Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-visual event localization in unconstrained videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 252–268. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_16
Chapter Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML (2021)
Google Scholar
Van Horn, G., et al.: Building a bird recognition app and large scale dataset with citizen scientists: the fine print in fine-grained dataset collection. In: CVPR (2015)
Google Scholar
Van Horn, G., Cole, E., Beery, S., Wilber, K., Belongie, S., Mac Aodha, O.: Benchmarking representation learning for natural world image collections. In: CVPR (2021)
Google Scholar
Van Horn, G., et al.: The inaturalist species classification and detection dataset. In: CVPR (2018)
Google Scholar
Vedaldi, A., et al.: Understanding objects in detail with fine-grained attributes. In: CVPR (2014)
Google Scholar
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-UCSD birds-200-2011 dataset. Technical report, CNS-TR-2011-001 (2011)
Google Scholar
Wang, L., et al.: Temporal segment networks for action recognition in videos. PAMI 41(11), 2740–2755 (2018)
Article Google Scholar
Wu, P., et al.: Not only look, but also listen: learning multimodal violence detection under weak supervision. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 322–339. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_20
Chapter Google Scholar
Xiao, F., Lee, Y.J., Grauman, K., Malik, J., Feichtenhofer, C.: Audiovisual slowfast networks for video recognition. arXiv:2001.08740 (2020)
Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 318–335. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_19
Chapter Google Scholar
Yang, L., Luo, P., Change Loy, C., Tang, X.: A large-scale car dataset for fine-grained categorization and verification. In: CVPR (2015)
Google Scholar
Yun, H., Yu, Y., Yang, W., Lee, K., Kim, G.: Pano-AVQA: grounded audio-visual question answering on 360deg videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2031–2041 (2021)
Google Scholar
Zamora-Gutierrez, V., et al.: Acoustic identification of Mexican bats based on taxonomic and ecological constraints on call design. Methods Ecol. Evol. 7(9), 1082–1091 (2016)
Article Google Scholar
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: a 10 million image database for scene recognition. PAMI 40(6), 1452–1464 (2017)
Article Google Scholar
Zhu, C., et al.: Fine-grained video categorization with redundancy reduction attention. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 139–155. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_9
Chapter Google Scholar

Download references

Acknowledgment

Serge Belongie is supported in part by the Pioneer Centre for AI, DNRF grant number P1. These investigations would not be possible without the help of the passionate birding community contributing their knowledge and data to the Macaulay Library; thank you!

Author information

Authors and Affiliations

Cornell University, Ithaca, USA
Grant Van Horn & Rui Qian
Google, Menlo Park, USA
Kimberly Wilber & Hartwig Adam
University of Edinburgh, Edinburgh, UK
Oisin Mac Aodha
University of Copenhagen, Copenhagen, Denmark
Serge Belongie

Authors

Grant Van Horn
View author publications
You can also search for this author in PubMed Google Scholar
Rui Qian
View author publications
You can also search for this author in PubMed Google Scholar
Kimberly Wilber
View author publications
You can also search for this author in PubMed Google Scholar
Hartwig Adam
View author publications
You can also search for this author in PubMed Google Scholar
Oisin Mac Aodha
View author publications
You can also search for this author in PubMed Google Scholar
Serge Belongie
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Grant Van Horn .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4391 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Van Horn, G., Qian, R., Wilber, K., Adam, H., Mac Aodha, O., Belongie, S. (2022). Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13668. Springer, Cham. https://doi.org/10.1007/978-3-031-20074-8_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-20074-8_16
Published: 12 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20073-1
Online ISBN: 978-3-031-20074-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset