Skip to main content

Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13668))

Included in the following conference series:

  • 2598 Accesses

Abstract

We present a new benchmark dataset, Sapsucker Woods 60 (SSW60), for advancing research on audiovisual fine-grained categorization. While our community has made great strides in fine-grained visual categorization on images, the counterparts in audio and video fine-grained categorization are relatively unexplored. To encourage advancements in this space, we have carefully constructed the SSW60 dataset to enable researchers to experiment with classifying the same set of categories in three different modalities: images, audio, and video. The dataset covers 60 species of birds and is comprised of images from existing datasets, and brand new, expert curated audio and video datasets. We thoroughly benchmark audiovisual classification performance and modality fusion experiments through the use of state-of-the-art transformer methods. Our findings show that performance of audiovisual fusion methods is better than using exclusively image or audio based methods for the task of video classification. We also present interesting modality transfer experiments, enabled by the unique construction of SSW60 to encompass three different modalities. We hope the SSW60 dataset and accompanying baselines spur research in this fascinating area.

The first two authors contributed equally. https://github.com/visipedia/ssw60.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. iNaturalist 2021 Challenge. http://www.kaggle.com/c/inaturalist-2021. Accessed 7 Mar 2022

  2. Macaulay Library. http://www.macaulaylibrary.org. Accessed 7 Mar 2022

  3. Merlin Sound ID. http://merlin.allaboutbirds.org/sound-id. Accessed 7 Mar 2022

  4. Alsahafi, Y., Lemmond, D., Ventura, J., Boult, T.: Carvideos: a novel dataset for fine-grained car classification in videos. In: International Conference on Information Technology-New Generations (2019)

    Google Scholar 

  5. Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: ICCV (2017)

    Google Scholar 

  6. Aytar, Y., Vondrick, C., Torralba, A.: Soundnet: learning sound representations from unlabeled video. In: NeurIPS (2016)

    Google Scholar 

  7. Baker, E., Vincent, S.: A deafening silence: a lack of data and reproducibility in published bioacoustics research? Biodivers. Data J. (2019)

    Google Scholar 

  8. Bayoudh, K., Knani, R., Hamdaoui, F., Mtibaa, A.: A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets. Vis. Comput. 38(8), 2939–2970 (2021)

    Article  Google Scholar 

  9. Berg, T., Liu, J., Lee, S.W., Alexander, M.L., Jacobs, D.W., Belhumeur, P.N.: Birdsnap: large-scale fine-grained visual categorization of birds. In: CVPR (2014)

    Google Scholar 

  10. Bossard, L., Guillaumin, M., Van Gool, L.: Food-101 – mining discriminative components with random forests. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 446–461. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_29

    Chapter  Google Scholar 

  11. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR (2017)

    Google Scholar 

  12. Chen, H., Xie, W., Afouras, T., Nagrani, A., Vedaldi, A., Zisserman, A.: Localizing visual sounds the hard way. In: CVPR (2021)

    Google Scholar 

  13. Chen, H., Xie, W., Vedaldi, A., Zisserman, A.: Vggsound: a large-scale audio-visual dataset. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020)

    Google Scholar 

  14. Chen, T., Rao, R.R.: Audio-visual integration in multimodal communication. Proc. IEEE 86(5), 837–852 (1998)

    Article  Google Scholar 

  15. Chronister, L., Rhinehart, T., Place, A., Kitzes, J.: An annotated set of audio recordings of Eastern North American birds containing frequency, time, and species information. Ecology (2021)

    Google Scholar 

  16. Cramer, J., Lostanlen, V., Farnsworth, A., Salamon, J., Bello, J.P.: Chirping up the right tree: incorporating biological taxonomies into deep bioacoustic classifiers. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020)

    Google Scholar 

  17. Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. In: ECCV (2018)

    Google Scholar 

  18. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)

    Google Scholar 

  19. Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. In: IJCV (2010)

    Google Scholar 

  20. Fayek, H.M., Kumar, A.: Large scale audiovisual learning of sounds with weakly labeled data. arXiv:2006.01595 (2020)

  21. Fonseca, E., Favory, X., Pons, J., Font, F., Serra, X.: FSD50K: an open dataset of human-labeled sound events. arXiv:2010.00475 (2020)

  22. Garofolo, J.S.: Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium (1993)

    Google Scholar 

  23. Ge, Z., et al.: Exploiting temporal information for DCNN-based fine-grained object classification. In: International Conference on Digital Image Computing: Techniques and Applications (2016)

    Google Scholar 

  24. Gebru, T., Krause, J., Wang, Y., Chen, D., Deng, J., Fei-Fei, L.: Fine-grained car detection for visual census estimation. In: AAAI (2017)

    Google Scholar 

  25. Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017)

    Google Scholar 

  26. Gong, Y., Chung, Y.A., Glass, J.: AST: audio spectrogram transformer. In: Interspeech (2021)

    Google Scholar 

  27. Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: ICCV (2017)

    Google Scholar 

  28. Gupta, A., Dollar, P., Girshick, R.: LVIS: a dataset for large vocabulary instance segmentation. In: CVPR (2019)

    Google Scholar 

  29. He, J., et al.: TransFG: a transformer architecture for fine-grained recognition. arXiv:2103.07976 (2021)

  30. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

    Google Scholar 

  31. He, X., Peng, Y., Xie, L.: A new benchmark and approach for fine-grained cross-media retrieval. In: International Conference on Multimedia (2019)

    Google Scholar 

  32. Hershey, S., et al.: CNN architectures for large-scale audio classification. In: ICASSP (2017)

    Google Scholar 

  33. Hou, S., Feng, Y., Wang, Z.: VegFru: a domain-specific dataset for fine-grained visual categorization. In: ICCV (2017)

    Google Scholar 

  34. Jia, M., et al.: Fashionpedia: ontology, segmentation, and an attribute localization dataset. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 316–332. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_19

    Chapter  Google Scholar 

  35. Kalogeiton, V., Ferrari, V., Schmid, C.: Analysing domain shift factors between videos and images for object detection. PAMI 38(11), 2327–2334 (2016)

    Article  Google Scholar 

  36. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)

    Google Scholar 

  37. Kay, W., et al.: The kinetics human action video dataset. arXiv:1705.06950 (2017)

  38. Kazakos, E., Nagrani, A., Zisserman, A., Damen, D.: Epic-fusion: audio-visual temporal binding for egocentric action recognition. In: ICCV (2019)

    Google Scholar 

  39. Khosla, A., Jayadevaprakash, N., Yao, B., Fei-Fei, L.: Novel dataset for fine-grained image categorization. In: First Workshop on Fine-Grained Visual Categorization (2011)

    Google Scholar 

  40. Krasin, I., et al.: Openimages: a public dataset for large-scale multi-label and multi-class image classification (2017). http://storage.googleapis.com/openimages/web/index.html

  41. Krause, J., et al.: The unreasonable effectiveness of noisy data for fine-grained recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 301–320. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_19

    Chapter  Google Scholar 

  42. Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for fine-grained categorization. In: ICCV Workshops (2013)

    Google Scholar 

  43. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: ICCV (2011)

    Google Scholar 

  44. Kumar, N., et al.: Leafsnap: a computer vision system for automatic plant species identification. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7573, pp. 502–516. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33709-3_36

    Chapter  Google Scholar 

  45. Lee, S., et al.: ACAV100M: automatic curation of large-scale datasets for audio-visual video representation learning. In: ICCV (2021)

    Google Scholar 

  46. Li, G., Wei, Y., Tian, Y., Xu, C., Wen, J.R., Hu, D.: Learning to answer questions in dynamic audio-visual scenarios. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19108–19118 (2022)

    Google Scholar 

  47. Li, Y., Li, Y., Vasconcelos, N.: RESOUND: towards action recognition without representation bias. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 520–535. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_32

    Chapter  Google Scholar 

  48. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  49. Lin, Y.-L., Morariu, V.I., Hsu, W., Davis, L.S.: Jointly optimizing 3D model fitting and fine-grained classification. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 466–480. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10593-2_31

    Chapter  Google Scholar 

  50. Liu, J., Kanazawa, A., Jacobs, D., Belhumeur, P.: Dog breed classification using part localization. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7572, pp. 172–185. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33718-5_13

    Chapter  Google Scholar 

  51. Lostanlen, V., Salamon, J., Farnsworth, A., Kelling, S., Bello, J.P.: Birdvox-full-night: a dataset and benchmark for avian flight call detection. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018)

    Google Scholar 

  52. Mac Aodha, O., et al.: Bat detective-deep learning tools for bat acoustic signal detection. PLoS Comput. Biol. 14(3), e1005995 (2018)

    Google Scholar 

  53. Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. arXiv:1306.5151 (2013)

  54. Monfort, M., et al.: Moments in time dataset: one million videos for event understanding. PAMI 42(2), 502–508 (2019)

    Article  Google Scholar 

  55. Morfi, V., Bas, Y., Pamuła, H., Glotin, H., Stowell, D.: NIPS4Bplus: a richly annotated birdsong audio dataset. PeerJ Comput. Sci. 5, e223 (2019)

    Google Scholar 

  56. Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention bottlenecks for multimodal fusion. In: NeurIPS (2021)

    Google Scholar 

  57. Nilsback, M.E., Zisserman, A.: A visual vocabulary for flower classification. In: CVPR (2006)

    Google Scholar 

  58. Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: Indian Conference on Computer Vision, Graphics & Image Processing (2008)

    Google Scholar 

  59. Park, D.S., et al.: SpecAugment: a simple data augmentation method for automatic speech recognition. In: Interspeech (2019)

    Google Scholar 

  60. Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.V.: Cats and dogs. In: CVPR (2012)

    Google Scholar 

  61. Piczak, K.J.: ESC: dataset for environmental sound classification. In: International Conference on Multimedia (2015)

    Google Scholar 

  62. Piergiovanni, A., Ryoo, M.S.: Fine-grained activity recognition in baseball videos. In: CVPR Workshops (2018)

    Google Scholar 

  63. Robinson, T., Fransen, J., Pye, D., Foote, J., Renals, S.: WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition. In: International Conference on Acoustics, Speech, and Signal Processing (1995)

    Google Scholar 

  64. Roemer, C., Julien, J.F., Bas, Y.: An automatic classifier of bat sonotypes around the world. Methods Ecol. Evol. 12(12), 2432–2444 (2021)

    Article  Google Scholar 

  65. Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. In: IJCV (2015)

    Google Scholar 

  66. Saito, T., Kanezaki, A., Harada, T.: IBC127: video dataset for fine-grained bird classification. In: International Conference on Multimedia and Expo (2016)

    Google Scholar 

  67. Salamon, J., Jacoby, C., Bello, J.P.: A dataset and taxonomy for urban sound research. In: International Conference on Multimedia (2014)

    Google Scholar 

  68. Sevilla-Lara, L., Zha, S., Yan, Z., Goswami, V., Feiszli, M., Torresani, L.: Only time can tell: discovering temporal data for temporal modeling. In: WACV (2021)

    Google Scholar 

  69. Shao, D., Zhao, Y., Dai, B., Lin, D.: FineGym: a hierarchical video dataset for fine-grained action understanding. In: CVPR (2020)

    Google Scholar 

  70. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)

    Google Scholar 

  71. Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402 (2012)

  72. Stowell, D., Wood, M.D., Pamuła, H., Stylianou, Y., Glotin, H.: Automatic acoustic detection of birds through deep learning: the first bird audio detection challenge. Methods Ecol. Evol. 10(3), 368–380 (2019)

    Article  Google Scholar 

  73. Thomee, B., et al.: YFCC100M: the new data in multimedia research. Commun. ACM 59(2), 64–73 (2016)

    Article  Google Scholar 

  74. Tian, Y., Li, D., Xu, C.: Unified multisensory perception: weakly-supervised audio-visual video parsing. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 436–454. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_26

    Chapter  Google Scholar 

  75. Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-visual event localization in unconstrained videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 252–268. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_16

    Chapter  Google Scholar 

  76. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML (2021)

    Google Scholar 

  77. Van Horn, G., et al.: Building a bird recognition app and large scale dataset with citizen scientists: the fine print in fine-grained dataset collection. In: CVPR (2015)

    Google Scholar 

  78. Van Horn, G., Cole, E., Beery, S., Wilber, K., Belongie, S., Mac Aodha, O.: Benchmarking representation learning for natural world image collections. In: CVPR (2021)

    Google Scholar 

  79. Van Horn, G., et al.: The inaturalist species classification and detection dataset. In: CVPR (2018)

    Google Scholar 

  80. Vedaldi, A., et al.: Understanding objects in detail with fine-grained attributes. In: CVPR (2014)

    Google Scholar 

  81. Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-UCSD birds-200-2011 dataset. Technical report, CNS-TR-2011-001 (2011)

    Google Scholar 

  82. Wang, L., et al.: Temporal segment networks for action recognition in videos. PAMI 41(11), 2740–2755 (2018)

    Article  Google Scholar 

  83. Wu, P., et al.: Not only look, but also listen: learning multimodal violence detection under weak supervision. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 322–339. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_20

    Chapter  Google Scholar 

  84. Xiao, F., Lee, Y.J., Grauman, K., Malik, J., Feichtenhofer, C.: Audiovisual slowfast networks for video recognition. arXiv:2001.08740 (2020)

  85. Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 318–335. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_19

    Chapter  Google Scholar 

  86. Yang, L., Luo, P., Change Loy, C., Tang, X.: A large-scale car dataset for fine-grained categorization and verification. In: CVPR (2015)

    Google Scholar 

  87. Yun, H., Yu, Y., Yang, W., Lee, K., Kim, G.: Pano-AVQA: grounded audio-visual question answering on 360deg videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2031–2041 (2021)

    Google Scholar 

  88. Zamora-Gutierrez, V., et al.: Acoustic identification of Mexican bats based on taxonomic and ecological constraints on call design. Methods Ecol. Evol. 7(9), 1082–1091 (2016)

    Article  Google Scholar 

  89. Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: a 10 million image database for scene recognition. PAMI 40(6), 1452–1464 (2017)

    Article  Google Scholar 

  90. Zhu, C., et al.: Fine-grained video categorization with redundancy reduction attention. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 139–155. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_9

    Chapter  Google Scholar 

Download references

Acknowledgment

Serge Belongie is supported in part by the Pioneer Centre for AI, DNRF grant number P1. These investigations would not be possible without the help of the passionate birding community contributing their knowledge and data to the Macaulay Library; thank you!

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Grant Van Horn .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4391 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Van Horn, G., Qian, R., Wilber, K., Adam, H., Mac Aodha, O., Belongie, S. (2022). Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13668. Springer, Cham. https://doi.org/10.1007/978-3-031-20074-8_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20074-8_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20073-1

  • Online ISBN: 978-3-031-20074-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics