Skip to main content
Log in

Feature set aggregator: unsupervised representation learning of sets for their comparison

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Unsupervised representation learning of unlabeled multimedia data is important yet challenging problem for their indexing, clustering, and retrieval. There have been many attempts to learn representation from a collection of unlabeled 2D images. In contrast, however, less attention has been paid to unsupervised representation learning for unordered sets of high-dimensional feature vectors, which are often used to describe multimedia data. One such example is set of local visual features to describe a 2D image. This paper proposes a novel algorithm called Feature Set Aggregator (FSA) for accurate and efficient comparison among sets of high-dimensional features. FSA learns representation, or embedding, of unordered feature sets via optimization using a combination of two training objectives, that are, set reconstruction and set embedding, carefully designed for set-to-set comparison. Experimental evaluation under three multimedia information retrieval scenarios using 3D shapes, 2D images, and text documents demonstrates efficacy as well as generality of the proposed algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Abadi M et al (2016) TensorFlow: a system for large-scale machine learning. Proc. OSDI 2016:265–283

    Google Scholar 

  2. Achlioptas P, Diamanti O, Mitliagkas I, Guibas L (2017) Learning Representations and Generative Models for 3D Point Clouds, arXiv preprint, arXiv:1707.02392

  3. Arandjelović R, Gronat P, Torii A, Pajdla T, Sivic J (2018) NetVLAD: CNN architecture for weakly supervised place recognition. TPAMI 40(6):1437–1451

    Google Scholar 

  4. Blitzer J, Dredze M, Pereira F (2007) Biographies, Bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. Proc. ACL 2007:440–447

    Google Scholar 

  5. Chang AX et al. (2015) ShapeNet: An Information-Rich 3D Model Repository, arXiv:1512.03012

  6. Charles RQ, Su H, Kaichun M, Guibas LJ (2017) PointNet: deep learning on point Sets for 3D classification and segmentation. Proc. CVPR 2017:77–85

    Google Scholar 

  7. Chen DY, Tian XP, Te Shen Y, Ouhyoung M (2003) On visual similarity based 3D model retrieval. Comput Graph Forum 22(3):223–232

    Google Scholar 

  8. Csurka G, Dance CR, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of Keypoints. Proc. ECCV 2004 workshop on statistical learning in computer vision: 59–74

  9. Deng J, Dong W, Socher R, Li L-J, Li K, Li F-F (2009) ImageNet: a large-scale hierarchical image database. Proc CVPR 2009:248–255

    Google Scholar 

  10. Fei-Fei L, Fergus R, Perona P (2004) Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. Proc CVPR workshop 2004:59–70

    Google Scholar 

  11. Furuya T, Ohbuchi R (2014) Fusing multiple features for shape-based 3D model retrieval, Proc British Machine Vision Conference (BMVC)

  12. Furuya T, Ohbuchi R (2015, 2015) Diffusion-on-manifold aggregation of local features for shape-based 3D model retrieval. Proc. ICMR:171–178

  13. Furuya T, Ohbuchi R (2016) Accurate aggregation of local features by using K-sparse autoencoder for 3D model retrieval. Proc. ICMR 2016:293–297

    Google Scholar 

  14. Furuya T, Ohbuchi R (2016) Deep aggregation of local 3D geometric features for 3D model retrieval. Proc BMVC 2016:121.1–121.12

    Google Scholar 

  15. Gavrila DM, Philomin V (1999) Real-time object detection for “smart” vehicles. Proc. ICCV 1999:87–93

    Google Scholar 

  16. Geoffrey E, Hinton RRS (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507

    MathSciNet  MATH  Google Scholar 

  17. Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. Proc AISTATS 2011:315–323

    Google Scholar 

  18. Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2016) Generative adversarial nets. Proc NIPS 2016:2672–2680

    Google Scholar 

  19. Günther F English LSA space, https://sites.google.com/site/fritzgntr/home

  20. Guo Y, Sohel F, Bennamoun M, Lu M, Wan J (2013) Rotational projection statistics for 3D local surface description and object recognition. IJCV 105(1):63–86

    MathSciNet  MATH  Google Scholar 

  21. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. Proc CVPR 2016:770–778

    Google Scholar 

  22. Henrikson J (1999) Completeness and total boundedness of the Hausdorff metric, MIT Undergraduate Journal of Mathematics: 69–80

  23. Hoffer E, Ailon N (2015) Deep metric learning using triplet network. Proc. ICLR 2015 workshop

  24. Hyvärinen A, Hurri J, Hoyer PO (2009) Natural image statistics: a probabilistic approach to early computational vision. Springer, Verlag

    MATH  Google Scholar 

  25. Jégou H, Douze M, Schmid C, Pérez P (2010) Aggregating local descriptors into a compact image representation. Proc CVPR 2010:3304–3311

    Google Scholar 

  26. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. Proc. ICLR 2015

  27. Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. Proc. NIPS 2012: 1097–1105.

  28. Landauer TK, Foltz PW, Laham D (1998) An introduction to latent semantic analysis. Discourse Process 25(2–3):259–284

    Google Scholar 

  29. Lang K (1995) Newsweeder: Learning to filter netnews. Proc. ICML 1995:331–339

    Google Scholar 

  30. Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, Ming-Hsuan Yang, Unsupervised representation learning by sorting sequences, Proc. ICCV 2017, pp. 667–676, 2017.

  31. Lehmann J et al (2015) DBpedia–a large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web 6(2):167–195

    Google Scholar 

  32. Leng L, Zhang J (2013) Palmhash code vs. palmphasor code. Neurocomputing 108:1–12

    Google Scholar 

  33. Leng L, Zhang J, Xu J, Khan MK, Alghathbar K (2010) Dynamic weighted discrimination power analysis in DCT domain for face and palmprint recognition. Proc ICTC 2010:467–471

    Google Scholar 

  34. Leng L, Li M, Leng L, Teoh ABJ (2013) Conjugate 2DPalmHash code for secure palm-print-vein verification. Proc CISP 3:1705–1710

    Google Scholar 

  35. Leng L, Li M, Kim C, Bi X (2017) Dual-source discrimination power analysis for multi-instance contactless palmprint recognition. MTAP 76(1):333–354

    Google Scholar 

  36. Lin R, Xiao J, Fan J (2018) NeXtVLAD: An efficient neural network to aggregate frame-level features for large-scale video classification, Proc. ECCV 2018 workshops: 206–218

  37. Lin T-Y, Maji S, Koniusz P (2018) Second-order democratic aggregation. Proc. ECCV 2018:639–656

    Google Scholar 

  38. Liu Z, Wang S, Tian Q (2016) Fine-residual VLAD for image retrieval. Neurocomputing 173(3):1183–1191

    Google Scholar 

  39. Liu Y, Yan J, Ouyang W (2017) Quality aware network for set to set recognition. Proc. CVPR 2017:4694–4703

    Google Scholar 

  40. Lowe DG (2004) Distinctive image features from scale-invariant Keypoints. IJCV 60(2):91–110

    Google Scholar 

  41. Lu L, Zhang J, Gao C (2011) Muhammad Khurram khan, Khaled Alghathbar, two-directional two-dimensional random projection and its variations for face and palmprint recognition. Proc ICCSA 2011:458–458

    Google Scholar 

  42. Lu L, Zhang J, Gao C (2011) Muhammad Khurram khan, ping Bai, two dimensional PalmPhasor enhanced by multi-orientation score level fusion. Proc STA 2011:122–129

    Google Scholar 

  43. Lu L, Beng A, Teoh J (2015) Alignment-free row-co-occurrence cancelable palmprint. Fuzzy Vault Pattern Recogn 48(7):2290–2303

    Google Scholar 

  44. Lu H, Li Y, Chen M, Kim H, Serikawa S (2018) Brain intelligence: go beyond artificial intelligence. Mobile Netw Appl 23(2):368–375

    Google Scholar 

  45. van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605

    MATH  Google Scholar 

  46. Makhzani A, Frey B (2014) k-sparse autoencoders, Proc. ICLR 2014

  47. Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. Proc. NIPS 2013:3111–3119

    Google Scholar 

  48. Nilsback M-E, Zisserman A (2008) Automated flower classification over a large number of classes. Proc ICVGIP 2008:722–729

    Google Scholar 

  49. Noroozi M, Favaro P (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. Proc. ECCV 2016:69–84

    Google Scholar 

  50. Ohbuchi R, Minamitani T, Takei T (2005) Shape-similarity search of 3D models by using enhanced shape functions. IJCAT 23(2):70–85

    Google Scholar 

  51. Osada R, Funkhouser T, Chazelle B, Dobkin D (2002) Shape distributions. ACM Trans Graph (TOG) 21(4):807–832

    MathSciNet  MATH  Google Scholar 

  52. Papadakis P, Pratikakis I, Perantonis S, Theoharis T (2007) Efficient 3D shape matching and retrieval using a concrete radialized spherical projection representation. Pattern Recogn 40(9):2437–2452

    MATH  Google Scholar 

  53. Pathak D, Krähenbühl P, Donahue J, Darrell T, Efros AA (2016) Context encoders: feature learning by Inpainting. Proc CVPR 2016:2536–2544

    Google Scholar 

  54. Perronnin F, Sánchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification. Proc. ECCV 2010, part IV: 143–156

  55. Qi CR, Yi L, Su H, Guibas LJ (2017) PointNet++: deep hierarchical feature learning on point Sets in a metric space, Proc. NIPS 2017: 5105–5114

  56. Rubner Y, Tomasi C, Guibas LJ (1998) A metric for distributions with applications to image databases. Proc ICCV 1998:59–66

    Google Scholar 

  57. Schroff F, Kalenichenko D, Philbin J (2015) Facenet: a unified embedding for face recognition and clustering. Proc CVPR 2015:815–823

    Google Scholar 

  58. Sfikas K (2018) Ioannis Pratikakis, Theoharis Theoharisa, ensemble of PANORAMA-based convolutional neural networks for 3D model classification and retrieval. Comput Graph 71:208–218

    Google Scholar 

  59. Shi B, Bai S, Zhou Z, Bai X (2015) DeepPano: deep panoramic representation for 3-D shape recognition. Signal Process Lett 22(12):2339–2343

    Google Scholar 

  60. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. Proc. ICLR 2015:1–14

    Google Scholar 

  61. Su H, Maji S, Kalogerakis E, Learned-Miller E (2015) Multi-view convolutional neural networks for 3D shape recognition. Proc. ICCV

  62. Thorsten Joachims A (1997) Probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Proc ICML 1997:143–151

    Google Scholar 

  63. Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P-A (2010) Stacked Denoising autoencoders: learning useful representations in a deep network with a local Denoising criterion. J Mach Learn Res 11:3371–3408

    MathSciNet  MATH  Google Scholar 

  64. Wahl E, Hillenbrand U, Hirzinger G (2003) Surflet-pair-relation histograms: a statistical 3D-shape representation for rapid classification. Proc Fourth Int Conf 3D Digit Imag Model (3DIM) 2003:474–481

    Google Scholar 

  65. Wang X, Gupta A (2015) Unsupervised learning of visual representations using videos. Proc. ICCV 2015:2794–2802

    Google Scholar 

  66. Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for image classification. Proc. CVPR 2010:3360–3367

    Google Scholar 

  67. Wei X, Zhang Y, Gong Y, Zhang J, Zheng N (2018) Grassmann pooling as compact homogeneous bilinear pooling for fine-grained visual classification. Proc ECCV 2018:365–380

    Google Scholar 

  68. Word2vec. https://code.google.com/archive/p/word2vec

  69. Wu Z, Song S, Khosla A, Yu F, Zhang L, Tang X, Xiao J (2015) 3D ShapeNets: a deep representation for volumetric shapes. Proc. CVPR 2015:1912–1920

    Google Scholar 

  70. Xi Z, Kai Y, Zhang T, Huang TS (2010) Image classification using super-vector coding of local image descriptors. Proc ECCV 2010:141–154

    Google Scholar 

  71. Xian Y, Lampert CH, Schiele B, Akata Z (2018) Zero-shot learning - a comprehensive evaluation of the good, the bad and the ugly. TPAMI 40(8)

  72. Xu X, Song J, Lu H, Yang Y, Shen F, Huang Z (2018) Modal-adversarial semantic learning network for extendable cross-modal retrieval. Proc ICMR 2018:46–54

    Google Scholar 

  73. Yang Y, Feng C, Shen Y, Tian D (2017) FoldingNet: Interpretable Unsupervised Learning on 3D Point Clouds, arXiv preprint, arXiv:1712.07262

  74. Zaheer M, Kottur S, Ravanbakhsh S, Poczos B, Salakhutdinov RR, Smola AJ (2017) Deep sets, Proc. NIPS 2017: 3394–3404.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Takahiko Furuya.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Furuya, T., Ohbuchi, R. Feature set aggregator: unsupervised representation learning of sets for their comparison. Multimed Tools Appl 78, 35157–35178 (2019). https://doi.org/10.1007/s11042-019-08078-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-019-08078-y

Keywords

Navigation