Abstract
Unsupervised representation learning of unlabeled multimedia data is important yet challenging problem for their indexing, clustering, and retrieval. There have been many attempts to learn representation from a collection of unlabeled 2D images. In contrast, however, less attention has been paid to unsupervised representation learning for unordered sets of high-dimensional feature vectors, which are often used to describe multimedia data. One such example is set of local visual features to describe a 2D image. This paper proposes a novel algorithm called Feature Set Aggregator (FSA) for accurate and efficient comparison among sets of high-dimensional features. FSA learns representation, or embedding, of unordered feature sets via optimization using a combination of two training objectives, that are, set reconstruction and set embedding, carefully designed for set-to-set comparison. Experimental evaluation under three multimedia information retrieval scenarios using 3D shapes, 2D images, and text documents demonstrates efficacy as well as generality of the proposed algorithm.
Similar content being viewed by others
References
Abadi M et al (2016) TensorFlow: a system for large-scale machine learning. Proc. OSDI 2016:265–283
Achlioptas P, Diamanti O, Mitliagkas I, Guibas L (2017) Learning Representations and Generative Models for 3D Point Clouds, arXiv preprint, arXiv:1707.02392
Arandjelović R, Gronat P, Torii A, Pajdla T, Sivic J (2018) NetVLAD: CNN architecture for weakly supervised place recognition. TPAMI 40(6):1437–1451
Blitzer J, Dredze M, Pereira F (2007) Biographies, Bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. Proc. ACL 2007:440–447
Chang AX et al. (2015) ShapeNet: An Information-Rich 3D Model Repository, arXiv:1512.03012
Charles RQ, Su H, Kaichun M, Guibas LJ (2017) PointNet: deep learning on point Sets for 3D classification and segmentation. Proc. CVPR 2017:77–85
Chen DY, Tian XP, Te Shen Y, Ouhyoung M (2003) On visual similarity based 3D model retrieval. Comput Graph Forum 22(3):223–232
Csurka G, Dance CR, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of Keypoints. Proc. ECCV 2004 workshop on statistical learning in computer vision: 59–74
Deng J, Dong W, Socher R, Li L-J, Li K, Li F-F (2009) ImageNet: a large-scale hierarchical image database. Proc CVPR 2009:248–255
Fei-Fei L, Fergus R, Perona P (2004) Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. Proc CVPR workshop 2004:59–70
Furuya T, Ohbuchi R (2014) Fusing multiple features for shape-based 3D model retrieval, Proc British Machine Vision Conference (BMVC)
Furuya T, Ohbuchi R (2015, 2015) Diffusion-on-manifold aggregation of local features for shape-based 3D model retrieval. Proc. ICMR:171–178
Furuya T, Ohbuchi R (2016) Accurate aggregation of local features by using K-sparse autoencoder for 3D model retrieval. Proc. ICMR 2016:293–297
Furuya T, Ohbuchi R (2016) Deep aggregation of local 3D geometric features for 3D model retrieval. Proc BMVC 2016:121.1–121.12
Gavrila DM, Philomin V (1999) Real-time object detection for “smart” vehicles. Proc. ICCV 1999:87–93
Geoffrey E, Hinton RRS (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. Proc AISTATS 2011:315–323
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2016) Generative adversarial nets. Proc NIPS 2016:2672–2680
Günther F English LSA space, https://sites.google.com/site/fritzgntr/home
Guo Y, Sohel F, Bennamoun M, Lu M, Wan J (2013) Rotational projection statistics for 3D local surface description and object recognition. IJCV 105(1):63–86
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. Proc CVPR 2016:770–778
Henrikson J (1999) Completeness and total boundedness of the Hausdorff metric, MIT Undergraduate Journal of Mathematics: 69–80
Hoffer E, Ailon N (2015) Deep metric learning using triplet network. Proc. ICLR 2015 workshop
Hyvärinen A, Hurri J, Hoyer PO (2009) Natural image statistics: a probabilistic approach to early computational vision. Springer, Verlag
Jégou H, Douze M, Schmid C, Pérez P (2010) Aggregating local descriptors into a compact image representation. Proc CVPR 2010:3304–3311
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. Proc. ICLR 2015
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. Proc. NIPS 2012: 1097–1105.
Landauer TK, Foltz PW, Laham D (1998) An introduction to latent semantic analysis. Discourse Process 25(2–3):259–284
Lang K (1995) Newsweeder: Learning to filter netnews. Proc. ICML 1995:331–339
Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, Ming-Hsuan Yang, Unsupervised representation learning by sorting sequences, Proc. ICCV 2017, pp. 667–676, 2017.
Lehmann J et al (2015) DBpedia–a large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web 6(2):167–195
Leng L, Zhang J (2013) Palmhash code vs. palmphasor code. Neurocomputing 108:1–12
Leng L, Zhang J, Xu J, Khan MK, Alghathbar K (2010) Dynamic weighted discrimination power analysis in DCT domain for face and palmprint recognition. Proc ICTC 2010:467–471
Leng L, Li M, Leng L, Teoh ABJ (2013) Conjugate 2DPalmHash code for secure palm-print-vein verification. Proc CISP 3:1705–1710
Leng L, Li M, Kim C, Bi X (2017) Dual-source discrimination power analysis for multi-instance contactless palmprint recognition. MTAP 76(1):333–354
Lin R, Xiao J, Fan J (2018) NeXtVLAD: An efficient neural network to aggregate frame-level features for large-scale video classification, Proc. ECCV 2018 workshops: 206–218
Lin T-Y, Maji S, Koniusz P (2018) Second-order democratic aggregation. Proc. ECCV 2018:639–656
Liu Z, Wang S, Tian Q (2016) Fine-residual VLAD for image retrieval. Neurocomputing 173(3):1183–1191
Liu Y, Yan J, Ouyang W (2017) Quality aware network for set to set recognition. Proc. CVPR 2017:4694–4703
Lowe DG (2004) Distinctive image features from scale-invariant Keypoints. IJCV 60(2):91–110
Lu L, Zhang J, Gao C (2011) Muhammad Khurram khan, Khaled Alghathbar, two-directional two-dimensional random projection and its variations for face and palmprint recognition. Proc ICCSA 2011:458–458
Lu L, Zhang J, Gao C (2011) Muhammad Khurram khan, ping Bai, two dimensional PalmPhasor enhanced by multi-orientation score level fusion. Proc STA 2011:122–129
Lu L, Beng A, Teoh J (2015) Alignment-free row-co-occurrence cancelable palmprint. Fuzzy Vault Pattern Recogn 48(7):2290–2303
Lu H, Li Y, Chen M, Kim H, Serikawa S (2018) Brain intelligence: go beyond artificial intelligence. Mobile Netw Appl 23(2):368–375
van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605
Makhzani A, Frey B (2014) k-sparse autoencoders, Proc. ICLR 2014
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. Proc. NIPS 2013:3111–3119
Nilsback M-E, Zisserman A (2008) Automated flower classification over a large number of classes. Proc ICVGIP 2008:722–729
Noroozi M, Favaro P (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. Proc. ECCV 2016:69–84
Ohbuchi R, Minamitani T, Takei T (2005) Shape-similarity search of 3D models by using enhanced shape functions. IJCAT 23(2):70–85
Osada R, Funkhouser T, Chazelle B, Dobkin D (2002) Shape distributions. ACM Trans Graph (TOG) 21(4):807–832
Papadakis P, Pratikakis I, Perantonis S, Theoharis T (2007) Efficient 3D shape matching and retrieval using a concrete radialized spherical projection representation. Pattern Recogn 40(9):2437–2452
Pathak D, Krähenbühl P, Donahue J, Darrell T, Efros AA (2016) Context encoders: feature learning by Inpainting. Proc CVPR 2016:2536–2544
Perronnin F, Sánchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification. Proc. ECCV 2010, part IV: 143–156
Qi CR, Yi L, Su H, Guibas LJ (2017) PointNet++: deep hierarchical feature learning on point Sets in a metric space, Proc. NIPS 2017: 5105–5114
Rubner Y, Tomasi C, Guibas LJ (1998) A metric for distributions with applications to image databases. Proc ICCV 1998:59–66
Schroff F, Kalenichenko D, Philbin J (2015) Facenet: a unified embedding for face recognition and clustering. Proc CVPR 2015:815–823
Sfikas K (2018) Ioannis Pratikakis, Theoharis Theoharisa, ensemble of PANORAMA-based convolutional neural networks for 3D model classification and retrieval. Comput Graph 71:208–218
Shi B, Bai S, Zhou Z, Bai X (2015) DeepPano: deep panoramic representation for 3-D shape recognition. Signal Process Lett 22(12):2339–2343
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. Proc. ICLR 2015:1–14
Su H, Maji S, Kalogerakis E, Learned-Miller E (2015) Multi-view convolutional neural networks for 3D shape recognition. Proc. ICCV
Thorsten Joachims A (1997) Probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Proc ICML 1997:143–151
Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P-A (2010) Stacked Denoising autoencoders: learning useful representations in a deep network with a local Denoising criterion. J Mach Learn Res 11:3371–3408
Wahl E, Hillenbrand U, Hirzinger G (2003) Surflet-pair-relation histograms: a statistical 3D-shape representation for rapid classification. Proc Fourth Int Conf 3D Digit Imag Model (3DIM) 2003:474–481
Wang X, Gupta A (2015) Unsupervised learning of visual representations using videos. Proc. ICCV 2015:2794–2802
Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for image classification. Proc. CVPR 2010:3360–3367
Wei X, Zhang Y, Gong Y, Zhang J, Zheng N (2018) Grassmann pooling as compact homogeneous bilinear pooling for fine-grained visual classification. Proc ECCV 2018:365–380
Wu Z, Song S, Khosla A, Yu F, Zhang L, Tang X, Xiao J (2015) 3D ShapeNets: a deep representation for volumetric shapes. Proc. CVPR 2015:1912–1920
Xi Z, Kai Y, Zhang T, Huang TS (2010) Image classification using super-vector coding of local image descriptors. Proc ECCV 2010:141–154
Xian Y, Lampert CH, Schiele B, Akata Z (2018) Zero-shot learning - a comprehensive evaluation of the good, the bad and the ugly. TPAMI 40(8)
Xu X, Song J, Lu H, Yang Y, Shen F, Huang Z (2018) Modal-adversarial semantic learning network for extendable cross-modal retrieval. Proc ICMR 2018:46–54
Yang Y, Feng C, Shen Y, Tian D (2017) FoldingNet: Interpretable Unsupervised Learning on 3D Point Clouds, arXiv preprint, arXiv:1712.07262
Zaheer M, Kottur S, Ravanbakhsh S, Poczos B, Salakhutdinov RR, Smola AJ (2017) Deep sets, Proc. NIPS 2017: 3394–3404.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Furuya, T., Ohbuchi, R. Feature set aggregator: unsupervised representation learning of sets for their comparison. Multimed Tools Appl 78, 35157–35178 (2019). https://doi.org/10.1007/s11042-019-08078-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-019-08078-y