skip to main content
10.1145/2964284.2964307acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Shorter-is-Better: Venue Category Estimation from Micro-Video

Published: 01 October 2016 Publication History

Abstract

According to our statistics on over 2 million micro-videos, only 1.22% of them are associated with venue information, which greatly hinders the location-oriented applications and personalized services. To alleviate this problem, we aim to label the bite-sized video clips with venue categories. It is, however, nontrivial due to three reasons: 1) no available benchmark dataset; 2) insufficient information, low quality, and 3) information loss; and 3) complex relatedness among venue categories. Towards this end, we propose a scheme comprising of two components. In particular, we first crawl a representative set of micro-videos from Vine and extract a rich set of features from textual, visual and acoustic modalities. We then, in the second component, build a tree-guided multi-task multi-modal learning model to estimate the venue category for each unseen micro-video. This model is able to jointly learn a common space from multi-modalities and leverage the predefined Foursquare hierarchical structure to regularize the relatedness among venue categories. Extensive experiments have well-validated our model. As a side research contribution, we have released our data, codes and involved parameters.

References

[1]
F. R. Bach. Consistency of the group lasso and multiple kernel learning. JMLR, 9(Jun):1179--1225, 2008.
[2]
F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Bergeron, D. Warde-Farley, and Y. Bengio. Theano: new features and speed improvements. arXiv preprint arXiv:1211.5590, 2012.
[3]
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 3:993--1022, 2003.
[4]
S. Cao and N. Snavely. Graph-based discriminative learning for location recognition. In IEEE CVPR, pages 700--707, 2013.
[5]
B.-C. Chen, Y.-Y. Chen, F. Chen, and D. Joshi. Business-aware visual concept discovery from social media for multimodal business venue recognition. In AAAI, pages 61--68, 2016.
[6]
D. M. Chen, G. Baatz, K. Köser, S. S. Tsai, R. Vedantham, T. Pylv\"a, K. Roimela, X. Chen, J. Bach, M. Pollefeys, et al. City-scale landmark identification on mobile devices. In IEEE CVPR, pages 737--744, 2011.
[7]
J. Chen, J. Zhou, and J. Ye. Integrating low-rank and group-sparse structures for robust multi-task learning. In ACM KDD, pages 42--50, 2011.
[8]
J. Choi, G. Friedland, V. Ekambaram, and K. Ramchandran. Multimodal location estimation of consumer media: Dealing with sparse training data. In IEEE ICME, pages 43--48, 2012.
[9]
D. J. Crandall, L. Backstrom, D. Huttenlocher, and J. Kleinberg. Mapping the world's photos. In ACM WWW, pages 761--770, 2009.
[10]
P. Cui, Z. Wang, and Z. Su. What videos are similar with you?: Learning a common attributed representation for video recommendation. In ACM MM, pages 597--606, 2014.
[11]
X. Feng, Y. Zhang, and J. Glass. Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition. In IEEE ICASSP, pages 1759--1763, 2014.
[12]
G. Friedland, J. Choi, H. Lei, and A. Janin. Multimodal location estimation on flickr videos. In ACM SIGMM, pages 23--28, 2011.
[13]
G. Friedland, O. Vinyals, and T. Darrell. Multimodal location estimation. In ACM MM, pages 1245--1252, 2010.
[14]
D. Ganguly, D. Roy, M. Mitra, and G. J. Jones. Word embedding based generalized language model for information retrieval. In ACM SIGIR, pages 795--798, 2015.
[15]
S. Gopal and Y. Yang. Recursive regularization for large-scale classification with hierarchical and graphical dependencies. In ACM KDD, pages 257--265, 2013.
[16]
Y. Guo. Convex subspace representation learning from multi-view data. In AAAI, pages 2--9, 2013.
[17]
L. Han and Y. Zhang. Learning tree structure in multi-task learning. In ACM KDD, pages 397--406, 2015.
[18]
Z. Hanwang, W. Meng, H. Richang, N. Liqiang, and C. Tat-Seng. Play and rewind: Optimizing binary representations of videos by self-supervised temporal hashing. In ACM MM, October 2016.
[19]
J. Hays and A. A. Efros. Im2gps: estimating geographic information from a single image. In IEEE CVPR, pages 1--8, 2008.
[20]
J. He and R. Lawrence. A graph-based framework for multi-task multi-view learning. In ICML, pages 25--32, 2011.
[21]
L. Jacob, J.-p. Vert, and F. R. Bach. Clustered multi-task learning: A convex formulation. In NIPS, pages 745--752, 2009.
[22]
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM MM, pages 675--678, 2014.
[23]
X. Jin, F. Zhuang, S. Wang, Q. He, and Z. Shi. Shared structure learning for multiple tasks with multiple views. In MLKDD, pages 353--368, 2013.
[24]
M. Kan, S. Shan, H. Zhang, S. Lao, and X. Chen. Multi-view discriminant analysis. IEEE TPAMI, 38(1):188--194, 2016.
[25]
S. Kim and E. P. Xing. Tree-guided group lasso for multi-task regression with structured sparsity. In ICML, pages 1--8, 2010.
[26]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097--1105, 2012.
[27]
H. Lei, J. Choi, and G. Friedland. Multimodal city-verification on flickr videos using acoustic and textual features. In IEEE ICASSP, pages 2273--2276, 2012.
[28]
T.-Y. Lin, Y. Cui, S. Belongie, and J. Hays. Learning deep representations for ground-to-aerial geolocalization. In IEEE CVPR, pages 5007--5015, 2015.
[29]
A. Liu, W. Nie, Y. Gao, and Y. Su. Multi-modal clique-graph matching for view-based 3d model retrieval. IEEE TIP, 25(5):2103--2116, 2016.
[30]
A. Liu, Z. Wang, W. Nie, and Y. Su. Graph-based characteristic view set extraction and matching for 3d model retrieval. Inf. Sci., 320:429--442, 2015.
[31]
H. Liu, X. Yang, L. J. Latecki, and S. Yan. Dense neighborhoods on affinity graph. IJCV, 98(1):65--82, 2012.
[32]
J. Liu, S. Ji, and J. Ye. Multi-task feature learning via efficient l2, 1-norm minimization. In UAI, pages 339--348, 2009.
[33]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111--3119, 2013.
[34]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in python. JMLR, 12:2825--2830, 2011.
[35]
G. Schindler, M. Brown, and R. Szeliski. City-scale location recognition. In IEEE CVPR, pages 1--7, 2007.
[36]
X. Song, L. Nie, L. Zhang, M. Liu, and T.-S. Chua. Interest inference via structure-constrained multi-source multi-task learning. In AAAI, pages 2371--2377, 2015.
[37]
K. Tang, M. Paluri, L. Fei-Fei, R. Fergus, and L. Bourdev. Improving image classification with location context. In IEEE CVPR, pages 1008--1016, 2015.
[38]
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. JMLR, 11:3371--3408, 2010.
[39]
M. Wang, X.-S. Hua, R. Hong, J. Tang, G.-J. Qi, and Y. Song. Unified video annotation via multigraph learning. IEEE TCSVT, 19(5):733--746, 2009.
[40]
M. Wang, H. Li, D. Tao, K. Lu, and X. Wu. Multimodal graph-based reranking for web image search. IEEE TIP, 21(11):4649--4661, 2012.
[41]
M. Wang, X. Liu, and X. Wu. Visual classification by-hypergraph modeling. IEEE TKDE, 27(9):2564--2574, 2015.
[42]
M. White, X. Zhang, D. Schuurmans, and Y.-l. Yu. Convex multi-view subspace learning. In NIPS, pages 1673--1681, 2012.
[43]
Z. Xu, Y. Yang, and A. G. Hauptmann. A discriminative cnn video representation for event detection. In IEEE CVPR, pages 1798--1807, 2015.
[44]
M. Ye, P. Yin, and W.-C. Lee. Location recommendation for location-based social networks. In ACM AGIS, pages 458--461, 2010.
[45]
J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In IEEE CVPR, pages 4694--4702, 2015.
[46]
H. Zhang, X. Shang, W. Yang, H. Xu, H. Luan, and T.-S. Chua. Online collaborative learning for open-vocabulary visual classifiers. In IEEE CVPR, June 2016.
[47]
J. Zhang and J. Huan. Inductive multi-task learning with multiple view data. In ACM KDD, pages 543--551, 2012.
[48]
Z. Zhang, L. Wang, A. Kai, T. Yamada, W. Li, and M. Iwahashi. Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification. EURASIP, 2015(1):1--13, 2015.
[49]
J. Zhou, J. Chen, and J. Ye. Malsar: Multi-task learning via structural regularization. In Arizona State University, pages 1--50, 2011.
[50]
Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collaborative filtering for the netflix prize. In AAIM, pages 337--348, 2008.

Cited By

View all
  • (2024)Query-Oriented Micro-Video SummarizationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.335540246:6(4174-4187)Online publication date: Jun-2024
  • (2024)SADCMF: Self-Attentive Deep Consistent Matrix Factorization for Micro-Video Multi-Label ClassificationIEEE Transactions on Multimedia10.1109/TMM.2024.340619626(10331-10341)Online publication date: 2024
  • (2024)Dual-Domain Aligned Deep Hierarchical Matrix Factorization Method for Micro-Video Multi-Label ClassificationIEEE Transactions on Multimedia10.1109/TMM.2023.330122426(2598-2607)Online publication date: 2024
  • Show More Cited By

Index Terms

  1. Shorter-is-Better: Venue Category Estimation from Micro-Video

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '16: Proceedings of the 24th ACM international conference on Multimedia
    October 2016
    1542 pages
    ISBN:9781450336031
    DOI:10.1145/2964284
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 October 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. micro-video analysis
    2. multi-modal multi-task learning
    3. venue category estimation

    Qualifiers

    • Research-article

    Funding Sources

    • CUC Engineering Project
    • Singapore National Research Foundation under its International Research Centre @ Singapore Funding Initiative
    • National Key Technology Research and Development Program of the Ministry of Science and Technology of China
    • China Scholarship Council

    Conference

    MM '16
    Sponsor:
    MM '16: ACM Multimedia Conference
    October 15 - 19, 2016
    Amsterdam, The Netherlands

    Acceptance Rates

    MM '16 Paper Acceptance Rate 52 of 237 submissions, 22%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)28
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 07 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Query-Oriented Micro-Video SummarizationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.335540246:6(4174-4187)Online publication date: Jun-2024
    • (2024)SADCMF: Self-Attentive Deep Consistent Matrix Factorization for Micro-Video Multi-Label ClassificationIEEE Transactions on Multimedia10.1109/TMM.2024.340619626(10331-10341)Online publication date: 2024
    • (2024)Dual-Domain Aligned Deep Hierarchical Matrix Factorization Method for Micro-Video Multi-Label ClassificationIEEE Transactions on Multimedia10.1109/TMM.2023.330122426(2598-2607)Online publication date: 2024
    • (2024)Enhancing Micro-Video Venue Recognition via Multi-Modal and Multi-Granularity Object RelationsIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.334920234:7(5440-5451)Online publication date: Jul-2024
    • (2024)Deep Matrix Factorization With Complementary Semantic Aggregation for Micro-Video Multi-Label ClassificationIEEE Signal Processing Letters10.1109/LSP.2023.334009731(1685-1689)Online publication date: 2024
    • (2024)Multimodal semantic enhanced representation network for micro-video event detectionKnowledge-Based Systems10.1016/j.knosys.2024.112255301(112255)Online publication date: Oct-2024
    • (2024)Multimodal deep hierarchical semantic-aligned matrix factorization method for micro-video multi-label classificationInformation Processing & Management10.1016/j.ipm.2024.10379861:5(103798)Online publication date: Sep-2024
    • (2024)Demsasa: micro-video scene classification based on denoising multi-shots association self-attentionPattern Analysis and Applications10.1007/s10044-024-01378-627:4Online publication date: 29-Nov-2024
    • (2024)Context-aware focal alignment network for micro-video multi-label classificationPattern Analysis & Applications10.1007/s10044-024-01376-827:4Online publication date: 14-Nov-2024
    • (2024)A deep low-rank semantic factorization method for micro-video multi-label classificationMultimedia Systems10.1007/s00530-024-01428-330:4Online publication date: 5-Aug-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media