research-article

Shorter-is-Better: Venue Category Estimation from Micro-Video

Authors:

Jianglong Zhang,

Xianglin Huang,

Tat Seng ChuaAuthors Info & Claims

MM '16: Proceedings of the 24th ACM international conference on Multimedia

Pages 1415 - 1424

https://doi.org/10.1145/2964284.2964307

Published: 01 October 2016 Publication History

Abstract

According to our statistics on over 2 million micro-videos, only 1.22% of them are associated with venue information, which greatly hinders the location-oriented applications and personalized services. To alleviate this problem, we aim to label the bite-sized video clips with venue categories. It is, however, nontrivial due to three reasons: 1) no available benchmark dataset; 2) insufficient information, low quality, and 3) information loss; and 3) complex relatedness among venue categories. Towards this end, we propose a scheme comprising of two components. In particular, we first crawl a representative set of micro-videos from Vine and extract a rich set of features from textual, visual and acoustic modalities. We then, in the second component, build a tree-guided multi-task multi-modal learning model to estimate the venue category for each unseen micro-video. This model is able to jointly learn a common space from multi-modalities and leverage the predefined Foursquare hierarchical structure to regularize the relatedness among venue categories. Extensive experiments have well-validated our model. As a side research contribution, we have released our data, codes and involved parameters.

References

[1]

F. R. Bach. Consistency of the group lasso and multiple kernel learning. JMLR, 9(Jun):1179--1225, 2008.

Digital Library

[2]

F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Bergeron, D. Warde-Farley, and Y. Bengio. Theano: new features and speed improvements. arXiv preprint arXiv:1211.5590, 2012.

[3]

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 3:993--1022, 2003.

Digital Library

[4]

S. Cao and N. Snavely. Graph-based discriminative learning for location recognition. In IEEE CVPR, pages 700--707, 2013.

Digital Library

[5]

B.-C. Chen, Y.-Y. Chen, F. Chen, and D. Joshi. Business-aware visual concept discovery from social media for multimodal business venue recognition. In AAAI, pages 61--68, 2016.

[6]

D. M. Chen, G. Baatz, K. Köser, S. S. Tsai, R. Vedantham, T. Pylv\"a, K. Roimela, X. Chen, J. Bach, M. Pollefeys, et al. City-scale landmark identification on mobile devices. In IEEE CVPR, pages 737--744, 2011.

Digital Library

[7]

J. Chen, J. Zhou, and J. Ye. Integrating low-rank and group-sparse structures for robust multi-task learning. In ACM KDD, pages 42--50, 2011.

Digital Library

[8]

J. Choi, G. Friedland, V. Ekambaram, and K. Ramchandran. Multimodal location estimation of consumer media: Dealing with sparse training data. In IEEE ICME, pages 43--48, 2012.

Digital Library

[9]

D. J. Crandall, L. Backstrom, D. Huttenlocher, and J. Kleinberg. Mapping the world's photos. In ACM WWW, pages 761--770, 2009.

Digital Library

[10]

P. Cui, Z. Wang, and Z. Su. What videos are similar with you?: Learning a common attributed representation for video recommendation. In ACM MM, pages 597--606, 2014.

Digital Library

[11]

X. Feng, Y. Zhang, and J. Glass. Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition. In IEEE ICASSP, pages 1759--1763, 2014.

[12]

G. Friedland, J. Choi, H. Lei, and A. Janin. Multimodal location estimation on flickr videos. In ACM SIGMM, pages 23--28, 2011.

Digital Library

[13]

G. Friedland, O. Vinyals, and T. Darrell. Multimodal location estimation. In ACM MM, pages 1245--1252, 2010.

Digital Library

[14]

D. Ganguly, D. Roy, M. Mitra, and G. J. Jones. Word embedding based generalized language model for information retrieval. In ACM SIGIR, pages 795--798, 2015.

Digital Library

[15]

S. Gopal and Y. Yang. Recursive regularization for large-scale classification with hierarchical and graphical dependencies. In ACM KDD, pages 257--265, 2013.

Digital Library

[16]

Y. Guo. Convex subspace representation learning from multi-view data. In AAAI, pages 2--9, 2013.

Digital Library

[17]

L. Han and Y. Zhang. Learning tree structure in multi-task learning. In ACM KDD, pages 397--406, 2015.

Digital Library

[18]

Z. Hanwang, W. Meng, H. Richang, N. Liqiang, and C. Tat-Seng. Play and rewind: Optimizing binary representations of videos by self-supervised temporal hashing. In ACM MM, October 2016.

[19]

J. Hays and A. A. Efros. Im2gps: estimating geographic information from a single image. In IEEE CVPR, pages 1--8, 2008.

[20]

J. He and R. Lawrence. A graph-based framework for multi-task multi-view learning. In ICML, pages 25--32, 2011.

Digital Library

[21]

L. Jacob, J.-p. Vert, and F. R. Bach. Clustered multi-task learning: A convex formulation. In NIPS, pages 745--752, 2009.

Digital Library

[22]

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM MM, pages 675--678, 2014.

Digital Library

[23]

X. Jin, F. Zhuang, S. Wang, Q. He, and Z. Shi. Shared structure learning for multiple tasks with multiple views. In MLKDD, pages 353--368, 2013.

Digital Library

[24]

M. Kan, S. Shan, H. Zhang, S. Lao, and X. Chen. Multi-view discriminant analysis. IEEE TPAMI, 38(1):188--194, 2016.

Digital Library

[25]

S. Kim and E. P. Xing. Tree-guided group lasso for multi-task regression with structured sparsity. In ICML, pages 1--8, 2010.

[26]

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097--1105, 2012.

Digital Library

[27]

H. Lei, J. Choi, and G. Friedland. Multimodal city-verification on flickr videos using acoustic and textual features. In IEEE ICASSP, pages 2273--2276, 2012.

[28]

T.-Y. Lin, Y. Cui, S. Belongie, and J. Hays. Learning deep representations for ground-to-aerial geolocalization. In IEEE CVPR, pages 5007--5015, 2015.

[29]

A. Liu, W. Nie, Y. Gao, and Y. Su. Multi-modal clique-graph matching for view-based 3d model retrieval. IEEE TIP, 25(5):2103--2116, 2016.

Digital Library

[30]

A. Liu, Z. Wang, W. Nie, and Y. Su. Graph-based characteristic view set extraction and matching for 3d model retrieval. Inf. Sci., 320:429--442, 2015.

Digital Library

[31]

H. Liu, X. Yang, L. J. Latecki, and S. Yan. Dense neighborhoods on affinity graph. IJCV, 98(1):65--82, 2012.

Digital Library

[32]

J. Liu, S. Ji, and J. Ye. Multi-task feature learning via efficient l2, 1-norm minimization. In UAI, pages 339--348, 2009.

Digital Library

[33]

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111--3119, 2013.

Digital Library

[34]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in python. JMLR, 12:2825--2830, 2011.

Digital Library

[35]

G. Schindler, M. Brown, and R. Szeliski. City-scale location recognition. In IEEE CVPR, pages 1--7, 2007.

[36]

X. Song, L. Nie, L. Zhang, M. Liu, and T.-S. Chua. Interest inference via structure-constrained multi-source multi-task learning. In AAAI, pages 2371--2377, 2015.

Digital Library

[37]

K. Tang, M. Paluri, L. Fei-Fei, R. Fergus, and L. Bourdev. Improving image classification with location context. In IEEE CVPR, pages 1008--1016, 2015.

Digital Library

[38]

P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. JMLR, 11:3371--3408, 2010.

Digital Library

[39]

M. Wang, X.-S. Hua, R. Hong, J. Tang, G.-J. Qi, and Y. Song. Unified video annotation via multigraph learning. IEEE TCSVT, 19(5):733--746, 2009.

Digital Library

[40]

M. Wang, H. Li, D. Tao, K. Lu, and X. Wu. Multimodal graph-based reranking for web image search. IEEE TIP, 21(11):4649--4661, 2012.

Digital Library

[41]

M. Wang, X. Liu, and X. Wu. Visual classification by-hypergraph modeling. IEEE TKDE, 27(9):2564--2574, 2015.

[42]

M. White, X. Zhang, D. Schuurmans, and Y.-l. Yu. Convex multi-view subspace learning. In NIPS, pages 1673--1681, 2012.

Digital Library

[43]

Z. Xu, Y. Yang, and A. G. Hauptmann. A discriminative cnn video representation for event detection. In IEEE CVPR, pages 1798--1807, 2015.

[44]

M. Ye, P. Yin, and W.-C. Lee. Location recommendation for location-based social networks. In ACM AGIS, pages 458--461, 2010.

Digital Library

[45]

J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In IEEE CVPR, pages 4694--4702, 2015.

[46]

H. Zhang, X. Shang, W. Yang, H. Xu, H. Luan, and T.-S. Chua. Online collaborative learning for open-vocabulary visual classifiers. In IEEE CVPR, June 2016.

[47]

J. Zhang and J. Huan. Inductive multi-task learning with multiple view data. In ACM KDD, pages 543--551, 2012.

Digital Library

[48]

Z. Zhang, L. Wang, A. Kai, T. Yamada, W. Li, and M. Iwahashi. Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification. EURASIP, 2015(1):1--13, 2015.

[49]

J. Zhou, J. Chen, and J. Ye. Malsar: Multi-task learning via structural regularization. In Arizona State University, pages 1--50, 2011.

[50]

Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collaborative filtering for the netflix prize. In AAIM, pages 337--348, 2008.

Digital Library

Cited By

Jia MWei YSong XSun TZhang MNie L(2024)Query-Oriented Micro-Video SummarizationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.335540246:6(4174-4187)Online publication date: Jun-2024
https://doi.org/10.1109/TPAMI.2024.3355402
Fan FJing PNie LGu HSu Y(2024)SADCMF: Self-Attentive Deep Consistent Matrix Factorization for Micro-Video Multi-Label ClassificationIEEE Transactions on Multimedia10.1109/TMM.2024.340619626(10331-10341)Online publication date: 2024
https://doi.org/10.1109/TMM.2024.3406196
Fan FSu YNie LJing PHong DLiu Y(2024)Dual-Domain Aligned Deep Hierarchical Matrix Factorization Method for Micro-Video Multi-Label ClassificationIEEE Transactions on Multimedia10.1109/TMM.2023.330122426(2598-2607)Online publication date: 2024
https://doi.org/10.1109/TMM.2023.3301224
Show More Cited By

Index Terms

Shorter-is-Better: Venue Category Estimation from Micro-Video
1. Information systems
  1. Information systems applications
    1. Multimedia information systems

Recommendations

What's a better category?: shavers or father's day gifts?
SIGDOC '15: Proceedings of the 33rd Annual International Conference on the Design of Communication

Consumer websites such as Amazon.com categorize products both according to themes (i.e. Christmas Gifts) as well as according to taxonomies (i.e. Electronics). With thousands of items to choose from, categorization potentially plays a role in reducing ...
Designing Better Location Fields in User Profiles
GROUP '14: Proceedings of the 2014 ACM International Conference on Supporting Group Work

Twitter, Facebook, Pinterest and many other online communities ask their users to populate a location field in their user profiles. The information that is entered into this field has many uses in both industry and academia, with location field data ...
Finding better active learners for faster literature reviews

Literature reviews can be time-consuming and tedious to complete. By cataloging and refactoring three state-of-the-art active learning techniques from evidence-based medicine and legal electronic discovery, this paper finds and implements FASTREAD, a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '16: Proceedings of the 24th ACM international conference on Multimedia

October 2016

1542 pages

ISBN:9781450336031

DOI:10.1145/2964284

General Chairs:
Alan Hanjalic
Delft University of Technology
,
Cees Snoek
Qualcomm Research Netherlands / University of Amsterdam
,
Marcel Worring
University of Amsterdam
,
Moderator:
Dick Bulterman
CWI / VU University Amsterdam
,
Program Chairs:
Benoit Huet
EURECOM
,
Aisling Kelliher
Virginia Tech
,
Yiannis Kompatsiaris
CERTH-ITI
,
Jin Li
Microsoft

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

CUC Engineering Project
Singapore National Research Foundation under its International Research Centre @ Singapore Funding Initiative
National Key Technology Research and Development Program of the Ministry of Science and Technology of China
China Scholarship Council

Conference

MM '16

Sponsor:

SIGMM

MM '16: ACM Multimedia Conference

October 15 - 19, 2016

Amsterdam, The Netherlands

Acceptance Rates

MM '16 Paper Acceptance Rate 52 of 237 submissions, 22%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

62
Total Citations
View Citations
447
Total Downloads

Downloads (Last 12 months)28
Downloads (Last 6 weeks)3

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jia MWei YSong XSun TZhang MNie L(2024)Query-Oriented Micro-Video SummarizationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.335540246:6(4174-4187)Online publication date: Jun-2024
https://doi.org/10.1109/TPAMI.2024.3355402
Fan FJing PNie LGu HSu Y(2024)SADCMF: Self-Attentive Deep Consistent Matrix Factorization for Micro-Video Multi-Label ClassificationIEEE Transactions on Multimedia10.1109/TMM.2024.340619626(10331-10341)Online publication date: 2024
https://doi.org/10.1109/TMM.2024.3406196
Fan FSu YNie LJing PHong DLiu Y(2024)Dual-Domain Aligned Deep Hierarchical Matrix Factorization Method for Micro-Video Multi-Label ClassificationIEEE Transactions on Multimedia10.1109/TMM.2023.330122426(2598-2607)Online publication date: 2024
https://doi.org/10.1109/TMM.2023.3301224
Liu WCao JWei RZhu XLiu B(2024)Enhancing Micro-Video Venue Recognition via Multi-Modal and Multi-Granularity Object RelationsIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.334920234:7(5440-5451)Online publication date: Jul-2024
https://doi.org/10.1109/TCSVT.2023.3349202
Jing PLiu XWang XSu Y(2024)Deep Matrix Factorization With Complementary Semantic Aggregation for Micro-Video Multi-Label ClassificationIEEE Signal Processing Letters10.1109/LSP.2023.334009731(1685-1689)Online publication date: 2024
https://doi.org/10.1109/LSP.2023.3340097
Li YLiu XZhang LTian HJing P(2024)Multimodal semantic enhanced representation network for micro-video event detectionKnowledge-Based Systems10.1016/j.knosys.2024.112255301(112255)Online publication date: Oct-2024
https://doi.org/10.1016/j.knosys.2024.112255
Fan FSu YLiu YJing PQu KLiu Y(2024)Multimodal deep hierarchical semantic-aligned matrix factorization method for micro-video multi-label classificationInformation Processing & Management10.1016/j.ipm.2024.10379861:5(103798)Online publication date: Sep-2024
https://doi.org/10.1016/j.ipm.2024.103798
Gong RZhang YZhang YLiu YGuo JNie X(2024)Demsasa: micro-video scene classification based on denoising multi-shots association self-attentionPattern Analysis and Applications10.1007/s10044-024-01378-627:4Online publication date: 29-Nov-2024
https://doi.org/10.1007/s10044-024-01378-6
Yuan BYao WJing PZhang JTsang KWang S(2024)Context-aware focal alignment network for micro-video multi-label classificationPattern Analysis & Applications10.1007/s10044-024-01376-827:4Online publication date: 14-Nov-2024
https://dl.acm.org/doi/10.1007/s10044-024-01376-8
Fan FSu YLiu YJing PQu K(2024)A deep low-rank semantic factorization method for micro-video multi-label classificationMultimedia Systems10.1007/s00530-024-01428-330:4Online publication date: 5-Aug-2024
https://doi.org/10.1007/s00530-024-01428-3
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten