Robust semi-automatic head pose labeling for real-world face video sequences

Demirkus, Meltem; Clark, James J.; Arbel, Tal

doi:10.1007/s11042-012-1352-1

Robust semi-automatic head pose labeling for real-world face video sequences

Published: 25 January 2013

Volume 70, pages 495–523, (2014)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Meltem Demirkus¹,
James J. Clark¹ &
Tal Arbel¹

808 Accesses
21 Citations
Explore all metrics

Abstract

Automatic head pose estimation from real-world video sequences is of great interest to the computer vision community since pose provides prior knowledge for tasks, such as face detection and classification. However, developing pose estimation algorithms requires large, labeled real-world video databases on which computer vision systems can be trained and tested. Manual labeling of each frame is tedious, time consuming, and often difficult due to the high uncertainty in head pose angle estimate, particularly in unconstrained environments that include arbitrary facial expression, occlusion, illumination etc. To overcome these difficulties, a semi-automatic framework is proposed for labeling temporal head pose in real-world video sequences. The proposed multi-stage labeling framework first detects a subset of frames with distinct head poses over a video sequence, which is then manually labeled by the expert to obtain the ground truth for those frames. The proposed framework provides a continuous head pose label and corresponding confidence value over the pose angles. Next, the interpolation scheme over a video sequence estimates i) labels for the frames without manual labels and ii) corresponding confidence values for interpolated labels. This confidence value permits an automatic head pose estimation framework to determine the subset of frames to be used for further processing, depending on the labeling accuracy required. The experiments performed on an in-house, labeled, large, real-world face video database (which will be made publicly available) show that the proposed framework achieves 96.98 % labeling accuracy when manual labeling is only performed on 30 % of the video frames.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Real-Time Head Pose Estimation by Tracking and Detection of Keypoints and Facial Landmarks

Human Pose Tracking Using Online Latent Structured Support Vector Machine

Robust Classification of Head Pose from Low Resolution Images

References

Aghajanian J, Prince S (2009) Face pose estimation in uncontrolled environments. In: Proceedings of the British Machine Vision Conference. pp 1–11
Ahn L, Liu R, Blum M (2006) Peekaboom: A game for locating objects in images. In: Proceedings of the SIGCHI conference on Human Factors in computing system pp 55–64. doi:10.1145/1124772.1124782
Ambardekar A, Nicolescu M, Dascalu S (2009) Ground Truth Verification Tool (GTVT) for Video Surveillance Systems. In: Proceedings of the Second International Conferences on Advances in Computer-Human Interactions. doi:10.1109/ACHI.2009.17
Ballerini L (2003) Multiple Genetic Snakes for People Segmentation in Video Sequences. In: Proceedings of the 13th Scandinavian conference on Image analysis pp 275–282
Bederson BB (2001) Photomesa: A zoomable image browser using quantum treemaps and bubblemaps. In: Proceedings of the annual ACM symposium on User interface software and technology, pp 71–80. doi: 10.1145/502348.502359
Belkin M, Niyogi P (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput 15:1373–1396. doi:10.1162/089976603321780317
Article MATH Google Scholar
Birchfield ST, Rangarajan S (2005) Spatiograms versus histograms for region-based tracking. Proc IEEE Conf Comput Vis Pattern Recognit 2:1158–1163
Google Scholar
Blanz V, Grother P, Vetter T (2005) Face recognition based on frontal views generated from non-frontal images. Proc IEEE Conf Comput Vis Pattern Recognit 2:454–461. doi:10.1109/CVPR.2005.150
Google Scholar
Blunsden SJ, Fisher RB (2010) The BEHAVE video dataset: ground truthed video for multi-person behavior classification. Ann BMVA 4:1–12
Google Scholar
Boom BJ, Spreeuwers LJ, Veldhuis RNJ (2011) Virtual illumination grid for correction of uncontrolled illumination in facial images. Pattern Recog 44(9):1980–1989. doi:10.1016/j.patcog.2010.07.022
Article Google Scholar
Bosch A, Zisserman A, Muñoz X (2008) Scene classification using a hybrid generative/discriminative approach. IEEE Trans Pattern Anal Mach Intell 30(4):712–727. doi:10.1109/TPAMI.2007.70716
Article Google Scholar
Bruneau P, Picarougne F, Gelgon M (2010) Interactive unsupervised classification and visualization for browsing an image collection. Pattern Recog 43(2):485–493. doi:10.1016/j.patcog.2009.03.024
Article MATH Google Scholar
Chen Y, Han C, Wang C, Jeng B, Fan K. A CNN-Based Face Detector with a Simple Feature Map and a Coarse-to-fine Classifier. Accepted for IEEE Trans on Pattern Analysis and Machine Intelligence. doi:10.1109/TPAMI.2007.70798
Color FERET face database (2003) http://www.nist.gov/itl/iad/ig/feret.cfm. Accessed 1 June 2012
Delezoide B, Precioso F, Redi M, Merialdo B, Granjon L, Pellerin D, Rombaut M, Jégou H, Vieux R, Mansencal B, Benois-Pineau J et al. (2011) IRIM at TRECVID 2011: Semantic Indexing and Instance Search. In: Proceedings of TREC Video Retrieval Evaluation Online
Demirkus M, Oreshkin B, Clark J, Arbel T (2011) Spatial and probabilistic codebook template based head pose estimation from unconstrained environments. In: Proceedings of the IEEE International Conference on Image Processing (ICIP), pp 573–576
Demirkus M, Precup D, Clark J, Arbel T (2012) Soft Biometric Trait Classification from Real-world Face Videos Conditioned on Head Pose Estimation. In: Proceedings of the IEEE Computer Society Workshop on Biometrics in association with the IEEE Conference on Computer Vision and Pattern Recognition
Dhall A, Goecke R, Lucey S, Gedeon T (2012) A semi-automatic method for collecting richly labelled large facial expression databases from movies. IEEE Multimedia (99):1. URL: http://doi.ieeecomputersociety.org/10.1109/MMUL.2012.26
Doerman D, Mihalcik D (2000) Tools and techniques for video performance evaluation. In: Proceedings of International Conference on Pattern Recognition 4:167:170
Furht B, Marques O (2003) Handbook of video databases: Design and applications. CRC Press. http://www.crcpress.com/ecommerce_product/product_detail.jsf?catno=7006&isbn=0000000000000&parent_id=441&pc=http://flylib.com/books/en/2.495.1.3/1/
Gao W, Cao B, Shan SG (2004) The CAS-PEAL Large-Scale Chinese Face Database and Baseline Evaluations. Technical report of JDL
Giro-i-Nieto X, Martos M (2012) Multiscale annotation of still images with GAT In: Proceedings of the First International Workshop on Visual Interfaces for Ground Truth Collection in Computer Vision Applications. doi:10.1145/2304496.2304497
Gross R, Matthews I, Cohn JF, Kanade T, Baker S (2009) Multi-PIE. Image Vis Comput 28(5):807–813. doi:10.1016/j.imavis.2009.08.002
Article Google Scholar
Hacid H (2006) Neighborhood graphs for semi-automatic annotation of large image databases. Adv Multimedia Model 1:586–595. doi:10.1007/978-3-540-69423-6_57
Article Google Scholar
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten I H (2009) The WEKA Data Mining Software. An Update, SIGKDD Explorations 11(1)
He J, Rijke M, Sevenster M, Ommering RV, Qian Y (2011) Generation to Backgroun Knowledge: A Case Study in Annotating Radiology Reports. In: Proceedings of the 20th ACM Conference on Information and Knowledge Management pp 1867–1876. doi:10.1145/2063576.2063845
Hildebrand M, van Ossenbruggen J (2012) Linking user-generated video annotations to the web of data. Proc Int Conf Multimed Model 7131:693–704. doi:10.1007/978-3-642-27355-1_74
Google Scholar
Hildebrand M, van Ossenbruggen JR, Hardman L, Jacobs G (2009) Supporting subject matter annotation using heterogeneous thesauri: a user study in web data reuse. Int J Hum Comput Stud 67(10):888–903. doi:10.1016/j.ijhcs.2009.07.008, doi:10.1016/2Fj.ijhcs.2009.07.008
Article Google Scholar
Huang GB, Ramesh M, Berg T, Learned-Miller E (2007) Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. University of Massachusetts, Amherst, Technical Report 07-49
Jones M, Rehg JM (2002) Statistical color models with application to skin detection. Int J Comput Vis 81–96. doi:10.1023/A:1013200319198
Karaman S, Benois-Pineau J, Mégret R, Bugeau A (2012) Multi-layer local graph words for object recognition. Advances in Multimedia Modeling 29–39
Kavasidis I, Palazzo S, Salvo RD, Giordano D, Spampinato C (2012) A Semi-automatic Tool for Detection and Tracking Ground Truth Generation in Videos. In: Proceedings of the First Int. Workshop on Visual Interfaces for Ground Truth Collection in Computer Vision Applications. doi:10.1145/2304496.2304502
Kumar N, Berg AC, Belhumeur PN, Nayar SK (2009) Attribute and Simila Classifiers for Face Verification. In: Proceedings of the International Conference on Computer Vision pp 365–372. doi:10.1109/ICCV.2009.5459250
Kumar N, Berg A, Belhumeur P, Nayar S (2011) Describable visual attributes for face verification and image search. IEEE Trans Pattern Anal Mach Intell 33:1962–1977. doi:10.1109/TPAMI.2011.48
Article Google Scholar
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. Proc IEEE Comput Vis Pattern Recognit 2:2169–2178
Google Scholar
Lin C, Tseng BL, Smith JR (2003) Video collaborative annotation forum: Establishing ground-truth labels on large multimedia datasets. In: Proceedings of the TRECVID Workshop
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110. doi:10.1023/B:VISI.0000029664.99615.94
Article Google Scholar
McGill Real-World Face Video Database (2012) http://www.cim.mcgill.ca/~rfvdb. Accessed 10 Nov 2012
Mezaris V, Dimou A, Kompatsiaris I (2010) On the use of feature tracks for dynamic concept detection in video. In: Proceedings of IEEE International Conference on Image Processing pp 4697–4700
Moehrmann J, Heidemann G (2012) Efficient annotation of image data sets for computer vision applications. In: Proceedings of the First International Workshop on Visual Interfaces for Ground Truth Collection in Computer Vision Applications. doi:10.1145/2304496.2304498
Murphy-Chutorian E, Trivedi MM (2009) Head pose estimation in computer vision: a survey. IEEE Trans Pattern Anal Mach Intell 31(4):607–626. doi:10.1109/TPAMI.2008.106
Article Google Scholar
Phillips PJ, Flynn PJ, Scruggs T, Bowyer KW, Chang J, Hoffman, Marques J, Min J, Worek W (2005) Overview of the face recognition grand challenge. Proc IEEE Conf Comput Vis Pattern Recognit 1:947–954. doi:10.1109/CVPR.2005.268
Google Scholar
Shi J, Tomasi C (1994) Good Features to Track. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp 593–600. doi:10.1109/CVPR.1994.323794
Shneiderman B, Kang H (2000) Direct annotation: A drag-and-drop strategy for labeling photos. In: Proceedings of the IEEE Conference on Information Visualization pp 88–95
Spampinato C, Boom B, He J (2012) First International Workshop on Visual Interfaces for Ground Truth Collection in Computer Vision Applications
Toews M, Arbel T (2009) Detection, localization and sex classification of faces from arbitrary viewpoints and under occlusion. IEEE Trans Pattern Anal Mach Intell 31(9):1567–1581, http://doi.ieeecomputersociety.org/10.1109/TPAMI.2008.233
Article Google Scholar
Torki M, Elgammal AM (2011) Regression from local features for viewpoint and pose estimation. In: Proceedings of the International Conference on Computer Vision pp 2603–2610 URL:http://doi.ieeecomputersociety.org/10.1109/ICCV.2011.6126549
Volkmer T, Smith JR, Natsev AP (2005) A web-based system for collaborative annotation of large image and video collections: An evaluation and user study. In: Proceedings of the 13th annual ACM international conference on Multimedia. doi:10.1145/1101149.1101341
Weston J, Ratle F, Collobert R (2008) Deep learning via semi-supervised embedding. In: Proceedings of the 25th International Conference on Machine Learning pp 1168–1175. doi:10.1145/1390156.1390303
Yang Y, Wu F, Nie F, Shen HT, Zhuang Y, Hauptmann AG (2012) Web and personal image annotation by mining label correlation with relaxed visual graph embedding. IEEE Trans Image Process 21(3):1339–1351. doi:10.1109/TIP.2011.2169269
Article MathSciNet Google Scholar
Zhou SK, Chellappa R, Zhao W (2005) Unconstrained face recognition. Springer. http://scholar.google.ca/citations?view_op=view_citation&hl=en&user=8eNm2GMAAAAJ&citation_for_view=8eNm2GMAAAAJ:8k81kl-MbHgC

Download references

Author information

Authors and Affiliations

Centre for Intelligent Machines, McGill University, 3480 University Street, Montréal, QC, Canada, H3A 0E9
Meltem Demirkus, James J. Clark & Tal Arbel

Authors

Meltem Demirkus
View author publications
You can also search for this author inPubMed Google Scholar
James J. Clark
View author publications
You can also search for this author inPubMed Google Scholar
Tal Arbel
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Meltem Demirkus.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Demirkus, M., Clark, J.J. & Arbel, T. Robust semi-automatic head pose labeling for real-world face video sequences. Multimed Tools Appl 70, 495–523 (2014). https://doi.org/10.1007/s11042-012-1352-1

Download citation

Published: 25 January 2013
Issue Date: May 2014
DOI: https://doi.org/10.1007/s11042-012-1352-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust semi-automatic head pose labeling for real-world face video sequences

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Real-Time Head Pose Estimation by Tracking and Detection of Keypoints and Facial Landmarks

Human Pose Tracking Using Online Latent Structured Support Vector Machine

Robust Classification of Head Pose from Low Resolution Images

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now