Elsevier

Pattern Recognition

Volume 45, Issue 7, July 2012, Pages 2489-2498
Pattern Recognition

Video fingerprinting using Latent Dirichlet Allocation and facial images

https://doi.org/10.1016/j.patcog.2011.12.022Get rights and content

Abstract

This paper investigates the possibility of extracting latent aspects of a video in order to develop a video fingerprinting framework. Semantic visual information about humans, more specifically face occurrences in video frames, along with a generative probabilistic model, namely the Latent Dirichlet Allocation (LDA), are used for this purpose. The latent variables, namely the video topics are modeled as a mixture of distributions of faces in each video. The method also involves a clustering approach based on Scale Invariant Features Transform (SIFT) for clustering the detected faces and adapts the bag-of-words concept into a bag-of-faces one, in order to ensure exchangeability between topics distributions. Experimental results, on three different data sets, provide low misclassification rates of the order of 2% and false rejection rates of 0%. These rates provide evidence that the proposed method performs very efficiently for video fingerprinting.

Highlights

► Latent Dirichlet Allocation for Perceptual Hashing. ► Scale Invariant Features Transform (SIFT) Based Face Clustering. ► Faceword definition for video content.

Introduction

Video fingerprinting [1], also known as content-based copy detection, or robust perceptual hashing [2], or near replica detection [3], refers to methods that try to identify whether a given video is a replica or a near replica of one of the videos existing in a video database. The need for efficient video fingerprinting algorithms is due to the enormous amount of video content and the scale of illegal video copying and distribution. Video sharing web sites such as YouTube need such algorithms in order to automatically check the intellectual property rights for videos that are uploaded in their database. Video fingerprinting is used in many applications such as copyright protection, multimedia databases management, broadcast monitoring, etc. Most methods, calculate a feature vector (i.e. the video fingerprint or perceptual hash) for the video in question, and compare it, by means of a similarity or distance function, against a set of vectors stored in a database.

Many methods exist for image perceptual hashing. In [4] Monga et al. have used non-negative matrix factorization and random projections to create a perceptual hash. In [5] a detailed theoretical work on non-negative matrix factorization in use with perceptual hashing is reported. In [6], a virtual watermark detection based method is used for perceptual hashing under the assumption that this virtual watermark detector responds similarly in images of similar content. Finally, in [7] the radon transform is used to create the image perceptual hashes.

However, their extension of image fingerprinting methods to video data (e.g. on a frame-by-frame basis) is not straightforward and efficient due to the temporal dimension. A limited number of video fingerprinting or replica detection techniques have been proposed in the literature so far. In [8], Lee et al. have used a novel binary fingerprint obtained by using a feature selection algorithm called the symmetric pairwise boosting. The binary fingerprints are obtained by filtering and quantizing perceptually significant features extracted from an input video clip. Oostveen et al. have proposed a spatio-temporal fingerprint based on the luminance difference in spatio-temporal blocks [9]. Coskun et al. have proposed two robust video hashing algorithms for copy identification that are based on the Discrete Cosine Transform (DCT) [10]. Hampapur and Bolle, compare various global video descriptors based on motion, color and spatio-temporal intensity distribution [11]. Law-To et al. propose a technique for video copy tracking which is based on labels of local descriptor behavior computed along the video [12]. Their aim was to distinguish copies within a collection of highly similar videos, as well as to link similar videos, in order to reduce redundancy in video collections or gather the associated metadata. Changick and Vasudev have proposed a copy detection scheme, where each video frame is partitioned into 2×2 blocks by intensity averaging [13]. Their spatio-temporal approach combines spatial matching of ordinal signatures obtained from the partitions of each video frame and temporal matching of signatures from the temporal partitions trails. Lee and Yoo present a video fingerprinting method based on 2-D Oriented Principal Component Analysis (2D-OPCA) of affine covariant regions [14]. According to this method, in order to achieve robustness against geometric transformations, the fingerprints are extracted from local regions, covariant with a class of affine transformations. For reliable local fingerprint matching, only spatio-temporally consistent matches are taken into account. Finally, in [15], the same authors, propose a novel video fingerprinting method based on the centroid of a pixel neighborhood gradient orientations. By these means, they achieve robustness against common video processing tasks such as frame rate change, lossy compression and others.

Latent Dirichlet Allocation (LDA) is a generative probabilistic model introduced in [16]. It is a powerful method for capturing statistical properties of a collection of conditionally independent and identically distributed random variables. The main idea behind LDA is the fact that such a set of random variables can be represented by a mixture of probability distributions. The latter is known as the de Finetti theorem [17]. This approach has already been applied in text modeling with good results [18]. It has been proven that LDA performs better than the pLSI (probabilistic Latent Semantic Indexing) algorithm, in the context of text modeling [16]. Moreover, this framework has been recently used in the context of image processing [19], [20], [21], [22], [23], [24], [25], where LDA is used for image classification, image retrieval and other image related analysis tasks. Moreover, LDA has been used in video analysis and description [26], [27], [28], [29], [30] for video summarization, scenes categorization and other tasks.

In this paper, we investigate the possibility of extracting latent aspects of a video in order to develop a video fingerprinting framework. Semantic visual information about humans, more specifically face occurrences in video frames, along with a generative probabilistic model (LDA), are used for this purpose. The latent variables, namely the video topics, are modeled as a mixture of distributions of faces in each video. The method also involves a clustering approach based on Scale Invariant Features Transform (SIFT) for clustering the detected faces and adapts the bag-of-words concept into a bag-of-faces one, in order to ensure exchangeability between topics distributions.

The novelty of this paper lies mainly in the use of latent aspects of the video content, aiming at extracting the underlying video topics and using them in video fingerprinting. In more detail, this paper includes the following novelties:

  • The use of face occurrences in a video, to be called “facewords”, for describing this video. However, this framework can be easily extended to cases without humans, since “facewords” can be replaced by animals, even objects and scene artifacts, provided that an appropriate detector is used.

  • The use of latent semantic analysis for video fingerprinting. Although many papers are using probabilistic latent semantic analysis (pLSA) for a number of image and video processing tasks, only recent publications like [19], [28] have utilized the LDA algorithm. However, none did use this framework for video fingerprinting, to the best of our knowledge.

  • The face clustering technique which makes use of SIFT features evaluated on facial images.

The paper is organized as follows: in Section 2, an outline of the proposed framework is presented and a step-by-step presentation of how the main parts of the framework integrate. In Section 3, we introduce the face detection, facial feature extraction and face clustering methodology, the creation of the universal vocabulary and a procedure for incrementing the video database with new videos. Section 4 reviews the LDA framework. In Section 5, we explain the training phase, as well as the query mechanism of the proposed fingerprinting framework. Experimental results and complexity analysis are presented in Section 6. Finally, conclusions are drawn in Section 7.

Section snippets

The video fingerprinting framework outline

This section briefly describes the various parts of the proposed framework. This can be partitioned into three sequential video analysis tasks (i.e. face detection, face clustering and latent semantics extraction from video). Each of these tasks operates in both the training and the testing phase of the framework in order to extract a feature vector (i.e. the video fingerprint). In Fig. 1 a schematic representation of the framework is illustrated.

In the training phase, the video database is

Face detection, face clustering and data organization

This section, outlines the facewords used in order to characterize a video and details the proposed framework for video fingerprinting. As already mentioned, for each video, two steps are undertaken:

(a) Face detection. Faces are important semantic features for movies and humans often recognize a movie based on the actors that appear therein. Thus, we use actors' facial images as the basis of our video fingerprinting framework. These facial images can be interpreted as facewords.

(b) Face

Latent Dirichlet Allocation

Many latent semantic analysis approaches have been proposed so far for multimedia analysis [36]. Latent Dirichlet Allocation (LDA) [16] is a recently proposed approach within this framework that produced good analysis and modeling results. In our case, we aim to use LDA to reveal the latent aspects of a video, based on actors' appearances. As already briefly explained, LDA is a framework used until now, mainly in text retrieval and mining. LDA uses the following structures:

  • 1.

    A finite universal

Training through variational inference

In order to train our model we introduce into the LDA algorithm the histograms of facewords for each video, which are produced by the face detection/clustering procedure. The faceword-by-video matrix created from this process is considered as the first estimate of the matrix β in the LDA framework. Training the model involves in fact solving the inference problem of computing the posterior distribution of the hidden vectors θ and zn given a video v and the Dirichlet parameters α and β:p(θ,Z|v,α,

Video databases

The performance of the method has been evaluated on three video data sets namely Video Clips (VC), Movies (M) and the MUSCLE-VCD-2007 database (MVCD). The VC data set includes short, low quality videos, randomly collected over the Internet (YouTube). It consists of 332 videos, each 2–5 min long (approximately 4000–7000 frames per video clip). The M data set consists of eight high-quality full length movies of approximately 2 h each (approximately 150,000 video frames). The small number of movies

Discussion, conclusions and future work

In this work, a new framework for video fingerprinting has been presented. The intuition behind this work is that actor instances (i.e. mapped to facewords) can carry a significant amount of information and can be used to capture very distinctive video features, thus characterizing uniquely each video. In this context, by applying a generative probabilistic model, namely the Latent Dirichlet Allocation, we aim at discovering latent aspects of a video (video topics), based on the semantic

Vretos Nicholas graduated from the Department of Informatics of The University Pierre et Marie Curie in Paris (Paris VI) in 2002. He is currently a researcher and studies towards the Ph.D. degree at the Department of Informatics, in the Artificial Intelligence Information Analysis (AIIA) laboratory, at the Aristotle University of Thessaloniki. He has published more than 10 conference papers and Book Chapters. His research interests include digital signal processing, face detection/recognition,

References (42)

  • J. Seo et al.

    A robust image fingerprinting system using the radon transform

    Signal Processing: Image Communication

    (2004)
  • D. Kundur et al.

    Video fingerprinting and encryption principles for digital rights management

    Proceedings of the IEEE

    (2004)
  • C. De Roover et al.

    Robust video hashing based on radial projections of key frames

    IEEE Transactions on Signal Processing

    (2005)
  • A. KoŁcz et al.

    Improved robustness of signature-based near-replica detection via lexicon randomization

  • V. Monga et al.

    Robust and secure image hashing via non-negative matrix factorizations

    IEEE Transactions on Information Forensics and Security

    (2007)
  • F. Khelifi et al.

    Analysis of the security of perceptual image hashing based on non-negative matrix factorization

    IEEE Signal Processing Letters

    (2010)
  • F. Khelifi et al.

    Perceptual image hashing based on virtual watermark detection

    IEEE Transactions on Image Processing

    (2010)
  • S. Lee et al.

    Robust video fingerprinting based on symmetric pairwise boosting

    IEEE Transactions on Circuits and Systems for Video Technology

    (2009)
  • J. Oostveen et al.

    Feature extraction and a database strategy for video fingerprinting

  • B. Coskun et al.

    Spatio-temporal transform based video hashing

    IEEE Transactions on Multimedia

    (2006)
  • A. Hampapur et al.

    Comparison of sequence matching techniques for video copy detection

  • J. Law-To et al.

    Video copy detection on the internet: the challenges of copyright and multiplicity

  • K. Changick et al.

    Spatiotemporal sequence matching for efficient video copy detection

    IEEE Transactions on Circuits and Systems for Video Technology

    (2005)
  • S. Lee et al.

    Robust video fingerprinting based on 2D-OPCA of affine covariant regions

  • S. Lee et al.

    Robust video fingerprinting for content-based video identification

    IEEE Transactions on Circuits and Systems for Video Technology

    (2008)
  • D. Blei et al.

    Latent Dirichlet Allocation

    The Journal of Machine Learning Research

    (2003)
  • B. De Finetti, B. de Finetti, Theory of Probability volume I, Bulletin of the American Mathematical Society, vol. 2...
  • X. Wei et al.

    LDA-based document models for ad-hoc retrieval

  • L. Fei-Fei et al.

    A Bayesian hierarchical model for learning natural scene categories

  • F. Monay et al.

    On image auto-annotation with latent space models

  • J. Sivic et al.

    Discovering object categories in image collections

  • Cited by (0)

    Vretos Nicholas graduated from the Department of Informatics of The University Pierre et Marie Curie in Paris (Paris VI) in 2002. He is currently a researcher and studies towards the Ph.D. degree at the Department of Informatics, in the Artificial Intelligence Information Analysis (AIIA) laboratory, at the Aristotle University of Thessaloniki. He has published more than 10 conference papers and Book Chapters. His research interests include digital signal processing, face detection/recognition, object tracking, image and video semantic content analysis, 3D Face Recognition, 3D Facial Expressions Recognition and Video Fingerprinting.

    Nikos Nikolaidis received the Diploma of Electrical Engineering and the Ph.D. degree in electrical engineering from the Aristotle University of Thessaloniki, Thessaloniki, Greece, in 1991 and 1997, respectively. From 1992 to 1996, he was Teaching Assistant at the Departments of Electrical Engineering and Informatics at the Aristotle University of Thessaloniki. From 1998 to 2002, he was a Postdoctoral Researcher and Teaching Assistant at the Department of Informatics, Aristotle University of Thessaloniki, where he is currently an Assistant Professor. He is the co-author of the book 3-D Image Processing Algorithms (New York: Wiley, 2000). He has co-authored 11 book chapters, 29 journal papers, and 90 conference papers. His research interests include computer graphics, image and video processing and analysis, computer vision, copyright protection of multimedia, and 3-D image processing. Dr. Nikolaidis currently serves as Associate Editor for the International Journal of Innovative Computing Information and Control, the International Journal of Innovative Computing Information and Control Express Letters and the EURASIP Journal on Image and Video Processing.

    Ioannis Pitas received the Diploma of Electrical Engineering in 1980 and the Ph.D. degree in Electrical Engineering in 1985 both from the Aristotle University of Thessaloniki, Greece. Since 1994, he has been a Professor at the Department of Informatics, Aristotle University of Thessaloniki. From 1980 to 1993 he served as Scientific Assistant, Lecturer, Assistant Professor, and Associate Professor in the Department of Electrical and Computer Engineering at the same University. He served as a Visiting Research Associate or Visiting Assistant Professor at several Universities. He has published over 607 papers and contributed in 27 books in his areas of interest and edited or co-authored another 7. He has also been an invited speaker and/or member of the program committee of several scientific conferences and workshops. In the past he served as Associate Editor or co-Editor of four international journals and General or Technical Chair of three international conferences. His current interests are in the areas of digital image and video processing and analysis, multidimensional signal processing, watermarking and computer vision.

    View full text