Exploring heuristic and optimum branching algorithms for image phylogeny

https://doi.org/10.1016/j.jvcir.2013.07.011Get rights and content

Highlights

  • We explore one heuristic algorithm for finding the evolution history of images.

  • We explore one optimum branching algorithm for the same task.

  • Applications are tracking image broadcasting or the chain of image distribution.

  • 350,000 test cases show the effectiveness of the proposed methods.

  • Our solution finds the ancestry connections in a set and the original document.

Abstract

Currently, multimedia objects can be easily created, stored, (re)-transmitted, and edited for good or bad. In this sense, there has been an increasing interest in finding the structure of temporal evolution within a set of documents and how documents are related to one another overtime. This process, also known in the literature as Multimedia Phylogeny, aims at finding the phylogeny tree(s) that best explains the creation process of a set of near-duplicate documents (e.g., images/videos) and their ancestry relationships. Solutions to this problem have direct applications in forensics, security, copyright enforcement, news tracking services and other areas. In this paper, we explore one heuristic and one optimum branching algorithm for reconstructing the evolutionary tree associated with a set of image documents. This can be useful for aiding experts to track the source of child pornography image broadcasting or the chain of image distribution in time, for instance. We compare the algorithms with the state-of-the-art solution considering 350,000 test cases and discuss advantages and disadvantages of each one in a real scenario.

Introduction

Nowadays, popular images and videos spread out at a rapid pace on the web through blogs, news sites, and social media, described by the popular expression to go viral. It is straightforward to find exact duplicates among available media, but many types of media objects, such as images and videos, can suffer small modifications during the redistribution that can change them without interfering on their semantic meaning. Some example include A/D or D/A conversions, (de)-coding, noise due to transmission, and small editing/corrections as brightness adjustments, and cropping. These are called near-duplicates, and are part of an active research area [1], [2], [3], [4], [5].

While most of these changes are natural and not necessarily harmful, sometimes the distribution itself may cause copyright infringement or even be a criminal action [6]. In some situations, the spreading pattern of an image or video can help companies to understand demographics and effectiveness of an ad campaign or a product.

These scenarios motivated the advent of a new research subfield called Multimedia Phylogeny [7], [8], [9], [10], [11], with the objective of investigating the history and evolutionary process of digital objects.

Multimedia Phylogeny goes beyond the detection of near-duplicate objects which only seeks to determine if a set of objects are near-duplicates of one another. In Multimedia Phylogeny, we are interested in finding the structure of modifications of a set of multimedia objects, including their causal and ancestry relationships, source of modifications (root or patient zero) of a set of related documents, reconstructing the order and transformations that originally created the near-duplicate set [7], [8], [9], [10], [11]. Fig. 1 illustrates one case for images. The underlying question in Multimedia Phylogeny is how we find the relationships between each pair of images to point out which document generated another one across time and which transformation parameters were used.

Solutions to this problem have applications in different areas, such as the ones listed bellow:

  • Security: the modification graph of a set of documents might provide information of suspects’ behavior, and point out the directions of online content distribution.

  • Forensics: better results might be achieved if the forensic analysis is performed in the original document instead of on a near-duplicate [6], [12]. In addition, forensic experts might focus their attention on individuals associated with the redistribution of content closer to the root of the tree as they are more likely to be the ones who created the content at first place.

  • Copyright enforcement: traitor tracing without the requirement of active source control solutions such as the ones based on either watermarking or fingerprinting approaches.

  • News tracking services: Multimedia Phylogeny might be a valuable tool for mining applications. The near-duplicate relationships can feed news tracking services with key elements for determining the opinion formation process across time and space [7], [8].

  • Information retrieval: content-based retrieval systems might be developed to allow the display of similar photographs coming from different photographers without any metadata analysis.

Current approaches state that there are two steps necessary to reconstruct the phylogeny tree. The first one consists of calculating a dissimilarity matrix whose entries compare every two documents (e.g., images) in a set. For the second step, this matrix is then used to reconstruct the tree. In this sense, Rosa et al. [8] make the decision at every pair of images on what should be the right direction, while Dias et al. [9], [11] have a global method to construct a minimum spanning tree.

Following the same methodology discussed in previous works, in this paper we explore one heuristic and one optimum branching algorithm for reconstructing the evolutionary tree associated with a set of image documents. The heuristic solution is inspired on Prim’s classic minimum spanning tree algorithm [13] while the optimum branching algorithm is based on the classical Chu–Liu, Bock and Edmonds algorithm [14], [15], [16]. We compare both methods to the state-of-the-art Oriented Kruskal algorithm [11] using the same dissimilarity matrices for image near-duplicate sets and discuss the pros and cons of each approach.

Section snippets

Related work

In the past decade, we have seen an increasing progress on the development of efficient and effective systems to identify the cohabiting versions of a given document in the wild [4], [17], [18]. However, only recently there were first attempts to go beyond the detection of near duplicates, focused on identifying the structure of relationships within a set of near-duplicates.

During its lifetime, a multimedia object might undergo different processing stages whereby each processing operator might

Review on Multimedia Phylogeny

In this section, we present a formal definition of phylogeny tree and discuss how we can build it from a set of image transformation estimations and a tree building algorithm.

New methods for reconstructing an image phylogeny tree

This paper’s main contribution relies on exploring two methods for reconstructing an image phylogeny tree from a set of near-duplicate images. Both approaches operate upon a set of n near-duplicate images. To build the dissimilarity matrix M, we use the same setup proposed in [11].

Experiments and methodology

We follow the methodology introduced by Dias et al. [11] for the validation of the algorithms in this paper.

Conclusions and future work

In this paper, we discussed two algorithms: a heuristic directed oriented graph extension over the classic Prim minimum spanning tree algorithm (Best-Prim) and one optimum branching algorithm for reconstructing the evolution tree associated with a set of images (Chu–Liu, Bock and Edmonds). To our knowledge, this is the first time such approaches are discussed for this problem with such a complete validation. In addition, this is the first time an exact algorithm (Chu–Liu, Bock and Edmonds) is

Acknowledgments

This work was partially supported by São Paulo Research Foundation - FAPESP (grant 2010/05647–4), National Counsel of Technological and Scientific Development - CNPq (grants 307018/2010–5, 304352/2012–8, 306730/2012–0, and 477692/2012–5), Microsoft, and the European Union through the Rewind project. The Rewind project acknowledges the financial support of the Future and Emerging Technologies (FET) program within the Seventh Framework Program for Research of the European Commission (under

References (30)

  • H. Bay et al.

    Speeded-Up Robust Features (SURF)

    Comput. Vision Image Understanding

    (2008)
  • Y. Maret, Efficient Duplicate Detection Based on Image Analysis, Ph.D. Thesis, École Polytechnique Fédérale de...
  • E. Valle, Local-descriptor Matching for Image Identification Systems, Ph.D. Thesis, Universit de Cergy-Pontoise,...
  • E. Valle, M. Cord, S. Philipp-Foliguet, High-dimensional descriptor indexing for large multimedia databases, in:...
  • A. Joly et al.

    Content-based copy retrieval using distortion-based probabilistic similarity search

    IEEE Trans. Multimedia (TMM)

    (2007)
  • H.-s. Kim et al.

    BASIL: effective near-duplicate image detection using gene sequence alignment

  • A. Rocha et al.

    Vision of the unseen: current trends and challenges in digital image and video forensics

    ACM Comput. Surv. (CSUR)

    (2011)
  • L. Kennedy et al.

    Internet image archaeology: automatically tracing the manipulation history of photographs on the web

  • A.D. Rosa et al.

    Exploring image dependencies: a new challenge in image forensics

  • Z. Dias et al.

    First steps toward image phylogeny

  • Z. Dias et al.

    Video phylogeny: Recovering near-duplicate video relationships

  • Z. Dias et al.

    Image phylogeny by minimal spanning trees

    IEEE Trans. Inf. Forensics Secur. (TIFS)

    (2012)
  • S. Goldenstein, A. Rocha, High-profile forensic analysis of images, in: International Conference on Crime Detection...
  • R.C. Prim

    Shortest connection networks and some generalizations

    Bell Syst. Tech. J.

    (1957)
  • Y.J. Chu et al.

    On the shortest arborescence of a directed graph

    Sci. Sin.

    (1965)
  • Cited by (31)

    • A computational approach for examining the roots and spreading patterns of fake news: Evolution tree analysis

      2018, Computers in Human Behavior
      Citation Excerpt :

      Considering that our focus is on text-level content, the evolution tree suits our purposes of investigating the source and evolution of fake news tweets. The spreading pattern of a topic also helps understand the transformation and to identify the key event facilitating the transformation (Dias et al., 2013). This analysis aims to discover the evolutionary history of fake news' spreading on Twitter.

    • Advancing Audio Phylogeny: A Neural Network Approach for Transformation Detection

      2023, WIFS 2023 - IEEE Workshop on Information Forensics and Security
    • Exact Learning of Multitrees and Almost-Trees Using Path Queries

      2022, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    • Image Provenance Analysis

      2022, Advances in Computer Vision and Pattern Recognition
    • Face Phylogeny Tree Using Basis Functions

      2020, IEEE Transactions on Biometrics, Behavior, and Identity Science
    View all citing articles on Scopus
    View full text