Discovering salient regions on 3D photo-textured maps: Crowdsourcing interaction data from multitouch smartphones and tablets
Introduction
We have developed a smartphone/tablet app for the viewing and manipulation of 3D models gathered with an Autonomous Underwater Vehicle (AUV). This app is freely available and has been downloaded and used by a large number of users. The question this paper is attempting to answer is “Can we employ crowdsourcing to perform salient interest point detection from users not specifically tasked to find these points?” A diagram depicting the high-level system presented in this work is shown in Fig. 1.
Saliency, particularly visual saliency is a popular construct from the field of biological vision and broadly describes an organisms ability to focus attention on a subset of its sensory input for further processing. In this work data subsetting is the most relevant part of the visual saliency process. While scientists and non-experts will have differing opinions on the high level top-down definitions of saliency, rapid bottom-up visual saliency is much less task and operator dependent [49]. This paper is focused on such processing in the context of a long-term environment-monitoring program using AUVs. At the Australian Centre for Field Robotics there is an ongoing program to perform benthic monitoring with an AUV [73]. This program deploys an AUV in unstructured natural environments where it gathers data for human review. One of the major bottlenecks in this process is the vast amount of data gathered by the AUV. The AUV is capable of gathering orders of magnitude more data than previous techniques. Traditionally divers used hand-held cameras to gather visual data in underwater environments and issues of decompression, airtime, and safety severely limited the quantity of data that scientists could gather. With the AUV in its current configuration, monitoring images can be gathered at up to 4 Hz. A typical field campaign lasting two weeks can result in hundreds of thousands of images requiring review.
The challenge of how to deal with this massive image archive is being explored on several fronts. A large effort has gone into unsupervised clustering [64], human hand labeling [46], and supervised classification [4]. This work presents an alternative for gathering large amounts of human review data quickly and inexpensively. The assertion we present in this paper is that human visual saliency can be modeled by proxy through the exploratory motions of a large number of users in a 3-D environment.
Capturing human curiosity and exploration for robotic platforms is non-trivial. The well established approach is to use visual saliency measures but it is not necessarily clear that they can predict what people find interesting in a 3D scene and how they will choose to interact with it. This paper presents two alternative measures of human interest both based on the motion of the viewpoint used by the operator and compares them to traditional saliency measures. Through the crowdsourcing of many remote smart phone/tablet users we gather data to perform the identification of visual saliency on 3-D photo mosaic maps. Human experiments with ground truth from eye tracking are used to validate our results.
Crowdsourcing has emerged as a successful model for solving tasks by leveraging the human intelligence of large groups of remote users in a distributed fashion. The term crowdsourcing was coined in 2006 and first appeared in scientific literature in 2008 as “an online, distributed problem-solving and production model” [6]. The crowdsourcing model has since been adapted to outsource difficult steps in many computational tasks [32]. Recently the computer vision community has begun using crowdsourcing to solve challenging vision problems.
In parallel, researchers have started harnessing the power of data-mining over massive user bases to answer many new questions. Search engines and social networking use the interaction from millions of users to refine and improve advertising and site usability [17]. Researchers have used this data to learn about the demographics of users, social trends, and behavioral patterns [41], [65]. With the rise of smart phones and ‘app stores’ mobile platforms have quickly become a practical means of gathering massive amounts of user data. App analytics is attempting to turn the millions of smart phones in use into a distributed network of data sources.
Traditional crowdsourcing of vision tasks relies upon motivating users through community good will [59], financial incentives (Mechanical Turk) or competitive/entertainment incentives [1], [2] by turning a task into a game. The intended motivation for users of our app was education and entertainment. The app was advertised in the education section of the Apple iTunes app store and in its description and screenshots offered the promise of exploration of images from the deep sea. We attempted to capitalize on public interest in science, especially exploratory science, to motivate downloads. A novel aspect of our approach is that the motivation of users was somewhat more decoupled from the task than in a traditional crowdsourcing model. To work with such user data we propose the use of a novel paradigm from big data analytics where the answers to questions can be inferred from the data of many users. The power of our data-mining approach to crowdsourcing is that data is collected from a much larger pool of users. A full discussion of the motivations and demographics of users on various crowdsourcing platforms is beyond the scope of this paper, however Kaufmann et al. [30] present a review of the studies on Mechanical Turk. While these studies reflect a diverse user pool they also show that the Mechanical Turk user base is a fraction the size of the potential smart phone app user pool [31], [56]. Using the smart phone platform gives us access to a much more general audience. To further the general appeal of the app we do not ask users to explicitly identify things they find interesting. Rather, we attempt to infer interest from patterns of interaction and in doing so free the user from an artificially constrained task. Without asking users to answer a specific question, their motivations for participating can be much more varied. This potentially gives access to a much larger ‘crowd’.
We will be presenting two novel metrics to calculate saliency from human user interaction data. One employs the use of the camera’s frustum to histogram observed points, while the other leverages a Hidden Markov model (HMM) to classify interaction data spatially into a saliency map. These techniques are compared to several state-of-the-art visual saliency techniques and validated using human gaze tracking data. The paper is laid out as follows. Section 2 discusses prior work. Section 3 presents the developed app as a platform for crowdsourcing. In Section 4 the two interaction-based formulations for saliency are laid out. The human trials for validation are discussed in Section 5. Results are presented in Section 6 and finally Section 7 concludes and presents future work.
Section snippets
Crowdsourced vision
Tools such as LabelMe, ImageNet, BUBL and other systems which leverage Amazon’s Mechanical Turk have provided solutions to the problem of image-labeling using human computation [18], [14], [33]. Mechanical Turk has become a particularly popular platform for crowdsourcing for vision. It offers flexibility and there has been research into assessing, processing, and rectifying image labelings from large groups of human sources [63], [71], [55]. All the aforementioned systems deal with image
Mobile app
For this work we have created an app that allows users to explore and navigate a 3-D photo-textured model of the seafloor. A screenshot of it running can be seen in Fig. 2. Written in Objective-C and using OpenGL ES (OpenGL for Embedded Systems) it runs on both phones and tablets. The app is downloadable for free.1 Named SeafloorExplore the app was released into the Apple iTunes app store in 2012. The app itself uses virtual texturing [44],
Interaction-based saliency methods
Building upon the work in interaction-based saliency discussed in Section 2.3 this section will propose two novel frameworks that attempt to capture the notion of saliency using 3D camera motions (as described in Section 3.2) crowdsourced from the developed app. While traditional visual image saliency experiments rely on static images, this paper presents two key differences to that work. First, 3-D photo-textured models are used which means that not only do intensity and color play a role in
Experiments
To validate that the proposed technique is an effective proxy for human interest we set up a traditional visual saliency eye tracking experiment. Eye tracking has long been used in psychology, human computer interaction, and vision research to experimentally measure human attention [51], [60], [53]. Most commonly we see the results of these experiments represented as heat maps that capture various metrics about eye fixations on the screen. There are several competing metrics for what to
Results
The results are generated across data gathered on missions from three separate field deployments.
Conclusions and future work
In this paper we have presented a novel technique for extracting saliency from crowdsourced interaction data. We have developed two algorithms to extract interest metrics from camera motions. To our knowledge this system is the first of its kind in using a distributed smart phone app to gather saliency data for 3D maps. The proposed system provides an alternative to traditional visual saliency by harnessing the modality of touch to provide a proxy for interest. Our results show that comparable
Acknowledgments
This work is supported by the New South Wales State Government and the Integrated Marine Observing System (IMOS) through the DIISR National Collaborative Research Infrastructure Scheme. The authors of this work would like to thank the Australian Institute for Marine Science and the Tasmanian Aquaculture and Fisheries Institute (TAFI) for making ship time available to support this study. The crews of the R/V Solander and R/V Challenger were instrumental in facilitating successful deployment and
References (75)
- et al.
Visual attention for region of interest coding in JPEG 2000
J. Vis. Commun. Image Represent.
(2003) - et al.
A hierarchical neural system with attentional top-down enhancement of the spatial resolution for object recognition
Vis. Res.
(2000) An integrated model of eye movements and visual encoding
Cognit. Syst. Res.
(2001)- et al.
Modeling visual attention via selective tuning
Artif. Intell.
(1995) - et al.
Modeling attention to salient proto-objects
Neural Netw.
(2006) - et al.
Labeling images with a computer game
- et al.
Peekaboom: a game for locating objects in images
- B. Baccot, V. Charvillat, R. Grigoras, C. Plesca, Visual attention metadata from pictures browsing, in: Ninth...
- et al.
Automated species detection: an experimental approach to kelp detection from sea-floor AUV images
- A. Borji, L. Itti, Exploiting local and global patch rarities for saliency detection, in: Proc. IEEE Conference on...
Crowdsourcing as a model for problem solving: an introduction and cases
Converg.: Int. J. Res. New Media Technol.
Crowdsourced automatic zoom and scroll for video retargeting
Combining content-based analysis and crowdsourcing to improve user interaction with zoomable video
Hierarchical geometric models for visible surface algorithms
Commun. ACM
Multimodal semantics extraction from user-generated videos
Adv. MultiMedia
What is a hidden Markov model?
Nat. Biotech.
Category independent object proposals
Interactions with big data analytics
Interactions
Shallow-depth 3d interaction: design and evaluation of one-, two- and three-touch techniques
Graph-based visual saliency
The sponge gardens of Ningaloo Reef, Western Australia
Open Marine Biol. J.
Image signature: highlighting sparse salient regions
IEEE Trans. Pattern Anal. Machine Intell.
Image signature: highlighting sparse salient regions
IEEE Trans. Pattern Anal. Machine Intell.
A model of saliency-based visual attention for rapid scene analysis
IEEE Trans. Pattern Anal. Machine Intell.
A model of saliency-based visual attention for rapid scene analysis
IEEE Trans. Pattern Anal. Machine Intell.
Generation and visualization of large-scale three-dimensional reconstructions from underwater robotic surveys
J. Field Robotics
Crowd-Forge: crowdsourcing complex work
Cited by (3)
The use of saliency in underwater computer vision: A review
2021, Remote SensingTexture Reconstruction Method for Complex Free-form Shapes
2017, Guangzi Xuebao/Acta Photonica SinicaImage-based surface reconstruction in geomorphometry-merits, limits and developments
2016, Earth Surface Dynamics