User Performance for Vehicle Recognition with Visual and Infrared Sensors from an Unmanned Aerial Vehicle

Lif, Patrik; Näsström, Fredrik; Bissmarck, Fredrik; Allvar, Jonas

doi:10.1007/978-3-319-91238-7_25

Patrik Lif¹⁴,
Fredrik Näsström¹⁴,
Fredrik Bissmarck¹⁴ &
…
Jonas Allvar¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10901))

Included in the following conference series:

International Conference on Human-Computer Interaction

4041 Accesses

Abstract

In many situations it is important to detect and recognize people and vehicles. In this study the purpose was to examine human performance to detect and recognize vehicles on the ground from synthetic video sequences captured from a simulated unmanned aerial vehicle. A visual and an infrared sensor was used on an unmanned aerial vehicle with camera scan rate of the field of view on the ground relative to the ground of either 8 m/s or 12 m/s. The results from this study demonstrated that performance was affected by type of sensor, camera scan rate and type of vehicle. Subjects performed worse with infrared than with visual sensor and increased camera scan rate caused more errors. Also, the results show that recognition performance varied between 67 and 100% depending on type of vehicle. Recognition of specific vehicles was also affected negatively by interference from vehicles of similar appearance. Consequently, a vehicle with unique appearance within the set was easier to recognize.

You have full access to this open access chapter, Download conference paper PDF

Human Performance in Vehicle Recognition with Visual and Infrared Images from Unmanned Aerial Vehicle

Visual and IR-Based Target Detection from Unmanned Aerial Vehicle

Detection of Unmanned Aerial Vehicles Based on Image Processing

Keywords

1 Introduction

Gathering information with new and better sensors is positive since users can access more information, but it is necessary to have an understanding of what is the most vital information in a given situation. To accomplish this, users’ need a good understanding of the whole system. Data overload may be a serious problem, and how to help human cognition using e.g. computers is fundamental to ensure good situation awareness and good user performance. Regardless of type of system it is also necessary to have a good understanding of the user and the context. The ecological approach [1] and representation design [2] describes a cognitive triad between environment, interface and users. There is a reciprocal coupling between the user and the environment, which often is mediated by a user interface. The interface effectiveness is determined by the mapping between the environment and interface (correspondence) and the mapping between the user and interface (coherence). To develop an effective and user-friendly system all these three parts must be taken into account. Information that reaches the user has often been acquired with some type of sensor system that involves signal processing, acting as a filter between the environment and the interface. In order to be able to understand the complete picture of study sensor-related aspects the model has to be extended to also include environment, interface and human aspects. Since the sensor is a central part in our research, we add the sensor to the representation visualization (Fig. 1).

Even though the whole system always have to be taken into account, the main focus here is on the ability and limitations of the users and their performance to extract correct information from sensor data.

Seeing an object could mean different things, but one way to analyze observers’ ability to perform visual tasks is to use the Johnson criteria [4, 5] that distinguish between detection (i.e. whether there is something of potential interest), recognition (e.g. the difference between a human and car) and identification (e.g. whether it is a friend or foe). According to Johnson criteria, possible detection distance is calculated based on how many pixels an object must contain. In order to detect static objects it requires 2 × 2 pixels, orientation 8 × 2.8 pixels, 8 × 8 pixels for recognition, and identification required 12.8 × 12.8 pixels [6]. However, this should be interpreted as values under best possible conditions. There is also a variety of factors that must be considered, including the contrast between objects and background, atmospheric disturbances, the number of objects in the picture, light, contextual clues, color and type of optics. Moreover, performance is affected by the type of task, the experience of the participants and their level of training for the specific task, motivation, and the relative importance between quick decisions and correct results [4]. Also the methods Triangle Orientation Discrimination (TOD), Targeting Task Performance (TTP) and Thermal Range Model (TRM) could be considered. For further description of these methods see Näsström et al. [7], Wittenstein [8], and Vollmerhausen and Jacobs [9].

Even though theoretically calculated values (e.g. Johnson criteria) could be of some value to get an indication of what objects that can be detected, recognized or identified, experiments with users should be conducted to get a better understanding of a real situation. There is an obvious risk of confusion regarding the interpretation of concepts, since the concepts are used by researchers in different context without a standardized definition. It is absolutely necessary to clarify and define the concepts used.

Identification of friend or foe is different from actual identification of a face from memory or a database. In many situations one must be absolutely certain about the identity of a person or vehicle to make a decision whether to use military force. Also, it is necessary to have a good understanding of roles of engagement (ROE), when military force can and cannot be used. Friendly fire, where a soldier accidentally opens fire on his own troops, is a well-known phenomenon that must be avoided. In other cases, such as intelligence, is it important to describe what is seen according to a predetermined classification scheme and not just describe what users think they see.

In military contexts, it is sometimes important to find a particular type of vehicle among other similar military vehicles, and it is also important to distinguish between military and civilian vehicles. To increase knowledge about this, our work involves assessing actual sensor performance but also investigating how operators use and interpret sensor information. Even though the interest from a human factors perspective is mainly on user performance to detect and recognize people and vehicles, we also conduct technology driven sensor studies [10] and thorough investigation of the real setting [11]. There are many interesting studies focusing on detection, recognition and identification. Colomina and Molina [12] discuss the evolution and use of unmanned aerial systems in photogrammetry and remote sensing that can be used in both military and civilian operations, e.g. search and rescue missions. Other research with unmanned aerial vehicle (UAV) and target detection focus has a more technical approach, e.g. develop algorithms for autonomous target detection [13] or autonomous UAVs for search and rescue [14]. There are also interesting studies using multiple cooperative vehicles [15] or a swarm of unmanned vehicles [16] which shows that multiple vehicles can improve performance. Other research has a clearer connection to human factors issues and user performance. Hixson et al. [17] used soldiers to investigate the relation between performance in the laboratory and in the field for tasks including detection, recognition and identification. The results shows that perception laboratory performance using real or simulated imagery relates well to imagery performance in the field.

The research question in the first experiment was to investigate how fast and to what degree of correctness can users detect and recognize one selected military vehicle among other similar vehicles and how is performance affected by type of sensor, camera scan rate of the field of view on the ground (hereafter referred to as scan rate) and distance? The research question in the second experiment was to investigate to what degree of correctness can users recognize eight military vehicles with an infrared sensor, at camera scan rate of 8 m/s at a distance of 400 meters?

It is important to investigate and understand the sensors’ pros and cons in different situations. Only the infrared sensor can be used at night while both the visual and infrared sensor can be used during daytime. However, it is not obvious which sensor is preferred during daytime in different situations and it is therefore important to investigate this. In some situations it is certainly better to use the visual sensor, but sometimes the vehicle can be partly hidden under e.g. branches or trees and then it is advantageously to use the infrared sensor also during daytime. From a tactical perspective it may be advantageous to fly the unmanned aerial vehicle at night, but then only the infrared sensor can be used. Also, at night there are significantly fewer civilian vehicles in motion and less vehicles that gives heat signatures which facilitates detection and recognition of military vehicles. If performance decreases with one of the sensors quantification would be important. It is preferable to use high camera scan rate since larger geographic areas can be covered, but if it results in decreased performance it may be necessary to use a lower speed. Even though user performance is expected to decline at increased camera scan rate it is important to objectively quantify performance decrease. If the unmanned aerial vehicle fly at high altitude there are tactical advantages such as lower risk for the UAV being detected, but if it results in decreased performance it is not recommended.

Here, two experiments were conducted that is part of a larger study where the overall goal is to investigate how different sensors should be used in unmanned aerial vehicles to gather information. The purpose with these two experiments were to investigate subjects’ performance of vehicle detection and recognition from a simulated unmanned aerial vehicle. In the first experiment, detection and recognition of one selected vehicle among a total of eight vehicles was investigated at two different camera scan rate (seen from the UAV) with visual- and IR-sensor. In the second experiment recognition of all eight vehicles was investigated at a camera scan rate of 8 m/s with an IR-sensor. Although the results here are only presented and analyzed strictly linked to these experiments, later it can be analyzed and compared to other experiments. Also, this information can be used to better understand how information from different sensors can be aggregated to increase performance. However, this is not the focus here and is therefore not presented in this paper.

2 Experiment 1 – Detection and Recognition of Selected Vehicle

In the first experiment, detection and recognition of one selected vehicle among a total of eight vehicles was investigated.

2.1 Method

Participants watched synthetic video sequences captured from an UAV. All video sequences were generated by a sensor simulation system [10]. The task was to detect and recognize a selected vehicle among other vehicles. A within-group design with two visualizations (visual and IR) × two distances (400 and 520 meters) × two camera scan rate (8 and 12 m/s) was used.

Subjects

Twelve subjects (5 women and 7 men) between 25 and 48 years participated in the experiment. Half of the participants’ had military background and the other participants’ were well acquainted with military activities through their civilian jobs. However, none of the participants were experts on the vehicles presented in these experiments and therefore trained prior to the experiments started. All had adequate vision with or without correction.

Apparatus

The video sequences were presented on a Dell Latitude 7240 with 12.5 inch display with a resolution of 1366 × 768 pixels. The computer had 4^th generation Intel® Core i5 and i7. A self-developed software was used to present stimuli and record participant’s response time.

Stimuli

A total of eight videos (640 × 480 pixels) were generated during a clear sunny day with shadows from targets on the ground to depict sensor information from a visual- and infrared sensor (Fig. 2). The overall mission was similar to a real UAV flying along a predefined path with vehicles stationary on the ground.

The task was to detect and recognize one selected vehicle among a total of eight vehicles. The eight vehicles were BMP-3, BTR-80, MT-LB, SA-19, T-72, TOS-1, Ural 4320 Ammunition truck, and Ural 4320 fuel truck (Fig. 3).

Four scenarios were generated with the visual- and infrared sensor respectively. Each scenario had 18 areas with different target positions. The same areas and positions were used for the visual- and infrared scenarios. A total of eight videos were generated according to the aforementioned design. The visual- and infrared scenarios were presented in a balanced order between subjects’, and within each sensor the four scenarios were presented in a randomized order.

Procedure

After welcoming the participants individually and briefing them about the experiment purpose and procedure they received written information and had the opportunity to ask questions to the experiment leader. Then an introduction was given to make sure that the participants were familiar with the situation and test material. They were introduced with both visual- and infrared image visualizations and received training, which consisted of two three minutes scenarios, one for visual- and one for infrared stimuli. The participants watched the videos and answered by first pressing the space bar whereby the response time (RT) was recorded and then used the left mouse button to annotate in the image to indicate the selected vehicle position. The annotation was later used to calculate number of correct answers. The participants were instructed to always focus on the screen with the stimuli. Because the task was mentally demanding it was divided into eight separate videos with the possibility to rest before continuing with the next one.

2.2 Results

The results include statistical analysis of time to detect targets and recognition of the selected vehicle. The data were analyzed with a three-way ANOVA [18] with type of visualization (visual and infrared), camera scan rate (8 and 12 m/s), and distance (400 and 520 meters). Tukey HSD was used for post hoc testing [19].

Detection

The ability to detect targets was measured by response time (RT) and analysis was performed by ANOVA repeated measures. The results showed no significant main effects of response time (p > .05).

Recognition of one selected vehicle

The ability to recognize one selected vehicle was analyzed by ANOVA repeated measurement, where mean values for each condition was used for each participant. The results showed a main effect for type of sensor F(1, 11) = 9.02, p < .05, where participant’s performance were lower with the infrared sensor than with the visual sensor (Fig. 4).

There was also a significant main effect of camera scan rate F(1, 11) = 8.75, p < .05, where higher camera scan rate caused more errors (Fig. 5). There was no significant main effect of distance, and no significant interaction effects p > .05.

3 Experiment 2 – Recognition of Eight Vehicles

In the second experiment, detection and recognition of a total of eight vehicles were investigated.

3.1 Method

From Experiment 1, the scenario with an infrared sensor, distance of 400 meters, and camera scan rate 8 m/s was selected. For this setting recognition of eight different vehicles was investigated. In this experiment the focus was on proportion correct recognized vehicles only. The subjects’ watched the video sequences for five seconds and then reported their answers, no response time was measured.