Abstract
Monitoring public space with imaging sensors to perform an object- or person-tracking is often associated with privacy concerns. We present a Dynamic Vision Sensor (DVS) based approach to achieve this tracking that does not require the creation of conventional grey- or color images. These Dynamic Vision Sensors produce an event-stream of information, which only includes the changes in the scene.
The presented approach for tracking considers the scenario of fixed mounted sensors. The method is based on clustering events and tracing the resulting cluster centers to accomplish the object tracking. We show the usability of this approach with a first proof-of-concept test.
This work is part of the project “plsm” which is founded by the European Regional Development Fund under the grant number EFRE-0801082.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
In the field of computer-vision automated object detection and tracking are challenging topics. Over the past decades, various approaches have been developed. In [1,2,3] methods under evaluation of the optical flow are considered, whereas in [4,5,6] variations of the Kalman filter and in [7,8,9] techniques of deep-learning based approaches are utilized.
However, these approaches are using conventional, frame-based (grey-value) images captured by classical CCD- or CMOS imagers [10]. Depending on the domain of application, this type of recording can quickly lead to problems with the privacy awareness of potential users (especially in in-home environments) or in the case of public places in complex legal issues [11].
The described use case of object tracking in this paper is part of a project whose goal is to improve the planning of public open space by including the specific user behavior in the basic urban design process. For this purpose, it is planned to construct a distributed, sensor-based system in order to automatically derive various parameters of the considered area. In the first step, we focus on the task of object detection and tracking to derive information about the number of users and their movements.
To overcome privacy concerns and restrictions by laws, we suggest the utilization of an alternative image sensor, the so-called Dynamic Vision Sensor (DVS). This type of sensor is biological inspired and works not in a frame-based manner. Instead it transmits the changes within a scene in an asynchronous way when they happen.
The paper is structured as follows: In Sect. 2 the DVS and its functionality are described. A filtering and clustering approach for object tracking based on a DVS is presented in the subsequent section. In Sect. 4 a simple proof-of-concept comparison of the DVS solution to a classical image-processing solution is presented. Section 5 concludes with a short summary.
2 Dynamic Vision Sensor
CCD- or CMOS imagers typically operate at a fixed frame rate and produce a constant data stream independent of changes in the considered scene. This can lead to high redundancies in the individual captured frames. In contrast to this, the pixels of a Dynamic Vision Sensor operate independently and asynchronously, based on relative light intensity changes in the scene. This approach is, as a part of neuromorphic engineering, borrowed from biology. For this reason, DVSs are also called “silicon retinas”.
Each pixel of a DVS only transmits an information (called an event) when a change in intensity greater than a pre-defined threshold is measured. As a consequence, a static scene generates no sensor output at all.
The output of this sensor is typically encoded as a sparse stream in an Address-Event Representation (AER). Each event in this stream includes [12]:
-
(x, y)-Coordinate:
The pixel coordinate in the sensor array that triggered the event.
-
Timestamp:
The time of the occurrence of the event. Typically, in a resolution range of milliseconds.
-
Meta information:
Depending on a specific sensor model, e.g. the polarity of an event (ON: change from dark \(\rightarrow \) bright, OFF: change from bright \(\rightarrow \) dark) or the brightness value at the moment of the event generation (greyscale value).
Lichtsteiner et al. mentions in [12] that the first sensor of this type was developed in the mid-1990s. An overview of subsequent developments can be found in [13]. In the scope of this work, we used the “CeleX-IV” sensor, which is developed by Hillhouse Technology [14]. This sensor offers a \(768\times 640\) pixel array resolution, a high-speed AER output with 200Meps (events per second) and a high dynamic range of \({\approx }120\) dB.
Figure 1 shows an example scene captured with this sensor. In Fig. 1a the scene is displayed as a greyscale image, whereas Fig. 1b shows the visualization of a 60 ms time window of event data as a binary image. Each pixel, where at least one event occurred in the time window, is set to white. Figure 1c illustrates the spatiotemporal information within the stream of events. Each of the six colors in this figure represents a time window of 60 ms (total 360 ms) of events. The burst of events at the position of the moving human and the tree waving in the wind are clearly visible in these visualizations.
3 Event-Clustering as a Basic Tracking Approach
Based on the inherent properties of the event-based vision sensor, we propose the processing-chain in Fig. 2 to achieve a tracking of moving objects. For this we use a neighborhood-based event filter as a pre-processing step, followed by a hierarchical clustering and a tracing of cluster centroids. These steps are explained in the following sub-sections. Our implementation is based on slicing the continuous event-stream in non-overlapping blocks of a fixed time length (following referred as sliding time window) and the processing of each of these blocks.
In addition to the privacy benefits (no grey- or color-value information of the scene is needed) offered by the sparsely populated event stream of a DVS, this approach offers the possibility to achieve a solution with little need of computational and power resources. An important point is, that the static background of the scene does not have to be considered. Especially in the context of sensor networks this can be a great advantage.
3.1 Event-Filtering
Figure 1b and c clearly shows that there is significant sensor noise in the recorded signal, which prevents a sensible use of clustering approaches. Therefore, we suggest a simple filtering step exploiting the spatial and temporal neighborhood for pre-processing. For each event, the number of other events in the von-Neumann neighborhood (4-neighborhood) within the current sliding time window is calculated as
Figure 3a clarifies the considered spatio-temporal neighborhood for an event. An event is rejected when \(f(\text {event}_x, \text {event}_y) < \text {threshold}\).
We suggest setting the threshold value depending on the width of the underlying sliding time window. We have chosen the value empirically and set it to 1/8 of the sliding window in ms. The filtering result is shown in Figs. 3b and c (compare with the unfiltered version in Fig. 1).
This filtering drastically reduces the number of events which must be processed in the next step, while preserving most of the events from the desired objects. The effect on the average number of events per sliding window is shown exemplarily in Fig. 3d based on various recordings (compare with Sect. 4). Within this practical example a reduction of about 96% on the average event count per sliding window was achieved.
3.2 Hierarchical Clustering
The next step in the processing chain consists of clustering the pre-filtered events to get semantic related groups of events. As the number of clusters (moving objects) in the scene is not known a priori, a hierarchical clustering approach is used.
We use a hierarchical single-link clustering (provided by the “fastcluster”-library [15]) based on the euclidean distance of the (x, y)-coordinates of the pre-filtered events. The clustering break point is controlled by a pre-defined cutoff-distance. Only clusters consisting of a minimum number of events are considered for further processing. Figure 4 illustrates the result of the clustering step at two different sliding windows in the form of color-coded clusters.
The Table 1 summarizes all parameters and their selected values of our presented approach.
3.3 Cluster Path Tracking
For each cluster resulting from the previous step, the center point is calculated. Based on this center point, the objects are traced over the time, i.e. over successive sliding time windows. Two center points from clusters in consecutive sliding time windows are considered as semantically linked when their euclidean distance is smaller than a pre-defined threshold (see Table 1). If there is no other point within this distance, the corresponding cluster is interpreted as a new object.
The result of tracking these cluster center points is exemplified in Fig. 5, which shows the tracked path between the two sliding windows displayed in Figs. 4a and b.
4 Proof of Concept
The presented cluster-based tracking approach on event-based vision information focuses currently on the special use case of a fixed mounted sensor and moving objects in the monitored scene. Due to the fact that the research area of event-based computer vision is fairly new, there is a lack of well-known standard databases covering various use cases and different DVS-sensor resolutions and characteristics.
Hu et al. [16] are describing an approach to convert “classical” frame-based vision datasets into synthetic event-based datasets. But the converted databases are not addressing the described use case of object tracking and the conversion tries to simulate a DVS sensor with a smaller sensor resolution than the one used in our practical experiments (see Sect. 2). Hence, creating synthetic converted data for our specific sensor will produce artefacts. Thus, we decided to use a small, self-recorded database for the first proof-of-concept of the proposed approach. Table 2 briefly summarizes the considered scenes within this dataset.
The following subsections present an alternative tracking approach using a frame-based imaging technique which is compared with the proposed DVS-clustering method.
4.1 Comparative Approach: Difference Image
Due to privacy concerns that need to be considered (compare with the project description in the introduction), it is not possible to use “classical” greyscale or color images to monitor the desired space. One possible option from the field of image processing is the approach of using difference images and binarization.
For this purpose, a recording of the background (scene without any objects, see Fig. 6a) is taken. Each frame of the actual recording (see Fig. 6b) is compared with this background in that the difference between these two images is calculated. To ensure the privacy concerns this difference image can be binarized (see Fig. 6c), so that no restoration of color- or greyscale values is possible.
Similar to the described filtering of DVS-events this approach allows also the use of a filtering as an additional step. In this case, the use of morphological operations is one possible way. Figure 6d shows the filtered result which arises when using a morphological opening operation with a 3 \(\times \) 3 cross-structure kernel element.
Based on these images the use of well-known computer vision object tracker is possible. For comparison we used the implementationsFootnote 1 in the openCV-library [17].
4.2 Comparison: Event-Clustering and OpenCV-Tracker
Compared to the presented clustering procedure on the DVS event data, the implementations of the openCV trackers require a bounding box, which includes the object to be tracked, as input parameter. Due to this fact, we decided to compare the two approaches on the basis of the tracked path of this selected object. This means, that in terms of this proof of concept comparison a single object tracking is performed, although the DVS clustering-based approach could track multiple objects in a scene.
For the two approaches the algorithmically determined object center is compared to a manually defined ground-truth position. In case of the DVS-clustering this object center is the cluster center point and for the openCV-tracker the center point from the points within the returned object bounding box is used. By estimating this center point over continuous sliding time windows (or the generated and filtered binary images), an object path is determined. An example is given by the red line in Fig. 5.
For the quantitative comparison of these paths in comparison with the ground-truth path the dynamic time warp distance is used [18]. This distance measure allows the determination of similarity even for different lengths of results.
In Fig. 7 the calculated distances for the DVS-clustering approach, and for six different in openCV implemented object tackers (for details please compare with openCV documentation) are shown. The distance values for each of the openCV trackers is averaged over 15 different executions with identical initial parameters to compensate stochastic effects within some of the trackers. In the considered footage (compare with Table 2) our event-based approach, with one major exception, is comparable or better than the openCV-trackers.
The presented approach fails in Rec7 due to the performed event filtering step. The Fig. 8a shows the unfiltered events which are generated by the object and Fig. 8b contains the corresponding filtered events, whereas Figs. 8c and d show that too many events are removed in the further course of the recording. As a result, the minimal cluster size condition (compare with Sect. 3.2) is not reached. Therefore, the continuous tracking of the object is lost.
5 Conclusion
We presented an initial approach to track moving objects by clustering and tracing their center points based on event data from a Dynamic Vision Sensor. Our method is simple and fast, while respecting possible privacy concerns of depicted persons.
The method is currently implemented on standard PC hardware. Since filtering of the event data significantly reduces the number of events, outsourcing the filter stage to a FPGA would enable implementation on less powerful microprocessors.
Another important aspect for further research is the creation and publication of a larger event-based database with corresponding ground truth annotations to allow a systematic evaluation that goes beyond the presented proof-of-concept test.
Also, the improvement of the presented approach itself is an aspect for further research. The time information for each event (which has a resolution of milliseconds) is mostly unused in the current approach, which represents potential for additional improvement.
References
Ranftl, A., Alonso-Fernandez, F., Karlsson, S.: Face tracking using optical flow. In: International Conference of the Biometrics Special Interest Group (BIOSIG), Darmstadt, pp. 1–5 (2015)
Liu, Y., Lu, Y., Shi, Q., Ding, J.: Optical flow based urban road vehicle tracking. In: Ninth International Conference on Computational Intelligence and Security, Leshan, pp. 391–395 (2013)
Dan, L., Dai-Hong, J., Rong, B., Jin-Ping, S., Wen-Jing, Z., Chao, W.: Moving object tracking method based on improved lucas-kanade sparse optical flow algorithm. In: International Smart Cities Conference (ISC2), Wuxi, pp. 1–5 (2017)
Bukey, C.M., Kulkarni, S.V., Chavan, R.A.: Multi-object tracking using Kalman filter and particle filter. In: IEEE International Conference on Power, Control, Signals and Instrumentation Engineering (ICPCSI), Chennai, pp. 1688–1692 (2017)
Mu, X., Che, J., Hu, T., Wang, Z.: A video object tracking algorithm combined Kalman filter and adaptive least squares under occlusion. In: 9th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Datong, pp. 6–10 (2016)
Najafzadeh, N., Fotouhi, M., Kasaei, S.: Object tracking using Kalman filter with adaptive sampled histogram. In: 23rd Iranian Conference on Electrical Engineering, Tehran, Iran, pp. 781–786 (2015)
Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. Technical report arXiv:1804.02767 (2018)
Mocanu, B., Tapu, R., Zaharia, T.: Single object tracking using offline trained deep regression networks. In: Seventh International Conference on Image Processing Theory, Tools and Applications (IPTA), Montreal, pp. 1–6 (2017)
Behrendt, K., Novak, L., Botros, R.: A deep learning approach to traffic lights: detection, tracking, and classification. In: IEEE International Conference on Robotics and Automation (ICRA), Singapore, pp. 1370–1377 (2017)
Mehta, S., Patel, A., Mehta, J.: CCD or CMOS image sensor for photography. In: International Conference on Communications and Signal Processing (ICCSP), Melmaruvathur, pp. 0291–0294 (2015)
Mahmood Rajpoot, Q., Jensen, C.: Video surveillance: privacy issues and legal compliance. In: Promoting Social Change and Democracy through Information Technology. IGI Global (2015). ISBN 9781466685024
Lichtsteiner, P., Posch, C., Delbruck, T.: A 128 \(\times \) 128 120 dB 15 \(\upmu \)s latency asynchronous temporal contrast vision sensor. IEEE J. Solid-State Circuits 43(2), 566–576 (2008)
Delbrück, T., Linares-Barranco, B., Culurciello, E., Posch, C.: Activity-driven, event-based vision sensors. In: Proceedings of 2010 IEEE International Symposium on Circuits and Systems, Paris, pp. 2426–2429 (2010)
Guo, M., Huang, J., Chen, S.: Live demonstration: a 768 \(\times \) 640 pixels 200Meps dynamic vision sensor. In: IEEE International Symposium on Circuits and Systems (ISCAS), Baltimore, p. 1 (2017)
Müllner, D.: Fastcluster: fast hierarchical, agglomerative clustering routines for R and Python. J. Stat. Softw. 53(9), 1–18 (2013)
Hu, Y., Liu, H., Pfeiffer, M., Delbruck, T.: DVS benchmark datasets for object tracking, action recognition, and object recognition. J. Front. Neurosci. 10, 405–410 (2016)
Bradski, G.: The OpenCV library. DR DOBBS J. Softw. Tools 25, 120–125 (2000)
Müller, M.: Dynamic time warping. In: Information Retrieval for Music and Motion, pp. 69–84. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74048-3_4. ISBN 9783540740483
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Bolten, T., Pohle-Fröhlich, R., Tönnies, K.D. (2019). Application of Hierarchical Clustering for Object Tracking with a Dynamic Vision Sensor. In: Rodrigues, J., et al. Computational Science – ICCS 2019. ICCS 2019. Lecture Notes in Computer Science(), vol 11540. Springer, Cham. https://doi.org/10.1007/978-3-030-22750-0_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-22750-0_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-22749-4
Online ISBN: 978-3-030-22750-0
eBook Packages: Computer ScienceComputer Science (R0)