1 Introduction

Recent advances in sensing technologies have resulted in ever-higher numbers of scientific and technical data sets being created at ever-higher fidelity. As with many other data types, this trend holds true for trajectory data, broadly defined as timestamped sequences of positions for uniquely identified objects. Trajectory data can be obtained from many different sources. Remote sensing systems track the movement of many objects through areas of sensor coverage. Individual objects such as taxi cabs, aircraft, seacraft, mobile phones, humans carrying GPS receivers, and even the occasional bear, turtle or penguin also record and broadcast their own position and movement. Video sensors (traffic or security cameras) combined with machine vision algorithms can track the appearance, motion and disappearance of objects in a scene. As with so many other real-world data sources, the ongoing explosion of trajectory data shows no signs of abating.

Trajectories are consumed and analyzed by a broad variety of research communities. Biologists use them to study wildlife movement [14]. Sociologists study human behavior [17]. Transit agencies and travel websites mine taxi trajectories to search for better driving directions [16]. Molecular dynamicists track molecules and individual atoms to study proteins, polymers, and transitions between different molecular conformations.

As a specific example, consider the Automatic Identification System (AIS) currently used for maritime ship tracking and cooperative collision avoidance. Given the amount of world commerce that depends on container ships and tankers, marine traffic safety and security is of incalculable importance – a perfect domain for trajectory analysis. Pallotta et al. [13] present an algorithm for automatically determining shipping routes given AIS ship trajectories for the purposes of anomaly detection and route prediction.

Both prediction and anomaly detection can be used to aid collision avoidance. However, such algorithms can also be extended to address the broader question of classifying ship behavior. Establishing patterns of normal and anomalous behavior is as useful to the large-scale problems of efficiency, security and resource allocation as tracking nearby objects for collision avoidance is at small scale.

Commercial airlines use similar pattern-finding analyses [1] to optimize the use of airspace resources such as flight paths, fuel and takeoff/landing permits. They are aided by the Aircraft Situation Display to Industry (ASDI) data set, a metadata-rich feed that provides frequent updates on nearly all non-sensitive civilian traffic in US airspace.

All of these analyses incorporate an assumption that we can identify and isolate patterns within the data. This brings us back to classic machine learning problems: finding instances of specifically described behavior, clustering items into groups (with or without ground truth), finding items similar to a provided exemplar, and labeling new data based on the information in a training set. Metadata (any information in the data feed beyond position, time, and object ID) and derived data (speed, heading, distance from origin, other quantities computed from the original data) help with this, but to date most of the job of labeling trajectories as normal or anomalous still relies on human knowledge and manual effort. Since humans still do not scale well we turn to algorithmic approaches.

Many approaches to trajectory analysis describe and compute the distance between each position of a trajectory. Amongst others, these approaches include hidden Markov models [3], Hausdorff-like distances [8], Bayesian models [10], the earth-mover’s distance [4], and Fourier descriptors [2]. Alternatively, the approach mentioned in [15] makes use of trajectory features in order to provide trajectory comparisons that appeal to human intuition and significantly reduce the computational intensity required for those comparisons to allow for timely analysis of large sets of trajectory data.

In this paper we report on an experimental analysis of a large set of aircraft trajectories using feature vectors. For clarity, we formalize our definition of a trajectory as follows:

A trajectory \({\mathrm Z}\) is a sequence of n points \({\mathrm z_1} \ldots {\mathrm z_n}\). Each point \({\mathrm z_i}\) exists in a normed vector space and is annotated with a timestamp \(t_i\). We assume that the moving object described by \({\mathrm Z}\) moves directly from each point \({\mathrm z_i}\) to its successor \({\mathrm z_{i+1}}\) with uniform velocity, leaving \({\mathrm z_i}\) at \(t_i\) and arriving at \({\mathrm z_{i+1}}\) at \(t_{i+1}\).

We now present a brief survey of previous work on trajectory analysis and clustering before describing our own experiments and their results.

2 Previous Work

The problem of classifying trajectories in an unsupervised manner has been tackled in a number of different ways, mostly focused around classic machine learning techniques. Hidden Markov Models (HMMs) have been applied to groups of trajectories [12] in order to divide them into different patterns. This has been reasonably effective, although the model building process can be time-consuming and it can sometimes be difficult to understand the underlying reasons for the assignment of the trajectories into different patterns. Similar approaches using Bayesian models have also been employed [10]. One can also develop distance metrics that can be defined between two individual trajectories and apply clustering techniques that use only the distance between two individual trajectories [4]. There are clustering techniques that can be applied with only a distance matrix between all of the points, but this can be large and potentially computationally expensive and can also limit the types of clustering techniques that can be used.

One of the primary problems of applying many types of unsupervised learning approaches is that many of the techniques, especially clustering, most naturally apply to n-dimensional vectors of data. The problem then becomes one of translating the trajectory representation in to that of a vector of real numbers that can have a simple and meaningful distance metric applied to it for clustering purposes. Annoni and Forster [2] have demonstrated this using Fourier descriptors to build the representations of the data. This approach has also been used [11] where the trajectories are represented not by a vector or points, but instead a vector of line segments that have been extracted from the straighter parts of the trajectory. However, their approach is inherently focused around trajectories that are naturally geographically aligned.

The unsupervised learning method we propose here uses has much in common with many of the previous works, but is tailored to have these key properties:

  1. 1.

    It must be fast enough to work at scales greater than \(10^6\) trajectories.

  2. 2.

    It must be generic enough to use with diverse types of trajectories.

  3. 3.

    It must not require a large number of user tunable parameters in order to work effectively.

  4. 4.

    The separation into different clusters should have a physically meaningful and intuitive interpretation.

  5. 5.

    It must not require geometric alignment of input trajectories in order to perform similarity calculations.

3 Overview

3.1 Intended Workflow

Our intent was to model a workflow that would let an investigator make sense of a large body of trajectories through a process of generating cluster labels, inspecting the results, extracting a subset and computing new cluster labels. In order to create such a workflow we need a useful characterization of a trajectory’s features and an unsupervised clustering algorithm that can operate on those features. For the purposes of this paper we will focus on the clustering step rather than the query and subset steps of the proposed workflow.

3.2 Feature Vectors

We describe trajectories using feature vectors in order to focus the clustering on characteristics of our own choice instead of point-by-point alignment. We tried two different kinds of feature vectors: one using the idea of distance geometry and one using several different geometric features. The main advantage of feature vectors over a divergence score computed directly between trajectories is that it allows us to use popular clustering algorithms that require (and benefit from) normed vector spaces.

Distance Geometry. We found feature vectors based on distance geometry [6] to be most expressive of the available options. We compute a distance geometry signature as follows:

  1. 1.

    Select an integer \(k \ge 1\).

  2. 2.

    For each \(1 \le j \le k\), select \(j+1\) uniform sample points along a trajectory \({\mathrm Z}\) to yield sample points \(s_0, s_1 \ldots , s_{j+1}\), where \(s_0\) and \(s_{j+1}\) represent the beginning and end points of the trajectory, respectively. Interpolate between trajectory points \(z_i\) as necessary to construct these points.

  3. 3.

    For each j where \(1 \le j \le k\), calculate the distance \(d_{j,01}, d_{j,12}, \dots , d_{j,j-1j}\) of each of the j sub-trajectories.

  4. 4.

    Normalize the distances by dividing each \(d_{j,m-1m}\) by the length of the portion of the trajectory between \(s_{m-1}\) and \(s_m\). For example, \(d_{1,01}\) will be divided by the length of the entire trajectory, while \(d_{j,m-1m}\) will be divided by one jth of the overall length.

  5. 5.

    Arrange these normalized distances into a vector. The actual order does not matter as long as it is consistent from one trajectory to the next. For convenience, we usually place \(d_{1,01}\) first, followed by \(d_{2,01}\) and \(d_{2,12}\) and so forth. The distances \(d_{k,k-1k}\) go at the end of the vector.

Observe that at a conceptual level, these signatures quantify how straight the trajectory is at different scales and different positions. The resulting signatures allow shape-based comparison independent of translation, rotation, reflection and uniform scale. This proves to be a major advantage. At the same time, it is also the major disadvantage of distance geometry since it is unaware of any property more complex than point-to-point distance. Finally, we note that a distance geometry signature based on k sample points yields a feature vector with \(\frac{k(k+1)}{2}\) components. This raises the specter of the curse of dimensionality as we experiment with larger values of k.

Geometric Features. Besides distance geometry signatures, it is often useful to define other functions that measure specific geometric features within a trajectory. These provide traction when an analyst has a specific property in mind that helps differentiate among trajectory behaviors. We have used the following features in the past:

  • End-to-end distance of the trajectory

  • Total length of trajectory

  • Total curvature (sum of signed turn angles)

  • Properties of the convex hull of the trajectory including:

    • Area

    • Aspect ratio

    • Centroid

    • Perimeter

In general, there are not many requirements for a geometric feature. The two primary ones are that they should be simple to calculate in one pass through the data, and that they should be relatively insensitive to point sampling frequency and spatial noise on the trajectory. This second condition is usually the one that is more difficult to fulfill. For example, the total curvature is relatively insensitive to fluctuations in the heading due to the fact that the errors tend to cancel each other out. However, if one was interested in the total amount of absolute heading change between points (to attempt to distinguish straight flights from meandering flights), any sort of fluctuation in the heading will strongly affect the outcome. This measure is also very sensitive to the sampling rate along the curve.

3.3 Clustering

Two principles guide our selection of a clustering algorithm. First, in analysis and sensemaking tasks, the category of “things not similar to any we have seen before” is especially important since it may indicate the presence of new or overlooked classes of behavior. Since this is the very definition of an outlier, we prefer clustering algorithms that can explicitly identify outliers. Second, since we have no idea how many interesting classes exist in the data, we prefer nonparametric clustering algorithms that can choose a number of clusters on the fly.

Both of these criteria point strongly toward density-based clustering algorithms such as DBSCAN [7]. We choose HDBSCAN [5] for both of these properties as well as its ability to identify clusters with varying densities.

4 Implementation and Results

4.1 Data Set

We tested our approach on a database of aircraft trajectories. These were originally collected from the Aircraft Situation Display to Industry (ASDI) feed that originates at the US Federal Aviation Administration.

We perform minimal preprocessing of the data. First, we discard obviously invalid points (those missing coordinates, an ID or a timestamp). Second, we assemble points into trajectories by grouping them by flight IDs and sorting by timestamp. We begin a new trajectory for a given flight ID when we find a delay of more than 20 min or a distance of more than 100 km between successive points. These numbers were arbitrarily chosen but correspond well with observed practice at busy airports. Third, we discard trajectories that last less than one hour as uninteresting.

We tried two separate databases of trajectories. The smaller one contains 83,431 trajectories that fall on July 10 and 11, 2013. These dates are notable for a large system of thunderstorms that swept through the central and eastern United States and caused widespread disruption at several major airports. The larger set contains 359,940 trajectories from July 1–15, 2014. All trajectories were at least one hour long.

4.2 Feature Vectors

First, we tried clustering trajectories using distance geometry signatures with a maximum of 5 \((k=4)\), 10 \((k=9)\) and 15 \((k=14)\) sample points. Since the length of each feature vector is proportional to the square of the number of sample points we expected runtimes to increase accordingly. The resulting feature vectors had dimension 10, 45 and 105.

We also tried feature vectors made from various combinations of a trajectory’s length, end-to-end distance and convex hull aspect ratio.

4.3 Runtime

For the purposes of this paper, runtime is not our primary concern. However, most current implementations of HDBSCAN have worst-case algorithmic complexity \({\mathrm O(n^2)}\) with respect to the number of items being labeled. We emphasize worst case because in our experience the runtime depended more on the dimension of the feature space than on the number of items. See Table 1 for our observations.

Table 1. Typical runtimes and cluster size results for selected data sets. N refers the total number of trajectories and DG refers to the maximum number of distance geometry points used. We believe that the large difference in runtime between 10 and 15 distance geometry samples is a symptom of choosing too many sample points for distance geometry.

5 Discussion

Our first observation was that distance geometry was far better at discriminating different behavior than any of the combinations of individual features that we tried. We attribute this strength to its multiscale representation of trajectory shape, an attribute not yet captured with our approach to custom features. The geometric features were able to distinguish the approximate straightness or roundness of a trajectory but little beyond that.

Second, we note that the runtime of the clustering algorithm depends only indirectly upon the number of items being labeled. We conjecture instead that runtime is dominated by the HDBSCAN step where a minimum spanning tree is computed over a (notionally complete) distance matrix. Although McInnes’s HDBSCAN implementation is heavily optimized and makes use of spatial data structures and parallelization wherever possible, the minimum spanning tree algorithm is still \({\mathrm O(E)} = {\mathrm O(n^2)}\) for a complete graph. We further conjecture that this distance graph grows denser as distances between feature vectors become more uniform – a known symptom of the curse of dimensionality. We do not yet understand fully how to choose the number of distance geometry samples for an entire corpus. If we choose too few samples then we can identify only coarse classes of behavior. If we choose too many, too much behavior winds up lumped into the large central cluster.

Fig. 1.
figure 1

The “noise” cluster from HDBSCAN contains plenty of interesting behavior. Since the members of this cluster are (by definition) unlike other items in the database we will need additional foraging and sensemaking tools for the user. These additional tools may be as simple as clustering using different feature measures. This particular cluster contains 2,153 trajectories and was identified with 10-point distance geometry over 83,000 trajectories.

Fig. 2.
figure 2

Out-and-back flights (those that leave one place, fly to another, then return to their origin) are particularly easy for distance geometry to pick out. This set contains 557 trajectories identified with 10-point distance geometry over 83,000 trajectories.

Fig. 3.
figure 3

A portion of a cluster of 683 flights exhibiting strong back-and-forth behavior typical of mapping and survey flights. This particular set of trajectories leaves from and returns to a small airfield near Bend, Oregon.

Fig. 4.
figure 4

Many instances of a single flight with a distinct shape that travels from Chicago to Detroit. Note that distance geometry pulls in additional flights with very similar shapes (Chicago to New York; New Haven, Connecticut to Philadelphia; and a single flight from Manchester, New Hampshire to Philadelphia). The differing sizes and orientations of these flights illustrate the invariance of distance geometry signatures under rigid transformation.

Now we turn to the actual clustering results. In each case, the cluster labels were dominated by one very large class comprising about 90 % of the input trajectories. Upon inspecting the distance geometry signatures we found that the members of this class were flights that took off, flew directly to their destination, maneuvered very briefly and then landed. This makes sense: it is the most efficient way to operate a commercial airline where time and fuel are both major expenses.

The outlier class also contained a wealth of diverse, interesting behavior. It typically contained about 5 % of the trajectories in the data set. Since the outliers are (by definition) data points that did not fit into any cluster, further exploration must incorporate other approaches. Here we appeal to our notional workflow where an analyst has information foraging and query tools available.

Further classes identified using distance geometry tend to be quite succinct. As we had hoped, distance geometry was able to identify trajectories that not only shared common behavior but shared it at similar distances along the trajectory. Figures 1, 2, 3, 4 and 5 show some examples. In each figure, the red end of a trajectory is its origin and the blue end is its destination.

Fig. 5.
figure 5

Part of a cluster of 17 flights that all include a holding pattern (oval loop) at approximately the same point in their travel. This cluster was extracted using 15 distance geometry samples (a 105-dimensional feature space), offering very high sensitivity to localized features at the cost of runtime and the influence of the curse of dimensionality. This cluster came from the larger set of 360,000 trajectories.

The invariance of distance geometry signatures to rigid transformations was not as great a strength as we had hoped. While an out-and-back trajectory 100 km long has the same shape as one 2000 km long, an observer might reasonably conclude that their behavior was quite different. The former is a quick hop lasting less than an hour. The latter is a multi-hour odyssey. This suggests that we remove some of the invariance by adding a term to the feature vector that represents trajectory length. If done right, this would discourage the clustering algorithm from grouping trajectories with similar shape but vastly different scales. As usual, “if done right” is likely to be the hard part.

6 Future Work

The approach we have described in the previous sections forms a fundamental base, from which many potential improvements and extensions could derive. It is important to first note that trajectory clustering, like many applied machine learning techniques, can be difficult to judge on an objective basis. In some cases where there is ground truth associated with the trajectories (such as aircraft type) that can be used, but this doesn’t necessarily divide the trajectories up into obviously different shapes. In other cases the results of the clustering could be judged based on very different metrics depending on the interest of the user. Any notion of improvement must be tempered by the application.

One of the primary difficulties in using many unsupervised machine learning algorithms is that there is ultimately a number of choices that must be made with respect to how data is structured and what choices are made with respect to algorithm parameters. This work primarily focuses on the distance geometry features because of their generality in describing trajectory shapes, but of course any numerical features could also be used. However, with too many features the algorithms run much more slowly; worse, the effects of important features are lost in the “curse of dimensionality” [9]. Techniques based on variable correlation studies could be used to reduce the dimensionality of the feature space. Additionally, there are linear algebra techniques such as PCA (Principal Component Analysis) that can be used to create more economical feature sets. Another potential source of improving the algorithm would be to have a weighted metric on the distances within the feature space that would scale the distances in the different dimensionalities based on a set of weights that would not necessarily be equal for each dimension.