1 Introduction

We consider the following space–time layout problem: Given an unorganized collection of N photographs taken by \(N_H\) photographers, capturing some dynamic event at \(N_V\) time steps roughly simultaneously, we would like to organize the collection into a space–time table. An example of such a space–time-organized table is illustrated in Fig. 1a. Each row represents the set of photographs taken from about the same position (likely by an individual photographer), and each column represents photographs taken at about the same time. We assume that the photographers took shots of a rather short event that occurred in a rather limited space–time domain, so the commonality among the photographs is high. Such a collection is likely to be found in social media networks, where it lacks any image metadata (GPS location, EXIF time tag, etc.). Moreover, as shown in [9], even if the image does contain a time tag, it cannot be used for our task due to insufficient accuracy.

We were inspired by two recent works. Dekel et al. [9] presented a photograph sequencing technique that, given a set of images that were captured roughly from the same viewpoint, they determine the temporal order of the images. Averbuch et al. [2] present a spatial ordering technique, where the images are captured roughly at the same moment, but from different unknown viewpoints, and the challenge is to find the spatial order of the images. Our work aims to do both spatial and temporal ordering, simultaneously.

Fig. 1
figure 1

a Space–time table generated by our algorithm using the wall dataset. b The embedding of the input points into two 3D domains. The coloring of the points represents the ground-truth clustering of the images into space and time clusters

Organizing such a collection of photographs boils down into two apparently independent clustering problems: One clusters the photographs by their spatial features, which is the view direction, and the other by the temporal features, which relies on the dynamic object in the scene. Without enforcing any constraints on the clustering, a typical result of such two independent clusterings is imperfect.

The space–time layout problem can also be regarded as an embedding problem. Here, the photograph collection has to be embedded into a given table. In this sense, our work is most similar to the recent works [12, 30] that present a method for arranging images into a grid according to their similarities. However, unlike these methods, we use two separate features to organize the images into rows and columns. The premise of our work is that better results can be achieved by using both features together than by treating each separately.

The technique that we present builds on a self-organizing map (SOM) [16], which is a neural network that embeds the high-dimensional training data into a two-dimensional discrete domain. We introduce BiSOM, which is variation of SOM that considers two features (space and time) rather than one, to embed the given photograph collection in the table. As we shall show, BiSOM does not consider the two features independently. The key advantage of using a SOM is that, unlike other clustering schemes, the intrinsic order among the clusters is accounted for in the organization of clusters across the table. We develop an objective function that evaluates the quality of a given space–time table and define an optimization problem to distribute the input images into a coherent table.

2 Related work

Space–time Analysis Temporal analysis of visual data has been studied extensively in the context of video synchronization and alignment [5, 10]. These methods try to align the sequences captured by uncalibrated cameras of the same dynamic scene from different viewpoints. Unlike these methods, in our problem, the temporal order of the photographs is unknown. Dekel et al. [8, 9] introduced a novel method for temporal ordering a set of still images taken asynchronously by a set of uncalibrated cameras that are colocated in space and time.

Viewpoint analysis has received considerably more attention. It has primarily been studied in the context of shape-from-motion (SFM) [31]. The viewing parameters are recovered together with the reconstruction of a scene by considering the correspondence among many images of the same scene. More recent methods [4, 13, 20, 29], known as Photo Tourism, reconstruct a static scene from an unorganized set of tourist photographs, including the spatial position of the photographers.

Averbuch et al. [2] present a technique to spatially order a set of images capturing a static scene, taken simultaneously by a group of people situated around the scene. This technique can deal with a single column in our space–time table, independently of the other, without analyzing the temporal domain.

The above space–time organization schemes are imperfect, so our technique focus is in employing global constraints to improve their separation performance.

Self-Organizing Maps Self-organizing maps (SOMs) were first introduced by Kohonen [16] as a method for visualizing low-dimensional views of high-dimensional data. Kiang [15] later showed how to extend self-organizing map networks for clustering analysis. SOM has been further used in the computer vision field for image segmentation [23], image classification [18], clustering 2D shapes [11], face recognition [3] and more.

An important result is that SOMs are topology preserving, i.e., data vectors that are close in the input space will be close in the resulting map. This aspect makes SOMs extremely valuable in data exploration tasks. Moehrmann et al. [21] use arbitrary features for clustering the images, and the resulting SOM is displayed to the user with one representative image per unit, allowing to navigate and explore the image data collection fast and intuitively.

Over the years, many modifications have been offered to the traditional SOM [1, 6, 17], all presenting changes that only affect the learning rule. In our work, we rely on the SOM intuition and develop a novel technique that considers two features, rather than only one, when organizing the input elements into a low-dimensional domain.

Grid Layout Arranging images on a grid is a fundamental task in infographics, data visualization and collection exploration fields. A common approach for such task is to first perform a standard graph layout without the grid constraint and then align the images onto the grid points [12, 24, 30]. All of the above methods use a single metric function to measure the pairwise distance between the images. Reinert et al. [25] introduce a method for organizing images on a grid according to two features, but their method is user supervised and uses multiple image selections to define the separating features.

In our work, we solve the grid layout problem by using two separate and independent features and arrange the images into rows and columns according to space and time features, respectively. Furthermore, it should be emphasized that all the above methods consider a rather fuzzy measure of the correct grid layout, while in our work the ground-truth layout is well defined.

3 Overview

Given an unorganized set of N images \(\left\{ {{I_1}, \ldots , {I_N}} \right\} \), we would like to organize the images into a space–time table of size \(N_H\times N_V\), where each row represents images taken from about the same location and each column represent images taken at about the same time. Defining such a table is equivalent to clustering the images into (horizontal) rows \({\left\{ H_i\right\} }_{1}^{N_H}\) and into (vertical) columns \({\left\{ V_i\right\} }_{1}^{N_V}\).

First, the images are analyzed, and for each image two feature descriptors \({x_i^H}\) and \({x_i^V}\) are extracted. These descriptors aim to represent the horizontal and vertical properties of the images. Images taken from close viewpoints have similar \({x_i^H}\) descriptors, while images taken from distant viewpoints have distant descriptors. Similarly, the descriptor \({x_i^V}\) aims to be similar for images taken at about the same time. We assume that the features that separate time and space are known, but they are inaccurate. If they were accurate, the problem of clustering the images into rows and columns would have been trivial. Further details on feature extraction are given in Sect. 7.

The \({x_i^H}\) and \({x_i^V}\) descriptors may be vectors of high dimensions. To visualize them, we compute all pairs of distances to define two similarity matrices and apply MDS to embed the input images into two 3D domains (Fig. 1b). Notice how the correct clustering in each domain is non-trivial.

We then imply our BiSOM algorithm which we describe in Sect.  5. Our technique uses a grid of neurons, where each neuron is associated with two weight vectors \({m_i^H}\) and \({m_i^V}\), learns the data and clusters the data by optimizing the grid network. During the training process, the neurons iteratively adjust their weights to the input points, while maintaining the inner relations, namely their grid structure. Besides the clustering of the images into space clusters and time clusters, the BiSOM algorithm outputs the order between the clusters, thus defining the temporal order of the images and the spatial order of the viewpoints. The output of the BiSOM algorithm is an assignment of the input elements to a grid of neurons, thus defining an initial guess for our space–time table.

Finally, in Sect. 6 we develop an objective function based on the Normalized Cut [28] criterion that evaluates the quality of a given table. We use a heuristic relaxation process, which locally optimizes this objective function, while maintaining the order among the clusters, and yields the final space–time table (Fig. 1a).

4 Unary-SOM

The self-organizing map (SOM), introduced by Kohonen [16], is a method for embedding high-dimensional inputs into a two-dimensional grid. Let us denote the set of input elements as \(\left\{ {{x_i}} \right\} _{i = 1}^N\) where each element is associated with a single weight vector \({x_i} \in {\mathbb {R}^D}\). Similarly, each neuron is associated with a single weight vector with the same dimensions \({m_i} \in {\mathbb {R}^D}\). The neurons are organized in a four-connected grid structure of size \(N_H \times N_V\), reflecting the target dimension of the table.

The training process is iterative, where at each iteration t all input elements are fed one by one to the network, in a random order. The closest neuron to the input element, in terms of euclidean distance, is chosen to be the winning neuron \(m_c(t)\). The weights of \(m_c(t)\) and the neurons around it are updated to be closer and more similar to the value of the input element. In the end of the training process, the network of neurons has weights which reflect the input data, so in fact, the input data are embedded into the structure of the neurons. We call this SOM, Unary-SOM, since it considers a single feature only, while later we develop a BiSOM which considers two features. The training algorithm of a Unary-SOM can be described as follows:

figure c

The update of the neighboring neurons’ weights is critical and performed by the following process

$$\begin{aligned} {m_i}\left( {t + 1} \right) = {m_i}\left( t \right) + \alpha \left( t \right) \cdot {h_i}\left( t \right) \cdot \left[ {x\left( t \right) - {m_i}\left( t \right) } \right] \end{aligned}$$
(1)

The neighborhood function \(h_i\) can be considered as a smoothing kernel over the grid of neurons. The Unary-SOM uses an isotropic Gaussian centered around \(m_c(t)\), where \(r_i\) are the coordinates of the i-th neuron on the neuron grid and \(r_c\) are the coordinates of \(m_c(t)\):

$$\begin{aligned} {h_i}\left( t \right) = \exp \left\{ { - \frac{{{{\left\| {{r_i} - {r_c}} \right\| }^2}}}{{2{\sigma ^2}\left( t \right) }}} \right\} \end{aligned}$$
(2)

Figure 2a shows the architecture of a \(10\times 10\) grid of neurons and the definition of the neighborhood function \(h_i\). In general, the farther a neuron is from \({m_c}(t)\), the lower the amplitude of the Gaussian is, and hence, the lower is the updating rate of the neuron’s weight vector.

The Gaussian width also decreases over time, by using \(\sigma \left( t \right) = {\sigma _0} \cdot \exp \left\{ { - \frac{t}{\lambda }} \right\} \) as an exponential time-decaying function of the Gaussian width. Generally speaking, at the first iterations of the training process, the neighborhood update region includes the entire grid of neurons, and as it progresses, the neighborhood size decreases, until eventually, it updates only a single neuron.

Furthermore, the update step amplitude is exponentially decaying over time defined by the learning rate function \(\alpha \left( t \right) = {\alpha _0} \cdot \exp \left\{ { - \frac{t}{\lambda }} \right\} \).

Fig. 2
figure 2

Architecture of Unary-SOM (a) and BiSOM (b). The gray tone of the neurons represents the single feature that the Unary-SOM is trained to. The red and blue tones are the horizontal and vertical descriptors of the BiSOM algorithm. The current input element \(x_i\) is used to train the network, and the current winning neuron \(m_c\) is marked in white. Notice the isotropic (a) and anisotropic (b) neighborhood functions centered around the winning neuron

5 BiSOM

The BiSOM algorithm is based upon the Unary-SOM. As the name implies, it is designed to deal with input elements that are defined by a bi-vector with two descriptors, whereas the Unary-SOM deals with a single descriptor. We denote each input point descriptors as \({x_i^H},{x_i^V} \in {\mathbb {R}^D}\). Similarly, the neurons are also bi-vectors of the same dimensions \({m_i^H},{m_i^V} \in {\mathbb {R}^D}\). The BiSOM algorithm is described in Algorithm 2.

figure d
Fig. 3
figure 3

Training process of the Unary-SOM (top row) and BiSOM (bottom row) algorithm, demonstrating the adjustment of the neural network grid to the input points (gray circles)

The major difference between the Unary-SOM and BiSOM is the update step, which consists of two consecutive steps:

$$\begin{aligned} \begin{aligned} {m_i^H}\left( {t + 1} \right)&= {m_i^H}\left( t \right) + \alpha \left( t \right) \cdot {h_i^H}\left( t \right) \cdot \left[ {x^H\left( t \right) - {m_i^H}\left( t \right) } \right] \\ {m_i^V}\left( {t + 1} \right)&= {m_i^V}\left( t \right) + \alpha \left( t \right) \cdot {h_i^V}\left( t \right) \cdot \left[ {x^V\left( t \right) - {m_i^V}\left( t \right) } \right] \end{aligned}\nonumber \\ \end{aligned}$$
(3)

First, the neuron’s horizontal weight vector \({m_i^H}\left( t \right) \) is updated using only \({x_i^H}\left( t \right) \) and the neighborhood function \({h_i^H}\left( t \right) \). The vertical neuron’s weight is then updated in a similar manner.

Both of the neighborhood functions \({h_i^H} , {h_i^V}\) are defined as an anisotropic Gaussian which has a different scale in each dimension. We will further discuss the scale factor \({\mathscr {F}}\) in Sect. 5.1. The location \(r_i=({u}_i,{v}_i)\) of the i-th neuron defines the neuron residing row and column on the grid.

$$\begin{aligned} \begin{aligned} {h_i^H}\left( t \right) =\exp \left\{ { - \frac{{{{\left( {u_i^{} - u_c^{}} \right) }^2}}}{{\mathscr {F}}\cdot {2{\sigma }^2}(t)} + \frac{{{{\left( {v_i - v_c} \right) }^2}}}{{2{\sigma }^2}(t)}} \right\} \\ {h_i^V}\left( t \right) = \exp \left\{ { - \frac{{{{\left( {u_i^{} - u_c^{}} \right) }^2}}}{{2{\sigma }^2}(t)} + \frac{{{{\left( {v_i - v_c} \right) }^2}}}{\mathscr {F}\cdot {2{\sigma }^2}(t)}} \right\} \end{aligned} \end{aligned}$$
(4)

Figure 3 demonstrates the training process of both algorithms on a toy example. We use a set of 64 points as our input elements, where the (xy) coordinates of each point are the \({x_i^V},{x_i^H}\) descriptors, respectively. To adopt the Unary-SOM to our setup, we use a concatenated vector of our two descriptors into one large single descriptor \(x_i=[x_i^H ,x_i^V]\). The Unary-SOM algorithm (top row) is trained according to the input elements’ locations, ignoring the valuable knowledge about their horizontal and vertical distributions. Conversely, the BiSOM algorithm (bottom row) maintains the neuron’s grid structure during the whole process. The time complexity of Algorithm 2 is similar to that of Algorithm 1 since the only difference between the algorithms is the update step. For a full demonstration of the training process, see the accompanying video.

5.1 \(\mathscr {F}\) factor

The factor \(\mathscr {F} \in \left( {0,1} \right] \) defines the ratio between the major and the minor axis of the anisotropic Gaussian in the update steps. In the first update step, we are updating the horizontal weights of the neurons, using the neighborhood function \({h_i^H}\) with the horizontal axis as the major axis of the Gaussian (see Fig. 2b). The update magnitude for the adjacent rows decays faster than adjacent columns, thus encouraging neurons in the \(m_c\)’s row to have similar H-weights. Similarly, in the second step, the neighborhood function \({h_i^V}\) results in a bigger update magnitude for neurons residing in the winning column than for adjacent columns.

Basically, BiSOM is a generalized version of Unary-SOM that allows us to give different weights to the horizontal and vertical update steps. The ratio \( \mathscr {F}\) plays an important role in the training process and has a significant impact on the result.

Setting \(\mathscr {F} \rightarrow 1\) causes the neighborhood function to reduce back into an isotropic Gaussian with \({\sigma _H} = {\sigma _V} = \sigma \left( t \right) \). The BiSOM algorithm then degenerates into an Unary-SOM where the descriptor is the concatenated vector \({m_i} = \left[ {m_i^H,m_i^V} \right] \). The horizontal and vertical features are treated equally during the update process. Thus, each neuron is encouraged to be similar to each of its four neighbors equally. Eventually, this leads the neurons to break down the grid structure which means that neurons residing in the same row/column are not necessarily similar.

On the other hand, setting \(\mathscr {F} \rightarrow 0\) degenerates the anisotropic Gaussian into a one-dimensional Gaussian. This implies that the first update step updates only the H-weights of the H-neighbors, and the second step similarly. Eventually, the order between the rows and columns is violated.

In our work, we compromise between these two extreme settings and allow the neurons to move independently to preserve the data structure on the one hand and maintain the grid structure on the other hand. To measure the trade-off between these two requirements, we calculate the grid stress which is the average distance between each neuron to its four neighbors and the input stress which is the average distance between each input element and its matching neuron. Obviously, we would like to minimize both of the stress parameters.

Fig. 4
figure 4

Grid stress and input stress as a function of \(\mathscr {F}\)

In Fig. 4, we elaborate on this and calculate the grid stress and the input stress for different values of \(\mathscr {F}\) using different sets of input elements. The input stress increases linearly as \(\mathscr {F}\) increases, while the grid stress is mostly constant for any value of \(\mathscr {F}\), but rises fast for \(\mathscr {F}<0.1\). We set empirically \(\mathscr {F}=0.125\) and use it through all our experiments.

5.2 Assignment

At the end of the training process, we obtain a map of neurons. Each neuron represents an entry in the space–time table. Each input element is then assigned by its nearest neuron to a table entry. The assignment is not one to one, and some cells might remain empty or crowded.

6 Refinement

In this section, we define an objective function \(f(T,S_H,S_V)\) that evaluates the quality of a space–time table T according to similarity matrices \(S_H\) or \(S_V\). We then use a greedy local search to relax our initial guess and to yield the final space–time table. We briefly summarize the N-Cut technique below, and we then elaborate on how it is used to define our objective function.

6.1 Normalized Cuts

Given a fully connected undirected weighted graph G(VE), where the vertices V are points in the feature space and the edge weight w(ij) is associated with the similarity between vertex i and j, the Normalized Cut algorithm [28] partitions the vertex set V into two disjoint sets AB by minimizing the Normalized Cut criterion:

$$\begin{aligned} N_{{\text {cut}}}(A,B) = \frac{{{\text {cut}}(A,B)}}{{{\text {ass}}(A,V)}} + \frac{{{\text {cut}}(A,B)}}{{{\text {ass}}(B,V)}} \end{aligned}$$
(5)

A cut is defined as the degree of dissimilarity between the two disjoint sets A and B. Similarly, the association is the partial volume of A from the entire set V:

$$\begin{aligned} {\text {cut}}\left( {A,B} \right)= & {} \sum \limits _{a \in A,b \in B} {w(a,b)} \end{aligned}$$
(6)
$$\begin{aligned} {\text {ass}}\left( {A,V} \right)= & {} \sum \limits _{a \in A,v \in V} {w(a,v)} \end{aligned}$$
(7)

Shi and Malik [28] show that minimizing the Normalized Cut criterion is naturally related to maximizing the Normalized Association criterion, given by:

$$\begin{aligned} N_{{\text {ass}}}(A,B) = \frac{{{{\text {ass}}}(A,B)}}{{{\text {ass}}(A,V)}} + \frac{{{\text {ass}}(A,B)}}{{{\text {ass}}(B,V)}} \end{aligned}$$
(8)

They further expand this criterion for K way partitioning of the data into K disjoint sets:

$$\begin{aligned} {N_{{\text {ass}}}^K}({A_1}, \ldots , {A_K}) = \sum \limits _{i = 1}^K {\frac{{{\text {ass}}({A_i},{A_i})}}{{{\text {ass}}({A_i},V)}}} \end{aligned}$$
(9)

This implies that the best partition of the set V to K disjoint sets is the partition that maximizes the \(N_{{\text {ass}}}^K\) criterion calculated according to the similarity matrix.

6.2 The objective function

The objective function primarily aims to evaluate the clusters’ quality. We use the well-known N-Cut scheme [28] that evaluates the clustering structure directly from the original data. The assignment of the input elements to a space–time table T is equivalent to two separate clusters partitions; first, partition of the elements into horizontal clusters \(\left\{ {{H_1},{H_2}, \ldots , {H_{{N_H}}}} \right\} \) and second, vertical clusters \(\left\{ {{V_1},{V_2}, \ldots , {V_{{N_V}}}} \right\} \). The correctness of the H / V-clusters is measured using the \(N_{{\text {ass}}}^{K}\) criterion, according to the corresponding similarity matrix \(S_H\) or \(S_V\), respectively.

(10)

To generate a smooth, uniform and coherent space–time table, we define a penalty term \(\mathscr {P}\), which decreases the objective function when the table T does not satisfy the global constraints. Let us denote the number of elements in the i-th row of the table by \({n^{r}_i}\), similarly \({n^{c}_j}\) is the number of elements in the j-th column, and \({n^{e}_{i,j}} \) is the number of elements in the cell [ij]. Basically, for a uniform distribution of the input elements in the table, the optimal number of elements in each row, column and cell is: \(\frac{N}{N_H}, \frac{N}{N_V} , \frac{N}{N_H \cdot N_V}\), respectively. The penalty term is then defined by:

$$\begin{aligned} \mathscr {P}= & {} {W_{r}} \sum \limits _{i = 1}^{{N_H}} {g\left( n^{r}_i - \frac{N}{N_H} \right) } \quad + {W_{c}} \sum \limits _{j = 1}^{{N_V}} {g\left( n^{c}_j - \frac{N}{N_V} \right) } \quad \nonumber \\&+ {W_{e}} \sum \limits _{i,j}^{} {g\left( n^{e}_{i,j} - \frac{N}{N_H \cdot N_V} \right) } \end{aligned}$$
(11)

The penalty term penalizes the objective function for any deviation from the optimal number of elements in each row, column or cell. The single-penalty function \(g\left( \right) \) is an arbitrary monotonic symmetric function around 0 that satisfies \(g\left( 0 \right) = 0\). In our work, we use \(g\left( x \right) = 1 - {{\text {e}}^{ - {x^2}}}\). We set the penalty weights \({W_{\text {cell}}}=1.5 , {W_{{\text {row}}}}={W_{{\text {col}}}}=0.5\) empirically to generate a smooth and uniform table.

6.3 The refinement process

The refinement process assumes that most of the elements are assigned to their correct entry in the table, with respect to the ground-truth assignment. Furthermore, we assume that the mismatches, if they exist, are small and local. Namely, a mismatched element is likely to be assigned to one of the adjacent neighbors of the correct cell. These assumptions are strong because the typical performance of BiSOM guarantees that adjacent neurons have similar weights.

At each iteration of the refinement process, the silhouette index Footnote 1 [27] of each of the elements is computed for all clusters. We then consider only the movements which have a large \({s_i}\left( j \right) \) and try to make small local movements and swaps in the table aiming to maximize the objective function.

7 Implementation details

Feature extraction is a initial key step to our method. We experimented with a number of spatial and temporal features according to the nature of each scene. It should be emphasized that automatic feature selection is an independent challenging problem, but is not in the scope of our work. The assumption is that the features used are leading to reasonable but imperfect clustering. We now elaborate on each feature category separately:

Fig. 5
figure 5

Example from the red car dataset. a Static SIFT matches used to compute to the transformation between the images. b Blended registered images and the dynamic SIFT matches used to compute the disparity between the images

Fig. 6
figure 6

Example from the dancer dataset (a), and the IDSC matches found using the two contours (b)

Fig. 7
figure 7

a, b and c Three typical layouts of the synthetic input points using \(\sigma =0.1\). d Rand index computed on the clustering result on each layout as a function of \(\sigma \)

7.1 Temporal features

The temporal feature aims to separate the images according to their time of acquisition. We define a metric that measures the time difference between each two images in the set. Notice that we cannot predict the inner order within any two images; thus, the final order of the images in the space–time table can be reversed.

First, we extract the dynamic object in each image using the Grabcut algorithm [26]. Next, according to the nature of the motion in each scene, we manually choose a different kind of descriptor.

For scenes with a fixed object moving along a linear path (i.e., a car traveling on a road), we use SIFT point tracking, similar to [32]. First, we extract the SIFT points from each image and divide them into static and dynamic points according to their locations on the foreground or background (Fig. 5a). Next, we compute a projective transformation between each pair of images according to the matches between the static points only. We use this transformation to align each pair of images. Finally, after registering both images, we compute the disparity between the dynamic matches that reflects the temporal difference between the two images (Fig. 5b).

For scenes demonstrating a human in a fixed position changing his posture during the motion (i.e., ballet dancer), we use contour-based descriptors (Fig. 6). We use the foreground/background separation and extract the contour of the dynamic object. We compare the contours of each two images by using the inner-distance shape context (IDSC) [19].

7.2 Spatial features

The spatial feature aims to separate the images according to the viewpoint of the photographer.

We used the number of static SIFT matches as a metric for measuring the similarity between two images. The farther apart the images were captured, and the less SIFT matches between them will be found. We also used the GIST descriptor [22] to measure the spatial distance between each two images. Furthermore, we experimented with simpler metrics such as a Chi-square metric on the color histogram of each image.

We display the results using the first feature, although our method showed similar performance using any other choice of spatial feature.

8 Results

8.1 Synthetic data

First, we demonstrate the strengths of our method on synthetic data. We generate sets of 64 points divided into 8 clusters. The points in each cluster are sampled from a 3D anisotropic Gaussian where the average variance of the Gaussian is denoted with \(\sigma \). The cluster centers are typically organized in one of the three configurations: scattered, diagonal and horseshoe. We observed that diagonal and horseshoe configurations are more typical when dealing with space or time features like in our work (see Fig. 1b).

Table 1 Dataset evaluation. Each dataset consists of N images organized into a table of \(N_H\) by \(N_V\). The table displays the horizontal and vertical Rand index and the total swapping distance of the resulting table relative to the ground truth

For each configuration layout, we generate two different point clouds defining our horizontal and vertical descriptors \({x_i^H}\), \({x_i^V}\), with different values of \(\sigma \). We compare our BiSOM algorithm result to the result of two separate N-Cut schemes on the horizontal and vertical descriptors separately. To measure the correctness of the clustering result, we compute the Rand index Footnote 2 [14], comparing the clustering result to the ground truth.

Figure 7d shows the average Rand index of the horizontal and vertical clustering result as a function of \(\sigma \). We can see that our BiSOM algorithm which considers the two features together outperform the N-Cut scheme which considers each feature independently.

8.2 Real visual data

To demonstrate the effectiveness of our method and to evaluate it, we generated 11 datasetsFootnote 3 of outdoor and indoor scenes. These datasets capture real-life dynamic scenes that occur in a rather limited space–time domain. The events were asynchronously captured with a multitude of capture devices including DSLR cameras and mobile phones. The photographers might have moved slightly during the dynamic event. Therefore, the images taken from the same photographer may differ slightly in the viewpoint. Furthermore, the photographers did not capture the dynamic event synchronously; hence, the input images may have temporal differences in acquisition.

Fig. 8
figure 8

dancer—a girl poses dancing stances at the beach. Notice the large variance in viewpoints and poses. The temporal order is not recovered in this example, due to the nature of the scene

Fig. 9
figure 9

beer—a guy picking up a beer bottle to drink. Notice the temporal order (right to left) and the spatial order of the viewpoints (top to bottom)

Fig. 10
figure 10

bag—a girl bending to pick up her bag

Fig. 11
figure 11

Ground-truth table of ladder dataset (a), and the generated space–time table (b)

Fig. 12
figure 12

ballet—a ballet dancer performing a jump on stage, captured by an audience seated in an amphitheater around the stage. This dataset is computer rendered

Fig. 13
figure 13

red car—a red car driving down the street from right to left

Fig. 14
figure 14

boats—a couple of small boats traveling down the river

Fig. 15
figure 15

mall—people walking at the mall. Notice the three different groups of people walking in different directions

Fig. 16
figure 16

slides—kids slide down the slide

Fig. 17
figure 17

bridge—a boy walking on a bridge at the playground

Fig. 18
figure 18

basketball—failure case. In this dataset the audience capturing the scene was spread sparsely on the side of the field. The spatial descriptor failed to reflect the distances between the viewpoints, and as a result our method failed to organize the images in a coherent table

The problem of assigning the photographs into the space–time table can be separated into two problems and the evaluation, respectively: (i) correctly clustering the elements into space clusters and time clusters and (ii) correctly ordering the elements within the clusters, namely the temporal and spatial order among the clusters.

To evaluate the clustering result, we use the Rand index described above. To evaluate the ordering result, we use the swapping metricFootnote 4 [7], which measures the distance from a given permutation of the clusters to the correct one. Table 1 displays the Rand index and the swapping distance, for each of the sets.

Figures 8, 9, 10, 11, 12, 13, 14, 15, 16 and 17 show the resulting space–time tables generated by our algorithm. For a full resolution of the images, see our supplementary material. Notice how the inner order between the rows defines the spatial order of the photographers, while the inner order between the columns defines the temporal order of the event.

Our method can also handle more realistic cases where the ground-truth table contains a different number of images in each row and column. To generate such dataset, we modified the ladder dataset by removing some of the images and duplicating others. Figure 11 shows the ground-truth table and the resulting space–time table generated by our algorithm. Although the objective function leads to a result with some local mismatches, the final space–time table is more uniform and coherent than the ground-truth table.

8.3 Comparison to other methods

First, we compare our results to the Unary-SOM method introduced in Sect. 4. We use the space descriptor and the time descriptor as a single concatenated descriptor, and the rest of the parameters of the neural network and the training process are identical to the BiSOM method. Next, we compare our results with the N-Cut [28] method by first applying the N-Cut algorithm on the space descriptors obtaining the space clusters, and similarly the time clusters. We further experiment with IsoMatch [12], a grid layout technique which organizes the set of images into a full grid. Although the resulting table is always full and uniform, it doesn’t reflect the temporal and spatial content of the event. Table 1 displays the comparison results on all datasets. A major drawback of all the alternative methods is their inability to provide the temporal or spatial order among the clusters, meaning that the rows and columns of the space–time table are not ordered correctly.

9 Discussion and conclusions

We have presented a technique to organize a set of photographs into a space–time table. The embedding of the photographs into the table reveals information regarding the clustering of the images into space and time clusters and the order within the clusters.

We introduced BiSOM, that is built upon SOM, which considers two independent features rather than one, and showed that it outperforms SOM. We also showed that using the space and time features independently to solve two independent clustering problems is inferior to using both the features coupled with BiSOM. Furthermore, unlike other methods, our method reveals the temporal and spatial order within the clusters.

Regarding the limitations of our work, we assume that the number of clusters is given and use it to initialize the size of our grid. Furthermore, note that our method relies on the extracted features and descriptors used to describe the distances among the images. This implies that we are bounded to common limitations of feature extraction methods. Figure 18 shows a failure case, where the images are captured from a wide baseline, which means that dynamic object appearance in adjacent viewpoints is significantly different.

Although our work focuses on space–time separation, the technique is applicable for any set of features. Given a set of images, we may use style and content descriptors to embed the images into a style content table. Another possible application can be organizing human shapes according to their body shape and postures. Furthermore, another possible venue for research is extending the BiSOM to higher dimensions, e.g., using three descriptors and a three-dimensional grid of neurons to organize a set of tiles according to shape, color and texture features.