1 Introduction

The object detection task entails localizing and classifying objects in an image. Deep learning methods originate from the two-stage approach [5, 9, 32] that uses encoded visual features of an input image to first search for image regions that potentially contain objects, and then classify and refine each proposal region in isolation. Over the years, alternate approaches have challenged these initial design choices: end-to-end formulations [6, 19, 31, 43, 44] yield a real-time and fully-differentiable alternative to the two-stage approach, multi-resolution detectors [22] boost the detection accuracy for small objects, and learnable proposals [34] supersede the need for dense candidate regions. Even novel object detection paradigms have emerged, such as anchorless [43] and set-based methods [6, 34] that bypass non-maximum suppression, as well as attention-based methods [6] that explicitly learn context between object proposals.

However, many design paradigms have never been questioned, first and foremost, the learnable class prototypes as well as the euclidean embedding space in the classifier head. Recently, alternative classification space formulations such as hyperbolic embeddings have outperformed their euclidean counterparts in an increasing number of domains. Gains were especially observed in tasks that underlie a hierarchical representation [24, 27], including the task of image classification [14]. These successes are attributed to the exponentially growing distance ratio of hyperbolic spaces, which enables them to match the rate of growth in tree-like structures. In this paper, we analyze whether visual object detectors also benefit from a hyperbolic classification space where visual features encode information about both, object categorization and bounding-box regression.

Fig. 1.
figure 1

General object detection architecture using classification scores embedding in the hyperboloid model. It is built on object proposal features from an arbitrary detection neck, for instance a transformer-decoder or RoI head, using operations in Euclidean space. Finally, it outputs the classification logits computed in the learned hyperbolic metric space, i.e. calculates hyperbolic distances (Eq. 5) to the learned class prototypes on the hyperboloid.

We incorporate a hyperbolic classification head into various object detectors, as can be seen in Fig. 1, including Sparse R-CNN [34], CenterNet [43], and Deformable DETR [44]. We evaluate the performance on closed-set object detection as well as long-tailed and zero-shot object detection and analyze how it copes with the task to distinguish foreground and background representations and its interaction with localization. We observe a latent class hierarchy emerging from visual features, resulting in fewer and better classification errors, while simultaneously boosting the overall object detection performance.

In summary, our main contributions in this work are:

  • Formulation of a hyperbolic classification head for two-stage, keypoint-based, and transformer-based multi-object detection architectures.

  • Evaluation of hyperbolic classification on closed-set, long-tailed, and zero-shot object detection.

2 Related Work

In this section, we review the relevant works in the research area of object detection architectures, with special interest in their classification heads, as well as hyperbolic embeddings in the image domain.

Object Detection: Early deep object detectors were anchor-based methods [5, 32,33,34] that first select image regions potentially containing and object from a set of search locations and then employ a Region of Interest (RoI) head to extract visual features which are used for classification and bounding box regression. Due to its class-agnostic Region Proposal Network (RPN), such methods are nowadays widely applied in zero-shot object detection [15, 30, 41].

Anchor-less detectors such as keypoint-based or center-based methods [16, 43] embody an end-to-end approach that directly predicts objects as class-heatmaps. Additional regression heads then estimate bounding box offset and height/width [43], instance masks [17, 26] or temporal correspondences [13] in an image sequence.

In recent years, an increasing number of works employ set-based training objectives [6, 7, 34, 44] that compute loss functions from a one-to-one matching between groundtruth and proposal boxes. This objective is widely used with transformer-based detection heads [6, 7, 44] which process the feature map as a sequence of image patches. Such methods detect objects using cross-attention between learned object queries and visual embedding keys, as well as self-attention between object queries to capture their interrelations in the scene context.

Hyperbolic Embeddings: Hyperbolic geometry defines spaces with negative constant Gaussian curvature. Consequently, the distance-ratio of hyperbolic spaces increases exponentially [3] - a property that is incorporated by recent works to capture parent-child relations within the data [27]. The Poincaré ball model is currently the most widely used formulation of hyperbolic space for embedding tasks [14, 21, 27], due to its intuitive visualization in 2D and its embedding space that is constraint based on the euclidean norm. Nickel et al. [27] pioneered the learning of tree-structured graph embeddings in the Poincaré ball, and surpassed Euclidean embeddings in terms of generalization ability and representation capacity. Poincaré embeddings were introduced in the image domain by Khrulkov et al. [14] for image classification and person re-identification tasks. They observed that mini-batches of image embeddings generated from ResNet-18 [12] feature extractors on Mini-ImageNet [8] form a hyperbolic group, as the embeddings’ Caley-Graph is \(\delta \)-hyperbolic. Motivated by this observation, they mapped the penultimate layer of a ResNet-18 [12] backbone onto the Poincaré ball and performed a hyperbolic multi-logistic regression, which they showed to achieve superior results compared to euclidean methods for 5-shot classification on Mini-ImageNet [8]. They further proposed a model for person re-identification tasks on proposal regions, disregarding the localization task of object detection.

For zero-shot image classification, [21] transformed image embeddings using a ResNet-101 [12] into hyperbolic vectors on the Poincaré ball model and perform classification based on distances to Poincaré embeddings of WordNet relations [27] as class prototypes. In doing so, they outperform all euclidean baselines models in terms of hierarchical precision on ImageNet. In a later work, Nickel et al. [28] found that the hyperboloid model, also called Lorentz model, learns embeddings for large taxonomies more efficiently and more accurately than Poincaré embeddings. Furthermore, the distance computation on the hyperboloid model is numerically stable, as it avoids the stereographic projection.

Building on embeddings in the hyperboloid model [28], we perform object detection with hyperbolic detection head for two-stage, keypoint-based, and transformer-based detectors. In addition to image classification, this requires to localize objects and distinguish between foreground and background image regions.

3 Technical Approach

SoTA closed-set object detection architectures employ linear euclidean layer design [1, 32, 37] that projects visual feature vectors \(\textbf{v}_i \in \mathbb {R}^{D}\) onto a parameter matrix \(\textbf{W} \in \mathbb {R}^{D \times C}\) composed of C class prototypes \(p(y = c | \textbf{v}_i) = \textit{softmax}( \textbf{W}^T \textbf{v}_i)_c\).

We propose to learn class prototypes in hyperbolic space to embed latent hierarchies with lower distortion. In the remainder of this section, we introduce hyperbolic embeddings in Sect. 3.1, the modifications to the classification losses in Sect. 3.2, and describe the incorporation into two-stage detectors, keypoint-based method, and transformer-based architectures in Sect. 3.3.

3.1 Hyperbolic Embeddings

In this work, we analyze the n-dimensional hyperboloid model \(\mathbb {H}^{n} \), also called Lorentz model, as a classification space for object detection. It is one of the several isometric models of the hyperbolic space, i.e., a Riemannian manifold with constant negative curvature \(\kappa <0\). We investigate, whether this non-zero curvature is desirable to capture latent hierarchical information due to the exponential growth of volume with distance to a point [3]. The limit case with \(\kappa =0\) would recover Euclidean geometry behavior and hence the baseline object detectors.

The n-dimensional hyperboloid model \(\mathbb {H}^{n}\) presents points on the upper sheet of a two-sheeted n-dimensional hyperboloid. It is defined by the Riemannian manifold \(\mathcal {L} = (\mathbb {H}^n, \textbf{g}_l)\) with

$$\begin{aligned} \mathbb {H}^n = \{ \textbf{x} \in \mathbb {R}^{n+1}: \langle \textbf{x}, \textbf{x} \rangle _{l} = -1, x_0 > 0 \}, \end{aligned}$$
(1)

and

$$\begin{aligned} g_l(\textbf{x}) = \text {diag}(-1, 1, \dots , 1) \in \mathbb {R}^{n+1 \times n+1}, \end{aligned}$$
(2)

such that the Lorentzian scalar product \(\langle \cdot , \cdot \rangle _{l}\) is given by

$$\begin{aligned} \langle \textbf{x}, \textbf{y} \rangle _{l} = -x_0 y_0 + \sum _{i=1}^{n} x_n y_n, \quad \textbf{x}, \textbf{y} \in \mathbb {R}^{n+1}. \end{aligned}$$
(3)

To transform visual features \(\textbf{v} \in \mathbb {R}^{n+1}\) - extracted from euclidean modules like convolutional layers or MLP - into points on the hyperboloid, we apply the exponential map as follows:

$$\begin{aligned} \exp ^k_0 (\textbf{v}) = \sinh \left( \Vert \textbf{v} \Vert \right) \frac{\textbf{v}}{\Vert \textbf{v} \Vert }. \end{aligned}$$
(4)

The distance between two points on the hyperboloid is then given by

$$\begin{aligned} d_{\mathbb {H}^n}(\textbf{x}_i, \textbf{t}_c) = \text {arccosh}\left( -\langle \textbf{x}_i, \textbf{t}_c\rangle _{l} \right) , \quad \textbf{x}_i, \textbf{t}_c \in \mathbb {H}^{n}. \end{aligned}$$
(5)

When performing gradient optimization over the tangent space at a hyperbolic embedding \(\textbf{x} \in \mathbb {H}^{n}\), we employ the exponential map at \(\textbf{x}\) to the tangent vector, as shown in [38].

3.2 Focal Loss Integration

The state-of-the-art object detectors compared in this work use a focal classification loss [19] that relies on the sigmoid function for computing the binary cross entropy. Since distance metrics only span the non-negative real numbers, we compute logits by shifting the distances by a negative value as

$$\begin{aligned} s_{i, c} = \varDelta - \frac{\varDelta }{d_{min}} d_{i, c}, \end{aligned}$$
(6)

where \(s_{i, c}\) is the classification score for proposal i and class c, \(d_{\mathbb {H}^n}(\textbf{x}_i, \textbf{t}_c)\) is the respective distance between the transformed visual feature vector and the prototype for class c. \(\varDelta \) is an offset that shifts the active regions of the activation function. The scaling parameter \(d_{min}\) defines the distance that accounts for a classification confidence of \(p=0.5\). It is set to the minimum inter-class distance for fixed class prototypes, or a scalar constant (here \(d_{min}=1\)) for learnable class prototypes.

3.3 Generalization to Existing Object Detectors

In this section, we briefly revisit the two-stage, keypoint-based, and transformer-based object detectors that we use for the experiments with a special focus on their classification loss.

Two-stage detectors such as Sparse R-CNN [34], first extract a set of N proposal boxes that potentially contain objects from latent image features. A RoI head then processes each proposal feature vector separately to extract classification features and bounding box regression values. In the case of Sparse R-CNN [34], the RoI head takes the form of a dynamic instance interactive head, where each head is conditioned on a learned proposal feature vector. The classification scores enter the matching cost via a focal loss term. For our experiments, we replace the classification head by the hyperbolic MLR module as described in Sect. 3.1.

Keypoint-based detectors formulate object detection as a keypoint estimation problem. CenterNet [43] uses these keypoints as the center of bounding boxes. The final layer outputs a per-class heatmap as well as center point regression, bounding box width, and height, etc. We evaluate CenterNet also as a representative of one-stage detectors. It outperforms RetinaNet [19], YOLOv3 [31], and FCOS [36] on the COCO test-dev benchmark when using ResNet-50 as feature extractor or alternatives with comparable number of parameters. We modify the classification heatmap to regress towards classification embeddings for each pixel. These embeddings are then transformed into hyperbolic space, and class heatmaps are generated by computing distance fields to each hyperbolic class prototype.

Transformer-based methods were pioneered by the DETR [6] architecture that utilize a transformer encoder-decoder model that processes feature map tokens of ResNet-encoded image features. Each feature vector is then independently decoded into prediction box coordinates and class labels by a 3-layer perceptron with ReLU activation [6]. Deformable DETR [44] (DDETR) improves the computational efficiency of DETR by proposing a deformable attention module that additionally allows to aggregate multi-scale feature maps. The decoder consists of cross-attention modules that extract features as values, whereby the query elements are of N object queries and the key elements are of the output feature maps from the encoder. These are followed by self-attention modules among object queries.

4 Experimental Evaluation

Datasets. COCO [20] is currently the most widely used object detection dataset and benchmark. The images cover complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentation to aid in precise object localization [20]. We use annotations for 80 “thing” objects types from the 2017 train/val split, with a total of 886,284 labeled instances in 122,266 images. The scenes range from dining table close-ups to complex traffic scenes.

LVIS [10] builds on the COCO 2017 images but distinguishes 1203 object categories with a total of 127,0141 labeled instances in 100,170 training images alone. The class occurrences follow a discrete power law distribution, and is used as a benchmark for long-tailed object detection.

COCO 65/15 reuses the images and annotations from COCO 2017, but holds out images with instances from 15 object types in the training set. We use the class selection as well as dataset-split from [30].

Evaluation Metrics. We evaluate our proposed hyperbolic classification head for 2D object detection on the challenging COCO 2017 test-dev and the long-tailed LVIS v1 benchmark. Additionally, we evaluate the visual-to-semantic mapping performance on the zero-shot detection task using the classes split proposed in [30]. Our evaluation metric is the mean average precision (mAP) which defines the area under the precision-recall curve for detections averaged over thresholds for IoU \(\in [0.5 : 0.05 : 0.95]\) (COCO’s standard metric). For closed-set object detection, we compare the mean over all classes, for long-tailed object detection on the LVIS v1 dataset, and we also provide the mean of frequent, common, and rare classes. For zero-shot evaluation, we report average precision as well as recall for the 65 seen and the 15 unseen classes separately. We additionally report \(AP_{cat}\), a modification of the average precision metric that defines a true positive detection if the IoU between the predicted and the groundtruth bounding box exceeds a threshold \(\in [0.5 : 0.05 : 0.95]\) and is assigned the class label of any class with the same super-category in COCO 2017 stuff [4] label hierarchy as the groundtruth class.

Training Protocol. We use the PyTorch [29] framework for implementing all the architectures, and we train our models on a system with an Intel Xenon@2.20GHz processor and NVIDIA TITAN RTX GPUs. All the methods use the ResNet-50 backbone with weights pre-trained on the ImageNet dataset [8] for classification, and extract multi-scale feature maps using a FPN with randomly initialized parameters. The hyperparameter settings, loss configuration, and training strategy principally follow the baseline configurations for maximum comparability, i.e., [32] for Faster R-CNN KGE, [43] for CenterNet KGE, and [6] for DETR KGE configurations. Please refer to the supplementary material for a detailed overview of hyperparameter settings and schedules. We train all the networks using an Adam optimizer with weight decay [23] and gradient clipping. We use multi-scale training for all the models with a minimum training size of 480 pixel side length with random flipping and cropping for augmentation.

4.1 Benchmark Results

COCO 2017 Benchmark. To compare the behaviors of hyperbolic classification heads with their baseline configurations, we evaluate the methods on the challenging COCO 2017 dataset. All object detectors were trained on 122,000 training images and tested on 5,000 validation images following the standard protocol [34, 43, 44].

Table 1. COCO 2017 val results for methods using linear layers (top row) compared to hyperbolic (bottom row) classification heads.
Table 2. COCO 2017 val results using single-scale testing. Each section compares the results given identical networks but a linear classification head (top row) with a Hyperbolic classification head (bottom row). The error metrics \(E_x\) were computed as proposed by Boyla et al. [2].

Baselines: We incorporate our hyperbolic classification heads into the two-stage, keypoint-based, and transformer-based object detector architectures a described in Sect. 3.3. Consequently, we compare these methods against the standard configurations using a linear classification head in the euclidean embedding space.

Discussions: The results on the COCO 2017 val set in Table 1 indicate that hyperbolic classification heads consistently outperform their euclidean baseline configurations on the COCO 2017 val benchmark. The Sparse R-CNN configuration achieves a substantial increase in the mean average precision of \(+1.2\%\), without changes to the architecture and training strategy, only by modifying the algebra of the embedding space. The hyperbolic classification head’s impact on various aspects of object detection are shown in Table 2 for the COCO 2017 val set. Surprisingly, the main benefits of the hyperbolic embeddings in Table 2 arise from a sharper contrast between the background and foreground detections, as the false positive error \(E_{FP}\) is consistently lower than the Euclidean baseline. The marginally increased classification error \(E_{cls}\) for DDETR and Sparse R-CNN in the hyperbolic variants is more than compensated by the lower number of false positives. Additionally, the \(mAP_{cat}\) is constantly higher for the hyperbolic methods, which we find the most striking result. This suggests that even though classification errors occur in both methods, the hyperbolic classification space appears to inherently learn a semantic structure such that classification errors are more often within the same category as for Euclidean methods. We argue that it therefore makes “better” detection errors, as the misclassifications are still within the same supercategory and therefore more related to the true class.

LVIS V1 Benchmark. The purpose of the LVIS v1 experiments is to study the behavior of hyperbolic classification heads with a large set of object types and imbalanced class occurrences. Table 3 shows the results on the LVIS v1 val set for baseline methods as well as their counterparts using a hyperbolic classification head.

Table 3. LIVS val results for various models with linear classification head and hyperbolic classification head. All methods were trained by repeat factor resampling [10] by a factor of 0.001.

Baselines: We compare our method against CenterNet2 using a federated loss [42], that computes a binary cross-entropy loss value over a subset of \(|S|=50\) classes. The subset S changes every iteration and is composed of all object types in the mini-batch’s groundtruth and padded with randomly sampled negative classes. Additionally, we trained a Faster R-CNN model using a EQLv2 loss [35], a mechanism to compensate class-imbalances in the dataset by equalizing the gradient ratio between positives and negatives for each class.

Discussions: We observe a consistent improvement for the detection accuracy with fine-grained object types in hyperbolic embedding space, as both hyperbolic methods outperform their euclidean counterparts on precision for frequent classes \(AP_f\). However, the hyperbolic classifiers perform inferior on rare and common classes. This effect is largest for the Faster R-CNN model trained by an EQLv2 [35] loss, which was initially proposed for euclidean embedding spaces. We therefore suggest that methods optimized in the Euclidean space cannot be straightforwardly applied to the hyperbolic space. An improved class-balancing strategy needs to be designed for hyperbolic embeddings for long-tailed object detection, that needs to address both the class-imbalance by sampling and the impact of negatives have on the gradients.

Zero-Shot Evaluation. Next, we assess the zero-shot abilities of hyperbolic embeddings on the COCO 2017 dataset using the 65/15 classes split proposed by Rahman et al. [30]. Zero-shot object detection requires learning a mapping from visual to semantic feature space, such that the detector recognizes unseen object types only given their semantic representations. We investigate the behavior with semantic representations from word embeddings learned from the Wikipedia corpus by the word2vec method [25] in rely on the formulation for hyperbolic space by Leimeister et al. [18]. Recent advances in zero-shot object detection rely on synthesizing unseen classes during training [11], or learn a reprojection of the semantic embedding vectors [40]. However, we reserve these tuning strategies of the network architecture and training pipeline to future work and focus on the straight-forward mapping from vision features to semantic class prototypes and a reprojection of embedding vectors as baselines.

Table 4. Precision and recall on seen as well as unseen classes for COCO 2017 65/15 split. An IoU threshold of 0.5 is used for computing recall and average precision. HM refers to the harmonic mean of seen and unseen classes.

Baselines: We compare our method against object detectors using word2vec word embeddings as class prototypes and trained with a polarity loss [30] and a reconstruction loss [41]. The letter additionally learns a reprojection of semantic embedding vectors into the visual domain. We further provide results for a Sparse R-CNN method trained using word2vec word embeddings but using a classifier head in euclidean space. This method was trained with the same hyperparameters as its hyperbolic variant. We train our hyperbolic classifier using a focal loss [19].

Discussions: The zero-shot performance of the hyperbolic classification head is shown in Table 4. The hyperbolic configurations outperform their naive baselines on average precision for seen and unseen classes, even though the baseline methods rely on more sophisticated training losses. The recall of groundtruth boxes appears to be dependent on the choice of loss function, as the Faster R-CNN baseline using the reconstruction loss proposed in [41] achieves higher recall even though it yields the lowest precision on detecting unseen objects. The zero-shot performance using the hyperbolic Sparse R-CNN architecture shows superior results compared to all the baseline models and loss functions, even its euclidean counterpart trained with the exact same setting. We take this as an indication that the hyperbolic embedding space does not negatively affect the recall.

4.2 Qualitative Results

In Fig. 2, we show qualitative detection results from two classifiers trained on the COCO 2017 train set and classes. Detections on the left (a) were trained on object detection architectures using a learnable class prototypes in the euclidean space, while detections on the right (b) were generated by detectors using our proposed hyperbolic MLR classifier. Interestingly, the accuracies of the two detectors are comparable, even though euclidean class prototypes resulted in a false positive in Fig. 2(a). Another noticeable difference is the composition of the top-3 class scores that provide an insight in the structure of the embedding spaces. While most bounding boxes were classified correctly by both classifiers, the learned hyperbolic embeddings appear to have grouped categorically similar concepts as the detector predicts classes of vehicle types for the car object instance. For the person instance, there are no equivalent categorical classes, so there seems to have emerged a living thing neighborhood and a neighborhood of frequently co-occurring classes containing bicycle in the embedding space.

Fig. 2.
figure 2

Qualitative results for Sparse R-CNN model trained on full COCO 2017 classes with an (a) euclidean and (b) hyperbolic classification head.

Fig. 3.
figure 3

Qualitative results for CenterNet2 model trained on LVIS v1 classes with an (a) euclidean and (b) hyperbolic classification head. The children are wearing ballet skirts, a rare class in the LVIS v1 dataset with \(<10\) training samples.

Fig. 4.
figure 4

Qualitative results for a Faster R-CNN model trained on the seen classes of the COCO 2017 65–15 split with an (a) euclidean and (b) hyperbolic classification head on word2vec semantic embeddings. The image shows a seen car instance as well as an unseen airplane instance.

The example predictions for LVIS v1 classes are shown in Fig. 3. The hyperbolic method recognizes even partly visible objects and top-3 predictions are more semantically similar, such as headband and bandanna for the child in the center. This indicates that the embeddings capture more conceptual similarities than the euclidean classifier. However, “rare” classes are missing from the top-3 predictions, as a result the hyperbolic method fails to assign the correct class ballet skirt for the tutu worn by the child on the right.

Figure 4 presents the zeros-hot results for an unseen airplane instance. Here, the hyperbolic model also yields categorically similar predictions when given semantic word embeddings. Surprisingly, this is not as strong for the euclidean classifier (a), even though it maps the visual features to word embeddings. However, we note that the confidence by the hyperbolic method (i.e. distance in embedding space) is considerably lower (larger) for the unseen class and the airplane bounding box appears less accurate. This could be mitigated by using more sophisticated zero-shot pipelines such as the reconstruction loss [41] or synthetic training samples [11].

5 Conclusions

In this work, we proposed to use hyperbolic embedding spaces to learn class prototypes in object detection. We extended two-stage, keypoint-based, and transformer-based object detectors to incorporate hyperbolic classification heads. Evaluations on closed-set, long-tailed, as well as zero-shot settings showed that the hyperbolic methods outperformed their euclidean baselines on classes with sufficient training samples. The hyperbolic classification heads resulted in considerably fewer false positives, and produced “better" classification errors, misclassified labels were within the same supercategory as the true classes. We therefore conclude that hyperbolic geometry provides a promising embedding space for object detection and deserves future work to design optimized class-balancing and zero-shot training frameworks.