Fast SIMDized Kalman filter based track fit

doi:10.1016/j.cpc.2007.10.001

Computer Physics Communications

Volume 178, Issue 5, 1 March 2008, Pages 374-383

https://doi.org/10.1016/j.cpc.2007.10.001 Get rights and content

Abstract

Modern high energy physics experiments have to process terabytes of input data produced in particle collisions. The core of many data reconstruction algorithms in high energy physics is the Kalman filter. Therefore, the speed of Kalman filter based algorithms is of crucial importance in on-line data processing. This is especially true for the combinatorial track finding stage where the Kalman filter based track fit is used very intensively. Therefore, developing fast reconstruction algorithms, which use maximum available power of processors, is important, in particular for the initial selection of events which carry signals of interesting physics.

One of such powerful feature supported by almost all up-to-date PC processors is a SIMD instruction set, which allows packing several data items in one register and to operate on all of them, thus achieving more operations per clock cycle. The novel Cell processor extends the parallelization further by combining a general-purpose PowerPC processor core with eight streamlined coprocessing elements which greatly accelerate vector processing applications.

In the investigation described here, after a significant memory optimization and a comprehensive numerical analysis, the Kalman filter based track fitting algorithm of the CBM experiment has been vectorized using inline operator overloading. Thus the algorithm continues to be flexible with respect to any CPU family used for data reconstruction.

Because of all these changes the SIMDized Kalman filter based track fitting algorithm takes 1 μs per track that is 10000 times faster than the initial version. Porting the algorithm to a Cell Blade computer gives another factor of 10 of the speedup.

Finally, we compare performance of the tracking algorithm running on three different CPU architectures: Intel Xeon, AMD Opteron and Cell Broadband Engine.

Introduction

Finding particle trajectories is usually the most time consuming part of modern experiments in high energy physics [1]. In many present experiments with high track densities and complicated event topologies a Kalman filter [1], [2] based track fit is used already at this combinatorial part of the event reconstruction. Therefore speed of the track fitting algorithm becomes very important for the total processing time.

CBM [3] is a dedicated heavy-ion experiment with fixed target to investigate the properties of highly compressed baryonic matter as it is produced in nucleus-nucleus collisions at the Facility for Antiproton and Ion Research (FAIR) in Darmstadt, Germany. Large track densities (on average 500 tracks in the main tracker for a typical central Au + Au collision) together with the presence of a non-homogeneous magnetic field make the reconstruction of events challenging. The track reconstruction procedure in the CBM experiment is based on the cellular automaton track finder and the Kalman filter track fitter [4], [5]. To achieve a high track finding efficiency the Kalman filter fitting algorithm is intensively used within the track finder.

Motivated by the idea of using the SIMD unit of modern processors (e.g., [6]), we have investigated here a chain of modifications of the Kalman filter based track fitting algorithm in order to increase the speed of the track finding stage of the event reconstruction. In the CBM experiment the track fitting algorithm based on the conventional Kalman filter is implemented using scalar instructions. Thus we have started with the double precision scalar version of the conventional Kalman filter based track fitting algorithm.

The algorithm uses the 70 MB large map of the magnetic field and, therefore, permanently accesses the main memory which is slow relative to the cache. But similar to other high energy physics experiments, the non-homogeneous magnetic field of the CBM experiment is smooth enough to be locally approximated by a polynomial of fourth order. In the case of the polynomial approximation of the magnetic field the algorithm operates within the cache which results in a significant increase of the speed without degradation of the tracking precision.

In order to further optimize memory usage, precision of all data in the algorithm has been changed from double to single. As a result, twice more data can be stored in the cache and, as well, twice more data can be later packed into a SIMD register, effectively doubling the throughput. The conventional Kalman filter algorithm exhibits an unstable behavior when using only single precision numbers (see also [7], [8]). Therefore the Kalman filter algorithm has been specially investigated and modified in order to avoid such instability due to roundoff errors. In addition, the algorithm has been mathematically and numerically optimized, especially in the parts of initial track parameters estimation and also propagation in the magnetic field [5].

In a next step, the algorithm has been adapted for use of a SIMD instruction set. The adaptation has been done by inline operator overloading. This keeps the source code of the algorithm unchanged. Therefore, both versions, scalar and SIMDized, are equivalent and can be selected by a compile time option. Furthermore, this approach gives a unified way of dealing with different CPU families which implement different SIMD instruction sets.

Finally, the SIMDized version of the algorithm has been ported to the Cell processor [9], [10]. Initially designed for a game console, the Cell processor promises extremely high computing capabilities. The Cell processor consists of a general-purpose PowerPC processor core (PPE) connected to eight special-purpose streamlined coprocessing synergistic processing elements (SPE), which greatly accelerate multimedia and vector processing applications, as well as many other forms of dedicated computation. Cell combines the considerable floating point resources required for demanding numerical algorithms with a power efficient software-controlled memory hierarchy. The current implementation of Cell is most often noted for its extremely high performance single precision arithmetic. Even though single precision is widely considered insufficient for many scientific applications, it is fully adequate for the reformulated Kalman algorithm. The Cell processor is particularly compelling because it is expected to be produced in high volumes and to be cost competitive with commodity PC CPUs.

Using the IBM Cell Broadband Engine SDK [9], [10], the algorithm has been first ported to the PPE and modified for use of the AltiVec vector instructions [11], and then ported to the SPE with the corresponding SPE specific vector instructions. After extensive tests on a Cell simulator, the algorithm has been run on a Cell Blade computer.

In the end, the performance of the SIMDized version of the Kalman filter based track fitting routine has been evaluated on three different computer architectures: Intel Xeon, AMD Opteron and Cell Broadband Engine.

Section snippets

SIMD architecture

There are three important classes of computer architectures based upon the number of concurrent instruction and data streams:

•
Single instruction, single data stream (SISD)—a single instruction stream on scalar data.
•
Single instruction, multiple data streams (SIMD)—multiple data streams against a single instruction stream to perform operations which may be naturally parallelized.
•
Multiple instruction, multiple data (MIMD)—many functional units perform different operations on different data.

The

Cell Broadband Engine

Cell¹ [9], [10] is a microprocessor architecture jointly developed by a Sony, Toshiba, and IBM alliance known as STI. Cell combines a general-purpose Power-architecture core of modest performance with multiple streamlined coprocessing elements which greatly accelerate multimedia and vector processing applications, as well as many other forms of dedicated computation. The resulting

Kalman filter method

The Kalman filter method [1], [2] is intended for finding the optimum estimation r of an unknown vector $r^{t}$ according to the measurements $m_{k}$ , $k = 1 \dots n$ , of the vector $r^{t}$ .

The Kalman filter starts with a certain initial approximation $r = r_{0}$ and refines the vector r, consecutively adding one measurement after the other. The optimum value is attained after the addition of the last measurement.

The vector $r^{t}$ can change from one measurement to the next: $r_{k}^{t} = A_{k} r_{k - 1}^{t} + ν_{k},$ where $A_{k}$ —a linear operator, $ν_{k}$ —a process

Speedup of the algorithm

The Kalman filter method is used both in the track finding and track fitting routines of the CBM experiment. The track finder is based on the cellular automaton method [4]. The algorithm creates short track segments (triplets) locally in neighboring detector planes and links them into track candidates, which are then selected using the $χ^{2}$ -criterion. The Kalman filter based routines are used at all stages of the track finder in order to reliably estimate parameters of the track segments and

Results and discussion

The Kalman filter based track fitting algorithm has been tested on simulated data of the CBM experiment [3], [4].

In the CBM experiment with forward geometry the natural choice of the state vector³ is: $r = {x, y, t_{x}, t_{y}, q / p},$ where x and y are track coordinates at the reference z-plane, $t_{x} = \tan θ_{x}$ is the track slope in the xz plane, $t_{y} = \tan θ_{y}$ is the track slope in the yz plane, $q / p$ is the inverse particle

Conclusion

The Kalman filter based track fitting algorithm, the core algorithm of the event reconstruction software in high energy physics experiments, has been significantly optimized and adapted to a vector form implementing different SIMD instruction sets: SSE2 of the Intel and AMD CPUs, AltiVec of the PPE and the specialized SIMD instruction set of the SPE of the Cell processor. Overloading basic scalar operators by corresponding vector instructions keeps the source of the algorithm unchanged, thus

Acknowledgements

We wish to thank M. Engler, J. Franz and J. Jordan from the IBM Laboratory, Böblingen, for giving us the possibility to test the algorithm on the Cell Blade system. We would like to thank also Drs. T. Steinbeck and R. Weis from the Kirchhoff Institute for Physics, University of Heidelberg, for their technical assistance.

We acknowledge the support of the European Community-Research Infrastructure Activity under the FP6 “Structuring the European Research Area” programme (HadronPhysics, contract

References (14)

I. Kisel
Event reconstruction in the CBM experiment
Nucl. Instr. Methods A
(2006)
S. Gorbunov et al.
Analytic formula for track extrapolation in non-homogeneous magnetic field
Nucl. Instr. Methods A
(2006)
R. Frühwirth
Data Analysis Techniques for High-Energy Physics
(2000)
R.E. Kalman
A new approach to linear filtering and prediction problems
Trans. ASME, Series D, J. Basic Eng.
(1960)
Compressed Baryonic Matter Experiment, Technical Status Report, GSI, Darmstadt, 2005; 2006 Update
IA-32 Intel Architecture Optimization Reference Manual, Intel, June...
M.S. Grewal et al.
Kalman Filtering: Theory and Practice using MATLAB
(2001)

There are more references available in the full text version of this article.

Cited by (34)

Real-time data processing in the ALICE High Level Trigger at the LHC
2019, Computer Physics Communications
At the Large Hadron Collider at CERN in Geneva, Switzerland, atomic nuclei are collided at ultra-relativistic energies. Many final-state particles are produced in each collision and their properties are measured by the ALICE detector. The detector signals induced by the produced particles are digitized leading to data rates that are in excess of 48 GB/s. The ALICE High Level Trigger (HLT) system pioneered the use of FPGA- and GPU-based algorithms to reconstruct charged-particle trajectories and reduce the data size in real time. The results of the reconstruction of the collision events, available online, are used for high level data quality and detector-performance monitoring and real-time time-dependent detector calibration. The online data compression techniques developed and used in the ALICE HLT have more than quadrupled the amount of data that can be stored for offline event processing.
Triplet Finder: On the way to triggerless online reconstruction with GPUs for the PANDA experiment
2015, Journal of Computational Science
$\bar{P} ANDA$ is a state-of-the-art hadron physics experiment currently under construction at FAIR, Darmstadt. In order to select events for offline analysis, $\bar{P} ANDA$ will use a software-based triggerless online reconstruction, performed with a data rate of 200 GB/s.
To process the raw data rate of the detector in realtime, we design and implement a GPU version of the Triplet Finder, a fast and robust first-stage tracking algorithm able to reconstruct tracks with good quality, specially designed for the Straw Tube Tracker sub-detector of $\bar{P} ANDA$ . We reduce the algorithmic complexity of processing many hits together by splitting them into bunches, which can be processed independently. We evaluate different ways of processing bunches, GPU dynamic parallelism being one of them. We also propose an optimized technique for associating hits with reconstructed track candidates.
The evaluation of our GPU implementation demonstrates that the Triplet Finder can process more than 8 Mhits/s on a single K20X GPU, making it a promising algorithm for the online event filtering scheme of $\bar{P} ANDA$ .
Triplet Finder: On the way to triggerless online reconstruction with GPUs for the PANDA experiment
2014, Procedia Computer Science
PANDA is a state-of-the-art hadron physics experiment currently under construction at FAIR, Darmstadt. In order to select events for offline analysis, PANDA will use a software-based triggerless online reconstruction, performed with a data rate of 200 GB/s.
To process the raw data rate of the detector in realtime, we design and implement a GPU version of the Triplet Finder, a fast and robust first-stage tracking algorithm able to reconstruct tracks with good quality, specially designed for the Straw Tube Tracker sub-detector of PANDA. We reduce the algorithmic complexity of processing many hits together by splitting them into bunches, which can be processed independently. We evaluate different ways of processing bunches, GPU dynamic parallelism being one of them. We also propose an optimized technique for associating hits with reconstructed track candidates.
The evaluation of our GPU implementation demonstrates that the Triplet Finder can process almost 6 Mhits/s on a single K20X GPU, making it a promising algorithm for the online event filtering scheme of PANDA.
A high-resolution pixel silicon Vertex Detector for open charm measurements with the NA61/SHINE spectrometer at the CERN SPS
2023, European Physical Journal C
A high-resolution pixel silicon Vertex Detector for open charm measurements with the NA61/SHINE spectrometer at the CERN SPS
2023, arXiv
Sustainability in astroparticle physics
2022, Proceedings of Science

View all citing articles on Scopus

View full text

Fast SIMDized Kalman filter based track fit

Abstract

Introduction

Section snippets

SIMD architecture

Cell Broadband Engine

Kalman filter method

Speedup of the algorithm

Results and discussion

Conclusion

Acknowledgements

Nucl. Instr. Methods A

Nucl. Instr. Methods A

Data Analysis Techniques for High-Energy Physics

A new approach to linear filtering and prediction problems

Trans. ASME, Series D, J. Basic Eng.

Compressed Baryonic Matter Experiment, Technical Status Report, GSI, Darmstadt, 2005; 2006 Update

Kalman Filtering: Theory and Practice using MATLAB