1 Introduction
Automatic Number Plate Recognition (ANPR) systems have gained significant attention, particularly in the context of intelligent transportation systems, with widespread implementation in various countries. These systems play a pivotal role in tasks such as traffic law enforcement, traffic monitoring, and vehicle park management. Beyond conventional applications, ANPR systems are instrumental in facilitating tasks like toll collection, entrance and exit management in vehicle parks, and enforcing security measures in restricted areas such as military campsites and protected sanctuaries. Their versatile utility extends to fraud prevention and heightened security measures in specific regions, aiding in locating missing vehicles or those associated with criminal activities.
The deployment of ANPR systems significantly reduces the need for extensive human labor, time, and resources that would otherwise be required for similar tasks. Moreover, manual intervention in such activities introduces the risk of erroneous interpretations, while reading license plates of moving vehicles efficiently poses practical challenges for human operators.
The unique challenges in the Bangladeshi ANPR landscape stem from the variability in license plate designs and the scarcity of labeled data. Traditional approaches often fall short in delivering consistent and accurate results in such dynamic and diverse scenarios.
Major contribution of this research:
•
Hybrid Architecture: The combination of YOLO and Transformers showcases the novelty of a hybrid architecture, where the strengths of object detection and sequence-based tasks are seamlessly integrated for comprehensive ANPR.
•
Dataset Enrichment: We collected 40,000 real-world images from various locations across Bangladesh to address the data scarcity of Bangladeshi vehicle images. Despite this, we encountered certain edge cases for which real-world data was unavailable. To cover these cases, we generated 60,000 synthetic images.
2 Related Work
The Automatic Number Plate Recognition (ANPR) system has been a focus of research for many years, with researchers around the world exploring various methods to enhance its development. Abdullah et al. [
1] utilized YOLOv3 for license plate detection and ResNet-20 for character recognition. Their dataset consisted of 1,500 license plate images and 6,400 character images for training the localization and recognition models, respectively. They reported an accuracy of 92.7%. However, their approach only targeted plates from the Dhaka Metropolitan Area, limiting its ability to generalize to other cities. Dhar et al. [
11] proposed a Shape Validation Technique for license plate detection, followed by tilt correction and Connected Component Analysis to segment text, characters, and digits. For recognition, they employed an Adaboost Classifier using two key features: Histogram of Oriented Gradients (HOG) and Local Binary Pattern (LBP). Their dataset included 2,800 images across 14 different classes, achieving an accuracy of 97.2%. Sarif et al. [
27] proposed a system that uses YOLOv3 for license plate localization and a custom segmentation algorithm to extract text, characters, and digits from the plates. These segmented elements were then fed into a CNN model for recognition, achieving a 97.5% accuracy. However, their model was tested on only 16 different classes, which is insufficient for real-world scenarios involving Bangladeshi vehicle license plates. Additionally, the dataset primarily consisted of private vehicles from Dhaka, making their claims less robust when applied to license plates of commercial vehicles or those from other regions. Saif et al. [
26] proposed using the YOLOv3 model for both number plate localization and recognition. Their dataset, however, was limited to just 1,050 images of private vehicles. While they reported an accuracy of 99.5%, this claim does not hold for commercial vehicle license plates, which were not included in their dataset. Additionally, their accuracy measurement was based on a binary evaluation of the entire license plate, rather than a more granular, character-level approach.
Kumari et al. [
18] proposed an approach that applies image preprocessing techniques followed by Contour Tracing and Edge Detection for license plate localization. For character segmentation and recognition, they utilized neural network models, aiming to enhance the accuracy of the overall system. Ahmed et al. [
3] and Choudhary et al. [
9] primarily focused on the recognition aspect of license plates. In [
3], Ahmed et al. employed Horizontal and Vertical Projection along with Gray Level Occurrence to extract readable text from plates. In contrast, Choudhary et al. [
10] used a combined CNN-LSTM model for character segmentation and recognition, achieving a claimed success rate of 99.64%. Venkateswari et al. [
31] focused on license plate localization, utilizing the highest Horizontal and Vertical histogram values to extract the Region of Interest (ROI) for accurate plate detection. In [
30], Surekha et al. reported achieving an accuracy of 97%. They performed several image preprocessing operations and compared Morphological Processing with Edge Processing for license plate area extraction. For character extraction, they utilized Connected Component Analysis and recognized the characters using a supervised learning model.
Most of the proposed systems are not well-suited for Bangladeshi vehicle license plates, as many are tailored to specific regions, languages, and types of license plates. However, some prior work has been conducted specifically for Bangladeshi license plates. For instance, Nooruddin et al. [
21] proposed utilizing color features in conjunction with MinPool and MaxPool features to enhance license plate detection. Amin et al. [
5] proposed a system that combines Edge Detection, Binary Thresholding, and Hough Transformation for license plate localization, followed by Optical Character Recognition (OCR) for recognizing text in the Bangla language. However, their approach has not achieved notable accuracy and lacks generalizability across different contexts. In their paper, Baten et al. [
8] proposed a method that leverages a unique feature of the Bangla language known as "Matra" along with Connected Component Analysis for text detection and segmentation. They then employed Template Matching for the recognition phase. However, they provided limited information regarding their dataset and the accuracy of their approach. Abedin et al. [
2] proposed using Contour Properties for both license plate detection and character segmentation, followed by a CNN model for character recognition. They reported an overall accuracy of 92% with a processing time of 0.11 seconds. However, their dataset primarily consisted of private vehicles, and they did not account for all vehicle categories or focus on performance under night conditions. Rahman et al. [
23] concentrated solely on the recognition task, requiring manual extraction of license plates and individual characters from the images. They then utilized a CNN model to recognize the characters. Their dataset consisted of 1,750 images, which involved considerable effort to compile.
In [
7], Azam et al. focused primarily on noise removal from images to enhance the detection of license plate regions, achieving a detection accuracy of 94%. Their approach included the use of a frequency domain mask to eliminate rain strokes, a contrast enhancement method, Radon transform for tilt correction, and an image entropy-based technique to filter license plate regions. Hossain et al. [
13] developed a system based on various image processing operations, utilizing the Sobel edge operator, dilation, erosion, boundary features, and horizontal and vertical projection to extract license plate regions. They then divided the extracted plate region into two halves, using boundary features for character segmentation and Template Matching for recognition. However, their system struggles with ambiguous character recognition and images tilted beyond 10 degrees. They claimed 90% accuracy. Chowdhury et al. [
10] extracted the license plate region using color information and segmented it into two halves based on centroid data, followed by character extraction using bounding box parameters. They used a Support Vector Machine (SVM) for character recognition and claimed a 99.3% accuracy rate. However, their system was limited to private vehicle images and struggled when the license plate was out of focus or when the image quality was not ideal. Furthermore, their testing was restricted to only 14 classes, limiting its applicability. In [
15], Islam et al. used Horizontal and Vertical projections along with geometric properties to extract license plate regions after preprocessing. Character localization was performed using Connected Component Analysis and bounding box techniques. For character recognition, they employed an SVM model using features extracted with Histogram of Oriented Gradients (HOG). While they achieved high recognition accuracy, their system did not account for non-ideal conditions. It failed when image resolution was low and struggled to detect license plates from commercial vehicles. Ahsan et al. [
4] proposed a system that uses Template Matching to localize the license plate region, employs Spatial Super Resolution techniques to enhance image quality, and utilizes the Bounding Box method for character segmentation. They used AlexNet for character recognition, achieving a high accuracy of 98.2%. However, they did not provide details about the number of classes AlexNet was trained on. Additionally, the Template Matching technique often struggles to detect targets when the license plate is tilted in the image.
Quadri et al. [
22] employed a Smearing algorithm to extract the license plate region, followed by row and column segmentation for Optical Character Recognition (OCR) to recognize the text from the plate. Shidore et al. [
28] utilized the Sobel Filter, Morphological Operations, Connected Component Analysis, and Vertical Projection Analysis for license plate detection. They employed a SVM for character recognition. Lekhana et al. [
19] presented an approach that combines Spectral Analysis with Connected Component Analysis for detecting license plate regions, followed by the use of a SVM for character recognition. Astari et al. [
6] reported achieving significant accuracy in their paper, where they proposed a system utilizing color features and a hybrid classifier combining a Decision Tree and a SVM for license plate detection and recognition. Wang et al. [
33] employed Image Processing techniques for the license plate localization and segmentation stages, and used a Convolutional Neural Network (CNN) model for character recognition. Jain et al. [
16] utilized Image Processing techniques with Sobel Edge Detection for license plate localization, followed by Optical Character Recognition (OCR) to recognize the characters on the license plate. Lin et al. [
20] employed the YOLOv2 model for vehicle and license plate localization, used classic Image Processing operations for segmentation, and implemented a custom LPR-CNN model for character recognition.
3 Dataset
The Bangladesh Road Transport Authority (BRTA) serves as the regulatory agency tasked with overseeing, managing, and enforcing discipline and safety in the country’s road transport sector. In 2012, BRTA launched a new vehicle license plate system called the Retro-Reflective License Plate, widely known as the digital license plate, as part of its digitalization efforts. Since its rollout, it has become mandatory for vehicles to display this license plate on their rear.
The digital license plates are classified into two categories: one for private vehicles and the other for trading vehicles. Private vehicle plates have a white background with black text Fig.
1a, while trading vehicle plates feature a green background with black text Fig.
1b. Each plate contains two separate rows of text, characters, and numbers.
In the top row, the first word indicates the district where the vehicle was registered. The optional second word identifies the area if the vehicle is registered in a metropolitan zone. The only character in this row, separated by a hyphen, denotes the category of the vehicle.
In the bottom row, the first two digits represent the vehicle’s class registration number, followed by four additional digits separated by a hyphen, which together constitute the vehicle’s serial number. It is mandatory for the license plates to display information in the Bangla language.
We collected a comprehensive dataset of vehicle and license plate images specific to Bangladeshi vehicles, along with their corresponding annotations.
The dataset was significantly enriched by contributions from Hossain et al. [
14], which includes a combination of images sourced from Nooruddin et al. [
21] and additional images collected by the authors. The first subset of this dataset comprises approximately 2,800 images designed for vehicle localization, while the second subset contains around 4,000 license plate images, which were cropped from the initial dataset for focused analysis.
Another dataset was introduced by Shomee et al. [
29] they compiled a detailed dataset comprising 1,928 images for vehicle localization Fig.
2 and an additional 2,662 license plate images. The second subset includes 720 synthetic images and 1,942 manually cropped images, which were derived from the localization dataset.
We combined these two datasets, along with their annotations, and integrated them with our own collected images to create a more comprehensive dataset for vehicle and license plate recognition tasks. For localization, both datasets included bounding box annotations for license plates. However, text extraction posed a greater challenge due to a mismatch in the number of annotation classes across the datasets.
We collected 40,000 vehicle images from various regions of Bangladesh, each annotated for license plate detection and text extraction. Upon merging all available data, the dataset revealed a significant imbalance, with approximately 75% of the vehicles registered in the Dhaka metropolitan area. Addressing this imbalance with real-world data proved challenging, so we generated 70,000 synthetic license plate images (Fig.
3) to ensure a more representative distribution from other districts, improving the overall dataset diversity. For synthetic data generation, we primarily adhered to the BRTA’s standard vehicle registration plate structure. However, recognizing that many vehicles in Bangladesh do not comply with the proper BRTA format (Fig.
4), we also generated a subset of synthetic images featuring irregular license plates to better reflect real-world variations.
5 Results
To conduct the experiments we used NVIDIA A4000 and NVIDIA A5000 GPU. The machine has 32GB RAM and the CPU is Intel(R) Core(TM) i7-14700K.
Table
1 presents a comparison of various object detection models fine-tuned for the specific task of license plate detection. The results indicate that YOLOv10 outperformed the other models in terms of accuracy. Therefore, if accuracy is the primary consideration, YOLOv10 emerges as the optimal choice for this application.
However, we used YOLOv5 in the deployment because the inference time is better. The comparison is shown in Table.
2In object detection tasks, the performance metric utilized is Intersection over Union (IoU). A prediction is considered a positive result if the IoU exceeds 50%, while predictions with an IoU below this threshold are classified as negative results. And with these results, we calculated the precision, recall and F1 score.
As demonstrated in Table
3, the DONUT model significantly outperformed other approaches in real-world scenarios. Additionally, the DONUT model exhibits impressive speed, requiring only 200 ms to extract license plate numbers from images when using an NVIDIA A5000 GPU, while it takes approximately 1.5 to 2 seconds on an average CPU.
7 Conclusion
This paper primarily seeks to identify the optimal solution for an automatic license plate recognition system specifically designed for vehicles in Bangladesh. To achieve this objective, we conducted a series of experiments with several state-of-the-art models, assessing their performance in various scenarios.
Among the models evaluated, we were pleasantly surprised by the exceptional results yielded by the DONUT model, which demonstrated significant efficacy in this domain. Consequently, we developed a hybrid system that integrates YOLOv5 with the DONUT model. This hybrid approach strikes an optimal balance between accuracy and inference speed, making it particularly suitable for our application.
Currently, our model operates exclusively on still images, effectively extracting license plate information. However, we envision future enhancements that will enable our system to process video feeds, allowing for real-time recognition and display of results. This advancement would significantly enhance the practical applicability of our automatic license plate recognition system in real-world settings.