short-paper

Open access

Short Paper: Detecting and Decoding: A YOLO-Transformer Hybrid Model for Bangla License Plate Recognition

Authors:

Nuraiyad Nafiz Islam,

Sadia Afroz,

Md. Faizul Bari,

A. B. M. Alim Al IslamAuthors Info & Claims

NSysS '24: Proceedings of the 11th International Conference on Networking, Systems, and Security

Pages 216 - 221

https://doi.org/10.1145/3704522.3704557

Published: 03 January 2025 Publication History

PDF eReader

Abstract

Automatic Number Plate Recognition (ANPR) systems have become critical for traffic management and law enforcement, especially in regions with unique script challenges such as Bangladesh. This paper presents a comprehensive approach to detecting and extracting text from Bangladeshi license plates using a combination of state-of-the-art object detection models and text extraction techniques. We employ three versions of the YOLO (You Only Look Once) model—YOLOv5, YOLOv8, and YOLOv10—to detect Bangladeshi license plates with remarkable accuracy rates of 96.5%, 96.8%, and 96.87%, respectively. For text extraction from the detected plates, we leverage a Transformer-based model, achieving an overall text recognition accuracy of 89.96%. Our results demonstrate that YOLOv10 marginally outperforms its predecessors in detection accuracy, while the text extraction performance remains consistent across all detection models. This study offers a robust solution for Bangladeshi license plate recognition, paving the way for further improvements in regional ANPR systems. We have also developed a dataset by merging previously available datasets with our own collected(40,000) vehicle images. To address the issue of class imbalance, we supplemented the dataset by generating synthetic data(60,000).

1 Introduction

Automatic Number Plate Recognition (ANPR) systems have gained significant attention, particularly in the context of intelligent transportation systems, with widespread implementation in various countries. These systems play a pivotal role in tasks such as traffic law enforcement, traffic monitoring, and vehicle park management. Beyond conventional applications, ANPR systems are instrumental in facilitating tasks like toll collection, entrance and exit management in vehicle parks, and enforcing security measures in restricted areas such as military campsites and protected sanctuaries. Their versatile utility extends to fraud prevention and heightened security measures in specific regions, aiding in locating missing vehicles or those associated with criminal activities.

The deployment of ANPR systems significantly reduces the need for extensive human labor, time, and resources that would otherwise be required for similar tasks. Moreover, manual intervention in such activities introduces the risk of erroneous interpretations, while reading license plates of moving vehicles efficiently poses practical challenges for human operators.

The unique challenges in the Bangladeshi ANPR landscape stem from the variability in license plate designs and the scarcity of labeled data. Traditional approaches often fall short in delivering consistent and accurate results in such dynamic and diverse scenarios.

Major contribution of this research:

•

Hybrid Architecture: The combination of YOLO and Transformers showcases the novelty of a hybrid architecture, where the strengths of object detection and sequence-based tasks are seamlessly integrated for comprehensive ANPR.

•

Dataset Enrichment: We collected 40,000 real-world images from various locations across Bangladesh to address the data scarcity of Bangladeshi vehicle images. Despite this, we encountered certain edge cases for which real-world data was unavailable. To cover these cases, we generated 60,000 synthetic images.

2 Related Work

The Automatic Number Plate Recognition (ANPR) system has been a focus of research for many years, with researchers around the world exploring various methods to enhance its development. Abdullah et al. [1] utilized YOLOv3 for license plate detection and ResNet-20 for character recognition. Their dataset consisted of 1,500 license plate images and 6,400 character images for training the localization and recognition models, respectively. They reported an accuracy of 92.7%. However, their approach only targeted plates from the Dhaka Metropolitan Area, limiting its ability to generalize to other cities. Dhar et al. [11] proposed a Shape Validation Technique for license plate detection, followed by tilt correction and Connected Component Analysis to segment text, characters, and digits. For recognition, they employed an Adaboost Classifier using two key features: Histogram of Oriented Gradients (HOG) and Local Binary Pattern (LBP). Their dataset included 2,800 images across 14 different classes, achieving an accuracy of 97.2%. Sarif et al. [27] proposed a system that uses YOLOv3 for license plate localization and a custom segmentation algorithm to extract text, characters, and digits from the plates. These segmented elements were then fed into a CNN model for recognition, achieving a 97.5% accuracy. However, their model was tested on only 16 different classes, which is insufficient for real-world scenarios involving Bangladeshi vehicle license plates. Additionally, the dataset primarily consisted of private vehicles from Dhaka, making their claims less robust when applied to license plates of commercial vehicles or those from other regions. Saif et al. [26] proposed using the YOLOv3 model for both number plate localization and recognition. Their dataset, however, was limited to just 1,050 images of private vehicles. While they reported an accuracy of 99.5%, this claim does not hold for commercial vehicle license plates, which were not included in their dataset. Additionally, their accuracy measurement was based on a binary evaluation of the entire license plate, rather than a more granular, character-level approach.

Kumari et al. [18] proposed an approach that applies image preprocessing techniques followed by Contour Tracing and Edge Detection for license plate localization. For character segmentation and recognition, they utilized neural network models, aiming to enhance the accuracy of the overall system. Ahmed et al. [3] and Choudhary et al. [9] primarily focused on the recognition aspect of license plates. In [3], Ahmed et al. employed Horizontal and Vertical Projection along with Gray Level Occurrence to extract readable text from plates. In contrast, Choudhary et al. [10] used a combined CNN-LSTM model for character segmentation and recognition, achieving a claimed success rate of 99.64%. Venkateswari et al. [31] focused on license plate localization, utilizing the highest Horizontal and Vertical histogram values to extract the Region of Interest (ROI) for accurate plate detection. In [30], Surekha et al. reported achieving an accuracy of 97%. They performed several image preprocessing operations and compared Morphological Processing with Edge Processing for license plate area extraction. For character extraction, they utilized Connected Component Analysis and recognized the characters using a supervised learning model.

Most of the proposed systems are not well-suited for Bangladeshi vehicle license plates, as many are tailored to specific regions, languages, and types of license plates. However, some prior work has been conducted specifically for Bangladeshi license plates. For instance, Nooruddin et al. [21] proposed utilizing color features in conjunction with MinPool and MaxPool features to enhance license plate detection. Amin et al. [5] proposed a system that combines Edge Detection, Binary Thresholding, and Hough Transformation for license plate localization, followed by Optical Character Recognition (OCR) for recognizing text in the Bangla language. However, their approach has not achieved notable accuracy and lacks generalizability across different contexts. In their paper, Baten et al. [8] proposed a method that leverages a unique feature of the Bangla language known as "Matra" along with Connected Component Analysis for text detection and segmentation. They then employed Template Matching for the recognition phase. However, they provided limited information regarding their dataset and the accuracy of their approach. Abedin et al. [2] proposed using Contour Properties for both license plate detection and character segmentation, followed by a CNN model for character recognition. They reported an overall accuracy of 92% with a processing time of 0.11 seconds. However, their dataset primarily consisted of private vehicles, and they did not account for all vehicle categories or focus on performance under night conditions. Rahman et al. [23] concentrated solely on the recognition task, requiring manual extraction of license plates and individual characters from the images. They then utilized a CNN model to recognize the characters. Their dataset consisted of 1,750 images, which involved considerable effort to compile.

In [7], Azam et al. focused primarily on noise removal from images to enhance the detection of license plate regions, achieving a detection accuracy of 94%. Their approach included the use of a frequency domain mask to eliminate rain strokes, a contrast enhancement method, Radon transform for tilt correction, and an image entropy-based technique to filter license plate regions. Hossain et al. [13] developed a system based on various image processing operations, utilizing the Sobel edge operator, dilation, erosion, boundary features, and horizontal and vertical projection to extract license plate regions. They then divided the extracted plate region into two halves, using boundary features for character segmentation and Template Matching for recognition. However, their system struggles with ambiguous character recognition and images tilted beyond 10 degrees. They claimed 90% accuracy. Chowdhury et al. [10] extracted the license plate region using color information and segmented it into two halves based on centroid data, followed by character extraction using bounding box parameters. They used a Support Vector Machine (SVM) for character recognition and claimed a 99.3% accuracy rate. However, their system was limited to private vehicle images and struggled when the license plate was out of focus or when the image quality was not ideal. Furthermore, their testing was restricted to only 14 classes, limiting its applicability. In [15], Islam et al. used Horizontal and Vertical projections along with geometric properties to extract license plate regions after preprocessing. Character localization was performed using Connected Component Analysis and bounding box techniques. For character recognition, they employed an SVM model using features extracted with Histogram of Oriented Gradients (HOG). While they achieved high recognition accuracy, their system did not account for non-ideal conditions. It failed when image resolution was low and struggled to detect license plates from commercial vehicles. Ahsan et al. [4] proposed a system that uses Template Matching to localize the license plate region, employs Spatial Super Resolution techniques to enhance image quality, and utilizes the Bounding Box method for character segmentation. They used AlexNet for character recognition, achieving a high accuracy of 98.2%. However, they did not provide details about the number of classes AlexNet was trained on. Additionally, the Template Matching technique often struggles to detect targets when the license plate is tilted in the image.

Quadri et al. [22] employed a Smearing algorithm to extract the license plate region, followed by row and column segmentation for Optical Character Recognition (OCR) to recognize the text from the plate. Shidore et al. [28] utilized the Sobel Filter, Morphological Operations, Connected Component Analysis, and Vertical Projection Analysis for license plate detection. They employed a SVM for character recognition. Lekhana et al. [19] presented an approach that combines Spectral Analysis with Connected Component Analysis for detecting license plate regions, followed by the use of a SVM for character recognition. Astari et al. [6] reported achieving significant accuracy in their paper, where they proposed a system utilizing color features and a hybrid classifier combining a Decision Tree and a SVM for license plate detection and recognition. Wang et al. [33] employed Image Processing techniques for the license plate localization and segmentation stages, and used a Convolutional Neural Network (CNN) model for character recognition. Jain et al. [16] utilized Image Processing techniques with Sobel Edge Detection for license plate localization, followed by Optical Character Recognition (OCR) to recognize the characters on the license plate. Lin et al. [20] employed the YOLOv2 model for vehicle and license plate localization, used classic Image Processing operations for segmentation, and implemented a custom LPR-CNN model for character recognition.

3 Dataset

The Bangladesh Road Transport Authority (BRTA) serves as the regulatory agency tasked with overseeing, managing, and enforcing discipline and safety in the country’s road transport sector. In 2012, BRTA launched a new vehicle license plate system called the Retro-Reflective License Plate, widely known as the digital license plate, as part of its digitalization efforts. Since its rollout, it has become mandatory for vehicles to display this license plate on their rear.

The digital license plates are classified into two categories: one for private vehicles and the other for trading vehicles. Private vehicle plates have a white background with black text Fig. 1a, while trading vehicle plates feature a green background with black text Fig. 1b. Each plate contains two separate rows of text, characters, and numbers.

Figure 1:

In the top row, the first word indicates the district where the vehicle was registered. The optional second word identifies the area if the vehicle is registered in a metropolitan zone. The only character in this row, separated by a hyphen, denotes the category of the vehicle.

In the bottom row, the first two digits represent the vehicle’s class registration number, followed by four additional digits separated by a hyphen, which together constitute the vehicle’s serial number. It is mandatory for the license plates to display information in the Bangla language.

We collected a comprehensive dataset of vehicle and license plate images specific to Bangladeshi vehicles, along with their corresponding annotations.

The dataset was significantly enriched by contributions from Hossain et al. [14], which includes a combination of images sourced from Nooruddin et al. [21] and additional images collected by the authors. The first subset of this dataset comprises approximately 2,800 images designed for vehicle localization, while the second subset contains around 4,000 license plate images, which were cropped from the initial dataset for focused analysis.

Figure 2:

Another dataset was introduced by Shomee et al. [29] they compiled a detailed dataset comprising 1,928 images for vehicle localization Fig. 2 and an additional 2,662 license plate images. The second subset includes 720 synthetic images and 1,942 manually cropped images, which were derived from the localization dataset.

We combined these two datasets, along with their annotations, and integrated them with our own collected images to create a more comprehensive dataset for vehicle and license plate recognition tasks. For localization, both datasets included bounding box annotations for license plates. However, text extraction posed a greater challenge due to a mismatch in the number of annotation classes across the datasets.

Figure 3:

We collected 40,000 vehicle images from various regions of Bangladesh, each annotated for license plate detection and text extraction. Upon merging all available data, the dataset revealed a significant imbalance, with approximately 75% of the vehicles registered in the Dhaka metropolitan area. Addressing this imbalance with real-world data proved challenging, so we generated 70,000 synthetic license plate images (Fig. 3) to ensure a more representative distribution from other districts, improving the overall dataset diversity. For synthetic data generation, we primarily adhered to the BRTA’s standard vehicle registration plate structure. However, recognizing that many vehicles in Bangladesh do not comply with the proper BRTA format (Fig. 4), we also generated a subset of synthetic images featuring irregular license plates to better reflect real-world variations.

Figure 4:

4 Methodology

Our model is designed with two key components. First, it detects the license plate within an image, and then it extracts the text from the detected area.

4.1 License Plate Detection

For the detection task, numerous image processing techniques have been utilized. However, a significant limitation of these approaches, including edge detection methods, is their sensitivity to specific conditions. Our objective was to design a more robust solution capable of functioning effectively across diverse scenarios, including variations in lighting, weather conditions, and license plate deformations.

In our approach, we explored and implemented various deep learning models, including state-of-the-art object detection frameworks. Specifically, we fine-tuned R-CNN [12] and Faster R-CNN [25], as well as multiple versions of the YOLO (You Only Look Once) algorithm [24], including versions 5, 8, and 10 [32]. After extensive evaluation, we selected YOLOv5 for further use due to several key advantages it offered.

Figure 5:

4.2 Text Extraction

Extracting text from the license plate region is essential for the effective development of an Automatic Number Plate Recognition (ANPR) system. Many prior studies have concentrated on segmenting individual characters in the license plate using different techniques, followed by separate recognition of each segment. However, this method is often inefficient and introduces unnecessary complexity to the process.

Figure 6:

We evaluated several OCR engines, but their performance proved insufficient for real-world implementation within our pipeline (Fig. 6). As an alternative, we employed an object detection-based approach, where each digit and district name is treated as a separate class. While this approach shows significant promise, its performance still falls short of the desired standards. (Fig. 7)

Figure 7:

We subsequently explored a different approach, given the recent advancements in document understanding models that have demonstrated exceptional performance in real-world applications. Models such as LayoutLM [34] have significantly transformed the landscape of information and text extraction from highly distorted and deformed documents. Among these advancements, the most groundbreaking innovation is the DONUT [17] model, which has shown remarkable capabilities in this domain.

The Document Understanding Transformer (DONUT) is a model based on the Swin Transformer architecture, designed to perform three key document-related tasks:

•

Document classification

•

Information extraction

•

Visual question-answering

In this study, we focus on the second task—information extraction—to specifically address the challenge of extracting text from detected license plates. To this end, we have adapted the information extraction component of the DONUT model and fine-tuned it using our curated dataset. This approach enables the architecture to effectively extract license plate information from images.

5 Results

To conduct the experiments we used NVIDIA A4000 and NVIDIA A5000 GPU. The machine has 32GB RAM and the CPU is Intel(R) Core(TM) i7-14700K.

Table 1:

Model	Accuracy	Precision	Recall	F1 Score	IOU
R-CNN	91.56	90.34	91.22	90.78	78.32
Faster R-CNN	91.37	89.8	90.5	90.15	78.56
YOLOv5	96.5	98.64	97.72	98.17	85.62
YOLOv8	96.8	99.26	97.4	98.32	85.63
YOLOv10	96.87	99.01	97.8	98.49	86.7

Table 1: Model Evaluation Metrics

Table 1 presents a comparison of various object detection models fine-tuned for the specific task of license plate detection. The results indicate that YOLOv10 outperformed the other models in terms of accuracy. Therefore, if accuracy is the primary consideration, YOLOv10 emerges as the optimal choice for this application.

However, we used YOLOv5 in the deployment because the inference time is better. The comparison is shown in Table. 2

Table 2:

Model	Time (ms)
R-CNN	41.25
Faster R-CNN	37.33
YOLOv5	19.3
YOLOv8	24.7
YOLOv10	22.46

Table 2: Inference Time

In object detection tasks, the performance metric utilized is Intersection over Union (IoU). A prediction is considered a positive result if the IoU exceeds 50%, while predictions with an IoU below this threshold are classified as negative results. And with these results, we calculated the precision, recall and F1 score.

Table 3:

Model	Accuracy
OCR(EasyOCR)	34.24
OCR(Tesseract)	32.33
Object Detection(YOLOv5)	49.65
Object Detection(YOLOv8)	53.06
Donut [17]	89.96

Table 3: Accuracy of Text Extraction

As demonstrated in Table 3, the DONUT model significantly outperformed other approaches in real-world scenarios. Additionally, the DONUT model exhibits impressive speed, requiring only 200 ms to extract license plate numbers from images when using an NVIDIA A5000 GPU, while it takes approximately 1.5 to 2 seconds on an average CPU.

6 Discussion

The results of this study highlight the effectiveness of the YOLOv5+ DONUT hybrid model for automatic number plate recognition (ANPR) in the context of Bangladeshi vehicles. By leveraging the strengths of both models—YOLOv5’s robust object detection capabilities and the DONUT model’s efficiency in extracting textual information—we were able to achieve high levels of accuracy and speed in image-based license plate recognition tasks.

Table 4:

Model	Night	R/G W	D/A LP	DD
[23]	✓
[1]			✓
[11]
[13]		✓
[10]
[4]	✓
[29]	✓			✓
[14]	✓			✓
YOLO+Donut	✓	✓	✓	✓

Table 4: Comparison With Other Approaches

R/G W - Rainy/Gloomy Weather

D/A LP - Deformed/Abnormal License Plate

DD - Diverser Dataset

Table 4 demonstrates that our hybrid model consistently outperforms other approaches across all challenging aspects. While the accuracy of our model is slightly lower in comparison, it excels in challenging real-world conditions. Specifically, it performs robustly under varying lighting conditions (both day and night), and adverse weather, including rainy or overcast environments. Moreover, our model effectively detects and extracts text from deformed or non-compliant license plates. A key strength of our approach lies in its extensive use of real-world data, complemented by synthetic data to ensure balanced representation. This comprehensive design has proven superior to alternative methods in real-world scenarios.

7 Conclusion

This paper primarily seeks to identify the optimal solution for an automatic license plate recognition system specifically designed for vehicles in Bangladesh. To achieve this objective, we conducted a series of experiments with several state-of-the-art models, assessing their performance in various scenarios.

Among the models evaluated, we were pleasantly surprised by the exceptional results yielded by the DONUT model, which demonstrated significant efficacy in this domain. Consequently, we developed a hybrid system that integrates YOLOv5 with the DONUT model. This hybrid approach strikes an optimal balance between accuracy and inference speed, making it particularly suitable for our application.

Currently, our model operates exclusively on still images, effectively extracting license plate information. However, we envision future enhancements that will enable our system to process video feeds, allowing for real-time recognition and display of results. This advancement would significantly enhance the practical applicability of our automatic license plate recognition system in real-world settings.

References

[1]

Sohaib Abdullah, Md Mahedi Hasan, and Sheikh Muhammad Saiful Islam. 2018. YOLO-based three-stage network for Bangla license plate recognition in Dhaka metropolitan city. In 2018 International Conference on Bangla Speech and Language Processing (ICBSLP). IEEE, 1–6.

Abstract

1 Introduction

2 Related Work

3 Dataset

4 Methodology

4.1 License Plate Detection

4.2 Text Extraction

5 Results

6 Discussion

7 Conclusion

References

Index Terms

Recommendations

Tripartite Architecture License Plate Recognition Based on Transformer

Automatic License Plate Recognition via sliding-window darknet-YOLO deep learning

License Plate Recognition in Unconstrained Scenarios Based on ALPR System

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations