Leveraging OCR-Driven Information Extraction for Accurate Product Type Classification from Thai Receipt Data: An Ensemble Learning Approach

Nuankaew, Wongpanya S.; Autarach, Apitarat; Meesri, Teerapakorn; Nuankaew, Pratya

doi:10.1007/978-981-96-0695-5_7

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15432))

Included in the following conference series:

International Conference on Multi-disciplinary Trends in Artificial Intelligence

153 Accesses

Abstract

This study investigates OCR-driven information extraction and ensemble learning for product type classification from Thai receipt data to enhance family expense management. Using 1,305 receipt images from Thailand, we extracted and preprocessed 5,087 product names across five categories. We compared base classification algorithms with ensemble learning algorithms, focusing on their performance in handling OCR-extracted Thai text. Results demonstrated the superiority of ensemble methods, particularly Majority Voting and Extra Trees, in classifying product types. Majority Voting achieved a weighted average F1-score of 91.74% and accuracy of 91.92%, while Extra Trees recorded the highest overall accuracy at 92.05%. This study contributes to the field by addressing the unique challenges of Thai language OCR and product classification, offering insights into effective ensemble learning strategies for receipt data analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alghazzawi, D.M., Alquraishee, A.G.A., Badri, S.K., Hasan, S.H.: ERF-XGB: ensemble random forest-based XG boost for accurate prediction and classification of e-commerce product review. Sustainability 15, 7076 (2023). https://doi.org/10.3390/su15097076
Article Google Scholar
Saout, T., Lardeux, F., Saubion, F.: An overview of data extraction from invoices. IEEE Access 12, 19872–19886 (2024). https://doi.org/10.1109/ACCESS.2024.3360528
Article Google Scholar
Sayallar, C., Sayar, A., Babalik, N.: An OCR engine for printed receipt images using deep learning techniques. Int. J. Adv. Comput. Sci. Appl. IJACSA. 14 (2023). https://doi.org/10.14569/IJACSA.2023.0140295
Huang, Z., et al.: ICDAR2019 competition on scanned receipt OCR and information extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1516–1520 (2019). https://doi.org/10.1109/ICDAR.2019.00244
Chinta, S.A.R.N, Ashili, N.K., Babu, B.S., Vydugula, R.R., Raj Sipada, V.S.L: An intelligent invoice processing system using tesseract OCR. In: 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), pp. 1–6 (2024). https://doi.org/10.1109/ADICS58448.2024.10533509
Kumar, V., Kaware, P., Singh, P., Sonkusare, R., Kumar, S.: Extraction of information from bill receipts using optical character recognition. In: 2020 International Conference on Smart Electronics and Communication ICOSEC, pp. 72–77 (2020). https://doi.org/10.1109/ICOSEC49089.2020.9215246
Ha, H.T., Horák, A.: Information extraction from scanned invoice images using text analysis and layout features. Signal Process. Image Commun. 102, 116601 (2022). https://doi.org/10.1016/j.image.2021.116601
Article Google Scholar
Yindumathi, K.M., Chaudhari, S.S., Aparna, R.: Analysis of image classification for text extraction from bills and invoices. In: 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pp. 1–6 (2020). https://doi.org/10.1109/ICCCNT49239.2020.9225564
Gan, B., Zhang, C.: An improved model of product classification feature extraction and recognition based on intelligent image recognition. Comput. Intell. Neurosci. 2022, 2926669 (2022). https://doi.org/10.1155/2022/2926669
Article Google Scholar
Fayaz, M., Khan, A., Rahman, J.U., Alharbi, A., Uddin, M.I., Alouffi, B.: Ensemble machine learning model for classification of spam product reviews. Complexity 2020, 8857570 (2020). https://doi.org/10.1155/2020/8857570
Article Google Scholar
Nuankaew, W., Thipmontha, R., Jeefoo, P., Nasa-ngium, P., Nuankaew, P.: Using text mining and tokenization analysis to identify job performance for human resource management at the University of Phayao. Presented at the September 29 (2023). https://doi.org/10.1007/978-3-031-42430-4_47
Adjetey, C., Adu-Manu, K.S.: Content-based Image Retrieval using Tesseract OCR Engine and Levenshtein Algorithm. Int. J. Adv. Comput. Sci. Appl. IJACSA. 12 (2021). https://doi.org/10.14569/IJACSA.2021.0120776
Pythainlp: pythainlp.tokenize — PyThaiNLP <unknown> documentation, https://pythainlp.org/docs/2.1/api/tokenize.html#pythainlp-tokenize. Accessed 13 Aug 2024
Kongsumran, N.: Thai tokenizer invariant classification based on bi-lstm and distilbert encoders. Chulalongkorn University Theses and Dissertations Chula ETD (2021). https://doi.org/10.58837/CHULA.THE.2021.113
Mohammed, M.T., Rashid, O.F.: Document retrieval using term term frequency inverse sentence frequency weighting scheme. Indones. J. Electr. Eng. Comput. Sci. 31, 1478–1485 (2023). https://doi.org/10.11591/ijeecs.v31.i3.pp1478-1485
Chicho, B.T., Abdulazeez, A.M., Zeebaree, D.Q., Zebari, D.A.: Machine learning classifiers based classification For IRIS recognition. Qubahan Acad. J. 1, 106–118 (2021). https://doi.org/10.48161/qaj.v1n2a48
Adam, H., Muhammad, A., Aboaba, A.A.: Design of a hybrid machine learning base-classifiers for software defect prediction. Int. J. Innov. Res. Dev. (2022). https://doi.org/10.24940/ijird/2022/v11/i10/OCT22020
Nuankaew, W.S., Bussaman, S., Nuankaew, P.: Evolutionary feature weighting optimization and majority voting ensemble learning for curriculum recommendation in the higher education. In: Surinta, O., Kam Fung Yuen, K. (eds.) Multi-disciplinary Trends in Artificial Intelligence. pp. 14–25. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20992-5_2

Download references

Acknowledgement

This research project was supported by the Thailand Science Research and Innovation Fund and the University of Phayao. It also received support from many advisors, academics, researchers, students, and staff. The authors thank everyone for their support and cooperation in completing this research.

Moreover, the special thanks to Claude, ChatGPT, Sci Space, Gemini, and Perplexity Generative AI for their invaluable assistance in gathering information, reading, and providing guidance throughout this research.

Author information

Authors and Affiliations

School of Information and Communication Technology, University of Phayao, Phayao, 56000, Thailand
Wongpanya S. Nuankaew, Apitarat Autarach, Teerapakorn Meesri & Pratya Nuankaew

Authors

Wongpanya S. Nuankaew
View author publications
You can also search for this author in PubMed Google Scholar
Apitarat Autarach
View author publications
You can also search for this author in PubMed Google Scholar
Teerapakorn Meesri
View author publications
You can also search for this author in PubMed Google Scholar
Pratya Nuankaew
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pratya Nuankaew .

Editor information

Editors and Affiliations

Mahasarakham University, Mahasarakham, Thailand
Chattrakul Sombattheera
Duke Kunshan University, Kunshan, China
Paul Weng
University of Luxembourg, Esch-sur-Alzette, Luxembourg
Jun Pang

Ethics declarations

The researchers declare that there is no conflict of interest for this research.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nuankaew, W.S., Autarach, A., Meesri, T., Nuankaew, P. (2025). Leveraging OCR-Driven Information Extraction for Accurate Product Type Classification from Thai Receipt Data: An Ensemble Learning Approach. In: Sombattheera, C., Weng, P., Pang, J. (eds) Multi-disciplinary Trends in Artificial Intelligence. MIWAI 2024. Lecture Notes in Computer Science(), vol 15432. Springer, Singapore. https://doi.org/10.1007/978-981-96-0695-5_7

Download citation

DOI: https://doi.org/10.1007/978-981-96-0695-5_7
Published: 20 February 2025
Publisher Name: Springer, Singapore
Print ISBN: 978-981-96-0694-8
Online ISBN: 978-981-96-0695-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics