Abstract
This study investigates OCR-driven information extraction and ensemble learning for product type classification from Thai receipt data to enhance family expense management. Using 1,305 receipt images from Thailand, we extracted and preprocessed 5,087 product names across five categories. We compared base classification algorithms with ensemble learning algorithms, focusing on their performance in handling OCR-extracted Thai text. Results demonstrated the superiority of ensemble methods, particularly Majority Voting and Extra Trees, in classifying product types. Majority Voting achieved a weighted average F1-score of 91.74% and accuracy of 91.92%, while Extra Trees recorded the highest overall accuracy at 92.05%. This study contributes to the field by addressing the unique challenges of Thai language OCR and product classification, offering insights into effective ensemble learning strategies for receipt data analysis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alghazzawi, D.M., Alquraishee, A.G.A., Badri, S.K., Hasan, S.H.: ERF-XGB: ensemble random forest-based XG boost for accurate prediction and classification of e-commerce product review. Sustainability 15, 7076 (2023). https://doi.org/10.3390/su15097076
Saout, T., Lardeux, F., Saubion, F.: An overview of data extraction from invoices. IEEE Access 12, 19872–19886 (2024). https://doi.org/10.1109/ACCESS.2024.3360528
Sayallar, C., Sayar, A., Babalik, N.: An OCR engine for printed receipt images using deep learning techniques. Int. J. Adv. Comput. Sci. Appl. IJACSA. 14 (2023). https://doi.org/10.14569/IJACSA.2023.0140295
Huang, Z., et al.: ICDAR2019 competition on scanned receipt OCR and information extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1516–1520 (2019). https://doi.org/10.1109/ICDAR.2019.00244
Chinta, S.A.R.N, Ashili, N.K., Babu, B.S., Vydugula, R.R., Raj Sipada, V.S.L: An intelligent invoice processing system using tesseract OCR. In: 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), pp. 1–6 (2024). https://doi.org/10.1109/ADICS58448.2024.10533509
Kumar, V., Kaware, P., Singh, P., Sonkusare, R., Kumar, S.: Extraction of information from bill receipts using optical character recognition. In: 2020 International Conference on Smart Electronics and Communication ICOSEC, pp. 72–77 (2020). https://doi.org/10.1109/ICOSEC49089.2020.9215246
Ha, H.T., Horák, A.: Information extraction from scanned invoice images using text analysis and layout features. Signal Process. Image Commun. 102, 116601 (2022). https://doi.org/10.1016/j.image.2021.116601
Yindumathi, K.M., Chaudhari, S.S., Aparna, R.: Analysis of image classification for text extraction from bills and invoices. In: 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pp. 1–6 (2020). https://doi.org/10.1109/ICCCNT49239.2020.9225564
Gan, B., Zhang, C.: An improved model of product classification feature extraction and recognition based on intelligent image recognition. Comput. Intell. Neurosci. 2022, 2926669 (2022). https://doi.org/10.1155/2022/2926669
Fayaz, M., Khan, A., Rahman, J.U., Alharbi, A., Uddin, M.I., Alouffi, B.: Ensemble machine learning model for classification of spam product reviews. Complexity 2020, 8857570 (2020). https://doi.org/10.1155/2020/8857570
Nuankaew, W., Thipmontha, R., Jeefoo, P., Nasa-ngium, P., Nuankaew, P.: Using text mining and tokenization analysis to identify job performance for human resource management at the University of Phayao. Presented at the September 29 (2023). https://doi.org/10.1007/978-3-031-42430-4_47
Adjetey, C., Adu-Manu, K.S.: Content-based Image Retrieval using Tesseract OCR Engine and Levenshtein Algorithm. Int. J. Adv. Comput. Sci. Appl. IJACSA. 12 (2021). https://doi.org/10.14569/IJACSA.2021.0120776
Pythainlp: pythainlp.tokenize — PyThaiNLP <unknown> documentation, https://pythainlp.org/docs/2.1/api/tokenize.html#pythainlp-tokenize. Accessed 13 Aug 2024
Kongsumran, N.: Thai tokenizer invariant classification based on bi-lstm and distilbert encoders. Chulalongkorn University Theses and Dissertations Chula ETD (2021). https://doi.org/10.58837/CHULA.THE.2021.113
Mohammed, M.T., Rashid, O.F.: Document retrieval using term term frequency inverse sentence frequency weighting scheme. Indones. J. Electr. Eng. Comput. Sci. 31, 1478–1485 (2023). https://doi.org/10.11591/ijeecs.v31.i3.pp1478-1485
Chicho, B.T., Abdulazeez, A.M., Zeebaree, D.Q., Zebari, D.A.: Machine learning classifiers based classification For IRIS recognition. Qubahan Acad. J. 1, 106–118 (2021). https://doi.org/10.48161/qaj.v1n2a48
Adam, H., Muhammad, A., Aboaba, A.A.: Design of a hybrid machine learning base-classifiers for software defect prediction. Int. J. Innov. Res. Dev. (2022). https://doi.org/10.24940/ijird/2022/v11/i10/OCT22020
Nuankaew, W.S., Bussaman, S., Nuankaew, P.: Evolutionary feature weighting optimization and majority voting ensemble learning for curriculum recommendation in the higher education. In: Surinta, O., Kam Fung Yuen, K. (eds.) Multi-disciplinary Trends in Artificial Intelligence. pp. 14–25. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20992-5_2
Acknowledgement
This research project was supported by the Thailand Science Research and Innovation Fund and the University of Phayao. It also received support from many advisors, academics, researchers, students, and staff. The authors thank everyone for their support and cooperation in completing this research.
Moreover, the special thanks to Claude, ChatGPT, Sci Space, Gemini, and Perplexity Generative AI for their invaluable assistance in gathering information, reading, and providing guidance throughout this research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
The researchers declare that there is no conflict of interest for this research.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Nuankaew, W.S., Autarach, A., Meesri, T., Nuankaew, P. (2025). Leveraging OCR-Driven Information Extraction for Accurate Product Type Classification from Thai Receipt Data: An Ensemble Learning Approach. In: Sombattheera, C., Weng, P., Pang, J. (eds) Multi-disciplinary Trends in Artificial Intelligence. MIWAI 2024. Lecture Notes in Computer Science(), vol 15432. Springer, Singapore. https://doi.org/10.1007/978-981-96-0695-5_7
Download citation
DOI: https://doi.org/10.1007/978-981-96-0695-5_7
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-96-0694-8
Online ISBN: 978-981-96-0695-5
eBook Packages: Computer ScienceComputer Science (R0)