Skip to main content

Leveraging OCR-Driven Information Extraction for Accurate Product Type Classification from Thai Receipt Data: An Ensemble Learning Approach

  • Conference paper
  • First Online:
Multi-disciplinary Trends in Artificial Intelligence (MIWAI 2024)

Abstract

This study investigates OCR-driven information extraction and ensemble learning for product type classification from Thai receipt data to enhance family expense management. Using 1,305 receipt images from Thailand, we extracted and preprocessed 5,087 product names across five categories. We compared base classification algorithms with ensemble learning algorithms, focusing on their performance in handling OCR-extracted Thai text. Results demonstrated the superiority of ensemble methods, particularly Majority Voting and Extra Trees, in classifying product types. Majority Voting achieved a weighted average F1-score of 91.74% and accuracy of 91.92%, while Extra Trees recorded the highest overall accuracy at 92.05%. This study contributes to the field by addressing the unique challenges of Thai language OCR and product classification, offering insights into effective ensemble learning strategies for receipt data analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alghazzawi, D.M., Alquraishee, A.G.A., Badri, S.K., Hasan, S.H.: ERF-XGB: ensemble random forest-based XG boost for accurate prediction and classification of e-commerce product review. Sustainability 15, 7076 (2023). https://doi.org/10.3390/su15097076

    Article  Google Scholar 

  2. Saout, T., Lardeux, F., Saubion, F.: An overview of data extraction from invoices. IEEE Access 12, 19872–19886 (2024). https://doi.org/10.1109/ACCESS.2024.3360528

    Article  Google Scholar 

  3. Sayallar, C., Sayar, A., Babalik, N.: An OCR engine for printed receipt images using deep learning techniques. Int. J. Adv. Comput. Sci. Appl. IJACSA. 14 (2023). https://doi.org/10.14569/IJACSA.2023.0140295

  4. Huang, Z., et al.: ICDAR2019 competition on scanned receipt OCR and information extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1516–1520 (2019). https://doi.org/10.1109/ICDAR.2019.00244

  5. Chinta, S.A.R.N, Ashili, N.K., Babu, B.S., Vydugula, R.R., Raj Sipada, V.S.L: An intelligent invoice processing system using tesseract OCR. In: 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), pp. 1–6 (2024). https://doi.org/10.1109/ADICS58448.2024.10533509

  6. Kumar, V., Kaware, P., Singh, P., Sonkusare, R., Kumar, S.: Extraction of information from bill receipts using optical character recognition. In: 2020 International Conference on Smart Electronics and Communication ICOSEC, pp. 72–77 (2020). https://doi.org/10.1109/ICOSEC49089.2020.9215246

  7. Ha, H.T., Horák, A.: Information extraction from scanned invoice images using text analysis and layout features. Signal Process. Image Commun. 102, 116601 (2022). https://doi.org/10.1016/j.image.2021.116601

    Article  Google Scholar 

  8. Yindumathi, K.M., Chaudhari, S.S., Aparna, R.: Analysis of image classification for text extraction from bills and invoices. In: 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pp. 1–6 (2020). https://doi.org/10.1109/ICCCNT49239.2020.9225564

  9. Gan, B., Zhang, C.: An improved model of product classification feature extraction and recognition based on intelligent image recognition. Comput. Intell. Neurosci. 2022, 2926669 (2022). https://doi.org/10.1155/2022/2926669

    Article  Google Scholar 

  10. Fayaz, M., Khan, A., Rahman, J.U., Alharbi, A., Uddin, M.I., Alouffi, B.: Ensemble machine learning model for classification of spam product reviews. Complexity 2020, 8857570 (2020). https://doi.org/10.1155/2020/8857570

    Article  Google Scholar 

  11. Nuankaew, W., Thipmontha, R., Jeefoo, P., Nasa-ngium, P., Nuankaew, P.: Using text mining and tokenization analysis to identify job performance for human resource management at the University of Phayao. Presented at the September 29 (2023). https://doi.org/10.1007/978-3-031-42430-4_47

  12. Adjetey, C., Adu-Manu, K.S.: Content-based Image Retrieval using Tesseract OCR Engine and Levenshtein Algorithm. Int. J. Adv. Comput. Sci. Appl. IJACSA. 12 (2021). https://doi.org/10.14569/IJACSA.2021.0120776

  13. Pythainlp: pythainlp.tokenize — PyThaiNLP <unknown> documentation, https://pythainlp.org/docs/2.1/api/tokenize.html#pythainlp-tokenize. Accessed 13 Aug 2024

  14. Kongsumran, N.: Thai tokenizer invariant classification based on bi-lstm and distilbert encoders. Chulalongkorn University Theses and Dissertations Chula ETD (2021). https://doi.org/10.58837/CHULA.THE.2021.113

  15. Mohammed, M.T., Rashid, O.F.: Document retrieval using term term frequency inverse sentence frequency weighting scheme. Indones. J. Electr. Eng. Comput. Sci. 31, 1478–1485 (2023). https://doi.org/10.11591/ijeecs.v31.i3.pp1478-1485

  16. Chicho, B.T., Abdulazeez, A.M., Zeebaree, D.Q., Zebari, D.A.: Machine learning classifiers based classification For IRIS recognition. Qubahan Acad. J. 1, 106–118 (2021). https://doi.org/10.48161/qaj.v1n2a48

  17. Adam, H., Muhammad, A., Aboaba, A.A.: Design of a hybrid machine learning base-classifiers for software defect prediction. Int. J. Innov. Res. Dev. (2022). https://doi.org/10.24940/ijird/2022/v11/i10/OCT22020

  18. Nuankaew, W.S., Bussaman, S., Nuankaew, P.: Evolutionary feature weighting optimization and majority voting ensemble learning for curriculum recommendation in the higher education. In: Surinta, O., Kam Fung Yuen, K. (eds.) Multi-disciplinary Trends in Artificial Intelligence. pp. 14–25. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20992-5_2

Download references

Acknowledgement

This research project was supported by the Thailand Science Research and Innovation Fund and the University of Phayao. It also received support from many advisors, academics, researchers, students, and staff. The authors thank everyone for their support and cooperation in completing this research.

Moreover, the special thanks to Claude, ChatGPT, Sci Space, Gemini, and Perplexity Generative AI for their invaluable assistance in gathering information, reading, and providing guidance throughout this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pratya Nuankaew .

Editor information

Editors and Affiliations

Ethics declarations

The researchers declare that there is no conflict of interest for this research.

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nuankaew, W.S., Autarach, A., Meesri, T., Nuankaew, P. (2025). Leveraging OCR-Driven Information Extraction for Accurate Product Type Classification from Thai Receipt Data: An Ensemble Learning Approach. In: Sombattheera, C., Weng, P., Pang, J. (eds) Multi-disciplinary Trends in Artificial Intelligence. MIWAI 2024. Lecture Notes in Computer Science(), vol 15432. Springer, Singapore. https://doi.org/10.1007/978-981-96-0695-5_7

Download citation

  • DOI: https://doi.org/10.1007/978-981-96-0695-5_7

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-96-0694-8

  • Online ISBN: 978-981-96-0695-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics