Abstract
Natural language processing (NLP) is a developing field that offers increasing potential to simplify accounting-related tasks. This research studies a novel NLP approach to classify invoice categories based on the invoice text description. The preprocessing steps can be divided into three parts, namely text cleaning, semantic enrichment using the labels as an information source, and text augmentation. A total of 12 different training datasets were prepared based on the raw invoice data, each reflecting an output of a unique combination of the preprocessing steps. Each training dataset was then sent for modelling with one traditional classifier and two deep learning classifiers, namely Linear Support Vector Machine (LSVM), Bi-directional Long Short-Term Memory (Bi-LSTM) and Bidirectional Encoder Representations from Transformers (BERT). Overall, the best approach yielded an improvement of up to 6.7 percentage points (ppts) for accuracy and 20 ppts for macro F1 score. Noise and overfitting were successfully reduced when only English text was retained for modelling. Using label data to semantically enrich invoice text descriptions improved the model’s generalizability. The lexical synonym substitution approach proved more effective in preserving semantics compared to the word embedding approach for short text augmentations. BERT outperformed Bi-LSTM and LSVM and performance improved further with an increase in training data, confirming the superiority of deep learning classifier performance compared to traditional classifiers. Multi-class balancing by lexical-based data augmentation improved the model generalizability, evidenced by a high macro F1 score. This novel discovery contributes to the area of automating invoice text classification, which up until today has remained largely a manual task in practice. The classification approach is well suited to be integrated with other artificial intelligence solutions like Optical Character Recognition (OCR) and Robotic Process Automation (RPA) to form a completely automated invoice processing system. Since invoice classification is a repetitive and non-value-added process, the combination of this novel text classification method with RPA can reduce overhead costs by approximately 90%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Sharda, R., Delen, D., Turban, E., Aronson, J. E., Liang, T.-P., King, D.: Business Intelligence, Analytics, and Data Science: A Managerial Perspective (Fourth). Pearson (2018)
Taylor, C. Structured vs Unstructured Data. Datamation. https://www.datamation.com/big-data/structured-vs-unstructured-data/. Accessed 21 May 2021
Guo, L., Shi, F., Tu, J: Textual analysis and machine learning: crack unstructured data in finance and accounting. J. Finance Data Sci. 2(3), 153–170 (2016)
Zhou, Y., Cui, S., Wang, Y.: Machine learning based embedded code multi-label classification. IEEE Access 9, 150187–150200 (2021)
Zhang, Y., Xiong, F., Xie, Y., Fan, X., Gu, H.: The impact of artificial intelligence and blockchain on the accounting profession. IEEE Access 8, 110461–110477 (2020)
Li, L., Feng, Y., Lv, Y., Cong, X., Fu, X., Qi, J.: Automatically detecting peer-to-peer lending intermediary risk - top management team profile textual features perspective. IEEE Access 7, 72551–72560 (2019)
Baviskar, D., Ahirrao, S., Potdar, V., Kotecha, K.: Efficient automated processing of the unstructured documents using artificial intelligence: a systematic literature review and future directions. IEEE Access 9, 72894–72936 (2021)
Korhonen, T., Selos, E., Laine, T., Suomala, P.: Exploring the programmability of management accounting work for increasing automation: an interventionist case study. Acc. Audit. Accountability J. 34(2), 253–280 (2021)
Samant, S.S., Bhanu Murthy, N.L., Malapati, A.: Improving term weighting schemes for short text classification in vector space model. IEEE Access 7, 166578–166592 (2019)
Balakrishnan, V., Shi, Z., Law, C.L., Lim, R., Teh, L.L., Fan, Y.: A deep learning approach in predicting products’ sentiment ratings: a comparative analysis. J. Supercomput. 78(5), 7206–7226 (2021). https://doi.org/10.1007/s11227-021-04169-6
Garcia-Mendez, S., Fernandez-Gavilanes, M., Juncal-Martinez, J., Gonzalez-Castano, F.J., Seara, O.B.: Identifying banking transaction descriptions via support vector machine short-text classification based on a specialized labelled corpus. IEEE Access 8, 61642–61655 (2020)
Mehanna, Y.S., Mahmuddin, M.B.: A semantic conceptualization using tagged bag-of-concepts for sentiment analysis. IEEE Access 9, 118736–118756 (2021)
Subedi, B., Sathishkumar, V.E., Maheshwari, V., Kumar, M.S., Jayagopal, P., Allayear, S.M.: Feature learning-based generative adversarial network data augmentation for class-based few-shot learning. Math. Probl. Eng. 2022, 1–20 (2022)
Xiang, R., Chersoni, E., Lu, Q., Huang, C.R., Li, W., Long, Y.: Lexical data augmentation for sentiment analysis. J. Am. Soc. Inf. Sci. 72(11), 1432–1447 (2021)
Wan, C., Wang, Y., Liu, Y., Ji, J., Feng, G.: Composite feature extraction and selection for text classification. IEEE Access 7, 35208–35219 (2019)
Wang, J., Li, Y., Shan, J., Bao, J., Zong, C., Zhao, L.: Large-scale text classification using scope-based convolutional neural network: a deep learning approach. IEEE Access 7, 171548–171558 (2019)
Luo, J., Bouazizi, M., Ohtsuki, T.: Data augmentation for sentiment analysis using sentence compression-based SeqGAN with data screening. IEEE Access 9, 99922–99931 (2021)
Liu, C.-L., Fink, G.A., Govindaraju, V., Jin, L.: Special issue on deep learning for document analysis and recognition. Int. J. Doc. Anal. Recogn. (IJDAR) 21(3), 159–160 (2018). https://doi.org/10.1007/s10032-018-0310-5
Somayajula, S.A., Song, L., Xie, P.: A multi-level optimization framework for end-to-end text augmentation. Trans. Assoc. Comput. Linguist. 10, 343–358 (2022)
Tan, K.L., Lee, C.P., Lim, K.M., Anbananthen, K.S.M.: Sentiment analysis with ensemble hybrid deep learning model. IEEE Access 10, 103694–103704 (2022)
Yan, C., Chen, Y., Zhou, L.: Differentiated fashion recommendation using knowledge graph and data augmentation. IEEE Access 7, 102239–102248 (2019)
Lee, S., Liu, L., Choi, W.: Iterative translation-based data augmentation method for text classification tasks. IEEE Access 9, 160437–160445 (2021)
El-Alami, F.-Z., El Alaoui, S.O., En Nahnahi, N.: Contextual semantic embeddings based on fine-tuned AraBERT model for Arabic text multi-class categorization. J. King Saud Univ.-Comput. Inf. Sci. 34(10), 8422–8428 (2022)
Amani, F.A., Fadlalla, A.M.: Data mining applications in accounting: a review of the literature and organizing framework. Int. J. Acc. Inf. Syst. 24, 32–58 (2017)
Sharda, R., Delen, D., Turban, E.: Business Intelligence, Analytics, and Data Science: A Managerial Perspective. Pearson (2017)
Acknowledgement
This work was supported in part by Sunway University and Sunway Business School under Kick Start Grant Scheme (KSGS) NO: GRTIN-KSGS-DBA[S]-02-2022. This work is also part of the Sustainable Business Research Cluster and Research Centre for Human-Machine Collaboration (HUMAC) at Sunway University. We also wish to thank those who have supported this research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Chi, W.W., Tang, T.Y., Salleh, N.M., Hwang, H.J. (2023). A Novel Natural Language Processing Strategy to Improve Digital Accounting Classification Approach for Supplier Invoices ERP Transaction Process. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2023. ICCSA 2023. Lecture Notes in Computer Science, vol 13956 . Springer, Cham. https://doi.org/10.1007/978-3-031-36805-9_38
Download citation
DOI: https://doi.org/10.1007/978-3-031-36805-9_38
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-36804-2
Online ISBN: 978-3-031-36805-9
eBook Packages: Computer ScienceComputer Science (R0)