Abstract:
Receipt data extraction and digitization is difficult even today attributing to the fact that receipts have a lot of variations mainly in the form of being crumpled, soil...Show MoreMetadata
Abstract:
Receipt data extraction and digitization is difficult even today attributing to the fact that receipts have a lot of variations mainly in the form of being crumpled, soiled and the overall scanning quality of the images being low. The major problems the industries are facing today in the domain are:(i)The lack of generalization in standard OCR solutions and other custom pipelines built from open source api like tesseract etc.(ii)High cost, yet low accuracy of commercially available solutions.(iii)Requirement for organization to supply large volumes of hand annotated images for training. In the paper we explain a strategy to overcome these limitations and to build a holistic pipeline for text detection and extraction deployable in real word. We have surveyed traditional methods as well as known recent CNN based architectures and moved on to explain the application of the novel architecture Connectionist Text Proposal Network(CTPN),to solve for the specific task of text detection in scanned text heavy images. We also compared the CTPN outcomes against outcomes on the state-of-art-trained SSD on sample dataset and it justified how the CTPN is a more suitable algorithm for this use case.
Date of Conference: 26-28 November 2020
Date Added to IEEE Xplore: 08 February 2021
ISBN Information:
Print on Demand(PoD) ISSN: 2164-7011