Abstract
Over the last few years, there has been a rapid growth in digital data. Images with quotes are spreading virally through online social media platforms. Misquotes found online often spread like a forest fire through social media, which highlights the lack of responsibility of the web users when circulating poorly cited quotes. Thus, it is important to authenticate the content contained in the images being circulated online. So, there is a need to retrieve the information within such textual images to verify quotes before its usage in order to differentiate a fake or misquote from an authentic one. Optical Character Recognition (OCR) is used in this paper, for converting textual images into readable text format, but none of the OCR tools are perfect in extracting information from the images accurately. In this paper, a method of post-processing on the retrieved text to improve the accuracy of the detected text from images has been proposed. Google Cloud Vision has been used for recognizing text from images. It has also been observed that using post-processing on the extracted text improved the accuracy of text recognition by 3.5% approximately. A web-based text similarity approach (URLs and domain name) has been used to examine the authenticity of the content of the quoted images. Approximately, 96.26% accuracy has been achieved in classifying quoted images as verified or misquoted. Also, a ground truth dataset of authentic site names has been created. In this research, images with quotes by famous celebrities and global leaders have been used. A comparative analysis has been performed to show the effectiveness of our proposed algorithm.
Similar content being viewed by others
References
AWS rekognition. Available on: https://docs.aws.amazon.com/rekognition /latest/dg/text-detection.html
Bassil Y, Alwani M (2012) Ocr post-processing error correction algorithm using google online spelling suggestion. arXiv preprint arXiv:1204.0191
Du S, Ibrahim M, Shehata M, Badawy W (2012) Automatic license plate recognition (ALPR): a state-of-the-art review. IEEE Trans Circuits Syst Video Technol 23(2):311–325
Dutta S, Sankaran N, Sankar KP, Jawahar CV (2012) Robust Recognition of Degraded Documents Using Character N-Grams, 10th IAPR International Workshop on Document Analysis Systems. Gold Cost, QLD 2012:130–134
S. Dutta, N. Sankaran, K. P. Sankar, C. V. Jawahar. “Robust Recognition of Degraded Documents Using Character N-Grams”. 10th IAPR international workshop on document analysis systems, Gold Cost, QLD, 2012, pp. 130–134.
Geetha M, Pooja RC, Swetha J, Nivedha N, Daniya T (2020) Implementation of text recognition and text extraction on formatted bills using deep learning. Int J Contrl Automat 13(2):646–651
Google Cloud Vision API. Available on: https://cloud.google.com/vision/docs/libraries
Gur E, Zelavsky Z (2012) Retrieval of Rashi semi-cursive handwriting via fuzzy logic. International Conference on Frontiers in Handwriting Recognition, Bari, pp 354–359
Handwritten word dataset. Available on: https://www.kaggle.com/nabeel965/handwritten-words-dataset
IIIT 5K-word dataset. Available on: http://cvit.iiit.ac.in/projects/SceneText Understanding /IIIT5K.html
Joshi N, Jain S, Agarwal A (2017) Segmentation based non local means filter for denoising MRI. 6th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, pp 640–644
KAIST Scene text dataset. Available on: http://www.iapr-tc11.org/mediawiki/index.php/KAIST_Scene_%20Text_Database
Kushol R, Ahsan I, Raihan MN (2018) An Android-Based Useful Text Extraction Framework Using Image and Natural Language Processing. Int J Comput Theory Eng 10(3):77–83
Manwatkar PM, Yadav SH (2015) Text recognition from images. International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS), Coimbatore, pp 1–6
Mihailidis P, Viotty S (2017) Spreadable spectacle in digital culture: civic expression, fake news, and the role of media literacies in “post-fact” society. Am Behav Sci 61(4):441–454
Mukherjee A, Venkataraman V, Liu B, Glance N (2013) Fake review detection: Classification and analysis of real and pseudo reviews”. UIC-CS-03-2013. Technical Report
Ntirogiannis K, Gatos B, Pratikakis I (2013) Performance evaluation methodology for historical document image Binarization. IEEE Trans Image Process 22(2):595–609
Papapicco C, Quatera I (2019) Do not make to eat to troll!: the dark side of web. Online J Commun Media Technol 9(2):e201910
Quoted image. Available online: https://drive.google.com/open?id=1O9aNCEDowiFpZ6m8ID6mFq5oS_TipFlU
Rajan V, Raj S (2017) Text detection and character extraction in natural scene images using fractional poisson model. International Conference on Computing Methodologies and Communication (ICCMC), Erode, pp 1136–1141
Samarinas C, Tsoumakas G (2018) WAMBy: An information retrieval approach to web-based question answering. In Proceedings of the 10th Hellenic Conference on Artificial Intelligence. ACM. 40:1–8. https://doi.org/10.1145/3200947.3201023
Tripathy A, Agrawal A, Rath SK (2015) Classification of sentimental reviews using machine learning techniques. Procedia Comput Sci 57:821–829
Vaithiyanathan D, Muniraj M (2019) Cloud based text extraction using Google Cloud Vison for visually impaired applications. In 2019 11th international conference on advanced computing (ICoAC) (pp 90–96). IEEE, Chennai. https://doi.org/10.1109/ICoAC48765.2019.246822
Yang J, Wang K, Li J, Jiao J, Xu J (2012) A fast adaptive binarization method for complex scene images. 19th IEEE International Conference on Image Processing, Orlando, pp 1889–1892
Acknowledgments
This Publication is an outcome of the R&D work undertaken in the project under the Visvesvaraya PhD Scheme of Ministry of Electronics & Information Technology, Government of India, being implemented by Digital India Corporation (formerly Media Lab Asia).
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Banerjee, S., Kaur, S. & Kumar, P. Quote examiner: verifying quoted images using web-based text similarity. Multimed Tools Appl 80, 12135–12154 (2021). https://doi.org/10.1007/s11042-020-10270-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-10270-4