research-article

Text Recognition Using Anonymous CAPTCHA Answers

Authors:

Alexander Shishkin,

Anastasia Bezzubtseva,

Valentina Fedorova,

Gleb GusevAuthors Info & Claims

WSDM '20: Proceedings of the 13th International Conference on Web Search and Data Mining

Pages 537 - 545

https://doi.org/10.1145/3336191.3371795

Published: 22 January 2020 Publication History

Abstract

Internet companies use crowdsourcing to collect large amounts of data needed for creating products based on machine learning techniques. A significant source of such labels for OCR data sets is (re)CAPTCHA, which distinguishes humans from automated bots by asking them to recognize text and, at the same time, receives new labeled data in this way. An important component of such approach to data collection is the reduction of noisy labels produced by bots and non-qualified users.

In this paper, we address the problem of labeling text images via CAPTCHA, where user identification is generally impossible. We propose a new algorithm to aggregate multiple guesses collected through CAPTCHA. We employ incremental relabeling to minimize the number of guesses needed for obtaining the recognized text of a good accuracy. The aggregation model and the stopping rule for our incremental relabeling are based on novel machine learning techniques and use meta features of CAPTCHA tasks and accumulated guesses. Our experiments show that our approach can provide a large amount of accurately recognized texts using a minimal number of user guesses. Finally, we report the great improvements of an optical character recognition model after implementing our approach in Yandex.

References

[1]

[n. d.]. Google Cloud Vision OCR. https://cloud.google.com/vision/docs/ocr.

[2]

1997. Yandex. https://yandex.com/company/.

[3]

Ittai Abraham, Omar Alonso, Vasilis Kandylas, Rajesh Patel, Steven Shelford, and Aleksandrs Slivkins. 2016. How many workers to ask?: Adaptive exploration for collecting high quality labels. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 473--482.

Digital Library

[4]

Suhas Aggarwal. 2012. CAPTCHAs with a Purpose. In Workshops at the TwentySixth AAAI Conference on Artificial Intelligence.

[5]

Kailash Atal, Ashish Arora, Devendra Singh Sachan, PK Bora, and Amit Sethi. 2013. reCAPTCHA assisted OCR for Devanagiri Texts. In Proceedings of the 1st Indian Workshop on Machine.

[6]

Kartik Audhkhasi, Panayiotis Georgiou, and Shrikanth S Narayanan. 2011. Accurate transcription of broadcast news speech using multiple noisy transcribers and unsupervised reliability metrics. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on.

[7]

Alessandro Bissacco, Mark Cummins, Yuval Netzer, and Hartmut Neven. 2013. Photoocr: Reading text in uncontrolled conditions. In Proceedings of the IEEE International Conference on Computer Vision. 785--792.

Digital Library

[8]

A. P. Dawid and A. M Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied statistics (1979), 20--28.

[9]

Pinar Donmez, Jaime G Carbonell, and Jeff Schneider. 2009. Efficiently learning the accuracy of labeling sources for selective sampling. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. 259--268.

Digital Library

[10]

Seyda Ertekin, Haym Hirsh, and Cynthia Rudin. 2012. Learning to predict the wisdom of crowds. arXiv preprint arXiv:1204.3611 (2012).

[11]

Keelan Evanini, Derrick Higgins, and Klaus Zechner. 2010. Using Amazon Mechanical Turk for transcription of non-native speech. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon's Mechanical Turk.

Digital Library

[12]

Siamak Faridani and Georg Buscher. 2013. LabelBoost: An Ensemble Model for Ground Truth Inference Using Boosted Trees. In First AAAI Conference on Human Computation and Crowdsourcing.

[13]

Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of Statistics (2001).

[14]

Nguyen Quoc Viet Hung, Nguyen Thanh Tam, Lam Ngoc Tran, and Karl Aberer. 2013. An evaluation of aggregation techniques in crowdsourcing. In International Conference on Web Information Systems Engineering. 1--15.

[15]

S Impedovo, L Ottaviano, and S Occhinegro. 1991. Optical character recognition -- a survey. International Journal of Pattern Recognition and Artificial Intelligence 5, 01n02 (1991), 1--24.

[16]

P G Ipeirotis, F Provost, V S Sheng, and J Wang. 2014. Repeated labeling using multiple noisy labelers. In Data Mining and Knowledge Discovery. 402--441.

[17]

Diane Kelly and Jaime Teevan. 2003. Implicit feedback for inferring user preference: a bibliography. In Acm Sigir Forum, Vol. 37. ACM, 18--28.

Digital Library

[18]

Kurt Alfred Kluever and Richard Zanibbi. 2009. Balancing usability and security in a video CAPTCHA. In Proceedings of the 5th Symposium on Usable Privacy and Security. 14.

Digital Library

[19]

Martin Kopp, Matej Nikl, and Martin Holena. 2017. Breaking CAPTCHAs with Convolutional Neural Networks. In Proceedings of the 17th Conference on Information Technologies-Applications and Theory.

[20]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25. 1097--1105.

[21]

Chen-Yu Lee and Simon Osindero. 2016. Recursive Recurrent Nets With Attention Modeling for OCR in the Wild. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]

Ping Li, Qiang Wu, and Christopher J Burges. 2008. Mcrank: Learning to rank using multiple classification and gradient boosting. In Advances in neural information processing systems. 897--904.

Digital Library

[23]

Christopher H Lin, M Mausam, and Daniel S Weld. 2014. To Re(label), or Not To Re(label). In Second AAAI conference on human computation and crowdsourcing.

[24]

Christopher H Lin, M Mausam, and Daniel S Weld. 2016. Re-Active Learning: Active Learning with Relabeling. In AAAI. 1845--1852.

[25]

Matthew Marge, Satanjeev Banerjee, and Alexander I Rudnicky. 2010. Using the Amazon Mechanical Turk for transcription of spoken language. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on.

[26]

Donn Morrison, Stéphane Marchand-Maillet, and Éric Bruno. 2009. TagCaptcha: annotating images with CAPTCHAs. In Proceedings of the ACM SIGKDD Workshop on Human Computation. 44--45.

Digital Library

[27]

Marti Motoyama, Kirill Levchenko, Chris Kanich, Damon McCoy, Geoffrey M. Voelker, and Stefan Savage. 2010. Re: CAPTCHAs: Understanding CAPTCHAsolving Services in an Economic Context. In Proceedings of the 19th USENIX Conference on Security (USENIX Security'10). 28--28.

[28]

P. Ruvolo, J. Whitehill, and J. R Movellan. 2013. Exploiting Commonality and Interaction Effects in Crowdsourcing Tasks Using Latent Factor Models. (2013).

[29]

Victor S Sheng, Foster Provost, and Panagiotis G Ipeirotis. 2008. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 614--622.

Digital Library

[30]

Alexander Shishkin, Anastasia Bezzubtseva, Valentina Fedorova, Alexey Drutsa, and Gleb Gusev. [n. d.]. Text Recognition Using Anonymous CAPTCHA Answers (Supplementary Materials). https://yadi.sk/i/usrtuCPZNsYO8w.

[31]

Rachele Sprugnoli, Giovanni Moretti, Matteo Fuoli, Diego Giuliani, Luisa Bentivogli, Emanuele Pianta, Roberto Gretter, and Fabio Brugnara. 2013. Comparing two methods for crowdsourcing speech transcription. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32]

Fabian Stark, Caner Hazrba, Rudolph Triebel, and Daniel Cremers. 2015. CAPTCHA Recognition with Active Deep Learning. In German Conference on Pattern Recognition Workshop.

[33]

Luis Von Ahn, Manuel Blum, Nicholas J Hopper, and John Langford. 2003. CAPTCHA: Using hard AI problems for security. In International Conference on the Theory and Applications of Cryptographic Techniques. 294--311.

Digital Library

[34]

Luis Von Ahn, Benjamin Maurer, Colin McMillen, David Abraham, and Manuel Blum. 2008. reCAPTCHA: Human-based character recognition via web security measures. Science 321, 5895 (2008), 1465--1468.

[35]

Jeroen Vuurens, Arjen P de Vries, and Carsten Eickhoff. 2011. How much spam can you take? an analysis of crowdsourcing results to increase accuracy. In Proc. ACM SIGIR Workshop on Crowdsourcing for Information Retrieval (CIR'11). 21--26.

[36]

Fabian L Wauthier and Michael I Jordan. 2011. Bayesian bias mitigation for crowdsourcing. In Advances in neural information processing systems. 1800--1808.

[37]

Peter Welinder, Steve Branson, Pietro Perona, and Serge J Belongie. 2010. The multidimensional wisdom of crowds. In Advances in neural information processing systems. 2424--2432.

[38]

J. Whitehill, T. Wu, J. Bergsma, J. R Movellan, and P. L Ruvolo. 2009. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems. 2035--2043.

[39]

Jason D Williams, I Dan Melamed, Tirso Alonso, Barbara Hollister, and Jay Wilpon. 2011. Crowd-sourcing for difficult transcription of speech. In Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on.

[40]

Yan Yan, Romer Rosales, Glenn Fung, and Jennifer G Dy. 2011. Active learning from crowds. In ICML, Vol. 11. 1161--1168.

[41]

Liyue Zhao, Gita Sukthankar, and Rahul Sukthankar. 2011. Incremental Relabeling for Active Learning with Noisy Crowdsourced Annotations. In SocialCom/PASSAT.

[42]

Liyue Zhao, Gita Sukthankar, and Rahul Sukthankar. 2012. Importance-weighted label prediction for active learning with noisy annotations. In Pattern Recognition (ICPR), 2012 21st International Conference on.

[43]

Liyue Zhao, Yu Zhang, and Gita Sukthankar. 2014. An active learning approach for jointly estimating worker performance and annotation reliability with crowdsourced data. arXiv preprint arXiv:1401.3836 (2014).

[44]

D. Zhou, Q. Liu, J. C Platt, C. Meek, and N. B Shah. 2015. Regularized minimax conditional entropy for crowdsourcing. arXiv preprint arXiv:1503.07240 (2015).

[45]

Qiang Zhu and Eamonn Keogh. 2010. Using CAPTCHAs to index cultural artifacts. In International Symposium on Intelligent Data Analysis. 245--257.

Digital Library

Cited By

Leonidou PConstantinides ABelk MFidas CPitsillides A(2021)Eye Gaze and Interaction Differences of Holistic Versus Analytic Users in Image-Recognition Human Interaction Proof SchemesHCI for Cybersecurity, Privacy and Trust10.1007/978-3-030-77392-2_5(66-75)Online publication date: 3-Jul-2021
https://doi.org/10.1007/978-3-030-77392-2_5

Index Terms

Text Recognition Using Anonymous CAPTCHA Answers

Recommendations

Agreement/disagreement based crowd labeling

In many supervised learning problems, determining the true labels of training instances is expensive, laborious, and even practically impossible. As an alternative approach, it is much easier to collect multiple subjective (possibly noisy) labels from ...
MUGS: A Multiple Granularity Semi-supervised Method for Text Recognition
Document Analysis and Recognition - ICDAR 2023
Abstract
Most text recognition methods are trained on large amounts of labeled data. Although text images are easily accessible, labeling them is costly. Thus how to utilize the unlabeled data is worth studying. In this paper, we propose a MUltiple ...
Multi-label Text Classification with Label Correction under Noise
ICCPR '21: Proceedings of the 2021 10th International Conference on Computing and Pattern Recognition

Multi-label text classification (MLTC) is a fundamental but difficult problem in text mining, the goal of MLTC is to assign a set of most relevant labels for the given document. While existing supervised training of deep learning models for MLTC ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '20: Proceedings of the 13th International Conference on Web Search and Data Mining

January 2020

950 pages

ISBN:9781450368223

DOI:10.1145/3336191

General Chairs:
James Caverlee
Texas A&M University
,
Xia "Ben" Hu
Texas A&M University
,
Program Chairs:
Mounia Lalmas
Spotify
,
Wei Wang
University of California, Los Angeles

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 January 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WSDM '20

Sponsor:

WSDM '20: The Thirteenth ACM International Conference on Web Search and Data Mining

February 3 - 7, 2020

TX, Houston, USA

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
324
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)2

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Leonidou PConstantinides ABelk MFidas CPitsillides A(2021)Eye Gaze and Interaction Differences of Holistic Versus Analytic Users in Image-Recognition Human Interaction Proof SchemesHCI for Cybersecurity, Privacy and Trust10.1007/978-3-030-77392-2_5(66-75)Online publication date: 3-Jul-2021
https://doi.org/10.1007/978-3-030-77392-2_5

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten