Keywords

1 Introduction

Completely Automated Public Turing tests to tell Computers and Humans Apart (CAPTCHA) [31] was developed to protect systems from denial of service (DoS) attacks by malicious automated programs. It is a Turing test to discriminate human users from malicious bots by using tasks that humans can perform easily but machines cannot. Two-dimensional (2D) CAPTCHAs are the most commonly used form of CAPTCHA to identify humans. However, the rapid enhancement of machine learning has enabled bots to overcome the 2D text-based CAPTCHAs [12, 25] and 2D image-based CAPTCHAs [29] with high accuracy. In particular, with advances in Optical Character Recognition (OCR), sophisticated attacks have been introduced to break 2D text-based CAPTCHAs. These attacks have led to the development of three-dimensional (3D) CAPTCHAs that are relatively hard to be decoded by computers.

3D CAPTCHAs are categorized into two types: 3D model-based CAPTCHAs and 3D text-based CAPTCHAs. Previous 3D text-based CAPTCHAs were formed by a simple extrusion of alphabets, making them easily recognizable with a single glance. However, the simple extrusion is vulnerable to OCR attacks because it has the same visual effects as distorted 2D text. 3D model-based CAPTCHAs have improved security by reducing usability, which is the strength of 3D text-based CAPTCHAs. They are based on the mental rotation ability [4, 27], which enables users to find answers by inferring from the direction of various 3D objects [14, 26, 33], such as vehicles and animals. However, 3D model-based CAPTCHAs have a low correct response rate, because they require the users to not only recognize the 3D object but also judge and infer the answer by performing elaborate operations. As a result, it takes a long time to obtain the right answer for 3D model-based CAPTCHAs compared with 3D text-based CAPTCHAs. Moreover, for those who are accustomed to conventional text-based CAPTCHAs, 3D model-based CAPTCHAs cause usability issues.

This paper proposes a new type of 3D CAPTCHA, called DotCHA, which satisfies both security and usability. DotCHA combines the strengths of text-based CAPTCHAs and 3D model-based CAPTCHAs. It presents different alphabets that are rotated at different angles. The alphabets are composed of several small spheres instead of being shown as a single solid model so that it is resistant to segmentation attack. DotCHA provides usability to users who are familiar with text-based CAPTCHAs while preserving the security of 3D model-based CAPTCHAs. In Sect. 2, we briefly review previous works. Section 3 describes the generation of DotCHA, and Sect. 4 evaluates DotCHA using some attack scenarios. The prototype implementation is available at https://suzikim.github.io/DotCHA/.

2 Related Work

In this section, we briefly discuss three main types of CAPTCHAs, as shown in Fig. 1, which are closely related to DotCHA: 2D text-based, 3D model-based, and interactive CAPTCHAs.

Fig. 1.
figure 1

Examples of previous 2D text-based, 3D model-based, and interactive CAPTCHAs.

2.1 2D Text-Based CAPTCHAs

2D text-based CAPTCHAs are the most widely used form, due to their ease of use and simple structure. A sequence of alphabets and numbers are presented to a user, and the user should identify the correct text to pass the test. Usually, noise and distortion appear in the letters to make the test robust to automated attacks.

Gimpy and EZ-Gimpy [2] are based on the human ability to read heavily distorted and corrupted text. Gimpy picks up 10 random words from the dictionary and arranges them to be overlapped with each other. Users have to identify at least three words to pass the test. EZ-Gimpy uses only one random word from the dictionary, but instead increases its security by deformation, blurring, noise, and distortion of letters. Mori and Malik [23] break Gimpy and EZ-Gimpy using object recognition algorithms and dictionary attacks.

Baffle text [6] minimizes the instances of dictionary attack by using pseudorandom but pronounceable words. The users are asked to infer the correct answer from words that have missing parts of letters. MSN Passport CAPTCHA [7] has eight characters, including alphabets and numbers, which are highly warped to distort the characters.

The prime advantages of 2D text-based CAPTCHAs are that they are easy to generate and identify. However, it is also easy to be recognized through OCR attacks [28, 34]. More advanced 2D text-based CAPTCHAs have been introduced [10, 20]; however, they have been easily broken by machine learning attacks [12, 25].

2.2 3D Model-Based CAPTCHAs

3D CAPTCHAs are designed to defeat bots, which use machine learning to identify 2D text-based CAPTCHAs easily, by minimizing the legibility. Most of the 3D model-based CAPTCHAs are based on the rotation of 3D models [14, 26, 33]. They take advantage of the cognitive ability of humans, called mental rotation [4, 27], which is an inherent characteristic of human nature. Mental rotation enables humans to compare two models in different orientations. However, users find it difficult to deal with an unfamiliar model, so they spend a long time to solve the problem.

3D text-based CAPTCHAs are more familiar to users because they originated from text-based CAPTCHAs that users have been accustomed to. Ince et al. [16] introduce a cubic-style 3D CAPTCHA, which contains six alphabets on each side of a cube. However, it is vulnerable to segmentation attacks because the letters are simply attached to each side of the cube without interference. There have been several CAPTCHAs using a sequence of 3D letter models created with extrusion and warping [15, 19]. Ye et al. [37] demonstrate that CAPTCHAs with such a simple distorted form could easily be broken with a high degree of accuracy.

2.3 Interactive CAPTCHAs

Interactive CAPTCHAs rely on user interaction to mitigate automated attacks. They require users to solve the problem by cognitive abilities and human actions. The 3D model-based CAPTCHAs [16, 33], which require rotation of the 3D model to get the answers, also belong to the interactive CAPTCHAs.

2D images are the most commonly used sources in interactive CAPTCHAs. Gossweiler et al. [13] introduce a 2D-image based interactive CAPTCHA that requires users to rotate a randomly oriented image to its upright orientation. SEIMCHA [21] applies geometric transformations to 2D images, and users are required to identify the upright orientation of the image. Gao et al. [11] introduce a CAPTCHA that asks users to solve a jigsaw puzzle of a 2D image divided into small pieces and shuffled. In rotateCAPTCHA [32], which combines the orientation and puzzle solving techniques, the users have to recreate the original image by rotating the segmented image.

CAPTCHaStar [9] requires the users to change the position of a set of randomly scattered small squares by moving the cursor to recognize the correct shape. Our DotCHA adopts the movement of small scattered objects from CAPTCHaStar to ensure resistance to random guessing attacks [18]. The main problem with interactive CAPTCHAs, which focus more on security than usability, is that it is difficult and time-consuming to solve the problem. This is due to the fact that users are not accustomed to solving visual problems. Our DotCHA combines the strengths of text-based and interactive CAPTCHAs to satisfy both security and usability.

3 Generation of DotCHA

DotCHA is designed to satisfy both security and usability by improving the security of 2D text-based CAPTCHAs and the usability of 3D model-based CAPTCHAs. All letters of the 2D text-based CAPTCHAs are visible at once, and the letters are legible because of the clear form of lettering; therefore, it is vulnerable to OCR attacks. Each letter of DotCHA is only legible at a unique rotation angle, which makes this technique robust to OCR attacks, by twisting 3D extruded models around a center axis, as shown in Fig. 3(c). In addition, we remove visual unnecessary parts of the models so that each letter is not read in any direction other than in its unique direction, as shown in Fig. 3(d).

Fig. 2.
figure 2

System Pipeline to generate DotCHA. Target letters are extruded and twisted to 3D model, and split to small unit blocks. Redundant blocks, which do not affect the perception of letters, are removed, and the remaining blocks are converted to spheres. Finally, noise spheres are added around the model.

Fig. 3.
figure 3

Sample result of the DotCHA generation.

In order to improve usability, the contents of DotCHA include just 3D letter models instead of other visual representations, such as images or object models. The rotation axis of DotCHA is fixed to a single axis, shown as a black bar in Fig. 3, in order to reduce the burden and confusion for users, who may wander around the 3D space. We replace remaining cubes with spheres to prevent direction attacks, which guess the unique orientation of each letter by aligning the edges of small cubes, and more details will be given in Sect. 3.3. Figure 2 shows the pipeline to generate a DotCHA from given target alphabets.

3.1 Extrusion and Twist of 3D Model

We use the molecular construction method [17, 22] to engrave the given letters on a solid rectangular parallelepiped model. Molecular construction [22] is a technique in which a model is divided into smaller units forming the larger model. The basic idea of generating a DotCHA involves cutting a solid cube model into small unit blocks and then deleting unnecessary blocks or adding missing blocks to represent the given letters. Figure 3(b) and (c) show the results of extrusion and twist, respectively. One letter has a size of \(k \times k \times k\), and it retains the same shape as an extrusion of a 2D letter pattern.

The 3D letter models are then rotated around the center axis of the rectangular parallelepiped model at unique angles to ensure that the correct answers are not recognized from a single direction. If the number of letters used in a CAPTCHA is n, the correct answer of the DotCHA can be identified only if the machine finds all n directions.

3.2 Removal of Redundant Blocks

Although not all letters are visible at once in defense of segmentation attacks, a twisted model is still vulnerable to OCR attack. We remove a set of unit blocks from the models so that each letter is recognized only in one particular direction, not in any direction. The blocks are removed based on two conditions. Firstly, we remove unnecessary blocks that do not affect the shape of the letters. If two blocks are placed side by side along one axis, even if one block is discarded, it is still recognized as a letter in a certain direction. Secondly, the blocks are evenly removed while preserving the balance between directions, because the letters can be easily identified in an arbitrary direction if the blocks are gathered.

We use a multigraph G, which has multiple edges between a pair of vertices. Each unit block is represented as a vertex of the graph G. Two unit blocks are connected by two types of edges depending on whether the blocks are located on the same coordinates along the y or z axes. Since the rotation axis is fixed to the x axis and the overlapping in x coordinates does not affect the recognition of the letter, we ignore the x axis in G. We score all the vertices according to the scoring function S of vertex v as follows:

$$\begin{aligned} S(v) = \alpha \cdot \left| N_R(v) \right| + \left| N_G(v) \right| \end{aligned}$$
(1)

where \(N_G(v)\) is a set of adjacent vertices of v in graph G and \(N_R(v)\) is the set of neighboring vertices. For a given vertex v, we define a neighboring vertex as the one whose Euclidean distance from vertex v is at most k. The first term is related to the dispersion of blocks in the cube. It indicates the number of blocks that exist around the block v. The second term is for calculating the number of blocks that are placed along the y and z axes. \(\alpha \) is a constant for balancing between the two terms. We used \(\alpha =0.3\) in our experiments.

We iterate the vertices in the descending order of the scores to decide whether to remove the block or not. At each iteration, unless the block is the only block placed along the y or z axes, it is discarded from the DotCHA model. Otherwise, we leave the block and continue to the next iteration. The iterations stop when the number of iterations exceeds \(\mu nk^3\), where \(\mu \) represents the removal ratio and \(nk^3\) is the volume of the bounding box. Large \(\mu \) makes it take a long time for the user to find the correct answer, while small \(\mu \) makes the security weaker; therefore, appropriate balance of \(\mu \) is important. We used \(\mu =0.8\) in our experiments.

3.3 Prototype Implementation

Since the direction of cube blocks can be inferred from the edges of the cube, the orientation of letters can be easily identified. In order to hide the orientation of the model, we convert the unit blocks into spheres, the result of which is similar to that of the scatter-type method [3]. The post-processing involves three parameters to maximize usability and security:

  • Sphere radius (\(\rho \)): the radius of the sphere converted from the unit block. The edge length of a unit cube is 1, and \(\rho =1\) means that a sphere fits exactly into the cube without any cutoff or margins.

  • Sphere offset (\(\sigma \)): the location offset of the center of the sphere from the center of the unit block. \(\sigma =0\) means that the centers of the sphere and unit cube are the identical, and \(\sigma =0.5\) means that the center of the sphere exists on the surface of the unit cube.

  • Noise (\(\delta \)): the number of noise spheres.

After the redundant blocks are removed, \(\delta \) noise spheres are added to the model to prevent recognition by automated machines. The region of noise spheres is three times bigger than \(nk^3\), and excludes the bounding box of DotCHA. This is based on the concept of motion parallax, which gives users the perception of depth from the relative motion between models [8]. The noise spheres appear to move faster or slower as compared to the alphabet spheres, and the user can distinguish them by human visual system of depth perception. We set \(\rho \) to be smaller than the half-edge of the unit block to avoid edge detection attacks. Each sphere is randomly translated within the range of \((0, \sigma )\) to avoid pixel-count attacks.

The rotation axis of DotCHA is fixed to the x axis in order to reduce the burden and confusion for users. In addition, we support both automatic and interactive rotation to improve usability. DotCHA is implemented using Three.js library, which is a lightweight 3D engine, on HTML5 Canvas, so that it is supported by a majority of the browsers. We use \(k=10\) alphabet pattern with Consolas font. In addition, to defend against segmentation attack, random clutter spheres are added to the background, which makes it difficult to separate the foreground and background to identify the letters.

4 Experiments

We analyzed the security of DotCHA by considering several different attack scenarios. The goal of all the attacks is twofold: (1) to find the correct view directions to identify the letters and (2) to read the letters in the selected view directions. For the first goal, we tested whether a particular view direction can be characterized by the attacks. We tested the second goal by applying OCR to read the sampled images. We used \(n=6\) letters of DotCHA, and a combination of random alphabet letters were used to avoid dictionary attack.

4.1 Finding the Correct View Directions

As mentioned in Sect. 3.1, an automated attack should find the \(n=6\) correct view directions to identify the correct answer. We sampled 30 different views including six ground truth views and scored them through pixel counting and edge detection.

Table 1. Result of scoring with pixel counting. Thirty views have been ranked through pixel counting, and the views of the largest and smallest pixel counts are shown in order. In addition, the pixel counting results of the correct views are shown in the right column with correct letters. The pixel counting attacks failed to find the correct view, and it shows that the correct view cannot be distinguished by pixel counting.

Score with Pixel Counting. Pixel counting attack is based on two assumptions: first, the wider the overlap between the spheres, the clearer the shape of letters; second, the narrower the overlap between the spheres, the more information that can be represented. We counted the number of non-background pixels and checked whether the correct views can be distinguished from incorrect views by the number of pixels. We confirmed that the correct view directions cannot be identified by pixel counting, as described in Table 1, which shows the result of the pixel counting attack when \(\rho =0.5\), \(\sigma =0.2\), and \(\delta =0.001\). It shows that the correct answers are ranked arbitrarily regardless of the pixel counts. In the process of converting the unit blocks into spheres, we added a random offset to the position of the sphere, and this makes DotCHA robust to pixel counting.

Table 2. Result of scoring with edge counting. Thirty views have been ranked through edge counting, and the views of the largest and smallest edge counts are shown in order. In addition, the edge counting results of the correct views are shown in the right column with correct letters. The edge counting attacks failed to find the correct view, and it shows that the correct view cannot be distinguished by edge counting. \(\rho =0.5\), \(\sigma =0.2\), and \(\delta =0.001\) were used for edge counting.

Score with Edge Detection. This criterion aims to find the correct view directions by detecting edges from the original images. We ran Canny edge detection [5] on every sampled image. A DotCHA model was projected onto a 2D text form after removing unnecessary pixel information via edge detection. We counted the number of pixels in the edge-detected images to distinguish the correct view directions from the irrelevant view directions.

There was no correlation between the correct view directions and the number of edges, as shown in Table 2. While converting the unit blocks into spheres, we made the sphere smaller than the unit block. As a result, the spheres were separated from each other, and edge detection showed the edges of spheres, which lowered the prominence of the edges of letters.

4.2 Reading the Letters from the Correct View Directions

In the previous subsection, we showed that it is difficult to find the correct view direction automatically. For the experiment described in this subsection, we tested the possibility of reading letters from the given correct view directions when we assumed that the machine somehow found the correct view direction.

Pixel-Count Attack. A pixel-count attack [34, 36] counts the number of pixels of each segmented letter by the vertical histogram of a CAPTCHA image. The number of pixels is then mapped to the lookup table, which contains precomputed numbers of every alphabet.

The most important part in segmentation is the removal of background or clutters. However, it is difficult to remove them from DotCHA, because clutter spheres looks similar to spheres that form the letter model. As a result, it disturbs the segmentation through vertical histogram, as shown in Fig. 4.

Fig. 4.
figure 4

Vertical histograms of DotCHA, when \(\rho =0.35\) and \(\delta =0.0015\). Noise still remains in the image and it disturbs the segmentation by affecting the histogram. As the range of random offset increases, the segmentation becomes difficult, which makes it difficult to figure out the letters.

Table 3. Pixel counts from different views of letter ‘D’, when \(\rho =0.4\), \(\sigma =0.3\), and \(\delta =0.001\). The pixel counts vary depending on the rotation angle, and it is resistant to pixel counting attacks.

DotCHA requires several segmentations, as many as the number of letters, and this becomes an overhead to repeat. Moreover, segmentation can be avoided by increasing the range of the random offset of spheres. Even if the segmentation works well, DotCHA is still resistant to pixel-count attack, because the number of pixels varies depending on the view direction and clutter, as depicted in Table 3. Furthermore, even if some segmentations succeed, it is impossible to guess the entire word from only a few letters obtained through segmentation, because DotCHA does not use dictionary words.

Recognition by Using OCR. We conducted recognition tests using two well-known OCR engines: Google Tesseract [30] and ABBYY FineReader 14 [1]. We performed two types of attacks: entering whole words into the OCR engines for automated recognition, and entering segmented individual letters into the OCR engines.

In both attacks, OCR engines could not completely recognize any of the words, even from the correct view images. Since the size of the spheres were small, they were not connected to each other. As a result, it was difficult to identify the letters by OCR, just as with the scatter-type CAPTCHA, which is resistant to OCR [3]. We shrank the image so that the shape of the spheres seemed to disappear and become downsampled. Then, the letters are partially recognized with a success rate of just 3.3%, which shows that DotCHA provides reasonable resistance to OCR attack.

Table 4. Values of parameters \(\mu \), \(\rho \), \(\sigma \), and \(\delta \) for the survey.

4.3 User Study

We ran a user study to compare the response time and accuracy required to solve a 2D text-based CAPTCHA and our DotCHA according to the usability metrics [35]. 50 participants, recruited online, took part in our web-based survey, and all the participants underwent eight unsupervised tests using their own device: six DotCHA challenges (named from T1 to T6) and two 2D text-based CAPTCHA ones (T7 and T8). All the DotCHA challenges had \(n=6\) letters, and were randomly generated using the value of parameters in Table 4. T7 and T8 are generated from reCAPTCHA with single word. In order to familiarize the participants with the challenges, one practice DotCHA demo is shown to the participants at the beginning of the survey. The users were not told whether they had passed or failed each challenge.

Fig. 5.
figure 5

Success response rates of the survey.

Figure 5 plots the success response rates of the tests. 2D text-based CAPTCHAs (T7 and T8) have less response time than DotCHA, while DotCHAs achieve a higher success rates with just a few more response time. The average success rates of 2D text-based CAPTCHAs (T7 and T8) are 39.5% and 63.2%, respectively, while those of DotCHAs (T2 and T6) are 80.0% and 89.5%. When considering extra time overhead, encountered during multiple attempts to acquire the correct answer, few more response time is acceptable.

While T6 is the most difficult test with high \(\mu \), \(\sigma \), and \(\delta \), and low \(\rho \) values, the participants acquired the correct answer than other DotCHA tests. Also, in the case of T1, which is the easiest test, it took longer time to acquired the correct answer than T2 and T4. This result shows that there is no significant effect of parameters on the user’s perception, and the parameters can be adjusted by placing more emphasis on the security.

5 Conclusion

In this paper, we have proposed a new type of 3D text-based CAPTCHA, called DotCHA, which attempts to overcome the limitations of existing 2D and 3D approaches. It is a scatter-type CAPTCHA, which shows different letters according to the rotation angle, and the user should rotate the 3D model to identify the letters. We demonstrated that DotCHA is robust against several types of attacks.

There is a general consensus that it is hard to design a CAPTCHA that simultaneously combines good usability and security [24]. To improve the usability of DotCHA while preserving security, we combined the automated rotation and interactive systems. As we demonstrated, DotCHA defeats several types of attacks even when the correct view direction is given.

To improve the security, two additional strategies are possible: adding a background and using a set of a small particles instead of a single sphere. Adding a complicated background protects the CAPTCHA against machines due to the difficulties in separating the foreground from the background to identify the letters. In addition, a set of small particles increases the scatter ratio and enhances the defense against pixel-count attack or edge detection attack.