Keywords

1 Introduction

When there is a major criminal in a neighborhood (drug dealer, kidnapper, serial killer), the police work can be compared to finding a needle in a haystack. The community wants to help, but the number of calls can be overwhelming and the citizens’ noblest intentions to contribute can be translated to countless unsubstantiated clues. More importantly, the police cannot follow up all the tips from the community because of limited resources. But what if instead of treating tips as unrelated data, we group them and analyze them to identify patterns?

Recent years have shown us that the active collaboration of a large community, also known as crowdsourcing, can play a decisive role at solving challenging tasks [1, 2]. Examples include finding a lost boat in thousands of satellite images [3], studying migration patterns of birds [4], searching for anomalous archaeological patterns to locate the lost tomb of Genghis Khan [5], propagating information to bring relief in natural disasters [6], tracking stolen vehicles using social media [7], and aiding the transparency and accountability of the justice system [8].

In this paper we formalize the idea of crowdsourcing criminal detection: using the citizens’ tips to rank the houses in a community according to the likelihood that they accommodate a criminal. We show that if reasonable assumptions are met, the strategy will provably succeed at locating houses hosting criminals. We extend the model to incorporate major drawbacks like geographic proximity, personal resentment or prejudice, and we will present other settings where similar strategies may be applied with very promising results. We complement our theoretical findings with experiments that illustrate our approach and show its effectiveness.

Organization of the Paper. In Sect. 2 we introduce our model and our main results, which we prove in Sect. 3. In Sect. 4 we present experiments that support our theory. In Sect. 5 we give a brief discussion of our findings, along with simple generalizations and other settings where our ideas may be applied.

2 Model and Main Results

Suppose there is a criminal that lives in one of the \({{n}}+1\) houses of a city. The goal is to identify \({{h_\star }}\), the house that hosts the criminal. The police receives \({{m}}\) tips from the citizens, and each tip suggests one house suspected to be \({{h_\star }}\). In the end we will select the most suggested house, \({{\hat{h}}}\).

Let \({{\mathscr {H}}}=\{{{h}}_1,{{h}}_2,\dots ,{{h}}_{{n}},{{h_\star }}\}\) denote the set of all houses. Suppose that if a citizen provides a tip, he will independently suggest \({{h_\star }}\) with probability \({{p_\star }}\), and \({{h}}_j\) with probability \({{p}}_j\). We will assume without loss of generality that \({{p}}_1 \ge {{p}}_2 \ge \cdots \ge {{p}}_{{n}}\). This way, \({{h}}_1\) is the most suspicious (with highest probability of being suggested) among the innocent houses. Intuitively, \({{p_\star }}\) models the accuracy of the citizens’ perception and \({{p}}_1\) models their level of prejudice or other sources of inaccuracy.

Our main result is presented in the following theorem. It essentially states that as long as \({{p_\star }}\) (the citizens’ accuracy) is slightly larger than \({{p}}_1\) (the level of prejudice), then with high probability the most suggested house will indeed be the one hosting the criminal.

figure a

The proof of Theorem 1 is given in Sect. 3. Equivalently, Theorem 1 states that as long as we have enough tips to overcome the gap between \({{p_\star }}\) and \({{p}}_1\), we will identify \({{h_\star }}\) with high probability. This result is related to survey sampling. For a fixed \({{n}}\), the gap between \({{p_\star }}\) and \({{p}}_1\) is \(O(1/ \sqrt{{{m}}})\). A conservative two-sample test for difference in proportions \(q_1\) and \(q_2\) states that one is able to distinguish between the two proportions if their confidence intervals do not overlap. The width of each confidence interval is \(O(1/\sqrt{{{m}}})\) for \({{m}}\) the sample size.

2.1 Geographic Dependency

In practice, it is more likely that citizens perceive suspicious activities on houses that they frequently see, e.g., neighboring houses or houses on their way to work. We can model this by weighting the inherent probabilities \(\{ {{p}}_1, {{p}}_2, \dots , {{p}}_{{n}}, {{p_\star }}\}\) by the exposure that citizens have to the houses, e.g., by the distances between citizens and houses.

To this end we introduce the matrix \({{\varvec{\mathrm{G}}}}\) that encodes the information of the geographic dependency. Essentially, \({{\varvec{\mathrm{G}}}}\) will specify the proximity of each citizen to each house, and this will determine the probability that each citizen perceives suspicious activities in each house. More precisely, let \({{\varvec{\mathrm{G}}}}\in \mathbb {R}^{{{m}}\times ({{n}}+1)}\). If citizen \({i}\) lives in house \({j}\), then \({{\varvec{\mathrm{G}}}}_{ij}:=0\). Otherwise, \({{\varvec{\mathrm{G}}}}_{ij}\) denotes how close citizen \({i}\) is from house \({j}\). Intuitively, if citizen \({i}\) does not live in house \({j}\), then the closest citizen \({i}\) is to house \({j}\), the larger \({{\varvec{\mathrm{G}}}}_{ij}\). The setup in the previous section is the particular case where all entries in \({{\varvec{\mathrm{G}}}}\) are equal.

Example 1

Consider a street with \({{n}}+1\) houses. Suppose there is one citizen living in each house, and that each reported one tip to the police, such that \({{m}}={{n}}+1\). Suppose we measure the geographic dependency between citizen \({i}\) and house \({{h}}_j\) using the number of houses between \({{h}}_i\) and \({{h}}_j\), such that

figure b

In this case, the closer \({{h}}_i\) is to \({{h}}_j\), the more likely it is that citizen \({i}\) notices suspicious activities in house \({{h}}_j\).

Let \({{\varvec{\mathrm{p}}}}\) be the diagonal matrix with diagonal elements taking the values in \(\{{{p}}_1, {{p}}_2, \dots , {{p}}_{{n}}, {{p_\star }}\}\). Let the \(({i},{j})^{th}\) entry of \({{\varvec{\mathrm{P}}}}:={{\varvec{\mathrm{G}}}}{{\varvec{\mathrm{p}}}}\) (with normalized rows) denote the probability that citizen \({i}\) suggests \({{h}}_j\) (given that citizen \({i}\) provided a tip). In particular, we use \({{\varvec{\mathrm{P}}}}_{i \star }\) to denote the probability that citizen \({i}\) suggests \({{h_\star }}\). In order to present our next result, let us introduce the set of \({{\gamma }}\)-perceptive citizens, defined as

$$\begin{aligned} \textstyle {{\mathscr {C}}}_{{\gamma }}\ := \ \left\{ \ {i}\ : \ {{\varvec{\mathrm{P}}}}_{i \star } - {{\varvec{\mathrm{P}}}}_{ij} \ge {{\gamma }}\quad \forall \ {j}\right\} . \end{aligned}$$

Intuitively, \({{\mathscr {C}}}_{{\gamma }}\) is the set of citizens that are at least \({{\gamma }}\) more likely to suggest \({{h_\star }}\) than any other house.

The next theorem is a generalization of Theorem 1. It states that if there are enough tips from sufficiently perceptive citizens, then with high probability the most suggested house will indeed be the one hosting the criminal.

figure c

The proof of Theorem 2 is given in Sect. 3. Notice the double dependency on \({{k}}\) in Theorem 2. First, \({{k}}\) determines the number of perceptive citizens required, that is, the number of citizens in \({{\mathscr {C}}}_{{{\gamma }}_k}\). And second, it determines how perceptive each of them must be, which is given \({{\gamma }}_k\). The larger \({{k}}\), the more perceptive citizens are required, but the less perceptive each needs to be.

Also notice that \({{r}}:=\frac{{{m}}-{{k}}}{{{k}}}\) represents the ratio of non-perceptive citizens versus perceptive citizens (that provided tips). So in words, Theorem 2 states that as long as there is a group \({{\mathscr {C}}}_{{{\gamma }}_k}\) of \({{k}}\) perceptive citizens that are more likely to suggest \({{h_\star }}\) over any other house by a little more than \({{r}}\), then with high probability we will identify \({{h_\star }}\). This little more is given by . In a nutshell, Theorem 2 requires to have enough citizens that provide tips (at least \({{k}}\)) with sufficient accuracy (at least \({{\gamma }}_k\)).

Finally, observe that \({{\gamma }}_k\) is monotonically decreasing. This implies that \({{\mathscr {C}}}_{{{\gamma }}_{k+1}}\) allows citizens with less perception than \({{\mathscr {C}}}_{{{\gamma }}_k}\), which in turn implies

$$\begin{aligned} {{\mathscr {C}}}_{{{\gamma }}_1} \subset {{\mathscr {C}}}_{{{\gamma }}_2} \subset {{\mathscr {C}}}_{{{\gamma }}_3} \subset \cdots . \end{aligned}$$

So the question is: as \({{k}}\) grows and \({{\gamma }}_k\) shrinks, will \({{\mathscr {C}}}_{{{\gamma }}_k}\) grow enough to contain at least \({{k}}\) citizens? This will depend on \({{\varvec{\mathrm{P}}}}\), which in turn depends on \({{\varvec{\mathrm{p}}}}\) and \({{\varvec{\mathrm{G}}}}\). Fortunately, given \({{\varvec{\mathrm{p}}}}\) and \({{\varvec{\mathrm{G}}}}\), we can iteratively test whether \({{\mathscr {C}}}_{{{\gamma }}_k}\) has at least \({{k}}\) elements. If so, by Theorem 2 we will identify \({{h_\star }}\) with high probability. See Fig. 1 to build some intuition.

We point out that Theorem 2 considers the worst-case scenario in which all non-perceptive citizens may even be providing tips collaboratively and adversarially to confuse the police. More about this is discussed in Sect. 5.

Fig. 1.
figure 1

Theorem 2 asks for a set \({{\mathscr {C}}}_{{{\gamma }}_k}\) with at least \({{k}}\) citizens, such that each of these citizens has a gap between \({{\varvec{\mathrm{P}}}}_{i \star }\) and any \({{\varvec{\mathrm{P}}}}_{ij}\) at least as large as \({{\gamma }}_k\). If such set exists, then with high probability we will identify \({{h_\star }}\). Notice that \({{\gamma }}_k\) is monotonically decreasing. This implies that \({{\mathscr {C}}}_{{{\gamma }}_1} \subset {{\mathscr {C}}}_{{{\gamma }}_2} \subset \cdots \). So the question is: as \({{k}}\) grows and \({{\gamma }}_k\) shrinks, will \({{\mathscr {C}}}_{{{\gamma }}_k}\) grow enough to contain at least \({{k}}\) citizens? In this figure, \({{\mathscr {C}}}_{{{\gamma }}_3}\) only contains 2 citizens (represented with points). It follows that \(|{{\mathscr {C}}}_{{{\gamma }}_3}|=2<3={{k}}\), and so \({{\mathscr {C}}}_{{{\gamma }}_3}\) is not large enough to satisfy the conditions of Theorem 2. On the other hand, \({{\mathscr {C}}}_{{{\gamma }}_4}\) contains 5 citizens. This time \(|{{\mathscr {C}}}_{{{\gamma }}_4}|=5>4={{k}}\), and so \({{\mathscr {C}}}_{{{\gamma }}_4}\) satisfies the conditions of Theorem 2. Since there is a set that satisfies these conditions, namely \({{\mathscr {C}}}_{{{\gamma }}_4}\), we conclude that with high probability we will identify \({{h_\star }}\).

2.2 Tipping Prior

The matrix \({{\varvec{\mathrm{P}}}}\) determines how the vote of each citizen would be distributed if he provided a tip. In this section we add one simple layer to our model to account for the distribution of citizens that provide tips. To this end, observe that

$$\begin{aligned} {{\varvec{\mathrm{P}}}}_{ij} \ = \ \mathsf {P}(\text {citizen}~{i}~\text {suggests}~{{h}}_j \ | \ \text {citizen}~{i}~\text {provides a tip}) \end{aligned}$$

by definition. Letting \({{\pi }}_i\) denote the probability that citizen \({i}\) provides a tip, it follows that:

$$\begin{aligned} \mathsf {P}(\text {citizen}~{i}~\text {suggests}~{{h}}_j) \ = \ {{\varvec{\mathrm{P}}}}_{ij} \ {{\pi }}_i. \end{aligned}$$

It is then clear that the number of citizens that suggest \({{h}}_j\), and hence the outcome of our procedure, will depend on \({{\pi }}_i\). This probability can be modeled in different ways. For instance, it is reasonable to assume that citizens are more likely to provide a tip if they live near \({{h_\star }}\). In this case, we can model \({{\pi }}_i\) as

$$\begin{aligned} \mathrm{(i)}&\ {{\pi }}_i \ \propto \ 1 / {{d}}_{i\star }, \qquad \text {or} \\ \mathrm{(ii)}&\ {{\pi }}_i \ \propto \ \exp {(-{{d}}_{i\star }^2)}. \end{aligned}$$

where \({{d}}_{i\star }\) denotes the distance between citizen \({i}\) and \({{h_\star }}\). For example, (ii) corresponds to a gaussian decay in \({{\pi }}_i\) as citizens get far from \({{h_\star }}\).

We point out that \({{\varvec{\mathrm{G}}}}\) does not capture this information. Without taking into account \({{\pi }}_i\), this model could yield very poor performance. To see this, suppose that citizen \({i}\) is so far from \({{h_\star }}\), that \({{\varvec{\mathrm{P}}}}_{i \star }\) is much smaller than \({{\varvec{\mathrm{P}}}}_{ij}\) for some houses \({{h}}_j\) neighboring citizen \({i}\). But this does not mean that citizen \({i}\) suspects of any of these houses. In fact, citizen \({i}\) may not suspect criminal activities in any house. In this case, citizen \({i}\) is unlikely to provide a tip, which equates to \({{\pi }}_i\) being small. But if we ignore \({{\pi }}_i\), and still ask this citizen to provide a tip, it is very likely (because \({{\varvec{\mathrm{P}}}}_{i \star }\) is very small) that he suggests some of his neighboring houses, contaminating the information provided to the police.

3 Proofs

3.1 Proof of Theorem 1

Let \({{N}}_\star \) and \({{N}}_j\) denote the number of suggestions that \({{h_\star }}\) and \({{h}}_j\) receive. We want to show that with high probability, the criminal lives in \({{\hat{h}}}\), the most suggested house. So union bounding over \({{\mathscr {H}}}\backslash {{h_\star }}\), we have that

$$\begin{aligned} \mathsf {P}\big ( {{\hat{h}}}\ne {{h_\star }}\big ) \ = \ \mathsf {P}\Big ( \bigcup _{{j}=1}^{{{n}}} \left\{ {{N}}_\star \le {{N}}_j \right\} \Big ) \ \le \ \sum _{{j}=1}^{{n}}\mathsf {P}\big ( {{N}}_\star \le {{N}}_j \big ). \end{aligned}$$
(3)

Let \({{Z}}_j:=\frac{1}{{{m}}}({{N}}_\star -{{N}}_j)\) such that \(\mathsf {P}({{N}}_\star \le {{N}}_j) = \mathsf {P}({{Z}}_j \le 0)\). Letting

$$\begin{aligned} {{Z}}_{ij} \ := \ \left\{ \begin{array}{rcl} 1 &{} &{} \text {if}~{i}^{th}~\text {citizen suggested house}~{{h_\star }}\\ -1 &{} &{} \text {if}~{i}^{th}~\text {citizen suggested house}~{{h}}_j \\ 0 &{} &{} \text {otherwise}, \end{array} \right. \end{aligned}$$
(4)

it is clear that \({{Z}}_j \ = \ \frac{1}{{{m}}}\sum _{{i}=1}^{{m}}{{Z}}_{ij}\). Since citizens suggest independently, the \({{Z}}_{ij}\)’s are i.i.d. random variables with mean \({{p_\star }}-{{p}}_j\). Using Hoeffding’s inequality [9] we obtain

$$\begin{aligned} \mathsf {P}\big ( {{Z}}_j \le 0 \big ) = \mathsf {P}\big ( \mathsf {E}[{{Z}}_j] - {{Z}}_j \ge ({{p_\star }}-{{p}}_j) \big ) \ \le \ e^{ -\frac{{{m}}}{2} ({{p_\star }}-{{p}}_j)^2 } \ \le \ e^{ -\frac{{{m}}}{2}({{p_\star }}-{{p}}_1)^2 }, \end{aligned}$$

where the last inequality follows because \({{p}}_1 \ge {{p}}_j\) \(\forall {j}\) by assumption. Going back to (3), we have that

$$\begin{aligned} (3) \ = \ \sum _{{j}=1}^{{n}}\mathsf {P}\big ( {{Z}}_j \le 0 \big ) \ \le \ \sum _{{j}=1}^{{n}}e^{ -\frac{{{m}}}{2}({{p_\star }}-{{p}}_1)^2 } \ < \ {{n}}e^{ -\frac{{{m}}}{2}({{p_\star }}-{{p}}_1)^2 } \ \le \ {{\epsilon }}, \end{aligned}$$

where the last inequality follows by (1).    \(\square \)

3.2 Proof of Theorem 2

Let \({{\mathscr {C}}}_{{{\gamma }}_k}\) be a set satisfying the conditions of Theorem 2. We start as before:

$$\begin{aligned} \mathsf {P}\big ( {{\hat{h}}}\ne {{h_\star }}\big ) \ = \ \mathsf {P}\Big ( \bigcup _{{j}=1}^{{n}}\left\{ {{N}}_\star \le {{N}}_j \right\} \Big ) \ \le \ \sum _{{j}=1}^{{n}}\mathsf {P}\big ( {{N}}_\star \le {{N}}_j \big ). \end{aligned}$$
(5)

In the worst case scenario, all citizens will most likely suggest the same house (other than \({{h_\star }}\)), which we will assume without loss of generality to be \({{h}}_1\) (equivalently, \({{\varvec{\mathrm{P}}}}_{i1} \ge {{\varvec{\mathrm{P}}}}_{ij}\) \(\forall {i},{j}\)). It follows that \(\mathsf {P}( {{N}}_\star \le {{N}}_j ) \ \le \ \mathsf {P}( {{N}}_\star \le {{N}}_1 ) \ \forall {j}\), which further implies

$$\begin{aligned} (5) \ \le \ {{n}}\mathsf {P}\big ( {{N}}_\star \le {{N}}_1 \big ) \ = \ {{n}}\mathsf {P}\big ( {{Z}}_1 \le 0 \big ), \end{aligned}$$
(6)

where the last inequality follows by letting \({{Z}}_1 := \frac{1}{{{m}}}({{N}}_\star -{{N}}_1)\). Defining \({{Z}}_{ij}\) as in (4), we can write

$$\begin{aligned} {{Z}}_1 \ = \ \frac{1}{{{m}}}\sum _{{i}=1}^{{m}}{{Z}}_{i1} \ = \ \frac{1}{{{m}}} \sum _{{i}\in {{\mathscr {C}}}_{{{\gamma }}_k}} {{Z}}_{i1} \ + \ \frac{1}{{{m}}} \sum _{{i}\notin {{\mathscr {C}}}_{{{\gamma }}_k}} {{Z}}_{i1}. \end{aligned}$$

In the worst case scenario, all the non-perceptive citizens will suggest \({{h}}_1\), whence \({{Z}}_{i1}=-1\) for every \({i}\notin {{\mathscr {C}}}_{{{\gamma }}_k}\). Then

$$\begin{aligned} {{Z}}_1 \ \ge \ \frac{1}{{{m}}} \sum _{{i}\in {{\mathscr {C}}}_{{{\gamma }}_k}} {{Z}}_{i1} \ - \ \frac{{{m}}-{{k}}}{{{m}}}, \end{aligned}$$

which implies

$$\begin{aligned} \mathsf {P}\big ( {{Z}}_1 \le 0 \big ) \ \le \ \mathsf {P}\left( \frac{1}{{{m}}} \sum _{{i}\in {{\mathscr {C}}}_{{{\gamma }}_k}} {{Z}}_{i1} \ \le \ \frac{{{m}}-{{k}}}{{{m}}} \right) \ = \ \mathsf {P}\left( \frac{1}{{{k}}} \sum _{{i}\in {{\mathscr {C}}}_{{{\gamma }}_k}} {{Z}}_{i1} \ \le \ \frac{{{m}}-{{k}}}{{{k}}} \right) . \end{aligned}$$
(7)

Letting \({{Z'_1}}:=\frac{1}{{{k}}} \sum _{{i}\in {{\mathscr {C}}}_{{{\gamma }}_k}} {{Z}}_{i1}\) we obtain

$$\begin{aligned} (7) \ = \ \mathsf {P}\left( {{Z'_1}}\le \frac{{{m}}-{{k}}}{{{k}}} \right) \ = \ \mathsf {P}\left( \mathsf {E}[{{Z'_1}}]-{{Z'_1}}\ge \mathsf {E}[{{Z'_1}}]-\frac{{{m}}-{{k}}}{{{k}}} \right) , \end{aligned}$$
(8)

and by Hoeffding’s inequality [9],

$$\begin{aligned} (8) \ \le \ e^{ -\frac{{{k}}}{2} (\mathsf {E}[{{Z'_1}}]-\frac{{{m}}-{{k}}}{{{k}}})^2 } \ \le \ \frac{{{\epsilon }}}{{{n}}}, \end{aligned}$$

where the last inequality follows because \(\mathsf {E}[{{Z'_1}}] \ = \ \frac{1}{{{k}}} \sum _{{i}\in {{\mathscr {C}}}_{{{\gamma }}_k}} ({{\varvec{\mathrm{P}}}}_{i \star } - {{\varvec{\mathrm{P}}}}_{i1})\); by the definition of \({{\mathscr {C}}}_{{{\gamma }}_k}\), every term of this sum, is at least \({{\gamma }}_k\), which implies \(\mathsf {E}[{{Z'_1}}]\) is lower bounded by \({{\gamma }}_k\). We thus conclude that \(\mathsf {P}({{Z}}_1 \le 0) \le \frac{{{\epsilon }}}{{{n}}}\). Substituting this in (6), we obtain the desired result.    \(\square \)

Fig. 2.
figure 2

Left: Phase transition diagram of the success rate at identifying \({{h_\star }}\) as a function of the number of tips \({{m}}\) and the gap between \({{p_\star }}\) and \({{p}}_1\), for four different settings. The gray level at each pair \(({{m}},{{p_\star }}-{{p}}_1)\) indicates the success rate over 10, 000 replicates: brightest gray represents \(100\,\%\) accuracy; darkest gray represents \(0\,\%\). Each \(({{m}},{{p_\star }}-{{p}}_1)\) pair was selected randomly. For each \({{m}}\), all pairs above the black point have at least \(95\,\%\) accuracy. The curve is the best exponential fit to these points. These curves represent the discriminant between at least \(95\,\%\) accuracy (above curve) and less than \(95\,\%\) accuracy (below curve). Intuitively, if we are above the curve, i.e., if we have enough tips, and enough gap between \({{p_\star }}\) and \({{p}}_1\), we will likely identify \({{h_\star }}\). Right: Comparison of the discriminants at \(95\,\%\) accuracy for the four settings in the left. The lower the curve the better, because then fewer tips and gap are required to identify \({{h_\star }}\). The Euclidian setting with prior (i) requires more tips and gap, which means identifying \({{h_\star }}\) is more difficult. Prior (i) corresponds to the basic model where all citizens are equally likely to provide tips. Prior (ii) corresponds to the more realistic model where citizens near \({{h_\star }}\) are more likely to provide tips. Under this model, identifying \({{h_\star }}\) requires fewer tips and gap. This can be appreciated by comparing the two Euclidian settings.

4 Experiments

In this section we present a series of experiments to study the behavior of our detection scheme for different geographic dependency matrices \({{\varvec{\mathrm{G}}}}\), which together with the inherent suspiciousness level of the houses \({{\varvec{\mathrm{p}}}}\), determines the likelihood that citizens perceive suspicious activities. We will test the following cases of \({{\varvec{\mathrm{G}}}}\):

  1. (a)

    Constant. This is equivalent to the most basic setup described at the beginning of Sect. 2, where each citizen suggests each house independently and identically according to \({{\varvec{\mathrm{p}}}}\).

  2. (b)

    Difference. Setup described in Example 1, where the geographic dependency is given by the number of houses in between.

  3. (c)

    Euclidian. Same setup as in Example 1, but with the geographic dependency given by the inverse distance, measured in number of houses, i.e.,

In each trial, we first generate a vector with independent entries selected uniformly at random according to the uniform distribution on \((0,1)^{{{n}}+1}\). Next we normalize and sort this vector to obtain \({{p_\star }}\ge {{p}}_1 \ge {{p}}_2 \ge \cdots \ge {{p}}_{{n}}\). The location of \({{h_\star }}\) in the street (and the rest of the houses) is selected uniformly at random. In each trial, \({{m}}\) citizens will provide a tip. The citizens that provide tips will be distributed independently over the \({{n}}+1\) houses (sample with replacement) according to two tipping priors:

$$\begin{aligned} \mathrm{(i)}&\ \ {{\pi }}_i \ := \ \mathsf {P}(\text {citizen}~{i}~\text {provides a tip}) \ = \ \text {same for every}~{i}, \\ \mathrm{(ii)}&\ \ {{\pi }}_i \ := \ \mathsf {P}(\text {citizen}~{i}~\text {provides a tip}) \ = \ \exp {(-\frac{{{d}}_{i \star }^2}{400})}, \end{aligned}$$

where \({{d}}_{i \star }\) denotes the distance (measured in number of houses) between \({{h}}_i\) and \({{h_\star }}\), and 400 represents a variance of roughly 20 houses before the exponential decay. Setting (i) corresponds to the basic model where all citizens are equally likely to provide tips. As discussed in Sect. 2.2, setting (ii) corresponds to the more realistic scenario where citizens near \({{h_\star }}\) are more likely to provide tips.

Each citizen that provides a tip will suggest a house suspected to host criminal activities according to \({{\varvec{\mathrm{P}}}}\). Recall that a citizen living in house \({{h}}_i\) will suggest house \({{h}}_j\) with probability \({{\varvec{\mathrm{P}}}}_{ij}\). Since \({{\varvec{\mathrm{P}}}}={{\varvec{\mathrm{G}}}}{{\varvec{\mathrm{p}}}}\), this probability depends on the house where the citizen lives through the geographic dependency matrix \({{\varvec{\mathrm{G}}}}\). We will then select the most suggested house, and we will verify whether it corresponds to \({{h_\star }}\). We repeat this experiment 10, 000 replicates for different values of \({{m}}\) and \(\{{{p}}_1,\dots ,{{p}}_n,{{p_\star }}\}\). The results are summarized in Fig. 2.

As predicted by our theory, \({{h_\star }}\) can be consistently identified as long as there are enough tips, and there is enough gap between \({{p_\star }}\) and \({{p}}_1\). Observe that vis-à-vis, under prior (i), the Euclidian setting demands more tips and gap than the rest of the settings. This is because the Euclidian matrix \({{\varvec{\mathrm{G}}}}\) has a faster decay with distance. We can interpret this as houses being farther apart from one an other. This suggests, in accordance to intuition, that it is easier to find \({{h_\star }}\) in denser areas, like highly populated cities, where people are close.

5 Conclusions and Discussion

In this paper we introduce a simple model to identify houses hosting criminals. We prove that under reasonable assumptions, a crowdsourcing strategy will succeed at this task with large probability. Our experiments support our theoretical findings. We now give some simple generalizations to the models described in Sect. 2, along with other settings where our ideas may be easily extended.

Increasing our Odds. Recall that \({{p_\star }}\) and \({{p}}_1\) denote the underlying probabilities that a citizen suggests \({{h_\star }}\) and \({{h}}_1\), where \({{h}}_1\) is the most suspicious among the innocent houses. As shown by Theorems 1 and 2, the gap between \({{p_\star }}\) and \({{p}}_1\), and the number of citizens that provide tips (\({{m}}\)) will determine whether our strategy will work. These quantities can be influenced in our favor through media campaigns to promote participation (to increase \({{m}}\)), to encourage citizens to be more aware (to increasing \({{p_\star }}\)) and to avoid unfounded suggestions, bias or prejudice (to restrict \({{p}}_1\)).

Organized Crime. It is also possible that the city has not only one, but several criminals. Moreover, these criminals could be organized and determined to collaborate in an optimal way to avoid detection. In this case, it is in the criminals’ best interest to suggest the most suspicious innocent house, \({{h}}_1\). This can be modeled by letting the rows of \({{\varvec{\mathrm{P}}}}\) corresponding to criminals take the value 1 in the column corresponding to \({{h}}_1\), and zeros elsewhere. In fact, Theorem 2 is shown assuming that all the citizens not in \({{\mathscr {C}}}_{{{\gamma }}_k}\) will suggest \({{h}}_1\). Recall that \({{\mathscr {C}}}_{{{\gamma }}_k}\) denotes the set of perceptive citizens that are at least \({{\gamma }}_k\) more likely to suggest \({{h_\star }}\) than any other house.

This implies that Theorem 2 follows regardless of whether the citizens not in \({{\mathscr {C}}}_{{{\gamma }}_k}\) are criminals or not. We thus conclude that as long as there are enough honest citizens (at least \({{k}}\)) with sufficient accuracy (at least \({{\gamma }}_k\)), then with high probability we will find a house hosting a criminal. Hence, we can easily generalize our model to include several criminals’ houses. The pattern of identified houses can help detect criminal networks.

Observe that one implicit requirement of Theorem 2 is that the set \({{\mathscr {C}}}_{{{\gamma }}_k}\) contains at least half of the citizens. This can be seen mathematically because if \({{k}}\le \frac{{{m}}}{2}\), then \(\frac{{{m}}-{{k}}}{{{k}}} \ge 1\), whence (1) requires that \({{\gamma }}_k>1\), which implies \({{\mathscr {C}}}_{{{\gamma }}_k}=\emptyset \). In other words, Theorem 2 requires that there are more perceptive citizens than not. This is precisely because Theorem 2 is considering this worst-case adversarial scenario. If there are more organized criminals than honest citizens, then with high probability, \({{h}}_1\) will have more suggestions than \({{h_\star }}\).

Detecting Corruption. Of course, none of the ideas discussed above will work if the police force is corrupt. Fortunately, the same ideas can be adapted to detect patterns of corruption, or equivalently, to find the most honorable policemen. Consider, for an example, the following scenario. Suppose a citizen runs a light and is caught by a policeman. It is the policeman duty to assign a ticket and report it in the system. But if the policeman is corrupt, he will take a bribe and there will be no record of this transaction.

Suppose instead that citizen \({i}\) runs a light, and is caught by a policeman. An other citizen \({i'}\) sees that a policeman (who can be identified by the police car) is interacting with the first citizen, and he reports this to the system (anonymously, through a website, a cell phone app, text message, phone call, etc.). Citizen \({i'}\) does not know the nature of the interaction between the policeman and citizen \({i}\), yet he reports that an interaction occurred.

If many citizens report that there was an interaction between a certain policeman, but there is no report of a fine in the system, this would suggest that the policeman took a bribe. If there are many cases suggesting that a particular policeman took bribes, it is likely he did. This would also allow us to identify the most honorable policemen: the ones whose interactions with citizens (reported by citizens) match the fines in the system (reported by the policeman). We can then analyze the hierarchical structure of the corrupt policemen to determine patterns of corruption in higher levels. This will be the case of future study.