Keywords

1 Introduction

In recent years, we have seen more and more open government and administrative data made available on the Web. One particular example is crime data. Based on our recent investigation, five out of top ten largest U.S. cities now publish their crime data online, including New York CityFootnote 1, Los AngelesFootnote 2, ChicagoFootnote 3, and PhiladelphiaFootnote 4. These crime data reveal information that was difficult to obtain before, and which is of particular interest to civic organizations and ordinary citizens who are concerned about public safety. Researchers have then starting proposing crime prediction methods leveraging these data [8, 20, 23]. The outcome of crime prediction can be used in many scenarios. First, we can plan police patrols targeting the location and time for which the crimes are likely to happen. On the other hand, residents and tourists can avoid such locations and times when planning their outdoor activities.

In this paper, we consider crime prediction as a recommendation problem. Crime prediction deals with the problem of predicting the time and location where future crimes are likely to happen given past crime data [24]. We show that crime prediction can be properly modeled using existing recommendation system techniques that achieve accurate results. First we need to note that our work is not related to criminal profiling and series crime prediction, as studied in several works [6, 12]. We consider, instead, the spatial and temporal factors hidden in past crime records. To the best of our knowledge, no previous work has positioned crime prediction as a recommendation problem.

We have seen in previous works that techniques such as kernel density estimation (KDE) can be used to map crime hotspots [8, 9], considering that past crime records may indicate areas where crimes concentrate. While hotspot detection is effective for understanding geographical distribution of crimes, it does not consider their temporal aspects. We could simply assume no temporal effect and use hotspot map as our sole guidance, but previous study shows that incorporating temporal aspect can steadily increase the prediction accuracy [21]. There is a number of works that propose spatial temporal crime prediction methods that can output a probability that the crime will happen in a certain location on the next day [1, 22, 23]. However, this may not be so useful, as the police will need to patrol the location all day if it turned out that the probability is high. In our study, we would want to know not only the day, but also the hour, that a crime is likely to happen. This is necessary for effective planning of police patrols.

With finer spatial and temporal units, however, the problem emerges, which is the sparsity of data. Intuitively, given a limited number of crimes in the studied period, if the granularity is small, many of the spatial temporal units will have no crime records. Indeed, if we map 80 weeks of thefts in San FranciscoFootnote 5 starting from January 1st, 2016, into 200 m \(\times \) 200 m blocks, and 24 \(\times \) 7 \(=\) 168 h in the week, about 87% of the spatial temporal units will have no criminal incidents. For the assault type of crime the corresponding number is 94%. Yet, it is intuitive that if a spatial temporal unit had no crime incident before, it does not mean it will have no crime in the future. According to routine activity theory [5], for a crime to occur, three elements should converge in time and space, namely, a motivated offender, a suitable target or victim, and the absence of a capable guardian. Based on this theory, crime occurrence in an area is mostly caused by the criminals in this area, who commit crimes when opportunity is noticed (if one neglects the guardians’ aspect for simplicity). We can assume that crime occurring in a given area and at a given hour is influenced by two factors, criminals being present in that area, which is a property of the area, and people’s daily routine at that hour (e.g., commuting). We found that these two factors can be properly modeled as a recommendation problem.

In a typical recommendation system, there are three components, users, items, and user-item interactions. The user item interaction may be a review rating given by a user for an item, or a purchase record. Among various recommendation techniques, two main groups of methods are collaborative filtering and context-based rating prediction, both of which deal with sparse data [11, 17]. It is not uncommon to have 99% sparsity in a product review datasetFootnote 6. Methods in recommendation systems that mitigate sparsity are thus suitable for our fine-grained crime prediction case.

We will give a detailed explanation of our model in Sect. 3. In Sect. 4 we discuss context-based recommendation techniques which use Twitter data, and in Sect. 5 we present the experimental results. We focus on two common crime types, namely, theft and assault, in the city of San Francisco, though our approach can be easily extended to other crime types and cities. We summarize our contribution as follows:

  • We model crime prediction as a recommendation problem. This modeling aims to solve the data sparsity issue when it is desirable to achieve fine-grained prediction. We also show that CF-type techniques and context-based rating prediction in recommender systems can be applied to crime prediction.

  • We run extensive experiments on real-world crime data, comparing a number of recommendation approaches. We find that recommendation techniques are effective for predicting future crimes. For example, we show that with the prediction by recommendation methods, we can use 20% of man-hour to capture around 70% future thefts in San Francisco.

2 Related Works

Computational spatial-temporal analysis for crime has been studied extensively [24]. The types of temporal patterns studied include crime rate changes in long time periods (e.g., five years), seasonal and annually recurring patterns, and the sequential behavior of criminals [10]. Due to the lack of fine-grained crime records, however, hourly temporal patterns have been rarely studied. With a more general applicability, kernel density estimation (KDE) has been used to map crime hotspots [3]. The crime-based navigation system proposed by Galbrun et al. for example, maps a risk index to geographical points using KDE with past crime records [7]. Boni and Gerber propose to extend KDE with evolutionary optimization to improve crime prediction accuracy [1]. A problem with KDE mapping is that when the granularity is small, there will not be enough training data to properly learn a model [9]. Indeed, Boni and Gerber try to mitigate this problem by down-sampling the negative examples [1]. Another popular technique used for crime forecasting is ARIMA [4]. However, this technique does not consider the spatial factors hidden in the data.

Crime prediction using Web data as context has recently started attracting attention in research community following the availability of fine-grained spatial-temporal data. Wang et al. propose incorporating tweets into a crime prediction model [21]. They find that adding tweets into the prediction model improves prediction accuracy. However, their tweets are limited to a certain news account. Gerber later makes another study on incorporating tweets into crime prediction models [8]. Similarly, he transforms tweets using LDA and adds them to an existing model based on kernel density estimation (KDE). The experiments with 25 different crime types show that incorporating tweets improves prediction accuracy for 19 crimes types, but for a few other crime types the accuracy decreases. Wang et al. use multiple data sources in a crime prediction model [20]. They use the total crime number provided by Chicago administration and divide the city by community areas as the unit of study. Zhao and Tang propose to incorporate crime complaint records, weather, Foursquare, and taxi flow data into a prediction model [23], targeting the city of New York. The unit of study is 2 km \(\times \) 2 km grid, and so, their method is difficult to be applied to a finer granularity because of data sparsity. Yang et al. build a crime prediction model, using both tweets and Foursquare data, as a binary classification problem [22], for which they divide the city into large grids and generate negative samples in even spaces.

A common problem in these studies is that, crime data will become very sparse if we study it based on finer granularity. For this reason, these works either have to treat days or weeks as atomic temporal units, and blocks of more than 1 Km as the basic spatial unit, or need some down-sampling procedures to deal with the sparsity. In this paper, however, we approach crime prediction task based on hourly granularity of temporal units and a small spatial unit. We then show that existing recommendation techniques can operate on the condition of sparse data.

3 Crime Prediction as a Recommendation Problem

In this section, we discuss how crime prediction can be modeled as a recommendation problem. More specifically, we discuss how spatial temporal factors can be modeled as users and items. We also show that, with this modeling way, techniques used in recommendation systems can be applied for crime prediction.

3.1 Defining User and Item

Similar to a recommendation problem that has two latent factors, user and item, we argue that in crime prediction, we also have two latent factors, namely, time and location. Similar to the rating a user gives to an item which reflects the interaction between the user and the item, the number of crimes in a location at a given time also reflects the criminal interweaving between the location and time factors. Thus it is not difficult to consider time and location as user and item in a recommendation problem. Here we face a design choice, i.e., should we model time as the user or the item (and location as the other factor)? We propose to model time as the item and location as the user, instead of the other way around. We have two reasons. First, in a typical recommendation problem, the number of users is far more than the number of items. In our problem, we also find that, given a small granularity, the number of locations, typically thousands, is much higher than the number of time units, which in our study is \(24\times 7=168\) h in a week. Second, when we consider that crimes are mostly caused by the criminals living in the neighborhood, it is more appropriate to represent the human factor as the location, rather than as time. In this way, locations, like users, contain more inherent properties that are independent to each other, while time, like items, contains more relative properties that can be revealed by comparison.

As we mentioned, the advantage of modeling crime prediction as a recommendation problem, is that we can have a fine spatial temporal granularity, and use existing recommendation techniques that are effective for sparse data. Therefore in our work, we choose a spatial temporal granularity that are finer than most of the existing works. Specifically, we use \(200\times 200\) m blocks, and \(24\times 7=168\) h in a week. For the city we study, San Francisco, the spatial granularity will result in around 2,000 blocks that contain at least one theft or assault record. The total numbers of block-hour units are around 396k for theft and 324k for assault. The crime number and sparsity are shown in Table 1. The sparsity is calculated as the ratio of units that has 0 crime in all block-hour units. This level of sparsity is close to one in typical recommendation problems.

Table 1. Crime number and sparsity for crimes in SF.

We argue that a 0-crime block-hour unit does not mean that no crime will happen at that location and time. In product recommendation, even though the user rating for an item is absent, the user can still potentially buy and rate the item (thus the recommendation is needed). Similarly, even if no crime happened at a location and hour in past records, it is still possible for criminals to commit crime at that location and hour. We will need technique to infer this potential based on the sparse records. In the following section, we briefly review the collaborative filtering technique used in recommendation systems that can help us solve the problem.

3.2 Inferring Crime Potential

Collaborative filtering (CF) is a technique widely used in product recommendations. It is based on item or user similarities in the rating data, and accordingly, there are item-based CF and user-based CF. With item-based CF, items with similar ratings are ranked and recommended to the user. With user-based CF, user similarity is calculated based on the rating they give to items, and items from similar users are ranked and recommended to the user. Generally speaking, item-based CF provides better recommendation than user-based CF, because items are easier to be compared. In the following explanation, we assume item-based CF, but changing it to user-based CF is straightforward.

The first step of CF is to calculate the similarity matrix. Typically, given item i and j, and their rating vectors, \(\overrightarrow{i}\) and \(\overrightarrow{j}\), their similarity is calculated as the cosine similarity:

$$ sim(i,j)=\text {cos}(\overrightarrow{i}, \overrightarrow{j})=\frac{\overrightarrow{i}\cdot \overrightarrow{j}}{||\overrightarrow{i}||_2 * ||\overrightarrow{j}||_2} $$

Then, the prediction of the rating user u gives to item i can be calculated as the weighted sum:

$$ P_{u,i}=\frac{\sum _{j=1}^N sim(i,j) * R_{u,j}}{\sum _{j=1}^N sim(i,j)} $$

where N is the number of items user u has given ratings, and \(R_{u,j}\) is the rating user u has given to item j. By this calculation, the absent ratings for all remaining items can be inferred. In our crime prediction case, crime potential at a particular location and a time, for which no crime was recorded, can also be inferred in same way.

4 Incorporating Context Information

In addition to the user, item, and rating model, there is a group of works in recommendation systems that deal with another dimension of information, i.e., contextual information. For example, in an online shopping website, users will not only give a rating to the product they purchased, but also provide a detailed review in natural language texts explaining how the assessment was made. Such review is considered as the contextual information for the product, and has been extensively studied in recommendation [11, 13, 14]. Incorporating context information in recommendation often does not only improve the accuracy in rating prediction, but also reveals some insights about hidden factors behind the rating, for example, why two users agree on the same product but disagree on another product. In this section, we will discuss how crime prediction can be modeled into a context-based recommendation problem, and the techniques that can be applied.

4.1 Generating Context Information for Crime

In our model, which treats blocks as users, and hours as items, and crime number as ratings, we do not have contextual information like review texts in a shopping website. We need to find a second source that provides spatial-temporal text data. Nowadays microblogs like Twitter generate everyday millions of short texts that have geo-location attached. We can use such data to associate time and location to textual content. Indeed, unlike product reviews, which are mostly descriptions of the product that lead to the ratings, the tweets posted at a given time and location are usually not containing any descriptions of crimes. However, previous works show that tweet text is to some extend related to crime, and can be used to improve crime predictions [8, 21]. For example, Gerber found that criminal damage is correlated with tweets of sports-oriented and museum-oriented topics [8]. In this paper, we also consider that tweets contain implicit information related to crime at certain hour and location, and can thus be used as the context information. We collect geo-tagged tweets and assign them with block indexes and hours based on their geo-locations and timestamps. Note that since we generalize temporal pattern to hours in a week, the tweets do not need to be collected in the same period as the crime.

4.2 Solutions for Context-Based Crime Prediction

Once we create context information for time and location, we can apply context-based recommendation methods. Here we demonstrate how two existing context-based methods can be applied to our crime prediction problems. First, we show a tensor decomposition-based technique, then we discuss a latent factor analysis technique.

Tensor Decomposition Analysis. Context information adds another dimension to the spatial temporal data. The simplest way to use the context information is to create a model for each of the spatial-temporal unit, where a bias parameter w indicates the effect of each of the k elements in context information:

$$ f(\mathbf {x}_{ij})=\mathbf {w}^T\cdot \mathbf {x}_{ij}=\sum _k w_k\cdot x_{ijk} $$

The parameter w can be learned using optimization techniques. However, this approach does not consider the spatial temporal effects. In content-based review rating prediction, Li et al. proposed a model that incorporates the user and item effects, by creating a bias parameter \(w_{ij}\) that represents user item effects. In their model, the prediction function is defined as follows:

$$\begin{aligned} f(\mathbf {x}_{ij})&=(\mathbf {w}^0+\mathbf {w}_{ij})^T \cdot \mathbf {x}_{ij} \\&= \sum _{k=1}^K(w^0+w_{ijk})\cdot x_{ijk} \end{aligned}$$

A practical problem of using the above model is that given a large number of users and reviews, the number of parameters \(\mathbf {w}_{ij}\) would increase drastically, making the model intractable. Also it would suffer data sparsity problems because a user usually only reviews a few of all items, and there would be no enough training data to learn the model. This is also true in our crime prediction problem, because most of time-location units would have no crime neither tweet data given fine granularity. To deal with the problem, Li et al. propose a tensor-decomposition-based technique [11]. First consider \(\mathbf {w}_{ij}\) as a three dimensional tensor \(\mathbf {W}\in \mathbb {R}^{M\times N\times K}\). This tensor can be decomposed into three low rank matrices, \(\mathbf {U}\in \mathbb {R}^{M\times D}\), \(\mathbf {V}\in \mathbb {R}^{N\times D}\), and \(\mathbf {P}\in \mathbb {R}^{K\times D}\). \(\mathbf {W}\) can then be computed by multiplying three matrices together:

$$ \mathbf {W}=\mathbf {I}\times \mathbf {U}\times \mathbf {V}\times \mathbf {P} $$

where \(\mathbf {I}\) is a identity tensor. In this model, the number of parameters becomes \(D\times (M+N+K)\), which is significantly fewer than the full model, which has \(M\times N\times K\) parameters. With this model, the objective function becomes:

$$\begin{aligned} f(\mathbf {x}_{ij})&=(\mathbf {w}^0+\mathbf {w}_{ij})^T \cdot \mathbf {x}_{ij} \\&= \sum _{k=1}^K(w^0+\sum _{f=1}^D u_{if}\cdot v_{jf} \cdot p_{kf})\cdot x_{ijk} \end{aligned}$$

After deriving the gradients of \(\mathbf {U}\), \(\mathbf {V}\), and \(\mathbf {P}\), the model can be learned using gradient-descent-style optimization.

In [11], Bag-of-words (BOW) is used to represent review texts. We instead use word embedding as the text representation, which is a recent advancement in text processing, and is considered to be a better representation that captures semantics of the text [15]. More specifically, we use the GloVe word vectors that are trained on 2 billion tweets, and have 1.2 million word entries [16]. For tweets assigned to a block-hour unit, we first look up the GloVe vector for each word token, and then take the average of word vectors as the unit vector. This vector is then used as the context data \(\mathbf {x}_{ij}\).

Latent Topic Analysis. In recent years, latent topic analysis for texts has been extensively studied. The commonly used model, Latent Dirichlet Allocation (LDA) [2], discovers a K-dimensional topic distribution \(\theta _d\) for text documents, with words in the document d discussing topic k associated with probability \(\theta _{d,k}\). At the same time, each topic k is also associated with a distribution \(\phi _k\), indicating the probability each word is used for the topic. Finally, \(\theta _d\) is assumed to be drawn from a Dirichlet distribution.

Recently, it has been found that the latent topics in review text can be used together with the hidden factors in a user-item recommendation setting [13, 14]. In the following we will give a quick review of a technique proposed by McAuley and Leskovec [14], called hidden factor as topics (HFT). First, the standard latent factor model predicts rating using offset parameter \(\alpha \), bias \(\beta _u\) and \(\beta _i\), and K-dimension user item latent factors \(\gamma _u\) and \(\gamma _i\).

$$ rec(u,i)=\alpha +\beta _u+\beta _i+\gamma _u\cdot \gamma _i $$

The parameters are estimated typically by minimizing the Mean Squared Error (MSE):

$$ \hat{\varTheta } = \mathop {\mathop {\text{ arg } \text{ min }}}\limits _\varTheta \frac{1}{|\mathcal {T}|} \sum _{r_{u,i}\in \mathcal {T}} (rec(u,i) -r_{u,i})^2 - \lambda \varOmega (\varTheta ) $$

where \(\varOmega (\varTheta )\) is a regularizer that penalizes complex models.

The task of latent topic analysis in rating predicting is to find the correspondence between topics in review text and the ratings. In other words, it is desirable that a high rating, given by a high \(\gamma _{i,k}\), will correspond to a high value in a particular topic \(\theta _{i,k}\). However, the transformation is not straightforward, since \(\theta _i\) represents probability, but \(\gamma _i\) can take any value in \(\mathbb {R}^K\). McAuley and Leskovec propose the following formula for the transformation:

$$ \theta _{i,k} = \frac{exp(\mathcal {K}\gamma _{i,k})}{\sum _{k'}exp(\mathcal {K}\gamma _{i,k'})} $$

where a parameter \(\mathcal {K}\) is introduced to represent ‘peakiness’ of the transformation. A large \(\mathcal {K}\) will push the calculation to consider only the most important topics. The parameters are estimated using objective function

$$ f(\mathcal {T}|\varTheta , \varPhi , \mathcal {K}, z) = \sum _{r_{u,i}\in \mathcal {T}} (rec(u,i) -r_{u,i})^2 - \mu l(\mathcal {T}|\theta ,\phi ,z) $$

where corpus likelihood \(l(\mathcal {T}|\theta ,\phi ,z)\) replaces the typical regularizer \(\varOmega (\varTheta )\). Effectively, when there are few ratings for a product i, the corpus likelihood regularizer will push \(\gamma _i\) and \(\gamma _u\) to zero, causing the prediction rely on the review texts.

In our crime prediction, we apply the HFT technique on crime and context information. Tweet texts posted at an hour and a block are concatenated and used as the review text, and the method requires no further feature transformation. We use the implementation made available online by the authors in our evaluationFootnote 7. For fitting the model, this implementation uses L-BFGS, a quasi-Newton method for non-linear optimization problems with many variables.

5 Evaluation

We conduct evaluation using various prediction methods, including recommendation and non-recommendation methods, on real-world crime datasets. In this section, we first present the experiment setup and then we discuss the results.

5.1 Experimental Dataset

In this paper, we focus on crimes in the city of San Francisco. We obtain crime data from the DataSF websiteFootnote 8, which contains crime records since 2003 until present. For the purpose of our study, we take a subset of 100 weeks since the start of 2016 from this crime data. Out of 39 crime categories, we use the theft and assault, which are the most common property crime and violent crime. For the selected period, there are about 151k thefts and 42k assaults, compared to 313k of all other crime types. In the following experiments, we use the first 80 weeks as the training data and last 20 weeks as the testing data, except for the experiment on the effect of training size (for which the period of the training data is shorter). In areas with concentrated crimes, the crime numbers are abnormally high. In order to avoid distortion of predictions, we use a crime number of 5 for those units that have more than 5 crimes. In other words, the crime number for each particular block-hour is capped to a maximum of 5.

As the contextual data, we collect tweets posted in San Francisco area using the Twitter Filter APIFootnote 9, between 2016 and 2017. Note that as we discussed, the contextual data need not to be in the same period as the crime data, as long as it can be organized into spatial temporal unitsFootnote 10. There are some accounts that repeatedly sent tweets, most likely using automatic methods, that can have undesirable effect on predictions. To avoid these bots, we remove tweets from the top 1% most frequently posting users. As a result, we obtain about 371k tweets that have geo-coordinates and timestamps. We assign these tweets to block-hour units as described in Sect. 4.1.

5.2 Non-recommendation Methods

We will compare recommendation methods described in the previous sections, including item-based CF (CF-item), user-based CF (CF-user), tensor decomposition (TD), and Hidden Factor as Topics (HFT), with four non-recommendation baselines that are used in previous crime predictions, namely, historical sum, ARIMA, VAR, and KDE.

Historical Sum. A straightforward way to predict crime is to consider the same amount of crimes are likely to happen in the same location and the same hour in the future as in the past. Due to its simplicity and effectiveness, historical crime number has been used in practical systems, for example in [19]. With this approach, time and location of future crime is predicted to be the same as the time and location of crimes in the training data.

ARIMA. The autoregressive integrated moving average (ARIMA) is a common method used in time series forecasting, and has previously been used in crime prediction [4]. It consists of an autoregression model (AR) that captures lagged patterns, and a moving average model (MA) that captures long terms trends. Normally the parameters (p, d, q) need to be specified. In our experiments, we use the auto.arima function in R package forecast to automatically find the optimal parameters.

VAR. Vector autoregression (VAR) is a popular forecast method, particularly in econometrics [18]. It combines multiple signals together in an AR model for forecasting, which means we can use it to incorporate contextual information. We use the GloVe-transformed tweets described in Sect. 4.2 as the contextual vector. We use the VAR function implemented in the R package vars to automatically find the optimal parameter and make predictions.

Kernel Density Estimation (KDE). KDE is a popular interpolation method to estimate crime in areas where there is lack of previous crime records, by considering geographical correlations. With pre-defined grids, the probability of crime to occur in one point is calculated based on its distance to grid points that have previous crime records:

$$ f(p)=k(p,h)=\frac{1}{Ph} \sum _{j=1}^P K \left( \frac{||p-p_j||}{h} \right) $$

where p is the point where density is estimated, h is the bandwidth parameter that controls the smoothness of the estimation, P is the total number of crimes, K is the density function, \(||\cdot ||\) is the Euclidean distance between two points. Following [8], we use R package ks to estimate k(p, h), with standard normal density function and the default band-width estimator Hpi. After estimating the density for each grid point, we transform the density into a probability function that indicates if a crime will occur in a spatial temporal unit:

$$ Pr\left( occur=T|f(p) \right) =\frac{1}{1+\exp ^{-f(p)}} $$

Since KDE does not incorporate temporal effects, we calculate a KDE model for each hour.

5.3 Recommendation Performance

As a standard recommendation system evaluation metric, we measure mean absolute error (MAE) for recommendation and non-recommendation methods. MAE is calculated as the mean difference between the predicted and actual crime numbers in training data for block-hour units that have non-zero crimes. Since historical sum and CF methods preserve actual crime numbers in training data, thus have 0 MAE, they will not be measured here. Moreover, since the KDE method calculates the likelihood of future crime, and not the crime number, it is also not measured. MAE for the remaining methods are shown in Table 2.

Table 2. MAE of recommendation and non-recommendation methods

As we can see from Table 2, HFT achieves the best model fitting, by considering both the hidden factors and contextual data. VAR performs better than ARIMA, and shows that incorporating tweets as contextual information can effectively raise prediction accuracy.

5.4 Predicting Future Crimes

MAE calculated above indicates how a model fits the training data, but it does not tell the prediction of future crimes. Our next experiment studies the effectiveness of the methods for predicting future crimes. We follow [8] and use the surveillance plot as our measurement for future crime prediction effectiveness. A surveillance plot consists of a number of surveillance points. Each surveillance point tells the ratio of crime captured with the set of particular selected block-hour units. We can consider each block-hour unit corresponds to a man-hour. Thus we can easily interpret a surveillance point as the percentage of crimes captured given a certain amount of man-hour. To calculate the value for a surveillance point sp(k), we first rank the block-hour units using the prediction, from high to low, then we take top k% units in the test data, and sum the amount of crime captured. The point value is thus calculated as

$$ sp(k)=\frac{\sum _i^{k\% \times |U'|} u_i}{\sum _i^{|U'|} u_i} $$

where \(u_i\) is the number of crimes in unit i, and \(U'\) is the set of ranked units.

To draw a surveillance plot, we plot 100 surveillance points with their values. Surveillance plots for four selected methods, i.e., KDE, VAR, CF-item, and HFT, are shown in Fig. 1. As an overall evaluation, Area Under Curve (AUC) is calculated for each surveillance plot. Table 3 shows AUC for all presented methods, as well as surveillance levels at 20% and 50%, sp(20) and sp(50), which can be interpreted as the ratio of crime captured with 20% and 50% man-hour.

Fig. 1.
figure 1

Surveillance plots of selected prediction methods

Table 3. Future crime prediction accuracy with recommendation and non-recommendation methods

According to Fig. 1 and Table 3, HFT achieves the highest accuracy among all the methods, followed closely by TD. This reveals the power to consider both spatial temporal hidden factors and context information. With HFT, 70% of thefts can be captured using 20% man-hour, and 88% with 50% man-hour. VAR performs only a little better than ARIMA, indicating the tweets are actually not a strong indicator for crimes. KDE, despite not considering temporal factors, achieves higher accuracy than CF methods, and the highest among non-recommendation methods. Nevertheless, it captures 13% fewer crimes than HFT with 20% man-hour. CF methods did not achieve very high accuracy, with item-based CF performing much better than user-based CF. This is likely due to that CF methods only consider one of two latent factors.

5.5 Effect of Training Data Size

Finally, we investigate the effect of training data size on prediction accuracy. Particularly, we aim to find out whether very small training data can still provide good prediction. We therefore create training datasets composed of 1, 3, 5, 10, and 20 weeks before the testing dataset. The testing dataset is the same as used in the previous experiment. We select the best performing recommendation and non-recommendation methods, i.e., HFT and KDE for comparison. The AUC results are shown in Fig. 2.

Fig. 2.
figure 2

for selected prediction methods with different training sizes

According to Fig. 2, HFT out-performs KDE steadily given the different training sizes. Both methods reach near-optimal performance compared to Table 3, with only one quarter (20 weeks) of training data. However, according to [14], HFT should perform much better when there is little training data, with prediction being driven by contextual information. We do not see it in our results. This is perhaps again due to tweets being weak predictor for crimes.

6 Conclusion

We have proposed to model crime prediction as a recommendation problem. By this modeling, we can have finer spatial temporal granularity, while using techniques in recommendation systems to mitigate the data sparsity problem. We present our method to model spatial and temporal factors as users and items, and discuss several recommendation techniques that can be applied. The techniques include collaborative filtering methods, as well as context-based rating prediction methods. Our extensive experiments with crime and contextual data collected for the city of San Francisco show that, recommendation methods such as HFT can outperform traditional crime prediction methods such as KDE. By using HFT, we can capture 70% thefts in San Francisco with 20% man-hours. We have also tested the effect of small training data size. In this work, we focus on thefts and assaults in San Francisco, while more crime types and cities can be studied in the future.