1 Introduction

The demographic trend clearly shows an increasing concentration of people in huge cities. By 2030, 9% of the world population is expected to live in just 41 mega-cities, each one with more than 10M inhabitants. Thus, the growing availability of data [2] makes it possible to discover new interesting aspects about cities and its life at a fine unprecedented granularity.

A fundamental challenge that policy makers and urban planners are dealing with is land use classification, which plays an important role for infrastructure planning and development, real-estate evaluations, and authorizations of business permits. More in detail, policy makers and urban planners need to associate different urban areas with specific human activities (e.g., residential, industrial, business, nightlife and others). However, traditional survey-based approaches to classify areas are time consuming and very costly to be applied to modern huge cities. Therefore, automatic approaches using novel sources of data (e.g., data from mobile phones, LBSNs, etc.) have been proposed. For example, [19] designed supervised and unsupervised approaches to infer New York City (NYC) land use from check-in. A check-in usually consists of latitude and longitude coordinates associated with additional metadata such as the venue where the user checked-in, comments and photos. Such data can be extracted from LBSNs like FoursquareFootnote 1, a social network application that provides the number and type of activities present in the target area (e.g., Arts & Entertainment, Nightlife Spot, etc.). The approach basically used feature vectors, mainly consisting of the number of check-in with the associated activity inferred from the Foursquare category of the place (e.g., eating if the check-in is done in a restaurant). As Gold Standard, the authors used data provided by the NYC Department of City Planning in 2013 mapped on a grid of 200 \(\times \) 200 m.

In this paper, we represent geographical areas in two different ways: (i) as a bag-of-concepts (BOC), e.g., Arts and Entertainment, College and University, Event, Food extracted from the Foursquare description of the area; and (ii) as the same concepts above organized in a tree, reflecting the hierarchical category structure of Foursquare activities. We designed kernels combining BOC vectors with Tree Kernels (TKs) [6, 9, 10, 17] applied to concept trees and used them in Support Vector Machines (SVMs). This way, our model (i) can learn complex structural and semantic patterns encoded in our hierarchical conceptualization of an area and (ii) highly improves the accuracy of standard classification methods based on BOC. Our GeoTK represents an interesting novelty as we show that TKs not only can capture semantic information from natural language text, e.g., as shown for semantic role labeling [12] and question answering [3, 15], but they can also convey conceptual features from the hierarchy above to perform semantic inference, such as deciding which is the major activity of a land. Our approach is largely applicable as (i) it can use any hierarchical category structure for POIs categories (e.g., OpenStreet Map POIs data); and (ii) many cities offer open access to their land use data.

Finally, we carry out a study with different granularities of the areas to be analyzed. This also enables to analyze the trade-off between the precision in targeting the area of interest and the accuracy with which we carry out the estimation. More in detail, we divide the NYC area in squares with edges of 50, 100, 200 and 250 m and, for each cell, we classify its most predominant land use class (e.g., Residential, Commercial, Manufacturing, etc.). Our extensive experimentation, including a comparative study as well as the use of several machine learning models, shows that GeoTKs are very effective and improve the state of the art up to 18% in Macro-F1.

The reminder of this paper is organized as follows, Sect. 2 introduces the related work, Sect. 3 describes the task and the related data, Sect. 4 presents our hierarchical tree representation and our GeoTK. Then, Sect. 5 illustrates the evaluation of our approach, and finally Sect. 6 derives some conclusions.

2 Related Work

Several works have targeted land use inference by means of different sources of information. For example, [18] built a framework that, using human mobility patterns derived from taxicab trajectories and Point Of Interests (POIs), classifies the functionality of an area for the city of Beijing. The model is similar to the one used for topic discovery in a textual document, where the functionality of an area is the topic, the region is the document, and POIs and mobility patterns are metadata and words, respectively. Specifically, [18] have used an advanced model combining Latent Dirichlet Allocation (LDA) with Dirichlet Multinomial Regression (DMR), in order to insert also information coming from the POIs (metadata). Hence, for each region, after the parameter estimation with DMR, they have a vector representing the intensity of each topic. This vector is then used to aggregate formal regions having similar functions by k-means clustering.

Similarly, [1] proposed a spatio-temporal approach for the detection of functional regions. They exploited three different clustering algorithms by using different set of features extracted from Foursquare’s POIs and check-in activities in Manhattan (New York). This task permits to better understand how the functionality of a city’s region changes over time. Other works have used geo-tagged data from social networks: for example, [8] used tweets as input data to predict the land use of a certain area of Manhattan. Moreover, they try to infer POIs from tweets’ patterns clustering the surface with Self-Organizing-Map, then characterizing each region with a specific tweet pattern and finally using k-means to infer land use. Again, [19] have used check-in data to compare unsupervised and supervised approaches to land use inference.

Finally, some works have also used Call Detail Records (CDRs) [7, 8, 13, 16], which are typically used by mobile phone operators for billing purposes. This data registers the time and type of the communication (e.g., incoming calls, Internet, outgoing SMS), and the radio base station handling the communication. For example, [16] have used CDRs jointly with a Random Forest classifier to build a time-varying land use classification for the city of Boston. The intuition behind this work is to mine a time-variant relation between movement patterns and land use. In particular, they perform a Random Forest prediction and then they compare it with the predictions obtained for the neighboring regions, applying a sort of consensus validation (e.g., they modify the prediction if a certain number of neighbors belong to a different uniform function). This way, they model different land uses for different temporal slots of the day.

Compared to the state of the art, the main novelties introduced by our work are the following: (i) we model the hierarchical semantic information of Foursquare using GeoTK, thus adding powerful structural features in our classification models; and (ii) we study how the size of the grid impacts on the accuracy of different models, thus investigating the trade-off between granularity of the analysis and accuracy. It should be also noted that, in contrast to previous work, GeoTK does not rely on external resources (e.g., mobile phone data) or heavy features engineering in addition to the structural kernel model.

3 Datasets

We use the shape file of New York provided by the NYC governmentFootnote 2. This file is publicly available and contains the entire shape of New York divided in the 5 boroughs: Manhattan, Brooklyn, Staten Island, Bronx, and Queens. Then, we build a grid over the entire city in order to enable our classification task. The goal is to infer the land use of a region given a target label and a feature representation of the region. In the next subsections, we describe (i) the land use data and labels utilized by our approach, and (ii) the Foursquare’s POIs used to obtain a feature representation of the land of a region.

3.1 Land Use

In our study, we use MapPluto, a freely available dataset provided by the NYC government, which contains precise geo-referenced information for each city’s borough. For example, it provides the precise category and shape for each building in the city (Fig. 1). More in detail, it contains the following land use categories: (i) One and Two Family Buildings, (ii) Multi-Family Walk-Up Buildings, (iii) Multi-Family Elevator Buildings, (iv) Mixed Residential and Commercial Buildings, (v) Commercial and Office Buildings, (vi) Industrial and Manufacturing Buildings, (vii) Transportation and Utility, (viii) Public Facilities and Institutions, (ix) Open Space and Outdoor Recreation, (x) Parking Facilities, and (xi) Vacant Land. Land use information is very fine-grained, and in most cases there is only one land use assigned to one building, thus making it very difficult to determine the land use with just POI information. A reasonable trade-off between classification accuracy and the desired area granularity consists in segmenting the regions in squared cells: each cell will refer to more than one land use but we consider the predominant class as its primary use.

Fig. 1.
figure 1

Example of land use distribution in New York City.

3.2 Foursquare’s Point of Interests

We extracted 206,602 POIs from the entire NYC. As for the land use data, we have several sources of information, but we focused on the ten macro-categories of the POIs, each one specialized in maximum four levels of detail. These levels follow a hierarchical structureFootnote 3, where each level of a category has a finite number of subcategories as node children. For instance, the first level of POIs main categories is constituted by: (i) Arts and Entertainment, (ii) College and University, (iii) Event, (iv) Food, (v) Nightlife Spot, (vi) Outdoors and Recreation, (vii) Professional and Other Places, (viii) Residence, (ix) Shop and Service, and (x) Travel and Transport. The second level includes 437 categories whereas the third level contains a smaller number of categories, 345.

4 Semantic Structural Models for Land Use Analysis

Previous works [4, 13, 14] have mainly used features extracted from LBSNs (e.g., Foursquare’s POIs) in the XGboost algorithm [5]. However, these feature vectors have several limitations such as (i) the small amount of information available for the target area and (ii) their inherent scalar nature, which does not capture the existence and the type of relations between different POIs. Here, we propose a much powerful approach based on TKs applied to a semantic structure based on the hierarchical organization of the Foursquare categories.

4.1 Bag-of-Concepts

The most straightforward way to represent an area by means of Foursquare data is to use its POIs. Every venue is hierarchically categorized (e.g., Professional and Other Places \(\rightarrow \) Medical Center \(\rightarrow \) Doctor’s office) and the categories are used to produce an aggregated representation of the area. We use this feature representation by aggregating all the venues together, namely we count the macro-level category (e.g., Food) in all the POIs that we found in any cell grid. This way, we generate the Bag-Of-Concepts (BOC) feature vectors, counting the number of each activity under each macro-category.

4.2 Hierarchical Tree Representation of Foursquare POIs

Every LBSN (e.g., Foursquare) has its own hierarchy of categories, which is used to characterize each location and activity (e.g., restaurants or shops) in the database. Thus, each POI in Foursquare is associated with a hierarchical path, which semantically describes the type of location/activity (e.g., for Chinese Restaurant, we have the path Food \(\rightarrow \) Asian Restaurant \(\rightarrow \) Chinese Restaurant). The path is much more informative than just the target POI name, as it provides feature combinations following the structure and the node proximity information, e.g., Food & Asian Restaurant or Asian Restaurant & Chinese Restaurant are valid features whereas Food & Chinese Restaurant is not.

In this work, we propose, a tree structure, Geo-Tree (GT), where its nodes are Foursquare categories and the edges among them are the same provided in the hierarchical category tree of Foursquare. Our structure is basically composed of all paths associated with the POIs that we find in the target grid cell. Precisely, we connect all these paths in a new root node. This way, the first level of root children corresponds to the most general category in the list (e.g., Arts & Entertainment, Event, Food, etc.), the second level of our tree corresponds to the second level of the hierarchical tree of Foursquare, and so on. The terminal nodes are the finest-grained descriptions in terms of category about the area (e.g., College Baseball Diamond or Southwestern French Restaurant). For example, Fig. 2 illustrates the semantic structure of a grid cell obtained by combining all the categories’ chains of each venue. Given such representation, we can encode all its substructures in kernel machines using TKs as described in the next section.

Fig. 2.
figure 2

Example of Geo-Tree built according to the hierarchical categorization of Foursquare venues.

4.3 Geographical Tree Kernels (GeoTK)

Structural kernels are very effective means for automatic feature engineering [11]. In kernel machines both learning and classification algorithms only depend on the evaluation of inner products between instances, which correspond to compute similarity scores. In several cases, the similarity scores can be efficiently and implicitly handled by kernel functions by exploiting the following dual formulation of the classification function: \(\sum _{i=1..l}y_i\alpha _iK(o_i,o)+b,\) where \(o_i\) are the training objects, o is the classification example, \(K(o_i,o)\) is a kernel function that implicitly defines the mapping from the objects to feature vectors \(\varvec{x_i}\). In case of tree kernels, K determines the shape of the substructures describing trees.

4.4 Tree Kernels

In the majority of machine learning approaches, data examples are transformed in feature vectors, which in turn are used in dot products for carrying out both learning and classification steps. Kernel Machines (KMs) allow for replacing the dot product with kernel functions, which compute the dot product directly from examples (i.e., they avoid the transformation of examples in vectors).

Given two input trees, TKs evaluate the number of substructures, also called fragments, that they have in common. More formally, let \(\mathcal {F} = \{ f_1, f_2, \dots .. f_{\mathcal {F}} \} \) be the space of all possible tree fragments and \(\chi _i(n)\) an indicator function such that it is equal to 1 if the target \(f_1\) is rooted in n, equal to 0 otherwise. TKs over \({T_1}\) and \({T_2}\) are defined by \( TK(T_1, T_2) = \sum _{n_1 \in N_{T_1}} \sum _{n_2 \in N_{T_2}} \varDelta (n_1, n_2), \) where \(N_{T_1}\) e \(N_{T_2}\) are the set of nodes of \(T_1\) and \(T_2\) and

$$\begin{aligned} \varDelta (n_1, n_2) = \sum _{i=1}^{\mathcal {F}}\chi _i (n_1) \chi _i (n_2) \end{aligned}$$
(1)

represents the number of common fragments rooted at nodes \(n_1\) and \(n_2\). The number and the type of fragments generated depends on the type of the used tree kernel functions, which, in turn, depends on \(\varDelta (n_1, n_2)\).

Syntactic Tree Kernels (STK). Its computation is carried out by using \(\varDelta _{STK}(n_1,n_2)\) in Eq. 1 defined as follows (in a syntactic tree, each node can be associated with a production rule):

  1. (i)

    if the productions at \(n_1\) and \(n_2\) are different \(\varDelta _{STK}(n_1,n_2)=0\);

  2. (ii)

    if the productions at \(n_1\) and \(n_2\) are the same, and \(n_1\) and \(n_2\) have only leaf children then \(\varDelta _{STK}(n_1,n_2)=\lambda \); and

  3. (iii)

    if the productions at \(n_1\) and \(n_2\) are the same, and \(n_1\) and \(n_2\) are not pre-terminals then \(\varDelta _{STK}(n_1,n_2)=\lambda \prod _{j=1}^{l(n_1)} (1 + \varDelta _{STK}(c_{n_1}^j,c_{n_2}^j))\),

where \(l(n_1)\) is the number of children of \(n_1\) and \(c_{n}^j\) is the j-th child of the node n. Note that, since the productions are the same, \(l(n_1)=l(n_2)\) and the computational complexity of STK is \(O(|N_{T_1}| |N_{T_2}|)\) but the average running time tends to be linear, i.e., \(O(|N_{T_1}| + |N_{T_2}|)\), for natural language syntactic trees [10].

Finally, by adding the following step:

  1. (0)

    if the nodes \(n_1\) and \(n_2\) are the same then \(\varDelta _{STK}(n_1,n_2)=\lambda \),

also the individual nodes will be counted by \(\varDelta _{STK}\). We call this kernel STK\(_b\).

The Partial Tree Kernel (PTK). [10] generalizes a large class of tree kernels as it computes one of the most general tree substructure spaces. Given two trees, PTK considers any connected subset of nodes as possible features of the substructure space. Its computation is carried out by Eq. 1 using the following \(\varDelta _{PTK}\) function:

if the labels of \(n_1\) and \(n_2\) are different \(\varDelta _{PTK}(n_1,n_2)=0;\)

else \(\small \displaystyle \varDelta _{PTK}(n_1,n_2)= \mu \Big (\lambda ^2 + \sum _{\varvec{I}_1,\varvec{I}_2,l(\varvec{I}_1)=l(\varvec{I}_2)} \lambda ^{\scriptscriptstyle d(\varvec{I}_1)+d(\varvec{I}_2)} \prod _{j=1}^{l(\varvec{I}_1)} \varDelta _{\scriptscriptstyle PTK}(c_{n_1}({\varvec{I}_{1j}}),c_{n_2}({\varvec{I}_{2j})})\Big ), \)

where \(\mu , \lambda \in [0,1]\) are two decay factors, \(\varvec{I}_1\) and \(\varvec{I}_2\) are two sequences of indices, which index subsequences of children u, \(\varvec{I} = (i_1,...,i_{|u|})\), in sequences of children s, \(1 \le i_1< ... < i_{|u|} \le |s|\), i.e., such that \(u=s_{i_1}..s_{i_{|u|}}\), and \(d(\varvec{I})\) = \(i_{|u|}-i_1 + 1\) is the distance between the first and last child.

When the PTK is applied to the semantic Geo-Tree of Fig. 2, it can generate effective fragments, e.g., those in Fig. 3.

Fig. 3.
figure 3

Some of the exponential fragment features from the tree of Fig. 2

Combination of TKs and Feature Vectors. Our TKs do not consider the frequencyFootnote 4 of the POIs present in a given grid cell. Thus, it may be useful to enrich the feature space with further information that can be encoded in the model using a feature vector. To this end, we need to use a kernel that combines tree structures and feature vectors. More specifically, given two geographical areas, \({x}^a\) and \({x}^b\), we define a combination as: \( K({x}^a,{x}^b) = TK(\mathbf {t}^a, \mathbf {t}^b) + KV(\mathbf {v}^a, \mathbf {v}^b) \), where TK is any structural kernel function applied to tree representations, \(\mathbf {t}^a\) and \(\mathbf {t}^b\) of the geographical areas and KV is a kernel applied to the feature vectors, \(\mathbf {v}^a\) and \(\mathbf {v}^b\), extracted from \({x}^a\) and \({x}^b\) using any data source available (e.g., text, social media, mobile phone and census data).

5 Experiments

We test the effectiveness of our approach on the land use classification task, where the goal is to assign to each area the predominant land use class as performed in previous work by [16, 19]. We first test several models on Manhattan using several grid sizes, then we focus on evaluating the best models on all NYC boroughs and finally, we use the best models on the entire NYC, also enabling comparisons with previous work.

5.1 Experimental Setup

We performed our experiments on the data from NYC boroughs, evaluating grids of various dimensions: \(50\times 50\), \(100\times 100\), \(200\times 200\) and \(250\times 250\) m. We applied a pre-processing step in order to filter out cells for which it is not possible to perform land use classification. In particular, from each grid, we removed the cells (i) that cover areas without a specified land use (e.g., cell in the sea) and (ii) for which we do not have POIs (e.g., cells from Central Park). For each grid, we created training, validation and test sets, randomly sampling 60%, 20%, 20% of the cells, respectively. We labelled the dataset following the same category aggregation strategy proposed by [19], who assigned the predominant land use class to each grid cell. Note that given the categories described in Sect. 3.1, we merged (i) One & Two Family Buildings, (ii) Multi-Family Walk-Up Buildings and (iii) Multi-Family Elevator Buildings into a single general Residential category. Then, we also aggregated (i) Industrial & Manufacturing, (ii) Public Facilities & Institutions, (iii) Parking Facilities and (iv) Vacant Land into a new category called Other. Thus, the aggregated dataset contains six different classes: (i) Residential, (ii) Commercial and Office Buildings, (iii) Mixed Residential and Commercial Buildings, (iv) Open Space and Outdoor Recreation, (v) Transportation and Utility, (vi) Other. The names and distribution of examples in training and test sets (for the grid of \(200\times 200\)) are shown in Table 1. Compared to the original categorization, this new taxonomy has a lower granularity, thus facilitating the identification of the predominant class in each cell.

Table 1. Distribution of land use classes in the training and test set for NYC.

To train our models, we adopted SVM-Light-TKFootnote 5, which allow us to use structural kernels [10] in SVM-lightFootnote 6. We experimented with linear, polynomial and radial basis function kernels applied to standard feature vectors. We measured the performance of our classifier with Accuracy, Macro-Precision, Macro-Recall and Macro-F1 (Macro indicates the average over all categories).

Fig. 4.
figure 4

Accuracy of common machine learning models on different cell sizes in Manhattan.

Fig. 5.
figure 5

Accuracy of GeoTKs according to different cell sizes of Manhattan.

5.2 Results for Land Use Classification

We trained multi-class classifiers using common learning algorithm such as Logistic Regression (LogReg), XGboost [5], and SVM using linear, polynomial and radial basis function kernel, named SVM-{Lin, Poly, Rbf}, respectively, and our structural semantic models, indicated with STK, STK\(_b\) and PTK. We also combined kernels with a simple summation, e.g., PTK+Poly indicates an SVM using such kernel combination.

We first tested our models individually just on Manhattan using different grid sizes. Figures 4 and 5 show the accuracy of the multi-classifier for different models according to different granularity of the sampling grid. We note that SVM-Poly, XGboost and LogReg show comparable accuracy. PTK and STK\(_b\) perform a little bit less than the feature vector models. Interestingly, the kernel combinations in Fig. 6 provide the best results. This is an important finding as XGboost is acknowledged to be the state of the art for land use classification. Additionally, when the size of the grid cell becomes larger, the accuracy of TKs degrades faster than the one of kernels based on feature vectors, mainly because the conceptual tree becomes too large. After the preliminary experiments above, we selected the most accurate models on Manhattan and tested them on the other boroughs of NYC. Table 2 shows that TKs are more accurate than vectors-based models and the combinations further improve both models.

Fig. 6.
figure 6

Accuracy of kernel combinations using BOC vectors and GeoTKs according to different cell sizes of Manhattan.

In the final experiments, we tested our best models on the entire NYC with a grid of 200 \(\times \) 200. We first tuned the following parameters on a validation set: (i) the decay factors \(\mu \) and \(\lambda \) for TK, (ii) C value for all the SVM approaches, and the specific parameters, i.e., degree in poly and \(\gamma \) in RBF kernels, (iii) the important and the parameters of XGBoost such as the maximum depth of the tree and the minimum sum of weights of all observations in a child node.

Table 2. Accuracy of the best models for each New York borough and cell size.

Table 3 shows the results in terms of Accuracy, Macro F1, Macro-Precision and Macro-Recall. The model baseline is obtained by always classifying an example with the label Residential, which is the most frequent. We note that: (i) all the feature vector and TK combinations show high accuracy, demonstrating the superiority of GeoTK over all the other models. (ii) STK\(_b\)+poly (polynomial kernel of degree 2) achieved the highest accuracy, improving over XGBoost up to 4.2 and 6.5 absolute percent points in accuracy and F1, respectively: these correspond to an improvement up to 18% over the state of the art.

Finally, Zhan et al. [19] is the result obtained on the same dataset using check-in data from Foursquare. Although an exact comparison cannot be carried out for possible differences in the experiment setting (e.g., Foursquare data changing over time), we note that our model is 1.8 absolute percentage points better.

Table 3. Classification results on New York City.

6 Conclusions

In this paper, we have introduced a novel semantic representation of POIs to better exploit geo-social data in order to deal with the primary land use classification of an urban area. This gives the urban planners and policy makers the possibility to better administrate and renew a city in terms of infrastructures, resources and services. Specifically, we encode data from LBSNs into a tree structure, the Geo-Tree and we used such representations in kernel machines. The latter can thus perform accurate classification exploiting hierarchical substructure of concepts as features. Our extensive comparative study on the areas of New York and its boroughs shows that TKs applied to Geo-Trees are very effective, improving the state of the art up to 18% in Macro-F1.