1 Introduction

In recent years, we have had an explosion of available data in many fields of knowledge. We have companies like Facebook that collect different kinds of information from users.

Some of the data that has been collected in the last years has been geographical and demographic data, from income to number of people living in a single room; we can access vast amounts of information regarding the social fabric of different places.

One of the problems most commonly associated with habitability and social justice is the displacement of people by the effects of the markets. When a neighborhood vastly changes its face to give rise to new business and a new breed of inhabitants, we call this process Gentrification [7].

Gentrification is a phenomenon that happens in many cities in the world. One of the prime examples of gentrification is San Francisco [9], where the people that provides services are no longer capable of living in the city and are in turn displaced to other cities far away from the urban center.

It is difficult to create generic models to predict and analyze gentrification. First, the gentrification effects can be vastly different depending on the city. Where in New Orleans gentrification can happen near the French Quarters, in Mexico City, gentrification happens near the zoo.

In this work, we test different Machine Learning models to predict which places gentrification in a narrow period. The applications of this work may vary from profitable real estate developments to create new public housing policy to accommodate the people that loses their homes due to the gentrification.

1.1 Gentrification

Population displacement is an important issue in cities and/or knowledge centers, where the pressure for urban life is accelerating. These cities attract new businesses, highly skilled workers and large corporations. Generally, in the popular neighborhoods of these cities a process of transformation arises where the medium and high-income replaces the low-income population living in some area. This population is responsible for renovating the houses on their own account or private investment (real estate agencies and banks). All this drives supply, demand and the cost of housing. As a result, residents may feel pressured to move to more affordable locations. This process is defined under the term of gentrification [1].

There are generally three dynamic factors that relevant for gentrification: (a) movement of people, (b) public policies and investments and (c) private capital flows. These elements are by no means exclusive - they are very dependent on each other - and each of them is mediated by conceptions of class, place and scale [1].

Gentrification is typically the result of investment in a community by local government, community activists, or business groups, and can often stimulate economic development, attract business, and reduce crime rates [4].

Several studies analyze gentrification. Mainly, they focus on analyzing the advantages and disadvantages from a social perspective and its impact on the communities affected by this phenomenon.

Because there is no universally accepted definition of gentrification, that is why in this work we consider the following elements as an integral part of the gentrification process [1]:

  • Population movements, being able to present as direct or indirect displacement. The direct displacement is that in which the people are forced to leave their homes through violent actions, known as compulsive evictions. Indirect displacement would be from socioeconomic factors; old residents are forced to leave their homes because rent increases or because real estate taxes increased when the market value of the property increases. Alternatively, because the transformations make the old residents no longer feel comfortable being in their neighborhood.

  • Degraded, or non-degraded dwellings are usually rehabilitated, renovated or change their land use.

  • Commercial businesses such as restaurants, aesthetics, art galleries and bars are established in spaces previously occupied by a traditional family business.

  • Old facilities (warehouses, factories, railway stations, commercial ports) are converted into facilities of different use (commercial, housing, offices and services) for medium and high-income social groups.

1.2 Gentrification in Mexico City

The effects of gentrification in Mexico City (CDMX) have been cited in some newspapers. Such is the case of [3], where he cites the neighborhoods Doctores, Obrera, Tabacalera and Alamos as neighborhoods that are undergoing through a gentrification process due to their commercial offer, service offer and commuting convenience.

On the other hand [2] cites the neighborhoods Condesa, Cuauhtemoc, Hipodromo de la Condesa, Roma Sur, Roma Norte, Juarez, Doctores, Guerrero, Santa Maria La Ribera and Centro as undergoing a gentrification process due to an increase of annual property taxes, increase in rents and sale prices of real estate.

According to Forbes [5], the neighborhoods that have potential because of their geographical position, transportation services, entertainment and gastronomy offerings are Juarez, Doctores, San Rafael, Santa Maria La Ribera, Irrigacion y and Escandon.

1.3 Machine Learning and Gentrification

Due to the nature of Machine Learning and Gentrification, there has not been many works using Machine Learning to predict or analyze gentrification. Recently, Reades et al. [8] used a Machine Learning algorithm to predict Gentrification in London. However, they acknowledged the limitation of their work. Furthermore, variables, which are important to predict gentrification in one city, might be useless in a different setting. In New Orleans, the gentrified neighborhoods are near the ocean [6], while Mexico City doesn’t have an ocean, or other coastal cities might have different water front’s not prone for gentrification.

To our knowledge, this is the first time that Machine Learning algorithms have been used to predict and analyze gentrification in Mexico City, and as such, is still a work in progress as gentrification is an undergoing phenomenon, and new governmental policies might induce gentrification in other areas which had not been considered before.

2 Methodology

In this work, we extracted data from different sources of information, once we had all the data, we tested multiple Machine Learning algorithms to compare their metrics and in such a way decide which is better suited to analyze the problem of gentrification with the data that we have.

2.1 Data Sources

As we mentioned before, the geographical coverage of this work is limited to Mexico City, the most populated city in Mexico. Mexico City has enough studies done in gentrification and plenty of data associated with commerce and social fabric. This availability of data will allow us to model the city in the most accurate way.

Fig. 1.
figure 1

CDMX neighborhoods

The Fig. 1 shows the 1,436 neighborhoods of the CDMX that are handled in the present work.

The data that we use was recollected in 2000, 2010 and 2016, which mostly comes from the Population and Housing Censuses conducted by INEGI, Mexico’s geographical institute.

Table 1 shows the neighborhoods that are defined as gentrified in the present work ([3, 5]).

Table 1. Gentrified neighborhoods
Fig. 2.
figure 2

Gentrified neighborhoods

Figure 2 shows the location of the gentrified neighborhoods on a Google map. The gentrified neighborhoods are in the center of Mexico City. In the delegations Cuauhtemoc, Miguel Hidalgo and Benito Juarez.

2.2 Data

The different data sources we used are referenced in the Fig. 3.

  • Inventario Nacional de Viviendas 2016: National inventory of living quarters.

  • Censo de Población y Vivienda: national census in Mexico.

  • SCINCE: System to query the census information.

  • DENUE: National directory of economic activities.

  • Softec: Private database with information regarding the cost of the square meter.

  • Shapefiles: Different shapefiles for the geographical units in Mexico City.

Fig. 3.
figure 3

Data sources

2.3 Data Processing

Since the data that we used comes from a variety of sources, we had to undergo a heavy process of pre-processing to be able to match the different tables in terms of having the same locations coincide in the different sources.

In some case neighborhoods were named different and we did not have a unification code for the different neighborhoods. In some other, for temporal data, neighborhoods ceased to exist or altogether new neighborhoods were created in the years between the different census.

All the pre-processed data was stored in a database located in an Amazon DB.

Once the data was pre-processed, we did an exploratory analysis, and then we used the algorithm to create the gentrification indicator number.

2.4 Exploratory Analysis

Since we had many sources of information, we created different graphs, we put in this work what we considered are the most important to represent the problem of gentrification.

These are not the only variables we used, but the ones we chose to represent in this paper, for further analysis, please contact any of the corresponding authors.

Fig. 4.
figure 4

Percentage of people living in a different city in 2005

In Fig. 4 we indicate how many people in the past census were living in a different state. It shows us how popular are these areas among new comers. The map clearly shows that gentrified neighborhoods are rather popular for people who recently moved to Mexico City.

Fig. 5.
figure 5

Average school education

In Fig. 5 we show the average education in those areas. Gentrified areas attract people whose education is higher, which means these people may have higher incomes and go in line with the classic gentrification analyses usually given in the literature.

3 Algorithm Description

In this section we describe the modeling process based on the data defined in the last section. The type of problem, the target variable and the evaluation of the model are defined.

We want to estimate the target variable “Gentrified” and “Not Gentrified”, that is, a binary categorical variable.

The solution to the problem is addressed by fitting a Random Forest Classifier model.

3.1 Random Forest Classifier

By adjusting the Random Forest Classifier (RFC) model, we estimate the probability of belonging to each neighborhood from Mexico City to class 1: “Gentrified” or 0: “Not Gentrified” given the values of the variables used for the regression, which we will denote by X.

The expression 1 denotes the probability of belonging to class 1 given X and the expression 2 denotes the probability of belonging to class 0.

$$\begin{aligned} P(X)= Pr( Y = 1 | X) \end{aligned}$$
(1)
$$\begin{aligned} P(X)= Pr( Y = 0 | X) \end{aligned}$$
(2)

In the data set we have 16 observations with label 1 and 1,420 observations with label 0.

In the adjustment of the model there are two stages (a) Training and (b) Testing. As a first step, the data is divided into two sets that are used in the two stages mentioned: training set and test set. The training set is used to adjust the model and estimate the parameters that will allow estimates to be made on new data. The test set is used to adjust the model on data that the model has never seen and the accuracy of the model is estimated by comparing the estimate with the real value of the target variable. The division of the data of the training set and test set is done under the proportion 80% and 20% of the data, respectively.

We fit a Random Forest Classifier model with 45 trees. Next we present the metrics that allow to measure the precision of the model.

3.2 Model Evaluation

In Fig. 6 the confusion matrix is shown, as well as the adjustment error and the prediction error.

Fig. 6.
figure 6

Random forest confusion matrix

In the inverse diagonal the errors of the model are shown, FP are the values where the model classifies as Gentrified some neighborhood that is not and the FN happens when the model estimates as non-Gentrified a neighborhood that is gentrified.

Table 2. Detail of precision metrics

Table 2 shows the Classification Accuracy, Sensitivity, Precision and F1 Score metrics.

Since the proportion of observations for each class are not similar, that is, the proportion of observations with tag 1 is lower than those with tag 0. The metrics that matter in this case are Sensitivity, Precision and F1 Score.

The Precision shows which percentage of estimated Gentrified neighborhoods are currently gentrified and Sensitivity shows which percentage of total observations were estimated with the Gentrified label since they are gentrified. F1 Score is a metric calculated from Sensitivity and Precision.

According to these metrics, in training, 100% of the neighborhoods estimated as gentrified are currently gentrified. And 100% of the neighborhoods that are gentrified were estimated as gentrified by the algorithm.

On the other hand, in the test set 66% of the neighborhoods that have been estimated as gentrified are currently gentrified. And 100% of the neighborhoods that are gentrified were estimated as gentrified by the algorithm.

In summary, the model learned to identify a gentrified neighborhood 100% of the time and classified it as gentrified with a precision of 100%. In the test 66% of the time It identifies a gentrified neighborhood with a precision of 100%.

4 Results

Once we have evaluated the model, is time to create a map, where we show the probability of gentrification for the different analyzed neighborhoods.

Due to the nature of this problem, it is important to analyze the probabilities of belonging to class 1 instead of a model that perfectly estimates the classes. If the model estimates a neighborhood with a high probability of being gentrified it is because it finds patterns of behavior that assign it this probability.

The group of “gentrifiable” neighborhoods is defined as those colonies that have a probability greater than 5% of gentrification. Analyzing the probabilities of belonging to class 1, the class of the label Gentrified. The table 3 shows the list of gentrifiable neighborhoods ordered by their probability.

Table 3. Gentrifiable colonies

The map of Fig. 7 shows the probability map of gentrification showing with wine color those with the highest probability and in pink those with less probability.

Fig. 7.
figure 7

Gentrification map

This is the first time that such map has been created for Mexico City, and it shows quite reliably some of spots where inhabitants of the city “feel” like the city is becoming gentrified.

5 Conclusions and Future Work

After creating the model, and analyzing all the data, we predict the moment when a neighborhood is going to become gentrified. Several variables become of importance once we run the Random Forest, and these variables can be used for later social studies in creating new housing policies regarding gentrification in other neighborhoods.

In our future work, we will create a mapping between cities to bring the learning of a model to a different city, and in such a way, even if we do not know if a city has gentrified neighborhoods, we can create estimators to calculate an index for previously unknown places.