Keywords

1 Introduction

Tourism plays an important role in the economic development of many geographical areas. It requires adequate analysis of tourists behaviors to encourage innovation and promotion of the products along to market trends and needs. Researchers have studied the touristic flows with interviews and opinion polls at entrances of sites of interest, such as museums and national parks. This data gathering technique is expensive and is limited in terms of spatial and temporal coverage and it does not ensure long-term prospects for the activation of policies and marketing strategies.

The direct interviews of tourists require a staff of people to conduct the interviews and the related analysis. Hence, it can be performed only for a limited number of days to limit costs. Since stratified information about gender, nationality, ages, etc. of visiting persons is of special relevance, the size of the significative samples for a direct interview would be prohibitively high.

Tickets at the entrance of the attractions are another way to count the number of visitors. However, a huge number of natural and monumental areas are just open spaces and there is no possibility to count people through ticketing. This is typically the case of many highly recognized public areas inscribed on the World Heritage List [1].

The aforementioned considerations suggest the research for new effective methods to achieve a comprehensive understanding of tourists’ travel behavior. To this aim, a solution comes from the huge amount of georeferenced information voluntarily posted by tourists on social media during their trips. Georeferencing refers to the association of GPS coordinates to a given digital format data. Looking at the digital images posted in social media, the geotagging tells where any image has been taken. Geotagging also allows the images to be organized and geographically arranged on a map system.

There are, by now, several very popular web photo sharing services (Flickr [2], Instagram [3], 500Pixels [4], etc.). Among these we choose to work with the Flickr portal because of the rich information for each picture posted on this platform and for the ready availability of suitable Application Program Interface (API) for data extraction. The most relevant information we look at is of course the geographical coordinates record, but information about the data when each picture has been taken and posted together with information about the user (age, nationality, gender, etc.) are also collected. These information can then be used to build statistics and recommendation systems for tourist attractions and services. The relevance of such systems in the e-tourism domain is discussed in [5]. The authors of [6] present a method to capture travel information from geotagged photos on Flickr. First, they conduct a cluster analysis in order to identify the major areas of interest visited by inbound tourists in Hong Kong. Then, a reconstruction of tourist movements is obtained according to the time information of the pictures extracted on a daily basis. To enrich this information with a model of tourists flow from a location to another, a Markov model is employed. Different travel routes followed by tourists have been discovered by the authors with this approach. The authors in [7] propose a recommendation system by combining topic models and Markov chains. Their final tourist behavior model provide a set of personalized travel routes that match the user’s current location, user’s interest, and user’s spare time. Another travel route recommendation system is proposed in [8] where the authors describe an approach to build structured models of human travel as a function of spatially-varying latent properties of locations and travel distances. Similarly to what is proposed in [6], the authors make the assumption that human travel can be treated as a Markov process. By discretizing the desirability of locations and the distance between them, it is possible to analyze affinities between locations. In addition, individuals are grouped into clusters with distinct travel models. The use of latent models is hence exploited to make predictions in new situations. Another approach in this context is the one described in Torrisi et al. [9]. The authors propose to use Flickr data to build a Visual Analytics tool in order to analyze the distribution of tourists with respect to gender, nationality (e.g., domestic vs foreign), also considering the date of the visit (e.g., distribution of pictures for month/year).

In this paper we introduce a simple model to learn tourist behavior by exploiting Flickr georeferenced data. Flickr data of selected areas are collected and organized to perform visual analysis. We process the information of the georeferenced images to infer the tourist flow through a graph representation whose nodes are related to parts of the area to be monitored. All the images are associated with a specific node of the graph taking into account the GPS coordinates. Two cells, i.e., two nodes in the graph, are connected by a weighted edge: the weight is proportional to the number of single Flickr users that have posted photos taken in both cells. Each connection on the graph is hence equivalent to a likely route covered by tourists during a trip. This procedure allows us to infer information about the major routes followed by tourists during their visits. Statistical association rules that model the tourist traffic among neighboring sites are obtained using the Apriori data mining algorithm [10]. The most followed paths are then selected for further analysis. The proposed method is useful to the tourist managers to conduct planning activities regarding the transportation system, tour operators and positioning of the accommodation and info points.

2 Proposed Method

In this section we detail the method used to extract insights about tourist behavior from social images.

2.1 Data Extraction and Exploration

Social media platforms generate a huge amount of data every day. An automated procedure to have a constant access to this data for tourism management purposes is hence very helpful. In addition to the browsing mode, Flickr allows users to download public data through its web based API [11]. To use the API, a structured query is submitted to the service and a response is returned in XML, XML-RPC, JSON or PHP format. Flickr offers clear documentation and an intuitive API Explorer that allows to try API methods through the browser. Tutorials and examples of how to use the API with different programming languages are also available.

The first thing to do in order to extract data from a social platform is the selection of an area of interest. Flickr API’s allow two ways to retrieve geotagged pictures. The first way retrieves pictures located inside a circle specified by latitude, longitude of its center and a radius. The other way is to retrieve the pictures located inside a bounding box specified with the coordinates of the bottom-left and top-right corners. There is however a problem: Flickr server does not provide more than about 4000 distinct records for each query. This required us to adopt a simple recursive subdivision strategy using the bounding box approach. In particular, the initial box that includes the entire area of interest is splitted quadtree-likewise until we have a reasonable confidence that all the pictures in the area have been taken. This first gathering of records is refined and integrated using other Flickr API’s methods. In particular, using flickr.photos.getInfo and flickr.people.getInfo, we complete the records of each picture with more information about time, location and about gender and provenience of the photographer.

We adopt Visual Analytics methods on the data obtained insofar in order to generate insights for further analysis. Latitude-longitude pairs allow to obtain a representation of spatial distribution of photographers inside the area of interest. By combining this chart with the nationality of photographers it is also possible to have proportion of photographers for each region or roughly distinguish them between domestic or foreign. However, one of the drawback of this type of representation is that we do not have a density estimation of the tourists presents at a given point. In other words, photos taken in the same point are superimposed in a trivial 2D visualization. To overcome this problem, we consider a probability density function estimation to investigate the properties of the data points. In particular, the Parzen-Rosenblatt window method is used [12]. Intuitively, this approach counts the number of samples belonging to a specified square region \(R_{(lon,lat)}\) with size \(h \times h\) surrounding a geolocalized point of interest with GPS coordinates (lotlan). In our study, we consider a Gaussian kernel centered on geolocalized point of interest to compute the Parzen-Rosenblatt probability density function. Hence, given the whole set of images’ coordinates, the probability density function in a location \(x=(lon,lat)\) is estimated as:

$$\begin{aligned} p(x) = \frac{1}{n} \sum _{i=1}^{n} \frac{1}{2\pi h^2} \exp \left( -\frac{||x-x_{i}||^2}{2h^2}\right) \end{aligned}$$
(1)

The probability density obtained in this way can be superimposed to a map to give information about the most popular locations travelled by tourists. We also obtain other visual charts useful to analyze the distribution of photographers with respect to the gender or the date of the visit.

2.2 Data Mining Techniques to Analyze Travel Preferences

The previous described analytics tools are useful to understand the interest of tourists on a given area. The creation of summary and statistics of this data can help tourism managers to have an idea about the tourist population. However, this kind of analysis do not provide details about the most followed paths chosen by tourists during their trips. To this aim we take into account the location where pictures have been acquired as well as time information of collected images. In this way we are able to reconstruct the tourist trip considering the sites they have visited and the duration of their trip (e.g., daily or weekly excursion).

For this kind of analysis the density distribution described in the previous sub-section is not very helpful. What is needed here is some kind of “discretization” of the locations inside the area of interest. We hence build a list of “sites” that in this context are simply clusters of the locations where the pictures have been taken. These clusters are obtained with a naive, but effective, method: the area of interest is partitioned into square cells of uniform dimension. The pictures belonging to a “site” are those whose coordinates fall within one of these cells. This approach opens an issue: how to select the scale (“granularity”) of the cells of the discretizing grid? The choice of the proper scale is indeed application-dependant: in large areas (e.g., natural parks, wildlife areas, etc.) it is reasonable to choose quite large cells (cell diameters in the order of thousands of meters), on the other hand in smaller areas (urban city centers) the cell diameter should not exceed the hundreds of meters.

We keep track of the trips taking into account the temporal sequence of images of each photographer (Fig. 1). For all the trips that have been carried out on the area of interest in a specific time interval by the same photographer, we achieve a weighted diagram of the routes followed by the photographers. The proposed method is formally modeled by a directed graph. We define a graph \(G=(N,E)\) composed by N different nodes and E edges. An oriented edge from i to j indicates tourist flow from the site i to the site j. Two main parameters are involved to model the grid graph:

Fig. 1.
figure 1

Reconstruction of tourists trips. Yellow points represent the images taken by a photographer inside the area of interest. Each image is associated to a cell of the grid taking into account its GPS coordinates. The trip of the photographer is then reconstructed using the temporal sequence of the shot images. (Color figure online)

  • M: it represents the granularity of the grid. The higher the value of M, the greater the number of nodes in the grid graph (while the area size of the sites of interest decreases). In the example in Fig. 1, M is equal to 5.

  • \(\varDelta t\): it indicates the duration of the trip. In this way we can choose to model trips that have a daily or weekly basis.

The graph is represented through a weighted adjacency matrix. Each entry of the matrix indicates the amount of traffic flow between pairs of sites. To understand data an analyst may look to the graph to have a comprehensive view of all the paths followed by tourists inside the selected area. A limit in this kind of visualization is that it erases the “sequentiality” of each photographer visit to the cells. In other words, given a pair of sites (ij) it cannot be directly estimated from the adjaceny matrix what are the preferred routes. Some tourists could stop at a site, others might continue for one or more sites. Understanding how much a particular path (composed by 2 or more nodes) is taken by travelers is a useful information to tourism managers to conduct planning and tourism management of the territory. Because of this we need to extract association rules between sets of nodes that have a certain probability of being jointly visited during a trip. To learn association rules among tourist sites we employ the Apriori algorithm [10]. The Apriori algorithm attempts to infer frequent subsets of sites which are common to the different tourists with a minimum support. To this aim, Apriori uses a bottom-up approach, where frequent subsets of tourist sites are extended one site at a time and groups of candidates are tested against the data. This iterative process continues until no further successful extensions of the sites subsets are found. The frequent tourist sites item sets computed by the Apriori algorithm provide association rules that highlight general trends related to which sites are visited jointly with a specific confidence score. In our settings, for a given rule \(s^i_{(lon,lat)} \rightarrow s^j_{(lon,lat)}\), confidence is proportional to the likelihood that the site \(s^j_{(lon,lat)}\) is visited during the same trip of a photographer who has visited the site \(s^i_{(lon,lat)}\).

The last step of our analysis takes into account one at each time the strongest association rules generated by the Apriori algorithm. The aim is to get more details on the tourist flow generated between nodes (sites) of the obtained association rules. In other words, it would be useful to have a comprehensive knowledge of the routes covered by tourists to move from one site of interest to another one along these “most travelled paths”. They could follow the main path, others may choose auxiliary routes. For example, a tourist could get away from the main route and choose a customized way as it presents particular naturalistic details that he wants to see and photograph. Given the main path, which are the possible paths that branch off from it? How many tourists are affected by these detour paths? Which part of the route is more appropriate to locate an accommodation facility for tourists? These are just some examples of possible questions that require the need to know travel information along specific paths. The easiest and quickest way to get this type of information involves the use of buffers in ArcGIS [13]. From the geometric point of view, the buffer is a polygon whose perimeter identifies a territorial area which is located at a distance with respect to the path of interest, between a minimum and a maximum value. Our analysis includes the creation of buffers of increasing size from the main path related to association rules to locate different bands of territory and see the evolving of the tourist flow in these areas.

3 Case Study

The case study considered here is the area around Mount Etna, an active volcano on the east coast of Sicily. The summital area of the volcano is a protected natural area of approximately 59.000 ha. The overall natural environment surrounding the highest active volcano in Europe is very attractive for visitors from all the world but because of its large extension and of its openess there is no way to directly know the tourists behavior.

The region of interest that is included in the bounding box for data extraction includes the entire area of the volcano and the twenty small towns on its proximities. Flickr allows to add a search filter based on the date on which the photos have been taken. Another parameter that could be used in the search is “accuracy”: it is a measure of how accurate are the GPS coordinates of the image with respect to the point where the image was taken. We have conducted different experiments with different levels of accuracy. It can take any value between 1 (World level) and 16 (Street level). Filtering images using the accuracy parameter has the only effect of reducing, with higher levels, the number of retrieved images. It does not affect the quality of the visual content in the images as we have observed in few experiments. Since for the present application to have large datasets is relevant we choose the minimum accuracy level. The final dataset is composed by 30,692 images taken by 2932 different photographers. Figure 2 shows some insights of the collected data. In particular, Fig. 2(a) reports the distribution of pictures inside the area of interest. With this information it is simple to infer that the most visited sites are located in the eastern and the southern sides of the volcano. More details among the tourist concentration in these areas are obtained through the application of the Parzen-Rosenblatt probability density function (Fig. 2(b)). The two peaks we get through this approach correspond respectively to two sites on the volcano. The highest one is related to “Rifugio Sapienza”, the starting point of many different guided tours. It also hosts the entrance of the cableway facilities. The second peak corresponds to the central area surrounding the main active crater of the volcano. This area is of particular interest for tourists since mount Etna is an active volcano and tourists go as closer to the crater during the eruptions as safety permits. Some details about photographers are shown in Fig. 2(c) and (d). The histograms report the gender distribution and the seasonal distribution. Both gender and seasonal distributions agree with the general distribution of Flickr users. Indeed, there is no evidence that Mount Etna gets more visits from men than from women. The summer peak on the other hand could be explained by the easier climatic conditions on the mount during July, August and September. The above findings obtained applying the proposed techniques confirm what was the guess about the tourists interests about Mount Etna.

Fig. 2.
figure 2

Some examples of Visual Analytics charts of the collected dataset. (a) Distribution of photographs inside the area of interest and (b) its Parzen-Rosenblatt probability density estimation. (c) Gender and (d) monthly distribution of photographers.

3.1 Analysis of Tourist Movements

The next step of our proposal consists in analyzing the tourist traffic within the considered area. As described in Sect. 2.2, the area of interest is subdivided into a \(M \times M\) cells grid. The value of M should be chosen carefully. To this aim, experts of the area of interest may consulted. We tried three granularity levels: \(M=3,5,7\). The best results have been obtained with \(M=5\). Experts have confirmed that a \(3 \times 3\) grid has too large cells and loses important details. On the other hand a \(7 \times 7\) grid allows too many empty cells and makes further analysis unnecessarily complex.

The second parameter we need to model the trajectories followed by tourists takes into account the duration of the trip measured in days. The longer the trip, the higher is the probability that completely unconnected areas are jointly visited. Different choices have been considered and we experimentally selected only trips of maximum 7 days. Figure 3 shows the graph obtained considering the method of Sect. 2.2. The weight of each edge indicates how many photographers have covered the route. To make the graph more readable, we reported only arcs that have a weight \(\ge 10\). As can be seen from the graph in Fig. 3, tourist traffic is focused on the central part of the map which correspond to the path that brings tourists to the top of the volcano.

With the Apriori algorithm we obtained two association rules which confirm that the edge between the red nodes in Fig. 3 is the most travelled one. In particular the two symmetric rules that characterize the path between the two red nodes have \(Support = 8\%, Confidence = 22\% \) in one case and \(Support=8\%, Confidence=25\% \) in the other case.

In the final step of our study we zoom on the edge between the red nodes. The idea is to analyze which paths are chosen during the crossing between these two sites. The Etna paths are developed on recent and historical lava flows, in wooded areas or in areas without tree vegetation. These often present a variety of slopes resulting from the changing morphology of the volcano. The paths that we focus on are those that connect “Rifugio Sapienza” (node (4, 3) in the graph) with “torre del filosofo” (node (3, 3) in the graph). More specifically, “Rifugio Sapienza” at about 2000 m above the sea level is an area easily reached with cars or buses and local guides offer many trekking opportunities toward the higher areas starting from there. “Torre del filosofo” (english: Philosopher tower) at about 2900 m above the sea level is the mytical place where Heraclitus killed himself jumping into the lava flow. More realistically is a hill from where the main crater at 3000 m above the sea level can be easily and safety observed. To investigate tourist behaviors in this area we have identified what is the main route taken by the guides using ArcGIS. We have considered four different buffers around the main path (Fig. 4(a)). We have considered a maximum radius of 200 m from the center of the path, with incremental steps of 50 m. Since the GPS coordinates of acquisition devices may differ by several meters from the actual value, the first buffer covers a distance of 50 m around the main path. Figure 4 shows the details about the areas belonging to each buffer. The box in Fig. 4(a) and (b) contains 7259 images. Most of the pictures are located outside the path of interest and the four buffers. This means that tourists move largely away from the main road to explore the surroundings. The photos within the four buffers are only 2333. The first and the second buffers contain both the 32% of the whole set of images and then this percentage decreases to 19% in the third buffer and to 17% in the last one.

Fig. 3.
figure 3

Travel model of photographers inside the mount Etna area. (a) The area of interest is subdivided in 25 different sites of interest. (b) Weighted graph of the paths followed by tourists during their trips. Red points represent the endpoints of a path that is the most covered by tourists in both directions. (Color figure online)

Fig. 4.
figure 4

Analysis of the path between the “Rifugio Sapienza” and “Torre del Filosofo” sites. (a) The dark blue line refers to the main path to be investigated. Four different buffers are considered with an incremental radius from the center of the main route. (b) Images taken inside the area of interest are shown on the map with the colour of the corresponding buffer. (c) Zoom of the final section of the route. (Color figure online)

The data taken into account in the case study reveal, through the suggested tool, some interesting insights. Our analysis confirms that the tourist population on Mount Etna is mainly distributed on the eastern and southern versants of the volcano. Although Etna area has been interested by a continuous tourism development, its versants have different degrees of anthropization. The west side of the mountain is strongly linked to traditional agricultural activities; it represents an area of territory less developed from the tourism point of view although it is the most suitable to practice a tourism regarding naturalistic aspects. The southern and eastern sides, on the contrary, offer greater receptive hospitality structures. The presence of nature trails and ski facilities further contributes to making this area the most popular choice among tourists. The seasonal distribution of visits helps to understand when the highest presence of tourists is realized. Our analysis confirms the seasonal nature of the Etna territory. It highlights that the winter tourism (from November to March) is practiced almost exclusively by winter sports enthusiasts. They usually take less photographs than a tourist who prefers a hiking tourism feasible mostly from April to October. The second part of our methodology produces a travel model of the tourism movements within the area of interest.

With the buffer analysis we have noticed there is considerable amount of pictures taken outside the main path of interest. This indicates that many tourists customize their way to go to the north part of the volcano.

4 Conclusion

In this paper we have proposed a framework composed by different methods which is useful to analyze the tourist flow present in an area of interest and extracts useful clues that helps the tourism managers to conduct a reliable activity for tourism planning. In particular, we based our analysis on the data that are today available on social media platforms. We have built a tool that analyzes the spatial distribution of the tourists. Using data mining techniques we have also proposed a method to infer tourist flow within the area of interest. The method has been applied to a specific case study, taking into account the area around the mount Etna, one of sites included in the World Heritage List. The finding supports and confirms the experts’ knowledge about this region.