Keywords

1 Introduction

Place is an underestimated concept frequently used in every-day life in sentences as “This is my favourite place”, “I finally found my place in the world”, “Lay a place at the table for Mr. Twist”. In the commonsense language, we use the word place to refer to a city (e.g., New York), a public space (e.g., Central Park), a shop or even to the seat we usually take at the table. From these examples it is easy to understand that the concept of place has a high width degree which ranges from a punctual space to a wide area.

Place is also one of the main concepts in Geography, but it assumes in this field a more structured representation. In particular, according to Cresswell [6, 7] the concept of place embodies three different aspects:

  • location: the physical absolute point in the space, identified by a set of coordinates;

  • locale: the visible features and settings of a place, such as streets, shops, parks and so on;

  • sense of place: the set of emotions and feelings that a place inspires in people. These sentiments can be subjective or shared: they are subjective when they are based on someone’s personal biography, and shared when a group of people feels the same sentiment towards a place.

Starting from the intuition of Cresswell, it is clear that a complete definition of a place can’t ignore a systematic analysis of the three aspects listed above. First of all, they involve the identification of a place to focus on and the subsequent collection of a set of observations about the chosen place as for example feelings and emotions. Finally the data must be processed in order to find the sense of place (SOP).

To accomplish such a task automatically, it is necessary to analyse a large amount of data which must encompass enough information to define a place in terms of its three composing aspects. Social networks are good sources for the extraction of data suitable to carry on this analysis. Millions of users post their activities, their emotions, their interests and their opinions every day. This is the reason why the scientific community has become increasingly interested on these data, considering people in social network acting as social sensors in different fields such as politics, economics and sociology. Moreover, the possibility to associate a geographical reference to a post (the so called geo-tags) or to infer the location starting from significant hash-tags allowed scientists to develop map-based data analysis which can also be used to identify disaster-affected areas or regions with high crime rates, as respectively in the works of Cerutti et al. [5] and Ristea et al. [12].

In our aim to detect the SOP, we used Twitter as our source of information. Since Twitter allows to extract tweets posted within a specific geographical region, it was easy for us to fix both location and locale, leaving free the SOP that we extracted from the tweets. We collected all tweets containing the word nyc (New York city) and located over the area of New YorkFootnote 1. Then, following the idea proposed in [14], we applied Latent Dirichlet Allocation (LDA) over the collected data. The assumption is that topics generated by LDA can summarise the several SOP shared by social sensors, since topics capture words that frequently co-occur each other. The topics are used both to select tweets expressing the SOP and to visualise them on a map.

The remain of this paper is structured as follows: Sect. 2 describes some works which use georeferenced data extracted from the social networks and some applications of LDA on them; Sect. 3 describes the experiment we made and the results we obtained in order to detect the SOP for the city of New York applying LDA; finally, in Sect. 4 the article concludes.

2 Related Works

The specification of a geographical reference in a shared post is nowadays an habit for the users of the most famous social networks as TwitterFootnote 2, FacebookFootnote 3 and InstagramFootnote 4. In addition to these well-known platforms, other new location-based services emerged in the last years. Among them, we mention TrendsmapFootnote 5 which shows on a map the latest trends emerging from Twitter, UshahidiFootnote 6 which collects and visualises information about crisis witnesses providing the users the possibility to respond and FixMyStreetFootnote 7 which allows the UK citizens to signal streets problems (pot holes, unsafe walls, not working lampposts) to the local authorities. FirstLifeFootnote 8 [2] is a more interactivity-oriented service which focuses its attention on the user intended as citizen, giving him the possibility to interact with a map on which he can share events, news and even aggregate people. Moreover, the data are associated with a temporal dimension which allows users to filter and order the information according to time.

As previously mentioned, the mass dissemination of social networks and the possibility to acquire the data posted by users trough the RESTful API provided by the social-networks themselves encouraged the scientific community to analyse these data in different fields of interest.

In their work, Sakaki et al. [13] used the intuition of considering the users as social sensors in order to implement event detection. Through a semantic analysis of a collection of tweets and the application of location estimation methods, they were able to approximate the earthquakes’ centre and the typhoons’ trajectories. Cataldi et al. [4] extracted in real time the most emerging topics expressed by the community based on the interests of a specific user in a particular temporal frame. Allisio et al. [1] exploited the temporal and spacial information associated with the tweets in order to produce a daily estimation of the degree of happiness of the main Italian cities. An interactive map shows the data obtained combining Sentiment Analysis and visualisation techniques. Referring to the definition of places given by Cresswell, this work can be considered as an experiment designed to the extraction of the sense of place associated to a location.

Besides the Sentiment Analysis techniques, also Latent Dirichlet Allocation (LDA) [3] was successfully applied on data extracted from social networks. LDA is a probabilistic generative model that treats a document as a finite mixture of topics, where a topic is a distribution over the vocabulary. In details, each topic captures words co-occurrences inside documents, allowing to explore the document collection. In the work of Pennacchiotti and Gurutmundi [11], the authors used LDA to discover users’ interests. In their model, users are represented as a mixture of topics. Thus, it can be used to suggest friends or people to follow just comparing the topics. Zhang et al. [15] proposed a model called SSN-LDA (Simple Social Network LDA) which is able to find communities. In this case, the latent variables (topics) are the communities. Eisenstein et al. [9] argue that words co-occurrences are corrupted by geographical information. According to the authors, people living in a certain geographical region use a different vocabulary from the people that live in a different one. Thus, they treat the geographical area as a latent variable. Lau et al. [10] proposed a method to track emerging events in microblogs based on LDA. Finally, Di Caro et al. [8] proposed a framework called TMine which defines a navigable tag-flag: a kind of topic with the associated words.

3 Experiments

As previously described in the Introduction, we would like to extract the SOP defined by Cresswell [6, 7] from social networks. To accomplish this research question, we applied Latent Dirchlet Allocation (LDA) [3] to find common topics expressed in users’ posts. Our idea is that topics can capture the SOP expressed by people regarding a place or a city. For instance, we may capture that there is an ongoing concert in a park. To validate our assumption, we fixed location and locale to New York, searching all tweets containing the word nyc (New York City) with the constraint that they are geo-located over the area of New York. The only free parameter is the SOP which is extracted from the tweets.

Table 1. Some statistics about the dataset

3.1 Dataset Creation

We downloaded a set of 449054 tweets using Twitter APIs. This set includes all the tweets in which the “nyc” chars sequence appears somewhere in the tweet (it could be into the text or into the hash-tags).

Then, analysing the tweets, we noticed that some of them reported news or irrelevant information for our task. Since SOP regards the sentiments expressed toward the city (e.g., a street, a park or a monument), we decided to filter those ones that express a neutral sentiment. To perform sentiment classification, we used Python’s TextBlob library which comprises a pre-trained classifier. After the classification, we found 395467 tweets expressing a non-neutral sentiment. However, many of them were duplicated tweets due to re-tweets. For instance, we found the tweet “So this happened, gotta love NYC” 1620 times. We decided to filter those tweets to create a dataset containing unique ones. Such dataset is composed by 120538 tweets.

Finally, we used a regular expression to remove from the set of tweets those in which the “nyc” chars sequence was part of another word and was not used as the acronym of “New York City”. We applied this regular expression both on the text and the hash-tags of the tweets. Thus, we obtained the final dataset of 21808 tweets from which we extracted the topics using LDA. Table 1 contains some statistics about the collected dataset: the average length of tweets (expressed in terms of number of words, including hash-tags); the average number of hash-tags per tweet; and the vocabulary dimension of the collected dataset, that is the number of different words which appear in the tweets.

We then extracted the frequent words contained in those tweets to see if there exist words that express a sentiment towards New York City and to analyze the content of the dataset. We started removing stopwords, user names and links. Then, we lowercased the text of those tweets and we stemmed the words. To extract and plot the frequent words we used the WordCloud toolFootnote 9. The produced wordcloud is depicted in Fig. 1. From the image, we can find words that express a sentiment towards New York City, such as: “love”, “happen” and “deadly”. Furthermore, the wordcloud shows words related to weather, school and a march against weapon, meaning that in the days we collected the tweets, those three topics were the most discussed ones.

Fig. 1.
figure 1

The figure represents the frequent words (unigrams and bigrams) present in the tweet dataset.

3.2 Topic Extraction

We used Latent Dirchlet Allocation (LDA) model to extract the underline tweets present in the dataset. LDA requires in input the number of topics and extract, for each topic, a probability distribution over the vocabulary. Then, the top-k words of each distribution are selected to represent the topic.

In details, our pipeline to extract the topics is the following one: first, each tweet text has been lowercased and tokenized, preserving users name, hash-tags and urls. Then, we filtered out stopwords, usernames and urls. Finally, we stemmed the words. We also filtered those words that have a globally frequency less than 5. The constructed Bag-Of-Words are given in input to the LDA model.

We used Gensim implementation of LDA, tuning its hyperparameters. We randomly searched the number of topics in output (trying 5, 10, 20 and 50 topics), the number of passes through the corpus (trying 1, 10, 50, 100), and the number of steps of the Expectation-Maximization Algorithm (trying 100, 500 and 1000). We found good results setting the number of topics to 20, the number of passes to 50 and the number of steps to 1000.

3.3 Topic Selection and Analysis

Once we extracted the topics, we had to select those ones that express the sense of place. We gave to two annotators the extracted topics with some tweets associated to them (to understand the topics), asking to judge if a topic expresses the SOP. We then considered only those topics that for both annotators express the SOP. Table 2 shows some selected topics and an associated tweet.

Table 2. The table reports six topics with the first 5 words. We also included a tweet associated to the topic that expresses the sense of place.

We decided to perform some analysis on the tweets associated to the selected topics to deeply understand how the SOP is spatially and temporally distributed over the city. We started plotting the tweets that have a geographical information on a map labelling them with their associated topic. From Fig. 2, we can notice that the dominant topic is the blue one, which expresses the love of the people towards the city (see the pop-up in the image).

Fig. 2.
figure 2

The figure shows the tweets over New York. The colours represent the topics. From the map, we can see that the dominant topic is the blue one. (Color figure online)

Fig. 3.
figure 3

The figure shows the tweets posted on Monday 19/03/2018.

We conducted a second analysis dividing the tweets by posted date and by posted hour to see if a SOP could emerge in a particular day of the week (e.g., Monday) or time (e.g., from 4 pm to 8 pm). From the split of tweets by day, we noticed that on Monday tourists tweetted that they’ll miss New York (see Fig. 3) and that the schools will be closed due to snow (see Fig. 4) on Wednesday.

Fig. 4.
figure 4

The figure shows the tweets posted on Friday 16/03/2018.

For the daytime analysis, unfortunately we did not find a set of tweets that expresses the same content. We thought that this is due to the nature of Twitter, whereby people tend to express their opinions, thoughts and news during all the day.

4 Conclusions

In this paper, we created a dataset of tweets regarding New York City in order to extract the sense of place defined by Cresswell [6, 7]. In detail, we fixed location and locale to New York and we tried to extract the sense of place (SOP) from the users posts. The detection of the SOP is performed by using Latent Dirchlet Allocation (LDA) [3] in order to extract topics that summarise the several sentiments that people expressed towards the city. Finally, we showed that is possible to capture the SOP from tweets and that it could depend by the day of the week.

As future works, we are planning to improve the pre-processing phase, since some unfiltered tweets do not express the SOP. Furthermore, we are interest to apply other LDA models to unveil information present inside tweets.