Abstract
Geo-social media data involve various kinds of inhomogeneities. These can concern, amongst others, the users, but also spatial distributions or the fact that the most frequently used hashtags, keywords or emojis often have little relevance in the context under investigation. In order to properly tackle and reduce these inhomogeneities and to strive for a less distorted analysis, normalisation of geo-social media data is expedient. Various measures exist that are frequently used in research for this purpose. This paper presents four of these measures and compares them with each other, both theoretically as well as practically in the form of a demonstration through three exemplary case studies highlighting potentials and limitations of each measure. This comparison involves the relatively new typicality measure, which was developed specifically for this type of data following the dimensions commonly used to describe geo-social media data (temporal, spatial, social and thematic dimension).
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction and motivation
Since the rise of social media platforms such as X (previously Twitter and referred to as such in the following), Flickr or Instagram, not only their use and the penetration of daily life by them has increased, but also their utilisation in science for a wide variety of analyses, be it emotion or opinion analyses (Camacho et al. 2021; Gallegos et al. 2016; Han and Wang 2019; Hu et al. 2021; Lyu et al. 2022; Resch et al. 2016), the study of tourist behaviour (Encalada et al. 2019; Paolanti et al. 2021; Su et al. 2020; Teles da Mota and Pickering 2020; Zhang et al. 2020), road traffic or transportation analysis (Gu et al. 2016; Kuflik et al. 2017; Suat-Rojas et al. 2022; Zhañay et al. (2018); Zhang et al. 2018), disaster response (Shelton et al. 2014; Zahra et al. 2017) or in the context of crisis management and mental health during the COVID19 pandemic (Abbas et al. 2021)—the spectrum is enormously broad. Occasionally and depending on the case study, also the spatial location of the posts is included in the investigations. However, data from (geo-)social media are subject to various inhomogeneities and thus distortions.
Inhomogeneity in social media can be observed right from the user base as social media users do not represent an average of the real population. There are several reasons for this. First of all, only a part of the total population uses social media, which is globally approximately 62% (Kemp 2024). The majority of content is generated by only a relatively small proportion of all users, around 25% (Antelmi et al. 2019). Secondly, social media users are younger than average: in April 2021, nearly 40% of all Twitter users, i.e. almost half, belonged to the age group 25–34 (statistica 2022), whereas in the same year only 15% of the world's population belonged to that age group (United Nations n.d.). Thirdly, also in a spatial sense, there is an inhomogeneity in terms of users, which is that the majority of social media users lives in technologically developed countries. For example, although India has four times as many inhabitants as the USA, Twitter had three times as many users from the USA as from India in January 2022 (statistica 2024). Lastly, there are other factors influencing access to social media platforms including technological skills or income levels. This phenomenon is known as the "digital divide” (Taubenböck et al. 2018).
In the spatial context, there is another inhomogeneity apart from the user-related one already described: the spatial distribution of posts from geo-social media is not homogeneous (Toivonen et al. 2019; Zhang and Zhu 2018), but rather represents touristic hotspots, population density or even the accessibility of geo-social media. But even if areas have similar geographical granularities, the information richness of contents, the credibility as well as user and usage characteristics can still vary widely (Zahra et al. 2017). This spatial inhomogeneity illustrated in Fig. 1, where the spatial distribution of geo-referenced tweets in Europe from 2020 and the population density of the European Union (EU) are juxtaposed. The agglomerations of tweets in map (a) can be identified in map (b) as the most densely populated areas.
a) spatial density of georeferenced Twitter posts in Europe in 2020 (except November), b) population density in the European Union in 2021 (eurostat n.d.b) and in UK in 2011 (data.gov.uk 2024), reference units are each a 50 × 50 km grid
Another phenomenon that appears when analysing (geo-)social media data is that high absolute numbers of certain hashtags or emojis rarely signify relevance in the context being studied. The ten most frequently used emojis worldwide in the year 2021 are presented in Fig. 2. While studies show that there are differences in emoji usage across regions and languages (Barbieri et al. 2016; Chandra Guntuku et al. 2019), the most frequently occurring ones consistently include emojis from. Also in the authors’ experience, even when a geo-social media dataset is pre-filtered by space, time and/or topic, those emojis are among the most frequently used ones in this dataset. Although they all show some kind of liking or disliking and thus provide emotional information, this is not topic-specific. This means that concepts, symbols, activities etc. related to the topic under investigation cannot be recognised or derived from these 10 most frequently used emojis despite their dominance.
The 10 most frequently used emojis in 2021 (Daniel 2021)
All these inhomogeneities and distortions require a normalisation of geo-social media data in order to obtain conclusive and relevant findings within analyses by considering relative rather than absolute differences. For this purpose, various measures exist that are commonly used in research work. This paper presents these measures and compares them with each other, both theoretically as well as practically in the form of a demonstration through three exemplary case studies. The comparison involves the relatively new typicality measure introduced by Hauthal et al. (2021), which was developed specifically for data from geo-social media.
2 Related work
In a number of scientific publications, geo-social media data are similarly viewed, described and processed in structuring ways (e.g. Di Minin et al. 2015; Dunkel et al. 2019; Yuan et al. 2013). Comparable or partially identical modelling exists for context-related computer systems (Baldauf et al. 2007; Etzion and Niblett 2010; Zimmermann et al. 2007). Although the actual terminology of these models differs, the dimensions can basically be described as follows:
-
Spatial dimension (Where)
-
Temporal dimension (When)
-
Thematic dimension (What)
-
Social dimension (Who)
The question words Where, When, What and Who have a descriptive character. In contrast, question words with an explanatory character, i.e. How and Why, are not found separately in such models or are integrated into other dimensions, although they actually make up a large part of many geo-social media analyses (Dunkel et al. 2019). A completely different system for categorising geo-social media data is proposed by Chen et al. (2017), involving three entities (network, geographic information, text and content).
The inhomogeneities described in the previous chapter can also be transferred to the dimensions mentioned. Each dimension defines different attributes of geo-social media data, all of which possess their own complexity and make their own specific methods of processing, analysis and normalisation necessary. For example, spatial information, i.e. geographical coordinates, can hardly be treated in the same way when it comes to normalisation as thematic information, i.e. written language.
Aside from normalisation, specific measures or indicators have already been developed in research work for the investigation of particular subjects. To give just a few examples: Du et al. (2018) developed a measure called social media self-control failure (SMSCF) to examine the extent to which users of social media indulge in its temptations. The indicator Social Media Political Participation Scale captures political action via social media in terms of active, expressive form, but also in terms of cognitive use (Waeterloos et al. 2021). To determine social media users' awareness of cybercrime, the psychometric measure Cybercrime Awareness on Social Media Scale (CASM-S) was created (Arpaci and Aslan 2023).
For the purpose of normalising data from geo-social media, various measures have been established. The universal measure of relative frequency is used quite commonly, for example to spatially normalise a selection of tweets by the total number of tweets in the area or by population numbers (Dahal et al. 2019). In many studies, the numerical statistic term frequency-inverse document frequency (tf-idf) can be found. For instance, Wang et al. (2017) visualise feature words from tweets in word clouds based on their tf-idf value. But tf-idf is also used for ranking hashtags or keywords within social media datasets (Habibi and Cahyo 2019; Khan et al. 2021). Furthermore, this statistical measure is often part of hashtag recommendation systems (Ben-Lhachemi et al. 2019; Kumar et al. 2021). Another statistical measure that is used in the context of social media analysis is signed chi-score. For example, it has been applied in existing studies on social media to examine efforts of politicians and their used hashtags when employing a media effect called framing (Hemphill et al. 2013), to measure political polarisation (Hemphill et al. 2016) or to explore landscape preferences (Dunkel et al. 2023) and experienced tranquillity in landscapes (Wartmann et al. 2019). The typicality measure, unlike the three previously described measures, was developed specifically for the normalisation of geo-social media data. It was used in the investigation of emojis (Hauthal et al. 2021; Levi et al. 2024) and also in a case study looking at the migration crisis in the EU (Mukherjee et al. 2022). Poorthuis et al. (2016), Shelton et al. (2015) and Shelton et al. (2014) use the less common odds ratio for spatial normalisation of Twitter data. Poorthuis et al. (2016) point out that normalisation can be done incorrectly. For example, when normalising based on population data, it is assumed that tweeting is done uniformly across the population, which is not the reality. The solution of Poorthuis et al. (2016), Shelton et al. (2015) and Shelton et al. (2014) is to normalize based on the number of tweeting people instead.
3 Measures for normalising geo-social media data
In the following section, the four measures mentioned above will be presented in more detail. Subsequently, similarities and differences will be highlighted.
3.1 Relative frequency
Relative frequency is a measure in descriptive statistics and represents a classification number. It reflects the proportion of elements in a sample set that have a certain characteristic value. Relative frequency is calculated by dividing the absolute frequency n of features with a certain characteristic in an underlying set by the total number N of samples in this set. Therefore, relative frequency is a ratio in the form of a fractional number and can take values between 0 and 1. Relative Frequency can be converted into a percentage for easier interpretation.
Relative frequency is usually considered helpful in classifying data into spatial units. However, Visvalingam (1978) is critical about the use of ratios (including proportions and percentages) in a spatial context and explains two problems. First, ratios are strongly influenced by the sample size, and the spatial distribution of all samples is usually not constant. This tends to affect the final ratio more than the number of features studied. The second problem described are the extreme values that can result from ratios due to large and small numerical differences. For example, if there is only one person within a spatial unit, the only possible values would be 0% or 100%. Although the possible values are extreme, this phenomenon could be common in certain areas, for example in the desert. On the other hand, ratios would also provide near-average results for variations in the feature set if the sample size is sufficiently large.
3.2 Signed chi-score
The just-described problems associated with ratios in a spatial context led Visvalingam (1983) to propose the use of the signed chi-square measure, which was originally developed for area-based policy research and partially mitigates these problems. For a single variable, over- or underrepresentation is indicated by a positive or respectively negative sign, incorporating the observed value obs of a category and the expected value exp of this category in the calculation.
The major concern with the signed chi-square measure is the formulation of the expectation exp, which can be also regarded as a reference dataset. In terms of policy research or demographic research, statistical data such as census data have been found to be sufficient as a reference dataset to formulate expectation. However, Visvalingam (1976) discusses the issue of expectation further and warns that distorted distributions might alter the rankings significantly. She states that if the formula is used with an expectation based on political theory or social theory, the numerical formulation of expectation needs to be adequately justified.
The following variation of signed chi-score, as proposed by Visvalingam (1983), has been popularised by Wood et al. (2007) in the context of geo-social media for the calculation of so-called chi expectation surfaces:
Wood et al. (2007) also introduce normalisation as part of the calculation. Meanwhile, a modified and simplified form of this is in use, e.g. by Dunkel et al. (2023) or Gugulica and Burghardt (2023):
Visvalingam (1983) had proposed the use of Eq. (2) when variables under consideration are “mutually exclusive and together exhaustive”. She explains this phrase with the case of unemployment. In terms of policy research, the ideal situation of unemployment would be zero (expected), however the use of Eq. (2) would lead to a division by zero. Therefore she suggested using employment (with expectation set to 100) to infer the spatial distribution of unemployed versus the set expectation, since employment and unemployment can never apply to an individual simultaneously.
The previous example highlights an issue with the signed chi-score measure: the formulation of expectation. There are no requirements or rules for selecting this reference dataset. Wood et al. (2007) use population numbers as the expectation, Dunkel et al. (2023) use a random selection of geo-social media posts, Gugulica and Burghardt (2023) the overall number of posts in the study area. Thus, the expectation that is used as a reference dataset when calculating this measure can, in practice, be chosen arbitrarily; it could be related or unrelated to the topic, could be a superordinate dataset, etc. However, the choice of expected values has a major influence on the calculated results. This is why Visvalingam (1976, 1978, 1983, 1981) puts a lot of emphasis on detailed discussion when formulating expectations and a contextualised view of the results.
All signed chi-score values presented in chapter 4 are calculated using Eq. (4), as this equation has become established for the context of geo-social media data analysis.
3.3 Tf-idf
The tf-idf measure is a statistical measure used in information retrieval and text mining to assess the relevance of a term in a document within a document collection (Leskovec et al. 2014). tf-idf is calculated by multiplying the term frequency tf and the inverse document frequency idf.
The term frequency tf is determined by dividing the number of occurrences of a term t within a document d by the maximum occurrence number of any term k in this document:
The inverse document frequency calculation is the logarithm of a fraction made up of the total number N of all documents in the collection and the number n of documents within the collection that contain that term t:
Since the number of documents containing the term t within the total collection is included in the calculation of tf-idf in a counterbalancing way, the fact is taken into account that some words are naturally used more often than others. If a word occurs in only a few documents in a collection, but in these documents it tends to occur multiple times, then this word has a high tf-idf score (Leskovec et al. 2014).
As mentioned previously, tf-idf was developed for the field of information retrieval. The need for such a weighting scheme was to have high recall and high precision (Salton and Buckley 1988). In other words, a weighting scheme, which retrieves all relevant documents (high recall) and rejects all non-relevant documents (high precision) is strived for. All classes weighting schemes based on tf-idf perform well at this task and even in multiple languages (Harman 2005). However, idf is considered to be a heuristic measure which has led to several authors trying to underpin its robustness with theoretical explanations (Robertson 2004). Vector space models (Salton and McGill 1983), probability theory (Hiemstra 2000) and the inference network model (Croft and Turtle 1992) amongst others have been used for the aforementioned purpose. The larger number of theoretical models is itself evidence of the fact that there is little consensus on why tf-idf weighting schemes work so well. Therefore, using the weighting scheme in geo-social media context could lead to issues, some of which are described hereafter.
Yamasaki et al. (2015) see problems with use of tf-idf for hashtag analyses. On the one hand, because a hashtag usually appears only once per tweet, i.e. per document, which means that the term frequency is always 1. On the other hand, hashtags can be considered rare in the entire corpus, thus inverse document frequency loses its value. Yahav et al. (2019) raise similar concerns in the field of sentiment detection in social media, as tf-idf is used outside of its intended purpose. Content from social media consists of very short texts, which tf-idf is not designed for. In addition, there is rarely a large collection of documents available for reference, and the discourse between content creators does not only involve textual messages, but also writing styles, symbols, signs or abbreviations as a remedy for interaction beyond language. All this leads to bias in sentiment analysis involving tf-idf. Both works mentioned address the common issue of the corpus and its intended use. Tf-idf is based on the intuition that the user making a query has a priori knowledge of what is to be contained in a certain document (Hiemstra 2000). This, however, is not the case for geo-social media because new hashtags or tags could be created and thereby change the corpus. Essentially the corpus of geo-social media would be continuously suffering from this sparse data problem (Hiemstra 2000). However, efforts have already been made to adapt the principle of tf-idf to social media (called Hashtag Frequency-Inverse Hashtag Ubiquity (HF-IHU)) by considering low data density as well as hashtag relevance in the calculation (Otsuka et al. 2014).
3.4 Typicality
Typicality is a measure developed to quantify relative differences and is specifically adapted to the characteristics of geo-social media data. It was first used in Hauthal et al. (2021), but only as a tool and its actual properties and behaviour were not canvassed in that publication despite its novelty.
Typicality allows to determine how typical or atypical an object under investigation, such as an emoji or hashtag, is in a sub-dataset, in relation to the total dataset the sub-dataset is taken from. The formation of this sub-dataset is based on the different dimensions that can be used to describe data from geo-social media and were introduced earlier. The development of typicality was inspired and influenced by these dimensions. Thus, the selection of a sub-dataset can be spatial (e.g. a country within a continent), temporal (e.g. a month within a year), thematic (e.g. one of several topics), social (e.g. one of several user groups) or any combination of these dimensions. It is possible to choose the extent of the total dataset flexibly, only the condition must be fulfilled that the sub-dataset is a subset of the total dataset.
Two relative frequencies are involved in the calculation of typicality: the relative frequency fS of the object under investigation in the sub-dataset and the relative frequency fT of that object under investigation in the total dataset. The difference between these two values is formed and divided by the latter relative frequency for normalisation purposes. This means that typicality t is formulated as follows:
The values resulting from the calculation have the following meaning:
-
t > 0: the object under investigation is typical within the analysed sub-dataset
-
t < 0: the object under investigation is atypical within the analysed sub-dataset
-
t = 0: in this case, fS and fT are identical, which is why the analysed object is neither typical nor atypical within the within the analysed sub-dataset
The term a-/typical thus refers to the occurrence of an object under investigation in the sub- dataset and not in the totality of the dataset.
Typicality can take values from − 1 to + infinity. This range of values makes it difficult to compare typicality values across datasets. One solution is to scale the positive and negative range in such a way that the positive maximum value becomes + 1 and the negative maximum value − 1. Thus, a symmetrical value range from − to + 1 can be achieved, with + 1 indicating the most typical object under investigation and -1 the most atypical.
For this type of normalisation, we recommend applying a sigmoid function to the calculated typicality values, as it does not squeeze values in the upper positive areas, but peaks and nuances remain clearly recognizable after scaling. We have calculated normalised typicality tn as follows:
All typicality values presented in chapter 4 are scaled using Eq. (8).
3.5 Comparison
All four introduced measures have in common that they allow to determine relative differences in a normalising way. Nevertheless, there are various similarities and differences amongst the measures.
While relative frequency and tf-idf can only take positive values, typicality and signed chi-score can also be negative, as they indicate whether the occurrence under investigation is a/typical or respectively over-/under-represented. Both typicality and signed chi-score calculations include a reference dataset, but these measures differ strongly with respect to the requirements for this reference dataset. For signed chi-score, basically any dataset can be used as a reference. However, the choice of this reference dataset strongly influences the results and therefore signed chi-score values cannot be considered comparable across use cases. On the contrary, in the case of typicality, there is a strict requirement: the sub-dataset under consideration must be part of the reference, i.e. the total dataset.
Relative frequency and tf-idf have the exclusively positive value range in common, but differ in their basic meaning. Relative frequency indicates the proportion of elements with a certain characteristic in a total dataset, whereas tf-idf returns a weighting that indicates the relevance of a term within a collection of documents. Tf-idf is designed for the analysis of language, relative frequency can be used much more universally, but does not provide a weighting. The outlined comparison is summarised in Table 1.
4 Examples
In the following subsections, by using three examples, the measures relative frequency, tf-idf, signed chi-score as well as typicality are both demonstrated and compared. The focus of the three examples is slightly more on the typicality measure than on the other measures, as this is still fairly unexplored and it is worth obtaining findings and empirical values on it.
The three following examples make use of two datasets, each having different characteristics with respect to its data source (i.e. the accessed geo-social media platform), its spatial extent, the subject under investigation as well as the formation of the sub-dataset. An overview of all properties as well as details about the respective typicality calculations is given in Table 2. What the three examples have in common is the object under investigation: in each case, emojis are examined which, in the opinion of the authors, are well suited for the intended purposes due to their characteristics described in chapter 1, and thus ensuring a stringency between the three examples. The subjects under investigation may seem simple and obvious, but are therefore suitable as examples, as this makes it relatively easy to assess and verify the obtained results.
Emojis have proven to be a rich source of information in (geo-)social media (Hauthal et al. 2021; Levi et al. 2024). The choice of words is not random, and neither is the choice of emojis (Na'aman et al. 2017). Emojis can express emotional states, they can function as decoration or as a substitute for lexical units; they can also reveal information not explicitly mentioned in the text or provide contextual information on images posted in (geo-)social media (Danesi 2017; Ge and Gretzel 2018; Illendula et al. 2018; Novak et al. 2015; Pohl et al. 2017).
For the calculations of tf-idf in all three examples, an emoji was regarded as a term and all posts in a sub-dataset were considered as a document collection. Only posts that contain emojis at all were taken into account for the calculation of tf-idf, as emojis are the focus of our investigation and posts without emojis are therefore not relevant.
4.1 Example 1: thematic formation of sub-datasets
The first example uses a global dataset from Instagram that is related to the subject sunrise and sunset by filtering for keywords that mean sunrise or sunset in the four languages English, German, French and Dutch. The dataset covers the time period August 2017–mid-January 2018. The two investigated sub-datasets were formed thematically: one refers to sunrise, the other to sunset, based on the mentioned keywords. The total quantity of the two sub-datasets forms the total dataset. The objects under investigation are the 100 most frequently used emojis per sub-dataset.
For the calculation of tf-idf, the number of all emoji-containing posts per sub-dataset is required. Since a privacy-aware approach based on the HyperLogLog data abstraction format (Dunkel et al. 2020) was used for this example, this number could not be reconstructed. Only the number of all posts (i.e. posts with and without emojis) was known, which is why we worked with an approximate value that is 56.5% of this total number. According to a study, 56.5% of all Instagram posts in July 2017 contained one or more emojis (Kmieckowiak 2017).
In the following, Figs. 3 and 4 compare different measures. The emoji clouds in both figures show the 20 emojis with the highest values per measure for sunrise and sunset. The size ratios of the displayed emojis correspond to the relative value differences within each emoji cloud. All emojis are generally displayed in light blue (except for typicality), unless they are also found in one of the two typicality emoji clouds, in which case they are dark blue. Dark blue therefore indicates an overlap with the 20 most typical emojis for sunrise or respectively sunset.
Figure 3 compares the four measures relative frequency, tf-idf, typicality and signed chi-score, with the latter taking the total dataset used to calculate the typicality values as the expectation. The 20 emojis with the highest relative frequency hardly differ for the two thematic sub-datasets (90% overlap) and also provide very general findings, which are that the subject under investigation is a natural phenomenon, with the sun playing the central role, that is photographed and appreciated. Tf-idf provides similar results, except that the maximum value peaks out significantly more. In contrast, the emojis with the 20 highest typicality values for the two sub-datasets differ significantly from each other (10% overlap). They reveal much more detailed information about activities that are usually associated with the events of sunrise (drinking coffee in the morning, jogging, mountain hiking) and sunset (perception in urban spaces, at the beach or social events), during which the photographs are taken. More insights on emojis as contextual indicators based on the dataset used here can be found in Hauthal et al. (2021).
The emojis with the highest signed chi-score values are absolutely identical to the top 20 typicality emojis, also their order in the sorting. Only the relative ratios between the emojis are slightly different, but this may be due to the scaling of the typicality values by the sigmoid function presented earlier. This high similarity can be explained by the use of the same reference dataset when calculating the two measures.
Figure 4 presents further emoji clouds resulting from signed chi-score calculations, but this time using a different expectation (exp), i.e. a different reference dataset. For this purpose, we used a Twitter dataset containing all geo-referenced posts from Europe in 2020 (without thematic pre-filtering) and used it in three ways for the generation of a reference dataset, i.e. the formulation of an expectation (the numbers below correspond to the numbers in the header of Fig. 4):
-
1
all tweets are assigned to the topic of sunrise or sunset based on the same four-language keywords used for filtering the Instagram dataset under investigation (= 2 thematic sub-datasets, the total quantity of those forming the total dataset) → exp = frequency of investigated emoji in the respective sub reference data set, ∑exp = frequency of investigated emoji in the total reference dataset
$$ \rightarrow exp = {\text{frequency of investigated emoji in the respective sub reference data set}}, \, \sum\nolimits_{exp} = {\text{frequency of investigated emoji in the total reference dataset}}$$ -
2
no thematic filtering of the Twitter reference dataset
$$ \rightarrow exp = {\text{frequency of investigated emoji in the reference dataset}}, \, \sum\nolimits_{exp} = {\text{number of all emojis in the reference dataset}}$$ -
3
for exp: all tweets are assigned to the topic of sunrise or sunset based on the same four-language keywords used for filtering the Instagram dataset under investigation, plus temporal filtering to reduce the amount of data: only January 2020
$$ \rightarrow exp = {\text{number of all sunrise- or respectively sunset-related tweets in the reference dataset (i.e. the investigated emoji is not taken into account here)}}, \, \sum\nolimits_{exp} = {\text{number of all tweets in the reference dataset (without thematic filtering)}}$$
All three expectations share the fact that although they use data from a different geo-social media platform, it is because of this shared origin that they are still quite similar to the observed dataset, but the expectations are formulated in different ways around the topic, i.e. from a typicality-identical and therefore highly topic-related approach to no thematic filtering as in the case of expectation 2.
The emoji clouds for expectation 1 are different from the typicality emoji clouds, although it would have been expected that the reference dataset used would be similar due to the same procedure for its generation. The top 20 emojis for signed chi-score with expectation 1 (the huge bar in the emoji cloud for sunrise is a giant minus sign) are more general in nature and less topic specific, as with relative frequency. In contrast, the emoji clouds for expectation 2 show more overlap with the typicality emoji clouds, but more in the range of top 10–top 20, while the emoji clouds for expectation 3 overlap completely with the typicality emoji clouds. This may be due to the fact that there is the same exp value for all emojis related to sunrise, as well as for all emojis related to sunset, and ∑exp is the same for every emoji examined. When, in case of expectation 3, comparing the signed chi-score calculations for the different emojis, obs and ∑obs make the main difference and therefore have a stronger impact on the calculation.
This example demonstrates on the one hand very well the thematic superficiality of high-frequency emojis addressed in chapter 1, and also shows how the typicality measure can be used in a straightforward way to get initial, yet in-depth insights into a topic. On the other hand, it is shown how strongly the formulation of the expectation, i.e. the selection of the reference dataset, influences the results of the signed chi-score calculations. Only three possible formulations of the expectation were outlined here. There are numerous other possibilities and therefore further possible outcomes.
4.2 Example 2: temporal formation of sub-datasets
The second example uses a Twitter dataset that covers Europe and contains all tweets from 2020 (with the month of November missing). Only tweets containing emojis are considered. There was no thematic pre-filtering. The dataset is temporally divided into 11 sub-datasets by month (eleven instead of twelve due to the data gap in November). The following calculations are not based on the number of emojis (although this is sometimes formulated in this way for the sake of simplicity), but on the number of tweets containing a specific emoji or respectively containing emojis at all.
Example 2 was inspired by our previous work done in Levi et al. (2024). The aim is to examine six different emojis. Three of them (medical mask, microbe and syringe emoji) are likely to have a connection to the COVID-19 pandemic, which made its way to Europe in 2020 and is the focus, i.e. the subject under investigation of this example, even though no corresponding thematic filtering of the dataset has been carried out. Furthermore, we will also investigate the raised fist emoji, representing the Black Lives Matter (BLM) movement, which was already founded in 2013 but experienced an upswing in 2020, as well as the two emojis face with tears of joy and red heart, which are generally the most frequently used emojis of all (Daniel 2021) and can therefore be considered topic-neutral due to their universality. The last three emojis mentioned are intended to provide a contrast to the first three COVID-19-prone ones.
Figure 5 provides an overview of the timeline of reported cases of the COVID-19 disease in Europe as well as the absolute number of tweets in the analysed dataset with hashtags containing "covid" or "corona" in order to give an impression of the development of the pandemic as well as the absolute quantities of this topic on Twitter. In addition, the time frames of the first two COVID-19 waves that hit Europe are indicated.
Timeline of 2020 with reported COVID-19 cases in Europe (European Centre for Disease Prevention and Control 2022) and the quantities of this topic in the analysed Twitter dataset
Figure 6 shows the relative frequency, tf-idf, typicality and signed chi-score of the six investigated emojis for the eleven month in 2020, while in case of signed chi-score the total dataset used to calculate the typicality values serves as the expectation. The relative frequency (see Fig. 6a) shows the percentage share of the respective emoji among all emojis used in the corresponding month. The popularity of the emojis face with tears of joy and red heart is evident in Fig. 6a, as a result of which there are greater variations between months than with the other emojis, for which the relative frequency of use can be described as nearly constant. The temporal pattern of tfd-idf is similar (see Fig. 6b). However, in the case of typicality (see Fig. 6c), the two emojis face with tears of joy and red heart show hardly any variations. The values for both hover consistently around zero, i.e. they are neither typical nor atypical in all months of 2020. In contrast, the microbe and medical mask emoji have a positive typicality peak in March, and continue to be positive in April and May, when the first COVID-19 wave took its course in Europe and brought with it a series of protective measures, such as mandatory face masks. In the other months, both emojis are atypical, except for the medical mask emoji in July, where measures were relaxed and masks once again became a subject of discussion. Although stricter measures were enforced again in the second wave from autumn onwards and, above all, the number of reported cases increased rapidly, this is not evident in the typicality curve of these two emojis. The tension and uncertainty caused by the first wave seemed to have been overcome. The syringe emoji is widely atypical in 2020 and only takes on a positive typicality in November and particularly in December. This can be explained by the fact that the first COVID-19 vaccine was authorised for use in the EU in the second half of December and, of course, there were corresponding discussions and reports in advance. The raised fist emoji has a positive typicality peak in June and is otherwise atypical (apart from a very low positive typicality in December). This reflects the death of African-American George Floyd during a violent arrest in Minneapolis at the end of May 2020—an event that received a lot of media attention and leading to numerous worldwide protests against police violence organised by the BLM movement in the following month. The signed chi-score presented in Fig. 6d shows the same trend for all emojis as typicality, although with different values. As in the previous example, this is due to the identical reference dataset used as the expectation in the calculation. Sorting the typicality and signed chi-score values for each emoji results for both measures in the same order of months, i.e. of sub-datasets. That means the relationships between the values and also ± signs correspond to each other across those two measures.
Two further calculations of signed chi-score were carried out with other reference datasets, i.e. expectations, that are intended to place the calculations in a COVID-19 context, which does not yet exist, as the observed dataset is thematically unfiltered. The expectations are formulated using the following data (the numbers below correspond to the numbers in the header of Fig. 7):
-
1
posts from the same Twitter dataset containing at least one hashtag
$$ \rightarrow exp = {\text{number of tweets per month with hashtags containing either "covid" or "corona"}}, \, \sum\nolimits_{exp} = {\text{number of all tweets per month with any hashtag)}}$$ -
2
reported COVID-19 cases in Europe (European Centre for Disease Prevention and Control 2022)
$$ \rightarrow exp = {\text{number of cases per month}}, \, \sum\nolimits_{exp} = {\text{number of all cases in 2020 (except November))}}$$
The results of the calculations based on expectation 1 and 2 are shown in Fig. 7. The signed chi-score values for expectation 1 (see Fig. 7a) show the peaks that can also be identified in the diagrams in Fig. 7a and b and that can be assigned to the events described earlier. Otherwise, there are hardly any negative values, i.e. emojis are barely underrepresented, especially in January. The universality of the emojis face with tears of joy and red heart does not become evident, as they are subject to strong variations in contrast to typicality. In the case of expectation 2 (see Fig. 7b), the signed chi-score values of all emojis undergo a sharp drop from January to April, where all curves show a downward peak and from September onwards all emojis are negative, i.e. underrepresented. For all emojis, except for the microbe and medical mask emoji, values in January are sigificantly higher than in the other months (between ~ 210,000 and ~ 28,000) and are not shown in the diagram in Fig. 7b.
This example demonstrates how emojis, which are of varying popularity and universality, are manifested in different values for the different presented measures, but this is highly dependent on the reference dataset used. The two used expectations 1 and 2, despite their supposed thematic relevance, result in temporal trends that do not consistently reflect developments as they occurred during the pandemic and also the different characteristics of emojis in terms of popularity and universality are not always apparent.
4.3 Example 3: spatial formation of sub-datasets and extent of the typicality total dataset
In the third example, typicality is applied spatially and the effect of different sizes of the typicality total dataset is demonstrated. This example uses the same dataset as the previous one, i.e. a Twitter dataset of 2020 that spatially covers Europe but does not include the month November of that year. Only tweets containing emojis are considered. Some countries in Eastern Europe were cut off by the bounding box used for the data collection and are therefore not completely covered in spatial terms, Iceland is not covered at all. For this reason, these countries were excluded from the following calculations and are not included in the maps in this section.
Also this third example was inspired by our previous work described in Levi et al. (2024). It analyses the spatial occurrence of the beer mug emoji which is intended as an indicator of beer consumption as the subject under investigation. It can be assumed that this emoji is more typical or overrepresented in northern and eastern European countries and not in the south, where wine growing and consumption are more prevalent. The following calculations are not based on the number of emojis (although this is sometimes formulated in this way for the sake of simplicity), but on the number of tweets containing a beer mug emoji or respectively containing emojis in general.
In the typicality and signed chi-score maps shown in the following three figures, both the positive and negative value ranges are divided each into four equally sized intervals. In the case of the typicality maps, these are symmetrical around zero due to the performed scaling described in Sect. 3.4 resulting a value range from − 1 to + 1. This is not the case with the signed chi-score maps. This type of interval subdivision was carried out in order to facilitate comparability between the maps. The negative ranges are visualised with a bluish scale, the positive ranges with an orange one.
The spatial reference units used for this example are NUTS (Nomenclature des unités territoriales statistiques) units. NUTS is a hierarchical system for classifying the official spatial statistical reference units in the member states of the EU and is closely based on the administrative structure of the individual countries. In the following, the levels NUTS 0 (countries) and NUTS 1 (larger regions of countries) are used.
Fig. 8 a shows the relative frequency of beer mug emojis between all emojis within a country and Fig. 8 b the results of tf-idf calculations. Figure 8c visualises the typicality of beer mug emojis per country (where the total dataset is the sum of emojis in all countries included in the calculation) and Fig. 8d shows the signed chi-score per country with the just described typicality total dataset as expectation. The proportion of beer mug emojis in Fig. 8a is very low and hardly varies between countries; only Croatia and the United Kingdom are slightly more prominent. The same applies for Fig. 8b. The quintessence of the maps in Fig. 8c and d is similar: in the same countries, the beer mug emoji is typical/overrepresented or atypical/underrepresented, with similar but not identical proportions. The phenomenon that occurred in the first two examples, i.e. that when sorting the typicality and signed chi-score values, calculated based on the same reference dataset, the order of values being identical, does not occur here in the spatial context.
Two further calculations of signed chi-score were carried out with other reference datasets, i.e. expectations, both related to beer or respectively alcohol. The expectations are formulated using the following data (the numbers below correspond to the numbers in the caption of Fig. 9):
-
1
Beer production in liter per country in the year 2020 (BarthHaas GmbH & Co. KG 2021) (data can be viewed in Supplementary Material A) → exp = , ∑exp =
$$ \rightarrow exp = {\text{beer production per country}}, \, \sum\nolimits_{exp} = {\text{total beer production of all countries included in the calculation}}$$ -
2
Cases of death caused by mental and behavioural disorders due to use of alcohol in the year 2020 (eurostat n.d.a) (data can be viewed in Supplementary Material B)
$$ \begin{aligned}&\rightarrow exp = {\text{number of deaths from mental and behavioural disorders due to alcohol consumption per country}}, \\ &\sum\nolimits_{exp} = {\text{total number of deaths from mental and behavioural disorders due to alcohol consumption in all countries included in the calculation}}\end{aligned}$$
The results of the calculations based on expectation 1 and 2 are shown in Fig. 9. Countries for which no data is available in the reference dataset are cross-hatched. The map in Fig. 9a shows a similar picture to Fig. 8c and d, which is that the beer mug emoji is only overrepresented in a few countries, with the United Kingdom being the most prevalent. However, it is surprising that in both maps in Fig. 9, Czech Republic, of all countries, as a well-known beer nation with a long brewing tradition and the world's highest beer consumption per capita, falls into the negative value range. Figure 9b indicates the beer mug emoji as overrepresented in some southern European countries. As the expectation relates to alcohol consumption in general and not beer consumption in particular, it makes sense that the rather wine-oriented countries Spain and Italy fall into the positive range. Apart from the aforementioned consistency and conclusiveness with regard to the UK, the results in Fig. 9 are not very convincing, although the data used as an expectation are in line with the subject under investigation.
In the following, typicality calculations are carried out for a different spatial reference unit, that is, for NUTS1. The results are shown in Fig. 10. Two total datasets of different extents were used: the sum of all emojis in the countries included in the map (see Fig. 10a) and the sum of all emojis in the superordinate NUTS 0 region, i.e. in the country in which the NUTS 1 region under investigation is located (see Fig. 10b). However, the latter does not work for all considered countries, as some are not members of the EU and therefore no NUTS 1 subdivision was made, or because no NUTS 1 regions were defined despite EU membership. In these cases, the sub-dataset is identical to the total dataset, which is why these countries were excluded from the calculation and are cross-hatched in Fig. 10b. To emphasize the NUTS 0 units, their boundaries are depicted thicker than those of the NUTS 1 regions.
The map in Fig. 10a illustrates in which NUTS 1 regions the beer mug emoji is typical or atypical with regard to the entire area shown on the map. In contrast, in Fig. 10b typicality refers to the respective country and completely different spatial patterns emerge. For example, according to Fig. 10a, the beer mug emoji is completely atypical in the more wine-loving countries of France, Italy and Spain, which makes good sense in a European, i.e. transnational, context. However, Fig. 10b shows that the beer mug emoji is quite typical in some NUTS 1 regions of these countries, as the analysis is now carried out within the country, i.e. intranational, and there are therefore certainly regions that are more inclined towards beer than others.
In the previous explanations, wine-affine regions were repeatedly mentioned. All typicality calculations performed for the beer mug emoji were also carried out for the wine glass emoji for comparison purposes and the resulting maps can be viewed in Supplementary Material C.
The third example likewise highlights the impact of the selection of the reference dataset, both for typicality and signed chi-score calculations. Furthermore, the flexibility of the typicality measure is demonstrated, as different extents of the total dataset allow different perspectives on the analysed data.
5 Discussion and conclusions
The comparison of four measures commonly used for the normalisation of geo-social media data and their application in parallel within three exemplary case studies have revealed interesting findings. On the one hand, the potential of emojis to be a rich source of information was demonstrated. Further insights and discussions on this can be found in Hauthal et al. (2021) and Levi et al. (2024), among others. The behaviour of typicality when analysing hashtags and the difference to emojis in this respect is an interesting aspect of future research. On the other hand, and this is the main contribution of this paper, relevant behaviours of the four measures investigated have emerged, which will be discussed in more detail below.
Although relative frequency is a relative measure, it does not suffice well for the characteristics of geo-social media data as it can certainly not adequately reflect all aspects of those data during analysis. Still it is a measure that can be used for high-level overviews, particularly because it is familiar to laypersons and therefore easy to grasp. Besides, it is used in the calculation of typicality.
The results provided by tf-idf in the three examples were very similar to those of relative frequency. We used tf-idf to examine emojis which are actually text, but not text in the sense of tf-idf, i.e. written language. Our datasets were available in such a way that each emoji was only counted once per post, even if it appeared several times in the post. This is a standard procedure when processing and analysing social media data, but it does not comply with the concept of tf-idf, where it is significant when a term appears multiple times in a document. This demonstrates that a one-to-one transfer of tf-idf to social media data only works to a limited extent.
Although there is a clear requirement for typicality that the sub-dataset under investigation must be part of the total dataset, this still allows for a certain degree of freedom, which can have both advantages and disadvantages. Advantageous is the resulting flexibility and the different typicality values in the result, which allow different views on the data. A disadvantage is that there is no standard procedure yet in the sense of selection criteria or recommendations for deciding how broadly or narrowly the total dataset can or should be defined. What influences this decision is the focus of the analysis, e.g. in a spatial sense, whether a phenomenon is to be viewed from a global, regional or local perspective. The latter issue can be stated similarly about the number of objects to be investigated. In the case of example 1 (Sect. 4.1), a pre-selection was made: typicality was calculated only for the 100 most frequently used emojis per sub-dataset, since an emoji that appears only a few times in a total number of nearly 16 million emojis can hardly be called typical, even if it was previously claimed that the absolute number does not necessarily correlate with relevance. So far, there is no uniform procedure or threshold for this pre-selection.
Signed chi-score is more generic with regard to the reference dataset, i.e. the expectation. The fact that, basically, arbitrary datasets can be selected for this purpose means that additional factors can be included and the analysis can thus be placed in a wider context, which is not possible with typicality. However, the previous examples have shown that the effects of the expectation selection on the results are difficult to predict. However, the formulation of an expectation according to typicality principles is possible (i.e. the data to be analysed (= sub-dataset) are part of the expectation (= total dataset)) and provides conclusive results, as demonstrated in the previous chapter. This type of expectation formulation was revealed by the development of typicality and its comparison with signed chi-score, i.e. retrospectively a possibility for a structured approach to choosing expectations for signed chi-score has emerged.
Taking up the typicality focus of the three examples, in conclusion it can be said that typicality is a statistic measure which is tuned to the particularities and characteristics of geo-social media data. It is easy to use, efficient and flexible. These properties of typicality lie in its offering of various possibilities to form sub-datasets. Moreover, it can be used for a wide variety of objects under investigation such as hashtags, emojis, topics etc. and is not only suitable for one purpose, like language analyses as in the case of tf-idf. Another strength is the clear requirement regarding the total dataset, which nevertheless grants flexibility and also comparability due to this uniform rule. The three included application examples have shown that typicality provides nuanced insights across the dimensions of geo-social media data.
Based on the results of this paper, we would like to encourage other researchers to make conscious use of the described dimensions of geo-social media in order to enable a structured and versatile approach to data analysis. On the other hand, we would like to appeal to carefully select measures and, if a reference dataset is included, to critically discuss and question it, as it strongly influences the results, as our case studies have shown. One option could be to use different reference datasets and draw conclusions from a synthesis of the different results.
Regarding future work, it should be noted that it may be necessary to carry out investigations at several spatial levels, as shown in Fig. 10. Ebert et al. (2022) argue that several scales should always be included in an analysis in order to be able to assess how robust the relationship between spatially aggregated parameters is. Distributions can be very inhomogeneous locally, so the choice of spatial reference units and the management of spatial dependencies are essential. In this context, the modifiable area unit problem, known as MAUP, also comes into play, which describes a potential source of error in spatial analyses when using aggregated data (Wong 2009). Different scales, but also different shapes of the reference units lead to different analysis results, even if the same data is involved. And this also applies to reference datasets. However, the issue described should not only be seen as a problem, but should also be treated as an opportunity to gain new perspectives on the data.
Data availability
Data are available on request from the corresponding author.
Change history
02 January 2025
The original online version of this article was revised: to update the list in the section 4.
References
Abbas J, Wang D, Su Z, Ziapour A (2021) The role of social media in the advent of covid-19 pandemic: crisis management, mental health challenges and implications. Risk Manag Healthc Policy 14:1917–1932. https://doi.org/10.2147/RMHP.S284313
Antelmi A, Malandrino D, Scarano V (2019) Characterizing the behavioral evolution of twitter users and the truth behind the 90-9-1 Rule. In: Liu L. (Ed.) Companion Proceedings of the 2019 world wide web conference: pp. 1035–1038. https://doi.org/10.1145/3308560.3316705.
Arpaci I, Aslan O (2023) Development of a scale to measure cybercrime-awareness on social media. J Comp Inform Syst 63(3):695–705. https://doi.org/10.1080/08874417.2022.2101160
Baldauf M, Dustdar S, Rosenberg F (2007) A survey on context-aware systems. Int J Ad Hoc Ubiq Co 2(4):263. https://doi.org/10.1504/ijahuc.2007.014070
Barbieri F, Kruszewski G, Ronzano F, Saggion H (2016) How cosmopolitan are emoji’s? In: MM’16. Proceedings of the 2016 ACM multimedia conference, October 15-19, Amsterdam, The Netherlands. https://doi.org/10.1145/2964284.2967278
Ben-Lhachemi N, Nfaoui EH, Boumhidi J (2019) Hashtag recommender system based on LSTM neural reccurent network In: 2019 3rd international conference on intelligent computing in data sciences (ICDS) pp. 1–6. https://doi.org/10.1109/ICDS47004.2019.8942380
BarthHaas GmbH & Co. KG (2021) Barth Haas Report Hops 2020/2021. https://www.barthhaas.com/fileadmin/user_upload/kampagnen/barthhaas_bericht/BarthHaas_Report_Hops_2020_21.pdf.
Camacho K, Portelli R, Shortridge A, Takahashi B (2021) Sentiment mapping: point pattern analysis of sentiment classified twitter data. Cartogr Geogr Inf Sc 48(3):241–257. https://doi.org/10.1080/15230406.2020.1869999
Chandra Guntuku S, Li M, Tay L, Ungar L H (2019) Studying cultural differences in emoji usage across the east and the west In: Proceedings of the international AAAI conference on web and social media, pp. 226–235. https://doi.org/10.1609/icwsm.v13i01.3224
Chen S, Lin L, Yuan X (2017) Social media visual analytics. Comp Graph Forum 36(3):563–587. https://doi.org/10.1111/cgf.13211
Croft WB, Turtle HR (1992) Text retrieval and inference. In: Jacobs P.S. (Ed.) Text-based intelligent systems. Current research and practice in information extraction and retrieval. Hillsdale, Erlbaum pp. 127–155
Dahal B, Kumar SAP, Li Z (2019) Topic modeling and sentiment analysis of global climate change tweets, 9: 1. https://doi.org/10.1007/s13278-019-0568-8.
Danesi M (2017) The semiotics of emoji. London, Oxford, New York, New Delhi, Sydney: Bloomsbury Academic an imprint of Bloomsbury Publishing Plc (Bloomsbury advances in semiotics)
Daniel J (2021) The most frequently used emoji of 2021. https://home.unicode.org/emoji/emoji-frequency/. Accessed 25 April 2024
data.gov.uk (2024) Find open data: UK gridded population 2011 based on Census 2011 and Land Cover Map 2015. https://www.data.gov.uk/dataset/ca2daae8-8f36-4279-b15d-78b0463c61db/uk-gridded-population-2011-based-on-census-2011-and-land-cover-map-2015. Accessed 23 August 2024.
Di Minin E, Tenkanen H, Toivonen T (2015) Prospects and challenges for social media data in conservation science. Front Environ Sci 3:268. https://doi.org/10.3389/fenvs.2015.00063
Du J, van Koningsbruggen GM, Kerkhof P (2018) A brief measure of social media self-control failure. Comput Hum Behav 84(2):68–75. https://doi.org/10.1016/j.chb.2018.02.002
Dunkel A, Andrienko G, Andrienko N, Burghardt D, Hauthal E, Purves R (2019) A conceptual framework for studying collective reactions to events in location-based social media. Int J Geogr Inf Sci 33(4):780–804. https://doi.org/10.1080/13658816.2018.1546390
Dunkel A, Löchner M, Burghardt D (2020) Privacy-aware visualization of volunteered geographic information (VGI) to analyze spatial activity: a benchmark implementation. ISPRS Int J Geo-Inf 9(10):607. https://doi.org/10.3390/ijgi9100607
Dunkel A, Hartmann MC, Hauthal E, Burghardt D, Purves RS, Estima J (2023) From sunrise to sunset: exploring landscape preference through global reactions to ephemeral events captured in georeferenced social media. PLoS ONE 18(2):e0280423. https://doi.org/10.1371/journal.pone.0280423
Ebert T, Gebauer JE, Brenner T, Bleidorn W, Gosling SD, Potter J, Rentfrow PJ (2022) Are regional differences in psychological characteristics and their correlates robust? Applying spatial-analysis techniques to examine regional variation in personality. Persp Psychol Sc 17(2):407–441. https://doi.org/10.1177/1745691621998326
Encalada L, Ferreira CC, Boavida-Portugal I, Rocha J (2019) Mining big data for tourist hot spots: geographical patterns of online footprints In: geospatial challenges in the 21st Century. Cham, Springer pp. 99–123. https://doi.org/10.1007/978-3-030-04750-4_6
Etzion O and Niblett P (2010) Event processing in action. Manning Publications Co
European Centre for Disease Prevention and Control (2022) Download historical data (to 20 June 2022) on the weekly number of new reported COVID-19 cases and deaths worldwide. https://www.ecdc.europa.eu/en/publications-data/download-historical-data-20-june-2022-weekly-number-new-reported-covid-19-cases. Accessed 25 April 2024
eurostat (n.d.a) Data Broswer: Causes of death–deaths by country of residence and occurrence. https://ec.europa.eu/eurostat/databrowser/view/hlth_cd_aro__custom_10969866/default/table?lang=en. Accessed 25 April 2024
eurostat (n.d.b) GISCO: Geographical information and maps. https://ec.europa.eu/eurostat/web/gisco/geodata/reference-data/population-distribution-demography/geostat.
Gallegos L, Lerman K, Huang A, Garcia D (2016) Geography of Emotion. In: Bourdeau J et al (Eds.) WWW’16 companion. In Proceedings of the 25th international conference on world wide web pp. 569–574. https://doi.org/10.1145/2872518.2890084
Ge J, Gretzel U (2018) Emoji rhetoric: a social media influencer perspective. J Marketing Manage 34(15–16):1272–1295. https://doi.org/10.1080/0267257X.2018.1483960
Gu Y, Qian Z, Chen F (2016) From twitter to detector: real-time traffic incident detection using social media data. Transport Res Part c: Emerg Technol 67(5):321–342. https://doi.org/10.1016/j.trc.2016.02.011
Gugulica M, Burghardt D (2023) Mapping indicators of cultural ecosystem services use in urban green spaces based on text classification of geosocial media data. Ecosyst Serv 60(2):101508. https://doi.org/10.1016/j.ecoser.2022.101508
Habibi M, Cahyo PW (2019) Clustering user characteristics based on the influence of hashtags on the instagram platform. Indones J Comput Cybern Syst 13(4):399. https://doi.org/10.22146/ijccs.50574
Han X, Wang J (2019) Using social media to mine and analyze public sentiment during a disaster: a case study of the 2018 Shouguang City flood in China. ISPRS Int J Geo-Inf 8(4):185. https://doi.org/10.3390/ijgi8040185
Harman D (2005) The History of IDF and its influences on IR and other fields. In: Tait JI (Ed.) Charting a new course: natural language processing and information retrieval. Essays in Honour of Karen Spärck Jones (The Kluwer International Series on Information Retrieval, 16). Dordrecht, Springer pp. 69–79.
Hauthal E, Dunkel A, Burghardt D (2021) Emojis as contextual indicants in location-based social media posts. ISPRS Int J Geo-Inf 10(6):407. https://doi.org/10.3390/ijgi10060407
Hemphill L, Culotta A, Heston M (2013) Framing in social media: how the US congress uses twitter hashtags to frame political issues. SSRN Electron J 7(1):315. https://doi.org/10.2139/ssrn.2317335
Hemphill L, Culotta A, Heston M (2016) #Polar Scores: measuring partisanship using social media content. J Inform Technol Polit 13(4):365–377. https://doi.org/10.1080/19331681.2016.1214093
Hiemstra D (2000) A probabilistic justification for using tf×idf term weighting in information retrieval. Int J Dig Libr 3(2):131–139. https://doi.org/10.1007/s007999900025
Hu T, Wang S, Luo W, Zhang M, Huang X, Yan Y et al (2021) Revealing public opinion towards COVID-19 vaccines with twitter data in the United States: spatiotemporal perspective. J Med Internet Res 23(9):e30854. https://doi.org/10.2196/30854
Illendula A, Manohar KV, Yedulla MR (2018) Which emoji talks best for my picture? In: 2018 IEEE/WIC/ACM international conference on web intelligence (WI) pp. 514–519. https://doi.org/10.1109/WI.2018.00-44
Kemp S (2024) Digital 2024: global overview report. https://datareportal.com/reports/digital-2024-global-overview-report. Accessed 4 September 2024
Khan HU, Nasir S, Nasim K, Shabbir D, Mahmood A (2021) Twitter trends: a ranking algorithm analysis on real time data. Expert Syst Appl 164(3):113990. https://doi.org/10.1016/j.eswa.2020.113990
Kmieckowiak T (2017) Emojis Lead up to 47.7% more interactions on instagram. https://www.quintly.com/blog/instagram-emoji-study. Accessed 23 August 2024
Kuflik T, Minkov E, Nocera S, Grant-Muller S, Gal-Tzur A, Shoor I (2017) Automating a framework to extract and analyse transport related social media content: the potential and the challenges. Transport Res Part c: Emerg Technol 77(2):275–291. https://doi.org/10.1016/j.trc.2017.02.003
Kumar N, Baskaran E, Konjengbam A, Singh M (2021) Hashtag recommendation for short social media texts using word-embeddings and external knowledge. Know Inf Syst 63(1):175–198. https://doi.org/10.1007/s10115-020-01515-7
Leskovec J, Rajaraman A, Ullman JD (2014) Mining of massive datasets, 2nd edn. Cambridge University Press, Cambridge. https://doi.org/10.1017/CBO9781139924801
Levi S, Hauthal E, Mukherjee S, Ostermann FO (2024) Visualizing emoji usage in geo-social media across time, space, and topic. Front Commun 9:465. https://doi.org/10.3389/fcomm.2024.1303629
Lyu H, Wang J, Wu W, Duong V, Zhang X, Dye TD, Luo J (2022) Social media study of public opinions on potential COVID-19 vaccines: Informing dissent, disparities, and dissemination. Int Med 2(1):1–12. https://doi.org/10.1016/j.imed.2021.08.001
Mukherjee S, Hauthal E, Burghardt D (2022) Analyzing the EU migration crisis as reflected on twitter. KN J Cartogr Geogr Inf 72(3):213–228. https://doi.org/10.1007/s42489-022-00114-6
Na’aman N, Provenza H, Montoya O (2017) Varying linguistic purposes of emoji in (twitter) context. In: Ettinger A, et al (Eds.), In Proceedings of ACL, student research workshop pp. 136–141. https://doi.org/10.18653/v1/P17-3022
Novak PK, Smailović J, Sluban B, Mozetič I (2015) Sentiment of Emojis. PLoS One 10(12):e0144296. https://doi.org/10.1371/journal.pone.0144296
Otsuka E, Wallace S A, Chiu D (2014) Design and evaluation of a Twitter hashtag recommendation system. In: Desai BC et al (Eds.) Proceedings of the 18th international database engineering & applications symposium–IDEAS’14: 330–333. https://doi.org/10.1145/2628194.2628238
Paolanti M, Mancini A, Frontoni E, Felicetti A, Marinelli L, Marcheggiani E, Pierdicca R (2021) Tourism destination management using sentiment analysis and geo-location information: a deep learning approach. Inform Technol Tour 23(2):241–264. https://doi.org/10.1007/s40558-021-00196-4
Pohl H, Domin C, Rohs M (2017) Beyond Just text: semantic emoji similarity modeling to support expressive communication. ACM Trans Comput-Human Interact 24(1):1–42. https://doi.org/10.1145/3039685
Poorthuis A, Zook M, Shelton T, Graham M, Stepehens M (2016) Using geotagged digital social data in geographic research. In: Clifford N et al (eds) Key methods in geography. Sage, London, pp 248–269
Resch B, Summa A, Zeile P, Strube M (2016) Citizen-centric urban planning through extracting emotion information from twitter in an interdisciplinary space-time-linguistics algorithm. Urban Plan 1(2):114–127. https://doi.org/10.17645/up.v1i2.617
Robertson S (2004) Understanding inverse document frequency: on theoretical arguments for IDF. J Doc 60(5):503–520. https://doi.org/10.1108/00220410410560582
Salton G and McGill MJ (1983) Introduction to modern information retrieval. New York: McGraw-Hill (McGraw-Hill Computer Science Series)
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inform Process Manag 24(5):513–523. https://doi.org/10.1016/0306-4573(88)90021-0
Shelton T, Poorthuis A, Graham M, Zook M (2014) Mapping the data shadows of Hurricane Sandy: uncovering the sociospatial dimensions of ‘big data.’ Geoforum 52(3):167–179. https://doi.org/10.1016/j.geoforum.2014.01.006
Shelton T, Poorthuis A, Zook M (2015) Social media and the city: rethinking urban socio-spatial inequality using user-generated geographic information. Landscape Urban Plan 142(1):198–211. https://doi.org/10.1016/j.landurbplan.2015.02.020
Statistica (2022) Distribution of Twitter users worldwide as of April 2021, by age group. https://www.statista.com/statistics/283119/age-distribution-of-global-twitter-users/. Accessed 25 April 2024
Statistica (2024) Leading countries based on number of X (formerly Twitter) users as of January 2024. https://www.statista.com/statistics/242606/number-of-active-twitter-users-in-selected-countries/. Accessed 25 April 2024
Su X, Spierings B, Dijst M, Tong Z (2020) Analysing trends in the spatio-temporal behaviour patterns of mainland Chinese tourists and residents in Hong Kong based on Weibo data. Curr Issues Tour 23(12):1542–1558. https://doi.org/10.1080/13683500.2019.1645096
Suat-Rojas N, Gutierrez-Osorio C, Pedraza C (2022) Extraction and analysis of social networks data to detect traffic accidents. Inform 13(1):26. https://doi.org/10.3390/info13010026
Taubenböck H, Staab J, Zhu X, Geiß C, Dech S, Wurm M (2018) Are the poor digitally left behind?: indications of urban divides based on remote sensing and twitter data. ISPRS Int J Geo-Inf 7(8):304. https://doi.org/10.3390/ijgi7080304
Teles da Mota V, Pickering C (2020) Using social media to assess nature-based tourism: J Outdoor Recreat Tour 30:100295. https://doi.org/10.1016/j.jort.2020.100295
Toivonen T, Heikinheimo V, Fink C, Hausmann A, Hiippala T, Järv O et al (2019) Social media data for conservation science: a methodological overview. Biolog Conserv 233:298–315. https://doi.org/10.1016/j.biocon.2019.01.023
United Nations (n.d.) UN Population Division Data Portal: Interactive access to global demographic indicators. https://population.un.org/dataportal/home. Accessed 25 April 2024
Visvalingam M (1978) The Signed chi-square measure for mapping. Cartogr J 15(2):93–98. https://doi.org/10.1179/caj.1978.15.2.93
Visvalingam M (1981) The signed chi-score measure for the classification and mapping of polychotomous data. Cartogr J 18(1):32–43. https://doi.org/10.1179/caj.1981.18.1.32
Visvalingam M (1983) Area-based social indicators: signed chi-square as an alternative to ratios. Social Indic Res 13(3):311–329. https://doi.org/10.1007/BF00318102
Visvalingam M (1976) Chi-square as an alternative to ratios for statistical mapping and analysis. Working Paper. University of Durham, Department of Geography, Census Research Unit, Durham
Waeterloos C, Walrave M, Ponnet K (2021) Designing and validating the social media political participation scale: an instrument to measure political participation on social media. Technol Soc 64(1):101493. https://doi.org/10.1016/j.techsoc.2020.101493
Wang Y, Mohd Pozi M S, Yasui G, Kawai Y, Sumiya K, Akiyama T (2017) Visualization of Spatio-temporal events in geo-tagged social media. In: Brosset D et al (Eds.) web and wireless geographical information systems. Proceedings of 15th international symposium, W2GIS, pp. 137–152. https://doi.org/10.1007/978-3-319-55998-8_9
Wartmann FM, Tieskens KF, van Zanten BT, Verburg PH (2019) Exploring tranquillity experienced in landscapes based on social media. Appl Geogr 113:102112. https://doi.org/10.1016/j.apgeog.2019.102112
Wong DW (2009) Modifiable areal unit problem. In: Thrift NJ et al (Eds.) International encyclopedia of human geography. Amterdam, London, Oxford, Elsevier, pp. 169–174
Wood J, Dykes J, Slingsby A, Clarke K (2007) Interactive visual exploration of a large spatio-temporal dataset: reflections on a geovisualization mashup. IEEE T Vis Comp Gr 13(6):1176–1183. https://doi.org/10.1109/TVCG.2007.70570
Yahav I, Shehory O, Schwartz D (2019) Comments mining with TF-IDF: the inherent bias and its removal. IEEE T Knowl Data En 31(3):437–450. https://doi.org/10.1109/TKDE.2018.2840127
Yamasaki T, Hu J, Aizawa K, Mei T (2015) Power of tags: predicting popularity of social media in geo-spatial and temporal contexts. In: Ho Y-S et al (Eds.) Advances in multimedia information processing–PCM, pp. 49–158. https://doi.org/10.1007/978-3-319-24078-7_15
Yuan Q, Cong G, Ma Z, Sun A, Thalmann NM (2013) Who, where, when and what: discover spatio-temporal topics for twitter users In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 605–613. https://doi.org/10.1145/2487575.2487576
Zahra K, Ostermann FO, Purves RS (2017) Geographic variability of Twitter usage characteristics during disaster events. Geospatial Inf Sc 20(3):231–240. https://doi.org/10.1080/10095020.2017.1371903
Zhañay BA, Cordero GO, Cordero MO, Urigüen M-IA (2018) A text mining approach to discover real-time transit events from twitter. In: Botto-Tobar M et al (Eds.) information and communication technologies of ecuador (TIC.EC). TICEC (advances in intelligent systems and computing, pp. 884). Cham, Springer. https://doi.org/10.1007/978-3-030-02828-2_12
Zhang G, Zhu A-X (2018) The representativeness and spatial bias of volunteered geographic information: a review. Ann GIS 24(3):151–162. https://doi.org/10.1080/19475683.2018.1501607
Zhang Z, He Q, Gao J, Ni M (2018) A deep learning approach for detecting traffic accidents from social media data. Transp Res Part c: EmergTechnol 86(1):580–596. https://doi.org/10.1016/j.trc.2017.11.027
Zhang K, Chen D, Li C (2020) How are tourists different? Reading geo-tagged photos through a deep learning model. J Qual Assur Hosp Tour 21(2):234–243. https://doi.org/10.1080/1528008X.2019.1653243
Zimmermann A, Lorenz A, Oppermann R (2007) An operational definition of context. In: Kokinov B et al (Eds.) Modeling and using context. Proceedings of 6th international and interdisciplinary conference, CONTEXT 2007, Roskilde, Denmark, August 20–24 (lecture notes in computer science, 4635). Berlin, Springer pp. 558–571
Funding
Open Access funding enabled and organized by Projekt DEAL. No funding was received for conducting this study.
Author information
Authors and Affiliations
Contributions
E.H. and S.M. contributed equally to the study. Conceptualisation: E.H., S.M.; Methodology: E.H.; Formal analysis and investigation: E.H., S.M.; Writing—original draft preparation: E.H.; Writing—review and editing: S.M., D.B.; Resources: E.H., S.M.; Supervision: D.B.
Corresponding author
Ethics declarations
Conflict of interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Hauthal, E., Mukherjee, S. & Burghardt, D. Normalising inhomogeneities in geo-social media data – a comparison of different measures. Soc. Netw. Anal. Min. 14, 230 (2024). https://doi.org/10.1007/s13278-024-01395-7
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13278-024-01395-7