Keywords

1 Introduction

Overweight and obesity is now a global phenomenon, found in economically developed or developing countries (e.g., United States [1], European countries [2], South Africa [3], China [4]) as well as in regions that experience a double burden with the concomitant problem of malnutrition [5]. While there are ongoing debates on a possible plateau or even decrease of overweight and obesity in the next generation, updated prevalence data for children suggests that severe obesity is on the rise [6]. There is a plethora of interventions to prevent overweight and obesity in both children [7] and adults [8], and an equally impressive number of interventions for treatment [9, 10]. Yet, individual struggles to achieve a health weight over a sustained period of time. For example, a review of weight management interventions found a weight loss over two years of 1.54 kg [11], which is far from the 5% weight loss recommended to produce health benefits [12]. These challenges have led to the realization that a simple solution would not suffice [13]: the health system needs to cope with the complexity of obesity [14,15,16].

The notion of complexity covers multiple characteristics, such as the vast individual differences (or heterogeneity) between weight-related factors [17, 18], or the nonlinear ways in which factors interact to form a system. The obesity system has been the subject of numerous studies [19,20,21,22]. This system involves factors from a broad array of sectors (e.g., built environment, eating disorders, weight stigma [23, 24]), with interactions within as well as across sectors. Accurately modeling this system facilitates the development of integrated policies building on cross-sectoral efforts [25, 26]. If policies are developed separately along traditional themes (e.g., public planning works on the environment, doctors work on diseases and physiology, mental health experts work on psychology), then we have a heavily fragmented approach to obesity (Fig. 1a). Efforts such as the Foresight Obesity Map [20, 27], or the Public Health Services Authority’s series of maps [24, 28, 29] thus support the development of synergistic policies working on integrated thematic clusters (Fig. 1b).

Given the importance of developing accurate models of the obesity system, the modeling process often seeks to be comprehensive by including experts and community members [19, 24, 30,31,32,33,34]. While many qualitative modeling processes can produce models in the form of maps [35] (e.g., cognitive/concept mapping, causal loop diagrams), they are generally conducted with a facilitator. Some of the limitations (e.g., costs, trained facilitator) may be addressed through emerging technologies [36]. However, one limitation remains: participants may not openly express their beliefs (e.g., weight discrimination) when perceiving that they may not be well received by a facilitator or the research team. In contrast, the naturally occurring exchange of perspectives in social media provides an unobtrusive approach to collecting beliefs on causes and consequences of obesity. Mining social media may thus provide the views of community members [37,38,39,40].

Fig. 1.
figure 1

The Public Health Services Authority’s series of maps [24, 28, 29] suggests that typical categories lead to fragmented approaches (a) whereas themes specific to overweight and obesity can support more integrated options (b). These maps are conceptual maps as they articulate how concepts (labeled circles) are related (curves).

While obtaining a model via social media can inform policymakers about popular support for possible policies [41], the model may stand in stark contrast with an expert-based model [34]. Identifying and reconciling these differences is an important step to integrate social computing (and specifically social web mining) with policy making. In this paper, we contrast how mining social media instead of expert reports affects the validation of a large conceptual model of obesity. This overarching goal is achieved through three consecutive steps. First, we assemble a social media dataset (consisting of several million tweets) and several expert reports (totaling hundred of pages). Second, we employ an innovative multi-step process to examine a conceptual model using both the social media dataset and the expert reports. Finally, we contrast the structure of these models using network methods.

The remainder of this paper is organized as follows. In Sect. 2, we provide background information on the application of social web mining to health, and on the use of conceptual models in obesity research. In Sect. 3, we briefly explain our approach to validate a conceptual model from text. In Sect. 4, we perform this inference on both expert reports and tweets, and we examine how the conceptual models differ. Finally, these differences are discussed and contextualized in Sect. 5.

2 Background

2.1 Social Web Mining for Health

The social media of interest in this paper is Twitter, in which users post and interact through short messages known as ‘tweets’. Twitter has been used for many studies on obesity and weight-related behaviors. For instance, Harris and colleagues collected 1,110 tweets and read them to understand how childhood obesity was discussed [42], while Lydecker et al. read 529 tweets to identify the main themes related to fatness [43]. Similarly, So and colleagues analyzed the common features of 120 tweets that were most frequently shared (i.e., retweets) to understand what information individuals preferred to relay when it came to obesity [37]. Reading the tweets to identify themes (i.e., content analysis) is a typical task to understand the arguments that a specific population uses on a subject of interest. Broader examples in health include the content analysis of 700 tweets [44] and 625 tweets [45] to examine the type of claims that health professionals make online, or an examination of 8,934 tweets documenting cyberincivility among nurses and nursing students [46]. While such content analyses make a valuable contribution to the body of knowledge on arguments in public healthFootnote 1, they do not employ computational methods to automate (parts of) the analysis and thus scale it to a larger dataset. Automation can be as simple as counting how many times keywords of interest appear across tweets. Turner-McGrievy and Beets used Hashtagify.me to automatically count keywords in tens of thousands of tweets on weight loss, health, diet, and fitness. By dividing the analysis across time periods, they were able to examine if there are times of the year when individuals would be likely to consider weight loss, thus contributing to the timing of interventions [48]. Similarly, Sui et al. used the intensity of topics on Twitter as part of an effort to identify the public interest in intensive obesity treatment [49]. Such studies illustrate the important shift from having humans read and code all tweets to relying on a machine to handle most of a (much larger) dataset. The latter is the focus of data mining applied to the ‘social web’ (i.e. social web mining) which includes social networking sites such as Twitter but also encompasses blogs and micro-blogging. As Twitter has been the social platform of interest for many studies, the term of ‘Twitter mining’ has also emerged to refer specifically to the application of social web mining to Twitter [50].

Social web mining started to garner attention in the late 2000’s to early 2010’s. The application of social web mining to health was discussed in 2010 by Boulos et al. [51] and in 2011 by Paul and Dredze [52], showing how a broad range of public health applications could benefit from mining Twitter. Studies have been able to mine a staggering volume of data, going well over what a team of humans could handle. For example, Eichstaedt et al. mapped 148 million tweets to counties in an effort to relate language patterns to county-level heart disease mortality [53]. At an even larger scale, Ediger and colleagues used a Cray computer to approximate centrality within two hours on a dataset of interactions between Twitter users comprising 1.47 billion edges [54]. While these cases are noteworthy by their volume of data, studies employing social web mining for obesity research typically involve millions of tweetsFootnote 2. Using 2.2 million tweets, Chou and colleagues found that tweets (as well as Facebook posts) often stigmatized individuals living with overweight and obesity [38]. In two studies on obesity and weight-related factors, Karami analyzed 6 million [39] and 4.5 million tweets [40]. In a study of health-related statistics, Culotta mined 4.3 million tweets and found that the data was correlated with obesity [56]. Given that obesity is driven by many factors (e.g., eating behaviors, physical activity behaviors), there is also a wealth of large-scale studies on such factors, such as the work of Abbar et al. on 503 million tweets regarding food [57]. Finally, the value proposition of several new platforms is not the analysis of one particular dataset, but rather the ongoing ability to monitor diet or physical activity. This is particularly the case for the Lexicocalorimeter, which measures calories in each US state via Twitter [58], and to a lesser extent for the National Neighborhood Dataset of Zhang et al. which tracks diet and physical activity through Twitter [59].

Several commentaries [60] and reviews [61,62,63] have explored whether this abundance of studies has contributed to public health. Findings depend on what specific aspect of health is concerned. Social media has yet to impact practices in public health surveillance [62], but a review centered on chronic disease found a benefit on clinical outcomes in almost half of the studies [61], and a review specific to obesity highlighted a modest impact on weight [63].

2.2 Conceptual Models in Obesity Research

Although our work will involve the identification of themes, we have a very different endeavor from studies reviewed in the previous section, which focused on identifying themes and their variations across time, places, or communities of users. Our objective is to contrast conceptual models that have been automatically extracted from tweets and expert reports. As evoked in the introduction, models of complex systems such as obesity support several important policy-making and analytical tasks. In this section, we briefly review the features that models often seek to capture when it comes to complex health systems, and how models are used in obesity research specifically. Penn detailed key characteristics of complex health systems that justify the development of models (emphases added):

“Many problems that society wishes to address in population health are clearly problems of managing complex adaptive systems. They involve making interventions in systems with multiple interacting causal connections, which span domains from physiological to economic. Additionally, of course, the individuals whose health we ultimately wish to improve adapt and change their behavior in response to medical or policy interventions.” [64]

Several of these points were echoed by Silverman in justifying the use of systems-based simulation for population health research [65]. Modeling changes in the heterogeneous health behaviors of individuals often uses the simulation technique of Agent-Based Modeling, and has been done in obesity research on multiple occasions [66,67,68,69,70]. Such models can be very detailed and use widely different architectures to capture the cognitive processes of the agents. Validating them using text is thus an arduous task. Modeling interacting causes across domains has been achieved in obesity research through a variety of techniques. System Dynamics (SD) allows to represent nonlinear interactions between weigh-related factors over different time scales and at different strengths [71, 72]. However, much like agent-based modeling, the great level of details supported by SD makes it difficult to derive or validate such models from text. Fuzzy Cognitive Maps (FCM) are a simpler alternative that eliminates the notion of time to focus on the different strengths of causal relations [34, 73,74,75]. Such models can be compared [34], but validating them from text still requires a trained analyst [76]. An even greater simplification is to use conceptual rather than simulation models. Conceptual models cannot run scenarios or what-if questions, and cannot ‘generate’ numbers. Instead, their focus is to capture relevant factors and whether they are connected [77]. Conceptual models can be compared [78] and validated using text as shown in our previous work [77].

There are several types of conceptual models [35]. We recently detailed the differences between causal maps, mind maps, and concept maps [36]. In short, this paper focuses on concept maps (Fig. 1), which are undirected networks representing concepts as nodes and relationships as edges. Similarly to the other forms of conceptual models aforementioned, a concept map supports policy-oriented tasks such as identifying clusters [27] (e.g., to coordinate actors across domains on one problem such as food) or finding feedback loops [24, 28, 29] (e.g., to use as leverage points in an intervention).

Fig. 2.
figure 2

Our process in seven steps to validate a conceptual model using textual data. The high-definition figure can be zoomed in for details.

3 Validating a Conceptual Model from Text

The process starts with a conceptual model that we seek to validate, and the text corpus is used to validate. Intuitively, our process uses the concepts’ names to find relevant parts of the corpus and find which concepts tend to co-occur. Technical aspects include handling variations in language (as we cannot rigidly assume that a concept’s name will appear as such), identifying themes, and mapping themes from the corpus back to concepts in the conceptual model. Our process uses seven steps, illustrated on a theoretical example in Fig. 2. The first two steps are performed for each concept node:

  1. (1.a)

    We replace all concepts’ names and words from the corpus with their base form (i.e., lemma). This is accomplished through lemmatization, which uses a morphological analysis to remove inflectional endings. This step ensures that minor variations of a term are all mapped to the same one (e.g., ‘flooding’ and ‘floods’ are all mapped to ‘flood’).

  2. (1.b)

    Each lemmatized concept names is expanded with derivationally related forms. For instance, instead of only searching for ‘flood’ in the corpus, we will also accept words such as ‘deluge’.

  3. (2)

    For each concept (i.e., the expanded lemma), we retrieve all parts of the corpus that contain it. For instance, the concept ‘flooding’ will lead to retrieving all tweets include the lemmas ‘flood’ or ‘deluge’.

Upon completion of step 2, we have related a portion of the corpus to each concept node. We then find the themes in each portion of the corpus using three parameters:

  1. (3)

    We apply the Latent Dirichlet Accuracy (LDA) model to find prevalent themes. The two parameters for this step are the number of themes and number of words per theme.

  2. (4)

    We gather words across themes into a single set of words. This set is cleaned by removing words that are already present in the set of derivationally related form of the node. In other words, we only look for concepts that the node could be associated with but not equivalent to.

  3. (5)

    Since concepts’ names are entities, a concept can only be associated with an entity. Consequently, we remove all non-entities from the words.

  4. (6)

    At this step, we have a set of entities that a concept node could be associated with. However, some of the entities may be noise rather than meaningful associations. We thus sort the entities by tf-idf (term-frequency inverse-document-frequency) computed over the set of tweets in which each word appears. We use a threshold parameter to identify which entities have a sufficient tf-idf to be selected.

Upon completion of step 2, we found entities that a concept node could be associated with. The final step goes back to the conceptual model to see if the association exists:

  1. (7)

    For each node, we compare its associated entities with its connected nodes and derivationally related forms. If there is a match, then the text corpus has confirmed an association between the two concepts. If no match is found, the association is not confirmed. Note that associated entities that do not match any connected nodes suggest additional connections, which is a different from validation as we seek to confirm existing connections.

Fig. 3.
figure 3

Alternative view of our process, including libraries and APIs.

This process is also depicted in Fig. 3, listing the libraries that can be used for each step. The specific versions of the libraries used in our experiments are included in Sect. 4.

4 Comparing Conceptual Models from Twitter and Expert Reports

4.1 Datasets and Pre-processing

The conceptual model that we seek to validate was developed with the Provincial Health Services Authority (PHSA) of British Columbia to explore the interrelationships involved in obesity and well-being. The model was presented in 2015 at the Canadian Obesity Summit [24] and tested with policy makers in 2016 [29]. The model is now part of the ActionableSystems tool [28] can be downloaded at https://osf.io/7ztwu/ within ‘Sample maps’ (file Drasic et al (edges).csv). The model consists of 98 nodes and 177 edges. From here on, we will refer to it as ‘the PHSA map’.

To validate the PHSA map, we used two datasets. Our first dataset (‘the twitter dataset’) consists of 6,633,625 tweets in the English language on obesity collected from Oct. 2, 2018 to Oct. 4, 2018. The number of tweets was chosen to be in line with comparable studies at the interface of natural language processing and obesity research [38,39,40]. The keywords to collect the tweets included each of the 98 concept names in the PHSA map as well as their synonyms automatically retrieved through WordNet. For instance, we used not only ‘obesity’ but also words such as ‘fatness’, ‘corpulent’, ‘embonpoint’ and ‘fleshiness’. Similarly, physical activity was expanded to include many forms such as calisthenics, isometrics, jogging, jump rope, and so on. The rationale is that the map contains abstract concepts, but individuals may speak of specific instances or use a variety of words to describe the same abstraction. After collecting a large number of tweets, natural language applications require extensive pre-processing. The impact of each options (and their interactions) on results obtained from Twitter has been extensively described when performing sentiment analysis [79,80,81] and in more generic tasks such as classification [82]. Some of these options are summarized in Fig. 4 and include the removal of parts deemed unnecessary for analysis (e.g., hashtags, URLs, numbers, non English words) or the mapping of data into forms that can be more conveniently processed (e.g., expanding acronyms and abbreviations, replacing emojis, spell checking). The pre-processing options used for our dataset are depicted in Fig. 5. These options are chosen specifically for our research question: for instance, we remove stop words because they cannot be meaningful concept names in a model, but other analyses (e.g., attributing tweets to specific writers) may have kept such words. The order of the steps also matters: for instance, we cannot perform part-of-speech tagging and lemmatization (step 5) before ensuring that all the words have been corrected (step 3). After pre-processing, our dataset included 1,791,333 tweets.

Fig. 4.
figure 4

Typical pre-processing techniques applied to tweets.

Fig. 5.
figure 5

Pre-processing techniques applied to our tweeter dataset in a specific order. We used a Spell Checker library in step 3, the Natural Language Toolkit (NLTK) for steps 1–4, and the Stanford coreNLP library for step 5.

Fig. 6.
figure 6

Average number of edges confirmed (out of 177 in the PHSA map) for each combination of parameter values over ten experiments.

The second dataset is formed of three reports on obesity: the 2010 report from the white house task force on childhood obesity [83], the 2013 report to the Provincial Health Services Authority [84] and its 2015 update (whose findings are published in [24]). We combined the three reports with the PyPDF2 library, leading to 310 pages, and we kept 247 pages after removing those that were either blank or only contained images. Pages were then transformed into raw text using the pdftotext library and divded into 4,302 sentences using the full point (‘.’). Pre-processing was finally applied, using the same script as for tweets while noting that several options such as removing emojis would not be triggered. The resulting dataset had 3447 sentences.

4.2 Validating the Model for Each Dataset

The methods introduced in Sect. 3 are implemented in Python, relying on libraries as listed in Table 1. While our implementation was able to cope with millions of tweets, we note that a larger volume of data may also require a distributed database architecture and an efficient search engine such as Elasticsearch [85].

Table 1. Libraries used in each step (Sect. 3) of our experiments.

Our approach has three parameters: number of themes, number of words per theme, and tf-idf threshold to eliminate noise. Hyperparameter optimization was thus necessary to use each dataset most efficiently, and fairly compare their potential in validating a model. To optimize performances with expert reports, we performed a grid search by varying the number of topics and words per topic from 5 to 50 in increments of 5, and we varied the tf-idf from 2 to 9 by increments of 1. This resulted in 800 combinations of parameter values. As there is randomness in the LDA model, we performed ten experiments per combination of parameter values, leading to a total of 8,000 experiments. At most, our process validated an average of 136.5 edges (77.11% of the map) using 50 topics, 50 words per topic, and a td-idf threshold of 8 (Fig. 6).

A grid search was also performed on the Twitter dataset. However, our current implementation takes approximately five days to compute the results for one combination of parameter values (single experiment), using a server-grade workstation (Dual Xeon Gold 6140). Given this limitation, we used single experiments and a coarser grid. At most, our process validated 101 edges (57.06%) using 50 topics, 50 words per topic, and a tf-idf threshold of 9.

5 Discussion

A focus group with a few participants may only discuss some of the interrelationships at work in overweight and obesity, and may avoid sharing opinions that are potentially disapproved by others. In contrast, social media such as Twitter provide access to a massive number of participants who can use conditions of anonymity to share opinions more freely. Social web mining applied to Twiter thus comes with the potential to explore many interrelationships in an unobtrusive fashion. In particular, crowdsourcing over Twitter holds the promise of easily building large conceptual models, under the assumption that at least some groups of users will touch on each part of the model. Our study questions this potential and promises by analyzing whether millions of tweets are more useful to develop a conceptual model of obesity than a handful of reports.

Although conceptual models can be automatically compared [78], developing a model from each dataset (tweets vs. reports) and comparing them would not be able to tell us which one is ‘better’. Our study question thus requires a referential. We use a previously developed conceptual model of obesity and well-being to serve as referential, and we establish how much of this model would have been obtained if we used either tweets or reports. In other words, we measured the percentage of the model’s structure that is confirmed with each dataset.

While both datasets were able to cover over half of the model, we note that it only took three expert reports compared to using millions of tweets. In addition, despite the abundance of tweets, the three expert reports touched on more relationships. Within our application context, these results suggest that an exclusive reliance on social media may result in oversimplifying a complex system, thus limiting the potential to automatically develop models using such a source. We note that a comprehensive analysis across subjects and using a variety of maps would be needed to assess whether our results produced on one model (the Provincial Health Services Authority map) and one application subject (obesity) can be generalized to other models and subjects.

There are several limitations to this study, which we intend to address in our future research. First, one of the premises of big data research is that a large volume may compensate for many imperfections in the individual data points. Although we used a similar number of tweets to other studies at the interface of natural language processing and obesity research [38,39,40], it is possible that some of the interrelationships of the model we seek to validate are rare and thus only detectable in even larger datasets. Repeating this study with significantly larger datasets could elucidate this question. However, we then run into the second issue: our process to validate a causal map against textual data is very computational intensive. The search space to optimize the result is defined by three parameters which involve randomness, thus requiring several experiments for each combination of parameter values. On a server-grade workstation, a single combination with a CPU-based implementation requires in the order of days. Optimizing results and using larger datasets will thus require implementations that scale, with a particularly promising option consisting of a GPU-based implementation. Alternatively, we may reduce the search space if we can better characterize the impact that parameters generally have on the results and then devise more computational efficient processes. For instance, the tf-idf threshold plays an essential role in driving performances (Fig. 6) but may be replaced by additional pre-processing steps preventing the inclusion of noise, such as classifiers removing unwanted documents [87].

6 Conclusion

Both social media data and expert reports may be used to take into account popular perspectives and expert opinions when creating large conceptual models. In the case of obesity, we found that three expert reports discussed 77% of all possibilities while millions of tweets on obesity and its cognates covered fewer interrelationships. Creating models using social media only may thus result in an oversimplification of complex problems.