Introduction

The field of data science is rapidly growing, and the need to address the data science skills gap has caused an explosion of new program offerings at the undergraduate, graduate, and workplace training levels. Given the relative newness of the field, there is still much work to be done to develop specific curricular and pedagogical approaches to teach data science practices. For example, Donoho (2017) has advanced a vision for data science education where learners engage in the exploration of authentic problems that require the use of real-world datasets, through a process that involves formulating and carrying out a sustained, coherent investigation, resulting in a tangible data analysis project. This vision of authenticity and problem-based learning motivates us and, while residential education tends to be made up of smaller more homogeneous populations which lend themselves to shared motivations and experiences, we are specifically interested in understanding how authentic learning experiences can be delivered to larger, global data science student bodies which are more heterogenous with respect to culture and background. Indeed, given the prevalence of data science courses available in large scale online settings such as massive open online courses (MOOCs) (see Hood et al., 2015), we are concerned that the lack of sensitivity to individual learner background with respect to the data being investigated may undermine the equity goals of the global platforms themselves.

In this work, we take one small step towards understanding how to support and motivate global learners with data science activities which are personalized for them. We draw on ideas from pedagogical approaches that are presented in the literature, including project- and problem-based learning, along with work on culturally responsive pedagogies. At the same time, we must moderate these approaches, which are often high-cost and require significant interaction with learners, to the large scale of our activities. And this scale is not just the number of students we aim to support, but the diversity of cultural/geographic background of these learners.

To this end, we have conducted two studies to understand how problem-based learning and culturally responsive approaches may be used as models to build culturally relevant personalization activities at scale. We do not comprehensively apply all aspects of these approaches, instead we focus on gathering baseline knowledge about how we might situate learners in data science activities in ways which leverage intrinsic motivation. The diversity of MOOC audiences provides a significant opportunity to experiment with mechanisms for inclusiveness in teaching and learning (Kizilcec & Brooks, 2016) and, through large scale experimental methods, our work aims to identify whether there are benefits to including geographically and culturally sensitive datasets both in summative and formative learning activities.

Our research is situated within informal skills-based data science education, specifically, the first two courses within a five-course series on Applied Data Science with Python.Footnote 1 The first of these courses focuses on building skills related to data manipulation, cleaning, and processing using the Python language, and serves as a broad overview of the field. The second course includes both theory related to information visualization as well as skills using the associated Python plotting libraries. Both courses are taught on the Coursera platform and are made up primarily of video-based instruction backed by discussion forums. The lectures and the assignments in these courses use the web-based Jupyter notebook authoring environment (Kluyver et al., 2016). This environment encourages the production of computational narratives (Pérez & Granger, 2015), document-based artifacts which blend source code, output of this source code, and natural language descriptions of the problems being solved.

Importantly, our work here is situated with the goal of understanding how we can engage in culturally relevant instruction in scaled diverse learning environments. As the cultural diversity of the learner population grows, the ability to respond in a deeply cultural way is a challenge, and our work explores this largely through geographical proxies of culture. Thus throughout this work we explicitly describe personalization we have done as one which is limited based on location features, and we do not consider other aspects of the student background in our interventions.

In our first study, we explore the relationship between location-based datasets which were geographically related to the location of the learner and the types of engagement the learner had within the course. A bulk of literature points to a lack of persistence and completion in MOOCs (Andres et al., 2018; Kizilcec & Halawa, 2015; Whitehill et al., 2017), and we explored whether these contextual assignments would improve these measures, more specifically aiming to answer the questions:

  1. 1.

    Do students who are given personalized datasets have higher rate of completion, higher grades, higher levels of satisfaction or greater interaction with content?

  2. 2.

    How does receiving either a personalized or non-personalized dataset impact learners’ affect, motivation, and performance?

Whereas our first study is embedded directly within summative course activities, our second study was formative in nature, and was delivered in the form of optional culturally relevant data science problems that students could complete. We increased our tailoring of messages and narrowed our target students to those in the United States and India in order to deepen comparisons between two subpopulations. Specifically, we aimed to answer:

  1. 1.

    Will learners be more likely to return to and interact with Data Science course materials if they are communicated as culturally responsive problems?

  2. 2.

    What do learners value (or not value) in these culturally responsive problem sets?

Given the lack of evidence for culturally relevant personalization in the field of scaled online learning, both of our investigations were exploratory in nature, and we engaged in both quantitative and qualitative analysis techniques. Before describing the study details, the next section will provide a brief review of our motivational theoretical lenses – project- and problem-based learning and culturally responsive pedagogies.

Theoretical Lenses

Our goal of supporting and motivating data science learners led us to consider problem-based and project-based learning approaches as a mechanism that could, in part, motivate students as they engage in new activity. Data science practices have many similarities to contemporary descriptions of science and scientific practices. For example, some of the ways data can be used to answer driving questions, along with data collection and analysis practices are described in the Next Generation Science Standards (NGSS Lead States, 2013). As we think about data science educational activities in the MOOC context, we have looked at how we can incorporate elements of the problem-based learning (e.g., Barrows, 1985) or project-based learning (e.g., Blumenfeld et al., 1991; Krajcik & Blumenfeld, 2006). While problem- and project-based learning have differences, at a higher level, they can be characterized by curricular designs that contextualize the practices and skills being learned in problems or larger projects that many times defined and driven by learners themselves. An integral part of this process involves learners making decisions about what to measure, how to measure it, and then becoming involved in the data collection process. This activity can bring learners in close proximity to the data they are using in their investigations, which can serve to motivate the application of scientific practices that allow them to actively explore their driving questions, and ultimately develop a solution for their investigation. Despite the potential advantages that PBL approaches may have for student motivation, the literature has also detailed how such an approach can be complex, time-consuming, and difficult to implement in classroom settings (e.g., Spoelstra et al., 2014).

Supporting and motivating data science learners in the MOOC context led us to examine the work on culturally responsive pedagogies due to the large diversity of background of learners in these environments. As MOOC environments continue to draw in learners from around the world, it is critical that MOOC instructors actively consider ways to create an environment that is supportive and meaningful to all learners. Culturally responsive pedagogies value the wealth that diverse learners bring into the classroom and encourage instructors to shape their teaching practices and forms of assessment in ways that are responsive to learners’ lived experience (Ladson-Billings, 1995). Much of this work has taken place within urban K-12 settings with African American and Hispanic learners (e.g., Garcia & Chun, 2016), although there are also examples of culturally responsive models for Native American learners in higher education (e.g., Shotton et al., 2013). Despite evidence that instructors can improve outcomes for diverse communities, higher education institutions have struggled to implement culturally responsive pedagogies because they require significant learner-instructor engagement and extensive instructor training (Shotton et al., 2013).

Despite the challenges associated with these two approaches, the potential benefits to boost learner motivation should be kept in mind. For example, Blumenfeld et al. (1991) provide a classic view that has grounded many PBL approaches by describing how PBL can serve as a motivating mechanism for learners as they engage in varied and novel activities within a challenging problem context that they perceive is authentic and valuable to them. Others, such as Engle and Conant (2002) discuss how educators can foster “productive disciplinary engagement” by having learners develop disciplinary problems to explore, and then giving them the authority and resources to support the activity needed to develop solutions to those problems by (1) situating the work within the learner’s context, environment, and interests, and (2) noting the potential impact of such situated work in terms of increasing student motivation to engage in the activity and carry out the practices that they are learning in that context. These ideas also fit with the use of “anchor phenomena”–phenomena that are complex, observable, relevant and compelling to learners, and can include data to engage students in a range of ideas to drive investigations–to develop instruction that is more coherent and compelling to learners engaged in science practices (Penuel & Bell, 2016). Similarly, culturally responsive pedagogies have the potential to boost motivation, not only in the immediate learning activity, but also in the application of skills after the course is finished (Ladson-Billings, 1995).

Towards Implementing Traditional Pedagogical Approaches at Scale

As we have described, PBL methods and culturally responsive pedagogies have been mostly implemented within K-12 classroom contexts up until now. Perhaps because of the challenges that accompany implementation (and even more so at scale), there have been few efforts to enact a PBL approach in a MOOC. In one example, Verstegen et al. (2016) explored ways in which a PBL MOOC could be structured with more small group interaction, and questions and assignments that emerge from student activity. Haklev (2016) enacted a complex MOOC design that orchestrated the collaboration of several thousand in-service teachers, according to subject matter and grade level taught. Haklev’s (2016) course design supported the work of groups to produce co-constructed artifacts that were tied to the disciplines represented by the group. These efforts have focused mostly on the curricular structure of the MOOC itself, but have not focused on the impact of these curriculum designs on learner motivation. While leveraging culturally responsive pedagogy has been characterized through small individualized approaches to content creation, learning at scale has focused on large homogeneous (e.g. video) content creation.

From a data science education perspective, it is fruitful to consider how learners’ use of data that connects to personally meaningful projects can be motivating (Lee & Wilkerson, 2018). We can envision how customizing data science activities that allow learners to engage with data and problems in a more authentic, personally-meaningful manner might have the potential to impact a large, diverse set of learners. For example, in one secondary science classroom, educators supplied learners with a complex governmental dataset from the United States of America Geographical Survey that allowed learners to investigate earthquakes in their area (Kerlin et al., 2010). Such personally-meaningful learning in the Data Science could be understood as a type of personalized learning. There have been multiple studies on personalization including Williams et al. (2014) which suggested a MOOClet framework to personalize user experience of edX platform based on learning goals. Kizilcec and Saltarelli (2019) also examined a type of personally-meaningful design from the gender perspective: the impact of the psychologically inclusive design on science, technology, engineering, and math (STEM) courses. The authors found that the design increased female learners’ enrollment to the courses.

Furthermore, as we consider the range of datasets now available to learners and educators, we can also determine how such data can support the development of a variety of problems for learners to explore, thereby better connecting to learners’ backgrounds, culture, and interests. As we think about more diverse sets of learners in the MOOC context, larger-scale data availability allows for the development of a wider range of problems for learners to explore within a data science educational context. We can then think about how to situate a learner’s activity in different ways—by the data that learners use, and by the types of problems that learners can explore through the Data Science activities that they are learning.

Research Context

We use PBL and culturally responsive approaches as models to think about online Data Science education and how we can support learners to engage in these new disciplinary skills in in projects that are authentic and meaningful to them. We do not claim that we comprehensively apply all aspects of these approaches, but we do employ them in support of our overall goal of gathering baseline knowledge about we might situate learners in Data Science activity in ways that leverage intrinsic motivation. An important goal of our research is to take advantage of the increased availability of large-scale open data and educational technologies used in Data Science contexts to explore aspects of PBL approaches and culturally responsive pedagogies at scale.

In the two studies we present, the principal mechanism for personalizing the learning experience has been through the learners’ geographic location, which we determine by using the IP address. While there are potential inaccuracies with this approach, such as the use of virtual private networks (VPN) for access of location-restricted content, or the movement of learners as they change learning locations, we anticipate the effects are small and uniformly distributed amongst the populations and experimental conditions discussed further. Therefore, here we will use the term “personalized” to mean localized, geographic data that is matched to a learner’s location. In the case of the first study this is geographic data alone, while in the second study we explicitly create new sociocultural content (e.g. not simply language translation) for the learners. In our studies we do not differentiate between students who are paying for the courses (a minimal monthly fee which can lead to a certificate) versus auditing courses (e.g. not paying), and we include in the study populations learners who are currently engaged with the course as well as, for the second study, those who have finished the course content.

Study 1: Personalized Problems in Large Scale Learning Scenarios

The goal of the first study is to explore the impact of providing personalized data within graded problems relating to information visualization (i.e., course two of the specialization). Learners were asked to create visualizations based on regional weather data and consider long term trends with respect to this data, and we anticipated that learners could explore the weather patterns of their city, for instance, and intuitively know if the results of their queries seem appropriate.

Design of Study 1

We chose datasets based on weather patterns over a ten-year history. Learners were randomly assigned to one of two conditions: in the personalized condition, they received data from their region of the world, and in the control condition, they received data from the instructor’s region in the world (Ann Arbor, MI, USA). The size of the regions varied and was determined in part based on the limitations of the dataset (e.g., rural China weather data was limited, so the region was increased for these learners to a larger portion of China, but learners in a populous region such as Beijing would have seen data from within a small portion of the city).

Learners were given a piece of sample code that showed them how to load the dataset they were to visualize. We grounded the notion that this data was from a given world region by embedding a map of the weather sensors from which the data was collected (see Fig. 1). Learners were asked to write Python code drawing a line graph using the given annual data recording high or low temperatures of each day of a certain region. Assignment completion was required, but it alone was not sufficient to obtain a certificate in the course.

Fig. 1
figure 1

An example assignment for a learner from the city of Seoul, Korea. Bolded text indicates the geographical origin of the dataset, and the large map of the region with markers shows where data was collected

Participants

Participants were recruited through participation in the course from February 14th, 2017 to February 4th, 2018. A total of 3502 learners attempted the assignment.

Approach to Analysis

To address our first research question, we compared the two populations with respect to the number of assignment submissions, final grade, satisfaction, and the number of lecture video clickstream entries. Satisfaction was measured through a 5-star rating that learners gave through the vendor portal after completing the course. Clickstream data analysis was limited to data obtained after the assignment was first accessed (week two of the course). Bayesian statistical analysis was used so that statistical significance was not affected by the large sample size.

To address our second research question, we sent learners a survey that included three open response questions that asked them to reflect on their experience using the dataset that was provided to them for the assignment (i.e., either personalized or control datasets). Questions were related to how learners felt about receiving the dataset (affect), the extent to which receiving the dataset motivated them to continue in the course (motivation), and the extent to which the dataset affected their performance on the assignment (performance).

Survey results were thematically coded, separating responses from the personalized condition and the control condition. An inductive coding process from Creswell (2015) was followed to (1) read open responses (n = 194 surveys), (2) create excerpts by identifying discrete segments of information that related to our constructs of interest (affect, motivation, and performance), (3) engage in an iterative process of grouping related segments, defining labels, and reducing redundancy to identify codes, (4) and derive major ideas and themes from code groups. A summary of the coded data collected is shown in Table 1.

Table 1 Distribution of identified codes across the two conditions

Results

We used a Bayesian analysis and present here our results as the difference between groups using credible intervals covering 95% of the data. There was no strong difference in whether there was an assignment submitted (0 = unsubmitted, 1 = submitted), between the personalized group and the control group (Fig. 2a). Considering the highly overlapped credible intervals of two different groups of learners, there was also no strong difference in final course grades (which ranged between 0 and 1) between the personalized group and the control (Fig. 2b). Finally, there was also no strong difference in course ratings between the two groups. (Fig. 2c).

Fig. 2
figure 2

Credible intervals for difference in (a) submission rate, (b) final grade, (c) satisfaction, and (d) clickstream data of the personalized group and non-personalized group learners. The difference was earned by subtracting values of non-personalized data from the value of personalized data, and we consider evidence to be strong only if this difference does not include 0, and only the clickstream entries (d) meet these criteria. Positive values imply personalized data is outperforming non-personalized, while negative values imply non-personalized is outperforming personalized. Credible intervals (thin line) were given as 95%, while thick line represents 66% credible interval

Clickstream data has been used as a proxy for user attraction, webpage visit frequently, and webpage dwell time (Sinha et al., 2014). We use clickstream data derived from lecture video accesses as a proxy for video engagement. There was a surprisingly clear difference (at 95% credible intervals) between the two groups; learners in the personalized group clicked links to lecture videos on average 46 fewer times than those in the control group, though this only represents a 6.3% difference in viewing rate (Fig. 2d).

Open response data served as a window into learners’ experiences using either personalized or non-personalized datasets (summarized in Table 2 and Table 3). Learners in both the personalized and control conditions surfaced the following major themes as being relevant to the way that they felt (affect) about receiving their dataset: pragmatics, improved experience, a real-world dataset, and data science practices. Both groups spoke equally about the dataset they received from a pragmatic perspective. They explained that being provided with a dataset lowered barriers to participation, saved time, and allowed them to focus on the tasks necessary to complete the assignment. “It’s nice to be given a dataset so that we could immediately start to apply Python charting skills (and not spend time looking for data).” It was learners in the personalized group that primarily described how using the local dataset improved their experience, describing benefits that included appreciation of having a personalized experience, a deeper understanding of the topic, and an impression of increased relevance of the task. Learners in both conditions referenced the fact that the dataset they were using for the assignment was a real-world dataset. Similarly, learners in both conditions described how their experience in working with the dataset allowed them to focus on data sciences practices such as plotting and charting.

Table 2 Number of responses coded to the open-ended question “How did you feel about having received a dataset from [The City] for your assignment?”. For the personalized group, [The City] changed depending on the location extracted from learners’ IP address
Table 3 Number of responses to the Likert-scale question “How did you feel about having received a dataset from [The City] for your assignment?”. For the personalized group, [The City] changed depending on the location extracted from learners’ IP address

Learners in both the personalized and non-personalized conditions provided responses that were coded according to the following major themes concerning motivation to complete the course: improved experience, data science practices, pragmatics, and real-world dataset. The distribution of these codes was relatively even between the two groups. These themes relate to the extent to which learners felt the dataset they used motivated them to continue working in the course. Excerpts that were coded “improved experience” provided an expression of benefit related to working with the dataset, including increased enjoyment and interest, familiarity with the data, and the affordance of being able to share outputs with family and friends. A learner in the personalized condition commented, “Since the data was local, I could show the output to the family. Their interest in the graphic was a motivational aid.” With respect to data science practices, learners in both conditions commented that the dataset they received allowed them to practice relevant data science skills, solve problems, and make progress in the course. As with the first survey question, excerpts that were coded “pragmatics” related to how being provided with a dataset was convenient and expedient for them. Similarly, learners in both groups noted that receiving “real world” datasets was motivating because the datasets concerned authentic problems: “The assignment looked more like a real-world problem to me, and immediate application of knowledge is why I chose this course.”

It was learners who were in the personalized condition that primarily provided commented on data science practices with respect to the impact of the dataset on performance. These excerpts related to how using the dataset allowed them to gain practice using data science skills that were beyond the course curriculum and completing assignments. “With the dataset, I could practice more with other skills, even [skills] not from the course.” Learners in both conditions provided responses that mapped to the themes of “existing motivation” and “little or no effect/relevance.” With respect to how the dataset might have influenced or motivated performance, learners explained that the dataset did not necessarily alter their performance on the assignment task because they came into the course with a high degree of motivation. Some learners expressed even stronger sentiments saying that the characteristics of the dataset had little effect on performance: “To me, data is just data, and the source is not related to how well I learn in the course.”

In conclusion, data distribution of submission rate, final grades, and satisfaction, after learners viewed the assignment showed a lack of differences between learners in the personalized and control conditions, with 95% credible intervals. Further, while there was an effect of the intervention on clicks on lecture videos, the magnitude of this was quite small, and suggests that the effect slightly reduces clicking on lecture videos for learners who were given the personalized data. Survey results showed that there were some differences in how learners in the personalized condition described their experience working with their dataset. Learners in the personalized condition described an “improved experience” regarding their feelings about receiving the dataset and identified that “data science practices” influenced their performance in the course.

Study 2: Culturally Responsive Problem-Based Email Intervention

The goal of the second study was to explore the potential impact of using culturally responsive data problems and how they might affect a learner’s propensity to re-engage with course materials. A well-known challenge with MOOCs is that completion rates can be low, and while there are many factors involved, we wanted to see if we could ameliorate this issue by using culturally relevant problems that were delivered through email communications. Similar to the first experiment, we considered the context of the learner to be their location, but in this study, the problems learners received could include a broader range of phenomena more relevant to that context, such as entertainment, natural disasters, and environmental sustainability.

Design of Study 2

This study involved a randomized trial with learners who enrolled in the first course in the specialization which focused on data manipulation methods. Due to the global nature of MOOC learner population and the representation of this population in our course, we focused on three different regional contexts that were well-represented in our subject pool: US-based learners, Indian-based learners, and a third more general classification of global learners which did not include members from the US and India.

All learners received a standardized welcome email when signing up for the course which was unchanged by researchers. Prior to the start of weeks two, three and four of the MOOC, US, Indian, and global learners were randomly assigned to receive an email message with a problem that reflects their culture (personalized problem condition), an email with a generic non-culture specific problem (global problem condition), an email with no problem (no problem condition), or no email at all (no email condition). Note that global learners did not receive personalized problems because we only created personalized problems for US and Indian users. See Table 4 for explicit treatment assignments. For the problem-type condition, the weekly email messages were designed as a “call to return” to the course that emphasized the skills that would be learned in a given week, and demonstrated the kinds of contextually responsive problems that the coming weeks’ skills would be used to solve. For all email conditions we randomly included a growth mindset sentence (Dweck, 1986) to evaluate the efficacy of growth mindset framing, making our trial a 2 × 4 factorial design. Growth mindset has been shown in some populations to be beneficial for productive engagement, however brevity in email calls to action are also a pragmatic consideration, and we aimed to understand (at a pragmatic level) whether there were tradeoffs between these two approaches. We found no significant difference when growth mindset was included, we collapse our results over the growth mindset factor and focus our comparisons on email conditions only.

Table 4 Treatment assignments and randomization probabilities for different users in the trial. The number of actual participants changes per week, with the total number of participants for week 1 was 11,850, week 2 was 8785, and week 3 was 7802

The condition a learner was assigned to in a given week was independent of previous conditions the learners might have been assigned to. Experience in sending email prompts has suggested that the impact of such a prompt is short lived, so considering weekly emails as independent was a reasonable approach.

The email messages were designed with several attributes, including a social identity prime, a high level problem statement, identification of the learning objectives addressed in the problem, context-specific scaffolding sample code for the question provided, and a sign-off from the instructor that includes a link to the solution. Figure 3 shows an annotated example of an Indian contextual problem. All of the emails were created by the instructor of the course. The subject line for each week was the same across all email conditions.

Fig. 3
figure 3

An example of an email sent to learners assigned to the Indian context condition. Sections of the email follow categories previously described, and are annotated for publication to better highlight the different elements of the email structure

Learners who had received emails in any of the eight conditions were sent a follow-up survey with four open response questions that asked them to recall the extent to which they remembered the emails, to describe what they remembered and what they found interesting and disinteresting, and to identify the the kind of problem sets and data that the instructor should include in the weekly email format.

Participants

Participants were recruited by signing up for the “Introduction to Applied Data Science with Python” course on Coursera between the times of April 1, 2018 to June 10, 2018. A total of 15,037 unique learners were sent 28,446 emails.

Approach to Analysis

A logistic regression approach was taken for the analysis of data. We tested each condition (problem-type or no-email) in an exploratory fashion, and we used alpha = 0.2 as an indicator of moderate evidence, and alpha = 0.1 as an indicator of strong evidence. Open ended survey responses (n = 76) were coded using the same inductive coding process described for Study 1.

Results

We found that the region a learner was in (US, India, or Global) significantly affected the propensity for that learner to open emails. Even though the subject line for a given week was the same across all email conditions, US based learners were most likely to open emails (41.7% open rate), while Indian users were least likely to open email (27.0% open rate, p value <0.0001 for difference in proportions between Indian and US learners).

A core question we sought to answer was about the effectiveness of the different kinds of emails at engaging with learners such that they would return to the course, to understand the impact of communications that included culturally responsive problems. To answer this, we engaged in two regression analyses, the first measuring whether links were clicked in the email (summarized in Table 5) and the second measuring whether learners clicked on anything in the course platform for the next week (summarized in Table 6). We break down both analyses by week of the course in which they received the emails. Thus a learner who is in the global population (e.g., not from the US or India) and receives an email at the end of the first week providing stimuli for returning to the course in the following week will be used in the estimate in the upper left cells of the table (depending on which email they might have received). In both tables, positive values indicate an increase in likelihood of clicking a link (Table 3) or clicking something in the course (Table 4) when compared to the baseline; negative values indicate a decrease in likelihood, positive values an increase in likelihood.

Table 5 Log odds ratios for logistic regression model with outcome being the binary measure of (1) the learner clicks on any link in the email vs (0) the learner does not click on any link in the email. The baseline is no problem emails
Table 6 Log odds ratios for logistic regression model with the outcome being the binary measure of (1) the user clicks anything in the course during the week after receiving an email vs (0) the user does not click anything in the course in the subsequent week. The baseline is no email

For the first analysis, understanding how learners engage with links in the emails, we set our baseline comparison population as those learners who received a non-problem based email. We exclude learners from the control condition, those who did not receive emails, from the analysis, as they are unable to click on links in the emails. We see one strong and repeated result, which is that global learners are more likely to click on a link within the email if it is problem-based (finding 2.1). The other results are mixed; while there seems to be little effect on US learners between getting problem or non-problem-based emails, Indian users are negatively affected in the first week (regardless of whether the email is personalized or not), and positively affected in the second week of the course, especially for personalized emails.

To understand if learners engage with the course after receiving an email, we set our baseline comparison population as those learners who did not receive any email intervention at all (control condition). The intent is to capture the effect of learners having seen an email and re-engaging with the course, but not necessarily doing so directly from a link within the email.

The effects of interventions are much smaller. The first interesting result is that the effects on US-based learners is non-significant across all emails except for the personalized email in the last week of the course, and that the log odds ratios are small and in many cases negative (finding 2.2). This suggests that sending emails of any type to US-based learners has no or even slight negative effects on having them return to the course in the subsequent week. The second finding is that the impact of emails on Indian learners in weeks 2 and 3 is largely positive, but the impact of receiving a no problem email (e.g. a reminder to come back to the course) is as good or better than receiving either global s or personalized problem-based emails (finding 2.3).

Open response data (see Table 7 for summary statistics) provided insight into what learners valued or did not value about the content of the email questions. For the survey question regarding what topics learners found most interesting, learners focused on data science practices, such as hypotheses testing and merging data frames. Most learners stated they did not find any topic disinteresting. However, one learner commented that although the details of the problem were interesting (analysis of data for earthquakes in the US), they “thought it was too theoretical for an email.” Some learners expressed frustration that they were unable to engage with the materials presented because they were “busy trying to complete the course and didn’t have enough time to ‘play around’ with the topics presented in the emails.”

Table 7 Distribution of identified codes across the two conditions

Learners also provided insights on the kind of datasets and problems that they would like to see in the emails. Five themes emerged from our qualitative coding of the data: personal preference, value for learning, professional identity, personal identity, and for the greater good. Learners spoke about how they valued problems related to personal preference or interest, either at the individual level or the collective level: “Anything that relates to our lives and routines, because we will understand them better and it will also interest us more.” Learners articulated aspects of problems and datasets that they valued because they would add increase the potential for learning about a data science topic: “The datasets must include instructions for processing for beginners.” Learners also suggested that problem sets and or data should be related to professional or personal identity. With respect to professional identity, learners commented on how working with relevant datasets could support their professional development: “A particular interest of mine is analysis of time series datasets. As a transportation researcher, I continually deal with cleaning, process, and analysis of time series datasets.” With respect to personal identity, learners suggested that the emails should relate to learners’ own demographic details or individual characteristics: “I’m kind of a tree-hugging environmentalist. I would have been interested in looking at a datasets involving the environment and pollution.” Finally, learners described how datasets and problems could be used for the betterment of individuals within the course and society at large. For instance, one learner suggested that the instructor provide health datasets for learners to use; if through a process of working with these datasets learners see a link from health choices to health outcomes, then learners might make better choices. Several learners suggested that by working with crime data or satellite imagery, learners could work together to investigate or solve a problem that is of importance to a particular region.

Discussion and Future Directions

Findings from Study 1 showed that some learners valued receiving personalized datasets (as shown from the qualitative analysis), but that this finding was not surfaced through the statistical analysis. While the results from the comparison of assignment submission rates, final grades, and course satisfaction between the personalized and control groups did not yield statistically significant results, a difference was observed in the personalized group’s behavior with respect to clickstream entries to lecture videos. It is not altogether clear why learners in the personalized condition engaged in fewer video entry clicks than the control condition (a 6.3% difference), although future studies could consider analyze video watching behavior at a finer level. Survey results provide a more nuanced understanding of the experience of both the personalized and control groups. The personalized group remarked on how their experience was improved because they enjoyed working with data that was familiar to them, data that related to their geographic location. This observation about the utility of familiar datasets resonates with early PBL discussions (e.g., Blumenfeld et al., 1991) along with more contemporary ideas about how the design of learning tools and activities should connect to learning goals and learner interests to make the activity more meaningful (e.g., Krajcik et al., 2008). From our thematic analysis, we can see that learners identified some benefits from working with a personalized dataset, but that this benefit amounted to a “nice to have” and did not greatly impact either their motivation to continue in the course or their performance in the weather data assignment. This result speaks to the existing motivation that many learners held prior to starting the course. This is a slight contrast from the PBL literature where personally-meaningful activity in K-12 settings is included to motivate more learners who may not necessarily be motivated to explore their classroom activities otherwise. Here, we see varying levels of motivation. The relevant, meaningful datasets do motivate some learners, but some learners indicated that they were eager to apply the data science skills that they were learning in the course to workplace tasks--they are already motivated to engage in the activity. This insight resonates with Hood et al.’s (2015) finding that Data Science MOOC learners who were data professionals or seeking a higher education qualification possessed higher self-regulation abilities than those who were not. Finally, learners in both conditions commented on how they appreciated that the dataset they were provided was authentic and was not invented for the purposes of the assignment.

Findings from study 2 revealed differences among the three groups (US, Indian, and Global) in both learners’ propensity to engage with the emails themselves (i.e., open them) and to engage in course activities after opening the emails. Results from our first analysis were unexpected, and they relate to the relevance for learners to engage with the intervention by opening email itself. This has immediate implications with respect to personalization, in that even before content and data are personalized to the learner, one must consider whether the intervention itself is likely to be considered by the learners who are receiving it. This speaks further to the idea that cultural context matters, and instructors cannot assume that communications will be received in the same way, by all learners. Results from our second analysis show nuance both in terms of how learners in different conditions respond differently (i.e., Indian learners responded very positively to culturally responsive content), but also that the timing of when these datasets and problems are received in the course matters (e.g., Indian learners responded positively in week 3, but not week 1). Learners may have felt overwhelmed by receiving problems from the instructor that were in addition to regular course activities. The finding that the impact of receiving a “no problem email” (e.g., a reminder to return to the course) is as good or better than receiving a global or personalized email may relate to the “overload” issue, both in terms of length of email and the additional task that is provided. In the second study culturally responsive problems were provided to learners in an email, but it might be worth considering alternative formats, such as discussion groups that are embedded within the course.

The qualitative survey results revealed that many learners described positive impact but did so while talking about specific Data Science skills. This suggests that the course content of a given week might have a strong mediation effect on the usefulness of problem-based emails. For instance, the low values for Indian learner return-rates in the first week of the course (where the content of the email was about the basics of data cleaning, Table 2) seem to contrast with the higher positive return-rate values for these learners in weeks two and three (where the content of the course is about more advanced techniques, such as dataframe merging, or hypothesis testing). It could be that the introductory material is just not compelling enough to cause learners to return to the course. In addition, the learner comments on the kinds of topics they would like to see (e.g. environmentalism) suggest that our location-based personalization might not be the best learner characteristics for personalization. It does not seem unreasonable to consider MOOC learners, especially in this course, to be focused on development of career-relevant skills. Thus providing personalized problems in the form of different applied domains of study (e.g. healthcare, automotive, financial) may have a stronger effect than that of socio-culturally or regionally targeted datasets. That is, the learners might be better motivated if we activated and served their career identity rather than their national identity. To summarize what learners valued about course emails, they valued problems that made connections to their personal lives, professional identities, and datasets that have the potential to improve personal and societal outcomes.

In thinking about how we can incorporate Data Science problems that have more cultural relevance, we have begun to develop an approach that involves collaborating with MOOC learners who have completed our courses in a program called Mentor Academy (Redacted, 2018). The core of this approach is to scale the generation of authentic culturally-relevant content with the size of the learner community, allowing us to deliver more diverse personalized experiences while encouraging the growth of more advanced (e.g. reflective and metacognitive) disciplinary skills. We see this as a promising approach for drawing on the background and experiences of our global learners, one that could allow us to create culturally responsive problems that are based on real world, local datasets to motivate and engage global learners in MOOCs.

Notes on the Jim Greer Festschrift Special Issue

Dr. Jim Greer was the principle dissertation advisor of the first author, and the comments here reflect that author’s memories of working with Jim and opinions of how this research relates to Jim and his career. Jim consistently aimed to use his research agenda to change the world of education for the better. He delighted in the exploration of a new idea, and jumped at the chance to build new technologies to make it easier for communities to form, for learners to find information, and for education to scale but remain personal. During the author’s time at the University of Saskatchewan, Jim led the creation of dozens of new systems not with the goal of commercialization or personal gain, but with the goal of bettering the students, staff, and faculty at his home public university. His passion for systems spilled outside of the technological alone, and he spent the last years of his career transforming the socio-technical structures at the university as the Director of the University Learning Centre.

Not surprisingly, this work parallels a number of themes of Jim’s own work. It is highly contextual, because a system built without context is a missed opportunity for impact. It focuses on scaling education in ways which are portable to related contexts. And it is measured not just through statistical techniques, but by going to the learners themselves to hear their stories. Most of all, however, is that the approach we have taken was pragmatic in nature, with the goal of bettering teaching and learning for our students.