A Multimodal Analysis of Making

Worsley, Marcelo; Blikstein, Paulo

doi:10.1007/s40593-017-0160-1

4153 Accesses
34 Citations
3 Altmetric
Explore all metrics

Abstract

This paper presents three multimodal learning analytic approaches from a hands-on learning activity. We use video, audio, gesture and bio-physiology data from a two-condition study (N = 20), to identify correlations between the multimodal data, experimental condition, and two learning outcomes: design quality and learning. The three approaches incorporate: 1) human-annotated coding of video data, 2) automated coding of gesture, audio and bio-physiological data and, 3) concatenated human-annotated and automatically annotated data. Within each analysis we employ the same machine learning and sequence mining techniques. Ultimately we find that each approach provides different affordances depending on the similarity metric and the dependent variable. For example, the analysis based on human-annotated data found strong correlations among multimodal behaviors, experimental condition, success and learning, when we relaxed constraints on temporal similarity. The second approach performed well when comparing students’ multimodal behaviors as a time series, but was less effective using the temporally relaxed similarity metric. The take-away is that there are several strategies for doing multimodal learning analytics, and that many of these approaches can provide a meaningful glimpse into a complex data set, glimpses that may be difficult to identify using traditional approaches.

Artificial Intelligence Innovations for Multimodal Learning, Interfaces, and Analytics

Multimodal Learning Analytics in a Laboratory Classroom

Many are the ways to learn identifying multi-modal behavioral profiles of collaborative learning in constructivist activities

Article Open access 01 December 2021

Introduction

The twenty-first century has seen an expansion in the set of tools available for assessing the quality of a given learning environment (Baker and Yacef 2009; Blikstein and Worsley 2016; Martin and Sherin 2013). A number of the traditional tools: test and quiz performance, speeches and essays; are modes of expression that have been around for centuries and remain the more privileged forms of assessment. For all of their pedagogical shortcomings, these forms of assessment have the benefit of being widely accepted and easy to interpret. However, contemporary learning sciences research is increasingly concerned with additional constructs: motivation, engagement, collaboration, creativity, critical thinking, and problem solving, for example. These are constructs that tend to be much harder to quantify using traditional testing instruments and often necessitate adopting an alternative approach that more closely aligns with the design of constructivist-inspired learning environments (Piaget 1973; Schwartz et al. 2009). By virtue of the breadth of interactions students have with collaborators and various technological resources, traditional tools and metrics are probably not well suited for making^{Footnote 1} or other constructionist-based learning environments. Instead, studying these environments likely requires the use of multimodal analysis. That said the analytic strategies used in this paper are not limited to the study of constructionist learning environments. On the contrary, the approaches could prove to be useful across a broad set of learning contexts, and with a variety of data sources.

In this article we leverage multimodal learning analytics (Blikstein and Worsley 2016; Worsley 2012; Worsley and Blikstein 2013) in order to better understand the complexities of student learning in a hands-on learning environment. In particular, we conducted a two-condition experimental study with 20 participants. The two experimental conditions were principle-based reasoning and example-based reasoning (Worsley and Blikstein 2017). Both principle-based reasoning and example-based reasoning represent instances of analogical reasoning (Gentner and Holyoak 1997; Gick and Holyoak 1980). Where they differ is in the level of abstraction of the analogy. In principle-based reasoning, the analogs are engineering principles. In example-based reasoning, the analogs are real-world structures. In prior work, we observed that principle-based reasoning was associated with higher learning gains and higher quality designs (Worsley and Blikstein 2014b). We collect multimodal data and combine it with machine learning to compare differences in multimodal behavior between the two experimental conditions and correlate those behaviors with success and learning.

In what follows we briefly describe prior literature in multimodal learning and multimodal analysis of learning. We then move on to describe the study from which our data is derived and the analytic techniques used to analyze this data. Finally, we present the results from our analyses and discuss the implications of these findings for future assessments of hands-on learning environments. We conclude with remarks about limitations and future research.

Multimodal Analysis in Education

Multimodal analysis in education is not a new concept. On the contrary, multimodal analysis has been the primary means of analysis for decades of researchers trained in audio/video analysis, ethnography, etc. (for examples see Barron et al. 2013). These researchers carefully analyze individual and group behaviors to interpret the inner workings of various learning environments. One exemplar in this line of work is Kress et al. (2001). Kress et al. examines multimodality among teachers and students in several science education classrooms. In each classroom video capture, hand-written, student-created artifacts, and field notes are used to study the intersection of the text modality with actions, facial expressions, diagrams and guided noticing (Pea et al. 2004). Among the findings reported is the inability of text to accurately represent models of student learning in complex learning environments.^{Footnote 2} Specifically, Kress et al. writes:

“From our data we can demonstrate that attention to one mode alone fails to capture the meaning of a communicative event; not just that it fails to capture all the meaning, but that it fails to capture the meaning.” (Kress et al. 2001, p. 14)

Kress et al. justifies this claim by presenting several analyses that are based on triangulating among speech, gestures and diagrams, and show how a given utterance only has meaning in the context of the other events and actions that took place during that time. In leveraging Kress et al. (2001) we would argue that if multimodal analysis is seen as a necessity for understanding student learning in a traditional science classroom, such a requirement becomes increasingly pronounced in a constructionist learning environment.

Accordingly, this article builds upon Kress et al. (2001), but also features several important differences. At a basic level, whereas Kress et al. used video data and student artifacts, we have the advantage of having a wide set of sensory tools that can capture user behavior at high frequency and high resolution. Also, a portion of our analysis is similar to Kress et al.’s identification of characteristic multimodal actions among the population of students analyzed. Their analysis identifies six different “conventionalized forms of action” which appear to have specific utility to the user. Our multimodal analyses will also look for common multimodal behaviors among the population of students that participated in our study. However, because we have more detailed behavioral and socio-emotional information about each participant, our common behaviors incorporate a larger set of modalities, including facial expressions and stress. Invoking analysis from behavior level data is another example of how this work deviates from that of Kress et al. (2001), which is largely dismissive of behavior-based studies. Given the tools available at that time behavioral analysis may have been empirically fruitless and intractable. However, one of the things that we show, and have shown in previous work (Worsley and Blikstein 2013), is that multimodal analysis has relevance for studying learning and performance. The ability to leverage behavioral data is, partially, a function of the computational tools and high resolution sensor data that allows researchers to construct user representations that are semi-semantic. A discussion of the techniques and sensors used will be briefly presented later in this article.

Moreover, when we examine the work of Scherr and Hammer, for example, we see a clear example of how multimodal representations of learners and groups of learners play an important part in interpreting how they are framing a given task. Specifically, Scherr and Hammer study Epistemological framing, which is concerned with the role that a student’s perception of the nature of knowledge and the task, has on how they approach the task. Scherr and Hammer (2009) discuss four epistemic frames that are typified by a combination of speech, gaze, posture and engagement (Table 1). While the analyses presented in this paper will not attempt to reproduce the same epistemological frames, Scherr and Hammer’s work highlights the potential for multimodal data to advance our understanding of human cognition and learning.

Table 1 Epistemic Frames adapted from Scherr and Hammer (2009)

Full size table

Process-Oriented Analysis of Learning

In addition to the importance of multimodal analysis, we also build upon the notion that student knowledge is infused in the processes that students complete. More specifically, we define process as the behaviors, actions and interactions that occur while a student or group of students are completing a given task. Fundamentally, distinguishing between achievement and process has been an essential part of education research for nearly a century (Werner 1937), and is the primary consideration of constructionist learning. For example, Turkle and Papert (1992) provides a prime instance where the researchers focus on studying learner processes, and not learner outcomes. Specifically, they write,

Using clinical methods inspired by the Piagetian and psychoanalytic traditions, we built up case studies of children using computers in grade-school settings where they were encouraged to explore programming without preconceptions about the “right way” to go about it. We took 40 cases for which we had material both on individual personality and programming style. What we say in this chapter about gender, programming, and intellectual style is based on the analysis of these cases. But we believe that what is most important is not any statistical association between gender and programming styles, but what lies behind the styles and behind the resistance of our intellectual culture to recognize and facilitate them both. (Turkle and Papert 1992).

The use of case studies centered on the belief that process was of primary import, and was more relevant than mere correlations between style and gender. Hence, our use of process is in line with the guiding principles and perspectives of constructionism (e.g. Harel and Papert 1991; Kafai 1995; Lawler and Yazdani 1987).

Beyond the specifics of the constructionist movement, focusing on process has been espoused by several other researchers (Atman and Bursic 1998; Atman et al. 2005; Bamberger and Schön 1983; Lehrer and Schauble 1998; Smith et al. 1994; Toulmin 1999). Among these papers, authors are concerned with analyzing student learning with the understanding that as the student is participating in the study, they are engaging in a learning process. For example, Toulmin (1999) advocates for “knowledge as shared procedures.” As such the analysis of student expertise should be situated in practices central to a domain, as opposed to solely being grounded in language or a final product. Similarly, Bamberger and Schön (1983) describe learning as a “reflective conversation with materials.” The idea of a conversation encapsulates the ways that individuals interact with their surroundings, both human and non-human, to interpret and make sense of what they observe. Hence it is not enough to simply look at a structure to determine its stability. Instead the individual must engage the structure in something that is akin to a dialogue, applying stimulus to the object and getting feedback from the object. In previous work (Worsley and Blikstein 2014b), we have qualitatively shown process-oriented differences between the two experimental conditions described in this article. Those two experimental conditions are principle-based reasoning and example-based reasoning. For the purpose of demonstrating the utility of multimodal analysis it is not essential to understand the specifics of principle-based reasoning or example-based reasoning. Nonetheless, we will briefly describe the experimental conditions in the paragraphs to following.

Example-based reasoning is a form of analogical problem solving, in which the participant’s design is modeled after a real-world object. For example, Worsley and Blikstein (2017) describe a student that bases his design on a specific chair in his room. The student then proceeds to make a design that closely resembles that chair.

Principle-based reasoning can also be characterized as a form of analogical problem solving. However, it differs from example-based reasoning, in that the participant bases their design on principles from engineering. Common principles used within the engineering design context are having a wide base, using triangles, and reinforcement. Instead of working towards a specific exemplar, the student attempts to piece their design together by iteratively and systematically applying engineering principles.

Having qualitatively observed marked differences between the two experimental groups, it is the intention of this article to utilize fine-grained quantitative measures to define more process-oriented differences.

Methods

Study Participants

The population of students included twelve high school students and eight undergraduate students. The distribution of high school students and undergraduate students was the same across the two conditions. Nine of the students were women, and 11 were men. One condition had four women while the other had five. Students were recruited to be participants through a series of university mailing lists and received $20 for their participation in the study.

Study Description

This study involved dyads of students working to complete an engineering design challenge. The challenge asked students to build a structure that could support a 0.5 lb. weight as high above a table as possible using basic household materials: one paper plate, four straws, five wooden sticks, and 5 ft. of garden wire (see Fig. 1). Because no pencil and paper was provided to students, students largely perceived this as a making/tinkering task, as opposed to a traditional engineering task for which they would have been expected to develop calculations as part of their design process. This blending of engineering design and tinkering often times characterizes K-16 “making” experiences where the overall objective is to create a finished and functioning artifact, similar to the work of an engineer, absent rigorous mathematical calculations to support their designs (Vossoughi and Bevan 2014). Furthermore, the task mirrors the forms of design challenges that are often utilized in Makerspace workshops.

The sequence of events completed for the activity included:

1.
Baseline Sensor Data Collection – before beginning the actual study, students complete baseline electro-dermal activation activities to determine their baseline in both stressful and non-stressful situations.
2.
Pre-test (Fig. 2) – students were asked to generate as many ways as possible to make an unstable structure more stable. The goal of the pre-test was primarily to account for any differences in prior experience, as well as serve as a reference point for assessing how each student’s conceptual intuitions changed as a result of the experiment.
Fig. 2
Diagram presented for pre- and post-test
Full size image
3.
Intervention – students participated in either an example-based reasoning intervention or a principle-based reasoning intervention. During both interventions students were first shown a picture of a ladder (Fig. 3), a bridge (Fig. 4) and an igloo (Fig. 5). In the example-based condition students were asked to generate three ideas of relevant structures from their home, community or school that would be useful in thinking about completing the current task. In the principle-based condition students were asked to generate three mechanisms, or engineering principles, that cause one or more of the three items pictured (Fig. 3, Fig. 4, and Fig. 5) to be structurally sound. The intervention task was three minutes in duration for both conditions.
Fig. 3
Ladder picture
Full size image

Fig. 4
Bridge picture
Full size image

Fig. 5
Igloo picture
Full size image
4.
Initial Design Drawing – students worked individually to create a quick sketch of what they thought their final structure would look like. This task was done as an intermediate step that would highlight if the intervention alone conferred noticeable advantages to one condition or the other.
5.
Building Activity – students were given the materials and had fifteen minutes to complete their structure.
6.
Post-test – students repeated the pre-test task, and were given access to their pre-test data. The pre-test was made available to them in order to let them reflect on their prior designs (i.e. reuse them if they so pleased) and eliminate any concerns that some students may have forgotten their pre-test answers, while others memorized theirs.
7.
Reflection – students verbally reflect on the reasons why their design did or did not work. (See Fig. 6 for overall study design).
Fig. 6
Overall design of the studies
Full size image

Data Collection

Throughout the experiment, we collected multimodal data with a depth camera, high resolution web camera, electro-dermal activation wrist band, and microphone array. The Microsoft Xbox Kinect sensor was responsible for capturing multi-channel audio, frontal images and skeletal tracking data using custom developed software. A Logitech HD C920 was used to capture overhead audio and video. Electro-dermal activation was captured using the Affectiva Q-sensor. All devices were connected to the same computer, except for the Affectiva Q-sensor, which was synchronized with the computer’s clock before each experiment. These multimodal data sources allowed us to hand-annotate students’ actions, as well as quantify speech, hand/wrist movement and electro-dermal activation for each participant. The following sub-sections describe the data collection in more detail.

Hand-Coded Data

Hand coding of human actions occurred at approximately 1-s intervals. A snapshot of each pair’s behavior was generated at 1-s intervals, and labeled based on the Object Manipulation Class that it corresponded to. The snapshots were generated from a custom application that takes pictures using the Xbox Kinect sensor. In many cases, determination of the manipulation class could only be determined several seconds after the action was completed. For example, as a student is preparing to put two pieces close to one another, it is not clear as to whether this action will be building with those two pieces, or merely prototyping, until seeing what the student ultimately does with those two items. In the case that the items are affixed to one another, the action, beginning from the point that the students gets a hold of the materials, would be classified as a building action. However, if the items were only placed near each other to physically prototype an idea, the entire action sequence would be classified as a prototype action. This approach for coding provides an interpretative lens to each action, when compared to an approach in which a given participant’s behavior must be described at the time of observation. In previous work we demonstrated that this form of coding is effective for studying students in hands-on learning activities (Worsley and Blikstein 2013, 2014a).

Audio Data

Audio data was derived from a combination of audio channels from an overheard web camera, and audio from the Xbox Kinect sensor. Custom software was developed based on the Carnegie Mellon University (CMU) Sphinx Speech Recognition Toolkit (Lee et al. 1990). Specifically, the source code was modified to leverage the program’s voice activity detection feature. Voice activity detection is an automated means for determining when voice-based audio is being generated. Several speech recognition software solutions contain some variant of voice activity detection. The custom software provided voice detection start and stop times for all of the audio channels. Audio was considered to be present if either of the audio sources detected a voice, within a given second of time. Thus the final format of this data is a binary representation. Every second of the activity is labeled with a zero or one, for the absence or presence of audio at that time stamp. Because the audio channel captured sound from both participants this piece of data is the same for each person in a pair.

Hand/Wrist Movement

Hand/wrist movement data was also generated from the Xbox Kinect sensor. Once again, a custom-built application was used to store three dimensional data for twelve upper body joints. The application uses native features available from the Kinect for Windows SDK, specifically, the ability to conduct skeletal tracking in the seated position. The custom application stores the data at 10 Hz. From the file generated, we utilize only the left and right wrist, hand and elbow data points. For each successive pair of data points we compute the angular displacement for the vectors that connect: left wrist and left hand; left wrist and left elbow, right wrist and right hand; right wrist and right elbow. The eventual angular displacement that is recorded is an average of the four angular displacements. Using angle as the means for comparison reduces biases introduced by participants having different sized limbs. Accordingly, for each tenth of a second in time we have stored the total angular hand/wrist displacement.

Electro-Dermal Activation

Electro-dermal activation (also referred to galvanic skin response and/or skin conductance) readings were captured at 8 Hz. Processing of electro-dermal activation involved controlling for individual differences in variance, as well as individual differences in stress response. In practice, this was achieved by collecting baseline data as students completed the task of counting down by 7. We will refer to this as the “math” stress test. As additional baseline data, students also completed a Stroop test, and had their electro-dermal activation recorded during non-task oriented activities. As before, each data point was time-stamped with the local date and time. Each data point was then transformed into an index value by subtracting the mean from the “math” stress test, and then dividing by the standard deviation of the “math” stress test data for that student. As validation that this approach reduced individual bias, when we compared electro-dermal activation index values across the different activities, there were no statistically significant differences between experimental conditions for the baseline data, the Stroop test, or the math test. However, across the intervention, design phase and the building activity differences were statistically significant.

In addition to the multimodal data, we also collected manual annotations for the beginning and end of each phase of the experiment. Start and stop times for each activity were recorded based on the images from the Kinect sensor and the video data. Through these annotations we could group data based on the activity that it is associated with. For example, we know which data corresponds with the intervention phase, the design sketch phase, and the building activity phase, for each participant.

Data Extraction

This algorithm is designed to recognize process similarities between participants, and test the hypothesis that there are multimodal practices that distinguish principle-based reasoning from example-based reasoning. Within each analysis, the hypothesis is tested in two ways. First, students are clustered based on the similarity of their processes. This approach maintains the temporality of the student behaviors. We refer to this approach as computing the pair-wise “process similarity metric.” Second, we conduct a behavior frequency analysis which relaxes the temporal relationship. The purpose of this approach is to provide a simplified means for determining process-based differences. However, even in conducting this analysis, we look at the frequencies in aggregate, and also split each participant’s process into three adjacent sections. Looking at behavior frequency at these set intervals moves closer to understanding how student processes differed without aggregating across the entire process.^{Footnote 3}

Figure 7 shows the general steps of the algorithm used for analysis. The algorithm builds on several previous studies (e.g. Berland et al. 2013; Blikstein et al. 2014.; Piech et al. 2012; Worsley and Blikstein 2013) and makes every effort to maintain the context of each piece of data by taking temporality into consideration, whenever possible. The paragraphs to follow provide a summary of each step.

Time-Stamp

The first step of extracting process data is to ensure that all data is properly time-stamped. Each data stream had been labeled with the local date and time of its occurrence, and stored as a csv. In the case of the electro-dermal activation data, a data synchronization step was either manually completed, or completed through the Q-sensor software. Apart from the electro-dermal activation sensors, all data was collected on the same computer. Having synchronized data sources allowed us to merge the data as needed for the different analyses. Merging the data was completed using custom Python scripts.

Segment

The time-stamped data is then segmented. Across all three analyses we segment the data every time a pair’s structure is tested. Testing will be described in more detail later, but for now, the reader can interpret testing as representing an instance in which at least one person in the pair is eliciting feedback that will update the students on the current stability of their structure. Testing usually takes the form of a team member placing the weight on their designed structure. As a whole, the segmentation process serves to smooth some of the noise in the data. Instead of having to take into account each of the spikes and troughs that may emerge from any of the data streams, segmentation allows us to look more for trends. Noise reduction is also a consideration for the next step.

Cluster^{Footnote 4}

The segmentation process yields several “test segments” for each student, and hundreds of unique segments when aggregating across all students. These “test segments” are characterized by the proportion of time spent in each possible behavior. Some of these will be very similar to one another, only differing by an infinitesimal amount, while others will vary quite extensively from one another. The goal of clustering is to identify natural groupings among the various “test segments” and ultimately provide a common set of states, or behaviors, by which to compare individual user sequences. However, before proceeding with clustering, we first do data standardization. Namely, we adjust each value, such that all of the data in a given column has a mean of zero and a standard deviation of one. This process eliminates bias in clustering by ensuring that each column contributes equally to the distance metric, which is this case was Euclidean distance. After standardizing the data, we used X-Means clustering to group the data points into a set of clusters that place each “test segment” with the other “test segments” that it is most similar to. Once each “test segment” has been grouped with similar “test segments,” each cluster, or group, can be described based on the average values of all of its members. These values provide the basis for determining common behavioral practices across the three analyses.

Re-Label

All “test segments” that are put into the same cluster are given the same name. Accordingly, each student’s sequence of “test segments” can now be represented as a list of clusters.

Normalize

In the normalization step, each student’s re-labeled sequence is lengthened so that we can more directly compare them to one another. The two forms of normalization that we use are L-1 normalization and dynamic time warping (Rabiner et al. 1978). In the case of L-1 normalization, each sequence is lengthened so that all participants’ sequences are of equal length. In dynamic time-warping a modification of Levenshtein distance (Levenshtein 1966) is used to find the best match between pairs of sequences. As with L-1 normalization, dynamic time warping preserves the order of the data. However, unlike L-1 normalization, dynamic time warping may change the proportion of total time spent using each cluster type. Because of this, we used the L-1 normalization when examining the frequency of cluster usage, but broke the cluster usage frequency into a beginning, middle and final segment to maintain some of the temporal elements of students’ sequences. For answering questions around the cycles of iterations, and similarity in point-by-point process data, the dynamic time warping algorithm seemed more appropriate, as it preserves the order of the process and is able to capture the distance between each pair of participants.

Compare Behavior Frequency

After L-1 normalization, the next step is to compare behavior frequency data across the three metrics of interest: success, experimental condition; and learning. The comparisons are based on Mood Median Tests along each of the individual clusters of “test segments.” However, instead of the traditional Mood Median Test, which computes statistical significance based on a Chi-Square distribution, we use a binomial test. These two tests were used because the data did not meet the requirements for MANOVA and violated the typical requirements of a Chi-Square Test. This step represents the conclusion of one branch of the analysis tree.

Compute Distances

After dynamic time warping, distance is computed between each pair of participants.

Group Participants

After completing dynamic time warping, pairwise distances are computed. Those pairwise distances are used to construct an n-by-n matrix. As before, this matrix is standardized before conducting K-Means clustering with k = 2. Thus, each student is put into the group that contains other students whose process was most similar to their own.

Compare Participants Groups

Finally, the groups are compared using a binomial test to determine the probability that individuals were randomly assigned to their specific group. Specifically, it is here that we examine the hypothesis that different groups, as partitioned by experimental condition, success on the activity, or based on post-test score, used markedly different processes from one another.

In general, this algorithm is used to identify common behaviors, from the first round of clustering, and then use each students’ sequence of common behaviors as the basis for the second round of clustering or to examine usage of common behaviors between the two experimental conditions.

In the next three sections (Part 1, Part 2 & Part 3) we outline three separate analyses that all use the same general algorithm presented above. What will differ for each analysis is the type of data analyzed.

Part 1: Qualitative Analysis of Learning, Success and Strategies

One of the common strategies used for video data is to produce annotations of student behaviors. In many respects, producing timestamps any time a student begins a new action is tantamount to transcribing where the modality of interest is user behaviors. Prior work on multimodal analysis has leveraged this technique and shown it to be important for characterizing and understanding student learning (Barron et al. 2013; Kress et al. 2001). In similar fashion, in this first section, we describe an analysis in which we timestamped the video data for every instance of six prototypical actions, or Object Manipulation Classes, as outlined in Table 2. As previously noted, actions were coded at 1-s intervals based on the video and frontal image data. Because of the nature of some of the actions, accurate labels could only be determined after the action was completed. The specific actions are based on Worsley and Blikstein (2013, 2014a) which showed that this coding paradigm is a useful way for studying hands-on learning, and that the coding paradigm bears similarity to prior work in engineering education (Atman et al. 1999).

Table 2 Object Manipulation Classes

Full size table

Common Behavior Analysis

The segmentation that follows hand-coding results in approximately two hundred unique “test segments.” For this analysis each “test segment” is defined based on the proportion of time spent in each of the five Object Manipulation Classes (c-realize, c-plan, c-modify, c-revert, c-nothing).^{Footnote 5} Clustering those “test segments” resulted in four common behaviors, or clusters. Each cluster can be characterized by the relative proportion of time spent in each of the five activities. As an overview, Fig. 8 shows the distribution of the four common “test segment” types. The labels assigned to each region of the pie chart will make more sense based on the data presented in Fig. 9 and in the following paragraphs.

The most frequently occurring cluster, IMPLEMENT, is characterized by significantly above average proportions of c-realize and c-revert. The proportion of c-nothing is below average, as is c-plan. Accordingly, this cluster seems to represent project or idea implementation in the absence of planning and/or modifying.

Because the primary actions for this cluster of “test segments” involves either adding to an existing structure (c-realize), or undoing an existing structure (c-revert), we call this cluster IMPLEMENT. That nearly two-fifths of the “test segments” are characterized by implementing an idea, is in line with the fact that the overall task is focused on hands-on manipulation of materials.

The second most frequently occurring cluster, PREPARE, is typified by above average c-plan behavior and above average c-nothing behavior.

At the same time, this cluster also represents below average c-modify, and c-realize and appears to be roughly average for c-revert. We call this cluster PREPARE, as the students seem to principally be concerned with actions that are either explicitly or implicitly indicative of preparing to actually build. The fact that a large proportion of segments is spent doing PREPARE indicates that even though the focus of the activity is geared towards “making,” many students are engaging in reflective processes that help them think about how best to complete the task. Furthermore, since several of the “test segments” are PREPARE segments, students are likely using PREPARE throughout the process.

The third cluster, ADJUST, is characterized by above average c-modify, and below average c-plan and c-realize. The behavior’s average c-modify value is approximately one standard deviation above the mean value for the entire population of “test segments”.

This indicates that when using this “test segment” students are spending a significant proportion of their time adjusting their structure, but may also occasionally spend a portion of the “test segment” doing nothing, or undoing. This, again seems reasonable. Based on observations made during data collection and video annotation, making adjustments to a structure was a fairly common activity. As can be seen from Fig. 8, ADJUST is nearly 20% of all “test segments” across all users.

An above average proportion of c-modify actions also characterizes the fourth cluster. However, whereas the ADJUST cluster involved c-modify values that were one standard deviation above the mean, SIGNIFICANTLY ADJUST has c-modify values that are closer to two standard deviations above the mean. To compensate for this increase in the proportion of time spent modifying, the proportion of time spent in c-realize, c-plan and c-nothing are all well below average. In this case it appears as though these “test segments” are typified by students only making adjustments to their structures. Again, based on personal observation, this seems like an accurate characterization of several “test segments” as some students tried to make their structure work without a clear sense of how to do so. Because the focus almost exclusively resides in c-modify, we call this cluster SIGNIFICANTLY ADJUST. Ten percent of the “test segments” were grouped into this cluster.

Process Similarity Comparison

Recall that the process similarity comparison groups students based on the pair-wise similarity of their processes. This particular metric maintains the order that each student completes each action, at the “test segment” level. Figure 10 shows the results of grouping students based on their process similarity, with a focus on comparing the number of student from each condition assigned to a given group.

Seven students from the principle-based condition were assigned to Group A, while the remaining three were assigned to Group B. For the example-based condition, three students were assigned to Group A, while the remaining five were assigned to Group B. According to a binomial test, there is approximately a 12% chance of this, or a less frequent split, happening at random. In addition to looking at experimental condition, we also examine how Group A and Group B differ in terms of success (Fig. 11).

Comparing Group A and Group B based on success rates produces a much clearer distinction. Seven of the ten students assigned to Group A succeeded on the activity, whereas only one of the eight students in Group B succeeded on the activity. The likelihood of this happening at random is less than 2%, suggesting that there were substantive process based differences between successful and unsuccessful students when considering their actions. To explore these differences more deeply, we now proceed to compare cluster frequency usage.

Comparing common behavior usage takes on two forms. At the most general level, it involves aggregating cluster frequency across the entirety of each student’s process. However, in order to provide a more fine-grain comparison, we also look at cluster usage in the first, second, and third portions of each student’s process.

Coarse-Grain Cluster Usage Comparison

Figure 12 shows the median cluster frequency usage for the example-based and principle-based conditions. The most pronounced differences between the example-based condition and the principle-based condition is in the IMPLEMENT cluster. The principle-based reasoning condition makes significantly more (p = 0.017) use of the IMPLEMENT cluster than their peers in the example-based reasoning condition. They also make less (p < 0.001 (8.5 × 10⁻⁵)) use of SIGNIFICANTLY ADJUST. These are the only dimensions for which there are statistically significant differences between the two conditions.

Figure 13 contains the same analysis, but now with success as the dependent variable. The most pronounced differences appear to be in the IMPLEMENT, ADJUST and SIGNFICANTLY ADUST categories. However, tests of statistical significance reveal that there are statistically significant differences in IMPLEMENT (p = 0.017) and SIGNIFICANTLY ADJUST (p = 0.017).

Finally, for the analysis of learning, we find that students who learned more spent relatively more time in IMPLEMENT (Fig. 14). When we compare students with positive learning scores, with those receiving negative learning scores, we find a statistically significant difference (p = 0.0019). Again there is a trend that students who spend more “test segments” in IMPLEMENT spend fewer “test segments” in ADJUST and SIGNIFICANTLY ADJUST, but still spend approximately the same number of “test segments” in PREPARE.

The coarse-grain analysis supports the hypothesis that students significantly differed in their processes when comparing experimental condition, success and learning. However, the coarse-grain analysis provides little in the way of describing where those differences are occurring and whether or not there is any causality in what is observed. To address this, the following section features a fine-grain analysis of cluster frequency usage that splits each student’s process into three equally-sized parts.

Fine-Grain Cluster Usage Analysis

A fine-grained analysis indicates that there are no significant differences between the two conditions. While there are places that have noticeable differences, those results are dropped after post-hoc analysis with Benjamini-Hochberg.

The fact that this particular algorithm appears to primarily be distinguishing successful students from unsuccessful students is reiterated through a fine-grained analysis of cluster usage. Specifically, there is a statistically significant difference (p = 0.0004) between successful and unsuccessful students in the amount that they use IMPLEMENT during the first third of the activity. Successful students were more likely to use IMPLEMENT, whereas the unsuccessful students were more likely to be in ADJUST or PREPARE. Apart from IMPLEMENT usage in the first third, there were no differences between successful and unsuccessful students.

Much like the case of condition, the fine-grained analysis did not identify any statistically significant differences between students who received positive post-test scores, and those who received negative post-test scores.

Discussion

In this section we have presented results that confirm the hypothesis that student processes differed along several dimensions for the principle- and example-based experimental groups. We began by discussing the four common “test segment” types. These were termed PREPARE, IMPLEMENT, ADJUST and SIGNIFICANTLY ADJUST. We then moved on to show that the process similarity comparison yielded weakly significant results when comparing between experimental conditions. However, when looking at success rate, the process similarity metric did substantially better. We then proceeded to analyze how the cluster usage frequency data could be used to describe the differences observed in the process similarity metric. The coarse-grain analysis found that increased usage of IMPLEMENT correlated with success, learning and the principle-based reasoning experimental condition. In interpreting this information it is important to recall that segmentation was based on when students tested and not based on the total amount of time spent on the task. Hence any attempt to argue that students spent more time in a given activity is not the appropriate inference to be made. Instead the results should be thought of in terms of the proportion of a students’ test segments that were spent in a given activity, recalling that these can be of variable length.

Moving to the more fine-grain analysis provided additional insight into how successful and unsuccessful students differ in how they start the activity. Namely, successful students were likely to spend more of the first-third in IMPLEMENT, than unsuccessful students. That said, even though the coarse-grain analysis consistently reported that usage of IMPLEMENT was important, the analysis, on the whole still leaves many questions about why the two experimental conditions significantly differed from one another.

Part 2: Multimodal Analysis of Learning, Success and Strategies

In Part 1 of this article we used hand-annotated data to pinpoint differences in how students enacted the engineering design process. In Part 2, we transition into using automated multimodal sensor data. This multimodal data includes audio, hand/wrist movement and electro-dermal activation. Whereas the analysis in Part 1 included the semantics of each user’s actions, the analysis in Part 2 will take a purely behavioral approach, but leverages multiple data streams in order to better capture the context in which each piece of data is recorded. This has commonly been a justification for undergoing multimodal analysis. Furthermore, prior research has studied how student posture and audio can be used as indicators for inferring student epistemological frames (Elby and Hammer 2010; Hutchison and Hammer 2009; Russ et al. 2012; Scherr and Hammer 2009). Accordingly, in this analysis we will examine student behavior at a similar level of granularity and identify the amount of audio, hand/wrist movement and electro-dermal activation that students generate at different points in time.