Keywords

1 Introduction

Program Comprehension is an essential activity [1] for students while they try to read and understand source code. It is important for students to develop code reading skills [2] early in the program because it forms the basis of many other activities. To better develop teaching tools for programming, it is critical to understand how students learn programming concepts. Source code is a rich combination of syntax and semantics. Determining either the importance of the syntax or semantics for a programmer (especially a student learning programming) requires a better understanding of how programmers read and understand code. From a programmer’s own perspective, the question of “Where can I go to find what is important?” is an important research problem that is heavily task dependent. As researchers help develop better teaching and learning tools, we propose that the answers to these questions are perhaps stronger when quoted from the experiences of students who are learning in their field. To add to the evidence of how students learn, we present an eye tracking study conducted with students in a classroom setting using thirteen C++ short code snippets that were chosen based on concepts students learnt in the class. There has been an increase in the number of studies being conducted using an eye tracker in recent years [3]. However, there is still much work to be done to understand what students actually read while comprehending code. We see this paper among many others that will eventually be used in a meta-analysis several years from now. In this paper, we focus on C++ as most previous studies were done mostly on Java. Another unique aspect of this paper is the method used to analyze the data. Instead of simply looking at line level analysis of what students look at, we study how they read chunks of code and how they transition between them to answer comprehension questions. The research questions we seek to analyze are:

  • RQ 1: How do students perform on comprehension questions related to short C++ code snippets?

  • RQ 2: What sections of code (chunks) do students fixate on and if this changes with program size?

  • RQ 3: What chunks do students transition between during reading?

Our first research question seeks to determine how accurately students perform on the comprehension tasks. In the second and third research questions, we analyze the eye tracking data collected on the C++ programs by segmenting the programs into chunks we are interested in analyzing and link them to the students’ performance from our first research question.

2 Related Work

In prior work, we discuss the role eye tracking can have in computing education [4]. In this section, we present selected work on program comprehension done using an eye tracker. For a more exhaustive list, we direct the reader to Obaidellah et al. [3].

An eye tracker is a device that is used to monitor where a user is looking at on a screen. Eye trackers record raw gazes as they occur at various speeds. Later, via event detection algorithms, fixations and saccades are identified. A fixation is a point on the screen where the eyes are relatively stable while a saccade is the movement from one fixation to another fixation indicating navigation. Most saccades frequently last between 200 and 300 ms, but the time may vary. A group of saccades makes up a scan path [5]. A computer program is a set of instructions written to perform a specified task. Comprehension of a program is defined as understanding lines of code. This programming code can be in any language, C++, Java, or C# for example. To investigate one way programmers focus on code, studies have been done that look into different fragments of code, also known as beacons. Beacons can differ from user to user, thus giving us the knowledge that not all programmers look at the same code the same way [1].

In order to find a connection between the way programmers read code, two different studies were performed where the programs had specific syntax highlighting. In the first study [6] by Bleeders et al., 31 participants reading C# programming code were made part of either a black and white team or syntax highlighting team. They were given programs with no errors and unlimited time to read the program. Using an eye tracker, in the data recorded they reported on regressions, fixations, and scan percentage. While this study showed minimal difference, another study by Sarkar [7] found that highlighting was more effective with novice programmers but became less effective as a programmer accrued more experience. A continued effort into how a programmer explores code was performed by Raina et al. [8]. The study was focused on finding how students can retain information by reading in a less linear pattern. Instead of having students read code left to right, top to bottom, they gave students code in a segmented patterns. With an eye tracker they took a look at two metrics, reading depth and reading scores. The 19 students were split into a control group and a treatment group, both given the same C++ module. The treatment group was given segmented code while the control group was given linear code. Results of the study showed that subjects given the segmented code had higher scores in both reading and depth. They were able to focus and understand code better than those who read it linearly.

Sharif et al. [9] performed a study that focused on the comparison of Python and C++. Participants were split into groups based on their knowledge of each given language. Students were given tasks that consisted of finding bugs. Metrics used included fixation duration, fixation counts, time, and accuracy. The study showed that although C++ debugging took longer, that there was higher accuracy in the output matching specifications. Even though the study did show these differences, the overall analytical results came to the conclusion that there was no significant difference found between the programming languages. Note that this does not mean that there is no difference.

Using an eye tracker can help to better understand how code is reviewed. In 2002, a study was performed that looked at code reviewing [10]. Using an in-house developed tool, the authors looked at fixations on lines. The six different programs were reviewed by five different programmers. The reviewers all had a similar reading patterns in that they read the entire code once called a “Scan”. After scanning the code, each would then go back and focus on certain parts of the code that they considered important. While this was a recurring instance for all reviewers, the results show that the reviewers had different patterns that involved recursive styles and that each focused on different variables.

Studies have been done that gather data on the effect of identifier styles such as underscores and camel case on comprehension [11, 12]. Results showed that naming styles had an effect on the time and effort it took to find different identifiers. The overall conclusion found that when using underscores, speed to find identifiers was improved however even though camel case took longer it was more accurate.

3 Experimental Design

This study seeks to investigate what students read while they try to understand C++ code snippets. We study reading by analyzing the eye movements of students using an eye tracker. Each student was first asked to take as much time as needed to read a snippet of C++ code presented to them in random order. After each code snippet a random comprehension question was given (related to the corresponding C++ code fragment). We randomized the order of tasks to avoid any order biases. Interested readers can find our complete replication package at http://seresl.unl.edu/projects/hcii2019.

3.1 Tasks

The C++ tasks given to participants had varying degrees of constructs used with varied levels of difficulty. The 13 C++ programs used are shown in Table 1 with their corresponding difficulty level. Participants were given as much time as they needed to complete their task. After each task they were asked to answer one of three randomly assigned comprehension questions. Each comprehension question was followed by a question asking about confidence in their answer and their perceived difficulty of each task. At the end of the session, participants were also asked if they had any problems during the test, if they were given enough time, and the overall difficulty of all tasks. The comprehension question was one of the following: a question about what the program outputs, a short answer question, or a multiple choice question. The main purpose of randomly assigning questions was to deter students from sharing answers with each other, although our methods also ensured a well-distributed random assignment.

Table 1. C++ programs with constructs used, number of lines of code, and a difficulty rating based on how easy the concepts are for students to grasp.

3.2 Areas of Interest

In order to analyze the students’ eye movements in a more structured way, we broke down the program into different AOIs (areas of interest). AOIs were created for each line we found in every stimulus, and the fixations were mapped to the appropriate AOI. Next, we grouped these AOIs together to form “chunks” whose contents logically fit together into a unit that may be of interest to a programmer. We customized the selection of each of these chunks down to both the stimulus and task given to the participant. We further grouped these chunks into cross-stimulus “code categories”, which we then used to discover constructs that groups of participants looked at with the highest frequency across all stimuli. This process is detailed in Sect. 4.4. In our selective mapping, the contents of each chunk are groups of contiguous lines logically suited to, as a unit, be a cue of interest to a programmer.

3.3 Study Variables

The independent variable is the expertise of the test participants with two treatments - novice and non-novice. The type of question asked per participant was a mixed group factor randomly assigned to each individual and recorded, and we use it to report on the “traits” of accuracy for each participant. The nine dependent variables measured in this study are given below. Transition refers to gaze transitions made between two chunks. Fixations were detected at 60ms using the Olsson fixation filter [13].

  • Accuracy: Number of questions answered out of those presented. (Each participant was presented a total of 13 questions, and answered between 2 and 8 questions each from the three categories)

  • Fixation Counts: The number of fixations made in a given chunk.

  • Fixation Duration: The total sum of all fixations made in a given chunk.

  • Chunk Fixation Duration Prior Exit: The average time spent fixating on a chunk before a transition is made.

  • Transition Count: The sum of all transitions between chunks for a given code snippet.

  • Vertical Later Chunk: The percentage of transitions between chunks that transition to a chunk lower in the code.

  • Vertical Early Chunk: The percentage of transitions between chunks that transition to a chunk higher up in the code.

  • Average Chunk Distance: The number of chunks in between a transition (including the final chunk).

  • Mean Fixation Frequency per category: in terms of duration and visit count, where did a participant’s eyes rest most often.

3.4 Participants

Students were given a questionnaire prior to the study to gather information on their skill level. A total of seventeen students volunteered for the study. Of the students involved there was a single student that did not speak English as their primary language. All students other than one had high proficiency in English. Only two of the students tested were over the age of 27, with 6 being between the ages of 23 and 27 and 9 being between the ages of 18 and 22. 10 of the students were female, and 7 were male. We split the participants into two groups, novices and non-novices, based on their years in the program. Individuals who had completed at least the first semester of their program up to their junior year were placed in the novice group. Those who had completed at least 3 out of the 4 years of their undergraduate program, in addition to participants enrolled in the graduate program, were considered beyond novice level, and were placed in the non-novice group.

3.5 Eye Tracking Apparatus

We used a Tobii X60 eye tracker. It is a binocular, non intrusive, remote eye-tracker that records 60 frames per second. We used it to record several pieces of information including gaze positions, fixations, timestamps, duration, validity codes, pupil size, start and end times, and areas of interest for each trial. The eye tracker was positioned on a desk in front of the monitors where students read the programming code. With an accuracy of roughly 15 pixels as well as being able to gather 60 samples of eye data per second, the Tobii X60 was a design choice that mapped well to what we needed to measure our study variables accurately. The monitors were 24" displays and set at a \(1920 \times 1080\) resolution.

3.6 Study Instrumentation

To assist in the study, we used a Logitech webcam to record both video and audio. Each student was positioned in front of a dual-monitor configuration with the code snippets displayed on one, while the other was used by the researcher to control and monitor the study. After the tasks were performed, students were given a post questionnaire. In it we asked all students if they had enough time to complete the study, and all participants replied that they did. 11 students stated that the overall difficulty of the tasks was average, 6 stated that they were difficult and 1 stated that the tasks were very difficult. We also asked the students to describe any difficulties that they had: several stated that the code was difficult to remember to determine outputs. Additional comments included that the study was interesting and enjoyable while others thought that it was intense.

4 Post Processing

After the data was collected, we conducted three post processing steps. The first step involved correcting the eye tracking data for any drift that might have occurred with the tracker. The second step involved mapping gaze to lines of code and finally the identification of chunks. The third step involves identifying and regrouping lines into chunks with similar code structures across all stimuli, into “coded categories” that would enable us to analyze gaze patterns across multiple stimuli.

4.1 Data Correction

We used the open source tool Vizmanip to visually locate strands of fixations that were made on code snippet images. Vizmanip is a tool that allows the user to adjust and manipulate strands of contiguously recorded fixations available at https://github.com/SERESLab/fixation-correction-vizmanip. Identified fixations directly recorded from the eyetracker can sometimes drift [14] some distance away from the defined areas of interest. Given that we had a standard for the definition of AOI’s this problem was mitigated by selecting sets of 10 or more contiguous fixations from the dataset and shifting them all a set number of pixels to better align with the identified areas.

4.2 Mapping of Eye Gaze to Lines

After all the corrections were done, we used eyecode [15], a Python library with a key focus on parsing and manipulating areas of interest in images. In addition, it also maps eye gaze fixations to areas of interest (in our case these were lines). After demonstrating that eyecode could appropriately handle the creation of AOI’s from various formats of images, we ran its automated image parser to generate areas of interest of line level granularity, which emits a tabular list of rectangles that map to lines. Two graduate students manually inspected each generated AOI file to make sure the AOI’s were correctly generated as eyecode does not always work as intended. Six stimuli needed manual readjustment where the generated AOI’s did not represent a line. Other parameters in eyecode, such as vertical padding, needed to be set via trial and error. Such tedious post processing is unfortunately required when using code as images. A better approach would have been to use an IDE that enables eye tracking natively. In the future, we plan to use iTrace [16] for this purpose as it enables implicit eye tracking within the IDE and automatically maps gaze to code elements eliminating such manual mapping.

4.3 Motivation for Chunks

These line-level AOI’s alone can provide interesting results however, we wanted to explore how the participants read groups of related lines or “chunks”. Code snippets are not read the same way as natural language [2]. Thus each chunk was selectively customized down to the code snippet and task given to the participant. Because of these differences, analyzing the chunks of related code could provide more insight into the behavior of the participants than restricting the analysis to line-level AOIs. This practice of chunking has some credence in the study of program comprehension. The bottom-up comprehension model [17] sees participants read code and mentally group them together into an abstract representation of multiple lines. In the top-down model [17], participants use their knowledge of the program domain to understand its function. One of the ways that they can do this is through beacons, recognizable features in the code such as a series of lines to swap two variables [18]. Both models rely on participants to process the code not as a series of lines, but as sets of related lines and functionality.

4.4 Identifying Chunks and Categories

Since our focus was to study method level program comprehension, we had to make the chosen chunk areas granular enough to decisively determine whether cognition was leading to comprehension. Three of the authors rated two independently formed chunk mappings. Any disagreements were discussed by at least two authors to come to a 100% agreement. After 90 min of discussing 13 programs, the amount of lines to be grouped per chunk was decided. It took another 60 min of conferring between a different group of authors to decide what names and categorizations to grant to each region in each stimulus. For an example of these chunks, see Fig. 3.

The boilerplate left at the top of files (#INCLUDE and namespace statements fall in this category) were all grouped into one chunk in every file. Other notable C++ code elements that fall in this category are method signature lines, method prototypes, public and private variable declarations, (without their access modifiers), and return statements found inside control blocks. We felt that transitions among lines within these would be too minor to study for our tasks. Fixations on prototype access modifiers and the main method signature, if they exist in a stimulus, are completely omitted from our analysis as they were not the focus of our study. More interesting are tokens that play a key role in understanding assignment and data flow. Data flow tokens such as control loop parameters and branch statement parameters, are all included in our block categorizations for every stimulus. Data flow patterns played a role in our choice for grouping areas of interest. If a stimulus contains two related method-calls or def-use flows rooted in the main method, we try to separate into chunks two or more method calls that appear to have disjoint data flow chains, especially if the file is complex enough. This dataflow analysis was conducted and agreed upon via manual inspection by two authors.

We further categorize each chunk pattern into code feature categories. These categories represent groupings of certain code features that exist across many types of stimuli. In theory, these would be important places where participants would look in code for important information about how the code works. We put effort to reduce this set to 5 groups that would be common enough to be tracked across many stimuli. The code features we selected include the following:

  • control blocks include if statements, switch statements and loop statements (typically their predicates only),

  • signatures include method signatures and constructor signatures.

  • initializers include constructor and method declarations, and statements or statement groups that initialize variables.

  • calls include method calls and constructor calls

  • output include statements that generate output printed to the console

Boilerplate lines, return statements, and inline methods were not grouped into these categories. Though they might provide value, we had to keep the groups under study to a minimum to be able to compare all the means.

5 Experimental Results

5.1 Results for RQ1: Accuracy

The number of questions participants answered correctly is shown in Fig. 1. On average, it took a participant 61.20 s to finish reading the code snippet before moving on to the comprehension question.

Fig. 1.
figure 1

Number of questions answered correctly by each participant

Table 2. Question accuracy non-novice/novice breakdown: inner cells show means by category and their comparisons. The estimated marginal mean (EMMean) shown for each category gives a fairer value to compare groups than the unweighted means of the inner cells by applying a few statistical corrections, including weighting the means according to how many questions were answered in a category. They are shown for replication purposes, though we do not use them to draw conclusions at this time.

We provide the data in Table 2 to compare the results in different groups of our sample. We use the ANOVA test as it is quite robust and reliable way to compare means of two or more samples. We discuss the results of comparing the means of three sets of responses across the two groups (novices and non-novices). Each mean represents the responses gathered from the three types of questions, “Program Overview (Overview)”, “What is the Output? (Output)”, and “Give a Summary (Summary)”. First, post-hoc analysis was able to confirm that, all participants considered, a fairly equivalent amount of questions got answered among all three question types (70, 74, and 64 respectively). The ANOVA Omnibus F-test indicates there exist some significant differences between the means of the novices and non-novices, taking into account weighted means existing across all three categories. (F(1, 15) = 4.618, p = .048, with effect size or r = .485). As expected, non-novices scored significantly higher than novices across all three questions (mean difference = 24.7%, p = .048). Upon learning this, we took a closer look at the individual means to detect patterns, whether this trend holds across all question types. In particular, we found that novices did better on program overview questions than on output questions by 34.9% (p = .002). This pattern does not carry across the same to non-novices, where they performed statistically the same on overview questions as they did output questions (p = .165). However, we found a statistically significant difference in the amount of questions that non-novices answered correctly compared to the novice participants in terms of output questions (p = .042).

5.2 Results for RQ2: Fixations in Chunks

Table 3 gives the results of the Mann-Whitney test on each of the dependent variables. Simple mean comparisons revealed that novices looked at method signatures significantly longer than non-novices (p = .036). Non-novices however, looked at output statements significantly longer than novices by 22.8% (p = .031). The first two metrics, fixation duration and fixation counts are relevant to RQ2.

Table 3. Eye movement metrics calculated over all participants, non-novices, and novices. The p-values for the differences between the non-novices and novices mean (using Mann Whitney test) are shown along with effect size

We found the average total fixation duration across all snippets to be 45.445 s. We observe that non-novices on average had a longer fixation duration with an average code snippet fixation duration of 46.325 s while novices had an chunk fixation duration of 45.049 s. After running a Mann Whitney test, we did not find this grand mean difference to be statistically significant between novice’s and non-novice’s fixation durations (p = 0.7647).

Table 4. WhileClass chunks ranked by count of participants with highest and second highest total fixation visits and total fixation duration

We now move to a discussion of the results we found while observing fixation patterns among named chunks. We chose four stimuli to break down fixation patterns - two with fewer lines of code WhileClass and PrintPatternR (Tables 4 and 5), and two with greater amounts of code Rectangle and SignCheckerClassMR (Tables 7 and 6). We chose to discuss programs with significant complexity with the potential to facilitate deeper discussion: both small programs have at least one loop construct, and the larger ones employ def-use flows that flow through multiple methods. See Figs. 2 and 3 for snapshots of selections from both groups.

Table 5. PrintPatternR chunks ranked by count of participants with highest and second highest total fixation visits and total fixation duration
Fig. 2.
figure 2

Chunks of PrintPatternR with chunk 3 and 4 highlighted

After studying the fixation durations of participants, we noticed in small programs like PrintPatternR and WhileClass that regions of fixations tended to converge to the exact same point, regardless of whether the participant scored correct or incorrect, and regardless of expertise. See Table 5 for our records of chunks on which participants gazed at the longest. 93% of participants all fixated the most and the longest on chunk 3, the inner for loop with the print statement, responsible for printing the asterisk pattern. Notably this chunk was designed to contain not one but two important code categories, namely loops and print statements, but participants potentially look here due to its relevance to the overall function of the program. Chunks 2, 3 and 4 from this program stand out as retaining the longest fixation durations and highest visit count for most participants, boilerplate only scoring at the top of one participant’s focal point of attention. A few chunks were tied for second for certain chunks in the second-top visited category.

Table 6. SignCheckerClassMR chunks ranked by count of participants with highest and second highest total fixation visits and total fixation duration
Table 7. Rectangle chunks ranked by count of participants with highest and second highest total fixation visits and total fixation duration

We find a few contrasts to small programs like PrintPatternR when we look at large programs such as Rectangle (Table 7) and SignCheckerClassMR (Table 6). We see trends that occurs in programs with more information that do not occur in these small programs. As for Rectangle, we saw most participants focus on bodies of inline methods and constructors. See Table 7. The dimension methods received the most fixations and the longest duration times for most participants, followed closely by either the area calculation method, or constructor method. What this seems to show is a concern by most participants for the information that the statement code and not the declarations and prototypes offer. In Fig. 3, we see the program numbered by chunk with shaded regions. The darker hues represent regions that more participants visited the most times throughout their session. We note that variable or method declarations (outside signatures) did not get the highest attention of any of our participants. The results shown here for these programs do not show the main method as gaining much attention either. These are promising results that our analysis was able to capture.

Fig. 3.
figure 3

Chunks of related code for Rectangle.cpp with top visited chunks highlighted

Looking closely, the top most looked at chunks cover the constructor, its method signature, and its helper method definitions. Our results do not greatly concern the main method. At least one element of the boilerplate code and one from the main method chunks scored in the bottom-three most-fixated chunks for eight of our participants. This last trend seemed to cover both high and low scorers. 4 out of 6 top scorers for Rectangle all had chunk 5 as their most fixated chunk.

5.3 Results for RQ3: Chunk Transitions

We address RQ3 by observing up close the transitions made between various stimuli, by looking at other dependent variables such as fixation counts more closely, and by looking for the trends that exist across gaze data for multiple stimuli. The first metric we investigate is number of transitions between chunks made by a participant during a single task. We found that on average 48.63 of these transitions between chunks were made by a participant during a single task. We observe that non-novices made more transitions on average (50.84 transitions) than novices (47.64). After running a Mann Whitney test, we did not find this difference to be statistically significant (p = 0.5091).

Next we analyzed Chunk Fixation Duration Prior Exits. We found that on average participants spent 0.82 s fixating on a chunk before transitioning to another chunk. Non-novices had a shorter Chunk Fixation Duration Prior Exit with an average of 0.69 s before a transition was made, and novices looked at the chunks for a longer Chunk Fixation Duration Prior Exit of 0.88 s. After running a Mann Whitney test, we found this difference to be statistically significant (\(p < 0.001\)). The effect size was found to be small according to Cliff’s delta (d = 0.1952).

For the Vertical Later Chunk, we found that on average 45.00% of transitions were made to a vertically lower chunk. For non-novices, we found that they made less transitions to vertically lower chunks with an average of 44.51% of transitions. For novices, we found that transitions to a vertical later chunk accounted for on average 45.22% of transitions. After running a Mann Whitney test, we find that these differences are not statistically significant (p = 0.7945). Next we analyzed a related metric, Vertical Earlier Chunk, for the transitions. We found that on average 38.79% of transitions were made to a vertically earlier chunk. The reason that the Vertical Later Chunk and Vertical Earlier Chunk percentages do not add to 100% is because some transitions are made to lines that are not included in a chunk or to points that are not mapped to lines. For non-novices, we found that they made more transitions to vertically earlier chunks with an average 41.20% of transitions. For novices, we found the Vertical Later Chunk was on average 37.71% of transitions. After running a Mann Whitney test, we find that these differences are statistically significant (p = 0.0151). The effect size was found to be small according to Cliff’s delta (d = 0.2245). The two previous metrics show us that non-novices are less likely to read code from the top chunk to the bottom chunk, and non-novices are more flexible in the direction they transition to. In addition, we can also see that non-novices transitions from chunks to chunks instead of between lines not included in a chunk more than novices.

We found that the average chunk distance of a transition made between chunks was 1.49. Non-novices transitioned to chunks that were on average farther away with an average chunk distance of 1.57, and novices transitioned to chunks that had on average a chunk distance of 1.46. After running a Mann Whitney test, we find that these difference are statistically significant (p = 0.0080). The effect size was found to be small according to Cliff’s delta (d = 0.2448). The most common chunk distance for a transition between chunks was 1 which shows that participants most commonly transitioned to chunks that are close to the current chunk being fixated on.

We now combine the results obtained from the eye tracker, namely the fixation regions of each participant and the length of each fixation duration, with the data that we have on the locations of chunks in files. We use a tool, named the Radial Transition Graph Comparison Tool (RTGCT), that was provided by researchers at the University of Stuttgart Institute of Visualization and Interactive Systems. This tool is used to display data from fixation files and materialize visual data on a computer screen in a tree-annulus style fashion, in a way shows how long participants gaze was on a certain part of the code and that allows users to view activity from a whole task at once in a single image. Each stimulus is colored differently and positioned adjacent to other stimuli along an annulus, the arc length of its color showing the percentage of the total duration of the participant’s task taken up by his accumulated fixations on that stimulus. See Fig. 4.

We observe the output of the tool for two of our largest programs, where we can find some interesting transitions. We first discuss the Rectangle example.

Fig. 4.
figure 4

Output of RTGCT for Rectangle, highlighting inter-chunk transitions between constructor, dimension methods, and the area method.

The top scorers in the non-novice category were P01 and P06, and a few notable trends appear in their results. See Fig. 4 for the transition rate between P01 and P06 between the constructor signature and both the area function and the chunk named dimension methods” (containing width and height functions for the rectangle), are greater in comparison to transitions between main method, boilerplate and other regions of the program. P01 a high scorer, made 7 transitions between the dimension methods and the area method. P06, the other high scorer, made a fascinating 10 transitions between the constructor signature and the area method. These patterns do not appear in other non-novice eye gaze patterns. These transitions are either non-existent or diminished in comparison to other non-novice participants indicating to us that these two points of the program might have been important for these participants.

The SignCheckerClassMR code snippet transitions are visualized in Fig. 5. In order to properly depict transitions and not hide any, we chose to use the RTGCT’s “Equal Sectors” mode to show all chunks as equivalent segments along the outer ring. In this example P01 and P07 performed poorer than other participants. We can see a trend that transitioning between methods and the constructor may have led to this.

Fig. 5.
figure 5

Output of RTGCT for SignCheckerClassMR which indicate trends in method declaration lookups with ring sectors sized equally regardless of duration percentages

5.4 Threats to Validity

We describe the main threats to validity to our study and measures taken to mitigate them.

Internal Validity: The 13 C++ programs used in this study are code snippets and might not be representative of real-world programs. To mitigate this, we had code snippets vary in length, difficulty, and constructs used to add variety to our independent variables. Correcting the eye tracking data to account for drift can introduce bias to the data. To mitigate this, only groups of ten fixations were moved at a time and the new location had to be agreed on by two of the authors.

External Validity: A threat to the generalization of our results is that all our participants were students. This was mitigated by the inclusion of students with widely varying degrees of expertise, ranging from 1 year of study to 5+ years (4 years of baccalaureate plus some years in a graduate program). Another threat is our sample size. We ended our study with comprehension data from 17 participants, and with viable eye tracking data from 15 participants. However, the fact that results we analyzed for non-novices came from only 5 participants may raise questions. In response, the fact we successfully gathered from all participants repeated measures on at least 10 stimuli per participant, and that we collect a total of 57 eye-gaze patterns and 65 question responses from these participants alone is suggestive of the rigor that went into our assessments of how each participant did.

Construct Validity: A threat to the validity of this study is that the method we chose to use to break lines into chunks was done using standards agreed upon by the authors of whether certain chunks would remain relevant by the end of our study. However, these decisions may not generalize to all potential code comprehension analyses, as these choices were made subjective to the data authors had at their disposal at different points of the study. To mitigate this threat, we carefully synchronized each decision on how to divide lines into chunks for each of our 13 stimuli, and two of the authors met for 90 min before the final decision was made on which chunks would remain. Since we only are only measuring our participants on program comprehension, a mono-operation bias can occur. In order to mitigate this, we used three different types of program comprehension questions, summarization, output, and overview, in order to vary the exact task being performed.

Conclusion Validity: In all our analyses we use standard statistical measures, i.e. t-test and Cohen’s d, which are conventional tools in inferential statistics. We take into account all assumptions for the tests. For comparisons we used analysis of variance (ANOVA), which includes an F test in order to decide whether the means used in our comparisons are equal.

6 Discussion

We found differences between the two levels of expertise in frequency of eye movements among the chunks we coded. Non-novices looked at chunks longer before transitioning to others, tended to transition to chunks that were further away from their original position, and had more transitions to earlier chunks than novices. Looking closer at the data for what participants took most interest in, we found that for smaller programs (PrintPatternR and WhileClass) over 90% of all participants from both groups fixated on a single segment of code. Larger programs like Rectangle brought up situations where there was little agreement, especially among non-novices, about which chunk got either the most fixations, the longest fixation durations, or both. These results were not necessarily isolated to Rectangle.

When looking at fixation data (without considering question responses), non-novices tended to shun other elements (other than control blocks) in stronger favor of output statements most of the time. However, interestingly in our data, novices tended to allocate equal amounts of their attention to visiting areas other than control blocks. They tended to hold their fixations on declarations more than signatures, but this is the only deviation from that pattern we could find. Output statements was the \(2^{nd}\)-least visited category among all the coded categories for novices and method signatures were the least visited category for both novices and non-novices. For over 50% of the questions for non-novice participants we saw non-novices focus on output statements in their top two most visited categories.

When looking at responses to questions, we realized that we cannot say much to what fixation categories generally lead to better answers on questions. This is because the better areas to fixate upon depend heavily on the content of the stimulus. We were able to show in our data that for some stimuli – those which had more complex-structured helper methods – participants focusing on method calls longest received better scores, but that focusing on method calls predicted worse scores for a stimulus with more complex control blocks. Future work will need to be done that controls across multiple stimuli for the complexity of code within a stimulus, perhaps evening out complexities of control blocks and of the def-use method call chains within stimuli, in order to ensure that comparisons can be drawn fairly when gathering what fixation patterns might lead to better performance. We did not have enough stimuli to make this kind of comparison, even though we noticed differences in performance when two stimuli had these differences in their structure.

7 Conclusions and Future Work

The paper presents an eye tracking study on thirteen C++ programs done in a classroom setting with students during the last week of a semester. We find that the link between the expertise of a student and how accurately they answer questions, is made much clearer when paired with the insight of what visual cues were used by students the most. The visual cues led us to discover that students agree less on which areas to focus on the most when the program size grows to be large. These insights also showed us that the frequency of incorrectly answered questions is only significantly affected in certain stimuli by the areas participants looked at – or perhaps what they did not look at. Finally, we saw that performance of non-novice students can be intrinsically linked to both the number of fixations and the transitions made between important segments of the code. More research will be required to determine whether it is the data flow through the constructs or simply the types of constructs available that drive where participants look. We were able to uncover and visualize patterns among top performers that showed what transitions may have mattered the most as cues perhaps leading to better understanding. In addition, more research will be required to learn whether more frequent transitions amongst coded categories within stimuli are truly linked to better performance, or whether other factors we did not observe more closely contributed more to success. As part of future work, we would like to use the iTrace infrastructure [16] to conduct experiments with industry professionals on real large-scale systems.