1 Introduction

Cyber security has become a topic of major concern since the rise of computing and the internet. The loss of personally identifying information (e.g., [1]), accidents due to code issues (e.g., [2]), and banking errors (e.g., [3]) have increased the demand for safe, secure, and resilient systems. However, no system is completely secure if it operates on an outward facing network. Security vulnerabilities have led computer programmers to work on minimizing threats in a timely manner, reducing the down time of systems and reducing the amount of information an outsider threat can access. One response to cyber-attacks is the use of automated code repair tools that can patch vulnerabilities, such as Clear View [4], ARMOR [5], and GenProg [6]. Although these tools are still in their infancy and not yet deployable to use at runtime, the design of such systems is critically important for human operators. If operators do not trust the automated code repair tool, they will not deploy the tool, leading to vulnerabilities not being patched in a timely manner. Research on psychological principles that influence trust can help software engineers understand what design aspects lead to proper trust calibration (i.e., increasing user trust as a function of system effectiveness; see [7]), which can improve automated code repair tool accessibility and usability. The current study seeks to explore the factors that influence human trust in the GenProg [6] automated code repair tool.

1.1 Automated Code Repair Tools

Software is now embedded in almost every aspect of modern life from industrial control systems, banking software, “Internet of Things”, and wearable technology. This ubiquity of software has led to a need for secure, efficient, and timely code repairs. However, the number of defects in software often exceeds the number of assets available to deal with them [8]. Technology exists for automatically detecting vulnerabilities in software such as through the use of static analysis [9, 10], intrusion detection [11], and software diversity methods [12]. However, identifying the issue is only half of the problem; the issue must still be fixed. Research in computer science has started employing evolutionary computation [13] and genetic programming [6] to repair code when predefined test cases fail. Genetic programming utilizes genetic mutations to change and test repairs to software, reducing the time and money spent on code repair [14]. Potential repairs are constructed offline and retested before being placed into the code.

Advances have been made in the field of genetic programming over the course of the past decade, with the GenProg program expanding from the C programming language to include the Java language [15]. However, little research has explored genetic programming from a human factors perspective. As the code repair software uses genetic mutations to find adequate solutions for bugs, it remains to be seen how programmers perceive such patches and repairs. As noted in the human factors literature, users must trust a system or the system will not be implemented [7].

1.2 Trust in Automation

Trust has been defined as the willingness to be vulnerable to another with the expectation of a positive outcome [16]. Although this definition was originally adopted for studying human-human trust, research has extended the definition of trust to human-automation relationships. Trust is an important issue in automation design as trust can facilitate reliance behaviors [7]. Sheridan [17] noted that the level of trust an operator has in an automated system can significantly influence how the automated system is used. The use of automation and its reliability does not always follow a one-to-one relationship. People may rely too little on capable automated systems (under trust) or too much on systems that do not operate at their assumed level (over trust). Over-trusting automation can lead to accidents, such as when drivers do not pay attention to the autopilot functions of Tesla cars [18]. Conversely, under-trust can also lead to accidents, such as not utilizing automation that helps alleviate driver fatigue when the human operator is too tired to drive safely (e.g., driver lane assist). As such, proper trust calibration is desired in contexts in which humans interact with automated aids.

1.3 Trust in Code

Recently, the findings from interpersonal and automation trust literatures have been applied to research on computer programming. Research in the computer and psychological sciences has explored how programmers perceive computer code [19]. However, until recently no research has explored how programmers perceive the trustworthiness of computer code and the psychological processes behind these perceptions. Alarcon et al. [20] performed a cognitive task analysis and found three factors influence perceived trustworthiness and decisions to reuse code, namely reputation, transparency, and performance. Research has been conducted on these three factors, empirically demonstrating the factors are present in code and that researchers can manipulate them [21, 22].

To understand the psychological processes occurring when programmers review code, the heuristic-systematic processing model [23] was applied to perceptions of code [19]. The model hypothesizes that two process occur when a decision is being made to trust code: heuristic and systematic processing. Heuristic processing comprises the use of rules or norms to form judgements and make decisions about a referent. In contrast, systematic processing constitutes more effortful processing by performing critical thinking about the referent to reach a decision. For example, Alarcon et al. [21] found reducing the readability or credibility of computer code led to quicker decisions to abandon code, indicating heuristic processing. In contrast, degrading the organization of computer code led programmers to spend more time on the code, indicating systematic processing. Results indicated that programmers trusted code that was degraded in organization because they spent more time evaluating the code and found it was functional. Additionally, the authors replicated these effects in an additional sample in the same paper.

A key tenet of the HSM is the sufficiency principle [23]. Perceivers are motivated to perform efficient processing, not wanting to process more information than necessary. This efficiency is coupled with confidence. The model posits that there is an actual confidence level and a desired confidence level that humans possess. If the actual level is at or above the desired level, processing stops. However, if actual confidence is below the desired level then processing continues. In the HSM, perceivers are motivated to exert the least amount of processing to determine a particular confidence level. The balance between being efficient but attaining enough information to meet the desired confidence level is called the sufficiency principle [23].

1.4 Trust in Automated Code Generation

Research in the field of trust in automation has burgeoned in the last two decades. Hoff and Bashir [24] conducted a comprehensive review of the trust in automation literature and found three factors influence trust in automation: human factors, situational factors, and learned trust factors. Biases in trust beliefs—such as perceived trustworthiness—may influence both the human and learned trust factors. Human factors aspects, such as certain personality and dispositional variables, may impact perceived trustworthiness [25]. These trust perceptions are important, as they lead to reliance behavior [7, 26].

There is a great deal of literature covering aspects of the trust process in automation, such as in ground collision avoidance software [27], solider detection [28], and x-ray screening tasks [29]. However, the authors are not aware of any research that has explored these aspects in automated computer code. As mentioned above, research on automated code is growing rapidly. It is important to determine the psychological aspects of the human user that facilitate trust in the automated software code repair tools so that when the system is ready to deploy, programmers will appropriately trust in the system and use it when needed.

1.5 Deleting vs Commenting Out Changes

Humans are more sensitive to errors made by automation, with trust declining sharply when automation commits errors compared to when a human commits errors [30]. In initial interactions with automation, users typically monitor the automated system to ensure the system is functioning properly. Transparency is a key aspect when determining whether to trust a system, as transparency conveys information to the user about the decision/task the automation is performing [7]. Lyons et al. [31] found that trust in the automatic ground collision avoidance system in fighter jets increased when information about the system was added to their display, providing greater transparency. Similarly, Alarcon et al. [21] found transparency was a key factor in assessing whether programmers would reuse code, such that more transparent code was reused more often by programmers. Research has consistently demonstrated that transparency is a key factor in software contexts [22, 32]. As such, one of the code functions that may be important for adaptive computer programs is whether they delete or comment out code. Deleting the code can be problematic as the user will have to refer to older versions of the architecture to retrieve the code that was deleted, reducing transparency of the program. In contrast, when the program comments out pieces of the code, the modified section remains in the architecture and can be easily reviewed, providing greater transparency. A commented out section may be reintroduced if needed by removing the comment characters.

1.6 Human vs Computer Repair Trustworthiness

Another important factor to consider when using the code is reputation. Reputation is defined as “perceived code trustworthiness cues based on external information,” such as, “the source of the code, information available in reviews, and the number of current users of the software” [20, p. 112]. As such, trust in the code repair may depend on whether an automated aid or a human completed the repair. The complexity of truly autonomous systems comes at a cost. Greater system complexity often means less control and predictability of system behavior [33], all of which may challenge human trust in that system. The ability to adapt, learn, and dynamically respond to environmental stimuli are foundational attributes of a truly autonomous system, yet such capabilities come with inherent expectations and a need for traceability regarding the appropriateness of the system’s decision logic [34]. Automated systems capable of learning and adapting may have the capability to perform tasks themselves, but the operator must trust the system to perform the task with little oversight in order for a “reliable” system to achieve maximum benefit. However, trust in automation may differ from trust towards humans [7].

Programmers generally trust other programmers when they review and accept code revisions from those other programmers [21]. Accepting a code revision may be based on aspects such as experience of the programmer who receives the code, experience of the programmer that made changes to the code, or the reputation of the repository from which the code was retrieved. However, there may be differences in how programmers trust a computer program that repairs software compared to a human that repairs software. Lewandosky and colleagues [35] found that automation faults reduced a user’s self-confidence in performing a task and reduced trust in their partner when participants perceived they were being paired with an automated system. In contrast, when the participant perceived they were paired with a human partner, self-confidence was not affected, even though all participants were paired with an automated system. These results demonstrate extreme polarization of trust and distrust in human-machine teaming but not in human-human teaming, as performance faults influence trust more in the former compared to the latter. These results are consistent with other research that has found that people are less extreme in their assessments of human-human distrust relative to human-human trust in word elicitation, questionnaires, and paired comparison studies [36]. In contrast, assessments in human-machine trust are less extreme, yet human-machine distrust is more extreme [36]. In other words, humans are more extreme in their distrust of machines and more extreme in their trust of humans.

In the current study, we hypothesize that there are differences in how people trust humans compared to automated systems and the changes made by each to software. Specifically, we expect programmers to want more transparency from automation, such that commented out changes to the code will lead to higher perceptions of trustworthiness and higher use endorsement rates compared to deleted code. We also expect programmers to trust human repairs more, compared to an automated repair program.

2 Study 1

2.1 Study 1 Method

Participants.

A total of 24 student programmers from a Midwest university were recruited to participate in the study in exchange for $50 (USD) in financial remuneration. Participants were required to have a minimum of four years of experience with software development and a sufficient knowledge of the C programming language. The sample was primarily male (70.1%) with a mean age of 24.38 (SD = 3.50) years, a mean of 6.29 (SD = 3.37) years total programming experience, and 58% stated they use C on a weekly basis.

Design.

Participants were asked to review and assess repairs made to source code written in C. The study utilized a 2 × 2 between-subject design, with 5 within-subject trials. The between-subject factors consisted of the type of repair made (i.e. deleting lines of code vs commenting out lines of code) and the reported source of the repairs (i.e. human vs automation). The 5 within-subject trials consisted of 5 different pieces of source code and their repairs. Participants were randomly assigned to one of the four conditions and completed each of the five trials. A fully balanced design was achieved with 6 participants in each condition and 5 trials for each participant.

2.2 Study 1 Measures

Trustworthiness.

We used a single-item measure to asses overall perceived trustworthiness, consistent with previous studies on trust in code [21, 22]. Participants indicated their perceptions of trustworthiness with the item “How trustworthy do you find this repair?” on a scale ranging from 1 = Not at all Trustworthy to 7 = Very Trustworthy. Research has indicated single item measures are appropriate when the construct is well-defined and multiple items are likely to result in response fatigue [37].

Review Time.

The time spent on the code was assessed with HTML timestamps from each page. A timestamp was used to assess when a participant started reviewing the code repairs and when the participant left the code repair to write their assessments. Each page contained only one stimulus.

Trust Intentions.

We adapted Mayer and Davis’ [38] trust intentions scale to assess intentions to trust the referent (i.e., the software repairer). The scale consists of four items. All items were rated on a 5-point Likert scale (1 = Strongly Disagree to 5 = Strong Agree). The first and third items were reverse coded. We adapted the scale to reflect the referent being assessed. An example item is “I would be comfortable giving [Bill, GenProg] a task or problem which was critical to me, even if I could not monitor their actions. Participants rated their intentions to trust the referent once before beginning the experiment and once after they had finished reviewing all code stimuli. Additionally, participants were asked with a single item whether they would endorse the code repair for use with “Use” or “Don’t Use” as response options. This provided a single measure of reviewers’ intention to trust for each of the code repairs.

2.3 Study 1 Stimuli

An HTML testbed was created for participants to review the five pieces of code written in C. The testbed presented the code by emulating a diff utility, which provides a side-by-side comparison of two different files, typically source code, and highlights the differences between the two (see Fig. 1). Shortcut buttons were provided at the top of the test bed for participants to quickly navigate to the sections of the code where changes were made.

Fig. 1.
figure 1

Example of diff stimuli.

Before and after seeing each code repair, participants were shown a list of test cases and whether they passed or failed (see Fig. 2). Stimuli consisted of 5 successful patches that were produced by the GenProg program. The stimuli consisted of code samples ranging from 7874 to 16779 lines of code with between 3 and 18 repaired lines across 1 to 5 locations within the code. The code artifacts were ordered such that the simplest changes were shown first, with changes becoming progressively more complex (based on the number of modified lines) as the participants moved through the study. Changes included replacing lines of code with code found elsewhere in the sample, deleting lines of code, and copying lines of code from elsewhere in the sample. Before and after code comparisons (i.e., diffs) were presented in the color scheme used by Visual Studio Code (a popular, free software development tool). All samples were shown in the format provided by GenProg, which was modified from the original by expanding multiple single-line commands into multi-line commands.

Fig. 2.
figure 2

Example of test cases stimuli.

2.4 Study 1 Procedure

Participants completed demographic information and background surveys on a locally hosted website. Participants were then informed of their task and were given a brief description of the source of the code repairs (i.e., GenProg or a human programmer named Bill). After being read the description of the source, participants completed an initial trust intentions survey. Participants then conducted the task, reviewing 5 pieces of repaired code. Unbeknownst to the participants, all repairs were produced by GenProg and were the same across both referent conditions (i.e., Bill vs GenProg). After reviewing each repair, participants rated their perceived trustworthiness of the repair, indicated whether they would endorse the repair for use, and wrote any remarks they had about the code repair in provided textboxes. After reviewing all 5 code patches, participants completed the trust intentions questionnaire in regards to their assigned referent for a second time.

2.5 Study 1 Results

Missing data was observed for one participant for their fifth and final trial when providing responses for trustworthiness, use endorsement, and trust intentions. Values were multiply-imputed with 20 datasets to replace the missing data via the bootstrapped EM algorithm provided by the Amelia package [39] in RFootnote 1. Reliabilities, means, standard deviations, and zero-order correlations for trust intentions and perceived code trustworthiness in Study 1 are available from the first author.

Trustworthiness.

A repeated-measures analysis of variance (RM ANOVA) was conducted to analyze differences in perceived trustworthiness of the code repairs, factored by the source, type of repair, trial, and their interaction. The design was counter-balanced with six subjects in each condition and 5 observations per subject. Mauchly’s test of sphericity was non-significant, χ2(4) = 0.42, p = .069, indicating equal variances across the repeated measurements. Type III sum of squares was used for interpretation of the effects. Source of repair, F(1, 20) = 5.14, p = .035, \( \upeta_{p}^{2} \) = .20, significantly predicted perceived trustworthiness. Estimated marginal means, presented in Fig. 3, suggest that repairs made by GenProg (M = 4.87) were significantly higher in perceived trustworthiness than repairs made by Bill (M = 3.44). Neither the type of repair, F(1, 20) = 0.00, p = .993, nor the trial number, F(4, 80) = 1.13, p = .348, provided significant explanation of the variance in perceived trustworthiness. Finally, the two-way interactions between source of repair and type of repair, F(1, 20) = 0.00, p = .993, source of repair and trial, F(4, 80) = 1.42, p = .235, type of repair and trial, F(4, 80) = 1.37, p = .250, and the three-way interaction between source of repair, type of repair, and trial, F(4, 80) = 0.02, p = .823, were all non-significant.

Fig. 3.
figure 3

Student programmers’ perceptions of trustworthiness marginal means and standard errors.

Review Time.

Previous studies have analyzed time taken to review code to investigate increased engagement with computer code [21, 22]. Thus, we conducted a second RM ANOVA to analyze differences in review time, factored by source of repair, type of repair, and trial. Mauchly’s test of sphericity, χ2 (4) = .36, p = .025, indicated significant differences in variances between the repeated measures. As such, a Greenhouse-Geisser correction was applied to the degrees of freedom for interpreting the model effects. Neither source of repair, F(1, 20) = 0.13, p = .719, nor type of repair, F(1, 20) = 0.01, p = .941, were significant. However, trial, F(2.80, 56.02) = 2.98, p = .042, \( \upeta_{p}^{2} \) = .13, had a significant main effect on review time. A post-hoc analysis of estimated marginal means revealed linear, b = −100.80, t(80) = −2.45, p = .017, and cubic, b = 91.37, t(80) = 2.22, p = .029, trends in the average time taken to review code across the five time points, averaging across levels of source of repair and type of repair. The linear trend indicates participants took less time to review code with each successive piece of code, as illustrated in Fig. 4. The cubic trend suggests an initial increase in review time followed by a decrease.

Fig. 4.
figure 4

Student programmers’ time to review code marginal means and standard errors.

Probability of Use and Trust Intentions.

A generalized linear mixed effects regression analysis was conducted to analyze the probability that participants would endorse each code repair for use, factored by the source of repair and type of repair. The main effects of source of repair, b = −1.09, SE = .62, z = −1.78, p = .076, type of repair, b = −0.39, SE = .58, z = −0.67, p = .502, and the two-way interaction between repair source and type of repair, b = 0.11, SE = .57, z = 0.19, p = .848, were all non-significant. Next, we examined intentions to trust the referent with the trust intentions scale using a RM ANOVA. The scale reliabilities were .48 and .72 for the first and second assessments, respectively. The main effect of trial, F(1, 20) = 8.71, p = .008, \( \upeta_{p}^{2} \) = 0.30, and interaction between trial and source of repair, F(1, 20) = 12.59, p = .002, \( \upeta_{p}^{2} \) = 0.39, were both significant. Intentions to trust the referent declined between the initial (M = 3.04) and final assessments (M = 2.49), illustrated in Fig. 5. The decay in trust was moderated by the source of the repair, such that trust in Bill significantly declined between the initial (M = 3.31) and final (M = 2.10) assessments, t(20) = −4.60, p < .001, trust in GenProg did not significantly change between the initial (M = 2.76) and final (M = 2.88) assessments, t(20) = 0.42, p = .997, and the differences in change between Bill (M = −1.21) and GenProg (M = 0.11) across the two time points was significantly different, t(20) = 2.95, p = .039. The main effect of repair source, F(1, 20) = 0.32, p = .576, type of repair, F(1, 20) = 2.12, p = .161 and the interactions between repair source and type, F(1, 20) = 0.79, p = .385, repair type and trial, F(1, 20) = 0.80, p = .381, and repair source, type, and trial, F(1, 20) = 0.80, p = .381, were all non-significant.

Fig. 5.
figure 5

Student programmers’ intentions to trust estimated marginal means and standard errors.

2.6 Study 1 Discussion

In Study 1, student programmers perceived the code repairs as more trustworthy when GenProg was the source of repair rather than the human programmer, Bill. This result may have occurred because of how the code was repaired. All code stimuli were originally repaired by GenProg, which often does so in a way that is not intuitive. This method of repair often violates conventional approaches and may therefore inhibit perceived trustworthiness. This effect may be diminished for participants in the GenProg condition because they do not have strong expectations for how automated repair should be accomplished.

Over the sequential trials, the student sample spent less time reviewing each of the code repairs. This finding contrasts with our a priori expectations. Although not a specific hypothesis, one may expect programmers would have spent more time reviewing the code repairs, as those repairs became more complicated over the course of the experiment. The cubic trend that emerged indicates students spent more time reviewing code in the beginning of the experiment, but eventually they decreased the time they spent reviewing code over the course of the study. It may be that as the repairs became more complex, students leveraged their previously established heuristics for code review, leading to less time spent scrutinizing the code. However, this supposition is merely speculation until more data can be gathered.

Intentions to trust the repairer, or referent, decayed between the initial and final assessments. However, the source of the repair must be taken into consideration. Intentions to trust Bill declined sharply while intentions to trust GenProg did not change. Similar to perceptions of repair trustworthiness, trust in GenProg was less affected by the unconventional repairs than trust in Bill. Participants may not hold strong beliefs about how GenProg should have conducted the repairs.

3 Study 2

Alarcon et al. [21] CTA postulated that individual differences should play a role in the perception of and decision to trust code. Indeed, research on dual process models in psychology has focused on individual differences in information processing (e.g., [40]) and personality traits characterizing more or less enjoyment in thinking (e.g., need for cognition; [41]). One important individual difference that has been studied in both the psychology and computer science literatures is experience of the perceiver. A majority of research in psychology, and to a degree computer science, has utilized university students in their experiments. However, researchers have cautioned against using students as they may not be representative of the target population [42]. Indeed, research and theory has supported this idea.

Experts and novices perform different types of cognitive processing when reviewing code, namely top-down and bottom-up [43]. Top-down processing is the development of pattern recognition through the use of contextual information [44]. The contextual information helps in the perception of the referent as previous knowledge helps to make predictions of current stimuli. In the context of programming, top-down processing involves reconstructing information about the domain of the program and mapping this knowledge to the referent code [45]. In contrast, bottom-up processing is driven by data that is first perceived and then interpreted [46]. Bottom-up processing in program comprehension theorizes programmers first read code snippets and then cluster these snippets into higher level abstractions [47].

Research has demonstrated that experts more accurately recall code that is properly structured compared to unstructured code [48, 49], indicating a top-down processing approach as the structured code fits their schemas and thus is easily remembered. Ichinco and Kelleher [50] found experienced programmers recalled larger percentages of code snippets. Experienced programmers used schemas to chunk critical data, but novices did not perform the same processes. Additionally, novices have more trouble with static typing than more experienced programmers. In a study of lambda use in C++, compared to novice programmers, experienced programmers completed projects more rapidly accounting for 45.7% of the variance in completion time [51]. The authors concluded that lambdas are harder to use in the first few years of working with C++, but with experience, programmers begin to comprehend them.

As the first sample utilized students, we attempted to replicate the findings of Study 1 with professional programmers. As described above, more experienced programmers should perform top-down processing and should be more experienced with reading code. In addition, there has been a call in the social sciences, and all sciences to a degree, to replicate findings to ensure results are not by chance. By using the same stimuli in the same order, we were able to test the replicability of our findings, as replication with code stimuli may depend on the actual code itself, not just the experimental manipulations to the code [21, 22].

3.1 Study 2 Method

Participants.

A total of 24 professional programmers, 2 active duty military personnel and 22 contractors on Department of Defense contracts, were recruited to participate in the study in exchange for $50 (USD) in financial remuneration. Participants were required to have a minimum of four years of experience with software development and know the C programming language. The sample was primarily male (83%) with a mean age of 40.75 (SD = 11.20) years, a mean of 16.12 (SD = 10.02) years total programming experience, and 50% stated they use C on a weekly basis. A set of t-tests indicated that the professional sample was significantly older, t(27.45) = −6.72, p < .001, and significantly more experienced, t(28.15) = −4.48, p < .001, than the student sample. The study design was counter-balanced with 6 participants in each condition.

3.2 Stimuli and Procedure

The same stimuli and procedure were used as in Study 1.

3.3 Study 2 Results

No missing data was observed for Study 2. Means, standard deviations, and zero-order correlations for perceived trustworthiness and review time in Study 2 are available from the first author.

Trustworthiness.

A RM ANOVA was conducted to analyze differences in perceived trustworthiness between source of repair, type of repair, and trial. Mauchly’s test of sphericity, χ2 (9) = 15.71, p = .07, found no significant differences in variance between trials. The main effects of source of repair, F(1, 20) = 1.23, p = .281, type of repair, F(1, 20) = 1.70, p = .207, and trial, F(4, 80) = 1.44, p = .228, were all non-significant. The two-way interactions between source of repair and type of repair, F(1, 20) = 0.61, p = .443, source of repair and trial, F(4, 80) = 0.47, p = .757, type of repair and trial, F(4, 80) = 1.33, p = .265, and the three-way interaction between source of repair, type of repair, and order, F(4, 80) = 0.28, p = .893, were all non-significant.

Time.

Next, we conducted a repeated measures analysis of variance on review time, factored by source of repair, type of repair, and trial. Mauchly’s test of sphericity, χ2 (9) = 44.31, p < .001, found significant differences in variability between trials. A Greenhouse-Geisser correction was applied to the degrees of freedom for interpreting the model effects. Neither source of repair, F(1, 20) = 0.23, p = .637, nor type of repair, F(1, 20) = 0.41, p = .532, nor the interaction between the two, F(1, 20) = 0.21, p = .649, were significant predictors of review time. The within-subject effects revealed no significant effects for trial, F(2.06, 41.11) = 0.82, p = .449, trial and repair source, F(2.06, 41.11) = 1.11, p = .339, trial and repair type, F(2.06, 41.11) = 1.07, p = .356, or trial, repair source, and repair type, F(2.06, 41.11) = 0.32, p = .736.

Probability of Use and Trust Intentions.

A generalized linear mixed effects model was used to analyze participant endorsement of code for use, and the effects of repair source and type. The main effects of both source of repair, b = 0.79, z = 1.17, p = 0.243, and repair type, b = 0.34, z = 0.50, p = .617, were both non-significant. The interaction between source of repair and type of repair, b = −1.94, z = −1.92, p = .056, was also non-significant.

Next, we examined the trust intentions toward the referent. The scale reliabilities for trust intentions for both time points one and two were .62 and .66, respectively. The between-subject effects of source, F(1, 20) = 0.00, p = .954, type, F(1, 20) = 0.41, p = .530, and their interaction, F(1, 20) = 3.24, p = .087, were all non-significant. The within-subjects effects of trial, F(1, 20) = 29.92, p < .001, \( \upeta_{p}^{2} \) = 0.60, and trial x repair source, F(1, 20) = 14.71, p = .001, \( \upeta_{p}^{2} \) = 0.42, were both significant. Results indicate that overall trust intentions were higher at the initial assessment (M = 3.05) than the final assessment (M = 2.15). This was moderated by repair source, such that trust in Bill significantly decreased, t(20) = −6.58, p < .001, between the initial (M = 3.38) and final assessments (M = 1.83), while trust in GenProg did not significantly differ, t(20) = −1.16, p = .597, between the initial (M = 2.73) and final assessments (M = 2.46), illustrated in Fig. 6. The difference in the change across time points between Bill and GenProg was also significant, t(20) = −2.56, p = .043.

Fig. 6.
figure 6

Experienced programmers’ intentions to trust.

3.4 Study 2 Discussion

In Study 2, we assessed whether the source and type of code repair influenced the trust process in a sample of professional software developers. There were no significant effects for perceived trustworthiness, amount of time spent on code review, or use endorsement. With regard to trust intentions, participants trusted Bill less after reviewing the code, while their intentions to trust GenProg did not change significantly. This coincides with the findings on trust and trust intentions in study 1.

4 General Discussion

As society’s dependency on software increases, so too will the demand for functional code that is both safe and reliable. Based on a dearth of manpower, software demands will necessitate humans leveraging automated repair tools to help with vetting and repairing software. The current paper investigated the biases programmers have towards code that is repaired by either another human or an automated repair aid. All participants received code that was repaired by GenProg, but half of the participants were told that a human performed the repairs. By investigating these biases, the present research aimed to decipher the difference in human trust towards human repairs versus automated repair tools when the repairs vary in transparency.

In Study 1, student programmers perceived the repaired code as more trustworthy when they believed the code was repaired by the automated repair software (GenProg) compared to a human programmer (Bill). Participants also trusted Bill far less after reviewing the code, while trust in GenProg was not significantly affected. Even though the repairs were identical for both Bill and GenProg conditions, student software developers may have different expectations for how humans versus automation should function [30, 36], as GenProg does not repair code in a way that a human would repair software [52].

Another reason that students may have perceived GenProg as more trustworthy in the student sample compared to the experienced programmer sample is because the students were younger (mean age difference of approximately 16 years). Although we did not measure propensity to trust in automation, previous researchers have found a negative relationship between propensity to trust automation and age [25], such that as age increases, propensity to trust automation decreases. The younger sample may have had a higher propensity to trust automation and therefore would be more likely to trust the changes GenProg made.

Similar to study 1, study 2 showed that intentions to trust Bill decreased, regardless of the type of repair, as the experiment progressed. This was not the case for GenProg, such that trust intentions remained stable across time points. Previous research has shown that trust in automated agents is largely determined by the performance of the automated system, while trust in humans incorporates other non-performance-based factors such as perceptions of integrity and benevolence [16]. The information presented by the test cases and the code diff allowed the reviewers to verify that the patches were effective. This verification of sufficient performance may be the predominant factor in trusting GenProg. Conversely, assessments of trust in a human repairer would necessitate other information and greater scrutiny of non-performance factors related to the patches, such as readability or maintainability of the patch. This would explain why trust in Bill declined sharply while trust in GenProg did not change.

In contrast to our hypotheses, the student sample spent less time reviewing each of the code repairs. As the repairs became more complex, students may have leveraged their previously established heuristics for code review which require less time spent scrutinizing code. When students were more familiar with the task over time, they spent less time looking over how the changes were made, possibly utilizing select-out heuristics more with each successive trial. On the contrary, no significant differences in review time were observed for the professional sample. Experienced developers may be more studious in understanding the code they are reviewing, opting for a more systematic approach to processing information about the stimuli, regardless of their familiarity with the task or method of repair.

4.1 Limitations

Study 1 and study 2 were limited in a few ways. The effect sizes for the manipulated factors were smaller than anticipated. This may, in part, be due to the novelty of the stimuli. Because the patches produced by GenProg are often unintuitive and unconventional, trust in the patch itself would be less influenced by manipulations to the source and type of repair. In the extreme case, participants examining changes made by a human may not have believed that it was a human that made the changes, affecting our results in adverse ways. Finally, the trust intentions scale displayed less than adequate reliability. This may be because of the brevity of the scale, or the items may require a more concrete context with which to assess the referent.

4.2 Implications

The present research has implications for human/automation teaming, specific to software development contexts. As shown in past research [30, 36], humans have different expectancies for human versus automated partners (see also [7]). Though our studies were underpowered, we have provided an impetus for future work to further investigate human biases towards human/automation teammates in software development contexts with larger sample sizes. The present work identified the biases that may shape intentions to reuse software, regardless of the software content and performance specifics. Identifying humans’ biases, which may fluctuate depending on supposed source of repair, provides another useful variable that shapes human trust in software development.

As more technology is introduced into our lives, our reliance on computer code will need to increase. Computer programmers who do not trust computer code from others or computer code that is computer-generated will either have to re-write the code from scratch or spend several hours on the code reading it line by line to ensure there are no issues. Computer generated code offers an efficient solution to programmers without sacrificing functionality. This study is, to the authors’ knowledge, the first study to examine how transparency influences the way computer programmers perceive how trustworthy code repair tools are when manipulating factors such as source and types of repairs.

4.3 Conclusion

In conclusion, novice programmers are more accepting of unconventional software patches when they believe the patches were generated by an automated tool rather than a human. The time that novice computer programmers spent reviewing code pieces decreased over time, regardless of condition, indicating they may have had greater reliance on heuristics in contrast to the more experienced programmers. Finally, the present study provides evidence that trust in an automated repair tool may be predominantly influenced by its performance, while trust in a human that makes similar repairs is influenced by other non-performance factors.