1 Introduction
Collaborative writing has become an integral part of professional and academic work, as business, education, engineering, law, and other organizational sectors are increasingly promoting group work that involves writing reports, papers, and articles together with others [
89,
93]. Decades of research in HCI and CSCW has focused on understanding collaborative writing practices [
13,
15,
46,
61,
63,
89,
94] and developing theoretical frameworks [
39,
49,
66,
77] and experimental systems (e.g., [
8,
33,
62]) to meet the needs of collaboration and coordination within teams. In parallel, a multitude of tools (e.g., Google Docs, Microsoft Office 365, Overleaf) have been designed that brought to fruition ideas from early research in the form of collaborative features such as comments, track changes, revision history, and real-time edit notifications. Researchers have also developed ways to visualize how co-authors use and interact with these new collaborative features and how their individual actions and contributions shape the production of a shared document over time [
43,
45,
64,
82,
88,
89,
97].
Despite significant academic and commercial interest in collaborative writing systems, less is known about how these systems support teams of people with diverse physical, cognitive, or sensory disabilities. Our focus is specifically on ability-diverse teams that involve people with vision impairments working with sighted colleagues and the ways in which the design of collaborative tools and features can support collaborative writing activities that are distributed across time and space. Our work identifies key design challenges associated with using screen readers to perform collaborative writing and systematically evaluates new auditory representations of collaborative features to address these challenges. We focus on accessibility issues in asynchronous collaborative writing where co-authors work on a shared document one at a time. While recent developments in collaborative writing tools offer many opportunities for synchronous collaboration (i.e., multiple authors working on a document simultaneously in real-time), many people still use asynchronous editing features, such as comments and suggested edits, to write together and exchange feedback [
13,
89]. As such, improving screen reader access for asynchronous editing is an important first step toward ensuring accessible collaboration in ability-diverse teams.
We ground the design and study of novel auditory representations of collaborative writing features in interviews with 20 academics and professionals with vision impairments who regularly perform collaborative writing using screen readers. Our prior work reports findings from these interviews that highlight the ways in which visually impaired writers navigate through an ecosystem of tools consisting of multiple word processors and screen readers, negotiate accessibility needs with sighted collaborators, and face broader social, professional, and organizational challenges in ability-diverse collaboration [
29]. In the current article, we report new aspects of the interview data that detail the complexities visually impaired writers encounter when using collaborative features (e.g., comments and edits) during asynchronous collaborative writing. Our current analysis of interview data reveals that screen reader users face four key challenges as part of developing and maintaining collaboration awareness [
30] (i.e., understanding who did what and where) in a shared document: (1) distinguishing between document content, collaboration markup, and comments/edits from others, (2) understanding how document content evolves through underlying edits, (3) managing disruption in workflow created by verbose spoken announcements for collaboration markup, and (4) controlling the influx of collaboration information.
To address these challenges identified through our interview study, we designed and developed a variety of auditory representations that incorporate non-speech audio (e.g., earcons [
35] and tone overlays), multiple text-to-speech voices, and contextual presentation techniques. The auditory representations were designed to help writers identify three key pieces of information that facilitate collaboration awareness in asynchronous editing:
where the comments are,
who commented what, and
who edited what. We evaluated these techniques through a within-subjects experiment with 48 visually impaired writers who frequently perform collaborative writing activities using screen readers. Our results indicated that non-speech audio, changing voices, and contextual presentation techniques are promising approaches for improving collaboration awareness among screen reader users. We found that tone overlay works as the least disruptive approach to understanding where comments are located while simultaneously comprehending the text content, specifically in complex passages with densely populated and overlapping comments. Similarly, reading collaborators’ edits or comments in different voices makes it easier to keep track of who edited or commented about a specific text segment and what they said in their comments, although this benefit diminishes when more collaborators contribute to a document. Additionally, presenting edits in the context of a sentence helps people in figuring out how the sentence evolved after multiple edits.
This article makes three key contributions to the fields of HCI, CSCW, and accessible computing. First, we contribute deeper empirical understandings of the complexities of how screen reader users maintain collaboration awareness during asynchronous writing, which extends prior work on how blind and sighted people collaborate in professional [
21,
87], educational [
53,
55,
56,
76,
79], creative work [
16,
28,
71], and everyday living contexts [
20,
91,
95]. Second, our systematic evaluation contributes new insights regarding how screen readers and word processors can better support collaborative writing through contextual markers and non-speech audio cues – techniques that have previously been used to improve non-visual access to graphical interfaces [
52,
60,
74], diagrams [
51,
54,
80], and navigation [
34,
92] for people with vision impairments. Third, we synthesize our findings from across the two studies to highlight design tradeoffs and considerations for enhancing accessibility in future collaborative writing systems used by blind and sighted teams.
4 Formative Study: Findings
Collaborative writing is a complex process that requires co-authors to remain aware of each other’s actions, such as who is editing or commenting what, where, and when, and how the shared document is evolving through these actions [
13,
30,
39,
77]. Existing collaborative writing applications offer a number of visual cues to help sighted writers remain cognizant of their co-authors’ actions in the shared document. For example, in applications like Microsoft Word or Google Docs, comments are juxtaposed in the sidebar beside the document body, text portions where comments are anchored are highlighted, insertions and deletions are represented through underlining and strikethrough, and comments and edits by different co-authors are color-coded. Additionally, various interactive features help sighted people navigate through and respond to collaborators’ actions. For instance, when a comment is selected in Word, the saturation of the color highlighting the corresponding text portion increases and the commented text becomes visually prominent; Word also reinforces the connection by converting the dashed line connecting the comment with the anchor text into a solid one (see Figure
1 for an example).
In comparison to these visual representations, screen readers’ auditory representations of collaborative features fall short in providing required information in easily understandable and effective ways. For instance, screen readers indicate the presence of a comment or revision by announcing markup phrases such as “has comment”, “revision, inserted”, “revision, deleted” that are spoken inline alongside the document text. Some screen readers also provide the name of the collaborator and the content of the comment or edit. While screen readers attempt to convey important collaboration information through spoken announcements, the way this information is actually presented makes it challenging for people to perform certain tasks that are essential for developing collaboration awareness. Below we detail these tasks along with the challenges associated with them and potential design changes to alleviate these challenges.
4.1 Distinguishing between Document Content, Collaboration Markup, and Collaborators’ Actions
To coordinate group efforts in a shared document, co-authors must learn who edited or commented what and where in the document. As we discussed earlier, screen readers present this collaboration information through serialized spoken announcements where markup phrases (e.g., “has comment”, “revision inserted”) and document content are interlaced. This makes it “cognitively overloading” for our participants to differentiate between document content and collaboration markup phrases as well as keep track of different collaborators’ actions (e.g., comments/edits). Emma said, “Track changes tends to muddy the waters very badly. For instance, if I have a document that someone else has changed, I might hear ‘the cat deleted rat ate 15 mice changed to’... Some of it is actual text, some of it is deleted text and [I’m] not having a great difference between the different ones.”
Furthermore, screen readers offer no straightforward way to differentiate between hierarchical comments (i.e., replies) and overlapping versus standalone comments. Mike, Isaac, Emma, and Maya explained that they “try to make sense [of replies and overlapping comments] based on the context of the discussion.” This, however, requires them to go through a “daunting process” that involves jumping back and forth between the list of comments and the document text, and performing a number of checking steps. For instance, in the list view of JAWS, each comment appears with a snippet of the text they are attached to. In cases where a comment does not have the snippet of the attached text alongside it, participants “assume that it’s the reply for the previous comment.” Alternatively, if multiple adjacent comments in the sequential list are attached to the same portion of text, those are considered to be overlapping with each other. To avoid this complex procedure, Mike asks his collaborators to reply in a separate comment and to preface their comment with “in reply to your previous comment” instead of leaving replies or overlapping comments.
Participants suggested that one possible way to easily process and distinguish between these intertwined pieces of collaboration information could be using multiple synthesized speech voices or manipulating speech parameters (e.g., pitch). Elena explained, “There could be more done with sound or pitch or inflection or even using multiple text-to-speech voices to present it... It would be cool if you could set a voice for each of the editors.” Addison even made an analogy between listening to comments/edits from co-authors in different voices and “see[ing] it in a different color, so you can still keep reading and it doesn’t break up your flow.”
While most participants suggested using different voices to elucidate different co-authors’ edits/comments, Emma proposed an alternative use-case, where specific voices could denote “different kinds of text (inserted or deleted text) in a document with track changes.” She emphasized that the characteristics of the voices should semantically align with the type of text they represent. For example, deleted text could be read out in a “deeper” (i.e., low-pitched) voice to delineate that “that’s not really relevant anymore,” while inserted text could have a “higher pitched voice so that I could tell, ‘Oh, hey, that’s new, that wasn’t part of the original documents.”’ Thus, audio characteristics (e.g., pitch and timbre) of text-to-speech voices could potentially convey the distinction between original and modified text content as well as which co-author acted on the content and how. However, participants emphasized the importance of attending to specific design choices that could facilitate (or even further complicate) the way screen reader users develop collaboration awareness. In particular, auditory representations that use multiple text-to-speech voices for collaborators’ edits and comments must be designed carefully such that they do not create further cognitive overload for screen reader users instead of reducing it.
4.2 Understanding the Evolution of Document Content
In addition to understanding who edited what, an important aspect of collaboration awareness involves perceiving how original content changes through the underlying edits. For screen reader users, however, understanding the context of the edits becomes immensely difficult, because it requires them to keep track of the original text, edited text, and collaboration markup, all of which are intertwined in the screen reader read-aloud. Bill explained, “What might take you 10 seconds to identify, may very well take me three minutes to disambiguate, because I’m going to read a complex paragraph with changes in complex sentences from three different authors, maybe even close to one another... I’ve forgotten the first half of the sentence by the time I get to the middle of the sentence.”
Instead of relying on screen reader announcements, sometimes Mike, Emma, and Henry keep multiple copies of a document—the original version and an edited version (without markup). They switch back and forth between these two copies and manually compare them sentence by sentence to detect “how someone’s changes would affect the document before and after.” However, this manual comparison process becomes quite challenging over time. As Mike described, “...after a while, you can imagine what I decided to do. I quit. Because it didn’t work.” Alternatively, sometimes participants ask their collaborators to summarize their edits using comments so that they can at least gather a high-level idea about “what was in the original paragraph and what he has changed.” However, this workaround does not provide details about collaborators’ actions that are visually available through features such as track changes or version history. “You can imagine, it’s not comprehensive enough compared to [what] you (sighted people) can see... very detailed changes that track changes can give you,” commented Mike.
In this vein, participants emphasized the importance of listening to edits in the context of the original sentence so that they can easily figure out how collaborators’ edits alter the form and meaning of the sentence. As Bill suggested, “Read[ing] the original version versus the modified version would be super helpful and super powerful... to freeze the changes as if they have been accepted, and then to iterate across the possibilities... so that you can get a sense of what the different versions are.” These excerpts highlight the importance of a contextual presentation of edits that could make it easier for screen reader users to understand the evolution of text content through the edits.
4.3 Managing Disruption in Workflow
Our participants shared that the way screen readers provide notifications for collaboration markup through a series of spoken announcements often creates a “verbal clutter.” Due to such continuous and copious collaboration notifications, participants find it extremely difficult to focus on their own work. Sofia shared her frustration: “I was just hearing so much information that I just feel like I had a big jumbled mess in my document... It (track changes) didn’t tell me enough of the information I needed, and it told me too much of the information I didn’t need... I just didn’t find it very effective for my workflow and my thought processing. It just made everything messier, not more efficient.”
Thus, in the process of making users aware of their collaborators’ actions in the document, screen readers end up conveying “too much information at once” and impede individual workflow. To reduce such interruption in their workflow, participants often turn off spoken announcements and discuss their edits through alternative communication mediums instead (e.g., e-mail, phone, or chat applications). Others often come up with workarounds to entirely avoid using default collaborative features. For instance, many participants prefer to “read through this document and put any of my comments right in brackets or parentheses” inline within the document text. Some also use special notations (e.g., @@@) that are “unique enough that it’s not going to be elsewhere in the document by chance” so that they can easily search through the text for locating inline comments when needed.
Despite being a common strategy among our participants, leaving inline comments does not always work as the perfect solution either. Bill explained, “It [inline comments] could be very helpful as you’re just reading through maybe the first time you’re getting a draft back from a colleague, but not as helpful if you’re working on stuff and know about the comments already, and now they’re all of a sudden getting in your way.” Here, we see that although inline comments are useful in certain scenarios, they can also be obtrusive to one’s flow of reading and understanding of the text content in a similar way to spoken announcements.
Participants suggested one possible approach to reduce such verbal clutter and resulting disruption in workflow could be using non-speech audio cues (e.g., earcons) instead of spoken announcements to indicate the presence of comments or edits. Emma explained,
“I would like something less obtrusive... whether that be an audio cue or a notation on my braille display... Because the words (spoken announcements) are going to stop me from actually being able to listen to what I’m working on, where hopefully the not-words will not.” Prior work that focused on navigational tasks and perceiving auditory graphs have also found that non-speech audio cues are less disruptive and impose less cognitive load for processing information compared to speech [
35,
70]. Importantly, screen reader users’ preferences of auditory representations would likely be subjective and dependent on specific use-cases, where they
“might prefer speech in some instances, a tone [in others].” Thus, a key design consideration for collaborative features is to determine which auditory representation works better for the task an individual is trying to accomplish at a particular instance.
4.4 Controlling the Influx of Collaboration Information
In addition to how collaboration information should be represented, what information needs to be accessed when also depends on the context of use, that is, whether the person is reading or editing the document at a particular instance and for what purposes. As such, understanding people’s intent of use and customizing auditory representations accordingly are critical to “siphon through” the large amount of information that is generated in a collaborative writing scenario. Maya described, “Collaboration is a good example where being able to customize the way information is presented will be really important, because different things will be important at different times. Maybe if I was an instructor, it would be really important for me to know that everyone’s collaborating... But then maybe when I’m writing a paper, I really need to know the track changes that were added.”
In existing writing applications, screen reader users can control which collaboration information they want to know by toggling notifications for different collaborative features (e.g., comments, edits). While having separate controls for each feature can be useful in many scenarios, for screen reader users, “that’s just completely useless, because you have to remember to toggle all of those back, and they’re not in the same place, and it’s just an arduous task.” The way controls and settings options are designed in existing applications relies heavily on visual exploration and requires memorizing numerous keyboard shortcuts when accessed using screen readers. Bill instead suggested a mode or “scene-based” approach (e.g., editing or reading mode) that could allow people to easily consolidate and declare their desired collaboration information at a given instance. He said, “What I would recommend there is a simple toggle of preference or verbosity, but not based around any type of static setting, but instead based around the fact that—‘Okay, I’m interested in a lot of editing related stuff now. Tell me about the following three, four, five things.”’
Beyond mode-based controls, screen reader users also need to be able to control what information they hear at a particular instance by opportunistically navigating comments or edits as opposed to receiving continuous spoken notifications. Henry explained, “Maybe you don’t have to have all that information right away. If you just had an earcon or a beep sound, maybe that’s- ‘Hey, there’s a comment here,’ then you could press a hotkey to learn more about it.” As we see here, participants wanted to access collaboration information through a hierarchical approach that would combine “less cluttered” notifications (e.g., non-speech audio cues) for the presence of a comment/edit with the opportunity to go deeper to explore the content of comments/edits with keyboard shortcuts.
Overall, our formative work illustrates that screen reader users face multiple challenges in maintaining collaboration awareness, which are due to the ways in which screen readers present who did what and where in a document as well as needing to understand how collaborators altered the document content without disrupting one’s workflow. While there are several ways to address such challenges (e.g., text summarization [
5]), our findings suggest one viable but relatively unexplored approach is to redesign the auditory representations that screen readers and word processing tools use to present collaborative information.
5 Designing and Evaluating Accessible Auditory Representations
Building on insights from our formative interview study, we designed, developed, and evaluated auditory representations that aim at supporting collaboration awareness for screen reader users during asynchronous collaborative writing. The study investigates how different auditory representations can address issues of cognitive overload, verbal clutter, and lack of context associated with three key questions that are essential to developing collaboration awareness: (1) where the comments are, (2) who commented what, and (3) who edited what.
We conduct the study in three distinct modules, each centering around one of the three questions stated above. The auditory representations we developed focus on asynchronous collaboration information (e.g., comments and edits). In each module, we compare default techniques available in existing screen readers (i.e., direct spoken announcements) to one or two experimental auditory representations (i.e., non-speech audio and contextual presentation). Before conducting the study, we refined the auditory representations and our study design through in-person pilot testing sessions with three visually impaired writers. Pilot participants were expert users of screen readers (e.g., JAWS, NVDA) and familiar with collaborative features on Microsoft Word such as comments and track changes. The pilot sessions lasted for 60–80 minutes and participants received US$50 compensation.
Our formative study indicated that screen reader users’ preferences regarding different auditory representations may depend on the context of use and complexity of the collaborative document. For example, a spoken announcement may seem overwhelming when multiple overlapping comments are attached within a sentence, whereas an earcon may work better in such a scenario. In contrast, if a comment is attached to a long span of text, a background tone alongside the text may give clearer indication of the presence of the comment, compared to earcons played only at the beginning and the end of the commented text. Thus, understanding the ways comment and edit complexity in a document influences the utility of different auditory representations is essential for making a collaborative system robust to the nuances that are likely to appear in natural writing situations. To address this, we designed the study to examine how participants’ reactions to the default and experimental techniques are contingent upon collaboration complexity of the shared documents.
5.1 Generating Auditory Representations for Collaborative Writing
As the basis for our exploration of the various techniques, we developed a system that generates custom auditory representations using a Microsoft Word document as input. The system’s input parameters can be configured to generate various audio representations corresponding to the document body text, edits, and comments. For example, these include adding earcons or additional contextual markers before or after comments or edits, playing edits and comments from different collaborators in different synthesized voices, and adding a background tone to indicate the presence of comments as the document content is read aloud. The system extracts text content and collaboration metadata from Word documents using the Word Object Model and pywin32 Python package. It then converts these metadata to JSON objects and applies text-to-speech conversion and other non-speech audio effects to create different auditory representations. We used the Amazon Polly service for text-to-speech conversion and the LibROSA Python package for audio processing. We used a voice identified as male (Matthew) on Amazon Polly as the default voice that reads the main document content and collaboration markup phrases. Among all the English (US accent) voices available on Amazon Polly, Matthew was chosen as default, because it most closely matched with the default voices of JAWS and NVDA screen readers. We verified this with one of our pilot participants who is a proficient screen reader user. Full details of the specific auditory representations are provided in the three study module sections below (Sections
6.1,
7.1, and
8.1).
5.2 Experimental Design and Stimuli
We defined document complexity according to the key collaboration question that guides the design of each module, i.e.,
where the comments are in Module 1,
who commented what in Module 2, and
who edited what in Module 3. We employed a within-subjects design where each participant experiences all available techniques (default and experimental) in each module. For each technique, participants listen to two stimuli passages—one with high complexity and another with low complexity. Thus, each participant listens to 16 passages in total: 6 in the first module (two for each of the three techniques—one default and two experimental, see Section
6.2), 4 in the second module (two for each of the two techniques—one default and one experimental, see Section
7.2), and 6 in the third module (two for each of the three techniques—one default and two experimental, see Section
8.2). We prepared 16 different passages to ensure that participants do not experience a passage more than once. Each of these 16 passages had two variations accommodating two levels of document complexity. To save time during the study, we generated and pre-recorded audio for all the stimuli passages beforehand (see Section
5.1). Within each module, we fully counterbalanced the stimuli to control the presentation order of techniques and document complexity across participants.
We standardized the stimuli used in the study. Each stimulus includes a single passage (5–6 sentences with 45–55 words in the first module and 4–5 sentences with 35–45 words in the last two modules). All stimuli have readability scores ranging from 6–7 according to the Flesch–Kincaid Grade Level. While preparing the passages, we selected topics that were less likely to be of public knowledge but did not need domain expertise to be understood. We collected passages from online resources (e.g., Wikipedia, blogs) about birds, animals, cities, and landmarks, and cross-checked from multiple sources to ensure that the statements were factually correct. We chose commenter and editor names that were mono or bi-syllable (e.g., Lisa or Beth) and commonly used in English language. During our pilot sessions, we noticed that sometimes participants remembered comments/edits in terms of perceived gender identity of the voice (e.g., “the boy made the most comments”). While identifying a collaborator by the gender of their assigned voice could be useful in a natural writing scenario, it was potentially introducing a confound in our study. To avoid this, we included only female identifying names and voices for commenters and editors.
5.3 Participants
We conducted the evaluation study with 48 visually impaired writers who were randomly assigned to counterbalanced orders.
2 Eleven of our participants had also participated in the formative interviews. We recruited participants through the National Federation of the Blind, our research network, and snowball sampling. Each participant was compensated with a US$60 gift card.
Participants had different levels of visual abilities ranging from total blindness (60.4%) to legal blindness or low vision with or without light perception due to a number of conditions such as Retinitis Pigmentosa, Retinopathy of Prematurity, Glaucoma, and so on, with onset at birth (60.4%) or later in life as well as acquired vision loss due to accidents. A total of 47.9% of participants identified as female, 50% as male, and 2.1% as female/non-binary. Participants ranged in age from 19 to 60 with the most in the 25–34 range (37.5%). A total of 62.5% participants identified as White, 16.7% as Hispanic, 10.4% as Asian, and 4.2% as Middle Eastern. Participant occupations included professor, assistive technology specialist, business analyst, finance advisor, attorney, and rehabilitation counselor, among others. Most participants (83.3%) lived in the U.S. and the rest came from seven different countries. A total of 42 participants self-reported as expert users of one or more screen readers such as JAWS (45.8%), NVDA (37.5%), and VoiceOver (66.7%), while the remaining 6 participants self-reported as advanced users of at least one screen reader. Participants mostly used Microsoft Word (97.9%), Google Docs (91.7%), and text-based editors such as Notes or Notepad (93.8%) for writing. Many participants frequently used comments (60.4%), track changes (47.9%), and real-time editing (29.2%), while others used these features occasionally.
5.4 Procedure
We conducted the study remotely using the conferencing tool Zoom. We tested the audio quality on different networks and selected the best setup. We asked participants to work from a quiet space, with a reliable internet connection, and using the speaker configuration (headphones or speakers) that they prefer for working with screen readers. During the session, we played the audio stimuli on our local computer and shared computer audio with the participants via Zoom. Each evaluation session lasted for 80–100 minutes, was audio-recorded and later transcribed for analysis. All of the sessions were conducted by the first author between February 2020 and April 2020.
The study session started by explaining the purpose of the study and collecting verbal consent from the participants. For participants residing in EEA countries, we collected consent prior to the session using a GDPR-compliant online form. Next, we asked a number of questions related to participants’ demographic information, usage of screen readers and writing tools, and their collaborative writing practices. After the demographic questionnaire, we played an example passage to adjust the volume level and check whether participants could hear the non-speech audio cues. We also requested that participants keep their screen readers muted (unless otherwise required), do not update their volume levels, and do not take any written notes during the session. Additionally, we explained to the participants that some questions may draw on their memory of the content presented and that they can respond with “I don’t recall the answer,” if needed.
We started each module by explaining the key collaboration question addressed in the module (e.g., “where the comments are” in the first module) and the different techniques available to represent this information. At the beginning of each technique, we briefly explained how it works using an example passage and the kinds of questions participants will be asked after each passage. Participants had the option to listen to the example passage multiple times to understand the technique clearly. We used the same topic for the example passage throughout the study. Unlike the example passage, the stimuli passages were played only once during the main experiment.
After each passage, we asked a set of questions to assess participants’
perception of collaboration information presented in the passage. We also asked one multiple-choice question specifically about the passage to gauge participants’
comprehension of passage content. When participants finished listening to both passages for a technique, we asked them to rate their agreement with statements that captured their perception of use, i.e.,
perceived ease of understanding collaboration information,
perceived ease of learning,
perceived cognitive load, and
perceived disruption in workflow on a 5-point Likert-style rating scale (ranging from 1-“strongly disagree” to 5-“strongly agree”). See Table
1 for the detailed study measures in each module.
3 We encouraged participants to rate these statements based on their overall experience with the techniques instead of whether or not they were able to answer passage comprehension and collaboration content related questions. We did this to reduce the extent to which participants’ performance influenced their ratings, since the questions and the statements aimed at capturing different facets of the techniques. Finally, at the end of the module, we asked participants open-ended questions regarding their
preferences for different techniques (e.g., which technique(s) they liked the most and the least), the rationales behind their choices, and feedback for further improvement.
5.5 Analysis Method
We followed a mixed-method approach that involved quantitative analyses on performance measures and self-reported data as well as qualitative coding on open-ended feedback. Performance measures include responses to the questions we ask after each passage to capture participants’ perception of collaboration information and comprehension of passage content. Self-reported data focus on participants’ perception of use and are recorded as ratings regarding each technique as well as overall preferences for the techniques within each module.
For analyzing performance measures, two researchers independently reviewed and labeled participants’ responses to each question with a binary category (“correct” and “incorrect”). We assessed inter-rater reliability using Cohen’s Kappa and achieved
\(\kappa = 0.83\) to
\(\kappa = 0.95\), which indicate high agreement among the coders [
44]. We then resolved any disagreements through discussion. The predictor variables for our models depend on the specific module of the study under investigation. These included the
technique experienced (e.g., default announcement, earcons, and tone overlay in the first module),
complexity of the passage (low and high) and, when applicable, a
technique X complexity interaction term. We also controlled for the
order in which a participant experienced a technique (e.g., 1, 2, or 3) and a participant’s
usage of the relevant collaboration feature (i.e., whether they have prior experience of using the feature frequently or not).
4 We considered “commenting” as the relevant feature in the first two modules that address where the comments are and who commented what. In the last module that addresses who edited what, we consider “track changes” as the relevant collaboration feature. For analyzing the performance measures, we applied linear mixed effects logistic regression models to account for non-independence in the data (e.g., repeated measures collected from the same participants under different conditions).
5 For ease of exposition, throughout the article, we only report results from the final models that were selected on the basis of Akaike Information Criterion (AIC) scores [
25].
Similar to the models of performance measures, for self-reported ratings of perception of use, we included
technique as a predictor and controlled for the
order of experiencing the technique and a participant’s
usage of the relevant collaboration feature. We applied linear mixed effects regression models to analyze the self-reported ratings. Finally, we categorized participants’ overall preferences on a scale of 1–3 (for the first and third modules) or 1–2 (for the second module) with higher rank associated with the most preferred technique. For analyzing these preference rankings, we included
technique as a predictor and controlled for the
usage of the relevant collaboration feature. We applied linear mixed effects regression models to analyze the preference rankings.
6 We report unstandardized
\(\beta\) coefficients for linear regression throughout the article, which permits interpretation of the predictor effects in original units.
Additionally, we analyzed participants’ open-ended feedback using open-coding and iterative comparison between the codes to identify salient themes [
22]. These concepts detailed participants’ rationales behind preferring different techniques and how these techniques could improve and/or disrupt their perception of collaboration information and individual workflow in different contexts.
6 Module 1: Where Are the Comments?
As the first component of our evaluation, we examine how to best support the challenging task of comprehending a passage while simultaneously identifying where comments are located and whether they are overlapping.
6.1 Auditory Representations
In the first module of the study, we incorporated three auditory representations to denote where comments are attached in a document: announcement (default), earcons, and tone overlay. Following insights from our interview study, we designed the earcons and tone overlay representations to assess whether non-speech audio can reduce the “verbal clutter” created by the spoken announcement while indicating the location of comments in a document.
Announcement (default) . Spoken announcement is the default technique that many screen readers use to indicate the presence of a comment attached to a text portion of the document content. Different screen readers use slightly different phrases to announce the starting and ending of a comment. We chose the phrases “
start comment” and “
end comment” following the way JAWS announces comments in Google Docs. In cases where two comments overlap each other, two “
start comment” phrases appear one after another, indicating that a comment started before another comment ended (i.e., it was fully or partially overlapped, see Figure
2, top). Note that screen readers use a variety of speech-based configurations to represent comments and edits. For example, both JAWS and NVDA have list views where users can navigate through all the comments (or edits) sequentially. However, we chose the aforementioned technique as the default, since our focus in this study is on the way screen reader users consume collaboration information as they go through the document content—not on how they attend to the list of comments/edits separately. An audio example can be found here:
http://bit.ly/bvi-cw-mod1-announcement.
Earcons . In this technique, two distinct audio tones work as earcons [
35] i.e., abstract representations of the spoken phrases “
start comment” and “
end comment”. We used a two-part bell sound (DING-DONG), where the DING sound specifies the starting of a comment and the DONG sound specifies the ending (see Figure
2, bottom left). Similar to the announcement technique, two DING sounds appearing one after another indicates the overlap between two comments. Based on feedback received from pilot testing sessions, we adjusted the length and the loudness of the sounds to make them noticeable but subtle and comfortable for listening. For the same reason, we chose these short-lived DING-DONG sounds instead of complex earcons consisting of multiple rhythmic sequences [
35]. An audio example can be found at this link:
http://bit.ly/bvi-cw-mod1-earcons.
Tone Overlay . In this technique, a tone is continuously played in the background as long as the text portion associated with a comment is read out. The frequency (i.e., pitch) of the background tone is increased when text portions have multiple comments overlapping with each other so that users can detect where standalone and overlapping comments are attached (see Figure
2, bottom right). We used 185 Hz (note G3) for the background tone associated with text having standalone comments and 220 Hz (note A3) for overlapping comments. We adjusted the amplitude of the background tone according to feedback from pilot participants to keep it at a discernible level but much lower than the level of the text read-aloud. We did this to ensure that users can distinguish the background tone from the text read-aloud, but it does not impede perception of the text content. An audio example can be found at this link:
http://bit.ly/bvi-cw-mod1-tone-overlay.
6.2 Stimuli and Measures
Given the focus on understanding where the comments are attached, we manipulate document complexity in terms of the number of comments, the length of the text where comments are attached, and whether there are any overlapping comments. For each of the three techniques, we prepared stimuli passages with two levels of complexity: low (2–3 comments in total, all are attached to 2–4 words in the passage text, and no overlapping comments) and high (5–6 comments in total, two of them are attached to a single word, one with a whole sentence and the rest to 2–4 words, and one pair of overlapping comments). Thus, in this module, participants listen to six passages in total, two for each technique. Table
1 includes the set of questions we asked to assess participants’
perception of collaboration information and
comprehension of passage content and the self-report statements we administered to capture their
perception of use and overall
preference.
6.3 Results
We begin by investigating how different auditory representations (default announcement, earcons, or tone overlay) affect participants’ performance on the questions related to where the comments are attached. With regards to the question that asked about the location of the distribution of comments in the passage (i.e., whether the comments are attached mostly in the first half or last half of the passage or almost evenly distributed throughout), we see differences in the way participants performed using these techniques in low and high complexity passages. Specifically, using earcons, participants were less likely to correctly identify the location of the comments in a low complexity passage relative to the default announcement, whereas they were more likely to correctly identify comment locations in a high complexity passage relative to the default announcement. In other words, the odds of correctly locating comments with earcons is 0.33 times than with the default announcement in a low complexity passage, whereas in a high complexity passage, the odds of correctly locating comments with earcons is 6.4 times than with the default announcement (for the interaction,
\(log(OR)\) = 2.95,
p = 0.006, see Figure
3 and Table
B.1). There was a similar statistical trend in participants’ performance using tone overlay in low and high complexity passages. Particularly, in a low complexity passage, the odds of correctly identifying the location of comments is 0.57 times compared to the default announcement, whereas in a high complexity passage the odds of correctly identifying the location of comments is 4.1 times compared to the default announcement (for the interaction,
\(log(OR)\) = 1.96,
p = 0.059, see Figure
3 and Table
B.1). This possibly indicates that in a low complexity passage with only a few comments dispersed throughout, the spoken announcements may have provided a more straightforward way to understand where comments are located compared to non-speech audio cues. However, in a high complexity passage where comments were densely populated—in close proximity and with overlaps between each other, spoken announcements may have become more confusing and verbose while earcons and tone overlay performed relatively better in identifying the distribution of comments.
Turning to the comprehension of passage content, we see that tone overlay improved comprehension relative to the default technique and earcons. With tone overlay, the odds of correctly answering the question about passage content is 1.9 times compared to using the default technique (
\(log(OR)\) = 0.63,
p = 0.055, Table
B.1) and 2.7 times compared to using earcons (
\(log(OR)\) = 1.01,
p = 0.002).
7 In addition, participants reported several benefits of tone overlay and earcons on the self-report measures. They felt it was easier to understand overlapping comments using tone overlay compared to the default technique and earcons; particularly, the predicted rating for tone overlay is 0.58 units higher (on the five point Likert-scale) than for the default technique (
\(\beta\) = 0.58,
p = <0.001, Table
B.2) and 0.31 units higher than for earcons (
\(\beta\) = 0.31,
p = 0.046). Additionally, they reported that their reading flow was less disrupted using both earcons (
\(\beta\) =
\(-\)0.50,
p = 0.04, Table
B.2) and tone overlay (
\(\beta\) =
\(-\)1.02,
p < 0.001, Table
B.2) compared to the default announcement. This finding that both earcons and tone overlay were considered less disruptive than spoken announcement supports our intuition behind using these non-speech audio representations to reduce verbal clutter and help fluent reading flow. Furthermore, tone overlay was considered to be even less disruptive (
\(\beta\) =
\(-\)0.52,
p = 0.03) and requiring less cognitive effort (
\(\beta\) =
\(-\)0.44,
p = 0.045) than earcons.
Overall, these results illustrate that non-speech audio such as tone overlay better represented some aspects of collaboration information (e.g., overlapping comments) without creating much disruption in the reading flow and were not detectably better or worse than the default announcement in other aspects. This is also supported by participants’ overall preference for the techniques. Although there was no significant difference between earcons and the default announcement (
\(\beta\) = 0.22,
p = 0.17, Table
B.2), participants preferred tone overlay more than the default announcement (
\(\beta\) = 0.63,
p < 0.001, Table
B.2) and earcons (
\(\beta\) = 0.41,
p = 0.01).
Our qualitative analyses of participants’ open-ended feedback provided deeper insights into how these auditory techniques supported and impeded their understanding of the passage content and the presence of comments. One factor that considerably influenced participants’ preferences was to what extent a technique helped them disambiguate text content and collaboration information. Many participants (52.1%) preferred tone overlay, because it uses “verbal and non-verbal cues, so it is easier to distinguish the text” (P41) and they “could kind of visualize words being underlined or highlighted” (P37). Participants’ reactions also depended on the way non-speech audio cues shifted their attention from text content to collaboration information. For instance, some felt that with earcons and tone overlay, they “ended up paying attention more to the tone than the audio (speech)” (P13). This reaction may have stemmed from the fact that our experimental techniques such as tone overlay were novel to many participants and they thought “it would take a little bit longer to get used to it” (P23). Participants also added that non-speech audio cues need to be customizable according to one’s individual receptivity toward audio enhancements and hearing abilities.
Additionally, participants found tone overlay to be more helpful in perceiving the span of commented text: “tone [overlay] was continuous through the comments, so you are kind of aware that you’re still in the comment versus not in a comment” (P9). However, compared to earcons, tone overlay was “a little less precise in terms of where the comment starts and ends” (P8). To get the benefits of both earcons and tone overlay, participants recommended combining these two techniques or choosing one based on the span of the commented text: “maybe have a ding and a dong (earcons) for a single word but a tone [overlay] for a sentence” (P34).
The amount of time required to perceive collaboration information was another key consideration for our participants. They felt that earcons were “quicker to read” (P19) and “fleeting” (P44) compared to spoken announcements. Tone overlay was even better, “because you are getting two pieces of information at once... it will represent a huge productivity boost. You are reading the text and you are getting an immediate indication that that text is commented” (P25).
While a majority of the participants preferred some form of non-speech audio, those who preferred the default announcement mentioned familiarity as a key reason. P14, an IT professional, preferred the default announcement
“probably because it’s similar to other materials that I’ve read that also have similar tags like HTML or different object notation things in programming that indicate the beginning and ending of particular blocks.” Participants who preferred non-speech audio mentioned additional concerns, such as memorizing
“too many other sounds that were used to indicate the beginning and ending of things... that’s a little harder to keep track of which sounds are which things” (P8). Prior work has also highlighted that earcons require explicit learning [
35]. To address this, P7 and P17 suggested using easily distinguishable earcons that can be meaningfully mapped to the notion of opening and closing comments, such as
“the train tones... like opening and shutting doors.”In summary, our quantitative and qualitative results suggest that (1) tone overlay best supports the challenging task of identifying where comments are located without causing disruption to reading flow; (2) earcons and tone overlay are most useful in understanding where comments are located in complex passages (i.e., densely populated with comments); and (3) these techniques may work best in combination depending on the document complexity (e.g., presence of overlapping comments and the span of commented texts).
8 Module 3: Who Edited What?
Not only must writers keep track of where comments are located, what information each comment contains, and who added various comments, they must also understand which collaborators made certain edits and how those edits changed the document.
8.1 Auditory Representations
In the third module of our study, we incorporated three auditory representations for describing who edited what in a document: announcement (default), contextual presentation, and contextual presentation with voice coding.
Announcement (default) . This technique announces the name of the editor and the type and content of an edit while reading the document text using spoken phrases, such as “Mary inserted” and “end insertion”. For example, consider the sentence in Figure
6, top, that has an insertion and a deletion from two editors. This sentence is read by a screen reader as follows:
“The statue of liberty was a gift <pause> Mary inserted <pause> of friendship from the people of France <pause> end insertion <pause> Beth deleted <pause> the people of <pause> end deletion <pause> to the United States.”
We consider this technique as the default (see Figure
6, bottom left), since it aligns with the way many screen readers describe edits made with the Track Changes feature on a Word document. An audio example can be found here:
http://bit.ly/bvi-cw-mod3-announcement.
Contextual Presentation . This technique reads a document sentence-by-sentence presenting edits in-context of the sentence. That is, to contextually present a suggested edit, it reads the corresponding sentence as it would have appeared after the edit was applied to it. To make the effect of the edit more salient, this technique presents both versions of the sentence—before and after the edit is applied. It first starts with reading the original version before any edits are applied, followed by announcing the number of edits in the sentence and reading different versions of the sentence after applying the suggested edits sequentially one after another. Returning to the example in Figure
3 (top), Mary’s edit (i.e., inserted text) occurred earlier than Beth’s edit (i.e., deleted text). These edits are presented sequentially (See Figure
6, bottom middle):
“The Statue of Liberty was a gift from France to the United States <pause> Two edits <pause> Insertion by Mary <pause> The Statue of Liberty was a gift of friendship from the people of France to the United States <pause> Deletion by Beth <pause> The Statue of Liberty was a gift of friendship from France to the United States.”
Building on the insights gathered from our formative findings, this technique highlights how edits alter the meaning of a sentence by presenting those edits within the context of the sentence. An audio example can be found at this link:
http://bit.ly/bvi-cw-mod3-contextual.
As shown in the example above, while reading a version of the sentence corresponding to a specific edit, this technique retains the edits that were made earlier. We do this considering the possible interdependence between sequential edits (e.g., an editor may delete a word that was previously inserted by another editor, thus the deletion cannot be understood without the earlier insertion). Additionally, sequential presentation highlights the way a sentence evolves throughout the course of the suggested edits. However, it is also important to understand how an individual edit can alter the meaning of the sentence. Following the same technique we used, it is possible to iterate through all possible versions of a sentence applying the suggested edits individually as well as in combination with other edits.
Contextual Presentation with Voice Coding . This technique is a variation of the contextual presentation technique, where text portions inserted by different editors are voice coded, i.e., read out in the editors’ respective synthesized voices. In the previous example, the text portion inserted by Mary (“
of friendship from the people of France”) is read out in the synthesized voice assigned to them. Collaboration markup phrases (e.g., “
Insertion by Mary”) and text portions written without Track Changes are read in the default voice (see Figure
6, bottom right). Similar to the contextual presentation technique without voice-coding, this technique retains earlier edits in each iteration of a sentence that has multiple edits. To address the concern about cognitive overload in listening to several different voices within a sentence, we refined the technique to read the earlier edits in the default voice, while only the text portion inserted in the current iteration is read in the editor-specific voice. This limits the number of distinct voices in each iteration of a sentence to two: the default voice and the one associated with the editor who made the edit in the current iteration. In this way, it also highlights the content of the current edit by making it stand out amidst text read out in the default voice. An audio example can be found at this link:
http://bit.ly/bvi-cw-mod3-contextual-voice-coding.
8.2 Stimuli and Measures
While the previous two modules focused on the comments in a passage, this module examines tracked changes or edits on the passage content. We manipulate document complexity in terms of the number of edits, editors, and overlapping edits (i.e., one editor deleting a word from a text portion that another editor inserted). We prepared passages with two levels of complexity: low (two editors, four edits in total, no overlapping edits) and high (four editors, six edits in total with two pairs of overlapping edits). Thus, in this module, each participant experiences six passages in total, two for each technique. Similar to Section
7.2, we ensure that each passage has a salient contribution from one individual in that they make a higher number of edits (at least by two) than the rest of the editors. Table
1 includes the set of questions we asked to assess participants’
perception of collaboration information and the self-report statements we administered to capture their
perception of use and overall
preference. However, unlike previous modules, we do not ask any questions about the passage content separately, since the question about the changes in the meaning of a sentence after suggested edits also captures comprehension of passage content.
8.3 Results
We start by analyzing whether contextual presentation and contextual voice coding affect participants’ perception of who edited what differently than the default announcement. Looking at the responses to the question about who edited a sentence, we see that participants were more likely to correctly answer using contextual presentation relative to the default announcement. Specifically, the odds of providing a correct answer is 3.1 times in the contextual presentation relative to the default technique (
\(log(OR)\) = 1.12,
p = 0.007, Table
B.5). We see an even larger effect in response to this question with contextual voice coding: the odds of providing correct answers is 4.4 times in contextual voice coding relative to the default announcement (
\(log (OR)\) = 1.48,
p < 0.001, Table
B.5). This indicates that the addition of voice coding with the contextual presentation may have been more helpful in recognizing the editor correctly.
Similarly, with regards to the question about how the edits altered the meaning of a sentence, participants were more likely to provide correct answers using both contextual presentation and contextual voice coding techniques compared to the default announcement. We see that the odds of providing correct answers is 5.6 times in contextual presentation compared to the default technique (
\(log (OR)\) = 1.73,
p < 0.001, Table
B.5). Contextual voice coding shows a similar pattern with an even larger effect: the odds of providing correct answers is 15.4 times compared to the default technique (
\(log (OR)\) = 2.73,
p < 0.001, Table
B.5). Furthermore, participants were more likely to provide a correct answer to this question in contextual voice coding relative to the contextual presentation technique. The odds of providing correct answers is 2.7 times in contextual voice coding compared to the contextual technique (
\(log (OR)\) = 1.01,
p = 0.006). This indicates that including voice coding in the contextual presentation may have helped participants identify the newly inserted text in the modified version of the sentence and thus provided an even better understanding of how the meaning of the original sentence changed.
Participants’ self-reported ratings bolstered the results reported above. In particular, understanding who edited what was perceived to be easier using contextual voice coding compared to the default technique: the predicted rating for contextual voice coding is 0.38 units higher (on the five point Likert-scale) than the default technique (
\(\beta\) = 0.38,
p = 0.03, Table
B.6). Similarly, participants found it easier to understand changes in the meaning of a sentence with contextual voice coding than the default technique (
\(\beta\) = 0.58,
p = 0.002, Table
B.6) and contextual presentation (
\(\beta\) = 0.35,
p = 0.057). This result aligns with our formative findings, which inspired us to present collaborators’ edits in-context by using different voices to iteratively highlight how the edits alter the meaning of the original content.
Additionally, compared to the default announcement, participants found contextual voice coding easier to learn (
\(\beta\) = 0.42,
p = 0.02, Table
B.6), requiring a lower cognitive load (
\(\beta\) =
\(-\)0.73,
p < 0.001, Table
B.6), and causing less disruption in reading flow (
\(\beta\) =
\(-\)0.83,
p < 0.001, Table
B.6). Furthermore, contextual voice coding was rated as less disruptive (
\(\beta\) =
\(-\)0.54,
p = 0.003) and requiring less cognitive effort (
\(\beta\) =
\(-\)0.42,
p = 0.02) than contextual presentation. When we look at participants’ overall preferences, we see that contextual presentation was preferred to the default technique (
\(\beta\) = 0.32,
p = 0.04, Table
B.6) and contextual voice coding was preferred to both the default technique (
\(\beta\) = 0.71,
p < 0.001, Table
B.6) and contextual presentation (
\(\beta\) = 0.39,
p = 0.01). Overall, these results illustrate that contextual presentation technique improves perception of edits, and the integration of voice coding with this technique makes it even better by reducing cognitive load and disruption in workflow.
Participants’ open-ended responses further strengthened our findings from quantitative analyses. The key reason that guided most participants’ preference for contextual voice coding is that this technique combined the benefits of presenting edits in-context of the original sentence along with the “extra reinforcement” by different voices that served as “an easier memory guide for who did what and also what was [an] edit and what wasn’t” (P22). In contrast, the default spoken announcement, which was the least preferred by most participants (60.4%), was considered as “a complete waste of time” (P44), because participants felt that they “couldn’t really get a good grasp of how it changed the meaning by listening through it just a one time... I will have to go through it a few times” (P23).
Despite important benefits of contextual voice coding, the different voices caused a distraction for some participants, as we also found in Module 2 (Section
7.3). P18 further explained that the natural break that occurs in the synthesized speech when a new voice with different prosody appears in the middle of a sentence
“actually destroys the intonation of the sentence. So, a screen-reader user who is expecting a sentence to come in a natural flow loses that track.” Instead of assigning distinct voices to individual editors, these participants suggested having a single voice to read all the edits, as they did in the previous module (Section
7.3). Relatedly, those who preferred the default technique due to its simplicity, still wanted other forms of non-speech audio such as earcons, tone overlay, and changing pitch or voice of spoken announcement phrases to distinguish edits from text content. Interestingly, as we also found in Module 1 (Section
6.3), participants considered tone overlay to be better than earcons because of its
“efficiency” in terms of time requirement and clarity in depicting the span of the edited (or commented) text.
While contextual presentations (with or without voice coding) were generally preferred to the default technique because of the reasons discussed above, many participants expressed concern about the repetition of a sentence in contextual presentation—the reason why it takes longer to finish a passage in this technique. P1 said, “As blind people, things generally take us longer and every time the sentence is read a second time, I’m like- ‘okay, I already heard that’, and if you’re talking [about] a long document, that’s gonna take an age to go through.” As such, some participants said that in a natural writing scenario, they would prefer listening to only “the focus or the area that was changed, either just the words that were added or deleted, or maybe the immediate context” (P4) instead of the original and modified versions of the entire sentence. Participants also emphasized that interactive collaborative features that would allow them to consume information on an as-required basis might further reduce disruption in their workflow and improve collaboration awareness.
In summary, our quantitative and qualitative results show that (1) the combination of contextual presentation and voice coding provides the best support for understanding who edited a sentence and how the edits altered the meaning of a sentence; however, (2) presenting edits in the context of an entire sentence requires more time than the default technique, and (3) changing voices in the middle of a sentence to present edited text can break the continuity of reading.
9 Discussion
Maintaining collaboration awareness is a complex challenge for all writers. Yet, the serial nature of how screen readers present text-based content, combined with the lack of well-designed auditory representations for collaborative markup makes the work of achieving and maintaining collaboration awareness particularly difficult for blind writers. Prior research in HCI has discussed the problematic relationship between accessibility and usability [
11,
47,
65,
78,
81], showing that many technological systems are accessible on the surface but not usable for practical purposes [
11,
29]. Our analysis reveals a specific instance of this problem: Screen reader users have difficulty not only developing collaboration awareness but also maintaining efficiency due to tools that are
“supposedly accessible but very poorly implemented” [
29]. As such, writing tools must be designed such that screen reader users can perceive collaborative information efficiently without additional cognitive effort or significant disruption to their individual workflow (e.g., reading or writing on their own). The present article provides a foundation for creating more accessible collaborative writing tools through our empirically grounded design and evaluation of multiple auditory techniques for asynchronous collaborative writing. Below, we discuss the design tradeoffs and considerations for auditory representations that address issues of cognitive effort, disruption, and efficiency.
9.1 Managing Cognitive Effort in Understanding Collaboration Information
Our formative study revealed that screen reader users need to apply a higher amount of cognitive effort in sifting through the
“jumbled mess” of collaboration notifications that appear in the same format as the one used to read text content, i.e., speech. In contrast, by presenting collaboration information in a distinct auditory format, non-speech audio cues and voice coding help people perceive the location, content, and author of comments and edits with less cognitive effort. Specifically, voice coding makes it easier to keep track of who commented or edited what, while non-speech audio cues (e.g., tone overlay) are helpful in distinguishing between text content with overlapping or standalone comments, or without any comments attached. Interestingly, some of these techniques helped our participants create a mental imagery of collaborators’ actions. For example, tone overlay worked as an auditory
“underline or highlight,” while voice coding created an impression of people
“having a conversation.” Thus, the auditory enhancement and expressiveness non-speech audio and voice coding offer can minimize the cognitive effort [
70] required to disambiguate between complex and intertwined pieces of collaboration information and text content.
Despite this benefit, mapping non-speech audio cues to their corresponding meanings [
35] or figuring out which voice refers to whom can put additional cognitive load on screen reader users, particularly when various pieces of information (e.g., starting and ending of comments, insertions, deletions) are indicated by non-speech audio cues or a large number of co-authors contribute to the shared document. In contrast, spoken announcements that provide straightforward description of the collaboration markup (e.g., “start comment”, “end comment”) do not require explicit semantic mapping or memorizing. As such, spoken announcements may be preferable for novice screen reader users when they are just starting to use collaborative tools, whereas people may switch over to non-speech audio and voice coding techniques when they have a better understanding of the syntax of collaborative features and semantic mappings of audio cues. Another approach to address this issue could involve using representative auditory icons [
35] (e.g., the sound of a door opening or closing) instead of abstract earcons. Furthermore, collaborative tools and screen readers could allow users to create personalized voice profiles for their co-authors [
1]. This could potentially reduce the cognitive load of mapping different voices to co-authors, especially when working with the same collaborators (e.g., manager or advisor) and the voices become familiar over time.
Our analysis also illustrated that different auditory techniques can incur more or less cognitive load depending on the specific collaboration information they are presenting and the level of collaboration complexity in the document. For example, earcons can point to the precise locations where a comment starts and ends, whereas tone overlay can provide a clear understanding of the span of a comment and where comments overlap with each other. As such, audio representations should be implemented in a way that aligns with the context in which they are being used [
36] (e.g., a complex document with large number of edits or a paragraph with overlapping comments) and may work best in combination, i.e., screen readers could dynamically render collaborative information based on the complexity and structure.
9.2 Reducing Disruption in Individual Workflow
Our analysis joins prior work in highlighting the ways screen reader representations pale in comparison to the mainstream collaborative features that are designed for sighted people [
9,
67,
84]. One example of this is the way collaborative tools leverage glanceability [
67] to present multiple layers of collaboration information in tandem with text content through color-coding and comment sidebars, whereby sighted people can direct their attention to where they want to focus on by a quick glance without interrupting their current task at hand. In contrast, screen readers push spoken alerts to describe collaboration information interlaced with text content, creating a continuous disruption to one’s own reading flow. In this regard, non-speech audio cues (e.g., earcons and tone overlay) can offer a
“less obtrusive” approach for making sense of collaboration information.
While non-speech audio cues are generally less disruptive than spoken announcements, our analysis showed that audio cues can also sometimes pose a distraction and shift people’s attention away from understanding text content. Similarly, changing voice in the middle of a sentence can break the continuity of intonation and prosody in a way that may become “jarring” and “discordant.” To address this, participants wanted to have a single voice for all commenters (or editors) that will be distinct from the default voice for text content. Thus, a simpler version of voice-coding (or manipulation of pitch or timbre) could make it easier to differentiate between text content and comments/edits without breaking people’s reading flow or incurring additional cognitive burden to perform voice-to-author mapping. Importantly, people’s reaction to audio cues also depend on their personal preferences and hearing abilities. Some people may want to lower the level of pitch, volume, or duration of audio cues, because they find it disconcerting. Others, however, may prefer to increase pitch, volume, or duration of audio cues and make them more distinctive relative to the screen reader speech so that one does not subsume the other.
Allowing people to customize and personalize the parameters of non-speech audio cues and text-to-speech voices can be a key step toward addressing the issue described above. However, another approach involves rethinking collaborative writing through an activity centered lens [
9,
58] to support the goals a person intends to accomplish and the tasks they are attempting to complete at a particular instance to achieve these goals. For example, are they skimming through the document to understand how other co-authors have contributed? Are they reading to perceive the final state and content of the document? Are they making edits on their own? An individual may not always need continuous awareness of their collaborators’ actions, particularly when they are focusing on their own reading or writing activities. Similar to the way visual collaborative interfaces allow users to control the amount of visible collaboration information (e.g., by switching between “no markup”, “simple markup”, and “all markup” options for tracked changes on Microsoft Word), screen readers could present information relevant to particular tasks (e.g., understanding changes, reading and responding to comments) instead of continuously pushing auditory alerts for collaboration information. One such example may involve having separate “private writing” and “public editing” sessions, as suggested in prior work [
62,
89]. Although Wang et al. suggested the separate private and public sessions to support sighted collaborators who want to avoid exposing details of their writing practices [
89], here we see that such an activity centered approach may be helpful for screen reader users to filter out collaboration notifications when they do not need them. Importantly, one’s goals and tasks are likely to evolve over time. As such, collaborative tools should determine people’s intended tasks either by tracking relevant contextual indicators or by allowing them to declare and switch their current tasks or “modes” fluidly.
9.3 Improving Efficiency in Processing Collaboration Information
One important aspect of efficiency that repeatedly appeared across our study modules is the time required to consume collaboration information. Presenting information sequentially through verbose spoken announcements takes longer to listen to and make sense of the information [
67]. In contrast, non-speech audio cues (e.g., earcons, tone overlay) and voice coding can help people quickly process who did what and where by conveying multiple threads of information at once. For example, the background tone in tone overlay indicates the presence of a comment (or overlapping comments) simultaneously while the commented text is read aloud. Similarly, the voice coding technique reads the content of comments or edits while also highlighting who made that comment or edit. In fact, some participants said that they could get rid of the markup phrases that announce co-authors’ names once they learn the corresponding voice mapping in voice coding technique or if they could use personalized voice profiles, thus reducing the time required even further.
While non-speech audio cues and voice coding were decidedly better than the default spoken announcements in terms of time required to understand who edited or commented what and where, the situation gets more complicated when writers need to figure out how the document content has evolved through previous edits. Participants in our study shared that they forget the meaning of a sentence by the time they hear all the spoken announcements for suggested edits and often need to
“go through it a few times” to piece together how it appears before and after the edits. The contextual presentation technique appeared to reduce cognitive effort in this regard, since participants could more readily perceive how edits altered the meaning of the sentences. This improvement in cognitive effort, however, comes with a compromise in terms of efficiency, as contextual presentation takes longer because it plays a sentence in its original and modified versions. Balancing cognitive overload and efficiency in developing collaboration awareness may require a hierarchical approach with a combination of techniques, where people can opportunistically control what information they will hear in what format depending on their tasks and goals at a particular instance [
52]. For example, when someone skims through a document, the presence of edits could be indicated using non-speech audio cues at a higher level (e.g., paragraph level). If the person wants to explore the edits to a specific sentence in more detail, they could use designated keystrokes to listen to the edited text separately or within the context of the sentence.
9.4 Limitations and Future Work
Grounded in findings from an interview study, this article presents results from a controlled experimental study that investigated the extent to which non-speech audio, voice coding, and contextual presentation support screen reader users’ collaboration awareness needs and efficiency relative to the default representations. Our results provide a foundation for future interactive systems that incorporate these techniques and allow for research on other facets of collaborative writing that we were not able to capture within the scope of this controlled study. For instance, with an interactive prototype, future studies may investigate how different representations facilitate (or impede) one’s comprehension of collaboration information when they can pause and review, repeat certain comments or edits, and opportunistically query information as they need. Furthermore, a long-term deployment study with an interactive system could evaluate potential learning effects that influence how people perceive collaboration information once they get used to a particular audio representation and use it over time.
Future work could also explore whether non-speech audio cues and voice coding can support visually impaired writers in achieving collaboration awareness and efficiency when they perform real-time editing using screen readers. For instance, our prior work [
29] revealed that the lack of awareness around where co-authors are editing in real-time is a key accessibility issue in synchronous editing tools (e.g., Google Docs). Although screen reader users receive spoken notifications when a co-author enters or exits the paragraph they are working on in Google Docs, they do not know the exact location or proximity of the co-author’s cursor position relative to their own. As such, screen reader users often avoid close co-editing to reduce the risks of typing over someone else’s edits. Furthermore, their own writing gets disrupted by the spoken announcements they hear when a collaborator joins or leaves the document or moves cursor to and from the paragraph they are working on. Non-speech audio cues may be useful in such cases to provide collaboration information in a less obtrusive manner.
Beyond collaborative writing, our findings are likely to have implications for other collaborative activities such as collaborative programming. Potluri et al. highlighted the challenges with glanceability and alertability in Integrated Development Environments (IDEs), wherein screen reader users cannot process information as easily as a sighted person does with a quick glance at different windows and panes or through real-time visual alerts [
67]. Future work could explore whether non-speech audio cues and voice coding techniques can address these issues in the context of collaborative programming.