1 Introduction

While most respondents answer an online survey on a computer, increasingly survey respondents are using a smartphone to complete online surveys. This smaller device poses particular challenges for developers and survey designers as the display area on a smartphone is much less than that on a computer screen. This smaller amount of screen “real estate” affects many areas of the survey, including the response option designs for questions. In this paper, we explore one particular type of response option design on mobile devices: the text input field. Text input fields allow users to enter characters or numbers using the keyboard or keypad. Typically, they are used for questions with too many possible answers for a radio button or drop down design. Common examples of questions that use a text input field are name and address questions, ethnicities or ancestry questions, or food or medicine questions. Text input fields are used to collect open-ended comments or a personal note for a gift purchased online. The field is also used in forms for business purposes, such as collecting credit card numbers [1].

To analyze the data entered into these fields, often the open-ended responses are coded, either by a machine or an analyst, into categories or groups for further analysis. Coding answers that do not have enough context or content is challenging. Spelling and grammatical errors can interfere with correct coding as well. These limitations can lead to measurement error or more labor intensive, and therefore more expensive, manual coding. When there are several text input fields on a web page, survey respondents need to understand what to enter and where to enter the data so the correct answer is in the correct field. They also need to understand how detailed an answer to give without exceeding the space limitation of the field. To investigate these issues, we conducted experiments to examine field label placements; character countdown features; and predictive text (sometimes known as auto-correction, or type-ahead). We describe these three elements in detail below.

When considering the design of text input fields, it is essential to label each field so that users know what to enter. Nielsen and Norman (2010) recommend placing a label close to the correct field (either above or to the left) so that there is no confusion about which label goes with which field [2]. Labeling form fields has been researched using PC monitors [3, 4] with somewhat consistent results. For example, neither group found that the label position affects the time spent on a web page. However, field labels either above the field or to the left of the field (either aligned left or aligned right) are more preferred and generate fewer eye fixations on a PC than other label designs [5]. That same research found that the flow design where the label is left aligned and the field immediately follows the label such that a web page with multiple fields has a jagged design, is not preferred by users. Inline labels, or placeholder text, where the label disappears when focus is put into the field (by either clicking into, tabbing into, or touching the field) is sometimes used to save space on forms. However, this design has several problems. Inline labels tax short-term memory because the user must remember what the field was for and research has shown that users do not prefer this design [5, 6]. More recent advancements with inline labels allow the label to jump to the top of the field (either right outside or inside the field but small and above the text entered) once focus is placed in the field. Google calls these designs outlined text fields or filled text fields, but we refer to them as inline labels that move [7]. To our knowledge, research on these type of labels has not been conducted on mobile phones. It is possible that inline labels that move may be more advantageous for surveys on mobile phones where space is limited as it could reduce the length of the page without relying on short term memory. Given the prior research on PCs, we do not expect time to complete the task (i.e., efficiency) to differ because of label placement, but we do expect user preference for labels to differ. We hypothesize that inline labels that move would be preferred and over labels above or to the left of the field.

Another important decision for open text fields concerns how to communicate the amount of text input that is expected. On well-designed forms, the input field is the size of the expected entry [8, 9]. One design feature that is sometimes used to communicate the field size is a character countdown. With a character countdown, the maximum number of characters allowed in the field is placed near the open-text field, usually below the field. This number reduces directly in relation to the number of characters typed, letting users know how many characters remain to be used. When the maximum number of characters is reached, sometimes users can no longer type and the character countdown remains zero, but other designs let the countdown go negative and the number turns color, typically red. This feature is often used in open-text fields with a limited number of characters, such as a note to the receiver of an online purchase. However, this feature is not used in a field asking for name or address. While there is research that suggests respondents can share as rich and meaningful information with a known 160-character text limit as with larger fields in an email [10], little to no research has been conducted on character countdown features in web surveys. We hypothesize that character countdowns lead to a more efficient survey experience because respondents would know how much text is expected and they would not exceed the limit.

Finally, predictive text is a technology that uses the context of the existing text and the first letters typed into a field to suggest a complete word to insert. The words typically appear below the field and update on the fly depending on what additional letter is added to the word being typed. Instead of continuing to type, the user can touch the word and it will appear in the field. The feature is frequently used when texting on a mobile phone, yet it is not often used in open-ended input fields in a survey. Early research used tactile vibrations in combination with predictive text to increase accuracy and speed with typing on mobile devices [11]. One set of guidelines for forms suggest that predictive text should be used for fields with a lot of predefined options [12]. Questions with a large number of possible, but predictable answers such as car models, vacation destinations, ethnicities, or medicine, could be candidates for an open-text field response option with predictive text. Early research on mobile phones indicates that suggesting words could reduce the amount of time and key presses needed to compose messages [13]. When using predictive text, spelling errors are eliminated and thus answers might be more likely to be machine coded. If the predictive text was limited to a list of finite words to “suggest” to the user, it also might make coding more efficient as researchers would not have to spend time attempting to code a word that was meaningless or that did not precisely “fit.” Couper and Zhang conducted an experiment with a predictive text prototype with a finite word list, conducted primarily on PCs [14]. They found more codable answers using drop boxes and the predictive text dictionary, but response time was longer than entering data in a plain open text-field – the opposite finding of the research on mobile [13]. Given the conflicting findings, we hypothesize that predictive text with a finite word list would lead to more codable answers, but we have no hypothesis about whether it would take longer than plain text.

The purpose of the present study was to conduct a series of systematic assessments on a mobile phone to determine how older adults use different user-interface designs to answer online survey questions and to identify preferred designs based on performance. The results of these assessments could be used as guidelines for developing mobile online surveys. We focused on older adults because they have reduced vision, mobility, and memory compared to younger adults [15,16,17]. Our rationale was that if we develop guidelines for a mobile web survey interface that older adults can successfully complete, then younger adults would do at least as well because of their superior perceptual and motor capabilities. In addition to conducting the field label experiment with older adults, we also conducted it with younger participants to help assess the assumption. This study is a continuation of the research on other mobile survey elements [18]. For more information on this entire research project please see [19].

2 Methods

Below are highlights of methods relevant to the three experiments described in this paper. In the analysis, we consider significance to be at p = 0.05 or less.

2.1 Participants

We aimed to recruit a sample of persons aged 60–75. We prescreened to include only participants who had at least 12 months of experience using a smartphone under the assumption that these participants were more typical of respondents who choose to use mobile devices to complete online surveys than those with less experience using smartphones. Additionally, we prescreened participants to include only individuals who had an education of 8th grade or more, who were fluent in English, and who had normal or corrected to normal vision. The participants were a convenience sample recruited from senior and/or community centers in and around the Washington DC metropolitan area between late 2016 and the summer of 2018.

The participant characteristics are provided in Table 1. Experiment 1 was conducted with a pool of 64 older adult participants, Experiment 2 was conducted with a pool of 40 participants; and Experiment 3 was conducted with 37 participants. With regards to familiarity with using a smartphone, on a 5-point scale where 1 was “Not at all familiar” and 5 was “Extremely familiar,” participants in each pool reported an average of 3. Experiment 1 was also conducted on 58 younger adults from community colleges.

Table 1. Participant demographics for 3 experiments

2.2 Data Collection Methods

One-on-one sessions were conducted at senior centers and community centers. Participants were walk-ups that day or were scheduled by the center. At the appointment time, they were screened by Census Bureau staff and signed a consent form. Then, each participant worked with a test administrator (TA) and completed between 4 to 6 experiments, only some of which are the subject of this paper. The experiments were designed into apps and loaded on a Census-owned iPhone 5S or 6S. TAs provided participants with one of these devices for purposes of the test, and gave instructions to the participants. This included instructing participants not to talk aloud during the session, and to complete the survey to the best of their ability as though they were answering the survey at home without anyone’s assistance. The participants performed the task independently, taking 10–20 min for each experiment, depending upon the experimental design. At the end of the session, each participant was given $40 honorarium.

3 Experiment 1: Text Input Field Labels

3.1 Designs Tested in the Experiment

In this experiment, we tested five different label locations for text input fields using a between-subjects experimental design with five conditions. Sixty-two older adults participated in this study. They were randomly assigned to a condition. Condition 1 included the label above the text box, left justified (Fig. 1). Condition 2 included the inline labels, where labels were initially inside the box as shown in (Fig. 2), and when focus was placed in the field (by touching it), the label moved above the text box, and thus appeared similar to condition 1. In condition 3, the label was to the left of the text box and left aligned (Fig. 3). Condition 4 was similar to condition 3, in that the label was to the left of the text box, but it was right aligned so that it was near the field (Fig. 4). The label was to the right of the text box in condition 5 (Fig. 5). Our hypothesis was that questions with inline labels that move or labels above the text boxes would be most preferred, but there would be no difference between conditions in the time needed to complete each question.

Fig. 1.
figure 1

Label above box

Fig. 2.
figure 2

Inline labels that move

Fig. 3.
figure 3

Label to left of box and left align

Fig. 4.
figure 4

Label to left and right align

Fig. 5.
figure 5

Label to right of box

Each condition had the same 14 open-ended questions on a range of topics. Each question, aside from Question 14Footnote 1 which was one long scroll, was presented in full display on one screen. Some questions requested basic information like name and street address, that required little thought; while other questions were more complex, collecting information that might require more thought, such as the hours spent reading a book in the past week.

After the survey questions, satisfaction data were collected. The participant was asked to rate how easy or difficult it was to complete the task on a 5-point scale with the endpoints labeled 1 = Very Easy and 5 = Very Difficult. Then, the participant was shown the address question in all five labels locations and asked which one(s) he or she preferred.

3.2 Analysis Methods

The app collected behavioral measures including time on screen. For each condition, we measured respondent burden, operationalized as time to complete a screen (our efficiency measure), self-reported satisfaction and self-reported label location preference. We then compared these measures between conditions.

We modeled the log of time to complete a screen at the question level using a mixed model because the residuals of the model with time untransformed were slightly skewed. Modeling at the question level increases the number of observations and allows us to account for different question characteristics. See Table 2 for the question characteristics. In the model, we controlled for the condition, the type of data requested (basic or complex), and any interaction between condition and those characteristics. To control for any participant effect because each participant would contribute up to 13 or 14 times (one time for each question), we included a random effect for the participant. As a check we also modeled time controlling for the question number instead of the question characteristics.

Table 2. Text input field label experiment: question characteristics

We tabulated satisfaction scores for each of the five conditions. For these analyses, we conducted a Chi-square test of independence. And, finally we tabulated the preference data.

In this experiment we also collected data from 58 participants younger than 60 years old. We compared whether the younger adults performed similar to the older adults or if their performance differed by adding a cohort variable to the model to indicate whether the data was from a young individual or an older individual. We also included an interaction between the condition and the cohort indicator.

3.3 Results

Table 3 contains the results of the experiment for older adults, including how many participants were in each condition, the number of observations in each condition (this is the number of questions/screens); the average time to answer a question by condition, and the average satisfaction rating for each condition.

Table 3. Experiment 1 metrics by condition

Burden as Measured By Time to Complete (Efficiency).

Controlling for the type of data requested (whether it was basic information like name or address, or whether it was a complex question that required more thought), and using a mixed model predicting the log of time with a random effect for the participant, there was no difference in the amount of time spent on a screen by condition (F = 0.24; p = 0.9) for the older adults. When modeling with the question number instead of the type of data requested, the pattern of results was unchanged. When including the younger adults in the model, the pattern of results was unchanged and there was no interaction between the age cohort and the condition. Using a t-test, we found that the older adults took on average 12 seconds longer (p < 0.001) to complete each screen in the survey than the younger adults.

Satisfaction Scores By Condition.

Satisfaction was measured on a 5-point scale where 1 was very easy and 5 was very difficult. We found no differences in satisfaction scores by condition (χ2(16) = 11.4, p = 0.8) when combining the older and younger adults and there was no difference by condition for older adults (χ2(12) = 5.8, p = 0.9) or younger adults (χ2(16) = 20.1, p = 0.2).

Label Position Preference.

At the end of the experiment, participants were shown each of the designs, and we asked them which one(s) they preferred. Participants could choose one or more designs so the percentages below do not sum to 100. Results for older adults were:

  • 43% preferred the inline labels that move;

  • 42% preferred the label above the box;

  • 14% preferred the label left aligned and left justified;

  • 11% preferred the label left aligned and right justified; and

  • 6% preferred the label to the right of the field.

4 Experiment 2: Use of Character Countdowns

4.1 Designs Tested in the Experiment

In this experiment, we tested a character countdown feature where the character countdown shows 0 characters left and stops accepting characters into the box when the field is at the maximum number of characters. The survey contained six open-text field questions with different character limits asked in this order: Ancestry (20 characters); Kind of work (200 characters); Employer’s main business (35 characters); Main reason left job (15 characters); How you search for work (30 characters) and What you did yesterday (250 characters). Each question was on its own screen and the field size matched the number of characters allowed. Predictive text was offered for all questions in all conditions.

We used a between-subjects experimental design with three conditions and 40 participants. Condition 1 did not include a character countdown feature on the screens (Fig. 6); condition 2 included a character countdown feature left-justified below the field (Fig. 7); and condition 3 included a character countdown feature left-justified above the field (Fig. 8). Our hypothesis was that more information would be typed with the character countdown features because the number gives a sense of how much text is wanted; the use of a character countdown would reduce the number of people who tried to type more than was allowed in the field; and the presence of a character countdown would minimize the amount of changes within the text field.

Fig. 6.
figure 6

No countdown

Fig. 7.
figure 7

Countdown below

Fig. 8.
figure 8

Countdown above field

After the six survey questions, satisfaction data were collected. The participant was asked to rate how easy or difficult it was to complete the survey and, to enter the answer. Both questions were on a 5-point scale with the endpoints labeled 1 = Very Easy and 5 = Very Difficult. The final tasks were for the participant to select the type of input field design he or she had used (all three conditions were shown) and which one he or she preferred. We refer to the second to last task, where the participant had to select which input field design he or she used, as the memory task. This task was included as another measure of whether participants were aware of the feature.

4.2 Analysis Methods

The app collected behavioral measures, including time on screen, the number of characters submitted for each question, whether the user tried to type when out of room in the field, how many times the user selected a predictive text word for the question, and the number of characters deleted while typing in a question. We used two criteria to measure respondent burden which address efficiency as a whole. First, burden was operationalized as time to complete (both the amount of time on the screen and the amount of time typing in the field). Secondly, we considered erasing characters in the text input field (whether by backspacing or by selecting and deleting) to be an indicator of burden. To measure accuracy, we counted the final number of characters entered into the field and submitted. We tabulated satisfaction scores and preference data for each of the three conditions. We measured awareness of the character countdown feature by capturing whether the participant ran out of room in the field and kept trying to type and by calculating how many participants selected the correct input field design they had seen during the session in the memory task.

We modeled the log of the two time variables at the question level using a mixed model. Modeling at the question level increases the number of observations from 40 to 40 × 6 or 240 and allows us to account for the different maximum field lengths. We modeled the log of time because the residuals of the model with the untransformed time variables were slightly skewed. In the model, we controlled for the condition, the question number, and the number of characters submitted (because more characters might mean more time). To control for any participant effect because each participant would contribute up to 6 times (one time for each question), we included a random effect for the participant. We tried several other models including question number as an ordinal variable, and models with interactions and the results did not change.

We modeled the number of characters typed at the question level using the same type of mixed model, but we removed the last question’s data because an error in the programming allowed users to type more than what was allowed. Because we dropped a question, there were only 200 observations to use in that model. We controlled for the condition, the number of predictive text words used, and the question number and added a random effect for the participant.

We used a Chi-Square statistic for the remaining analysis: (1) did backspacing and/or deleting characters differ by condition; (2) did typing more than allowed in a field differ by condition; (3) was there a difference in memory task by condition; and (4) was there a difference in satisfaction by condition. We then investigated the significant Chi-squares by performing logistic regressions to determine the direction of the significance. Finally, we tallied preference of design.

4.3 Results

Table 4 includes the results of the experiment, including how many participants received each condition, the average time to answer a question by condition, the percent of questions where at least one character was deleted while typing; and the average satisfaction rating for each condition.

Table 4. Experiment 2 metrics by condition

Burden as Measured By Time to Complete (Efficiency).

Modeling log of time to complete and the log of time to type at the question level, we did not find a difference by condition for either time variable (log of screen completion time: F = .32, p = .73; log of typing time: F = .54, p = .58) with question number and the number of characters submitted as covariates and with a random effect for the participant.

Burden as Measured By Deleting Characters (Efficiency).

Participants in condition 1 deleted at least one character while they typed in 55% of the questions, while in both character countdown conditions, participants deleted at least one character in 37% of the questions. Deleting characters was dependent on the condition using the Chi-Square statistic (χ2(2) = 6.9, p = .03). Using a logistic regression model, we found that participants were more likely to delete a character when typing in the condition without the character countdown than in either of the other two conditions with a character countdown (p < .01) when the question number was added as a covariate.

Accuracy (Effectiveness).

There was no difference (F = .39, p = 0.68) in the number of characters typed by condition using a mixed model controlling for question number, the number of predictive words used, and adding a random effect for the participant.

Satisfaction.

Chi-square results reveal no difference in satisfaction ratings between conditions (χ2(6) = 6.0, p = 0.5).

Preference for Character Countdowns:

At the end of the experiment, participants were shown each of the designs, and we asked them about preference. Results were:

  • 25% preferred the condition with no character countdown;

  • 25% preferred the condition with the character countdown below the field;

  • 20% preferred the condition with the character countdown above the field; and

  • 30% had no preference.

However, several participants who selected the countdown above the box commented that when it was below the box they were/would not be able to see it when the keyboard appeared because the keyboard might cover up the lower half of the screen, including the character countdown.

Awareness of Character Limits.

Based on how few participants selected the correct design used and how many tried to enter more information than the field would hold, we conclude that not everyone was aware of the countdown feature.

Identifying the Design Used: Only 8% of the participants who received the character countdown above the field correctly recalled their condition when presented with a picture of each of the designs. Forty-six percent of participants who received the character countdown below the field correctly selected that picture when asked which design they used. Fifty percent of the participants who did not receive a character countdown, reported that they did, when asked to select the design they used.

Trying to Type More than Allowed: There was no significant difference between the conditions in the number of times participants tried to type more than allowed (χ2(2) = 1.5, p = .47). Without a character countdown, participants attempted to type more than allowed in 24% of the questions; when the countdown was below the field, they attempted to type more than allowed in 21% of the questions and when it was above the field, they attempted to type more than allowed in 16% of the questions.

5 Experiment 3: Use of Predictive Text

5.1 Designs Tested in the Experiment

Experiment 3 focused on predictive text. The research question was whether predictive text within an open-ended question on a survey improved the respondent experience and led to higher data quality than without predictive text.

We used a within-subjects experimental design to investigate the topic. The survey had two questions: the person’s race and the medicine the person uses. Since the population was older individuals, we expected most individuals to be on at least one prescription. One question would have predictive text and the other question would not.

Participants were randomly assigned to one of four question presentations to counter balance order effects. Thirty-seven participants completed the survey with two questions: 6 with predictive text on the first question of race; 8 with predictive text on the first question of medicine; 14 with predictive text on the second question of race; and 9 with predictive text on the second question of medicine. In total there were 74 data points. See Figs. 9, 10, 11 and 12.

Fig. 9.
figure 9

Race with predictive text

Fig. 10.
figure 10

Race without predictive text

Fig. 11.
figure 11

Prescription with predictive text

Fig. 12.
figure 12

Prescription without predictive text

After the two survey questions, we measured satisfaction by asking the participant to rate how easy or difficult it was to complete the survey on a 5-point scale with the endpoints labeled 1 = Very Easy and 5 = Very Difficult. The final task was for the participant to select whether he or she preferred predictive text or not. The participants were shown a paper with a screen shot of the race survey question with predictive text showing and without predictive text and asked which they preferred. Then, the participant was directed to the same preference question on the phone and asked to touch the option they preferred. We used this two-step approach because we found that on preference questions it was easier for participants to understand the preference when they could see both choices at the same time and you could not on the phone (which was not possible on the phone because of scrolling).

5.2 Analysis Methods

The app collected behavioral measures, including time on screen, number of touches per screen, and use of the backspace key. A touch was captured on any part of the screen except the QWERTY keyboard. The minimum number of touches for a question without the predictive text was two, one to enter the field, and one for the next button. The minimum number of touches if the participant used the predictive text was three: one to enter the field, one to touch the predictive text and one for the next button. The app also captured the race and medicine answers that had been entered and the participant’s satisfaction score and design preference.

Burden was operationalized as time to complete the task (the amount of time on the screen) and the number of touches per screen. We calculated whether there were too many touches by taking the minimum number of touches by factor (predictive or not predictive) and then created a variable for each factor whether the number of touches was the minimum number or more than the minimum of touches.

We used two criteria to measure accuracy. Accuracy was measured by whether the entered data matched to a real word or not. The race and medicine answers were coded by a reviewer into two categories - matched and not matched, which we assumed meant accurate and not accurate. Accuracy was also measured by the use of backspace, with more backspacing suggesting more error.

Satisfaction scores were not particularly meaningful because it was a within-subjects design and each participant saw each of the two factors, but even still, there were no differences in those scores by the randomization order (p > 0.8).

We modeled log of time at the question level using a mixed model. The model with the untransformed time variable produced slightly skewed residuals. A total of 72 observations, instead of 74, was used in the model. Two observations were dropped from analysis because they were three standard deviations away from the mean time. In the model, we controlled for the factor of predictive text presence or absence and the question content of race or medicine.

We used a Chi-Square statistic for the remaining analysis: (1) did the number of additional touches differ by the factor and (2) did backspacing differ by factor. We tallied preference of design.

We measured awareness of the predictive text feature by examining how many participants used the feature both in this experiment and in Experiment 2, where the feature was present on all questions.

5.3 Results

Table 5 includes the results of the experiment, including how many participants received each condition, the average time to answer a question by condition, the percent of questions where at least one character was deleted while typing; and the average satisfaction rating for each condition.

Table 5. Experiment 3 metrics by factor

Burden Operationalized as Time to Complete (Efficiency).

Modeling log of time to complete at the question level, we did not find a difference by factor (predictive text or not) with the question content as a covariate (F = 1.5, p = 0.2). It took significantly longer to answer the medicine question than the race question (F = 13.2, p < .001). A previous model demonstrated that there was no interaction between the question content and the availability of the predictive test (F = .09, p = 0.8).

Burden Operationalized as Number of Touches on Screen (Efficiency).

The number of extra touches on the screen was not significantly different by factor (χ2(1) = 1.0, p = 0.3).

Accuracy Operationalized as Codable/Uncodable Answers (Effectiveness).

Responses to the race and medicine questions were able to be matched to real words (e.g., coded) most of the time. Ninety-one percent of the answers to the questions where the predictive text was available matched a real word. Eighty-one percent of the answers to the questions without predictive text availability matched a real word. Overall, 14% of the responses did not match a real word, but this did not differ by factor (predictive text available or not available) (χ2(1) = 1.6, p = 0.2).

Accuracy Operationalized as Use of Backspace.

Backspace was used in 24 out of 72 questions, or 33% of the time. Use of the backspace key also did not differ by whether the question had predictive text available or not (χ2(1) = .03, p = 0.9).

Overall Preference.

Almost 76% of the 37 participants preferred the predictive text compared with just over 24% that preferred the text input fields without the predictive text option.

Knowledge of Predictive Text Options.

Examining Experiment 2 data we conclude that most older adults were familiar with predictive text. While Experiment 2 did not evaluate predictive text, that feature was offered at all questions. In that experiment, only 3 of the 40 participants never used predictive text. But, predictive text was used in only 56% of the questions, so even when participants knew it was there, they did not necessarily use it.

In Experiment 3, not all participants who could have used the predictive text feature, did so. The feature was used in only 22 of the 35 questions or (63%). Use of the feature did not differ by question content (χ2(1) = .09, p = 0.8) nor by whether the predictive feature was available on the first question or second question (χ2(1) = 0.36, p = 0.5). When modeling time for only the questions with the predictive text available, time differs greatly by whether the predictive text was used or not. Those questions where it was used took only 11 s (SE = 2) (n = 22), while those questions where it was not used took 26 s (SE = 7) (n = 13) (t-statistic = 2.06, p < .05). However, it could be that the difference was due to participant characteristics that we could not measure. Based on Experiment 2 data, we believe participants knew about the predictive text.

6 Discussion

The purpose of this research was to learn more about how to design text input fields on mobile web surveys for older adults. Results from the first experiment confirmed our hypothesis that label placement does not affect time on the screen and that there was a preference for labels to be located above the field (Fig. 1) and the newer inline labels that move (Fig. 2). Other researchers have also found that label placement does not affect time on task for PCs but that labels above the field are preferable and lead to fewer eye movements [4, 5]. While we did not use eye tracking in these experiments and our participants were older adults, we nonetheless draw the same conclusion for field labels on mobile phones. Because we also gathered additional data from younger participants, we can conclude that the time-on-task finding does not differ by age of the user, but that older adults are slower than younger adults during survey completion with text input fields regardless of label placement. These results contribute to the literature in two ways. We found respondent interaction with field labels to be the same on smartphones as they were on PCs, suggesting that device size does not matter for optimal label placement. We also found a preference for the inline labels that move. Inline labels that move have the benefit of saving space and not relying on short term memory.

To our knowledge, results from Experiment 2 is the first to show that while character countdowns are not explicitly noticed by older adults, they do appear to reduce the action of users erasing letters (or words) which is in keeping with our hypothesis. Surprisingly however, character countdowns did not lead to a reduction in the amount of time spent within the field nor did they increase the amount of information shared. While fewer participants were aware of the feature when it was above the field then when it was below the field, we did hear a few participants comment while selecting the design they preferred. Comments included that they would not be able to see the feature when it was below the field when the keyboard appeared because the keyboard would cover the countdown. This suggests that for smartphones, it might be a good idea to keep the countdown feature above the field.

Our results from the third experiment show that there were no real time differences for questions with predictive text offered compared to those where it was not offered. This is in contrast to the work done on PCs, by [14] where the predictive text feature took longer than the plain text. This difference may be due in part to the mechanism for using predictive text on a PC. Using a keyboard and mouse may make the task of selecting a choice more difficult than the single touch needed on a mobile phone virtual keyboard. Our timing finding is also in contrast to the prediction by [13] that predictive text would save time on mobile. It is possible that this too was due to the mechanism used in the early 2000s where at that time, the type of mobile keypad had at least three letters and one number associated with one button. Those researchers did not use a full virtual keyboard as was used in this research. Virtual keyboards on mobile devices are now the norm and so our timing data did not follow the expectations of [13].

However, when predictive text was offered, we found participants spending less time per question when they used the predicted words compared to those who simply typed all their response, which perhaps is what [13] was suggesting. This finding combined with the overwhelming number of users that preferred the predictive text option suggest survey designers should offer predictive text on questions that would benefit from it (e.g., open-ended questions with a finite list of response options, such as a list of prescription drugs).

Contrary to our hypothesis, there were no differences in the number of errors in made with or without the predictive text offered nor the number of codable terms. This may be due to a ceiling effect as the percent of codable answers were high across the board. Future research should test additional open-ended questions with long finite lists of items to see if those coding results still hold.

7 Conclusions

This research set out to examine three features of open-text fields in surveys accessed by mobile phones and whether they lead to more efficient and accurate data entry by older adults and more satisfaction by older adults. We found that placing field labels above or using inline labels that move was preferred by older and younger adults; not all older adults were aware of a character countdown feature but having that feature reduced erasing characters; and including a predictive text option does not necessarily affect overall time or accuracy for questions when offered, but it is a preferred option.

8 Limitations

Our research used a convenience sample of older adults who traveled to community centers, owned smartphones for a year or more, and lived in the Washington, D.C. metropolitan area. While we are unaware of any regional differences in smartphone use, it could be that older adults who cannot travel outside of their homes would behave differently.

Disclaimer.

This report is released to inform interested parties of research and to encourage discussion. The views expressed are those of the authors and not necessarily those of the U.S. Census Bureau.