Keywords

1 Introduction

According to the embodied cognition theory people ground both, concrete and abstract concepts in the sensory-motor stimuli [2, 15, 16]. While this theory requires further research to be corroborated, numerous evidence supporting it comes from behavioral studies. Experiments demonstrate the relationship between inter alia social relations and space, emotions and temperature, or love and a journeyFootnote 1 [16, 24]. Particularly, they reveal that people rely on space in order to conceptualize timeFootnote 2 [4]. They show that time organization along axes and people’s preferences toward temporal data presentation differ widely [5, 7, 8, 27]. However, three main classes of time arrangements can be distinguished (based on their usage frequency): (1) along horizontal axis from left to right or (2) from right to left and (3) along vertical axis from top to bottom.

First, common among languages written from left to right (e.g. English, French, or Polish), position past on the left side and future on the right side of the horizontal axis [4, 7, 8, 22, 24, 27]. Second, typical of languages written from right to left (e.g. Arabic, Urdu, or Hebrew), locates time on the horizontal axis but past on its right side and future on its left side [7, 21, 22, 27]. Third, observed in languages traditionally organizing text in columns (e.g. Mandarin or Cantonese Chinese), places past on the top and future on the bottom of the vertical axis [3, 5, 8, 18].

Simultaneously, evidence from numerous studies show significant gender differences in visuospatial abilities. Males are found to perform better than females across a variety of tasks investigating: spatial perception, mental rotation, spatial visualization, generation and maintenance of a spatial image, and spatiotemporal abilities. On the average, females slightly outperform males only in tasks requiring memory for object identity or its spatial location. Moreover, men and women tend to use different strategies when navigating through space or solving visuospatial problems — women are more likely to attend to landmarks whereas men to count on directional cues and distances [9].

The above mentioned issues raise the question of whether gender differences in visuospatial abilities affect the productivity of interactions with temporal data. In this paper we argue that research on gender differences is crucial to foster progress in brain and cognitive science, as well as in fields drawing from them, particularly in human-computer interaction and education. In the experiment presented here we aim, therefore, to systematically assess the evidence for gender differences in flexibility of interactions with time-oriented data visualizations.

The rest of the paper is organized as follows. Firstly, in Sect. 2, we describe the methodology of the experiment we conducted. Then, in Sect. 3, we detail the results. Section 4 discusses the findings. Finally, Sect. 5 concludes the paper.

2 Methodology

The experiment we conducted involved temporal reasoning over simple schedules visualizations. The mechanism through which people conceptualize time is yet unknown. However, in line with previous comparative linguistics and cognitive science findings, we expected to observe an interaction between time arrangement adaptation and participant’s gender [9, 20]. We hypothesized that in terms of judgment’s reaction time in non-adapted condition:

  • Males would outperform females if both sexes mainly rely on mental timeline rotation or generation and maintenance of preferable timeline,

  • Females would outperform males if both sexes mainly rely on memorization of objects (here events on a timeline) and their spatial locations.

We also speculated that the accuracy of such inferences would not be affected by time spatialization manipulations.

Participants. One hundred sixty-two individuals who reported not being multilingual participated in the experiment in exchange for payment. We excluded multilingual respondents from the study as according to comparative linguistics findings they can flexibly accommodate multiple time representations (i.e. different timelines) [3, 5, 18]. Consequently, they may not respond to the experiment stimulus affecting the analysis results. Moreover, in the course of data cleaning process, we discarded from further analysis responses of 12 individuals (four women and eight men) thus reducing the participant’s pool to 150.

Of the remaining 150 respondents, 56 were females and 94 males. They ranged in age from 15 to 60 (\(\bar{x}_F=30.50, s_F=9.41; \bar{x}_M=30.43, s_M=9.45\) Footnote 3). The female subjects held 18 distinct nationalities and were currently living in 13 different countries, whereas male subjects were of 21 nationalities and resided in 17 countries. Women were speaking 24 different languages while men 33. Thus, we achieved to gather culturally and linguistically diversified sample. Participants from both groups shared similar educational background. About 77 % of them graduated from a university: 52 % earning bachelor’s, 20 % master’s, and 5 % doctorate degree. All the participants reported using computer and the Internet on a daily basis.

Materials. In this experiment we employed 56 items out of the original 104-item Space, Time, and Agents test developed by Kessel [11, 12]. In the test, target stimuli consist of 2 sets of 4 schedules each, visualized using a matrix. On each trial a respondent based on the information in the visualization has to evaluate a true-false statement displayed under the matrix (Fig. 1). For the purposes of this experiment we considered only utterances involving time-related judgments. Hence, each set of schedules corresponded with 28 such statements, half of which were true.

Fig. 1.
figure 1

Example experimental stimulus

For the sake of simplicity, the test assumes time to be a 4-valued (morning, noon, afternoon, evening) ordinal variable, whereas space and agents to be 4-valued (dorm, library, bookstore, gym and Justin, Alex, Sammy, David respectively) categorical variables. Columns and rows of the matrices represent either locations or times. Line-based encoding of agents on schedules can outperform dots-based one in tasks requiring time sequences analysis or time trends identification [11, 12]. As the test task involves inferences over temporal statements (including those regarding sequences and trends) we initially used both line-based and dots-based visualizations. Our ramp-up phase of the experiment revealed yet that using them two can confuse the subjects. Thus, although the original test uses both line and dots-based visualizations to denote agents, we used only color-coded dots to avoid introducing potential learning effects.

Design. In order to evaluate the impact time spatialization can exert on the productivity of interactions with time-oriented data visualizations depending on gender, we manipulated the time arrangement along axes in the matrix-based schedules visualizations [11, 12]Footnote 4. In this two factor (time spatialization, gender) within-subjects design, we compared female differences in performance under the adapted – time arranged according to the given user’s preferences – and not-adapted – time arranged against his/her preferences – condition with those of male.

To quantify the effectiveness of inferences we measured response time and accuracy of responses. We gauged users’ preferences toward different time arrangements using self-developed 2-item 7-point Likert scale, where 7 denoted the most preferable time spatialization and 1 the least preferable [20]. Table 1 presents the overview of evaluation metrics used in the experiment.

Table 1. Summary of evaluation metrics used in the experiment

The experiment was run in an online scenario. We recruited the participants via CrowdFlower and Facebook advertisement. Both recruitment advertisement and experiment instructions were inviting contribution from all but multilingual individuals. They revealed only partial goal of the experiment: analysis of effectiveness of time-oriented data visualizations. Neither advertisement nor experiment’s instructions mentioned gender differences or impact of time spatialization on performance to avoid potential bias (for instance introduced via stereotypes activation) [10, 17, 19, 26]. We revealed the real purpose of the study at the debriefing stage.

The experiment design complies with the suggestions on conducting human subjects experiments on online labor markets proposed by Komarov [14]. We recognized the potential impact of input device on performance yet we did not have enough data to control for it. Thus, we reduced the participants pool only to those who used a mouse as pointing device. Further, we automatically excluded from the analysis observations collected from participants whose cognitive performance could have been negatively influenced. Namely, we discarded respondents who reported: (1) having disability or technical problem potentially impairing the performance in the experiment; (2) sleeping less than 6 h or more than 8 h a night before [1, 6, 13]; (3) drinking substantial amount of alcohol or taking drugs (e.g. strong painkillers) 12 h before the experiment; (4) multitasking while completing the experimental task or engaging in any other activity conflicting with his or her productivity. We reduced the technical requirements of the study to the modern browser installation. Since, crowdsourcing platforms require such installation, we could ignore its verification.

Procedure. The experiment consisted of four phases: (1) introduction; (2) usability test; (3) survey; and (4) closing. Each session began with formal introduction presenting the study goals, explaining its routine, and providing links to additional information. After passing screening questions, respondents had to accept the informed consent form in order to proceed to the actual test.

The group assignment questionnaire opened the second phase of the experiment. It established participant’s preferences toward time spatializations, assessed his/her reading speed, and controlled for factors that could negatively influence performance. The first block of the usability test began after completion of the 4-trial training session which assured comprehension of the task instructions. Following an optional pause, participants continued with the second block of the test. Instruction cues directed subjects to complete the task as quickly, yet as accurately, as possible. We counterbalanced the order of conditions (adapted vs. not-adapted) across participants to minimize learning effects. Further, we randomized the sequence of questions within each condition to avoid ordering effects. Lastly, we conducted the experiment entirely in English to reduce the effects of proximal language context [8, 18, 22]. We chose English for experimentation purposes as still a great deal of software and Web pages are available exclusively in English. Furthermore, people often use it as a lingua franca.

In the third phase, we administered the survey collecting information on respondents’ demographic, cultural, and linguistic background. Finally, we finished each session with the thank you note and debrief. At this stage, we also encouraged participants to give us some feedback and if they engaged in any activity that could potentially influence their productivity in the experiment (e.g. if they were multitasking, answering randomly without prior question analysis, or stopped the test for a moment for instance to answer the phone) to report it by checking the special box.

3 Analysis and Results

Data collected using the Web-based experiments, especially those crowdsourced, suffer from the extreme outliers problem. Thus, we performed outlier detection based on two measures: overall task completion time and geometric mean of task completion time. Outlier removal procedures proposed to clean crowdsourcing data (usually relying on standard deviation or IQR — inter-quartile range statistics) rarely guard against nonsensical observations without reducing the proper data diversity. Hence, we firstly flagged potential outliers using standard IQR-based formula: \([Q_1 - 3(Q_3-Q_1), Q_3+3(Q_3-Q_1)]\), where \(Q_1\) and \(Q_3\) refer to the first and third quartile respectively [14]. Then, we manually examined all the observations highlighted this way and additionally observations whose value was smaller than IQR, by comparing and contrasting them with data on participants reading speed and the amount of text presented during the test.

The IQR-based method identified 4 potential outliers: 2 in the upper bound of the overall task completion time and 2 in the upper bound of the geometric mean of task completion time. They all turned out to just correspond to slower reading participants as so we decided to keep the observations. However, the inspection of the overall task completion times substantially smaller than the IQR in the given experimental condition revealed 2 participants who completed the test in less than a minute. Although theoretically, a person reading over a 1000 words per minute could achieve such result, according to the results of our reading speed test those participants were only average readers (they read at a rate of 185 and 220 words per minute). Thus, we classified them as extreme outliers and removed them from further analysis.

Visual inspection of data (using histograms, density plots, and Q-Q plots) showed that overall task completion time, its geometric mean, as well as the number of correct answers data are strongly right skewed. Quantitative analysis of those data using skewness measure (\(g_1 > 1\) for all variables) and Anderson-Darling test (\(p\mathrm {-value} < 0.05\) for geometric mean of task completion time; \(p\mathrm {-value} < 0.001\) for rest of the variables) confirmed these hypotheses. Consequently, we rejected the assumption of normal distribution of data. Moreover, due to relatively small sample size and its imbalance (56 females vs. 94 males), we decided to further use non-parametric methods to analyze the data: (1) one-sided Wilcoxon Signed-Rank test for matched pairs to confirm the effect of superiority of adapted time visualizations over not-adapted ones and (2) Mann-Whitney U test to verify the hypotheses of the existence of gender differences in the flexibility of interactions with time-oriented data visualizations. We set the statistical significance level at \(\alpha = 0.05\). Where required we adjusted p-values using Benjamini-Hochberg correction.

We observed a significant effect of time spatialization adaption on response time (\(V = 3472.5\), \(p\mathrm {-value} \approx {5.99}\times 10^{-5}\); V = 4092, \(p\mathrm {-value} \approx 0.002\) for overall task completion time and its geometric mean respectively). Accuracy (measured by the number of correct answers) remained yet indifferent toward time arrangement changes (\(V = 3971.5\), \(p\mathrm {-value} \approx 0.47\)).

To examine the flexibility of interactions for gender differences we firstly transformed the original data so it met the requirements of Mann-Whitney U test. Specifically, we prepared flexibility data by subtracting the results gained by each participant under not-adapted condition from those gained under adapted condition. Thus, since participants performed better under adapted condition, the obtained flexibility data were generally left-skewed. Table 2 shows main characteristics of those data.

Table 2. Main characteristics of flexibility data

The Mann-Whitney U test for the differences in medians assumes homogeneity of variance between the groups. Furthermore, it requires groups distributions to have similar shapes. To verify these assumptions, we used Fligner-Killeen test which is robust against departures from normality and two-tailed Kolmogorov-Smirnov test respectively. Since non-directional Kolmogorov-Smirnov test for two independent samples is sensitive to any kind of difference in distributions (i.e. location, dispersion, skewness, or kurtosis) we centered and rescaled data before running the test. To normalize data we used the following formula:

$$\begin{aligned} \frac{X-median(X)}{MAD(X)}, \end{aligned}$$

where X denotes random variable and MAD calculates median absolute deviation for a given variable.

We failed to reject the hypothesis of equality of variances for both: response time and accuracy data (\(p\mathrm {-values} > 0.05\) for all variables). Further, the data provide insufficient evidence (at \(\alpha = 0.05\)) to deny the hypotheses of similar shapes. Thus, we conducted the Mann-Whitney U test for the differences in medians. We found no significant difference neither in median response time between females and males nor in median accuracy between females and males (all \(p\mathrm {-values} > 0.05\); \(r \approx -0.05\)). We summarized the tests results in Table 3.

Table 3. Results of tests for gender differences in the flexibility of interactions with time oriented data visualizations

4 Discussion

We addressed and empirically investigated the problem of gender differences in the productivity of interactions with temporal data visualizations. We found no sufficient evidence supporting the hypothesis of existence of such differences. These results are rather unexpected. On the one hand, a substantial body of research, supports the conceptual metaphor theory. On the other hand, gender differences in visuospatial abilities are consistently reported for decades.

Our results are however inconclusive. Firstly, based on one experiment, we cannot exclude the ceiling effect — the task we proposed could have been too easy for the hypotheses tests to yield significant results. Secondly, we assumed the magnitude and direction of gender differences in the flexibility of interactions with temporal data to be comparable with these of gender differences in the visuospatial abilities. Specifically, in line with previous cognitive science findings, we expected to observe at least medium effect size (\(d \ge 0.4\)). If the real effect size is substantially smaller than \(d < 0.4\), the power of our analysis can be insufficient. Finally, to prove that there’s no gender differences we shall run an equivalence test. Our sample is yet too small to reliably evaluate the equivalence hypotheses for not-normally distributed data.

If confirmed, the existence of gender differences in the flexibility of interactions with time-oriented data visualizations, may permeate into many areas related to human-computer interaction. It can also improve our understanding of brain and mind processes. Thus, we recommend further research in this matter.

5 Conclusions and Future Work

In this paper, we introduced the problem of gender-differences in temporal data analysis and demonstrated the first experimental results examining them. We found no evidence of gender-differences in the flexibility of interactions with such data. Thus, we recommend further research to definitely answer the question of whether systems requiring temporal data analysis should provide their users with gender-specific customization elements.