Skip to main content
Log in

Task estimation for software company employees based on computer interaction logs

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Digital tools and services collect a growing amount of log data. In the software development industry, such data are integral and boast valuable information on user and system behaviors with a significant potential of discovering various trends and patterns. In this study, we focus on one of those potential aspects, which is task estimation. In that regard, we perform a case study by analyzing computer recorded activities of employees from a software development company. Specifically, our purpose is to identify the task of each employee. To that end, we build a hierarchical framework with a 2-stage recognition and devise a method relying on Bayesian estimation which accounts for temporal correlation of tasks. After pre-processing, we run the proposed hierarchical scheme to initially distinguish infrequent and frequent tasks. At the second stage, infrequent tasks are discriminated between them such that the task is identified definitively. The higher performance rate of the proposed method makes it favorable against the association rule-based methods and conventional classification algorithms. Moreover, our method offers significant potential to be implemented on similar software engineering problems. Our contributions include a comprehensive evaluation of a Bayesian estimation scheme on real world data and offering reinforcements against several challenges in the data set (samples with different measurement scales, dependence characteristics, imbalance, and with insignificant pieces of information).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Availability of data and material

We provide an excerpt of data at (Yücel 2021) upon concealing certain privacy sensitive information. Since the data set analyzed during the current study contains privacy information and relates to the performance of the software company, the entire set is available from the corresponding author only on reasonable request.

Code availability

The codes generated during the current study are publicly available at our repository (Yücel 2020a).

Notes

  1. Nevertheless, at this stage we consider two kinds of roles as software developer and team leader to demonstrate its potential.

  2. The data collection campaign is carried out with the consent of the company. The subjects are informed in a clear manner about the nature and method of the research, and agreed to participate in the experiments.

  3. Here, deployment refers to active use of application window, and not running as a background process.

  4. Time is recorded in YYYY-MM-DD hh-mm-ss format but is illustrated in hh-mm-ss format in Table 1 for the sake of brevity. The name of the subject is replaced with a placeholder name (Zhang) in Table 1 for privacy reasons.

  5. The coder is a senior student at the department of computer science.

  6. Here, documentation refers to reading, writing or editing of project documentation.

  7. For brevity, we use the abbreviations Doc., Admin., Leis., and Prog. in the tables.

  8. These two subjects actively work for 26 days within the experimentation period, and we arbitrarily chose one day.

  9. We carry out this comparison relating the two subjects (i.e. the developer and the leader) considered in the analysis.

  10. The bin size of the histograms is set to 1 and no sort of optimization is performed to enhance visualization.

  11. In this table, an entry of asterisk (∗) denotes “any value” (of application or window title), whereas a dash (-) denotes “no value” (i.e. no candidate).

  12. Both inconclusive and uncertain cases are considered to be not-estimated.

  13. Henceforth, we carry the index n of the set of descriptors Λ to the subscript of the probability density function.

  14. The number of actions with alien titles is found to be 1377 for the developer and 514 for the leader.

  15. Since the matrices in Table 5 are symmetric, only the upper triangular parts are presented.

  16. Using only the infrequent tasks implies an inherently low number of samples.

  17. Nevertheless, even when all variables (more relevant and less relevant) are considered, the proposed method still achieves comparable rates.

  18. The number of nearest neighbors is considered as K = 3.

  19. In the pseudo-codes of this section, N denotes the total number of lines in the log file.

  20. In our case, k and r are the number of possible tasks and number of descriptor values, respectively.

  21. In Tables 3032, we denote the case where no quantile can be computed with ’-’.

References

  • ABB Inc (2017) ABB Dev Interaction Data. https://abb-iss.github.io/DeveloperInteractionLogs/

  • Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. ACM SIGMOD Record 22(2):207–216. https://doi.org/10.1145/170036.170072

    Article  Google Scholar 

  • Ahmed A (2016) Software project management: A process-driven approach. Auerbach Publications

  • Alemdar H, van Kasteren T, Ersoy C (2017) Active learning with uncertainty sampling for large scale activity recognition in smart homes. J Ambient Intell Smart Environ 9(2):209–223

    Article  Google Scholar 

  • Alpaydin E (2016) Machine learning: The new AI. MIT press

  • Amlekar R, Gamboa AFR, Gallaba K, McIntosh S (2018) Do software engineers use autocompletion features differently than other developers? In: International Conference on Mining Software Repositories. IEEE, pp 86–89

  • Anand K, Kumar J, Anand K (2017) Anomaly detection in online social network: A survey. In: Proceedings of International Conference on Inventive Communication and Computational Technologies. IEEE, pp 456–459

  • Bao L, Xing Z, Xia X, Lo D, Hassan AE (2018) Inference of development activities from interaction with uninstrumented applications. Empir Softw Eng 23(3):1313–1351

    Article  Google Scholar 

  • Beller M, Gousios G, Panichella A, Proksch S, Amann S, Zaidman A (2017) Developer testing in the IDE: patterns, beliefs, and behavior. IEEE Trans Softw Eng 45(3):261–284

    Article  Google Scholar 

  • Bernardi S, JL Domínguez, Gómez A, Joubert C, Merseguer J, Perez-Palacin D, Requeno J I, Romeu A (2018) A systematic approach for performance assessment using process mining. Empir Softw Eng 23 (6):3394–3441

    Article  Google Scholar 

  • Bogarín A, Cerezo R, Romero C (2018) A survey on educational process mining. Wiley Interdiscip Rev Data Min Knowl Discov 8(1):e1230

    Article  Google Scholar 

  • Brdiczka O (2010) From documents to tasks: Deriving user tasks from document usage patterns. In: Proceedings of International Conference on Intelligent User Interfaces. ACM, pp 285–288

  • Caballé S, Xhafa F (2013) Distributed-based massive processing of activity logs for efficient user modeling in a virtual campus. Clust Comput 16 (4):829–844

    Article  Google Scholar 

  • Caldeira J, e Abreu FB, Reis J, Cardoso J (2019) Assessing software development teams’ efficiency using process mining. In: Proceedings of International Conference on Process Mining. IEEE, pp 65–72

  • Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: Synthetic Minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  Google Scholar 

  • Chen L, Nugent CD (2019) Sensor-based activity recognition review. In: Human Activity Recognition and Behaviour Analysis. Springer, pp 23–47

  • Chernov S (2008) Task detection for activity-based desktop search. In: Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, pp 894–894

  • Chernov S, Demartini G, Herder E, Kopycki M, Nejdl W (2008) Evaluating personal information management using an activity logs enriched desktop dataset. In: Proceedings of Personal Information Management Workshop, vol 155. Citeseer

  • Choi H, Lim J, Yu H, Lee E (2016) Task classification based energy-aware consolidation in clouds. Sci Program 2016

  • Coman ID (2007) An analysis of developers’ tasks using low-level, automatically collected data. In: Joint meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, pp 579–582

  • Damevski K, Shepherd DC, Schneider J, Pollock L (2016) Mining sequences of developer interactions in visual studio for usage smells. IEEE Trans Softw Eng 43(4):359–371

    Article  Google Scholar 

  • Deisenroth MP, Faisal AA, Ong CS (2020) Mathematics for machine learning. Cambridge University Press

  • Delias P, Doumpos M, Grigoroudis E, Manolitzas P, Matsatsinis N (2015) Supporting healthcare management decisions via robust clustering of event logs. Knowl-Based Syst 84:203–213

    Article  Google Scholar 

  • Devaurs D, Rath AS, Lindstaedt SN (2012) Exploiting the user interaction context for automatic task detection. Appl Artif Intell 26(1-2):58–80

    Article  Google Scholar 

  • Dingsøyr T, Fægri TE, Dybå T, Haugset B, Lindsjørn Y (2016) Team performance in software development: Research results versus agile principles. IEEE Softw 33(4):106–110

    Article  Google Scholar 

  • Dragunov AN, Dietterich TG, Johnsrude K, McLaughlin M, Li L, Herlocker JL (2005) TaskTracer: A desktop environment to support multi-tasking knowledge workers. In: Proceedings of International Conference on Intelligent User Interfaces. ACM, pp 75–82

  • Eclipse Foundation (2010) Filtered UDC Data. http://archive.eclipse.org/projects/usagedata/

  • Embrechts P, Hofert M (2013) A note on generalized inverses. Math Methods Oper Res 77(3):423–432

    Article  MathSciNet  Google Scholar 

  • Fernández A, Garcia S, Herrera F, Chawla NV (2018) Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905

    Article  MathSciNet  Google Scholar 

  • Forsati R, Moayedikia A, Shamsfard M (2015) An effective web page recommender using binary data clustering. Inf Retriev J 18(3):167–214

    Article  Google Scholar 

  • Gatta R, Vallati M, Lenkowicz J, Casà C, Cellini F, Damiani A, Valentini V (2018) A framework for event log generation and knowledge representation for process mining in healthcare. In: Proceedings of International Conference on Tools with Artificial Intelligence. IEEE, pp 647–654

  • Hakim A, Hasibuan M, Andreswari R (2019) E-learning process analysis to determining student learning patterns using process mining approach 1193:1–8

  • Harris D, Harris S (2010) Digital design and computer architecture. Morgan Kaufmann

  • Hochstein L, Basili VR, Zelkowitz MV, Hollingsworth JK, Carver J (2005) Combining self-reported and automatic data to improve programming effort measurement. ACM SIGSOFT Softw Eng Notes 30(5):356–365

    Article  Google Scholar 

  • Jalali A (2016) Supporting social network analysis using chord diagram in process mining. In: Proceedings of International Conference on Business Informatics Research. Springer, pp 16–32

  • Jalote P, Kamma D (2019) Studying task processes for improving programmer productivity. IEEE Transactions on Software Engineering

  • Johnson PM (2007) Requirement and design trade-offs in Hackystat: An in-process software engineering measurement and analysis system. In: Proceedings of International Symposium on Empirical Software Engineering and Measurement. IEEE, pp 81–90

  • Johnson PM, Kou H, Agustin J, Chan C, Moore C, Miglani J, Zhen S, Doane WE (2003) Beyond the personal software process: Metrics collection and analysis for the differently disciplined. In: Proceedings of the International Conference on Software Engineering. IEEE, pp 641–646

  • Kalenkova AA, van der Aalst WM, Lomazova IA, Rubin VA (2017) Process mining using BPMN: relating event logs and process models. Softw Syst Model 16(4):1019–1048

    Article  Google Scholar 

  • Karahasanović A, Heim J (2015) Understanding the behaviour of online TV users. Pers Ubiquit Comput 19(5-6):839–852

    Article  Google Scholar 

  • KaVe Project (2018) Datasets. https://www.kave.cc/datasets

  • Ko AJ, DeLine R, Venolia G (2007) Information needs in collocated software development teams. In: Proceedings of International Conference on Software Engineering. IEEE, pp 344–353

  • Koldijk S, Van Staalduinen M, Neerincx M, Kraaij W (2012) Real-time task recognition based on knowledge workers’ computer activities. In: Proceedings of European Conference on Cognitive Ergonomics, pp 152–159

  • Langhnoja S, Barot M, Mehta D (2012) Pre-processing: Procedure on web log file for web usage mining. Int J Emerging Technol Adv Eng 2(12):419–423

    Google Scholar 

  • Leemans M, van der Aalst WM, van den Brand MG (2018) The Statechart workbench: Enabling scalable software event log analysis using process mining. In: Proceedings of International Conference on Software Analysis, Evolution and Reengineering. IEEE, pp 502–506

  • Maalej W, Ellmann M, Robbes R (2017) Using contexts similarity to predict relationships between tasks. J Syst Softw 128:267–284

    Article  Google Scholar 

  • MacKay DJ (2003) Information Theory, Inference and Learning Algorithms. Cambridge University Press

  • Martin N, Solti A, Mendling J, Depaire B, Caris A (2019) Mining batch activation rules from event logs. IEEE Trans Serv Comput:1–1. https://doi.org/10.1109/TSC.2019.2912163

  • Mazza R, Bettoni M, Faré M, Mazzola L (2012) MOCLog - monitoring online courses with log data. In: Proceedings of the Moodle Research Conference, pp 132–139

  • McLeod L, MacDonell SG (2011) Factors that affect software systems development project outcomes: a survey of research. ACM Comput Surv (CSUR) 43 (4):24

    Article  Google Scholar 

  • Meyer AN, Barton LE, Murphy GC, Zimmermann T, Fritz T (2017) The work life of developers: activities, switches and perceived productivity. IEEE Trans Softw Eng 43(12):1178–1193

    Article  Google Scholar 

  • Meyer AN, Satterfield C, Züger M, Kevic K, Murphy GC, Zimmermann T, Fritz T (2020) Detecting developers’ task switches and types. IEEE Trans Softw Eng:1–16

  • Mirza HT, Chen L, Hussain I, Majid A, Chen G (2015) A study on automatic classification of users’ desktop interactions. Cybern Syst 46(5):320–341

    Article  Google Scholar 

  • Monden A, Matsumura T, Barker M, Torii K, Basili VR (2012) Customizing GQM models for software project monitoring. IEICE Trans Inf Syst 95(9):2169–2182

    Article  Google Scholar 

  • Montgomery DC, Runger GC (2010) Applied statistics and probability for engineers. Wiley

  • Obregon J, Song M, Jung JY (2019) Infoflow: Mining information flow based on user community in social networking services. IEEE Access 7:48024–48036

    Article  Google Scholar 

  • Oram A, Wilson G (2010) Making software: What really works, and why we believe it. O’Reilly Media Inc

  • Parsons HM (1974) What Happened at Hawthorne?: New evidence suggests the Hawthorne effect resulted from operant reinforcement contingencies. Science 183(4128):922–932

    Article  Google Scholar 

  • Partington A, Wynn M, Suriadi S, Ouyang C, Karnon J (2015) Process mining for clinical processes: a comparative analysis of four australian hospitals. ACM Trans Manag Inf Syst 5(4):19

    Article  Google Scholar 

  • Perry DE, Staudenmayer NA, Votta LG (1995) Understanding and improving time usage in software development. Softw Process 5:111–135

    Google Scholar 

  • Proksch S, Nadi S, Amann S, Mezini M (2017) Enriching in-ide process information with fine-grained source code history. In: Proceedings of International Conference on Software Analysis, Evolution and Reengineering. IEEE, pp 250–260

  • Ramachandran KM, Tsokos CP (2014) Mathematical Statistics with Applications in R. Elsevier

  • Rashid T, Agrafiotis I, Nurse J (2016) A new take on detecting insider threats: Exploring the use of hidden markov models. In: Proceedings of ACM CCS International Workshop on Managing Insider Security Threats, pp 47–56. https://doi.org/10.1145/2995959.2995964

  • Rojas E, Munoz-Gama J, Sepúlveda M, Capurro D (2016) Process mining in healthcare: a literature review. J Biomed Inform 61:224–236

    Article  Google Scholar 

  • Rovani M, Maggi FM, de Leoni M, van der Aalst WM (2015) Declarative process mining in healthcare. Expert Syst Appl 42(23):9236–9251

    Article  Google Scholar 

  • Rovetta S, Cabri A, Masulli F, Suchacka G (2017) Bot or not? A case study on bot recognition from Web session logs. In: Italian Workshop on Neural Nets. Springer, pp 197–206

  • Russo B, Succi G, Pedrycz W (2015) Mining system logs to learn error predictors: a case study of a telemetry system. Empir Softw Eng 20(4):879–927

    Article  Google Scholar 

  • Schönig S, Cabanillas C, Jablonski S, Mendling J (2015) Mining the organisational perspective in agile business processes. In: Enterprise, Business-Process and Information Systems Modeling. Springer, pp 37–52

  • Shen J, Li L, Dietterich TG, Herlocker JL (2006) A hybrid learning system for recognizing user tasks from desktop activities and email messages. In: Proceedings of International Conference on Intelligent User Interfaces. ACM, pp 86–92

  • Shen J, Li L, Dietterich T G (2007) Real-time detection of task switches of desktop users. In: Proceedings of International Joint Conferences on Artificial Intelligence, vol 7, pp 2868–2873

  • Shimizu R, Monden A, Yücel Z, Uwano H (2018) Automatic estimation of software development tasks. In: Proceedings of IPSJ/SIGSE Winter Workshop, vol 2018, pp 30–31

  • Singh V, Pollock LL, Snipes W, Kraft NA (2016) A case study of program comprehension effort and technical debt estimations. In: International Conference on Program Comprehension. IEEE, pp 1–9

  • Soto-Valero C, Bourcier J, Baudry B (2018) Detection and analysis of behavioral t-patterns in debugging activities. In: Proceedings of International Conference on Mining Software Repositories, pp 110–113

  • Suthipornopas P, Leelaprute P, Monden A, Uwano H, Kamei Y, Ubayashi N, Araki K, Yamada K, Matsumoto K (2017) Industry application of software development task measurement system: Taskpit. IEICE Transactions on Information and Systems (3):462–472

  • Tax N, Sidorova N, Haakma R, van der Aalst WM (2016) Event abstraction for process mining using supervised learning techniques. In: Proceedings of SAI Intelligent Systems Conference. Springer, pp 251–269

  • van der Aalst WM (2015) Extracting event data from databases to unleash process mining. In: BPM-Driving Innovation in a Digital World, Springer, pp 105–128

  • Vialardi C, Bravo agapito J, Ortigosa A (2008) Improving AEH courses through log analysis. Journal of Universal Computer Science

  • Viertel FP, Karras O, Schneider K (2017) Vulnerability recognition by execution trace differentiation. Softwaretechnik-Trends 37(3), http://pi.informatik.uni-siegen.de/stt/37_3/01_Fachgruppenberichte/SSP2017_proceedings/01_Vulnerability_Recognition_by_Execution_Trace_Differentiation.pdf

  • Vijayasarathy LR, Butler CW (2015) Choice of software development methodologies: Do organizational, project, and team characteristics matter? IEEE Softw 33(5):86–94

    Article  Google Scholar 

  • Vuong T, Jacucci G, Ruotsalo T (2017) Watching inside the screen: Digital activity monitoring for task recognition and proactive information retrieval. Proceedings of the ACM on Interactive, Mobile. Wear Ubiquit Technol 1(3):1–23

    Google Scholar 

  • Wagner S, Ruhe M (2018) A systematic review of productivity factors in software development. arXiv:180106475

  • Wickramasinghe V, Nandula S (2015) Diversity in team composition, relationship conflict and team leader support on globally distributed virtual software development team performance. Strategic Outsourcing Int J 8(2/3):138–155

    Article  Google Scholar 

  • Yücel Z (2020a) Software applications and custom codes. https://github.com/yucelzeynep/Task-estimation-from-activity-logs, 2020-08-09

  • Yücel Z (2020b) Supplemental material on detailed results of alternative methods. https://yucelzeynep.github.io/pub/2020_supp_mat_std_clsf.pdf, 2020-07-09

  • Yücel Z (2020c) Supplemental material on detailed results of the proposed method. https://yucelzeynep.github.io/pub/2020_supp_mat_proposed.pdf, 2020-07-09

  • Yücel Z (2021) Interaction logs of sofware company employees for task estimation. https://doi.org/10.5281/zenodo.4500028

  • Zou L, Godfrey MW (2012) An industrial case study of Coman’s automated task detection algorithm: What worked, what didn’t, and why. In: Proceedings of IEEE International Conference on Software Maintenance. IEEE, pp 6–14

Download references

Acknowledgements

We would like to thank Mr. Ryosuke Shimizu for his help in compiling the data set and annotations. We would like to thank Mr. Christian Murphy and Dr. Samantha Stever for their help in proofreading.

Funding

This work was supported by JSPS KAKENHI Grant Numbers JP18K18168 and JP20H05706. The results of this research are funded by Okayama University Dispatch Project for Female Faculties.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zeynep Yücel.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Communicated by: Romain Robbes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by JSPS KAKENHI Grant Numbers JP18K18168 and JP20H05706. The results of this research are funded by Okayama University Dispatch Project for Female Faculties.

Appendices

Appendix A: Details of Ground Truth Annotations

Table 14 depicts details on ground truth annotations regarding the two subjects. This table indicates that both the developer and the leader have quite an imbalanced distribution of tasks. Namely, the developer carries out a Test 82% of the time, whereas the leader performs Documentation 78% of the time. In addition, Test and Administration are found to be performed by both subjects, but at different rates, whereas Programming is carried out only by the developer, and Leisure and Documentation is realized only by the leader.

Table 14 Distribution of the ground truth labels assigned by the coder

Appendix B: Details of Association Rules and Key Phrases for Window Titles

Table 15 depicts the entire set of association rules defined by the expert. Please note that rule numbers simply refer to the row number in the table and do not have any significance from a semantic point of view. Similarly, also the order of candidate tasks does not have any significance. Note that in the main track we provide some sample rules after translating their window titles to English, for the sake of clarity and uniformity. Here, we provide them in their original form, i.e. including Chinese and Japanese characters.

Table 15 The set of association rules defined by the expert

It can be seen in Table 15 that out of the 20 rules, 13 of them are definite and 7 are indefinite. Moreover, only two rules involve both application and window title in their antecedents, while 12 rules involve only application, and 6 rules involve only window titles as antecedents. The indefinite rules with a single term in their antecedents (e.g. rule 15), can be considered to pose the most serious risk from the point of view of (un)certainty of estimations.

In addition, as mentioned in Section 5.3.1, since there is virtually an infinite number of window title possibilities, we only consider a set of 50 key phrases, which commonly appear in window titles. In this study, the set of key phrases is defined by our expert. Nevertheless, this step can be also performed automatically, for instance, by integrating a standard clustering method integrated with natural language processing tools. Specifically, the expert provided a list of 50 key phrases, some of which are redundant. Namely, among the 50 key phrases defined by the expert, 10 key phrases are utilized by the developer and 25 key phrases are used by the leader. In addition, some key phrases appear in the same window title.

To solve this issue, we revised the set of key phrases by considering those that appear together as additional variable values. In that manner, we consider the developer to utilize a total of 11 distinct window titles and the developer to utilize 31 distinct window titles (aside from the alien titles).

Appendix C: Details of Estimation Performance Upon Direct Application of Association Rules

Table 16 demonstrates the number of actions, which receive 0\(\sim \)3 candidate tasks by direct application of the rules. These results indicate that the rules are able to give a single candidate task (i.e. estimation) for only 152 actions out of 1283 (i.e. 12% of the cases) regarding the leader, and 131 actions out of 1921 (i.e. 7% of the cases) regarding the developer. This means that the remaining 88% and 93% of the actions of the leader and the developer, respectively, need to be revised such that a -single- task is assigned to them.

Table 16 The number of actions estimated to have 0\(\sim \)3 candidate tasks by direct application of the rules

Appendix D: Details of Post-processing of Benchmark Method

Step-1 of post-processing:

If two subsequent actions are uncertain and have a single candidate in common, then their estimated tasks are determined as this candidate.

Algorithm 1 outlines Step-1 of post-processing as a pseudo-code and Table 17 illustrates its execution on a sample -hypothetical- piece of dataFootnote 19.

figure k
Table 17 Example for the application of Step-1 of post-processing

Step-2 of post-processing:

For each estimated action, the task associated with it is propagated to its preceding (not-estimated) action, provided that that task appears among the candidates of the preceding action.

Algorithm 2 outlines Step-2 of post-processing as a pseudo-code and Table 18 illustrates its execution on a sample -hypothetical- piece of data.

figure l
Table 18 Example for the application of Step-2 of post-processing

Step-3 of post-processing:

Step-2 is repeated beyond immediately preceding and succeeding actions, until it reaches an estimated action. Subsequently, the same is applied for the succeeding actions.

Algorithm 3 outlines Step-3 of post-processing as a pseudo-code and Table 19 and Table 20 illustrate its execution on a sample -hypothetical- piece of data with backward and forward propagation, respectively.

Table 19 Example for the application of Step-3 of post-processing
Table 20 Example for the application of Step-3 of post-processing
figure m

Step-4 of post-processing:

For each estimated action, the task associated with it is propagated to each preceding (not-estimated) action irrespective of their candidates, until an estimated action is found. Subsequently, the same is applied for the succeeding actions.

Algorithm 4 outlines Step-4 of post-processing as a pseudo-code and Table 21 illustrates its execution on a sample -hypothetical- piece of data (with backward propagation).

figure n
Table 21 Example for the application of Step-4 of post-processing

Appendix E: Details of Performance After Causal Post-processing

Table 22 demonstrates estimation accuracy of the benchmark method together with causal post-processing operations. The most frequently occurring task of the developer (i.e. Test) is achieved with a high accuracy (i.e. 0.92), raising his overall estimation accuracy (i.e. 0.86). On the contrary, for the leader, the most frequently occurring task (Documentation) is detected only by 0.58, leading his overall accuracy to be only 0.63. In addition, from Table 22, it is clear that there is a larger margin of improvement in estimation of the actions of the developer (see also Table 6).

Table 22 Performance of the benchmark method after causal post-processing for (a) developer, (b) leader and (c) both

Appendix F: Details of Benchmark Performance After Non-causal Post-processing

Table 23 presents the estimation accuracy of the benchmark method followed by non-causal post-processing. It can be observed that non-causal post-processing achieves a higher accuracy both for the developer and the leader. Namely, for the developer the accuracy increases from 0.86 to 0.92 and for the leader it increases from 0.63 to 0.85. making non-causal post-processing more beneficial for the leader.

Table 23 Performance of the benchmark method after non-causal post-processing for (a) developer, (b) leader and (c) both

Appendix G: Statistical Properties of Descriptor Values

Table 24 presents the minimum, maximum, mean and standard deviation values relating the four ratio scale descriptors, i.e. duration δ (in sec), and number of key strokes κ, left clicks cL and right clicks cR. Although the tasks of the developer and the leader are distributed in quite a different manner (see Table 14), no significant distinction is present between descriptor statistics of the two subjects. In other words, the variation between the mean value of any ratio scale descriptor is within a single standard deviation concerning either of the subjects.

Table 24 The minimum, maximum, mean and standard deviations for the four ratio scale descriptors

Appendix H: Details of Normalized Entropy Distances

Table 25 presents normalized entropy distance values computed separately for each user. This table ascertains that there is a higher degree of dependence (i.e. lower distance) between application α and window title ω for both subjects.

Table 25 Normalized entropy distance between pairs of descriptors regarding (a) the developer (b) the leader at Stage-1 of hierarchical classification

Here, it is noteworthy to discuss the possible drawbacks of having a 2D variable space instead of two 1D spaces, due to the obvious dependence of α and ω depicted in Table 5 and in more detail in Tables 25 and 26.

Table 26 Normalized entropy distance between pairs of descriptors regarding (a) the developer (b) the leader at Stage-2 of hierarchical classification

Clearly, building a 2D space introduces a larger number of bins (i.e. cells) than two 1D spaces.

However, taking a closer look at the distribution samples in the 2D variable space, we can actually confirm that this increase in number of bins does not pose a serious issue for our particular data set. Namely, the developer uses a total of 11 applications and utilizes 10 distinct window titles (including the alien titles), which leads to a variable space with 110 bins. Among those 110 bins, only 14 are observed to have nonzero observations. Since the number of samples relating the developer is 1921 (see Table 14 of Appendix Appendix), we consider the sample size to be sufficient to populate the effectual 2D variable space.

On the other hand, regarding the leader, there are 16 applications and 31 window titles (including the alien titles), which lead to a 2D space with 540 bins and 36 of them have nonzero samples. Since the leader has 1283 samples (see Table 14 of Appendix Appendix), similar to the case of the developer, we consider these to be sufficient to populate the effectual 2D variable space. Therefore, although the contribution expected from dimensionality reduction is not regarded to be high for the particular data set of interest, we still consider this attempt to understand the dependency and our strategy based on normalized entropy distance, to be potentially beneficial for other data sets and to expand the application capabilities of the proposed method.

Appendix I: Pre-processing of Ratio Scale Variables

The pre-processing operation applied on ratio scale variables is explained in Section 5.3.1. Figure 3 illustrates an example of that pre-processing operation by presenting its details on the δ descriptor relating the developer.

Fig. 3
figure 3

An example for the pre-processing of the δ variable concerning the developer

The quantiles Qi of that descriptor are determined by following the procedure summarized in (6) and are illustrated on the x-axis of Fig. 3 (and listed in column 2 of Table 27). Moreover, the number of observations in each of these clusters are as given in column 3 of Table 27. As explained in Section 5.3.1, since the observations are substantially imbalanced, the number of observations in these quantiles does not approximate a uniform distribution.

Table 27 Cluster edges and number of observations in each cluster for the δ variable of the developer

Appendix J: Assessing Variables’ Relevance

Section 5.3.2 elaborated on the independence of pairs of variables, whereas this section investigates the correlation of variables and tasks. It is known that automated collection tools

may register too much and too detailed information requiring a serious amount of time for the analysis (Karahasanović and Heim 2015). Namely, certain descriptors may be highly correlated with particular tasks, while some others may not present a distinction with respect to the nature of the task. In order to assess the relevance of each descriptor in task estimation, we use Cramér’s V, which is a measure of association between two variables based on Pearson’s χ2 statistics (Ramachandran and Tsokos 2014) and attains a value in the range [0, 1]. In explicit terms, Cramér’s V is computed as,

$$ V=\sqrt{\frac{\chi^{2}/n}{\min(k-1,r-1)}}, $$
(7)

where χ2 comes from Pearson’s test, n is the number of observations, and k and r are number of values that each variable can attainFootnote 20.

As seen in Table 28, α and ω have by far the highest V values. Therefore, considering the significant difference between the values of V regarding these two descriptors and the remaining ones, we suggest using α and ω in task estimation and omit the others. In this respect, in this work we

Table 28 Assessing the relevance of descriptors with Cramér’s V s

prefer using V for measuring the degree of correlation and determine our set of relevant descriptors in rather an empirical manner.

In addition to assessing the relevance , we used Cramér’s V also for determining the optimum number of quantiles which yields the highest correlation between the relevant descriptors (chosen as above) and tasks. Namely, we compute V by distributing the ratio scale variables into 1 ≤ q ≤ 6 quantiles and pick the optimum value of q, which yields the highest V for each variable (see Appendix Appendix). In this respect, Table 28 presents the highest values of V for each ratio scale variable.

We compute Cramér’s V for the set of tasks and each single descriptor, which has been pre-processed as described in Section 5.3.1.

10.1 J.1 Cramér’s V for Nominal Scale Variables

As mentioned in Section 3.1, the nominal scale variables are application name α and window title ω. Computing Cramér’s V accounting for the considerations explained in Appendix J, we achieved the values presented in Table 29. We observed from this table that V relating α is higher than V relating ω for both subjects. In addition, the difference in V between α and ω is larger for the developer than the leader. For instance, at Stage-1 the developer attains 0.55 and 0.26 (with a difference of 0.29) for the two descriptors, whereas the leader attains 0.83 and 0.65 (with a difference of 0.18). All tasks are correlated more strongly with α, and the effect is more pronounced for infrequent tasks (i.e. Stage-2). Moreover, V relating ω is lower for infrequent tasks (i.e. Stage-2) for both subjects (i.e. 0.26 > 0.15 and 0.65 > 0.34).

Table 29 Cramér’s V regarding nominal scale descriptors

10.2 J.2 Cramér’s V for Ratio Scale Variables

In practice, there are a number of points that we need to be careful in assessing the relevance of ratio scale variables. Specifically, unlike nominal scale variables, relevance of ratio scale variables is assessed based on their empirical cdfs rather than the -raw- values. In that respect, we first need to carefully decide on the bin size of the empirical distributions, since the choice may impact on Cramér’s V.

The essential challenge in choosing the bin size relates the concentration of descriptor values in narrow ranges as mentioned in Section 3.3. One important factor leading to this issue is that the number of key strokes κ and left/right clicks cL, cR, attain a value of 0 for most of the actions. Thus, in building their empirical distributions (i.e. histograms), it is plausible to consider the case of having a zero value on its own. Namely, we first distinguish zero values, and use a bin dedicated to them and group all non-zero values in one cluster. This corresponds to the Q = 0 case in Tables 3031 and 32 below. Besides, any number Q >= 1, indicates that the nonzero values are grouped among them (into Q + 1 clusters).

Table 30 Cramér’s V regarding ground truth tasks and ratio scale variables clustered into various quantiles for (a) the developer and (b) the leader and (c) both, when no hierarchy is considered.
Table 31 Cramér’s V regarding ground truth tasks and ratio scale variables clustered into various quantiles for (a) the developer and (b) the leader and (c) both at Stage-1 of hierarchical classification.
Table 32 Cramér’s V regarding ground truth tasks and ratio scale variables clustered into various quantiles for (a) the developer and (b) the leader and (c) both at Stage-2 of hierarchical classification.

Another impediment is that we cannot distribute the descriptor values into any number of bins. Specifically, given a particular Q, it may not always be possible to have distinct edges for each cluster. Namely, some quantiles may be the same, if the distribution is too imbalanced (i.e. too many values in the first few bins and too little values towards the tail). In that case, the method explained in Fig. 3 prohibits using that particular number of quantiles Q.

The final issue relates the number of clusters that we can possibly use, in particular for integer variables. Namely, the number of clusters depends on the number of distinct observations for such variables. This can be understood easily by watching, for instance, the cR values illustrated in Fig. 4g and h of Appendix H. The number of right clicks is either 1 or 2 for the developer, whereas it is some value between 1 and 7 for the leader, although values higher than 2 are observed quite rarely. From this figure, it is clear that we cannot categorize the values in more than 2 bins for the developer and more than 7 bins for the leader (though a large number of categories are not expected to introduce significant information)Footnote 21.

Fig. 4
figure 4

Histogram and cumulative histogram of duration δ for (a) the developer and (b) the leader. Similar pairs of figures for (c, d) number of key strokes κ, (e, f) number of left clicks cL, (g, h) number of right clicks cR.

When interpreting the information presented in Tables 3031 and 32, we should note that if a certain descriptor is correlated with a particular task for one subject, but with another task for the other subject, computing Cramér’s V for both subjects at once yields a lower value than computing it for each subject on its own. As an example, consider that, for the developer, firefox.exe is highly correlated with Test, while for the leader, it is highly correlated with Leisure. This means Cramér’s V can be -relatively- high for α computed for one subject only (i.e. only the developer or only the leader). But if it is computed for both subjects at once, it decreases, since α is associated with a larger variety of tasks.

In the light of these observations, we decided to use Cramér’s V as a guideline to determine the number of clusters of empirical distributions in addition to deciding the relevance of variables.

To determine the best number of clusters such that the correlation between the descriptors and tasks is highest, we compute Cramér’s V for the number of quantiles 0 < Q < 5 as presented in Tables 3031 and 32. Subsequently, we pick Q, such that it yields the highest possible V (marked in boldface in these tables).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pellegrin, F., Yücel, Z., Monden, A. et al. Task estimation for software company employees based on computer interaction logs. Empir Software Eng 26, 98 (2021). https://doi.org/10.1007/s10664-021-10006-4

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10664-021-10006-4

Keywords

Navigation