Abstract
This article examines experiences in evaluating a user-adaptive personal assistant agent designed to assist a busy knowledge worker in time management. We examine the managerial and technical challenges of designing adequate evaluation and the tension of collecting adequate data without a fully functional, deployed system. The CALO project was a seminal multi-institution effort to develop a personalized cognitive assistant. It included a significant attempt to rigorously quantify learning capability, which this article discusses for the first time, and ultimately the project led to multiple spin-outs including Siri. Retrospection on negative and positive experiences over the 6 years of the project underscores best practice in evaluating user-adaptive systems. Lessons for knowledge system evaluation include: the interests of multiple stakeholders, early consideration of evaluation and deployment, layered evaluation at system and component levels, characteristics of technology and domains that determine the appropriateness of controlled evaluations, implications of ‘in-the-wild’ versus variations of ‘in-the-lab’ evaluation, and the effect of technology-enabled functionality and its impact upon existing tools and work practices. In the conclusion, we discuss—through the lessons illustrated from this case study of intelligent knowledge system evaluation—how development and infusion of innovative technology must be supported by adequate evaluation of its efficacy.




Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
While PTIME can be seen as a type of recommender system, evaluating a task-oriented adaptive system such as PTIME differs significantly from evaluating a classical recommender system, due to the generative, incremental, and dynamic nature of the recommendation task.
References
Ackerman S (2011) The iPhone 4S’ talking assistant is a military veteran. Wired, 2011. www.wired.com/2011/10/siri-darpa-iphone/. Retrieved 26 Jan 2015
Ambite JL, Barish G, Knoblock CA, Muslea M, Oh J, Minton S (2002) Getting from here to there: Interactive planning and agent execution for optimizing travel. In: Proceedings of fourteenth conference on innovative applications of artificial intelligence (IAAI’02), pp 862–869
Ambite J-L, Chaudhri VK, Fikes R, Jenkins J, Mishra S, Muslea M, Uribe T, Yang G (2006) Design and implementation of the CALO Query Manager. In: Proceedings of eighteenth conference on innovative applications of artificial intelligence (IAAI’06), pp 1751–1758
Aylett R, Brazier F, Jennings N, Luck M, Nwana H, Preist C (1998) Agent systems and applications. Knowl Eng Rev 13(3):303–308
Azvine B, Djian D, Tsui KC, Wobcke W (2000) The intelligent assistant: an overview. In: Intelligent systems and soft computing: prospects, tools and applications. Lecture notes in computer science, vol 1804. Springer, New York, NY, pp 215–238
Bank J, Cain Z, Shoham Y, Suen C, Ariely D (2012) Turning personal calendars into scheduling assistants. In: Extended abstracts of twenty-fourth conference on human factors in computing systems (CHI’12)
Berry PM, Gervasio M, Peintner B, Yorke-Smith N (2007) Balancing the needs of personalization and reasoning in a user-centric scheduling assistant. Technical note 561, AI Center, SRI International
Berry PM, Donneau-Golencer T, Duong K, Gervasio MT, Peintner B, Yorke-Smith N (2009a) Evaluating user-adaptive systems: lessons from experiences with a personalized meeting scheduling assistant. In: Proceedings of twenty-first conf. on innovative applications of artificial intelligence (IAAI’09), pp 40–46
Berry PM, Donneau-Golencer T, Duong K, Gervasio MT, Peintner B, Yorke-Smith N (2009b) Mixed-initiative negotiation: facilitating useful interaction between agent/owner pairs. In: Proceedings of AAMAS’09 workshop on mixed-initiative multiagent systems, pp 8–18
Berry PM, Gervasio M, Peintner B, Yorke-Smith N (2011) PTIME: personalized assistance for calendaring. ACM Trans Intell Syst Technol 2(4):40:1–40:22
Bosker B (2013a) Tempo smart calendar app boasts Siri pedigree and a calendar that thinks for itself. The Huffington Post. www.huffingtonpost.com/2013/02/13/tempo-smart-calendar-app_n_2677927.html. Retrieved 30 June 2016
Bosker B (2013b) SIRI RISING: the inside story of Siri’s origins—and why she could overshadow the iPhone. The Huffington Post. www.huffingtonpost.com/2013/01/22/siri-do-engine-apple-iphone_n_2499165.html. Retrieved 10 June 2013
Bosse T, Memon ZA, Oorburg R, Treur J, Umair M, de Vos M (2011) A software environment for an adaptive human-aware software agent supporting attention-demanding tasks. Int J Artif Intell Tools 20(5):819–846
Brusilovsky P, Karagiannidis C, Sampson D (2004) Layered evaluation of adaptive learning systems. Int J Contin Eng Educ Lifelong Learn 14(4–5):402–421
Brusilowsky P (2001) Adaptive hypermedia. User Modell User Adapt Interact 11(1–2):87–110
Brzozowski M, Carattini K, Klemmer SR, Mihelich P, Hu J, Ng AY (2006) groupTime: preference-based group scheduling. In: Proceedings of eighteenth conference on human factors in computing systems (CHI’06), pp 1047–1056
Campbell M (2009) Talking paperclip inspires less irksome virtual assistant. New Scientist, 29 July 2009
Carroll JM, Rosson MB (1987) Interfacing thought: cognitive aspects of human-computer interaction. MIT Press, Cambridge
Chalupsky H, Gil Y, Knoblock CA, Lerman K, Oh J, Pynadath DV, Russ TA, Tambe M (2002) Electric elves: agent technology for supporting human organizations. AI Mag 23(2):11–24
Cheyer A, Park J, Giuli R (2005) IRIS: integrate, relate, infer, share. In: Proceedings of 4th international semantic web conference on workshop on the semantic desktop, p 15
Christie CA, Fleischer DN (2010) Insight into evaluation practice: a content analysis of designs and methods used in evaluation studies published in North American evaluation-focused journals. Am J Eval 31(3):326–346
Cohen P (1995) Empirical methods for artificial intelligence. MIT Press, Cambridge
Cohen P, Howe AE (1989) Toward AI research methodology: three case studies in evaluation. IEEE Trans Syst Man Cybern 19(3):634–646
Cohen PR, Howe AE (1988) How evaluation guides AI research: the message still counts more than the medium. AI Mag 9(4):35–43
Cohen PR, Cheyer AJ, Wang M, Baeg SC (1994) An open agent architecture. In: Huhns MN, Singh MP (eds) Readings in agents. Morgan Kaufmann, San Francisco, pp 197–204
Cramer H, Evers V, Ramlal S, Someren M, Rutledge L, Stash N, Aroyo L, Wielinga B (2008) The effects of transparency on trust in and acceptance of a content-based art recommender. User Model User Adap Int 18(5):455–496
Davis FD, Bagozzi RP, Warshaw PR (1989) User acceptance of computer technology: a comparison of two theoretical models. Manag Sci 35:982–1003
Deans B, Keifer K, Nitz K et al (2009) SKIPAL phase 2 final technical report. Technical report 1981, SPAWAR Systems Center Pacific, San Diego
Evers V, Cramer H, Someren M, Wielinga B (2010) Interacting with adaptive systemsInteractive collaborative information systems, volume 281 of studies in computational intelligence. Springer, Heidelberg
Freed M, Carbonell J, Gordon G, Hayes J, Myers B, Siewiorek D, Smith S, Steinfeld A, Tomasic A (2008) RADAR: a personal assistant that learns to reduce email overload. In: Proceedings of twenty-third AAAI conference on artificial intelligence (AAAI’08), pp 1287–1293
Gena C (2005) Methods and techniques for the evaluation of user-adaptive systems. Knowl Eng Rev 20(1):1–37
Grabisch M (1996) The application of fuzzy integrals in multicriteria decision making. Eur J Oper Res 89(3):445–456
Graebner ME, Eisenhardt KM, Roundy PT (2010) Success and failure in technology acquisitions: lessons for buyers and sellers. Acad Manag Perspect 24(3):73–92
Greenberg S, Buxton B (2008) Usability evaluation considered harmful (some of the time). In: Proceedings of twentieth conference on human factors in computing systems (CHI’08), pp 111–120
Greer J, Mark M (2016) Evaluation methods for intelligent tutoring systems revisited. Int J Artif Intell Educ 26(1):387–392
Grudin J, Palen L (1995) Why groupware succeeds: discretion or mandate? In: Proceedings of 4th European conference on computer-supported cooperative work (ECSCW’95), pp 263–278
Hall J, Zeleznikow J (2001) Acknowledging insufficiency in the evaluation of legal knowledge-based systems: Strategies towards a broad based evaluation model. In: Proceedings of 8th international conference on artificial intelligence and law (ICAIL’01), pp 147–156
Hitt LM, Wu DJ, Zhou X (2002) ERP investment: business impact and productivity measures. J Manag Inf Syst 19:71–98
Höök K (2000) Steps to take before intelligent user interfaces become real. Interact Comput 12(4):409–426
Horvitz E, Breese J, Heckerman D, Hovel D, Rommelse K (1998) The Lumière project: Bayesian user modeling for inferring the goals and needs of software users. In: Proceedings of 14th conference on uncertainty in artificial intelligence (UAI’98), pp 256–266
Jameson AD (2009) Understanding and dealing with usability side effects of intelligent processing. AI Mag 30(4):23–40
Joachims T (2002) Optimizing search engines using clickthrough data. In: Proceedings of 22nd ACM conference on knowledge discovery and data mining (KDD’02), pp 133–142
Kafali Ö, Yolum P (2016) PISAGOR: a proactive software agent for monitoring interactions. Knowl Inf Syst 47(1):215–239
Kahney L (2010) MS Office helper not dead yet. Wired, 19 April 2001. www.wired.com/science/discoveries/news/2001/04/43065?currentPage=all. Retrieved 8 Oct 2010
Kjeldskov J, Skov MB (2007) Studying usability in sitro: simulating real world phenomena in controlled environments. Int J Hum Comput Interact 22(1–2):7–36
Klimt B, Yang Y (2004) The Enron corpus: a new dataset for email classification research. In: Proceedings of 15th European conference on machine learning (ECML’04), number 3201 in lecture notes in computer science. Springer, pp 217–226
Knoblock CA (2006) Beyond the elves: making intelligent agents intelligent. In: Proceedings of AAAI 2006 spring symposium on what went wrong and why: lessons from AI research and applications, p 40
Kokalitcheva K (2015) Salesforce acquires “smart” calendar app Tempo, which is shutting down. Fortune. www.fortune.com/2015/05/29/salesforces-acquires-tempo/. Retrieved 30 June 2016
Kozierok R, Maes P (1993) A learning interface agent for scheduling meetings. In: Proceedings of international workshop on intelligent user interfaces (IUI’93), pp 81–88
Krzywicki A, Wobcke W (2008) Closed pattern mining for the discovery of user preferences in a calendar assistant. In: Nguyen NT, Katarzyniak R (eds) New challenges in applied intelligence technologies. Springer, New York, pp 67–76
Langley P (1999) User modeling in adaptive interfaces. In: Proceedings of 7th international conference on user modeling (UM’99), pp 357–370
Lazar J, Feng JH, Hockheiser H (2010) Research methods in human–computer interaction. Wiley, Chichester
Maes P (1994) Agents that reduce work and information overload. J ACM 37(7):30–40
McCorduck P, Feigenbaum EA (1983) The fifth generation: artificial intelligence and Japan’s computer challenge to the world. Addison Wesley, Boston
Mitchell T, Caruana R, Freitag D, McDermott J, Zabowski D (1994) Experience with a learning personal assistant. Commun ACM 37(7):80–91
Modi PJ, Veloso MM, Smith SF, Oh J (2004) CMRadar: a personal assistant agent for calendar management. In: Proceedings of agent-oriented information systems workshop (AOIS’04), pp 169–181
Moffitt MD, Peintner B, Yorke-Smith N (2006) Multi-criteria optimization of temporal preferences. In: Proceedings of CP’06 workshop on preferences and soft constraints, pp 79–93
Myers KL, Berry PM, Blythe J, Conley K, Gervasio M, McGuinness D, Morley D, Pfeffer A, Pollack M, Tambe M (2007) An intelligent personal assistant for task and time management. AI Mag 28(2):47–61
Nielsen J, Levy J (1994) Measuring usability: preference vs. performance. Commun ACM 37(4):66–75
Norman DA (1994) How might people interact with agents. Commun ACM 37(7):68–71
Oh J, Smith SF (2004) Learning user preferences in distributed calendar scheduling. In: Proceedings of 5th international conference on practice and theory of automated timetabling (PATAT’04), pp 3–16
Oppermann R (1994) Adaptively supported adaptivity. Int J Hum Comput Stud 40(3):455–472
Palen L (1999) Social, individual and technological issues for groupware calendar systems. In: Proceedings of eleventh conference on human factors in computing systems (CHI’99), pp 17–24
Paramythis A, Weibelzahl S, Masthoff J (2010) Layered evaluation of interactive adaptive systems: framework and formative methods. User Model User Adap Interact 20(5):383–453
Peintner B, Dinger J, Rodriguez A, Myers K (2009) Task assistant: personalized task management for military environments. In: Proceedings of twenty-first conference on innovative applications of artificialintelligence (IAAI’09), pp 128–134
Refanidis I, Alexiadis A (2011) Deployment and evaluation of Selfplanner, an automated individual task management system. Comput Intell 27(1):41–59
Refanidis I, Yorke-Smith N (2010) A constraint-based approach to scheduling an individual’s activities. ACM Trans Intell Syst Technol 1(2):121–1232
Rychtyckyj N, Turski A (2008) Reasons for success (and failure) in the development and deployment of AI systems. In: Proceedings of AAAI’08 workshop on what went wrong and why: lessons from AI research and applications, pp 25–31
Schaub F, Könings B, Lang P, Wiedersheim B, Winkler C, Weber M (2014) PriCal: context-adaptive privacy in ambient calendar displays. In: Proc. of sixteeth international conference on pervasive and ubiquitous computing (UbiComp’14), pp 499–510
Shakshuki EM, Hossain SM (2014) A personal meeting scheduling agent. Pers Ubiquit Comput 18(4):909–922
Shen J, Li L, Dietterich TG, Herlocker JL (2006) A hybrid learning system for recognizing user tasks from desktop activities and email messages. In: Proceedings of eighteenth international conference on intelligent user interfaces (IUI’06), pp 86–92
SRI International (2013) CALO: cognitive assistant that learns and organizes. https://pal.sri.com. Retrieved 10 June 2013
Steinfeld A, Bennett R, Cunningham K et al (2006) The RADAR test methodology: evaluating a multi-task machine learning system with humans in the loop. Report CMU-CS-06-125, Carnegie Mellon University
Steinfeld A, Bennett R, Cunningham K, et al. (2007a) Evaluation of an integrated multi-task machine learning system with humans in the loop. In: Proceedings of 7th NIST workshop on performance metrics for intelligent systems (PerMIS’07), pp 182–188
Steinfeld A, Quinones P-A, Zimmerman J, Bennett SR, Siewiorek D (2007b) Survey measures for evaluation of cognitive assistants. In: Proceedins of 7th NIST workshop on performance metrics for intelligent systems (PerMIS’07), pp 189–193
Stumpf S, Rajaram V, Li L, Wong W-K, Burnett M, Dietterich T, Sullivan E, Herlocker J (2009) Interacting meaningfully with machine learning systems: three experiments. Int J Hum Comput Stud 67(8):639–662
Tambe M, Bowring E, Pearce JP, Varakantham P, Scerri P, Pynadath DV (2006) Electric Elves: what went wrong and why. In: Proceedings of AAAI 2006 spring symposium on what went wrong and why: lessons from AI research and applications, pp 34–39
Van Velsen L, Van Der Geest T, Klaassen R, Steehouder M (2008) User-centered evaluation of adaptive and adaptable systems: a literature review. Knowl Eng Rev 23(3):261–281
Viappiani P, Faltings B, Pu P (2006) Preference-based search using example-critiquing with suggestions. J Artif Intell Res 27:465–503
Wahlster W (ed) (2006) SmartKom: foundations of multimodal dialogue systems. Cognitive technologies. Springer, New York
Weber J, Yorke-Smith N (2008) Time management with adaptive reminders: two studies and their design implications. In: Working Notes of CHI’08 workshop: usable artificial intelligence, pp 5–8
Wobcke W, Nguyen A, Ho VH, Krzywicki A (2007) The smart personal assistant: an overview. In: Proceedings of the AAAI spring symposium on interaction challenges for intelligent assistants, pp 135–136
Yorke-Smith N, Saadati S, Myers KL, Morley DN (2012) The design of a proactive personal agent for task management. Int J Artif Intell Tools 21(1):90–119
Acknowledgements
We thank the anonymous reviewers for suggestions that helped to refine this article. We thank Karen Myers and Daniel Shapiro for their constructive comments, and we thank Mark Plascencia, Aaron Spaulding, and Julie Weber for help with the user studies and evaluations. We thank other contributors to the PTIME project, including Cory Albright, Emma Bowring, Michael D. Moffitt, Kenneth Nitz, Jonathan P. Pearce, Martha E. Pollack, Shahin Saadati, Milind Tambe, Joseph M. Taylor, and Tomás Uribe. We also gratefully acknowledge the many participants in our various studies, and the larger CALO team. For their feedback we thank among others Reina Arakji, Bijan Azad, Jane Davies, Nitin Joglekar, and Alexander Komashie, and the reviewers at the IAAI’09 conference where preliminary presentation of part of this work was made [8]. NYS thanks the Operations group at the Cambridge Judge Business School, where the body of the article was written, the fellowship at St Edmund’s College, Cambridge, and the Engineering Design Centre at the University of Cambridge. This material is based in part upon work supported by the US Defense Advanced Research Projects Agency (DARPA) Contract No. FA8750-07-D-0185/0004. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA, or the Air Force Research Laboratory.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1: CALO Test scoring method
The annual CALO Test proceeded as follows, for project Years 2–4 (Sects. 4.1, 4.2).
Methods. The CALO Test process consisted of five parts, as follows for PTIME.
-
1.
The independent evaluator (IE) defined a set of parameterized questions (PQs). These templates were made known to the PTIME team, who worked to develop the system’s capabilities towards them. There were some 60 PQs relevant to time management. For example:
Rank the following times in terms of their suitability for [MEETING-ID], given the schedules and locations of anticipated participants.
Each PQ was supplemented by an agreed interpretation (i.e., what the PQ means) and an ablation process (i.e., how to remove any learning in LCALO, to yield BCALO), both approved by the IE.
-
2.
Data was collected during the week-long critical learning period (CLP).
-
3.
The IE selected a subset of the PQs for evaluation. In Year 2, nine PTIME-relevant PQs were selected. In Year 3, two additional questions were selected.
-
4.
The IE instantiated each PQ with three instantiations relevant to the data set. For example, one instantiation of the above PQ is:
Rank the following times in terms of their suitability for MTG-CALO-0133, given the schedules and locations of anticipated participants: (1) 7 am, (2) 10 am, (3) 10:30 am, (4) 3 pm, (5) 8 pm.
-
5.
The IE scored LCALO and BCALO on each such instantiated question (IQ) instance and produced the overall results. First, the IE determined the ‘gold-standard’ answer for each IQ. For each PQ, the process for determining the answer key was documented prior to the Test. For example, for the above PQ:
Since this PQ is not asked from a single user’s perspective but from a global perspective (what is best considering all invitees), the Test evaluators will select an arbitrator who will be given access to all calendars and user preferences. The arbitrator may also ask any user for any information that may help the arbitrator identify the best answer. For example, the arbitrator may ask how important the meeting is to a user. The arbitrator will come up with the correct answer.
While some PTIME IQs had objective answers, others (such as the above) had subjective answers. The IE followed the answer determination process to derive the answer key for each IQ. If necessary, the IE elicited information from the CLP participants, and if further warranted, made subjective final decisions.
Second, the IE scored LCALO and BCALO against the answer for each IQ. Scores were between 0 (worst) and 4 (best). Again, for each PQ the process for determining the score was documented prior to the Test. For our example PQ, the process was to compare the ordered list generated by PTIME with the ordered list of the answer key by (quoting verbatim):
Kendall rank correlation coefficient (also called Kendall Tau) with a shifted and scaled value: Kendall Tau generates numbers between −1.0 and 1.0 that indicate the correlation between two different rankings of the same items. 1.0 indicates the rankings are identical. −1.0 indicates that they are the reverse of each other. Kendall Tau accommodates ties in the ranking. To get values that range from 0 to 4 (rather than −1.0 to 1.0), we use the following adjustment: Score \(=\) ( Kendall Tau \(+ 1\) ) \(\times \,2\)
This scoring process was encoded programmatically so that scores could be computed automatically for LCALO and BCALO.
Tasks. Participants used CALO as part of their regular work activities during the period of the CLP. The participants pursued realistic tasks, in a semi-artificial scenario, i.e., in a dedicated physical location rather than their usual workplaces (Lessons 3 and 4). Participants were given guidance about activities to try to include in their work, for instance, to schedule and attend a certain number of meetings; the independent evaluator approved the guidance. Participants were informed that the CALO system was being evaluated (and not their work) and that they might encounter bugs in the system, due to it being in on-going development.
Appendix 2: Specific critique of the CALO Test
Further to the discussion of Sect. 4.1—the CALO Test aimed for objectivity, as far as could be attained, in providing a quantitative measure of the effects of learning on CALO’s performance. However, the nature of the scoring process of the CALO Test introduced unintended artefacts.
First, instantiated questions (IQs) were instantiated (from the parameterized questions (PQs)) with a range of ‘difficulty’, determined by what the Independent Evaluator (IE) considered easy/difficult for a human office assistant. What is easy or difficult for an intelligent assistant can differ from what is easy or difficult for a human.
Second, as described in “Appendix 1”, some IQs had subjective ‘gold-standard’ answers that required ex-post (i.e., after the activity) elicitation from subjects by the IE, and a partially subjective human decision on the answer key. More generally, a difficulty in evaluation is in defining successful completion of a task. It is worth noting how the PQs were defined by the IE to scope the information required to determine the answer key. For instance, it was not necessary to determine whether users had chosen the best schedules for their requirements out of all possible schedules, but only from the multiple choices of the IQ answers.
Third, for PQs asked as a multi-choice question, a chance effect could unintentionally favour BCALO. For example, consider a multiple choice PQ with two possible answers, A or B, and its three instantiations to IQs. Suppose BCALO has a naive strategy of always returning answer A. There is a \(\frac{3}{8}\) probability that for two of the three instantiations, A is the correct answer. In this case, BCALO scores 67% (2.67 / 4.0), which is higher than the LCALO target for the question!
Fourth, the scoring process for some PQs created artefacts. For example, consider the PQ of “Appendix 1”, which was scored using a shifted Kendall rank correlation coefficient. If CALO’s answer showed no correlation with the correct answer, it still gets 2 out of 4 points; thus BCALO scored at least 2. Only failing to pick some answer from the given list of choices would score 0 points.
Fifth—a point which we recognized with PTIME, when for instance memory usage of other components slowed CALO’s responsiveness—even though the Test was intended to measure CALO’s learning ability, it could not do so unless the other parts required by the Test process were in good working order, so that learning data could be collected for LCALO. Architecture, documentation, pretesting, debugging, usability, and user behaviour were therefore all important to scoring well, as much as learning algorithms (Lesson 6).
As a rule, it is difficult in any evaluation to eliminate effects such as selection bias, experimenter bias, learning effects, and the Hawthorne effect— although proper experimental design can minimize them or at least make their effects measurable. The CALO Test was in part deliberately not structured and conducted to eliminate such effects as much as it otherwise might have been, since the CLP was more a data-gathering exercise on the system than a regular user study. For example, whether it was impactful or not upon the data collected, there was selection bias from using subjects from our own institution (which was required for legal reasons). The Test was overseen by the IE, and monitors from the project sponsor were present. Both were satisfied by the validity of the Test results.
Rights and permissions
About this article
Cite this article
Berry, P.M., Donneau-Golencer, T., Duong, K. et al. Evaluating intelligent knowledge systems: experiences with a user-adaptive assistant agent. Knowl Inf Syst 52, 379–409 (2017). https://doi.org/10.1007/s10115-016-1011-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-016-1011-3