skip to main content
research-article

Examining User Heterogeneity in Digital Experiments

Published: 22 March 2023 Publication History

Abstract

Digital experiments are routinely used to test the value of a treatment relative to a status-quo control setting—for instance, a new search relevance algorithm for a website or a new results layout for a mobile app. As digital experiments have become increasingly pervasive in organizations and a wide variety of research areas, their growth has prompted a new set of challenges for experimentation platforms. One challenge is that experiments often focus on the average treatment effect (ATE) without explicitly considering differences across major sub-groups: heterogeneous treatment effect (HTE). This is especially problematic, because ATEs have decreased in many organizations as the more obvious benefits have already been realized. However, questions abound regarding the pervasiveness of user HTEs and how best to detect them. We propose a framework for detecting and analyzing user HTEs in digital experiments. Our framework combines an array of user characteristics with double machine learning. Analysis of 27 real-world experiments spanning 1.76 billion sessions and simulated data demonstrates the effectiveness of our detection method relative to existing techniques. We also find that transaction, demographic, engagement, satisfaction, and lifecycle characteristics exhibit statistically significant HTEs in 10% to 20% of our real-world experiments, underscoring the importance of considering user heterogeneity when analyzing experiment results; otherwise, personalized features and experiences cannot happen, thus reducing effectiveness. In terms of the number of experiments and user sessions, we are not aware of any study that has examined user HTEs at this scale. Our findings have important implications for information retrieval, user modeling, platforms, and digital experience contexts, in which online experiments are often used to evaluate the effectiveness of design artifacts.

References

[1]
Ahmed Abbasi, Hsinchun Chen, and Arab Salem. 2008. Sentiment analysis in multiple languages: Feature selection for opinion classification in web forums. ACM Trans. Inf. Syst. 26, 3 (2008), 1–34.
[2]
Ahmed Abbasi, Raymond Y. K. Lau, and Donald E. Brown. 2015. Predicting behavior. IEEE Intell. Syst. 30, 3 (2015), 35–43.
[3]
Idris Adjerid and Ken Kelley. 2018. Big data in psychology: A framework for research advancement. Am. Psychol. 73, 4 (2018), 899–917.
[4]
Faizan Ahmad, Ahmed Abbasi, Brent Kitchens, Donald A. Adjeroh, and Daniel Zeng. 2020. Deep learning for adverse event detection from web search. IEEE Trans. Knowl. Data Eng. 34, 6 (2020), 2681–2695.
[5]
Faizan Ahmad, Ahmed Abbasi, Jingjing Li, David G. Dobolyi, Richard G. Netemeyer, Gari D. Clifford, and Hsinchun Chen. 2020. A deep learning architecture for psychometric natural language processing. ACM Trans. Inf. Syst. 38, 1 (2020), 1–29.
[6]
Jaime Arguello and Bogeum Choi. 2019. The effects of working memory, perceptual speed, and inhibition in aggregated search. ACM Trans. Inf. Syst. 37, 3 (2019), 1–34.
[7]
Barry C. Arnold. 2015. Pareto Distribution. John Wiley & Sons, Ltd, 1–10.
[8]
Susan Athey and Guido Imbens. 2016. Recursive partitioning for heterogeneous causal effects. Proc. Natl. Acad. Sci. U.S.A. 113, 27 (2016), 7353–7360.
[9]
Susan Athey, Julie Tibshirani, and Stefan Wager. 2019. Generalized random forests. Ann. Stat. 47, 2 (2019), 1148–1178.
[10]
Xiao Bai, Ioannis Arapakis, B. Barla Cambazoglu, and Ana Freire. 2017. Understanding and leveraging the impact of response latency on user behaviour in web search. ACM Trans. Inf. Syst. 36, 2 (2017), 1–42.
[11]
Michel Ballings and Dirk Van den Poel. 2012. Customer event history for churn prediction: How long is long enough? Expert Syst. Appl. 39, 18 (2012), 13517–13522.
[12]
Yihan Bao, Shichao Han, and Yong Wang. 2021. Treatment effect detection with controlled FDR under dependence for large-scale experiments. arXiv:2110.07279. Retrieved from https://arxiv.org/abs/2110.07279.
[13]
Rina Foygel Barber and Emmanuel J. Candès. 2015. Controlling the false discovery rate via knockoffs. Ann. Stat. 43, 5 (2015), 2055–2085.
[14]
Tamás Bartus. 2005. Estimation of marginal effects using margeff. Stata J. 5, 3 (2005), 309–329.
[15]
Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Stat. Soc.: Ser. B (Methodol.) 57, 1 (1995), 289–300.
[16]
Yoav Benjamini and Daniel Yekutieli. 2001. The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29, 4 (2001), 1165–1188.
[17]
J. Martin Bland and Douglas G. Altman. 1986. Statistical methods for assessing agreement between two methods of clinical measurement. The Lancet 327, 8476 (1986), 307–310.
[18]
Leo Breiman. 2001. Random forests. Mach. Learn. 45, 1 (2001), 5–32.
[19]
Donald E. Brown, Ahmed Abbasi, and Raymond Y. K. Lau. 2015. Predictive analytics: Predictive modeling at the micro level. IEEE Intell. Syst. 30, 3 (2015), 6–8.
[20]
Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, and Whitney Newey. 2017. Double/debiased/neyman machine learning of treatment effects. Am. Econ. Rev. 107, 5 (2017), 261–65.
[21]
Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal 21, 1 (2018), C1–C68.
[22]
Scott Cunningham. 2021. Causal Inference: The mixtape. Yale University Press.
[23]
Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. 2013. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining. 123–132.
[24]
Yuxiao Dong, Nitesh V. Chawla, Jie Tang, Yang Yang, and Yang Yang. 2017. User modeling on demographic attributes in big mobile social networks. ACM Trans. Inf. Syst. 35, 4 (2017), 1–33.
[25]
Boyan Duan, Larry Wasserman, and Aaditya Ramdas. 2021. Interactive identification of individuals with positive treatment effect while controlling false discoveries. arXiv:2102.10778. Retrieved from https://arxiv.org/abs/2102.10778.
[26]
William Fithian and Lihua Lei. 2022. Conditional calibration for false discovery rate control under dependence. Ann. Stat. 50, 6 (2022), 3091–3118.
[27]
Tianjun Fu, Ahmed Abbasi, Daniel Zeng, and Hsinchun Chen. 2012. Sentimental spidering: Leveraging opinion information in focused crawlers. ACM Trans. Inf. Syst. 30, 4 (2012), 1–30.
[28]
Shen Gao, Xiuying Chen, Li Liu, Dongyan Zhao, and Rui Yan. 2021. Learning to respond with your favorite stickers: A framework of unifying multi-modality and user preference in multi-turn dialog. ACM Trans. Inf. Syst. 39, 2 (2021), 1–32.
[29]
Juan Carlos Gázquez-Abad, Marie Hélène De Canniére, and Francisco J. Martínez-López. 2011. Dynamics of customer response to promotional and relational direct mailings from an apparel retailer: The moderating role of relationship strength. J. Retail. 87, 2 (2011), 166–181.
[30]
Priyanga Gunarathne, Huaxia Rui, and Abraham Seidmann. 2017. Whose and what social media complaints have happier resolutions? Evidence from Twitter. J. Manage. Inf. Syst. 34, 2 (2017), 314–340.
[31]
Asela Gunawardana and Guy Shani. 2015. Evaluating recommender systems. In Recommender Systems Handbook. Springer, 265–308.
[32]
Yue Guo, Yi Yang, and Ahmed Abbasi. 2022. Auto-debias: Debiasing masked language models with automated biased prompts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1012–1023.
[33]
Somit Gupta, Ronny Kohavi, Diane Tang, Ya Xu, Reid Andersen, Eytan Bakshy, Niall Cardin, Sumita Chandran, Nanyu Chen, Dominic Coey, et al. 2019. Top challenges from the first practical online controlled experiments summit. ACM SIGKDD Explor. Newslett. 21, 1 (2019), 20–35.
[34]
Shuguang Han, Zhen Yue, and Daqing He. 2015. Understanding and supporting cross-device web search for exploratory tasks with mobile touch interactions. ACM Trans. Inf. Syst. 33, 4 (2015), 1–34.
[35]
Trevor Hastie and Robert Tibshirani. 1987. Non-parametric logistic and proportional odds regression. J. Roy. Stat. Soc.: Ser. C (Appl. Stat.) 36, 3 (1987), 260–276.
[36]
Yuzi He, Christopher Tran, Julie Jiang, Keith Burghardt, Emilio Ferrara, Elena Zheleva, and Kristina Lerman. 2021. Heterogeneous effects of software patches in a multiplayer online battle arena game. In Proceedings of the 16th International Conference on the Foundations of Digital Games (FDG’21) 2021. 1–9.
[37]
Lorin M. Hitt and Frances X. Frei. 2002. Do better customers utilize electronic distribution channels? The case of PC banking. Manage. Sci. 48, 6 (2002), 732–748.
[38]
Melody Y. Ivory and Rodrick Megraw. 2005. Evolution of web site design patterns. ACM Trans. Inf. Syst. 23, 4 (2005), 463–497.
[39]
Avinash Kaushik. 2009. Web Analytics 2.0: The Art of Online Accountability and Science of Customer Centricity. John Wiley & Sons.
[40]
Ken Kelley and Scott E. Maxwell. 2003. Sample size for multiple regression: Obtaining regression coefficients that are accurate, not simply significant. Psychol. Methods 8, 3 (2003), 305–321.
[41]
Ken Kelley and Kristopher J. Preacher. 2012. On effect size. Psychol. Methods 17, 2 (2012), 137–152.
[42]
Brent Kitchens, David Dobolyi, Jingjing Li, and Ahmed Abbasi. 2018. Advanced customer analytics: Strategic value through integration of relationship-oriented big data. J. Manage. Inf. Syst. 35, 2 (2018), 540–574.
[43]
Ron Kohavi. 2015. Online controlled experiments: Lessons from running a/b/n tests for 12 years. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1–1.
[44]
Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. 2013. Online controlled experiments at large scale. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1168–1176.
[45]
Ron Kohavi, Diane Tang, and Ya Xu. 2020. Trustworthy Online Controlled Experiments: A Practical Guide to a/b Testing. Cambridge University Press.
[46]
Ron Kohavi, Diane Tang, Ya Xu, Lars G. Hemkens, and John P. A. Ioannidis. 2020. Online randomized controlled experiments at scale: Lessons and extensions to medicine. Trials 21, 1 (2020), 1–9.
[47]
Ron Kohavi and Stefan Thomke. 2017. The surprising power of online experiments. Harv. Bus. Rev. 95, 5 (2017), 74–82.
[48]
Sören R. Künzel, Jasjeet S. Sekhon, Peter J. Bickel, and Bin Yu. 2019. Metalearners for estimating heterogeneous treatment effects using machine learning. Proc. Natl. Acad. Sci. U.S.A> 116, 10 (2019), 4156–4165.
[49]
John P. Lalor, Yi Yang, Kendall Smith, Nicole Forsgren, and Ahmed Abbasi. 2022. Benchmarking intersectional biases in NLP. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3598–3609.
[50]
Jingjing Li, Ahmed Abbasi, Amar Cheema, and Linda B. Abraham. 2020. Path to purpose? How online customer journeys differ for hedonic versus utilitarian purchases. J. Market. 84, 4 (2020), 127–146.
[51]
Jingjing Li, Kai Larsen, and Ahmed Abbasi. 2020. TheoryOn: A design framework and system for unlocking behavioral knowledge through ontology learning. MIS Quart. 44, 4 (2020).
[52]
Shangsong Liang, Yupeng Luo, and Zaiqiao Meng. 2021. Profiling users for question answering communities via flow-based constrained co-embedding model. ACM Trans. Inf. Syst. 40, 2 (2021), 1–38.
[53]
Eveliina Lindgren and Jürgen Münch. 2016. Raising the odds of success: The current state of experimentation in product development. Inf. Softw. Technol. 77 (2016), 80–91.
[54]
Vera L. Miguéis, Dirk Van den Poel, Ana S. Camanho, and João Falcão e Cunha. 2012. Modeling partial customer churn: On the value of first product-category purchase sequences. Expert Syst. Appl. 39, 12 (2012), 11250–11256.
[55]
Vikas Mittal and Wagner A. Kamakura. 2001. Satisfaction, repurchase intent, and repurchase behavior: Investigating the moderating effect of customer characteristics. J. Market. Res. 38, 1 (2001), 131–142.
[56]
Alan L. Montgomery, Shibo Li, Kannan Srinivasan, and John C. Liechty. 2004. Modeling online browsing and path analysis using clickstream data. Market. Sci. 23, 4 (2004), 579–595.
[57]
Cataldo Musto, Fedelucio Narducci, Marco Polignano, Marco De Gemmis, Pasquale Lops, and Giovanni Semeraro. 2021. MyrrorBot: A digital assistant based on holistic user models for personalized access to online services. ACM Trans. Inf. Syst. 39, 4 (2021), 1–34.
[58]
X. Nie and S. Wager. 2020. Quasi-oracle estimation of heterogeneous treatment effects. Biometrika 108, 2 (092020), 299–319.
[59]
Judea Pearl. 2009. Causal inference in statistics: An overview. Stat. Surv. 3 (2009), 96–146.
[60]
Werner J. Reinartz and Vita Kumar. 2003. The impact of customer relationship characteristics on profitable lifetime duration. J. Market. 67, 1 (2003), 77–99.
[61]
Joseph Rigdon, Michael Baiocchi, and Sanjay Basu. 2018. Preventing false discovery of heterogeneous treatment effect subgroups in randomized trials. Trials 19, 1 (2018), 1–15.
[62]
Peter M. Robinson. 1988. Root-N-consistent semiparametric regression. Econometrica 56, 4 (1988), 931–954.
[63]
Tetsuya Sakai, Sijie Tao, and Zhaohao Zeng. 2022. Relevance assessments for web search evaluation: Should we randomise or prioritise the pooled documents? ACM Trans. Inf. Syst. 40, 4 (2022), 1–35.
[64]
David C. Schmittlein, Donald G. Morrison, and Richard Colombo. 1987. Counting your customers: Who-are they and what will they do next? Manage. Sci. 33, 1 (1987), 1–24.
[65]
Korbinian Strimmer. 2008. A unified approach to false discovery rate estimation. BMC Bioinf. 9, 1 (2008), 1–14.
[66]
Vasilis Syrgkanis, Greg Lewis, Miruna Oprescu, Maggie Hei, Keith Battocchi, Eleanor Dillon, Jing Pan, Yifeng Wu, Paul Lo, Huigang Chen, et al. 2021. Causal inference and machine learning in practice with econml and causalml: Industrial use cases at microsoft, tripadvisor, uber. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 4072–4073.
[67]
Matt Taddy, Matt Gardner, Liyun Chen, and David Draper. 2016. A nonparametric bayesian analysis of heterogenous treatment effects in digital experimentation. J. Bus. Econ. Stat. 34, 4 (2016), 661–672.
[68]
Diane Tang, Ashish Agarwal, Deirdre O’Brien, and Mike Meyer. 2010. Overlapping experiment infrastructure: More, better, faster experimentation. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 17–26.
[69]
Yuan Tian, Ke Zhou, and Dan Pelleg. 2021. What and how long: Prediction of mobile app engagement. ACM Trans. Inf. Syst. 40, 1 (2021), 1–38.
[70]
Christopher Tran and Elena Zheleva. 2019. Learning triggers for heterogeneous treatment effects. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 5183–5190.
[71]
Kelsey Urgo and Jaime Arguello. 2022. Understanding the “Pathway” towards a searcher’s learning objective. ACM Trans. Inf. Syst. 40, 4 (2022), 1–42.
[72]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 11 (2008).
[73]
Stefan Wager and Susan Athey. 2018. Estimation and inference of heterogeneous treatment effects using random forests. J. Am. Stat. Assoc. 113, 523 (2018), 1228–1242.
[74]
Chao Wang, Hengshu Zhu, Peng Wang, Chen Zhu, Xi Zhang, Enhong Chen, and Hui Xiong. 2021. Personalized and explainable employee training course recommendations: A bayesian variational approach. ACM Trans. Inf. Syst. 40, 4 (2021), 1–32.
[75]
Hongwei Wang and Jure Leskovec. 2021. Combining graph convolutional neural networks and label propagation. ACM Trans. Inf. Syst. 40, 4 (2021), 1–27.
[76]
Hao Wang, Defu Lian, Hanghang Tong, Qi Liu, Zhenya Huang, and Enhong Chen. 2021. HyperSoRec: Exploiting hyperbolic user and item representations with multiple aspects for social-aware recommendation. ACM Trans. Inf. Syst. 40, 2 (2021), 1–28.
[77]
Lili Wang, Chenghan Huang, Ying Lu, Weicheng Ma, Ruibo Liu, and Soroush Vosoughi. 2021. Dynamic structural role node embedding for user modeling in evolving networks. ACM Trans. Inf. Syst. 40, 3 (2021), 1–21.
[78]
Wei Wang, Jiaying Liu, Tao Tang, Suppawong Tuarob, Feng Xia, Zhiguo Gong, and Irwin King. 2020. Attributed collaboration network embedding for academic relationship mining. ACM Trans. Web 15, 1 (2020), 1–20.
[79]
Huizhi Xie and Juliette Aurisset. 2016. Improving the sensitivity of online controlled experiments: Case studies at netflix. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 645–654.
[80]
Yuxiang Xie, Nanyu Chen, and Xiaolin Shi. 2018. False discovery rate controlled heterogeneous treatment effect detection for online controlled experiments. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 876–885.
[81]
Tao Xiong, Yong Wang, and Senlie Zheng. 2020. Orthogonal Traffic Assignment in Online Overlapping A/B Tests. Technical Report. Tencent EasyChair Whitepaper.
[82]
Jing Yao, Zhicheng Dou, and Ji-Rong Wen. 2021. Clarifying ambiguous keywords with personal word embeddings for personalized search. ACM Trans. Inf. Syst. 40, 3 (2021), 1–29.
[83]
Peng Zhang, Baoxi Liu, Tun Lu, Xianghua Ding, Hansu Gu, and Ning Gu. 2022. Jointly predicting future content in multiple social media sites based on multi-task learning. ACM Trans. Inf. Syst. 40, 4 (2022), 1–28.

Cited By

View all
  • (2024)When Automated Assessment Meets Automated Content Generation: Examining Text Quality in the Era of GPTsACM Transactions on Information Systems10.1145/370263943:2(1-36)Online publication date: 5-Nov-2024
  • (2024)Should Fairness be a Metric or a Model? A Model-based Framework for Assessing Bias in Machine Learning PipelinesACM Transactions on Information Systems10.1145/364127642:4(1-41)Online publication date: 22-Mar-2024
  • (2024)Simulating A/B testing versus SMART designs for LLM-driven patient engagement to close preventive care gapsnpj Digital Medicine10.1038/s41746-024-01330-27:1Online publication date: 18-Nov-2024

Index Terms

  1. Examining User Heterogeneity in Digital Experiments

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 41, Issue 4
    October 2023
    958 pages
    ISSN:1046-8188
    EISSN:1558-2868
    DOI:10.1145/3587261
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 March 2023
    Online AM: 12 January 2023
    Accepted: 17 December 2022
    Revised: 20 October 2022
    Received: 08 April 2022
    Published in TOIS Volume 41, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Heterogeneous treatment effects
    2. digital experiments
    3. user heterogeneity
    4. user modeling
    5. double machine learning

    Qualifiers

    • Research-article

    Funding Sources

    • U.S. NSF
    • Machine Learning Methods for Causal Inference in Digital Experimentation Platforms

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)236
    • Downloads (Last 6 weeks)32
    Reflects downloads up to 23 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)When Automated Assessment Meets Automated Content Generation: Examining Text Quality in the Era of GPTsACM Transactions on Information Systems10.1145/370263943:2(1-36)Online publication date: 5-Nov-2024
    • (2024)Should Fairness be a Metric or a Model? A Model-based Framework for Assessing Bias in Machine Learning PipelinesACM Transactions on Information Systems10.1145/364127642:4(1-41)Online publication date: 22-Mar-2024
    • (2024)Simulating A/B testing versus SMART designs for LLM-driven patient engagement to close preventive care gapsnpj Digital Medicine10.1038/s41746-024-01330-27:1Online publication date: 18-Nov-2024

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media