Learning-to-refer is a challenge in expert referral networks, wherein Active Learning helps experts (agents) estimate the topic-conditioned skills of other connected experts for problems that the initial expert cannot solve and therefore must seek referral to experts with more appropriate expertise. Recent research has investigated different reinforcement action selection algorithms to assess viability of the learning setting both with uninformative priors and with partially available noisy priors, where experts are allowed to advertise a subset of their skills to their colleagues. Prior to this work, time-varying expertise drift (e.g., experts learning with experience) had not been considered, though it is an aspect that may often arise in practice. This paper addresses the challenge of referral learning with time-varying expertise, proposing Hybrid, a novel combination of Thompson Sampling and Distributed Interval Estimation Learning (DIEL) with variance reset, first proposed in this paper. In our extensive empirical evaluation, considering both biased and unbiased drift, the proposed algorithm outperforms the previous state-of-the-art (DIEL) and other competitive algorithms e.g., Thompson Sampling and Optimistic Thompson Sampling. We further show that our method is robust to topic-dependent drifts and expertise level-dependent drifts, and the newly-proposed DIEL\(_{reset}\) can be effectively combined with other Bayesian approaches e.g., Optimistic Thompson Sampling and Dynamic Thompson Sampling and Discounted Thompson Sampling for improved performance.

Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The data set can be downloaded from https://www.cs.cmu.edu/~akhudabu/referral-networks.html.
[54] reports 0.8 as the optimal value of \(\gamma \) for slowly moving distributions. However, in our experiments, we obtained better performance for both Discounted TS and Hybrid\(_{\texttt {Discounted TS}}\) when \(\gamma \) was set to 0.95. We have not performed extensive parameter tuning for the new TS variants and chose values that seemed reasonable. We admit that with parameter tuning, it may be possible to squeeze further performance boost out of Hybrid\(_{\texttt {Dynamic TS}}\), but our primary goal was to test Hybrid’s design compatibility with other TS variants.
Abdallah, S., & Kaisers, M. (2016). Addressing environment non-stationarity by repeating Q-learning updates. The Journal of Machine Learning Research, 17(1), 1582–1612.
Agrawal, R. (1995). Sample mean based index policies with O (log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27, 1054–1078.
Agrawal, S., & Goyal, N. (2012). Analysis of Thompson sampling for the multi-armed bandit problem. In COLT (pp. 39–1).
Akakpo, N. (2008). Detecting change-points in a discrete distribution via model selection. arXiv preprint arXiv:0801.0970.
Allesiardo, R., & Féraud, R. (2015). Exp3 with drift detection for the switching bandit problem. In IEEE international conference on data science and advanced analytics (DSAA) (pp. 1–7). IEEE.
Audibert, J. Y., Munos, R., & Szepesvári, C. (2007). Tuning bandit algorithms in stochastic environments. In International conference on algorithmic learning theory (pp. 150–165). Springer.
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2–3), 235–256.
Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E. (2002). The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1), 48–77.
Babaioff, M., Sharma, Y., & Slivkins, A. (2014). Characterizing truthful multi-armed bandit mechanisms. SIAM Journal on Computing, 43(1), 194–230.
Barabási, A. L., & Albert, R. (1999). Emergence of scaling in random networks. Science, 286(5439), 509–512.
Baram, Y., Yaniv, R. E., & Luz, K. (2004). Online choice of active learning algorithms. Journal of Machine Learning Research, 5(Mar), 255–291.
Berry, D. A., & Fristedt, B. (1985). Bandit problems: Sequential allocation of experiments (monographs on statistics and applied probability) (Vol. 12). Berlin: Springer.
Bertsimas, D., & Niño-Mora, J. (2000). Restless bandits, linear programming relaxations, and a primal–dual index heuristic. Operations Research, 48(1), 80–90.
Bouneffouf, D., & Feraud, R. (2016). Multi-armed bandit problem with known trend. Neurocomputing, 205, 16–21.
Bowling, M., & Veloso, M. (2001). Rational and convergent learning in stochastic games. In International joint conference on artificial intelligence (Vol. 17, pp. 1021–1026). Lawrence Erlbaum Associates Ltd.
Burtini, G., Loeppky, J., & Lawrence, R. (2015). A survey of online experiment design with the stochastic multi-armed bandit. arXiv preprint arXiv:1510.00757.
Cesa-Bianchi, N., & Lugosi, G. (2006). Prediction, learning, and games. Cambridge: Cambridge University Press.
Chapelle, O., & Li, L. (2011). An empirical evaluation of thompson sampling. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, K. Q. Weinberger (Eds.), Advances in neural information processing systems (pp. 2249–2257).
Donmez, P., Carbonell, J., & Schneider, J. (2010). A probabilistic framework to learn from multiple annotators with time-varying accuracy. In Proceedings of the 2010 SIAM international conference on data mining (pp. 826–837). SIAM.
Donmez, P., Carbonell, J. G., & Bennett, P. N. (2007) Dual strategy active learning. In Machine learning ECML (pp. 116–127).
Donmez, P., Carbonell, J. G., & Schneider, J. (2009). Efficiently learning the accuracy of labeling sources for selective sampling. In Proceedings of the 15th ACM international conference on knowledge discovery and data mining (p. 259).
Dragoni, N. (2006). Fault tolerant knowledge level inter-agent communication in open multi-agent systems. AI Communications, 19(4), 385–387.
Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Computing Surveys (CSUR), 46(4), 44.
Garivier, A., & Moulines, E. (2008). On upper-confidence bound policies for non-stationary bandit problems. arXiv preprint arXiv:0805.3415.
Gittins, J. C., & Jones, D. M. (1974). A dynamic allocation indices for the sequential design of experiments. In J. Gani (Ed.), Progress in statistics, European meeting of statisticians (Vol. 1, pp. 241–266).
Gorner, J. M. (2011). Advisor networks and referrals for improved trust modelling in multi-agent systems. Master’s thesis, University of Waterloo
Graepel, T., Candela, J. Q., Borchert, T., & Herbrich, R. (2010). Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft’s Bing search engine. In Proceedings of the 27th international conference on machine learning (ICML-10) (pp. 13–20).
Granmo, O. C. (2010). Solving two-armed Bernoulli bandit problems using a Bayesian learning automaton. International Journal of Intelligent Computing and Cybernetics, 3(2), 207–234.
Guha, S., Munagala, K., & Shi, P. (2010). Approximation algorithms for restless bandit problems. Journal of the ACM (JACM), 58(1), 3.
Gupta, N., Granmo, O. C., & Agrawala, A. (2011). Thompson sampling for dynamic multi-armed bandits. In 10th International conference on machine learning and applications and workshops (ICMLA) (Vol. 1, pp. 484–489). IEEE.
Hartland, C., Gelly, S., Baskiotis, N., Teytaud, O., & Sebag, M. (2006). Multi-armed bandit, dynamic environments and meta-bandits. In NIPS-2006 workshop, Online trading between exploration and exploitation. Whistler, Canada.
Hasselt, H. V. (2010). Double Q-learning. In Advances in neural information processing systems (pp. 2613–2621).
Holme, P., & Kim, B. J. (2002). Growing scale-free networks with tunable clustering. Physical Review E, 65(2), 026,107.
Huang, L., Joseph, A. D., Nelson, B., Rubinstein, B. I., & Tygar, J. (2011). Adversarial machine learning. In Proceedings of the 4th ACM workshop on security and artificial intelligence (pp. 43–58). ACM.
Kaelbling, L. P. (1993). Learning in embedded systems. Cambridge: MIT Press.
Kaisers, M., & Tuyls, K. (2010). Frequency adjusted multi-agent Q-learning. In Proceedings of the 9th international conference on autonomous agents and multiagent systems (Vol. 1, pp. 309–316). International Foundation for Autonomous Agents and Multiagent Systems.
Kaufmann, E., Korda, N., & Munos, R. (2012). Thompson sampling: An asymptotically optimal finite-time analysis. In International conference on algorithmic learning theory (pp. 199–213). Springer.
Kautz, H., Selman, B., & Milewski, A. (1996). Agent amplified communication (pp. 3–9).
KhudaBukhsh, A. R., & Carbonell, J. G. (2018). Expertise drift in referral networks. In Proceedings of the 17th international conference on autonomous agents and multiagent systems (pp. 425–433). International Foundation for Autonomous Agents and Multiagent Systems.
KhudaBukhsh, A. R., Carbonell, J. G., & Jansen, P. J. (2016). Proactive-DIEL in evolving referral networks. In European conference on multi-agent systems (pp. 148–156). Springer.
KhudaBukhsh, A. R., Carbonell, J. G., & Jansen, P. J. (2016). Proactive skill posting in referral networks. In Australasian joint conference on artificial intelligence (pp. 585–596). Springer.
KhudaBukhsh, A. R., Carbonell, J. G., & Jansen, P. J. (2017). Incentive compatible proactive skill posting in referral networks. In European conference on multi-agent systems. Springer.
KhudaBukhsh, A. R., Carbonell, J. G., & Jansen, P. J. (2017). Robust learning in expert networks: A comparative analysis. In International symposium on methodologies for intelligent systems (ISMIS) (pp. 292–301). Springer.
KhudaBukhsh, A. R., Carbonell, J. G., & Jansen, P. J. (2018). Robust learning in expert networks: A comparative analysis. Journal of Intelligent Information Systems, 51(2), 207–234.
KhudaBukhsh, A. R., Jansen, P. J., & Carbonell, J. G. (2016). Distributed learning in expert referral networks. In European conference on artificial intelligence (ECAI) (pp. 1620–1621).
Lai, T. L. (2001). Sequential analysis. New York: Wiley Online Library.
Lai, T. L., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1), 4–22.
Langford, J., Strehl, A., & Wortman, J. (2008). Exploration scavenging. In Proceedings of the 25th international conference on Machine learning (pp. 528–535). ACM.
Levine, N., Crammer, K., & Mannor, S. (2017). Rotting bandits. In Advances in neural information processing systems (pp. 3074–3083).
Liu, K., & Zhao, Q. (2010). Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access. IEEE Transactions on Information Theory, 56(11), 5547–5567.
Lu, X., Adams, N., & Kantas, N. (2017). On adaptive estimation for dynamic Bernoulli bandits. arXiv preprint arXiv:1712.03134.
May, B. C., Korda, N., Lee, A., & Leslie, D. S. (2012). Optimistic Bayesian sampling in contextual-bandit problems. Journal of Machine Learning Research, 13(Jun), 2069–2106.
Noda, I. (2009). Recursive adaptation of stepsize parameter for non-stationary environments. In ALA (pp. 74–90). Springer.
Raj, V., & Kalyani, S. (2017). Taming non-stationary bandits: A Bayesian approach. arXiv preprint arXiv:1707.09727.
Shivaswamy, P. K., & Joachims, T. (2012). Multi-armed bandit problems with history. In N. D. Lawrence & M. Girolami (Eds.), International Conference on Artificial Intelligence and Statistics (pp. 1046–1054).
Silva, B. C. D., Basso, E. W., Bazzan, A., & Engel, P. M. (2006). Dealing with non-stationary environments using context detection. In Proceedings of the 23rd international conference on machine learning (pp. 217–224). ACM.
Slivkins, A., & Upfal, E. (2008). Adapting to a changing environment: The Brownian restless bandits. In COLT (pp. 343–354).
Tekin, C., & Liu, M. (2012). Online learning of rested and restless bandits. IEEE Transactions on Information Theory, 58(8), 5588–5611.
Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4), 285–294.
Tsymbal, A. (2004). The problem of concept drift: Definitions and related work. Computer Science Department, Trinity College Dublin, 106(2), 58.
Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3–4), 279–292.
Watts, D. J., & Strogatz, S. H. (1998). Collective dynamics of ‘small-world’ networks. Nature, 393(6684), 440–442.
Weber, R. R., & Weiss, G. (1990). On an index policy for restless bandits. Journal of Applied Probability, 27(3), 637–648.
Wei, W., Li, C. M., & Zhang, H. (2008). A switching criterion for intensification, and diversification in local search for SAT. Journal on Satisfiability, Boolean Modeling and Computation, 4, 219–237.
Whittle, P. (1988). Restless bandits: Activity allocation in a changing world. Journal of Applied Probability, 25(A), 287–298.
Wiering, M., & Schmidhuber, J. (1998). Efficient model-based exploration. In Proceedings of the fifth international conference on simulation of adaptive behavior (SAB’98) (pp. 223–228).
Yolum, P., & Singh, M. P. (2003). Dynamic communities in referral networks. Web Intelligence and Agent Systems, 1(2), 105–116.
Yolum, P., & Singh, M. P. (2003). Emergent properties of referral systems. In Proceedings of the second international joint conference on autonomous agents and multiagent systems (pp. 592–599). ACM.
Yu, B. (2002). Emergence and evolution of agent-based referral networks. Ph.D. thesis, North Carolina State University.
Yu, B., Venkatraman, M., & Singh, M. P. (2003). An adaptive social network for information access: Theoretical and experimental results. Applied Artificial Intelligence, 17, 21–38.
Yu, J. Y., & Mannor, S. (2009). Piecewise-stationary bandit problems with side observations. In Proceedings of the 26th annual international conference on machine learning (pp. 1177–1184). ACM.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A preliminary version of this work appeared in [39]. The previous version contained an experimental bug due to an inadvertent error in our random sequence generation which we fixed and re-designed Hybrid accordingly. Our new design of Hybrid is more elegant and capable of producing qualitatively similar results to our previously published results. Additionally, this version contains a thorough robustness analysis considering topic-dependent drifts, expertise-level-dependent drifts, and combined topic-and-expertise drift. Extending our results to effectively combining other Thompson Sampling variants such as Dynamic Thompson Sampling [28], Discounted Thompson Sampling [54] and Optimistic Thompson Sampling [52], is also new. We also provide an extensive design-component analysis of Hybrid showing empirical evidence that any simpler design of Hybrid cannot match our current design’s performance.
Rights and permissions
About this article
Cite this article
KhudaBukhsh, A.R., Carbonell, J.G. Expertise drift in referral networks. Auton Agent Multi-Agent Syst 33, 645–671 (2019). https://doi.org/10.1007/s10458-019-09419-9
Issue Date:
DOI: https://doi.org/10.1007/s10458-019-09419-9