Skip to main content
Log in

Learning potential functions and their representations for multi-task reinforcement learning

  • Published:
Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

Abstract

In multi-task learning, there are roughly two approaches to discovering representations. The first is to discover task relevant representations, i.e., those that compactly represent solutions to particular tasks. The second is to discover domain relevant representations, i.e., those that compactly represent knowledge that remains invariant across many tasks. In this article, we propose a new approach to multi-task learning that captures domain-relevant knowledge by learning potential-based shaping functions, which augment a task’s reward function with artificial rewards. We address two key issues that arise when deriving potential functions. The first is what kind of target function the potential function should approximate; we propose three such targets and show empirically that which one is best depends critically on the domain and learning parameters. The second issue is the representation for the potential function. This article introduces the notion of \(k\)-relevance, the expected relevance of a representation on a sample sequence of \(k\) tasks, and argues that this is a unifying definition of relevance of which both task and domain relevance are special cases. We prove formally that, under certain assumptions, \(k\)-relevance converges monotonically to a fixed point as \(k\) increases, and use this property to derive Feature Selection Through Extrapolation of k-relevance (FS-TEK), a novel feature-selection algorithm. We demonstrate empirically the benefit of FS-TEK on artificial domains.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. The authors termed these potential-based advice; specifically, look-ahead advice for the formula introduced here. We use the term “shaping” for both methods, and let function arguments resolve any ambiguity.

  2. Relevance is not a measure in the strict mathematical sense; because of dependence between feature sets, \(\rho (\mathsf{F } \cup \mathsf{G }) \ne \rho (\mathsf{F }) + \rho (\mathsf{G })\) for some disjoint feature sets \(\mathsf{F }\) and \(\mathsf{G }\) and relevance \(\rho \).

  3. We employ a standard real-valued GA with population size 100, no crossover and mutation with \(p=0.5\); mutation adds a random value \(\delta \in [-0.05, 0.05]\). Policies are constructed by a softmax distribution over the chromosome values.

  4. Note that the addition of this sensor is not the same as the manual separation of state features for the value and potential function as done in [34, 63]—see related work (Sect. 6). In the experiments reported in this section, both functions use the exact same set of features.

  5. In the policy improvement step, the policy is made only \(\varepsilon \)-greedy w.r.t. the value function.

References

  1. Albus, J. S. (1971). A theory of cerebellar function. Mathematical Biosciences, 10, 25–61.

    Article  Google Scholar 

  2. Argyriou, A., Evgeniou, T., & Pontil, M. (2008). Convex multi-task feature learning. Machine Learning, 73(3), 243–272.

    Article  Google Scholar 

  3. Asmuth, J., Littman, M., & Zinkov, R. (2008). Potential-based shaping in model-based reinforcement learning. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence ( pp. 604–609). Cambridge: The AAAI Press.

  4. Babes, M., de Cote, E.M., & Littman, M. L. (2008). Social reward shaping in the prisoner’s dilemma. In 7th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS 2008) (pp. 1389–1392).

  5. Baxter, J. (2000). A model of inductive bias learning. Journal of Artificial Intelligence Research (JAIR), 12, 149–198.

    MATH  MathSciNet  Google Scholar 

  6. Bertsekas, D. P. (1995). Dynamic programming and optimal control. Belmont: Athena.

    MATH  Google Scholar 

  7. Boutilier, C., Dearden, R., & Goldszmidt, M. (2000). Stochastic dynamic programming with factored representations. Artificial Intelligence, 121(1–2), 49–107.

    Article  MATH  MathSciNet  Google Scholar 

  8. Caruana, R. (1997). Multitask learning. Machine Learning, 28(1), 41–75.

    Article  MathSciNet  Google Scholar 

  9. Caruana, R. (2005). Inductive transfer retrospective and review. In NIPS 2005 Workshop on Inductive Transfer: 10 Years Later.

  10. Devlin, S., Grzes, M., & Kudenko, D. (2011). Multi-agent, reward shaping for robocup keepaway. In AAMAS (pp. 1227–1228).

  11. Devlin, S., & Kudenko, D. (2011). Theoretical considerations of potential-based reward shaping for multi-agent systems. In AAMAS, AAMAS ’11 (pp. 225–232).

  12. Devlin, S., & Kudenko, D. (2012). Dynamic potential-based reward shaping. In AAMAS (pp. 433–440).

  13. Diuk, C., Li, L., & Leffler, B. R. (2009). The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning. In ICML (p. 32).

  14. Dorigo, M., & Colombetti, M. (1994). Robot shaping: Developing autonomous agents through learning. Artificial Intelligence, 71(2), 321–370.

    Article  Google Scholar 

  15. Elfwing, S., Uchibe, E., Doya, K., & Christensen, H. (2008). Co-evolution of shaping: Rewards and meta-parameters in reinforcement learning. Adaptive Behavior, 16(6), 400–412.

    Article  Google Scholar 

  16. Elfwing, S., Uchibe, E., Doya, K., & Christensen, H. I. (2011). Darwinian embodied evolution of the learning ability for survival. Adaptive Behavior, 19(2), 101–120.

    Article  Google Scholar 

  17. Erez, T., & Smart, W. (2008) What does shaping mean for computational reinforcement learning? In 7th IEEE International Conference on Development and Learning, 2008. ICDL 2008 (pp. 215–219).

  18. Ferguson, K., & Mahadevan, S. (2006). Proto-transfer learning in markov decision processes using spectral methods. In ICML Workshop on Structural Knowledge Transfer for Machine Learning.

  19. Ferrante, E., Lazaric, A., & Restelli, M. (2008). Transfer of task representation in reinforcement learning using policy-based proto-value functions. In AAMAS (pp. 1329–1332).

  20. Foster, D. J., & Dayan, P. (2002). Structure in the space of value functions. Machine Learning, 49(2–3), 325–346.

    Article  MATH  Google Scholar 

  21. Frommberger, L. (2011). Task space tile coding: In-task and cross-task generalization in reinforcement learning. In Proceedings of the 9th European Workshop on Reinforcement, Learning (EWRL9).

  22. Frommberger, L., & Wolter, D. (2010). Structural knowledge transfer by spatial abstraction for reinforcement learning agents. Adaptive Behavior, 18(6), 507–525.

    Article  Google Scholar 

  23. Geramifard, A., Doshi, F., Redding, J., Roy, N., & How, J. P. (2011). Online discovery of feature dependencies. In ICML (pp. 881–888).

  24. Grześ, M., & Kudenko, D. (2009). Learning shaping rewards in model-based reinforcement learning. In Proceedings of AAMAS 2009 Workshop on Adaptive Learning Agents.

  25. Grzes, M., & Kudenko, D. (2009). Theoretical and empirical analysis of reward shaping in reinforcement learning. In ICMLA (pp. 337–344).

  26. Grześ, M., & Kudenko, D. (2010). Online learning of shaping rewards in reinforcement learning. Neural Networks, 23(4), 541–550.

    Article  Google Scholar 

  27. Gullapalli, V., & Barto, A.G. (1992). Shaping as a method for accelerating reinforcement learning. In Proceedings of IEEE International Symposium on Intelligent, Control (pp. 554–559).

  28. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.

    MATH  Google Scholar 

  29. Hachiya, H., & Sugiyama, M. (2010). Feature selection for reinforcement learning: Evaluating implicit state-reward dependency via conditional mutual information. In ECML/PKDD (pp. 474–489).

  30. Jong, N. K., & Stone, P. (2005). State abstraction discovery from irrelevant state variables. In IJCAI-05.

  31. Kakade, S. M. (2003). On the sample complexity of reinforcement learning. Ph.D. Thesis, University College London, London.

  32. Koller, D., & Sahami, M. (1996). Toward optimal feature selection. In ICML (pp. 284–292).

  33. Kolter, J. Z., & Ng, A. Y. (2009). Regularization and feature selection in least-squares temporal difference learning. In ICML (p. 66).

  34. Konidaris, G., & Barto, A. (2006). Autonomous shaping: Knowledge transfer in reinforcement learning. In Proceedings of 23rd International Conference on Machine Learning (pp. 489–496).

  35. Konidaris, G., Scheidwasser, I., & Barto, A. G. (2012). Transfer in reinforcement learning via shared features. Journal of Machine Learning Research, 13, 1333–1371.

    MathSciNet  Google Scholar 

  36. Koren, Y., & Borenstein, J. (1991). Potential field methods and their inherent limitations for mobile robot navigation. In Proceedings of IEEE Conference on Robotics and Automation (pp. 1398–1404).

  37. Kroon, M., & Whiteson, S. (2009). Automatic feature selection for model-based reinforcement learning in factored MDPs. In ICMLA 2009: Proceedings of the Eighth International Conference on Machine Learning and Applications (pp. 324–330).

  38. Laud, A., & DeJong, G. (2002). Reinforcement learning and shaping: Encouraging intended behaviors. In Proceedings of 19th International Conference on Machine Learning (pp. 355–362).

  39. Laud, A., & DeJong, G. (2003). The influence of reward on the speed of reinforcement learning: An analysis of shaping. In ICML (pp. 440–447).

  40. Lazaric, A. (2008). Knowledge transfer in reinforcement learning. Ph.D. Thesis, Politecnico di Milano, Milan.

  41. Lazaric, A., & Ghavamzadeh, M. (2010). Bayesian multi-task reinforcement learning. In ICML (pp. 599–606).

  42. Lazaric, A., Restelli, M., & Bonarini, A. (2008). Transfer of samples in batch reinforcement learning. In ICML (pp. 544–551).

  43. Li, L., Walsh, T. J., & Littman, M. L. (2006). Towards a unified theory of state abstraction for mdps. In Aritificial Intelligence and Mathematics.

  44. Lu, X., Schwartz, H. M., & Givigi, S. N. (2011). Policy invariance under reward transformations for general-sum stochastic games. Journal of Artificial Intelligence Research (JAIR), 41, 397–406.

    MATH  MathSciNet  Google Scholar 

  45. Maclin, R., & Shavlik, J. W. (1996). Creating advice-taking reinforcement learners. Machine Learning, 22(1–3), 251–281.

    Google Scholar 

  46. Mahadevan, S. (2010). Representation discovery in sequential decision making. In AAAI.

  47. Manoonpong, P., Wörgötter, F., & Morimoto, J. (2010). Extraction of reward-related feature space using correlation-based and reward-based learning methods. In ICONIP (Vol. 1, pp. 414–421).

  48. Marquardt, D. (1963). An algorithm for least-squares estimation of nonlinear parameters. SIAM Journal of Applied Mathematics, 11, 431–441.

    Article  MATH  MathSciNet  Google Scholar 

  49. Marthi, B. (2007). Automatic shaping and decomposition of reward functions. In Proceedings of 24th International Conference on Machine Learning (pp. 601–608).

  50. Matarić, M. J. (1994). Reward functions for accelerated learning. In Proceedings of 11th International Conference on Machine Learning.

  51. Mehta, N., Natarajan, S., Tadepalli, P., & Fern, A. (2008). Transfer in variable-reward hierarchical reinforcement learning. Machine Learning, 73(3), 289–312.

    Article  Google Scholar 

  52. Midtgaard, M., Vinther, L., Christiansen, J. R., Christensen, A. M., & Zeng, Y. (2010). Time-based reward shaping in real-time strategy games. In Proceedings of the 6th International Conference on Agents and Data Mining Interaction, ADMI’10 (pp. 115–125). Berlin, Heidelberg: Springer-Verlag.

  53. Ng, A., Harada, D., & Russell, S.(1999). Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of 16th International Conference on Machine Learning.

  54. Parr, R., Li, L., Taylor, G., Painter-Wakefield, C., & Littman, M. L. (2008). An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning. In ICML (pp. 752–759).

  55. Petrik, M., Taylor, G., Parr, R., & Zilberstein, S. (2010). Feature selection using regularization in approximate linear programs for markov decision processes. InICML (pp. 871–878).

  56. Proper, S., & Tumer, K. (2012). Modeling difference rewards for multiagent learning (extended abstract). In AAMAS, Valencia, Spain.

  57. Randløv, J., & Alstrøm, P. (1998). Learning to drive a bicycle using reinforcement learning and shaping. In Proceedings of 15th International Conference on Machine Learning.

  58. Rummery, G., & Niranjan, M. (1994). On-line q-learning using connectionist systems. Technical Report CUED/F-INFENG-RT 116, Engineering Department, Cambridge University, Cambridge.

  59. Saksida, L. M., Raymond, S. M., & Touretzky, D. S. (1997). Shaping robot behavior using principles from instrumental conditioning. Robotics and Autonomous Systems, 22(3–4), 231–249.

    Article  Google Scholar 

  60. van Seijen, H., Whiteson, S., & Kester, L. (2010). Switching between representations in reinforcement learning. In Interactive Collaborative, Information Systems (pp. 65–84).

  61. Selfridge, O., Sutton, R. S., & Barto, A. G. (1985). Training and tracking in robotics. In Proceedings of Ninth International Joint Conference on Artificial Intelligence.

  62. Sherstov, A. A., & Stone, P. (2005). Improving action selection in MDP’s via knowledge transfer. InProceedings of the Twentieth National Conference on Artificial Intelligence.

  63. Singh, S., Lewis, R., & Barto, A. (2009). Where do rewards come from? In Proceedings of 31st Annual Conference of the Cognitive Science Society (pp. 2601–2606).

  64. Singh, S., & Sutton, R. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22(1), 123–158.

    MATH  Google Scholar 

  65. Singh, S. P. (1992). Transfer of learning by composing solutions of elemental sequential tasks. Machine Learning, 8(3), 323–339.

    MATH  Google Scholar 

  66. Singh, S. P., Jaakkola, T., & Jordan, M. I. (1994). Learning without state-estimation in partially observable markovian decision processes. In ICML (pp. 284–292).

  67. Skinner, B. F. (1938). The behavior of organisms: An experimental analysis. New York: Appleton-Century-Crofts.

    Google Scholar 

  68. Snel, M., & Whiteson, S. (2010). Multi-task evolutionary shaping without pre-specified representations. In Genetic and Evolutionary Computation Conference (GECCO’10).

  69. Snel, M., & Whiteson, S. (2011). Multi-task reinforcement learning: Shaping and feature selection. In Proceedings of the European Workshop on Reinforcement Learning (EWRL).

  70. Sorg, J., & Singh, S. (2009). Transfer via soft homomorphisms. In Proceedings of 8th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2009) (pp. 741–748).

  71. Strehl, A. L., Diuk, C., & Littman, M. L. (2007). Efficient structure learning in factored-state mdps. In AAAI (pp. 645–650).

  72. Sutton, R. (1983). Learning to predict by the method of temporal differences. Machine Learning, 3, 9–44.

    Google Scholar 

  73. Sutton, R., & Barto, A. (1998). Reinforcement learning: An introduction. Cambridge: The MIT Press.

    Google Scholar 

  74. Tanaka, F., & Yamamura, M. (2003). Multitask reinforcement learning on the distribution of mdps. In Proceedings of 2003 IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA 2003) (pp. 1108–113).

  75. Taylor, J., Precup, D., & Panagaden, P. (2009). Bounding performance loss in approximate mdp homomorphisms. In Koller D., Schuurmans D., Bengio Y., & Bottou L. (Eds.), Advances in Neural Information Processing Systems (Vol. 21, pp. 1649–1656).

  76. Taylor, M., & Stone, P. (2009). Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(1), 1633–1685.

    MATH  MathSciNet  Google Scholar 

  77. Taylor, M., Stone, P., & Liu, Y. (2007). Transfer learning via inter-task mappings for temporal difference learning. Journal of Machine Learning Research, 8(1), 2125–2167.

    MATH  MathSciNet  Google Scholar 

  78. Taylor, M. E., Whiteson, S., & Stone, P. (2007). Transfer via inter-task mappings in policy search reinforcement learning. In AAMAS (p. 37).

  79. Thrun, S. (1995). Is learning the n-th thing any easier than learning the first? In Advances in Neural Information Processing (pp. 640–646).

  80. Torrey, L., Shavlik, J. W., Walker, T., & Maclin, R. (2010). Transfer learning via advice taking. In Advances in Machine Learning I (pp. 147–170). New York: Springer.

  81. Torrey, L., Walker, T., Shavlik, J. W., & Maclin, R.: Using advice to transfer knowledge acquired in one reinforcement learning task to another. In Proceedings of the Sixteenth European Conference on Machine Learning (ECML 2005) (pp. 412–424).

  82. Vlassis, N., Littman, M. L., & Barber, D. (2011). On the computational complexity of stochastic controller optimization in pomdps. CoRR abs/1107.3090.

  83. Walsh, T. J., Li, L., & Littman, M. L. (2006). Transferring state abstractions between mdps. In ICML-06 Workshop on Structural Knowledge Transfer for Machine Learning.

  84. Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279–292.

    MATH  Google Scholar 

  85. Whitehead, S. D. (1991). A complexity analysis of cooperative mechanisms in reinforcement learning. In Proceedings AAAI-91 (pp. 607–613).

  86. Whiteson, S., Tanner, B., Taylor, M. E., & Stone, P. (2011). Protecting against evaluation overfitting in empirical reinforcement learning. In ADPRL 2011: Proceedings of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement, Learning (pp. 120–127).

  87. Wiewiora, E. (2003). Potential-based shaping and q-value initialization are equivalent. Journal of Artificial Intelligence Research, 19, 205–208.

    MATH  MathSciNet  Google Scholar 

  88. Wiewiora, E., Cottrell, G., & Elkan, C.(2003). Principled methods for advising reinforcement learning agents. InProceedings of 20th International Conference on Machine Learning (pp. 792–799).

  89. Wilson, A., Fern, A., Ray, S., & Tadepalli, P. (2007). Multi-task reinforcement learning: A hierarchical Bayesian approach. In ICML (pp. 1015–1022).

Download references

Acknowledgments

We thank George Konidaris, Hado van Hasselt, Eric Wiewiora, Lihong Li, Christos Dimitrakakis and Harm van Seijen for valuable discussions, and the anonymous reviewers for suggesting improvements to the original article.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Matthijs Snel.

Appendices

Appendix 1: Glossary

Table 2 Glossary of terms

Appendix 2: Proof of Theorems in Sect. 4

As stated in Theorem 2, we are concerned with the case where the abstract Q-function is defined as in(18), i.e., the weighted average over state–action pairs in a given cluster. In this case, relevance equals the sum of weighted variances of the Q-values of ground state–action pairs corresponding to a given cluster \(\mathbf{y }\) (21). Before proving the theorems, we show how relevance can be rewritten as a sum of covariances between Q-functions.

Variance of a weighted sum of \(n\) correlated random variables equals the weighted sum of covariances. We start by showing that any \(Q_c\) is a weighted sum of random variables (namely the Q-functions of each task in the sequence), and that therefore relevance can be written in terms of a weighted sum over covariances. Equation 17 is already a weighted sum, but we require a constant weight per random variable (task). Thus we rewrite (17) as

$$\begin{aligned} Q_c\left( \mathbf{x }^+\right) = \sum \limits _{i=1}^{k} \Pr (c_i|c) Q^c_{c_i}\left( \mathbf{x }^+\right) \end{aligned}$$
(26)
$$\begin{aligned} Q^c_{c_i}\left( \mathbf{x }^+\right) = \dfrac{\Pr \left( \mathbf{x }^+|c_i \right) }{\Pr (\mathbf{x }^+|c)} Q_{c_i}\left( \mathbf{x }^+\right) , \end{aligned}$$
(27)

where the last line is just a rescaling of \(Q_{c_i}\) depending on \(c\), and \(k = |c|\), the sequence length. Similarly, we define

$$\begin{aligned} Q^c_{\mathbf{y },c_i}\left( \mathbf{x }^+\right) = \dfrac{\Pr \left( \mathbf{x }^+|\mathbf{y },c_i\right) }{\Pr \left( \mathbf{x }^+|\mathbf{y },c \right) } Q_{c_i}\left( \mathbf{x }^+\right) , \end{aligned}$$
(28)

where \(\Pr (\mathbf{x }^+|\mathbf{y },c_i) = 0\) for all \(\mathbf{x^+ } \notin \mathbf{X }^\mathbf{y }_{c_i}\), as the values of \(Q^c_{c_i}\) on the domain \(\mathbf{X }^\mathbf{y }_{c_i}\).

For ease of notation, we write \({{\mathrm{{\mathrm{Var}}}}}(Q_c)\) for the variance \({{\mathrm{{\mathrm{Var}}}}}(Q_c(\mathbf{X }^+_c))\), leaving the domain implicit, and similarly for the covariance. Note that \(\Pr (c_i|c) = 1/k\). Then relevance (21) can be written as

$$\begin{aligned} \rho (\phi ,Q_c)&= \sum \limits _{ \mathbf{y } \in \mathbf{Y }_c } \Pr (\mathbf{y }|c) {{\mathrm{{\mathrm{Var}}}}}\left( \dfrac{1}{k} \sum \limits _{i=1}^{k} Q^c_{\mathbf{y },c_i} \right) \nonumber \\&= \dfrac{1}{k^2}\sum \limits _{ \mathbf{y } \in \mathbf{Y }_c } \Pr (\mathbf{y }|c) \sum \limits _{i=1}^{k} \sum \limits _{j=1}^{k} {{\mathrm{{\mathrm{Cov}}}}}\left( Q^c_{\mathbf{y },c_i},Q^c_{\mathbf{y },c_j} \right) \end{aligned}$$
(29)

To see how relevance changes from one sequence to the next, we need to know how the covariance between two given tasks changes. For this purpose it is easier to write the covariance as \({{\mathrm{{\mathrm{Cov}}}}}(X_i,X_j) = {{\mathrm{{\mathrm{E}}}}}[X_i X_j] - {{\mathrm{{\mathrm{E}}}}}[X_i]{{\mathrm{{\mathrm{E}}}}}[X_j]\); then all we need to do is quantify both expectations. Let \((c,m)\) be the new sequence formed by appending a task \(m \in \mathsf{M }\) to a given sequence \(c\). In the following, for ease of notation, assume an abstraction that leaves the original Q-function intact, i.e. \(Q^c_{\mathbf{y },c_i} = Q^c_{c_i}\). The results can be extended to general abstractions by substituting \(\mathbf{X }_c = \mathbf{X }^\mathbf{y }_c\), \(\mathbf{X }_m = \mathbf{X }^\mathbf{y }_m\), \(\Pr (\mathbf{x }^+ | c) = \Pr (\mathbf{x }^+ | \mathbf{y }, c)\), and \(\Pr (\mathbf{x }^+|m) = \Pr (\mathbf{x }^+|\mathbf{y },m)\).

Lemma 1

The expected value of a given Q-function for a given task \(c_i\) in any sequence \(c\) is the same as the expected value of the original Q-function on \(c_i\). That is,

$$\begin{aligned} {{\mathrm{{\mathrm{E}}}}}\left[ Q^c_{c_i}\right] = {{\mathrm{{\mathrm{E}}}}}\left[ Q_{c_i}\right] . \end{aligned}$$
(30)

Proof

The expected value of any \(Q^c_{c_i}\) is

$$\begin{aligned} {{\mathrm{{\mathrm{E}}}}}[Q^c_{c_i}]&= \sum \limits _{\mathbf{x }^+ \in \mathbf{X }^+_c} \Pr \left( \mathbf{x }^+ | c \right) Q^c_{c_i}\left( \mathbf{x }^+\right) \\&= \sum \limits _{\mathbf{x }^+ \in \mathbf{X }^+_{c_i}} \Pr \left( \mathbf{x }^+ | c \right) Q^c_{c_i}\left( \mathbf{x }^+\right) + \sum \limits _{\mathbf{x }^+ \in \mathbf{X }^+_c /\mathbf{X }^+_{c_i}} \Pr \left( \mathbf{x }^+ | c\right) Q^c_{c_i}\left( \mathbf{x }^+\right) \\&= \sum \limits _{\mathbf{x }^+ \in \mathbf{X }^+_{c_i}} \Pr \left( \mathbf{x }^+ | c\right) \dfrac{\Pr \left( \mathbf{x }^+|m \right) }{\Pr \left( \mathbf{x }^+|c\right) }Q_{c_i}\left( \mathbf{x }^+\right) \qquad \text {(Since } Q^c_{c_i} = 0 \; \forall \mathbf{x }^+ \notin \mathbf{X }^+_{c_i} \text {)} \\&= \sum \limits _{\mathbf{x }^+ \in \mathbf{X }^+_{c_i}} \Pr \left( \mathbf{x }^+|m \right) Q_{c_i}\left( \mathbf{x }^+\right) = {{\mathrm{{\mathrm{E}}}}}[Q_{c_i}]. \end{aligned}$$

\(\square \)

To put bounds on the change in \({{\mathrm{{\mathrm{E}}}}}[Q^c_{c_i}Q^c_{c_j}]\), we have

Lemma 2

For a given sequence \(c\) and new sequence \((c,m)\) formed by appending a task \(m \in \mathsf{M }\) to \(c\),

$$\begin{aligned} 0 \le \left| {{\mathrm{{\mathrm{E}}}}}\left[ Q^{(c,m)}_{c_i}Q^{(c,m)}_{c_j}\right] \right| \le \dfrac{k+1}{k} \left| {{\mathrm{{\mathrm{E}}}}}\left[ Q^c_{c_i}Q^c_{c_j}\right] \right| , \end{aligned}$$

where \(|\cdot |\) denotes absolute value.

Proof

Let \(Q_{i\cdot j}\left( \mathbf{x }^+\right) = Q_{c_i}\left( \mathbf{x }^+\right) Q_{c_j}\left( \mathbf{x }^+\right) \). Then

$$\begin{aligned} {{\mathrm{{\mathrm{E}}}}}[Q^c_{c_i} Q^c_{c_j}]&= \sum \limits _{\mathbf{x }^+ \in \mathbf{X }^+_c} \Pr \left( \mathbf{x }^+|c\right) Q^c_{c_i}\left( \mathbf{x }^+\right) Q^c_{c_j}\left( \mathbf{x }^+\right) \\&= \sum \limits _{\mathbf{x }^+ \in \mathbf{X }^+_c} \dfrac{\Pr \left( \mathbf{x }^+|c_i \right) \Pr \left( \mathbf{x }^+|c_j \right) }{\Pr \left( \mathbf{x }^+|c\right) } Q_{i\cdot j}\left( \mathbf{x }^+\right) . \end{aligned}$$

For a given task pair, the only quantity that changes from one sequence to the next is \(\Pr (\mathbf{x }^+|c)\). Let \(f^c_{i,j}(\mathbf{x }^+) = \Pr (\mathbf{x }^+|c_i) \Pr (\mathbf{x }^+|c_j) / \Pr (\mathbf{x }^+|c)\), and recall that, for a sequence of length \(k\), \(\Pr (\mathbf{x }^+|c) = 1/k \sum ^k_{i=1} \Pr (\mathbf{x }^+|c_i)\). Therefore, on a new sequence \((c,m)\):

$$\begin{aligned} f^{(c,m)}_{i,j}\left( \mathbf{x }^+\right)&= \dfrac{\Pr \left( \mathbf{x }^+|c_i \right) \Pr \left( \mathbf{x }^+|c_j \right) }{\left( \sum \nolimits _{i=1}^{k}\Pr \left( \mathbf{x }^+|c_i \right) + \Pr \left( \mathbf{x }^+|m\right) \right) \Big /(k+1)} \\&= \dfrac{(k+1)\Pr \left( \mathbf{x }^+|c_i \right) \Pr \left( \mathbf{x }^+|c_j\right) }{k\Pr \left( \mathbf{x }^+|c\right) + \Pr \left( \mathbf{x }^+|m\right) }. \end{aligned}$$

Taking the ratio of \(f^{(c,m)}_{i,j}\) and \(f^c_{i,j}\):

$$\begin{aligned} f^{(c,m)}_{i,j}\left( \mathbf{x }^+\right) \big / f^c_{i,j}\left( \mathbf{x }^+\right)&= \dfrac{(k+1)\Pr \left( \mathbf{x }^+|c_i\right) \Pr \left( \mathbf{x }^+|c_j\right) }{k\Pr \left( \mathbf{x }^+|c\right) + \Pr \left( \mathbf{x }^+|m\right) } \times \dfrac{\Pr \left( \mathbf{x }^+|c\right) }{\Pr \left( \mathbf{x }^+|c_i\right) \Pr (\mathbf{x }^+|c_j)} \nonumber \\&= \dfrac{k\Pr \left( \mathbf{x }^+|c\right) + \Pr \left( \mathbf{x }^+|c\right) }{k\Pr \left( \mathbf{x }^+|c\right) + \Pr \left( \mathbf{x }^+|m\right) }. \end{aligned}$$
(31)

If \(\Pr (\mathbf{x }^+|m)\) is larger (smaller) than \(\Pr (\mathbf{x }^+|c)\), this ratio is smaller (larger) than one. It is largest when \(\Pr (\mathbf{x }^+|m) = 0\), namely \((k+1)/k\), and at its smallest it is

$$\begin{aligned} \lim _{\Pr \left( \mathbf{x }^+|c\right) \downarrow 0} \dfrac{k\Pr \left( \mathbf{x }^+|c\right) + \Pr \left( \mathbf{x }^+|c\right) }{k\Pr \left( \mathbf{x }^+|c\right) + \Pr \left( \mathbf{x }^+|m\right) } = 0. \end{aligned}$$

Since \(Q_{i\cdot j}(\mathbf{x }^+)\) is constant from one sequence to the next, this leads to the bounds as stated in the lemma.\(\square \)

Note that especially the lower bound is quite loose, since usually \(\Pr (\mathbf{x }^+|c)\) will not be that close to zero. However, for our present purposes this is sufficient. Given these facts, the proof of Theorem 2 readily follows.

Theorem 2

Let \(\phi \) be an abstraction with abstract Q-function as in Definition 2, and let \(\rho _k = \rho _k(\phi )\) for any \(k\). Let \(d(x,y) = |x - y|\) be a metric on \(\mathbb R \), and let \(f(\rho _k) = \rho _{k+1}\) map \(k\)-relevance to \(k+1\)-relevance. Then \(f\) is a contraction; that is, for \(k > 1\) there is a constant \(\kappa \in (0,1]\) such that

$$\begin{aligned} d\left( f(\rho _k),\,f(\rho _{k-1})\right) \le c d\left( \rho _k,\rho _{k-1}\right) . \end{aligned}$$

Furthermore, if \(d(\rho _2,\rho _1) \ne 0\), then \(f\) is a strict contraction, i.e. there is a \(\kappa \in (0,1)\) such that the above holds.

Proof

We need to show that \(|\rho _{k+1} - \rho _k| < |\rho _k - \rho _{k-1}|\) for any \(k > 1\). The relevance of a given sequence consists of the sum of the elements of the covariance matrix for that sequence, where each element has weight \(1/k^2\). As illustrated in Fig. 1, from one sequence \(c\) to a new sequence \((c,m)\), the ratio of additional covariances formed by the new task \(m\) with the tasks already present in \(c\) is \((2k - 1) / k^2\) and thus rapidly decreases with \(k\). The same figure also shows that change in relevance is caused by two factors: the expansion of the covariance matrices as sequence length increases coupled with the change in sequence probability, and change in the covariance between a given task pair from one sequence to the next. Suppose that the covariance of any task pair does not change from one sequence to the next. Then clearly, since the ratio of new covariance matrix elements changes with \(k\) as \((2k - 1) / k^2\) and in addition the probability of all new sequences \((c,m)\) formed from a given \(c\) sums up to the probability of \(c\), \(|\rho _{k+1} - \rho _k| \le |\rho _k - \rho _{k-1}|\) for any \(k > 1\).

Now suppose that covariances do change from one sequence to the next. As Lemmas 1 and 2 show, the maximum change in covariance from any sequence \(c\) of length \(k\) to the next is \((k+1){{\mathrm{{\mathrm{Cov}}}}}(Q^c_{c_i},\,Q^c_{c_j})/k\) for any \(i\) and \(j\). This change also decreases with \(k\), and therefore \(|\rho _{k+1} - \rho _k| \le |\rho _k - \rho _{k-1}|\) for any \(k > 1\). If \(|\rho _{2} - \rho _{1}| = 0\), then by this property the difference must stay 0 and \(|\rho _{k+1} - \rho _k| = 0\) for any \(k\). In all other cases, the change in relevance is a strict contraction, \(|\rho _{k+1} - \rho _k| < |\rho _k - \rho _{k-1}|\), by the above arguments.\(\square \)

In the following lemma, we distinguish between variances \({{\mathrm{{\mathrm{Cov}}}}}(Q^c_{c_i},Q^c_{c_j})\), \(c_i = c_j\) and covariances \({{\mathrm{{\mathrm{Cov}}}}}(Q^c_{c_i},Q^c_{c_j})\), \(c_i \ne c_j\). For \(k=1\), \(k\)-relevance consists solely of variances.

Lemma 3

The ratio of the number of variances to the number of covariances decreases with k. For a given sequence c of length \(k-1\), the ratio of new variances in all new sequences of length k formed from c is

$$\begin{aligned} \dfrac{2(k-1) + N}{N(2k - 1)} \end{aligned}$$
(32)

where \(N = |\mathsf{M }|\).

Proof

Let \(c\) be any sequence on a domain with \(N = |\mathsf{M }|\) tasks. Assume task \(m \in \{1,2, \ldots , N\}\), occurs \(o_m \in \{0,1, \ldots , k \}\) times in \(c\). Then any \(c\) can be represented by an \(N\)-dimensional vector \(\mathbf{o }\): \(\mathbf{o } = (o_1, o_2, \ldots , o_N)\). Note that for a given sample size \(k\), \(\sum \nolimits _i o_i = k\) for any \(c\). Lastly, denote by \(\sigma \) the sum of elements in the last column and row of the covariance matrix—as shown in Fig. 1, these are the elements added from one sequence \(c\) to the next \((c,m)\).

Now take any sequence \(c\) of length \(k-1\), with task counts in vector \(\mathbf{o }\). Form \(N\) new sequences of length \(k\), where each sequence is formed by adding a task from \(\mathsf{M }\) to \(c\). To see how the ratio of covariances changes between \(k-1\) and \(k\), all that matters is the ratio in \(\sigma \). For any new sequence formed by adding task \(m\) to sequence \(c\), there will be \(2o_m + 1\) variances in \(\sigma \). Hence, in total, taken over the \(N\) new sequences formed from \(c\), there will be \(2(o_1 + o_2 + \ldots + o_N) + N = 2(k-1) + N\) new variances. In total, taken over the \(N\) new sequences, there are \(N(2k - 1)\) covariances. So the ratio of variances in \(\sigma \) for a given sample size \(k\) is

$$\begin{aligned} \dfrac{2(k-1) + N}{N(2k - 1)} \end{aligned}$$
(33)

This ratio decreases with \(k\). Therefore the ratio of covariances increases with \(k\).\(\square \)

We can now prove Theorem 3.

Theorem 3

If all tasks share the same distribution over state–action pairs \(\Pr (\mathbf{x }^+|m)\), then \(\rho _k\) is a monotonically increasing or decreasing function of \(k\), or is constant.

Proof

The assumption of a single distribution over state–action pairs implies that covariances do not change from one sequence to the next. This follows from Lemma 1 and Eq. 31: since \(\Pr (\mathbf{x }^+|m) = \Pr (\mathbf{x }^+|c)\), the ratio resolves to one and \({{\mathrm{{\mathrm{E}}}}}\left[ Q^{(c,m)}_{c_i}Q^{(c,m)}_{c_j}\right] = {{\mathrm{{\mathrm{E}}}}}\left[ Q^c_{c_i}Q^c_{c_j}\right] \).

The rest of the proof is by cases. Given \(k=1\), \(\rho _{k+1}\) can either be smaller than, greater than, or equal to \(\rho _k\).

  • Case 1: \(\rho _2 < \rho _1\). Since \(\rho _2 < \rho _1\), it follows that the expected value of a covariance is lower than that of a variance: \(\rho _1\) is made up of all possible variances in the domain, while \(\rho _2\) in addition consists of all possible covariances. Since covariances do not change from one \(k\) to the next, covariances must be lower on average. From Lemma 3, the ratio of covariances increases with \(k\). Within the covariances, the frequency of a given task pair does not change, and the same holds for the variances. Therefore, since covariances do not change with \(k\), \(\rho _k\) must get ever lower with \(k\), and \(\rho _k\) is monotonically decreasing with \(k\).

  • Case 2: \(\rho _2 > \rho _1\). By a similar argument to that for case 1, \(\rho _k\) is a monotonically increasing function of \(k\).

  • Case 3: \(\rho _2 = \rho _1\). Therefore \(|\rho _2 - \rho _1| = 0\), and \(|\rho _{k+1} - \rho _k|\) must stay 0 by Theorem 2, which shows that \(\rho _k\) is constant.\(\square \)

Appendix 3: Cross-task binary function approximator

This appendix derives an average cross-task linear function approximator from a set of linear function approximators per task, where approximators are assumed to have binary features. Let \(\mathbf{w }_m\) be the weight vector of the function approximator in task \(m\) and let \(Q_m(\mathbf{x }) = \mathbf{w }_m^{\text {T}}\mathbf{f }_\mathbf{x }\) be the Q-value of \(\mathbf{x }\). We wish to find

$$\begin{aligned} Q_d&= \mathop {{{\mathrm{{argmin}}}}}\limits _{Q_0} \left( \sum \limits _{m \in \mathsf{M }} \Pr (m) \sum \limits _{\mathbf{x }} \Pr (\mathbf{x }|m) \Big [ Q_m(\mathbf{x }) - Q_0(\mathbf{x }) \Big ]^2 \right) \end{aligned}$$
(34)
$$\begin{aligned}&= \mathop {{{\mathrm{{argmin}}}}}\limits _{\mathbf{w }_0} \left( \sum \limits _{m \in \mathsf{M }} \Pr (m) \sum \limits _{\mathbf{x }} \Pr (\mathbf{f }_\mathbf{x }|m) \Big [ (\mathbf{w }_m - \mathbf{w }_0)^{\text {T}}\mathbf{f }_\mathbf{x } \Big ]^2 \right) . \end{aligned}$$
(35)

Let

$$\begin{aligned} g(\mathbf{w }_0) = \left( \sum \limits _{m \in \mathsf{M }} \Pr (m) \sum \limits _{\mathbf{x }} \Pr (\mathbf{f }_\mathbf{x }|m) \Big [ (\mathbf{w }_m - \mathbf{w }_0)^{\text {T}}\mathbf{f }_\mathbf{x } \Big ]^2 \right) \end{aligned}$$

Then with \(\mathbf{w }^i\) the weight of feature \(i\) and \(N\) features,

$$\begin{aligned} g\big (\mathbf{w }^i_0\big )&= \sum \limits _{m \in \mathsf{M }} \Pr (m) \sum \limits _{\mathbf{x }} \Pr \!\big (\mathbf{f }_\mathbf{x }|m\big ) \Big [ \big (\mathbf{w }^i_m - \mathbf{w }^i_0\big )\mathbf{f }^i_\mathbf{x } \Big ]^2 \\&= \sum \limits _{m \in \mathsf{M }} \Pr (m) \big (\mathbf{w }^i_m - \mathbf{w }^i_0\big )^2 \sum \limits _{\{\mathbf{x }:\mathbf{f }^i_\mathbf{x }=1\}} \Pr \!\big (\mathbf{f }_\mathbf{x }|m\big ), \end{aligned}$$

which follows from the fact that \(\mathbf{f }^i_\mathbf{x }\) is either zero or one.

Let \(\Pr (\mathbf{f }^i_\mathbf{x })\) indicate \(\Pr (\mathbf{f }^i = \mathbf{f }^i_\mathbf{x })\). Furthermore, \(\Pr (\mathbf{f }_\mathbf{x }|m) = \Pr (\mathbf{f }^1_\mathbf{x }, \ldots , \mathbf{f }^N_\mathbf{x }|m)\), which equals \(\Pr (\mathbf{f }^1_\mathbf{x }|m)\Pr (\mathbf{f }^2_\mathbf{x }|\mathbf{f }^1_\mathbf{x },m)\ldots \Pr (\mathbf{f }^N_\mathbf{x }|\mathbf{f }^{N-1}_\mathbf{x },\ldots ,\mathbf{f }^1_\mathbf{x },m)\). So

$$\begin{aligned} \sum \limits _{\{\mathbf{x }:\mathbf{f }^1_\mathbf{x }=1\}} \Pr \!\big (\mathbf{f }_\mathbf{x }|m\big )&= \sum \limits _{\{\mathbf{x }:\mathbf{f }^1_\mathbf{x }=1\}} \Pr \!\big (\mathbf{f }^1_\mathbf{x }|m \big )\Pr \!\big (\mathbf{f }^2_\mathbf{x }|\mathbf{f }^1_\mathbf{x },m \big )\cdots \Pr \!\big (\mathbf{f }^N_\mathbf{x }|\mathbf{f }^{N-1}_\mathbf{x },\ldots ,\mathbf{f }^1_\mathbf{x },m \big ) \\&= \Pr \!\big (\mathbf{f }^1=1|m \big ) \sum \limits _{\mathbf{f }^2,\ldots ,\mathbf{f }^N} \Pr \!\big (\mathbf{f }^2|\mathbf{f }^1=1,m \big )\cdots \Pr \!\big (\mathbf{f }^N|\mathbf{f }^{N-1},\ldots ,\mathbf{f }^1=1,m \big ) \\&= \Pr \!\big (\mathbf{f }^1=1|m \big ). \end{aligned}$$

This holds for all features \(i\) and therefore, if we multiply \(g\) with \(0.5\) for convencience when taking the partial derivative,

$$\begin{aligned} g\big (\mathbf{w }^i_0\big )&= \dfrac{1}{2}\sum \limits _{m \in \mathsf{M }} \Pr \!\big (\mathbf{f }^i=1,m \big ) \big (\mathbf{w }^i_m - \mathbf{w }^i_0 \big )^2, \\ \frac{\partial g}{\partial \mathbf{w }^i_0}&= \sum \limits _{m \in \mathsf{M }} \Pr \!\big (\mathbf{f }^i=1,m \big ) \big (\mathbf{w }^i_0 - \mathbf{w }^i_m\big ). \end{aligned}$$

Setting this to zero gives

$$\begin{aligned} \sum \limits _{m \in \mathsf{M }} \Pr \!\big (\mathbf{f }^i=1,m \big ) \mathbf{w }^i_0&= \sum \limits _{m \in \mathsf{M }} \Pr \!\big (\mathbf{f }^i=1,m \big ) \mathbf{w }^i_m \\ \mathbf{w }^i_0 \Pr \!\big (\mathbf{f }^i=1\big )&= \sum \limits _{m \in \mathsf{M }} \Pr \!\big (\mathbf{f }^i=1,m \big ) \mathbf{w }^i_m, \\ \mathbf{w }^i_0&= \sum \limits _{m \in \mathsf{M }} \Pr \!\big (m|\mathbf{f }^i=1\big ) \mathbf{w }^i_m. \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Snel, M., Whiteson, S. Learning potential functions and their representations for multi-task reinforcement learning. Auton Agent Multi-Agent Syst 28, 637–681 (2014). https://doi.org/10.1007/s10458-013-9235-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10458-013-9235-z

Keywords

Navigation