Skip to main content
Log in

WDIBS: Wasserstein deterministic information bottleneck for state abstraction to balance state-compression and performance

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

As an important branch of reinforcement learning, Apprenticeship learning studies how an agent learns good behavioral decisions by observing an expert policy from the environment. It has made many encouraging breakthroughs in real-world applications. State abstraction is typically used to compress the state space of the environment to eliminate redundant information, thereby improving learning efficiency. However, excessive compression results in poor decision performance. Therefore, it is important to balance the compression degree and decision performance. Deterministic Information Bottleneck for State abstraction (DIBS) attempts to solve this problem. Specifically, DIBS uses the information rate to represent the compression degree at first. Then, decision performance after compression is measured using the Kullback-Leibler (KL) divergence of distributions between the policy after state compression and the expert policy. However, if the two distributions do not have exactly overlapping support sets, then the KL divergence is usually infinity, which leads to poor decision performance under the low information rate. In this paper, we propose the Wasserstein DIBS (WDIBS) algorithm to optimize the trade-off between the compression degree and decision performance. Specifically, we use the Wasserstein distance to calculate the difference of the distributions between the policy after state compression and the expert policy. Even if the two distributions do not have precisely overlapping support sets, the Wasserstein distance can still reflect their actual difference, thereby ensuring that WDIBS has good decision performance under the low information rate. Theoretical analyses and experiments demonstrate that our method provides a better trade-off between the compression degree and decision performance than DIBS.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT press, Cambridge

    MATH  Google Scholar 

  2. Zhou W, Li W Safety-aware apprenticeship learning, Springer, Cham

  3. Lindenstrauss E, Tsukamoto M (2018) From rate distortion theory to metric mean dimension: variational principle. IEEE Trans Inf Theory 64(5):3590–3609

    Article  MathSciNet  Google Scholar 

  4. Abe D (2020) A theory of abstraction in reinforcement learning. Brown University

  5. Lecarpentier E, Abel D, Asadi K, Jinnai Y, Rachelson E, Littman ML (2021) Lipschitz Lifelong Reinforcement Learning. In: Thirty-fifth AAAI conference on artificial intelligence, AAAI 2021, Virtual Event, February 2-9. AAAI Press, pp 8270–8278

  6. Jonsson A, Gómez V (2016) Hierarchical linearly-solvable Markov decision problems. In: Proceedings of the twenty-sixth international conference on automated planning and scheduling, ICAPS 2016, London, UK, June 12-17. AAAI Press, pp 193–201

  7. Menashe J, Stone P (2018) State abstraction synthesis for discrete models of continuous domains. In: AAAI spring symposia, Stanford University, Palo Alto, California, USA, March 26-28, 2018. AAAI Press

  8. Vezhnevets AS, Osindero S, Schaul T, Heess N, Jaderberg M, Silver D, Kavukcuoglu K (2017) Feudal networks for hierarchical reinforcement learning. In: Proceedings of the 34th international conference on machine learning, ICML 2017, Sydney, NSW, Australia, 6-11 August. PMLR, pp 3540–3549

  9. Cai Z, Yang X, Huang T, Zhu W (2020) A new similarity combining reconstruction coefficient with pairwise distance for agglomerative clustering. Inform Sci 508:173–182

    Article  MathSciNet  Google Scholar 

  10. Cai Z, Zhu W (2018) Multi-label feature selection via feature manifold learning and sparsity regularization. Int J Mach Learn Cybern 9(8):1321–1334

    Article  Google Scholar 

  11. Huang T, Wang S, Zhu W (2020) An adaptive kernelized rank-order distance for clustering non-spherical data with high noise. Int J Mach Learn Cybern 11:1735–1747

    Article  Google Scholar 

  12. Yang X, Cai Z, Li R, Zhu W (2020) GDPC: Generalized density peaks clustering algorithm based on order similarity. Int J Mach Learn Cybern

  13. Guo Z, Huang T, Cai Z, Zhu W (2018) A new local density for density peak clustering. In: 22nd Pacific-Asia conference on knowledge discovery and data mining, PAKDD 2018, Melbourne, VIC, Australia, June 3-6, 2018. Springer, pp 426–438

  14. Li R, Yang X, Qin X, Zhu W (2019) Local gap density for clustering high-dimensional data with varying densities. Knowl-Based Syst 184:104905

    Article  Google Scholar 

  15. Bai A, Srivastava S, Russell SJ (2016) Markovian state and action abstractions for MDPs via hierarchical MCTS. In: The 25th Morgan Kaufmann international joint conference on artificial intelligence, IJCAI 2016, New York, NY, USA, July 9-15, 2016, Morgan Kaufmann. pp 3029–3039

  16. Bellemare AG, Dabney W, Dadashi R, Taïga AA, Castro PS, Le Roux N, Schuurmans D, Lattimore T, Lyle C (2019a) A geometric perspective on optimal representations for reinforcement learning. In: Advances in neural information processing systems 32, NeurIPS 2019, December 8-14, Vancouver, BC, Canada, pp 4360–4371

  17. MacKay DJC, Mac Kay DJC (2003) Information theory, inference and learning algorithms. Cambridge university press, Cambridge

    Google Scholar 

  18. Unal S, Wagner AB (2016) A rate-distortion approach to index coding. IEEE Trans Inf Theory 62(11):6359–6378

    Article  MathSciNet  Google Scholar 

  19. Li Q, Chen Y (2020) Rate distortion via deep learning. IEEE Trans Commun 68(1):456–465

    Article  Google Scholar 

  20. Cheraghchi M, Ribeiro JL (2021) An Overview of capacity results for synchronization channels. IEEE Trans Inf Theory 67(6):3207–3232

    Article  MathSciNet  Google Scholar 

  21. Goyal A, Islam R, Strouse D, Ahmed Z, Botvinick M, Larochelle H, Levine S, Bengio Y (2019) InfoBot: Transfer and exploration via the information bottleneck. In: 7th international conference on learning representations, New Orleans, USA, May 6-9

  22. Bacon P-L, Harb J, Precup D (2017) The option-critic architecture. In: Proceedings of the thirty-first AAAI conference on artificial intelligence, February 4-9, 2017, San Francisco, California. AAAI Press, pp 1726–1734

  23. Puterman M (2014) Markov decision processes: Discrete stochastic dynamic programming. Wiley, Hoboken

    MATH  Google Scholar 

  24. Sutton RS, Barto A (2018) Reinforcement learning: An introduction. MIT press, Cambridge

    MATH  Google Scholar 

  25. Abel D, Arumugam D, Lehnert L, Littman ML (2018) State abstractions for lifelong reinforcement learning. In: The 35th ACM international conference on machine learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018. ACM, pp 10–19

  26. Jothimurugan K, Bastani O, Alur R (2021) Abstract value iteration for hierarchical reinforcement learning. In: The 24th international conference on artificial intelligence and statistics, AISTATS 2021, April 13-15, Virtual Event. PMLR, pp 1162–1170

  27. Sutter D, Sutter T, Esfahani PM, Renner R (2015) Efficient approximation of channel capacities. IEEE Trans Inf Theory 61(4):1649–1666

    Article  MathSciNet  Google Scholar 

  28. Strouse D, Schwab DJ (2017) The deterministic information bottleneck. Neural Comput 29 (6):1611–1630

    Article  MathSciNet  Google Scholar 

  29. Nikolaidis S, Nath S, Procaccia AD, Srinivasa S (2017) Game-theoretic modeling of human adaptation in human-robot collaboration. In: ACM/IEEE international conference on human-robot interaction, HRI 2017, Vienna, Austria, March 6-9. ACM, pp 323–331

  30. Finn C, Levine S, Abbeel P (2016) Guided cost learning: Deep inverse optimal control via policy optimization. In: Proceedings of the 33nd international conference on machine learning, ICML 2016, New York City, NY, USA, June 19-24. JMLR, pp 49–58

  31. Kretzschmar H, Spies M, Sprunk C, Burgard W (2016) Socially compliant mobile robot navigation via inverse reinforcement learning. Int J Robot Res 57(5):1289–1307

    Article  Google Scholar 

  32. Abel D, Arumugam D, Lehnert L, Littman ML (2016) Near optimal behavior via approximate state abstraction. In: The 33th ACM international conference on machine learning, ICML 2016, New York City, NY, USA, June 19-24, 2016. ACM, pp 10–19

  33. Abel D, Arumugam D, Lehnert L, Littman ML (2017) Toward good abstractions for lifelong learning. In: NeurIPS workshop on hierarchical reinforcement learning

  34. Peyré G, Cuturi M (2019) Computational optimal transport. Found Trends Mach Learn 11 (5-6):355–607

    Article  Google Scholar 

  35. Arjovsky M, Chintala S, Bottou L (2017) Wasserstein generative adversarial networks. In: The 33th ACM international conference on machine learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017. ACM, pp 214–223

  36. Lopez R, Regier J, Jordan MI, Yosef N (2018) Information constraints on auto-encoding variational Bayes. In: Advances in neural information processing systems 31, NeurIPS 2018, December 3-8, Montréal, Canada. MIT Press, pp 6117–6128

  37. Kim H, Mnih A (2018) Disentangling by factorising. In: The 35th ACM international conference on machine learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018. ACM, pp 2654–2663

  38. Mnih V, Badia AP, Mirza M, Graves A, Lillicrap TP, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: The 33th ACM International conference on machine learning, ICML 2016, New York City, NY, USA. ACM, pp 1928–1937

  39. Bellemare MG, Naddaf Y, Veness J, Bowling M (2013) The arcade learning environment: An evaluation platform for general agents. J Artif Intell Res 47:253–279

    Article  Google Scholar 

  40. Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: 3rd international conference on learning representations, San Diego, CA, USA, May 7-9

  41. Ozair S, Lynch C, Bengio Y, van den Oord A, Levine S, Sermanet P (2019) Wasserstein dependency measure for representation learning. In: Annual conference on neural information processing systems 32, NeurIPS 2019, Vancouver, BC, Canada, 8-14 December, 2019. MIT Press, pp 15604–15614

  42. Gelada C, Kumar S, Buckman J, Nachum O, Bellemare MG (2019) DeepMDP: Learning continuous latent space models for representation learning. In: The 36th ACM international conference on machine learning, ICML 2019, Long Beach, California, USA. ACM, pp 2170–2179

  43. Abel D, Arumugam D, Asadi K, Jinnai Y, Littman ML, Wong LLS (2019) State abstraction as compression in apprenticeship learning. In: The 33th AAAI conference on artificial intelligence, AAAI 2019, Honolulu, Hawaii, USA. AAAI Press, pp 3134–3142

  44. Dupont E (2018) Learning disentangled joint continuous and discrete representations. In: Annual conference on neural information processing systems 31, NeurIPS 2018, Montréal, Canada, December 3-8, 2018, pages 708–718. MIT Press

  45. Devraj AM, Bušić A, Meyn S (2021) Fundamental design principles for reinforcement learning algorithms. In: Handbook of reinforcement learning and control. Springer, Cham, pp 75–137

  46. Sørensen RA, Nielsen M, Karstoft H (2020) Routing in congested baggage handling systems using deep reinforcement learning. IOS Press, vol 27, pp 139–152

  47. Rajeswaran A, Mordatch I, Kumar V (2020) A game theoretic framework for model based reinforcement learning. In: Proceedings of the 37th international conference on machine learning, ICML 2020, 13-18 July, Virtual Event. PMLR, pp 7953–7963

  48. Abel D, Umbanhowar N, Khetarpal K, Arumugam D, Precup D, Littman ML (2020) Value Preserving State-Action Abstractions. In: The 23rd international conference on artificial intelligence and statistics, AISTATS 2020, 26-28 August, Online [Palermo, Sicily, Italy]. PMLR, pp 1639–1650

  49. Hanten C, Katsuya F (2020) Reinforcement learning with convolutional reservoir computing. Appl Intell 50(8):2400–2410. Springer

    Article  Google Scholar 

  50. Lin E, Chen Q, Qi X (2020) Deep reinforcement learning for imbalanced classification. Appl Intell 50(8):2488–2502. Springer

    Article  Google Scholar 

  51. Pateria S, Subagdja B, Tan A, Quek C (2021) Hierarchical reinforcement learning: a comprehensive survey. ACM Comput Surv (CSUR) 54(5):1–35

    Article  Google Scholar 

Download references

Acknowledgments

This work is supported in part by The National Nature Science Foundation of China under Grant Nos. 61772120. We thank Maxine Garcia, PhD, from Liwen Bianji (Edanz) (www.liwenbianji.cn/) for editing the English text of a draft of this manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to William Zhu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: A

Appendix: A

In this section, we prove theorems and lemmas involved in algorithms.

1.1 A.1 Proofs of Theorem 1

Firstly, we provide the proof of Theorem 1. Specifically, to prove Theorem 1, we need two lemmas.

Usually, it is difficult to calculate the size of the abstract state space \(\left | {{S}_{\phi }} \right |\) in \(\mathcal {J}\). We use the size of the abstract state space \(\left | {{S}_{\phi }} \right |_{{{\rho }_{\phi }}\left (s\right )}^{{{\delta }_{{\min \limits } }}}\) in \({{\hat {\mathcal {J}}}_{WDIB}}\) to approximate \(\left | {{S}_{\phi }} \right |\). \(\left | {{S}_{\phi }} \right |_{{{\rho }_{\phi }}\left (s\right )}^{{{\delta }_{{\min \limits } }}}\) can be calculated by the size of the non-ignorable part of the alphabet under PMF defined by the following formula.

Definition 3

PMF-Used Alphabet Size: The PMF-used alphabet size of \(\mathcal {X}\) is the number of the elements whose probability under p(x) is higher than some negligibility threshold \({{\delta }_{{\min \limits } }}\in \left (0,1 \right )\):

$$ \left| \mathcal {X} \right|_{p\left( x \right)}^{{{\delta }_{\min }}}:=min\{\left| \{x\in {\mathcal {X}}: p\left( x \right)>{{\delta }_{\min }}\} \right|,\left| \mathcal {X} \right|\}. $$
(27)

Therefore, we use the first lemma, which associates the entropy of PMF with the maximum size of the alphabet used by that PMF [43]:

Lemma 1

Consider a discrete stochastic variable X, with alphabet \(\mathcal {X}\) and some PMF p(x). For a given threshold \({{\delta }_{{\min \limits } }}\in \left (0,1 \right )\), the PMF-used alphabet size of the alphabet is bounded, relative to some PMF p(x):

$$ {\mathcal {X}}_{p\left( x \right)}^{{{\delta }_{\min }}}\le \frac{H\left( X \right)}{{{\delta }_{\min }}\log \left( \frac{1}{{{\delta }_{\min }}} \right)}. $$
(28)

This bound is relatively large: it is not difficult to know that \(H\left (X \right )\le {{\log }_{2}}\left | \mathcal {X} \right |\) [3]. Therefore, in the worst case, the boundary can be \(\tilde {O}({1}/{{{\delta }_{{\min \limits } }}})\) times larger than the real alphabet. Nevertheless, the result allows us to correlate the entropy of a stochastic variable with the size of the alphabet it uses. In addition, by definition, the entropy \(H\left ({{\rho }_{\phi }} \right )\) of the abstract state distribution provides a lower bound for the number of bits to represent the used part of Sϕ. In this way, the entropy is a measure of the degree of compression, which uses the fact that the entropy of the most probable state can be expressed as 0. Therefore, lower entropy means a higher compression degree.

Proof

The entropy of a random variable X is as follows:

$$ H\left( X \right):=-\sum\limits_{x\in \mathcal {X} }{p\left( x \right){{\log }_{2}}p\left( x \right)}. $$
(29)

Then, for a given maximum entropy \(H\left (X \right )\le N\), we seek to understand the largest possible \(\left | \mathcal {X} \right |_{p\left (x \right )}^{{{\delta }_{\min \limits }}}\). That is:

$$ \underset{p\left( x \right):H\left( X \right)\le N}{\max } \left| \mathcal {X} \right|_{p\left( x \right)}^{{{\delta }_{\min }}}. $$
(30)

If the alphabet \(\mathcal {X}\) takes the maximum size, the variable X obeys a uniform distribution, where the probability of each element is \({{\delta }_{{\min \limits } }}\) such that \(H\left (X \right )=N\):

$$ \begin{array}{@{}rcl@{}} H\left( X \right)&=&-\sum\limits_{x\in \mathcal {X} }{p\left( x \right){{\log }_{2}}p\left( x \right)} \\ &=&-\sum\limits_{x\in \mathcal {X} }{{{\delta }_{\min }}{{\log }_{2}}{{\delta }_{\min}}} \\ &=&-\left| \mathcal {X} \right|{{\delta }_{\min }}{{\log }_{2}}{{\delta }_{\min }} \\ &=&\left| \mathcal {X} \right|{{\delta }_{\min }}{{\log }_{2}}\frac{1}{{{\delta }_{\min }}}. \end{array} $$
(31)

Therefore, for a given PMF p(x) with entropy \(H\left (X \right )\), and a minimum threshold of probability, the minimum size of the alphabet \(\mathcal {X}\) is upper bounded:

$$ \left| \mathcal {X} \right|\le \frac{H\left( X \right)}{{{\delta }_{\min }}{{\log }_{2}}\frac{1}{{{\delta }_{\min }}}}. $$
(32)

Next, we introduce a second lemma that relates the expected Wasserstein distance between two policies to the difference in value realized by the policies, in expectation under some state distribution:

Lemma 2

We denote by π the set of 1-Lipschitz-policies, s.t. for two stochastic policies, π1 and π2 on state space S, and a fixed probability distribution over S, p(s). If, for \(\varepsilon \in {{\mathbb {R}}_{\ge 0}}\):

$$ \left\{ \pi :\pi \in \prod ,\text{ }\underset{a}{\sum}{\left| {{\pi }_{1}}\left( a\left| s \right. \right)-{{\pi }_{2}}\left( a\left| s \right. \right) \right|} \right.\le \left. \varepsilon \right\}, $$
(33)

then:

$$ \begin{array}{ll} &\underset{p\left( s \right)}{\mathbb{E}} \left[ W\left( {{\pi }_{1}}\left( a\left| s \right. \right),{{\pi }_{2}}\left( a\left| s \right. \right) \right) \right]\le \frac{\varepsilon \gamma }{1-\gamma }, \\ &\underset{p\left( s \right)}{\mathbb{E}} \left[ {{V}^{{{\pi }_{1}}}}\left( s \right)-{{V}^{{{\pi }_{2}}}}\left( s \right) \right]\le \frac{\varepsilon \gamma }{1-\gamma }VMAX, \end{array} $$
(34)

where VMAX is an upper bound on the value function.

The above boundary allows us to correlate the distortion metric controlled by WIB with the distortion metric of the CVA objective.

Proof

Recall the Wasserstein distance between policies π1 and π2 for a given state s is defined as:

$$ \begin{array}{ll} &\underset{p\left( s \right)}{\mathbb{E}} \left[ W\left( {{\pi }_{1}}\left( a\left| s \right. \right),{{\pi }_{2}}\left( a\left| s \right. \right) \right) \right] =\underset{s\in p\left( s \right)}{\sum}{\left| {{\rho }_{{{\pi }_{1}},{{s}_{0}}}}\left( s \right)-{{\rho }_{{{\pi }_{2}},{{s}_{0}}}}\left( s \right) \right|}, \end{array} $$
(35)

where \({{\rho }_{{{\pi }},{{s}_{0}}}}\) denotes the stationary distribution over states under policy π, starting in state s0S.

We first bound the difference between the two state distributions after t steps:

$$ \begin{array}{@{}rcl@{}} &&\underset{{s}^{\prime}\in p\left( s \right)}{\sum}{\left| \rho_{{{\pi }_{1}},{{s}_{0}}}^{t}\left( {{s}^{\prime}} \right)-\rho_{{{\pi }_{2}},{{s}_{0}}}^{t}\left( {{s}^{\prime}} \right) \right|}\\ &=&\underset{{s}^{\prime}\in p\left( s \right)}{\sum}{\left| p\left( {{s}_{t}}={s}^{\prime}\left| {{s}_{0}},{{\pi }_{1}} \right. \right)-p\left( {{s}_{t}}={s}^{\prime}\left| {{s}_{0}},{{\pi }_{2}} \right. \right) \right|} \\ &\le& \underset{s\in p\left( s \right)}{\sum}{p\left( {{s}_{t-1}}=s\left| {{s}_{0}},{{\pi }_{1}} \right. \right)\underset{a}{\sum}{\left| {{\pi }_{1}}\left( a\left| s \right. \right)-{{\pi }_{2}}\left( a\left| s \right. \right) \right|}}\\ &&\times\underset{{s}^{\prime}\in p\left( s \right)}{\sum}{p\left( {s}^{\prime}\left| s,a \right. \right)} \\ &&+\underset{s\in p\left( s \right)}{\sum}{\left| p\left( {{s}_{t-1}}=s\left| {{s}_{0}},{{\pi }_{1}} \right. \right)-p\left( {{s}_{t-1}}=s\left| {{s}_{0}},{{\pi }_{2}} \right. \right) \right|} \\ &&\times\underset{a}{\sum}{{{\pi }_{2}}\left( a\left| s \right. \right)\underset{{s}^{\prime}\in p\left( s \right)}{\sum}{p\left( {s}^{\prime}\left| s,a \right. \right)}} \\ &\le& \varepsilon +\underset{s\in p\left( s \right)}{\sum}{\left| p\left( {{s}_{t-1}}=s\left| {{s}_{0}},{{\pi }_{1}} \right. \right)-p\left( {{s}_{t-1}}=s\left| {{s}_{0}},{{\pi }_{2}} \right. \right) \right|} \\ &=&\varepsilon +\underset{{s}^{\prime}\in p\left( s \right)}{\sum}{\left| \rho_{{{\pi }_{1}},{{s}_{0}}}^{t-1}\left( {{s}^{\prime}} \right)-\rho_{{{\pi }_{2}},{{s}_{0}}}^{t-1}\left( {{s}^{\prime}} \right) \right|}. \end{array} $$
(36)

From the above bound, we get the following inequality:

$$ \underset{{s}^{\prime}\in p\left( s \right)}{\sum} {\left| \rho_{{{\pi }_{1}},{{s}_{0}}}^{t}\left( {{s}^{\prime}} \right)-\rho_{{{\pi }_{2}},{{s}_{0}}}^{t}\left( {{s}^{\prime}} \right) \right|}\le t\varepsilon, $$
(37)

then:

$$ \begin{array}{@{}rcl@{}} &&\underset{p\left( s \right)}{\mathbb{E}} \left[ W\left( {{\pi }_{1}}\left( a\left| s \right. \right),{{\pi }_{2}}\left( a\left| s \right. \right) \right) \right] \\ &=&\underset{s\in p\left( s \right)}{\sum} {\left| {{\rho }_{{{\pi }_{1}},{{s}_{0}}}}\left( s \right)-{{\rho }_{{{\pi }_{2}},{{s}_{0}}}}\left( s \right) \right|} \\ &=&\underset{s\in p\left( s \right)}{\sum}\left| \left( 1-\gamma \right)\underset{t}{\sum}{{\gamma }^{t}}\rho_{{{\pi }_{1}},{{s}_{0}}}^{t}\left( s \right)-\left( 1-\gamma \right)\right.\\ &&\times\left.\underset{t}{\sum}{{{\gamma }^{t}}\rho_{{{\pi }_{2}},{{s}_{0}}}^{t}\left( s \right)} \right| \\ &\le& \left( 1-\gamma \right)\underset{t}{\sum}{{{\gamma }^{t}}\underset{s}{\sum}{\left| \rho_{{{\pi }_{1}},{{s}_{0}}}^{t}\left( s \right)-\rho_{{{\pi }_{2}},{{s}_{0}}}^{t}\left( s \right) \right|}} \\ &\le& \left( 1-\gamma \right)\underset{t}{\sum}{{{\gamma }^{t}}t\varepsilon =\left( 1-\gamma \right)}\frac{\gamma \varepsilon }{{{\left( 1-\gamma \right)}^{2}}}=\frac{\gamma \varepsilon }{1-\gamma}. \end{array} $$
(38)

The expectation of the value function loss is as follows:

$$ \begin{array}{@{}rcl@{}} &&\underset{p\left( s \right)}{\sum} {\left[ {{V}^{{{\pi }_{1}}}}\left( s \right)-{{V}^{{{\pi }_{2}}}}\left( s \right) \right]} \\ &\le& \underset{s}{\sum}{p\left( s \right)}\left( \underset{a}{\sum}{\left| {{\pi }_{1}}\left( a\left| s \right. \right)-{{\pi }_{2}}\left( a\left| s \right. \right) \right|R\left( s,a \right)} \right) \\ &&+\underset{s}{\sum}{p\left( s \right)}\left( \gamma \underset{s^{\prime}}{\sum}{p\left( s,a,{s}^{\prime} \right)\left| {{V}^{{{\pi }_{1}}}}\left( {{s}^{\prime}} \right)-{{V}^{{{\pi }_{2}}}}\left( {{s}^{\prime}} \right) \right|} \right) \\ &=&\underset{s}{\sum}{\underset{a}{\sum}{p\left( s \right)}}\left| {{\pi }_{1}}\left( a\left| s \right. \right)-{{\pi }_{2}}\left( a\left| s \right. \right) \right| \\ &&\left( R\left( s,a \right) + \gamma \underset{s^{\prime}}{\sum}{p\left( s,a,{s}^{\prime} \right)\left| {{V}^{{{\pi }_{1}}}}\left( {{s}^{\prime}} \right) - {{V}^{{{\pi }_{2}}}}\left( {{s}^{\prime}} \right) \right|} \right). \end{array} $$
(39)

Then, we apply the upper bound on the possible value \(VMAX=\frac {RMAX}{1-\gamma }\) to the above equation:

$$ \underset{p\left( s \right)}{\sum} {\left[ {{V}^{{{\pi }_{1}}}}\left( s \right) - {{V}^{{{\pi }_{2}}}}\left( s \right) \right]} \!\le\! VMAX \underset{p\left( s \right)}{\sum} {\left[ \left| {{\pi }_{1}}\left( a\left| s \right. \right) - {{\pi }_{2}}\left( a\left| s \right. \right) \right| \right]}. $$
(40)

Then, by Pinsker’s inequality, we conclude:

$$ \begin{array}{@{}rcl@{}} && \underset{p\left( s \right)}{\sum} {\left[ {{V}^{{{\pi }_{1}}}}\left( s \right)-{{V}^{{{\pi }_{2}}}}\left( s \right) \right]}\\ &\le& VMAX \underset{p\left( s \right)}{\sum} {\left[ \left| {{\pi }_{1}}\left( a\left| s \right. \right)-{{\pi }_{2}}\left( a\left| s \right. \right) \right| \right]} \\ &\le& VMAX\underset{p\left( s \right)}{\mathbb{E}} \left[ W\left( {{\pi }_{1}}\left( a\left| s \right. \right),{{\pi }_{2}}\left( a\left| s \right. \right) \right) \right] \\ &\le& \frac{\gamma \varepsilon}{1-\gamma}VMAX. \end{array} $$
(41)

Now we use Lemma 1 and Lemma 2 to prove Theorem 1.

Theorem 2

The function f of WDIB objective \({{\hat {\mathcal {J}}}_{WDIB}}\) is an upper bound of the CVA Objective \(\mathcal {J}\) that is tighter than \({{\hat {\mathcal {J}}}_{DIB}}\). That is:

$$ {{\forall }_{\phi }}: \mathcal{J}\left[ \phi \right]\le f\left( {{{\hat{\mathcal{J}}}}_{WDIB}}\left[ \phi \right] \right)\le f\left( {{{\hat{\mathcal{J}}}}_{DIB}}\left[ \phi \right] \right). $$
(42)

Proof

Consider the ϕ that minimizes \({{\hat {\mathcal {J}}}_{WDIB}}\), yielding the value of at most N + βε, where:

$$ \begin{array}{ll} &N:=H\left( {{S}_{\phi }} \right) \\ &\frac{\gamma \varepsilon }{1-\gamma }:=\underset{s\sim {{\rho }_{E}}}{\mathbb{E}} \left[ W\left( {{\pi }_{E}}\left( a\left| s \right. \right),{{\pi }_{\phi }}\left( a\left| \phi \left( s \right) \right. \right) \right) \right]. \end{array} $$
(43)

Then, by Lemma 1, we know:

$$ \left| {{S}_{\phi }} \right|_{{{\rho }_{\phi }}\left( s \right)}^{{{\delta }_{\min }}}\le \frac{N}{{{\delta }_{\min }}{{\log }_{2}}\left( \frac{1}{{{\delta }_{\min }}} \right)}. $$
(44)

By Lemma 2, we know:

$$ \underset{\rho \left( s \right)}{\mathbb{E}} \left[ {{V}^{{{\pi }_{1}}}}\left( s \right)-{{V}^{{{{\pi }_{\phi }}}}}\left( s \right) \right]\le \frac{\varepsilon \gamma }{1-\gamma}VMAX. $$
(45)

Therefore, we conclude:

$$ \begin{array}{@{}rcl@{}} \mathcal{J}\left[ \phi \right]&=&\left| {{S}_{\phi }} \right|_{{{\rho }_{\phi }}\left( s \right)}^{{{\delta }_{\min }}}+\beta \underset{\rho \left( s \right)}{\mathbb{E}} \left[ {{V}^{{{\pi }_{1}}}}\left( s \right)-{{V}^{{{{\pi }_{\phi }}}}}\left( s \right) \right] \\ &\le& \left| {{S}_{\phi }} \right|_{{{\rho }_{\phi }}\left( s \right)}^{{{\delta }_{\min }}}+\beta VMAX\underset{\rho \left( s \right)}{\mathbb{E}} \\ &&\times\left[ W\left( {{\pi }_{E}}\left( a\left| s \right. \right),{{\pi }_{\phi }}\left( a\left| \phi \left( s \right) \right. \right) \right) \right] \\ &\le& \frac{N}{{{\delta }_{\min }}{{\log }_{2}}\left( \frac{1}{{{\delta }_{\min }}} \right)}+\beta \frac{\varepsilon \gamma }{1-\gamma}VMAX. \end{array} $$
(46)

Thus, we can obtain the upper bound of \(\mathcal {J}\), which can be seen as a function of the \({{\hat {\mathcal {J}}}_{WDIB}}\). □

The DIBS algorithm uses the following function as the upper bound of:

$$ \begin{array}{ll} &{{\hat{\mathcal{J}}}_{DIB}} \le \left| {{S}_{\phi }} \right|_{{{\rho }_{\phi }}\left( s \right)}^{{{\delta }_{\min }}}+2\beta VMAX \\ &\qquad \qquad \underset{\rho \left( s \right)}{\mathbb{E}} \left[ \sqrt{\frac{1}{2}{{D}_{KL}}\!\left( {{\pi }_{E}}\left( a\left| s \right. \right)\left\| {{\pi }_{\phi }}\!\left( a\left| \phi \left( s \right) \right. \!\right) \right. \right)} \right]. \end{array} $$
(47)

According to Pinsker’s inequality, the following inequality can be derived:

$$ \begin{array}{ll} &W\left( {{\pi }_{E}}\left( a\left| s \right. \right),{{\pi }_{\phi }}\left( a\left| \phi \left( s \right) \right. \right) \right) \!\le\! \sqrt{\frac{1}{2}{{D}_{KL}}\left( {{\pi }_{E}}\left( a\left| s \right. \right)\left\| {{\pi }_{\phi }}\left( a\left| \phi \left( s \right) \right. \right) \right. \right)}. \end{array} $$
(48)

Then,

$$ \mathcal{J}\left[ \phi \right]\le f\left( {{{\hat{\mathcal{J}}}}_{WDIB}}\left[ \phi \right] \right)\le f\left( {{{\hat{\mathcal{J}}}}_{DIB}}\left[ \phi \right] \right). $$
(49)

1.2 A.2 Extensions

Lastly, we briefly discuss the situation where the agent is located in high-dimensional observation space. In this case, the state abstractions are usually stochastic. We utilize a known result relating the KL divergence to mutual information by variational autoencoder (VAE) [36]. During the process of training VAE, the KL divergence between \({{q}_{\psi }}\left (z\left | x \right . \right )\) and some prior distribution over latent codes, \(p\left (z \right )\), is minimized in expectation over the data distribution, p(x); by the way, \({\mathbb {E}_{p\left (x \right )}}\left [ {{D}_{KL}}\left ({{q}_{\psi }}\left (z\left | x \right . \right )\left \| p\left (z \right ) \right . \right ) \right ]\) is minimized. We treat x as the ground state representation S and z as the abstract state Sϕ.

Proof

To prove that our objective function \({{{\hat {\mathcal {J}}}}_{WSIB}}\) is a tighter upper bound of \({{{\hat {\mathcal {J}}}}_{SIB}}\) than \({{{\hat {\mathcal {J}}}}_{SIB\_N}}\), we utilize a known result [44]; [37] whose proof we replicate here for completeness with \(q\left (z,x \right )=p\left (x \right ){{q}_{\psi }}\left (z\left | x \right . \right )\) and \(q\left (z \right )={\mathbb {E}_{p\left (x \right )}}\left [ {{q}_{\psi }}\left (z\left | x \right . \right ) \right ]\):

$$ \begin{array}{@{}rcl@{}} &&{\mathbb{E}_{p\left( x \right)}}\left[ {{D}_{KL}}\left( {{q}_{\psi }}\left( z\left| x \right. \right)||p\left( z \right) \right) \right] \\ &=&{\mathbb{E}_{p\left( x \right)}}\left[ {\mathbb{E}_{{{q}_{\psi }}\left( z\left| x \right. \right)}}\left[ \log \frac{{{q}_{\psi }}\left( z\left| x \right. \right)}{p\left( z \right)} \right] \right]\\ &=&{\mathbb{E}_{q\left( z,x \right)}}\left[ \log \left( \frac{{{q}_{\psi }}\left( z\left| x \right. \right)}{p\left( z \right)}\frac{q\left( z \right)}{q\left( z \right)} \right) \right]\\ &=&{\mathbb{E}_{q\left( z,x \right)}}\left[ \log \frac{{{q}_{\psi }}\left( z\left| x \right. \right)}{q\left( z \right)} \right]+{\mathbb{E}_{q\left( z,x \right)}}\left[ \log \frac{q\left( z \right)}{p\left( z \right)} \right] \\ &=&{\mathbb{E}_{q\left( z,x \right)}}\left[ \log \frac{q\left( z,x \right)}{q\left( z \right)p\left( x \right)} \right]+{\mathbb{E}_{q\left( z \right)}}\left[ \log \frac{q\left( z \right)}{p\left( z \right)} \right] \\ &=&I\left( X;Z \right)+{{D}_{KL}}\left( q\left( z \right)||p\left( z \right) \right) \\ &\ge& I\left( X;Z \right). \end{array} $$
(50)

So the following inequality can be derived:

$$ \begin{array}{@{}rcl@{}} &&\underset{\phi }{\min} \underset{s\sim {{\rho }_{E}}}{\mathbb{E}} {{D}_{KL}}\left( {{q}_{\psi }}\left( {{S}_{\phi }}\left| S \right. \right)||p\left( {{S}_{\phi }} \right) \right)+ \\ &&\beta \underset{s\sim {{\rho }_{E}},{{s}_{\phi }}\sim \phi \left( s \right)}{\mathbb{E}} \left[ {{D}_{KL}}\left( {{\pi }_{E}}\left( a\left| s \right. \right)||{{\pi }_{\phi }}\left( a\left| {{s}_{\phi }} \right. \right) \right) \right] \end{array} $$
(51)
$$ \begin{array}{@{}rcl@{}} &\ge& \underset{\phi }{\min} I\left( S;{{S}_{\phi }} \right)+ \\ &&\beta \underset{s\sim {{\rho }_{E}},{{s}_{\phi }}\sim \phi \left( s \right)}{\mathbb{E}} \left[ {{D}_{KL}}\left( {{\pi }_{E}}\left( a\left| s \right. \right)\left\| {{\pi }_{\phi }}\left( a\left| {{s}_{\phi }} \right. \right) \right. \right) \right]. \end{array} $$
(52)

This formula is equivalent to the following form:

$$ {{{\hat{\mathcal{J}}}}_{SIB\_N}}\ge {{\hat{\mathcal{J}}}_{SIB}}. $$
(53)

It usually supposes that both \({{q}_{\psi }}\left (z\left | x \right . \right )\) and \(p\left (z \right )\) obey the Gaussian distribution to use the Gaussian reparameterization technique [36]. Let us suppose that these distributions have dimension k, means μi, and covariance matrices \(\sum \nolimits _{i}\), for i = 1, 2. The following two formulas can be obtained:

$$ \begin{array}{ll} \begin{aligned} &{{W}}{{\left( {{{q}_{\psi }}\left( z\left| x \right. \right)},{p\left( z \right)} \right)}}=\left\| {{\mu }_{1}}-{{\mu }_{2}} \right\|_{2}^{2}\\ &+tr\left( \sum\nolimits_{1}{+\sum\nolimits_{2}{-2{{\left( \sum\nolimits_{2}^{{1}/{2} }{\sum\nolimits_{1}{\sum\nolimits_{2}^{{1}/{2} }}} \right)}^{{1}/{2} }}}} \right) \end{aligned} \end{array} $$
(54)

and

$$ \begin{array}{@{}rcl@{}} &&{{D}_{KL}}\left( {{{q}_{\psi }}\left( z\left| x \right. \right)},{p\left( z \right)} \right)=\frac{1}{2}tr\left( \sum\nolimits_{1}^{-1}{\sum\nolimits_{2}} \right) \\ &&+\frac{1}{2}\left( {{\left( {{\mu }_{1}} - {{\mu }_{2}} \right)}^{T}}\sum\nolimits_{1}^{-1}{\left( {{\mu }_{1}} - {{\mu }_{2}} \right)}-k + \ln \left( \frac{\det \sum\nolimits_{1}}{\det \sum\nolimits_{2}} \right) \right). \end{array} $$
(55)

For simplicity, we consider \(\sum \nolimits _{1}{\text {=}\sum \nolimits _{2}{\text {=}w{{I}_{k}}}}\), where \(w \in \left [ 0,1 \right ]\) is the weight and μ1μ2. When combined with the − k term, the trace term in Wasserstein distance, the trace term in the KL divergence and the log-determinant ratio will be 0. So these two equations become:

$$ {{W}}{{\left( {{{q}_{\psi }}\left( z\left| x \right. \right)},{p\left( z \right)} \right)}}=\left\| {{\mu }_{1}}-{{\mu }_{2}} \right\|_{2}^{2} $$
(56)

and

$$ {{D}_{KL}}\left( {{{q}_{\psi }}\left( z\left| x \right. \right)},{p\left( z \right)} \right) = {{\left( {{\mu }_{1}} - {{\mu }_{2}} \right)}^{T}}\sum\nolimits_{1}^{-1}{\left( {{\mu }_{1}} - {{\mu }_{2}} \right)}. $$
(57)

Note that Wasserstein distance does not change if the variance changes whereas the KL divergence does. Therefore, the following inequality can be derived:

$$ \begin{array}{@{}rcl@{}} &&\underset{\phi }{\min} \underset{s\sim {{\rho }_{E}}}{\mathbb{E}} {{D}_{KL}}\left( {{q}_{\psi }}\left( {{S}_{\phi }}\left| S \right. \right)||p\left( {{S}_{\phi }} \right) \right) \\ &\ge& \underset{\phi }{\min} \underset{s\sim {{\rho }_{E}}}{\mathbb{E}} W\left( {{q}_{\psi }}\left( {{S}_{\phi }}\left| S \right. \right),p\left( {{S}_{\phi }} \right) \right). \end{array} $$
(58)

At same time, according to Equation (46), the following inequality can be derived:

$$ \begin{array}{@{}rcl@{}} &&\underset{s\sim {{\rho }_{E}},{{s}_{\phi }}\sim \phi \left( s \right)}{\mathbb{E}} \left[ {{D}_{KL}}\left( {{\pi }_{E}}\left( a\left| s \right. \right)||{{\pi }_{\phi }}\left( a\left| {{s}_{\phi }} \right. \right) \right) \right] \\ &\ge& \underset{s\sim {{\rho }_{E}},{{s}_{\phi }}\sim \phi \left( s \right)}{\mathbb{E}} \left[ W\left( {{\pi }_{E}}\left( a\left| s \right. \right),{{\pi }_{\phi }}\left( a\left| {{s}_{\phi }} \right. \right) \right) \right]. \end{array} $$
(59)

So the following inequality can be derived:

$$ {{{\hat{\mathcal{J}}}}_{SIB\_N}}\ge {{{\hat{\mathcal{J}}}}_{WSIB}}. $$
(60)

Therefore, we conclude:

$$ {{{\hat{\mathcal{J}}}}_{SIB\_N}}\ge {{{\hat{\mathcal{J}}}}_{WSIB}}\ge {{\hat{\mathcal{J}}}_{SIB}}. $$
(61)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, X., Huang, T., Zhang, R. et al. WDIBS: Wasserstein deterministic information bottleneck for state abstraction to balance state-compression and performance. Appl Intell 52, 6316–6329 (2022). https://doi.org/10.1007/s10489-021-02787-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-02787-4

Keywords

Navigation