Güçlendirme Öğreniminde Bellman Denklemini Çıkarmak

32

Aşağıdaki denklemi " Takviye Öğrenmede. Giriş " bölümünde görüyorum , ancak aşağıda mavi olarak vurguladığım adımı tam olarak takip etmeyin. Bu adım tam olarak nasıl elde edilir?

expected-value reinforcement-learning

— Amelio Vazquez-Reina
kaynak

7

Bu, arkasındaki temiz, yapılandırılmış matematiği merak eden herkesin cevabıdır (yani, rastgele bir değişkenin ne olduğunu bilen ve rastgele bir değişkenin yoğunluğa sahip olduğunu gösteren veya varsayalım) sizin için cevap ;-)):

Her şeyden önce Markov karar yöntem yalnızca sınırlı bir sayıda olduğunu olması gerekir , -rewards sonlu grubu var olduğunu mi yani her ait yoğunluklarının değişkenleri, yani tüm ve bir harita bu şekilde (MDP'nin arkasındaki otomatlarda, sonsuz sayıda durum olabilir, ancak eyaletler arasındaki muhtemel sınırsız geçişlere ekli ancak çok az sayıda - -reward-dağılımı vardır) $L^1$ $E$ $L^1$ $\int_{\mathbb{R}}x \cdot e(x) dx < \infty$ $e \in E$ $F : A \times S \to E$

p (r_{t} | a_{t}, s_{t}) = F (a_{t}, s_{t}) (r_{t})

$p(r_t|a_t, s_t) = F(a_t, s_t)(r_t)$

L^{1}

$L^1$

Teorem 1 : (yani bütünleştirilebilir bir gerçek rastgele değişken) olmasına izin verin ve , ortak yoğunluğa sahip olması için başka bir rastgele değişken olsun; $X \in L^1(\Omega)$ $Y$ $X,Y$

E [X | Y = y] = \int_{R} x p (x | y) d x

$E[X|Y=y] = \int_\mathbb{R} x p(x|y) dx$

İspat : Esasen burada Stefan Hansen tarafından kanıtlanmış .

Teorem 2 : ve ortak yoğunluğa sahip olması için rastgele değişkenler olmasına izin verin burada , aralığıdır . $X \in L^1(\Omega)$ $Y,Z$ $X,Y,Z$

E [X | Y = y] = \int_{Z} p (z | y) E [X | Y = y, Z = z] d z

$E[X|Y=y] = \int_{\mathcal{Z}} p(z|y) E[X|Y=y,Z=z] dz$

Z

$\mathcal{Z}$

Z

$Z$

İspat :

\begin{aligned} E [X | Y = y] & = \int_{R} x p (x | y) d x \\ (by Thm. 1) \\ = \int_{R} x \frac{p (x, y)}{p (y)} d x \\ = \int_{R} x \frac{\int_{Z} p (x, y, z) d z}{p (y)} d x \\ = \int_{Z} \int_{R} x \frac{p (x, y, z)}{p (y)} d x d z \\ = \int_{Z} \int_{R} x p (x | y, z) p (z | y) d x d z \\ = \int_{Z} p (z | y) \int_{R} x p (x | y, z) d x d z \\ = \int_{Z} p (z | y) E [X | Y = y, Z = z] d z \\ (by Thm. 1) \end{aligned}

$\begin{align*} E[X|Y=y] &= \int_{\mathbb{R}} x p(x|y) dx \\ &~~~~\text{(by Thm. 1)}\\ &= \int_{\mathbb{R}} x \frac{p(x,y)}{p(y)} dx \\ &= \int_{\mathbb{R}} x \frac{\int_{\mathcal{Z}} p(x,y,z) dz}{p(y)} dx \\ &= \int_{\mathcal{Z}} \int_{\mathbb{R}} x \frac{ p(x,y,z) }{p(y)} dx dz \\ &= \int_{\mathcal{Z}} \int_{\mathbb{R}} x p(x|y,z)p(z|y) dx dz \\ &= \int_{\mathcal{Z}} p(z|y) \int_{\mathbb{R}} x p(x|y,z) dx dz \\ &= \int_{\mathcal{Z}} p(z|y) E[X|Y=y,Z=z] dz \\ &~~~~\text{(by Thm. 1)} \end{align*}$

Put ve koyun sonra bir kişi (MDP’nin sadece çok sayıda elemanına sahip olduğu gerçeğini kullanarak ) birleştiğini ve işlevinden berihala (diğer bir deyişle integrali) bir de bu (şartlı beklenti [arasında çarpanlama] için belirleyici denklemlere monoton yakınsama teoremi olağan kombinasyonu ve daha sonra baskın yakınsama kullanarak) gösterebilir Şimdi bir kişi bunu gösteriyor $G_t = \sum_{k=0}^\infty \gamma^k R_{t+k}$ $G_t^{(K)} = \sum_{k=0}^K \gamma^k R_{t+k}$ $L^1$ $G_t^{(K)}$ $\sum_{k=0}^\infty \gamma^k |R_{t+k}|$ $L^1(\Omega)$

lim_{K \to \infty} E [G_{t}^{(K)} | S_{t} = s_{t}] = E [G_{t} | S_{t} = s_{t}]

$\lim_{K \to \infty} E[G_t^{(K)} | S_t=s_t] = E[G_t | S_t=s_t]$

E [G_{t}^{(K)} | S_{t} = s_{t}] = E [R_{t} | S_{t} = s_{t}] + γ \int_{S} p (s_{t + 1} | s_{t}) E [G_{t + 1}^{(K - 1)} | S_{t + 1} = s_{t + 1}] d s_{t + 1}

$E[G_t^{(K)} | S_t=s_t] = E[R_{t} | S_t=s_t] + \gamma \int_S p(s_{t+1}|s_t) E[G_{t+1}^{(K-1)} | S_{t+1}=s_{t+1}] ds_{t+1}$

, , Thm. 2 yukarıda, sonra Thm. 1 ve daha sonra basit bir marjinalleştirme savaşı kullanılarak, biri Tüm için . Şimdi limitini denklemin her iki tarafına da uygulamamız gerekiyor . Sınırı, alanı üzerindeki integral içine çekmek için bazı ek varsayımlar yapmamız gerekir:

G_{t}^{(K)} = R_{t} + γ G_{t + 1}^{(K - 1)}

$G_t^{(K)} = R_t + \gamma G_{t+1}^{(K-1)}$

E [G_{t + 1}^{(K - 1)} | S_{t + 1} = s^{'}, S_{t} = s_{t}]

$E[G_{t+1}^{(K-1)}|S_{t+1}=s', S_t=s_t]$

p (r_{q} | s_{t + 1}, s_{t}) = p (r_{q} | s_{t + 1})

$p(r_q|s_{t+1}, s_t) = p(r_q|s_{t+1})$

q \geq t + 1

$q \geq t+1$

K \to \infty

$K \to \infty$

S

$S$

Ya devlet alanı sonludur (ya da ve toplam sonludur) ya da tüm ödüller tamamen olumludur (sonra monoton yakınlaşmayı kullanırız) ya da tüm ödüller negatiftir (sonra eksi işaretini önüne koyarız. denklemini kullanın ve tekrar monoton yakınsama kullanın) ya da tüm ödüller sınırlandırılır (daha sonra baskın yakınsama kullanırız). Daha sonra ( yukarıdaki kısmi / sonlu Bellman denkleminin her iki tarafına uygulayarak) elde ederiz. $\int_S = \sum_S$ $\lim_{K \to \infty}$

E [G_{t} | S_{t} = s_{t}] = E [G_{t}^{(K)} | S_{t} = s_{t}] = E [R_{t} | S_{t} = s_{t}] + γ \int_{S} p (s_{t + 1} | s_{t}) E [G_{t + 1} | S_{t + 1} = s_{t + 1}] d s_{t + 1}

$E[G_t | S_t=s_t] = E[G_t^{(K)} | S_t=s_t] = E[R_{t} | S_t=s_t] + \gamma \int_S p(s_{t+1}|s_t) E[G_{t+1} | S_{t+1}=s_{t+1}] ds_{t+1}$

ve sonra gerisi olağan yoğunluk manipülasyonudur.

HATIRLATMA: Çok basit görevlerde bile, devlet alanı sonsuz olabilir! Bir örnek, 'kutup direği' görevidir. Durum esas olarak direğin açısıdır ( cinsinden bir değer , sayılamayan bir sonsuz küme!) $[0, 2\pi)$

HATIRLATMA: İnsanlar hamurlarını doğrudan yoğunluğunu kullanırsanız ve '... AMA ... sorum şu olurdu: $G_t$ $p(g_{t+1}|s_{t+1}, s_t) = p(g_{t+1}|s_{t+1})$

in yoğunluğunun olduğunu nasıl bildin ? $G_{t+1}$
Neden in ile birlikte ortak bir yoğunluğa sahip olduğunu bile biliyorsunuz ? $G_{t+1}$ $S_{t+1}, S_t$
Nasıl olduğu sonucuna do ? Bu sadece Markov mülkü değildir: Markov mülkü size sadece marjinal dağılımlar hakkında bir şeyler söyler ancak bunlar tüm dağıtımı belirlemez, örneğin çok değişkenli Gauss'lar! $p(g_{t+1}|s_{t+1}, s_t) = p(g_{t+1}|s_{t+1})$

— Fabian Werner
kaynak

10

Süreden sonra indirgenmiş ödüller toplamını olsun olabilir: $t$
$G_t = R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+...$

Halde başlangıç Yardımcı değeri, , zaman beklenen toplamına eşdeğer indirgenmiş ödülleri ilke yürütme durumu başlayarak itibaren. tanımına göre Doğrusallık yasasına göre yasasına göre $s$ $t$
$R$ $\pi$ $s$
$U_\pi(S_t=s) = E_\pi[G_t|S_t = s]$
$\\ = E_\pi[(R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+...)|S_t = s]$ $G_t$
$= E_\pi[(R_{t+1}+\gamma (R_{t+2}+\gamma R_{t+3}+...))|S_t = s]$
$= E_\pi[(R_{t+1}+\gamma (G_{t+1}))|S_t = s]$
$= E_\pi[R_{t+1}|S_t = s]+\gamma E_\pi[ G_{t+1}|S_t = s]$
$= E_\pi[R_{t+1}|S_t = s]+\gamma E_\pi[E_\pi(G_{t+1}|S_{t+1} = s')|S_t = s]$ Toplam Beklenti tanımı ile Doğrusallık yasasına göre
$= E_\pi[R_{t+1}|S_t = s]+\gamma E_\pi[U_\pi(S_{t+1}= s')|S_t = s]$ $U_\pi$
$= E_\pi[R_{t+1} + \gamma U_\pi(S_{t+1}= s')|S_t = s]$

İşlem tatmin Markov İşletme varsayılarak:
Olasılık durum içinde biten durumu başlamış olan ve alınan önlem , ve Ödül durum içinde biten durumu başlamış olan ve eylemin , $Pr$ $s'$ $s$ $a$
$Pr(s'|s,a) = Pr(S_{t+1} = s', S_t=s,A_t = a)$
$R$ $s'$ $s$ $a$
$R(s,a,s') = [R_{t+1}|S_t = s, A_t = a, S_{t+1}= s']$

Bu nedenle yukarıdaki fayda denklemini
$= \sum_a \pi(a|s) \sum_{s'} Pr(s'|s,a)[R(s,a,s')+ \gamma U_\pi(S_{t+1}=s')]$

Nerede; : aksiyon alma olasılığı zaman devlet içinde bir stokastik politikası için. Deterministik politika için $\pi(a|s)$ $a$ $s$ $\sum_a \pi(a|s)= 1$

— Ntabgoba
kaynak

Sadece birkaç not: üzerindeki toplam , stokastik bir politikada bile 1'e eşittir, ancak deterministik bir politikada, tam ağırlığı alan tek bir eylem vardır (yani, ve gerisi 0 kilo alırsınız, böylece terim denklemden çıkarılır.Toplam beklenti yasasını kullandığınız sırada, şartların sırası tersine çevrilir

π

$\pi$

π (a | s) = 1

$\pi(a|s) = 1$

— Gilad Peleg

1

Bu cevabın yanlış olduğuna eminim: Sadece toplam beklenti yasasını içeren çizgiye kadar denklemleri izleyelim. Sonra sol taraftaki bağlı değildir sağ taraf ise yaptığı ... Yani denklemleri sonra en doğru olduğu takdirde onlar düzeltmek edilir? İntegral üzerinde çeşit olması gerekir zaten bu aşamada. Sebep muhtemelen (rastgele bir değişken) ve faktoringi arasındaki farkları yanlış anlamanızdır. (deterministik bir işlev!) ...

s^{'}

$s'$

s^{'}

$s'$

s^{'}

$s'$

E [X | Y]

$E[X|Y]$

E [X | Y = y]

$E[X|Y=y]$

— Fabian Werner

@FabianWerner Bunun doğru olmadığını kabul ediyorum. Jie Shi'den gelen cevap doğru cevap.

— teucer

@teucer Bu cevap düzeltilebilir, çünkü sadece bazı "simetrikleşme" ler eksik, yani ama yine de, soru Jie Shis'in cevabındakiyle aynı: Neden ? Bu sadece Markov özelliği değildir, çünkü gerçekten karmaşık bir RV'dir: Bir araya geliyor mu? Eğer öyleyse, nerede? Ortak yoğunluk nedir? Bu ifadeyi yalnızca sınırlı toplamlar (karmaşık evrişim) için değil, sonsuz vaka için biliyoruz.

E [A | C = c] = \int_{range (B)} p (b | c) E [A | B = b, C = c] d P_{B} (b)

$E[A|C=c] = \int_{\text{range}(B)} p(b|c) E[A|B=b, C=c] dP_B(b)$

E [G_{t + 1} | S_{t + 1} = s_{t + 1}, S_{t} = s_{t}] = E [G_{t + 1} | S_{t + 1} = s_{t + 1}]

$E[G_{t+1}|S_{t+1}=s_{t+1}, S_t=s_t] = E[G_{t+1}|S_{t+1}=s_{t+1}]$

G_{t + 1}

$G_{t+1}$

p (g_{t + 1}, s_{t + 1}, s_{t})

$p(g_{t+1}, s_{t+1}, s_t)$

— Fabian Werner

@FabianWerner, tüm soruları cevaplayabileceğimden emin değil. Bazı işaretçilerin altında. in yakınsaması için, iskonto edilmiş ödüllerin toplamı olduğu göz önüne alındığında, serinin yakınsak olduğunu varsaymak makul olur (iskonto faktörü ve yakınsamaların gerçekten önemli olmadığı). Yoğunluktan endişe duymuyorum (rastgele değişkenlere sahip olduğumuz sürece her zaman bir eklem yoğunluğunu tanımlayabilir), yalnızca iyi tanımlanmışsa ve bu durumda ise önemlidir.

G_{t + 1}

$G_{t+1}$

< 1

$<1$

— teucer

8

İşte kanıtım. Koşullu dağılımların manipülasyonuna dayanır ve bu da takip etmeyi kolaylaştırır. Umarım bu size yardımcı olur.

\begin{aligned} v_{π} (s) & = E [G_{t} | S_{t} = s] \\ = E [R_{t + 1} + γ G_{t + 1} | S_{t} = s] \\ = \sum_{s^{'}} \sum_{r} \sum_{g_{t + 1}} \sum_{a} p (s^{'}, r, g_{t + 1}, a | s) (r + γ g_{t + 1}) \\ = \sum_{a} p (a | s) \sum_{s^{'}} \sum_{r} \sum_{g_{t + 1}} p (s^{'}, r, g_{t + 1} | a, s) (r + γ g_{t + 1}) \\ = \sum_{a} p (a | s) \sum_{s^{'}} \sum_{r} \sum_{g_{t + 1}} p (s^{'}, r | a, s) p (g_{t + 1} | s^{'}, r, a, s) (r + γ g_{t + 1}) \\ Note that p (g_{t + 1} | s^{'}, r, a, s) = p (g_{t + 1} | s^{'}) by assumption of MDP \\ = \sum_{a} p (a | s) \sum_{s^{'}} \sum_{r} p (s^{'}, r | a, s) \sum_{g_{t + 1}} p (g_{t + 1} | s^{'}) (r + γ g_{t + 1}) \\ = \sum_{a} p (a | s) \sum_{s^{'}} \sum_{r} p (s^{'}, r | a, s) (r + γ \sum_{g_{t + 1}} p (g_{t + 1} | s^{'}) g_{t + 1}) \\ = \sum_{a} p (a | s) \sum_{s^{'}} \sum_{r} p (s^{'}, r | a, s) (r + γ v_{π} (s^{'})) \end{aligned}

$\begin{align} v_{\pi}(s)&=E{\left[G_t|S_t=s\right]} \nonumber \\ &=E{\left[R_{t+1}+\gamma G_{t+1}|S_t=s\right]} \nonumber \\ &= \sum_{s'}\sum_{r}\sum_{g_{t+1}}\sum_{a}p(s',r,g_{t+1}, a|s)(r+\gamma g_{t+1}) \nonumber \\ &= \sum_{a}p(a|s)\sum_{s'}\sum_{r}\sum_{g_{t+1}}p(s',r,g_{t+1} |a, s)(r+\gamma g_{t+1}) \nonumber \\ &= \sum_{a}p(a|s)\sum_{s'}\sum_{r}\sum_{g_{t+1}}p(s',r|a, s)p(g_{t+1}|s', r, a, s)(r+\gamma g_{t+1}) \nonumber \\ &\text{Note that $p(g_{t+1}|s', r, a, s)=p(g_{t+1}|s')$ by assumption of MDP} \nonumber \\ &= \sum_{a}p(a|s)\sum_{s'}\sum_{r}p(s',r|a, s)\sum_{g_{t+1}}p(g_{t+1}|s')(r+\gamma g_{t+1}) \nonumber \\ &= \sum_{a}p(a|s)\sum_{s'}\sum_{r}p(s',r|a, s)(r+\gamma\sum_{g_{t+1}}p(g_{t+1}|s')g_{t+1}) \nonumber \\ &=\sum_{a}p(a|s)\sum_{s'}\sum_{r}p(s',r|a, s)\left(r+\gamma v_{\pi}(s')\right) \label{eq2} \end{align}$ Bu ünlü Bellman denklemi.

— Jie Shi
kaynak

Bu yorumu 'Dikkat et ...' ifadesini biraz daha açıklar mısınız? Neden bu rastgele değişkenler do ve devlet ve eylem değişkenleri bile sahip ortak bir yoğunluğa? Öyleyse, neden kullandığınız bu özelliği biliyorsunuz? Sonlu bir toplam için doğru olduğunu görebiliyorum, fakat rastgele değişken bir sınır ise ... ???

G_{t + 1}

$G_{t+1}$

— Fabian Werner

Fabian'a: İlk önce ne olduğunu hatırlayalım . . Not, sadece direkt bağlıdır ve yana , tam olarak (diğer tüm MDP geçiş bilgileri yakalar , ve verilen zamanlarından önce tüm durumlardan, eylemlerden ve ödüllerden bağımsızdır . Benzer şekilde, sadece ve . Sonuç olarak, , bağımsızdır

G_{t + 1}

$G_{t+1}$

G_{t + 1} = R_{t + 2} + R_{t + 3} + \dots

$G_{t+1}=R_{t+2}+R_{t+3}+\cdots$

R_{t + 2}

$R_{t+2}$

S_{t + 1}

$S_{t+1}$

A_{t + 1}

$A_{t+1}$

p (s^{'}, r | s, a)

$p(s', r|s, a)$

R_{t + 2}

$R_{t+2}$

t + 1

$t+1$

S_{t + 1}

$S_{t+1}$

A_{t + 1}

$A_{t+1}$

R_{t + 3}

$R_{t+3}$

S_{t + 2}

$S_{t+2}$

A_{t + 2}

$A_{t+2}$

G_{t + 1}

$G_{t+1}$

S_{t}

$S_t$

A_{t}

$A_t$ ve verilen , bu satırı açıklar.

R_{t}

$R_t$

S_{t + 1}

$S_{t+1}$

— Jie Shi

Üzgünüz, bu sadece 'motive ediyor', aslında hiçbir şeyi açıklamıyor. Örneğin: yoğunluğu nedir ? Neden misiniz? Bu rastgele değişkenler neden ortak bir yoğunluğa sahipler ? Bir miktarın yoğunluklarda bir evrişime dönüştüğünü biliyorsunuz, yani ne ... yoğunlukta sonsuz miktarda integral olmalıdır? Yoğunluk için kesinlikle aday yok!

G_{t + 1}

$G_{t+1}$

p (g_{t + 1} | s_{t + 1}, s_{t}) = p (g_{t + 1} | s_{t + 1})

$p(g_{t+1}|s_{t+1}, s_t) = p(g_{t+1}|s_{t+1})$

G_{t + 1}

$G_{t+1}$

— Fabian Werner

Fabian'a: Sorunu anlamadım. 1. Marjinal dağılımın tam şeklini mi istiyorsunuz ? Bilmiyorum ve bu kanıtda buna ihtiyacımız yok. 2. neden ? Çünkü daha önce de bahsettiğim gibi, ve , verilen bağımsızdır . 3. "Ortak yoğunluk" ile neyi kastediyorsunuz? Ortak dağıtım mı demek istiyorsun? Bu rastgele değişkenlerin neden ortak bir dağılıma sahip olduğunu bilmek ister misiniz? Bu evrendeki tüm rastgele değişkenler ortak bir dağılıma sahip olabilir. Bu senin sorununsa, bir olasılık teorisi kitabı bulmanı ve okumanı öneririm.

p (g_{t + 1})

$p(g_{t+1})$

p (g_{t + 1} | s_{t + 1}, s_{t}) = p (g_{t + 1} | s_{t + 1})

$p(g_{t+1}|s_{t+1}, s_t)=p(g_{t+1}|s_{t+1})$

g_{t + 1}

$g_{t+1}$

s_{t}

$s_t$

s_{t + 1}

$s_{t+1}$

— Jie Shi,

Bu tartışmayı sohbete taşıyalım

— Fabian Werner

2

Aşağıdaki yaklaşımda ne var?

\begin{aligned} v_{π} (s) & = E_{π} [G_{t} ∣ S_{t} = s] \\ = E_{π} [R_{t + 1} + γ G_{t + 1} ∣ S_{t} = s] \\ = \sum_{a} π (a ∣ s) \sum_{s^{'}} \sum_{r} p (s^{'}, r ∣ s, a) \cdot \\ E_{π} [R_{t + 1} + γ G_{t + 1} ∣ S_{t} = s, A_{t + 1} = a, S_{t + 1} = s^{'}, R_{t + 1} = r] \\ = \sum_{a} π (a ∣ s) \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ v_{π} (s^{'})] . \end{aligned}

$\begin{align} v_\pi(s) & = \mathbb{E}_\pi\left[G_t \mid S_t = s\right] \\ & = \mathbb{E}_\pi\left[R_{t+1} + \gamma G_{t+1} \mid S_t = s\right] \\ & = \sum_a \pi(a \mid s) \sum_{s'} \sum_r p(s', r \mid s, a) \cdot \,\\ & \qquad \mathbb{E}_\pi\left[R_{t+1} + \gamma G_{t+1} \mid S_{t} = s, A_{t+1} = a, S_{t+1} = s', R_{t+1} = r\right] \\ & = \sum_a \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a) \left[r + \gamma v_\pi(s')\right]. \end{align}$

Toplamları almak üzere konmakta , ve arasından . Sonuçta, olası eylemler ve olası sonraki durumlar olabilir. Bu ekstra koşullar ile beklentinin doğrusallığı hemen hemen doğrudan sonuca yol açar. $a$ $s'$ $r$ $s$

Yine de, tartışmamın matematiksel olarak ne kadar titiz olduğundan emin değilim. Gelişmelere açığım.

— Bay Tsjolder
kaynak

Son satır, yalnızca MDP özelliği nedeniyle çalışır.

— teucer

2

Bu sadece kabul edilen cevaba bir yorum / eklemedir.

Toplam beklenti yasasının uygulandığı hatta kafam karıştı. Toplam beklenti yasasının ana formunun burada yardımcı olabileceğini sanmıyorum. Aslında bunun bir varyantı gerekli.

Eğer $X,Y,Z$ rastgele değişkenlerse ve tüm beklentilerin mevcut olduğunu varsayarsak, aşağıdaki kimlik geçerli olur:

$E[X|Y] = E[E[X|Y,Z]|Y]$

Bu durumda, $X= G_{t+1}$ , $Y = S_t$ ve $Z = S_{t+1}$ . Sonra

$E[G_{t+1}|S_t=s] = E[E[G_{t+1}|S_t=s, S_{t+1}=s'|S_t=s]$ , ki Markov özelliği tarafından Eqauls $E[E[G_{t+1}|S_{t+1}=s']|S_t=s]$

Oradan cevaptaki kanıtın geri kalanını izleyebiliriz.

— Mehdi Golari
kaynak

1

CV'ye Hoşgeldiniz! Lütfen cevapları sadece soruyu cevaplamak için kullanın. Yeterli üne sahip olduğunuzda (50) yorum ekleyebilirsiniz.

— Frans Rodenburg

Teşekkür ederim. Evet, yeterince ün sahibi olmadığım için yorum yapamadığım için açıklamalara cevapları eklemenin faydalı olabileceğini düşündüm. Ama bunu aklımda tutacağım.

— Mehdi Golari

Ben yükseldim ama yine de, bu cevabın detayları eksik:

olsa bile

bu çılgın ilişkiyi tatmin ediyor, sonra kimse bunun koşullu beklentilerin çarpanlara ayrılması için de geçerli olduğunu garanti etmiyor! Yani, Ntabgoba'nın cevabında olduğu gibi: Sol taraf

side

bağlı değildir , sağ taraf da does’e bağlı değildir . Bu denklem doğru olamaz!

E [X | Y]

$E[X|Y]$

s^{'}

$s'$

— Fabian Werner

1

genellikle maddesi varsayılarak beklenti aşağıdaki belirtmektedir ilke. Bu durumdayani ajan aksiyon alır olasılığını verir, belirli olmayan görünüyordevlet. $\mathbb{E}_\pi(\cdot)$ $\pi$ $\pi(a|s)$ $a$ $s$

Bu gibi görünüyor , daha düşük bir durum, yerini almaktadır , rastgele değişken. İkinci beklentim takip etmeye devam varsayımını yansıtmak için, sonsuz toplamı yerini gelecekteki tüm için . o zaman bir sonraki adımda beklenen hemen ödül; İkinci beklenti olur durum içinde sarılması olasılığı ile ağırlıklı bir sonraki durum beklenen değerini, bu mu $r$ $R_{t+1}$ $\pi$ $t$ $\sum_{s',r} r \cdot p(s′,r|s,a)$ $v_\pi$ Alınarak mesafede . $s'$ $a$ $s$

Böylece, beklenen burada olarak birlikte ifade edilen ilke olasılık olarak, geçiş ve ödül fonksiyonları için hesapları . $p(s', r|s,a)$

— Sean Easter
kaynak

Teşekkürler. Evet, neyi kastettiniz

doğru olduğundan (bu ajan alma eylemi olasılığı var

devlet içinde

).

π (a | s)

$\pi(a|s)$

a

$a$

s

$s$

— Amelio Vazquez-Reina

Benim takip etmediğim şey, ikinci adımda hangi terimlerin tam olarak hangi terimlerle genişleştiğidir (olasılık çarpanlarına ayırma ve marjinalleştirmeye aşinayım, ancak RL ile fazla değil). Mi

terimi genişletilmektedir? Yani önceki adımda tam olarak neyin anlamı sonraki adımda tam olarak neye eşittir?

R_{t}

$R_t$

— Amelio Vazquez-Reina

1

Öyle görünüyor

, alt durumda, yerini alıyor

, rastgele değişkenin ve ikinci beklenti sonsuz toplamı (biz takip etmeye devam muhtemelen varsayımı anlatmak için değiştirir

gelecekteki tüm için

).

daha sonra bir sonraki adım beklenen hemen ödül ve ikinci beklenti hale gelir

sarma olasılığı ile ağırlıklı bir sonraki durum beklenen değerini, bu mu

devletinin

almış olması

r

$r$

R_{t + 1}

$R_{t+1}$

π

$\pi$

t

$t$

Σ p (s^{'}, r | s, a) r

$\Sigma p(s',r|s,a)r$

v_{π}

$v_\pi$

s^{'}

$s'$

dan

.

a

$a$

s

$s$

— Sean Easter,

1

Doğru cevap zaten verilmiş ve biraz zaman geçti bile, ben adım kılavuz aşağıdaki adımı yararlı olabileceğini düşündük:
Beklenen Değerin doğrusallığı derken ayırabilirsiniz $E[R_{t+1} + \gamma E[G_{t+1}|S_{t}=s]]$ içine $E[R_{t+1}|S_t=s]$ ve $\gamma E[G_{t+1}|S_{t}=s]$ .
İkinci bölüm Toplam Beklenti Kanunu ile birlikte aynı adımlarla devam ederken, yalnızca ilk bölüm için adımları ana hatlarıyla açıklayacağım.

\begin{aligned} E [R_{t + 1} | S_{t} = s] & = \sum_{r} r P [R_{t + 1} = r | S_{t} = s] \\ = \sum_{a} \sum_{r} r P [R_{t + 1} = r, A_{t} = a | S_{t} = s] (III) \\ = \sum_{a} \sum_{r} r P [R_{t + 1} = r | A_{t} = a, S_{t} = s] P [A_{t} = a | S_{t} = s] \\ = \sum_{s^{^{'}}} \sum_{a} \sum_{r} r P [S_{t + 1} = s^{^{'}}, R_{t + 1} = r | A_{t} = a, S_{t} = s] P [A_{t} = a | S_{t} = s] \\ = \sum_{a} π (a | s) \sum_{s^{^{'}}, r} p (s^{^{'}}, r | s, a) r \end{aligned}

$\begin{align} E[R_{t+1}|S_t=s]&=\sum_r{ r P[R_{t+1}=r|S_t =s]} \\ &= \sum_a{ \sum_r{ r P[R_{t+1}=r, A_t=a|S_t=s]}} \qquad \text{(III)} \\ &=\sum_a{ \sum_r{ r P[R_{t+1}=r| A_t=a, S_t=s] P[A_t=a|S_t=s]}} \\ &= \sum_{s^{'}}{ \sum_a{ \sum_r{ r P[S_{t+1}=s^{'}, R_{t+1}=r| A_t=a, S_t=s] P[A_t=a|S_t=s] }}} \\ &=\sum_a{ \pi(a|s) \sum_{s^{'},r}{p(s^{'},r|s,a)} } r \end{align}$

Whereas (III) follows form:

\begin{aligned} P [A, B | C] & = \frac{P [A, B, C]}{P [C]} \\ = \frac{P [A, B, C]}{P [C]} \frac{P [B, C]}{P [B, C]} \\ = \frac{P [A, B, C]}{P [B, C]} \frac{P [B, C]}{P [C]} \\ = P [A | B, C] P [B | C] \end{aligned}

$\begin{align} P[A,B|C]&=\frac{P[A,B,C]}{P[C]} \\ &= \frac{P[A,B,C]}{P[C]} \frac{P[B,C]}{P[B,C]}\\ &= \frac{P[A,B,C]}{P[B,C]} \frac{P[B,C]}{P[C]}\\ &= P[A|B,C] P[B|C] \end{align}$

— Adsertor Justitia
kaynak

1

I know there is already an accepted answer, but I wish to provide a probably more concrete derivation. I would also like to mention that although @Jie Shi trick somewhat makes sense, but it makes me feel very uncomfortable:(. We need to consider the time dimension to make this work. And it is important to note that, the expectation is actually taken over the entire infinite horizon, rather than just over $s$ and $s'$ . Let assume we start from $t=0$ (in fact, the derivation is the same regardless of the starting time; I do not want to contaminate the equations with another subscript $k$ )

\begin{aligned} v_{π} (s_{0}) & = E_{π} [G_{0} | s_{0}] \\ G_{0} & = \sum_{t = 0}^{T - 1} γ^{t} R_{t + 1} \\ E_{π} [G_{0} | s_{0}] & = \sum_{a_{0}} π (a_{0} | s_{0}) \sum_{a_{1}, . . . a_{T}} \sum_{s_{1}, . . . s_{T}} \sum_{r_{1}, . . . r_{T}} (\prod_{t = 0}^{T - 1} π (a_{t + 1} | s_{t + 1}) p (s_{t + 1}, r_{t + 1} | s_{t}, a_{t}) \\ \times (\sum_{t = 0}^{T - 1} γ^{t} r_{t + 1})) \\ = \sum_{a_{0}} π (a_{0} | s_{0}) \sum_{a_{1}, . . . a_{T}} \sum_{s_{1}, . . . s_{T}} \sum_{r_{1}, . . . r_{T}} (\prod_{t = 0}^{T - 1} π (a_{t + 1} | s_{t + 1}) p (s_{t + 1}, r_{t + 1} | s_{t}, a_{t}) \\ \times (r_{1} + γ \sum_{t = 0}^{T - 2} γ^{t} r_{t + 2})) \end{aligned}

$\begin{align} v_{\pi}(s_0)&=\mathbb{E}_{\pi}[G_{0}|s_0]\\ G_0&=\sum_{t=0}^{T-1}\gamma^tR_{t+1}\\ \mathbb{E}_{\pi}[G_{0}|s_0]&=\sum_{a_0}\pi(a_0|s_0)\sum_{a_{1},...a_{T}}\sum_{s_{1},...s_{T}}\sum_{r_{1},...r_{T}}\bigg(\prod_{t=0}^{T-1}\pi(a_{t+1}|s_{t+1})p(s_{t+1},r_{t+1}|s_t,a_t)\\ &\times\Big(\sum_{t=0}^{T-1}\gamma^tr_{t+1}\Big)\bigg)\\ &=\sum_{a_0}\pi(a_0|s_0)\sum_{a_{1},...a_{T}}\sum_{s_{1},...s_{T}}\sum_{r_{1},...r_{T}}\bigg(\prod_{t=0}^{T-1}\pi(a_{t+1}|s_{t+1})p(s_{t+1},r_{t+1}|s_t,a_t)\\ &\times\Big(r_1+\gamma\sum_{t=0}^{T-2}\gamma^tr_{t+2}\Big)\bigg) \end{align}$ NOTED THAT THE ABOVE EQUATION HOLDS EVEN IF $T\rightarrow\infty$ , IN FACT IT WILL BE TRUE UNTIL THE END OF UNIVERSE (maybe be a bit exaggerated :) )
At this stage, I believe most of us should already have in mind how the above leads to the final expression--we just need to apply sum-product rule(

\sum_{a} \sum_{b} \sum_{c} a b c \equiv \sum_{a} a \sum_{b} b \sum_{c} c

$\sum_a\sum_b\sum_cabc\equiv\sum_aa\sum_bb\sum_cc$ ) painstakingly. Let us apply the law of linearity of Expectation to each term inside the

(r_{1} + γ \sum_{t = 0}^{T - 2} γ^{t} r_{t + 2})

$\Big(r_{1}+\gamma\sum_{t=0}^{T-2}\gamma^tr_{t+2}\Big)$

Part 1

\sum_{a_{0}} π (a_{0} | s_{0}) \sum_{a_{1}, . . . a_{T}} \sum_{s_{1}, . . . s_{T}} \sum_{r_{1}, . . . r_{T}} (\prod_{t = 0}^{T - 1} π (a_{t + 1} | s_{t + 1}) p (s_{t + 1}, r_{t + 1} | s_{t}, a_{t}) \times r_{1})

$\sum_{a_0}\pi(a_0|s_0)\sum_{a_{1},...a_{T}}\sum_{s_{1},...s_{T}}\sum_{r_{1},...r_{T}}\bigg(\prod_{t=0}^{T-1}\pi(a_{t+1}|s_{t+1})p(s_{t+1},r_{t+1}|s_t,a_t)\times r_1\bigg)$

Well this is rather trivial, all probabilities disappear (actually sum to 1) except those related to $r_1$ . Therefore, we have

\sum_{a_{0}} π (a_{0} | s_{0}) \sum_{s_{1}, r_{1}} p (s_{1}, r_{1} | s_{0}, a_{0}) \times r_{1}

$\sum_{a_0}\pi(a_0|s_0)\sum_{s_1,r_1}p(s_1,r_1|s_0,a_0)\times r_1$

Part 2
Guess what, this part is even more trivial--it only involves rearranging the sequence of summations.

\sum_{a_{0}} π (a_{0} | s_{0}) \sum_{a_{1}, . . . a_{T}} \sum_{s_{1}, . . . s_{T}} \sum_{r_{1}, . . . r_{T}} (\prod_{t = 0}^{T - 1} π (a_{t + 1} | s_{t + 1}) p (s_{t + 1}, r_{t + 1} | s_{t}, a_{t})) = \sum_{a_{0}} π (a_{0} | s_{0}) \sum_{s_{1}, r_{1}} p (s_{1}, r_{1} | s_{0}, a_{0}) (\sum_{a_{1}} π (a_{1} | s_{1}) \sum_{a_{2}, . . . a_{T}} \sum_{s_{2}, . . . s_{T}} \sum_{r_{2}, . . . r_{T}} (\prod_{t = 0}^{T - 2} π (a_{t + 2} | s_{t + 2}) p (s_{t + 2}, r_{t + 2} | s_{t + 1}, a_{t + 1})))

$\sum_{a_0}\pi(a_0|s_0)\sum_{a_{1},...a_{T}}\sum_{s_{1},...s_{T}}\sum_{r_{1},...r_{T}}\bigg(\prod_{t=0}^{T-1}\pi(a_{t+1}|s_{t+1})p(s_{t+1},r_{t+1}|s_t,a_t)\bigg)\\=\sum_{a_0}\pi(a_0|s_0)\sum_{s_1,r_1}p(s_1,r_1|s_0,a_0)\bigg(\sum_{a_1}\pi(a_1|s_1)\sum_{a_{2},...a_{T}}\sum_{s_{2},...s_{T}}\sum_{r_{2},...r_{T}}\bigg(\prod_{t=0}^{T-2}\pi(a_{t+2}|s_{t+2})p(s_{t+2},r_{t+2}|s_{t+1},a_{t+1})\bigg)\bigg)$

And Eureka!! we recover a recursive pattern in side the big parentheses. Let us combine it with $\gamma\sum_{t=0}^{T-2}\gamma^tr_{t+2}$ , and we obtain $v_{\pi}(s_1)=\mathbb{E}_{\pi}[G_1|s_1]$

γ E_{π} [G_{1} | s_{1}] = \sum_{a_{1}} π (a_{1} | s_{1}) \sum_{a_{2}, . . . a_{T}} \sum_{s_{2}, . . . s_{T}} \sum_{r_{2}, . . . r_{T}} (\prod_{t = 0}^{T - 2} π (a_{t + 2} | s_{t + 2}) p (s_{t + 2}, r_{t + 2} | s_{t + 1}, a_{t + 1})) (γ \sum_{t = 0}^{T - 2} γ^{t} r_{t + 2})

$\gamma\mathbb{E}_{\pi}[G_1|s_1]=\sum_{a_1}\pi(a_1|s_1)\sum_{a_{2},...a_{T}}\sum_{s_{2},...s_{T}}\sum_{r_{2},...r_{T}}\bigg(\prod_{t=0}^{T-2}\pi(a_{t+2}|s_{t+2})p(s_{t+2},r_{t+2}|s_{t+1},a_{t+1})\bigg)\bigg(\gamma\sum_{t=0}^{T-2}\gamma^tr_{t+2}\bigg)$
and part 2 becomes

\sum_{a_{0}} π (a_{0} | s_{0}) \sum_{s_{1}, r_{1}} p (s_{1}, r_{1} | s_{0}, a_{0}) \times γ v_{π} (s_{1})

$\sum_{a_0}\pi(a_0|s_0)\sum_{s_1,r_1}p(s_1,r_1|s_0,a_0)\times \gamma v_{\pi}(s_1)$

Part 1 + Part 2

v_{π} (s_{0}) = \sum_{a_{0}} π (a_{0} | s_{0}) \sum_{s_{1}, r_{1}} p (s_{1}, r_{1} | s_{0}, a_{0}) \times (r_{1} + γ v_{π} (s_{1}))

$v_{\pi}(s_0) =\sum_{a_0}\pi(a_0|s_0)\sum_{s_1,r_1}p(s_1,r_1|s_0,a_0)\times \Big(r_1+\gamma v_{\pi}(s_1)\Big)$

And now if we can tuck in the time dimension and recover the general recursive formulae

v_{π} (s) = \sum_{a} π (a | s) \sum_{s^{'}, r} p (s^{'}, r | s, a) \times (r + γ v_{π} (s^{'}))

$v_{\pi}(s) =\sum_a \pi(a|s)\sum_{s',r} p(s',r|s,a)\times \Big(r+\gamma v_{\pi}(s')\Big)$

Final confession, I laughed when I saw people above mention the use of law of total expectation. So here I am

— Karlsson Yu
kaynak

Erm... what is the symbol '

\sum_{a_{0}, . . ., a_{\infty}}

$\sum_{a_0, ..., a_{\infty}}$ ' supposed to mean? There is no

a_{\infty}

$a_\infty$ ...

— Fabian Werner

Another question: Why is the very first equation true? I know

E [f (X) | Y = y] = \int_{X} f (x) p (x | y) d x

$E[f(X)|Y=y] = \int_{\mathcal{X}} f(x) p(x|y) dx$ but in our case,

X

$X$ would be an infinite sequence of random variables

(R_{0}, R_{1}, R_{2}, . . . . . . . .)

$(R_0, R_1, R_2, ........)$ so we would need to compute the density of this variable (consisting of an infinite amount of variables of which we know the density) together with something else (namely the state)... how exactly do you du that? I.e. what is

p (r_{0}, r_{1}, . . . .)

$p(r_0, r_1, ....)$ ?

— Fabian Werner

@FabianWerner. Take a deep breath to calm your brain first:). Let me answer your first question.

\sum_{a_{0}, . . ., a_{\infty}} \equiv \sum_{a_{0}} \sum_{a_{1}}, . . ., \sum_{a_{\infty}}

$\sum_{a_0,...,a_{\infty}} \equiv \sum_{a_0}\sum_{a_1},...,\sum_{a_{\infty}}$ . If you recall the definition of the value function, it is actually a summation of discounted future rewards. If we consider an infinite horizon for our future rewards, we then need to sum infinite number of times. A reward is result of taking an action from a state, since there is an infinite number of rewards, there should be an infinite number of actions, hence

a_{\infty}

$a_{\infty}$ .

— Karlsson Yu

1

let us assume that I agree that there is some weird

a_{\infty}

$a_\infty$ (which I still doubt, usually, students in the very first semester in math tend to confuse the limit with some construction that actually involves an infinite element)... I still have one simple question: how is “

\sum_{a_{1}} . . . \sum_{a_{\infty}}

$\sum_{a_1} ... \sum_{a_\infty}$ defined? I know what this expression is supposed to mean with a finite amount of sums... but infinitely many of them? What do you understand that this expression does?

— Fabian Werner

1

internet. Could you refer me to a page or any place that defines your expression? If not then you actually defined something new and there is no point in discussing that because it is just a symbol that you made up (but there is no meaning behind it)... you agree that we are only able to discuss about the symbol if we both know what it means, right? So, I do not know what it means, please explain...

— Fabian Werner

1

There are already a great many answers to this question, but most involve few words describing what is going on in the manipulations. I'm going to answer it using way more words, I think. To start,

G_{t} ≐ \sum_{k = t + 1}^{T} γ^{k - t - 1} R_{k}

$G_{t} \doteq \sum_{k=t+1}^{T} \gamma^{k-t-1} R_{k}$

is defined in equation 3.11 of Sutton and Barto, with a constant discount factor $0 \leq \gamma \leq 1$ and we can have $T = \infty$ or $\gamma = 1$ , but not both. Since the rewards, $R_{k}$ , are random variables, so is $G_{t}$ as it is merely a linear combination of random variables.

\begin{aligned} v_{π} (s) & ≐ E_{π} [G_{t} ∣ S_{t} = s] \\ = E_{π} [R_{t + 1} + γ G_{t + 1} ∣ S_{t} = s] \\ = E_{π} [R_{t + 1} | S_{t} = s] + γ E_{π} [G_{t + 1} | S_{t} = s] \end{aligned}

$\begin{align} v_\pi(s) & \doteq \mathbb{E}_\pi\left[G_t \mid S_t = s\right] \\ & = \mathbb{E}_\pi\left[R_{t+1} + \gamma G_{t+1} \mid S_t = s\right] \\ & = \mathbb{E}_{\pi}\left[ R_{t+1} | S_t = s \right] + \gamma \mathbb{E}_{\pi}\left[ G_{t+1} | S_t = s \right] \end{align}$

That last line follows from the linearity of expectation values. $R_{t+1}$ is the reward the agent gains after taking action at time step $t$ . For simplicity, I assume that it can take on a finite number of values $r \in \mathcal{R}$ .

Work on the first term. In words, I need to compute the expectation values of $R_{t+1}$ given that we know that the current state is $s$ . The formula for this is

\begin{aligned} E_{π} [R_{t + 1} | S_{t} = s] = \sum_{r \in R} r p (r | s) . \end{aligned}

$\begin{align} \mathbb{E}_{\pi}\left[ R_{t+1} | S_t = s \right] = \sum_{r \in \mathcal{R}} r p(r|s). \end{align}$

In other words the probability of the appearance of reward $r$ is conditioned on the state $s$ ; different states may have different rewards. This $p(r|s)$ distribution is a marginal distribution of a distribution that also contained the variables $a$ and $s'$ , the action taken at time $t$ and the state at time $t+1$ after the action, respectively:

\begin{aligned} p (r | s) = \sum_{s^{'} \in S} \sum_{a \in A} p (s^{'}, a, r | s) = \sum_{s^{'} \in S} \sum_{a \in A} π (a | s) p (s^{'}, r | a, s) . \end{aligned}

$\begin{align} p(r|s) = \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} p(s',a,r|s) = \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} \pi(a|s) p(s',r | a,s). \end{align}$

Where I have used $\pi(a|s) \doteq p(a|s)$ , following the book's convention. If that last equality is confusing, forget the sums, suppress the $s$ (the probability now looks like a joint probability), use the law of multiplication and finally reintroduce the condition on $s$ in all the new terms. It in now easy to see that the first term is

\begin{aligned} E_{π} [R_{t + 1} | S_{t} = s] = \sum_{r \in R} \sum_{s^{'} \in S} \sum_{a \in A} r π (a | s) p (s^{'}, r | a, s), \end{aligned}

$\begin{align} \mathbb{E}_{\pi}\left[ R_{t+1} | S_t = s \right] = \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} r \pi(a|s) p(s',r | a,s), \end{align}$

as required. On to the second term, where I assume that $G_{t+1}$ is a random variable that takes on a finite number of values $g \in \Gamma$ . Just like the first term:

\begin{aligned} E_{π} [G_{t + 1} | S_{t} = s] = \sum_{g \in Γ} g p (g | s) . (*) \end{aligned}

$\begin{align} \mathbb{E}_{\pi}\left[ G_{t+1} | S_t = s \right] = \sum_{g \in \Gamma} g p(g|s). \qquad\qquad\qquad\qquad (*) \end{align}$

Once again, I "un-marginalize" the probability distribution by writing (law of multiplication again)

\begin{aligned} p (g | s) & = \sum_{r \in R} \sum_{s^{'} \in S} \sum_{a \in A} p (s^{'}, r, a, g | s) = \sum_{r \in R} \sum_{s^{'} \in S} \sum_{a \in A} p (g | s^{'}, r, a, s) p (s^{'}, r, a | s) \\ = \sum_{r \in R} \sum_{s^{'} \in S} \sum_{a \in A} p (g | s^{'}, r, a, s) p (s^{'}, r | a, s) π (a | s) \\ = \sum_{r \in R} \sum_{s^{'} \in S} \sum_{a \in A} p (g | s^{'}, r, a, s) p (s^{'}, r | a, s) π (a | s) \\ = \sum_{r \in R} \sum_{s^{'} \in S} \sum_{a \in A} p (g | s^{'}) p (s^{'}, r | a, s) π (a | s) (* *) \end{aligned}

$\begin{align} p(g|s) & = \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} p(s',r,a,g|s) = \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} p(g | s', r, a, s) p(s', r, a | s) \\ & = \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} p(g | s', r, a, s) p(s', r | a, s) \pi(a | s) \\ & = \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} p(g | s', r, a, s) p(s', r | a, s) \pi(a | s) \\ & = \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} p(g | s') p(s', r | a, s) \pi(a | s) \qquad\qquad\qquad\qquad (**) \end{align}$

The last line in there follows from the Markovian property. Remember that $G_{t+1}$ is the sum of all the future (discounted) rewards that the agent receives after state $s'$ . The Markovian property is that the process is memory-less with regards to previous states, actions and rewards. Future actions (and the rewards they reap) depend only on the state in which the action is taken, so $p(g | s', r, a, s) = p(g | s')$ , by assumption. Ok, so the second term in the proof is now

\begin{aligned} γ E_{π} [G_{t + 1} | S_{t} = s] & = γ \sum_{g \in Γ} \sum_{r \in R} \sum_{s^{'} \in S} \sum_{a \in A} g p (g | s^{'}) p (s^{'}, r | a, s) π (a | s) \\ = γ \sum_{r \in R} \sum_{s^{'} \in S} \sum_{a \in A} E_{π} [G_{t + 1} | S_{t + 1} = s^{'}] p (s^{'}, r | a, s) π (a | s) \\ = γ \sum_{r \in R} \sum_{s^{'} \in S} \sum_{a \in A} v_{π} (s^{'}) p (s^{'}, r | a, s) π (a | s) \end{aligned}

$\begin{align} \gamma \mathbb{E}_{\pi}\left[ G_{t+1} | S_t = s \right] & = \gamma \sum_{g \in \Gamma} \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} g p(g | s') p(s', r | a, s) \pi(a | s) \\ & = \gamma \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} \mathbb{E}_{\pi}\left[ G_{t+1} | S_{t+1} = s' \right] p(s', r | a, s) \pi(a | s) \\ & = \gamma \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} v_{\pi}(s') p(s', r | a, s) \pi(a | s) \end{align}$

as required, once again. Combining the two terms completes the proof

\begin{aligned} v_{π} (s) & ≐ E_{π} [G_{t} ∣ S_{t} = s] \\ = \sum_{a \in A} π (a | s) \sum_{r \in R} \sum_{s^{'} \in S} p (s^{'}, r | a, s) [r + γ v_{π} (s^{'})] . \end{aligned}

$\begin{align} v_\pi(s) & \doteq \mathbb{E}_\pi\left[G_t \mid S_t = s\right] \\ & = \sum_{a \in \mathcal{A}} \pi(a | s) \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} p(s', r | a, s) \left[ r + \gamma v_{\pi}(s') \right]. \end{align}$

UPDATE

I want to address what might look like a sleight of hand in the derivation of the second term. In the equation marked with $(*)$ , I use a term $p(g|s)$ and then later in the equation marked $(**)$ I claim that $g$ doesn't depend on $s$ , by arguing the Markovian property. So, you might say that if this is the case, then $p(g|s) = p(g)$ . But this is not true. I can take $p(g | s', r, a, s) \rightarrow p(g | s')$ because the probability on the left side of that statement says that this is the probability of $g$ conditioned on $s'$ , $a$ , $r$ , and $s$ . Because we either know or assume the state $s'$ , none of the other conditionals matter, because of the Markovian property. If you do not know or assume the state $s'$ , then the future rewards (the meaning of $g$ ) will depend on which state you begin at, because that will determine (based on the policy) which state $s'$ you start at when computing $g$ .

If that argument doesn't convince you, try to compute what $p(g)$ is:

\begin{aligned} p (g) & = \sum_{s^{'} \in S} p (g, s^{'}) = \sum_{s^{'} \in S} p (g | s^{'}) p (s^{'}) \\ = \sum_{s^{'} \in S} p (g | s^{'}) \sum_{s, a, r} p (s^{'}, a, r, s) \\ = \sum_{s^{'} \in S} p (g | s^{'}) \sum_{s, a, r} p (s^{'}, r | a, s) p (a, s) \\ = \sum_{s \in S} p (s) \sum_{s^{'} \in S} p (g | s^{'}) \sum_{a, r} p (s^{'}, r | a, s) π (a | s) \\ ≐ \sum_{s \in S} p (s) p (g | s) = \sum_{s \in S} p (g, s) = p (g) . \end{aligned}

$\begin{align} p(g) & = \sum_{s' \in \mathcal{S}} p(g, s') = \sum_{s' \in \mathcal{S}} p(g | s') p(s') \\ & = \sum_{s' \in \mathcal{S}} p(g | s') \sum_{s,a,r} p(s', a, r, s) \\ & = \sum_{s' \in \mathcal{S}} p(g | s') \sum_{s,a,r} p(s', r | a, s) p(a, s) \\ & = \sum_{s \in \mathcal{S}} p(s) \sum_{s' \in \mathcal{S}} p(g | s') \sum_{a,r} p(s', r | a, s) \pi(a | s) \\ & \doteq \sum_{s \in \mathcal{S}} p(s) p(g|s) = \sum_{s \in \mathcal{S}} p(g,s) = p(g). \end{align}$

As can be seen in the last line, it is not true that $p(g|s) = p(g)$ . The expected value of $g$ depends on which state you start in (i.e. the identity of $s$ ), if you do not know or assume the state $s'$ .

— Finncent Price
kaynak