Madeni para atma örneklerine Beklenti Maksimizasyonu Uygulama

Son zamanlarda Beklenti Maksimizasyonu üzerinde kendi kendime çalışıyorum ve bu süreçte kendime bazı basit örnekler aldım:

Gönderen burada : üç sikke vardır $c_0$ , $c_1$ ve $c_2$ ile $p_0$ , $p_1$ ve $p_2$ attı Head iniş için ilgili olasılık. Toss $c_0$ . Sonuç Head ise, $c_1$ üç kez fırlatın, aksi takdirde $c_2$ üç kez fırlatın . Tarafından üretilen gözlemlenmiş $c_1$ ve $c_2$ HHH, TTT, HHH, TTT, HHH: bu gibidir. Gizli veriler sonucudur $c_0$ . Tahmin $p_0$ , $p_1$ ve $p_2$ .

Ve aralarından burada : İki sikke vardır $c_A$ ve $c_B$ ile $p_A$ ve $p_B$ atarken Head iniş için ilgili olasılık olmaktan. Her turda rastgele bir jeton seçin ve on kez atın; sonuçları kaydedin. Gözlenen veriler, bu iki jetonun sağladığı atım sonuçlarıdır. Ancak, belirli bir tur için hangi madalyonun seçildiğini bilmiyoruz. $p_A$ ve tahmin edin $p_B$ .

Hesaplamaları alabilmeme rağmen, bunların çözülme biçimlerini orijinal EM teorisiyle ilişkilendiremiyorum. Özellikle, her iki örneğin M-Adımı sırasında, bir şeyi nasıl en üst düzeye çıkardıklarını görmüyorum. Görünüşe göre parametreleri yeniden hesaplıyorlar ve bir şekilde yeni parametreler eskilerinden daha iyi. Dahası, iki E-Adım, orijinal teorinin E-Adımı'ndan bahsetmemek bile, birbirine benzemiyor.

Peki bu örnekler tam olarak nasıl çalışıyor?

probability-theory statistics

— IcySnow
kaynak

İlk örnekte, aynı deneyin kaç örneğini elde ediyoruz? İkinci örnekte, "rastgele bir jeton seçin" yasası nedir? Kaç tur gözlemliyoruz?

— Raphael

Bağladığım PDF dosyaları bu iki örneği adım adım çözüyor. Ancak, kullanılan EM algoritmasını gerçekten anlamıyorum.

— IcySnow

@IcySnow, rastgele bir değişkenin beklenti ve koşullu beklenti kavramını anlıyor musunuz?

— Nicholas Mancuso

Rasgele değişken ve koşullu olasılığın temel beklentisini anlıyorum. Ancak, koşullu beklentiye, türevine ve yeterli istatistiğe aşina değilim.

— IcySnow

(Bu cevap verdiğiniz ikinci bağlantıyı kullanır.)

$\newcommand{\Like}{\text{L}}\newcommand{\E}{\text{E}}$ Olabilirlik tanımını hatırlayın: bizim durumumuzda A ve B sikkelerinin sırasıyla kara kafaları olma olasılığı için tahmin edicilerdir, deneylerimizin sonuçlarıdır, her biri

L [θ | X] = Pr [X | θ] = \sum_{Z} Pr [X, Z | θ]

$\Like[\theta | X] = \Pr[X| \theta] = \sum_Z \Pr[X, Z | \theta]$

θ = (θ_{A}, θ_{B})

$\theta = (\theta_A, \theta_B)$

X = (X_{1}, \dots, X_{5})

$X = (X_1, \dotsc, X_5)$

ve 10 çevirir oluşan

her bir deney için kullanılan para olmak.

X_{i}

$X_i$

Z = (Z_{1}, \dots, Z_{5})

$Z = (Z_1, \dotsc, Z_5)$

Biz maksimum olabilirlik tahmincisi bulmak istediğiniz . Beklenti-En (EM) algoritması (en azından bulmak için böyle bir yöntemdir . Koşullu beklentiyi bularak çalışır ve daha sonra değerini maksimize etmek için kullanılır . Fikir sürekli daha olası bir (yani daha olası) bularak olmasıdır her tekrarında sürekli artacak sırayla, olasılık fonksiyonunu arttırır. EM tabanlı bir algoritma tasarlamadan önce yapılması gereken üç şey vardır. $\hat{\theta}$ $\hat{\theta}$ $\theta$ $\theta$ $\Pr[X,Z|\theta]$

Modeli oluşturun
Model altında Koşullu Beklentiyi hesapla (E-Adım)
Mevcut (M-Adımı) tahminimizi güncelleyerek olasılığımızı en üst düzeye çıkarın $\theta$

Modeli Oluşturun

EM ile daha ileri gitmeden önce, tam olarak ne olduğunu hesaplamamız gerekir. E-adımında için tam olarak beklenen değeri hesaplıyoruz. . Peki bu değer nedir, gerçekten mi? Bu izleyin $\log \Pr[X,Z|\theta]$

\begin{aligned} \log Pr [X, Z | θ] & = \sum_{i = 1}^{5} \log \sum_{C \in {A, B}} Pr [X_{i}, Z_{i} = C | θ] \\ = \sum_{i = 1}^{5} \log \sum_{C \in {A, B}} Pr [Z_{i} = C | X_{i}, θ] \cdot \frac{Pr [X_{i}, Z_{i} = C | θ]}{Pr [Z_{i} = C | X_{i}, θ]} \\ \geq \sum_{i = 1}^{5} \sum_{C \in {A, B}} Pr [Z_{i} = C | X_{i}, θ] \cdot \log \frac{Pr [X_{i}, Z_{i} = C | θ]}{Pr [Z_{i} = C | X_{i}, θ]} . \end{aligned}

$\begin{align*} \log \Pr[X,Z|\theta] &= \sum_{i=1}^5 \log\sum_{C\in \{A,B\}}\Pr[X_i, Z_i=C| \theta]\\ &=\sum_{i=1}^5 \log\sum_{C\in \{A,B\}} \Pr[Z_i=C | X_i, \theta] \cdot \frac{\Pr[X_i, Z_i=C| \theta]}{\Pr[Z_i=C | X_i, \theta]}\\ &\geq \sum_{i=1}^5 \sum_{C\in \{A,B\}} \Pr[Z_i=C | X_i, \theta] \cdot \log\frac{\Pr[X_i, Z_i=C| \theta]}{\Pr[Z_i=C | X_i, \theta]}. \end{align*}$ The reason is that we have 5 experiments to account for, and we don't know what coin was used in each. The inequality is due to

\log

$\log$ being concave and applying Jensen's inequality. The reason we need that lower bound is that we cannot directly compute the arg max to the original equation. However we can compute it for the final lower bound.

Now what is $\Pr[Z_i=C|X_i,\theta]$ ? It is the probability that we see coin $C$ given experiment $X_i$ and $\theta$ . Using conditional probabilities we have,

Pr [Z_{i} = C | X_{i}, θ] = \frac{Pr [X_{i}, Z_{i} = C | θ]}{Pr [X_{i} | θ]} .

$\Pr[Z_i=C| X_i, \theta] = \frac{\Pr[X_i, Z_i = C|\theta]}{\Pr[X_i|\theta]}.$

$X_i$ $h_i = \#\text{heads in } X_i$

Pr [X_{i}, Z_{i} = C | θ] = \frac{1}{2} \cdot θ_{C}^{h_{i}} (1 - θ_{C})^{10 - h_{i}}, for C \in {A, B} .

$\Pr[X_i, Z_i = C| \theta] = \frac{1}{2} \cdot \theta_C^{h_i} (1 - \theta_C)^{10 - h_i},\ \text{ for } \ C \in \{A, B\}.$ Now

Pr [X_{i} | θ]

$\Pr[X_i|\theta]$ is clearly just the probability under both possibilities of

Z_{i} = A

$Z_i=A$ or

Z_{i} = B

$Z_i=B$ . Since

Pr [Z_{i} = A] = Pr [Z_{i} = B] = 1 / 2

$\Pr[Z_i = A] = \Pr[Z_i = B] = 1/2$ we have,

Pr [X_{i} | θ] = 1 / 2 \cdot (Pr [X_{i} | Z_{i} = A, θ] + Pr [X_{i} | Z_{i} = B, θ]) .

$\Pr[X_i|\theta] = 1/2 \cdot (\Pr[X_i |Z_i = A, \theta] + \Pr[X_i |Z_i = B, \theta]).$

E-Step

Okay... that wasn't so fun but we can start doing some EM work now. The EM algorithm begins by making some random guess for $\theta$ . In this example we have $\theta^0 = (0.6,0.5)$ . We compute

Pr [Z_{1} = A | X_{1}, θ] = \frac{1 / 2 \cdot ({0.6}^{5} \cdot {0.4}^{5})}{1 / 2 \cdot (({0.6}^{5} \cdot {0.4}^{5}) + ({0.5}^{5} \cdot {0.5}^{5}))} \approx 0.45.

$\Pr[Z_1=A|X_1,\theta] = \frac{1/2 \cdot (0.6^5 \cdot 0.4^5)}{1/2 \cdot ((0.6^5 \cdot 0.4^5) + (0.5^5 \cdot 0.5^5))} \approx 0.45.$ This value lines up with what is in the paper. Now we can compute the expected number of heads in

X_{1} = (H, T, T, T, H, H, T, H, T, H)

$X_1 = (H,T,T,T,H,H,T,H,T,H)$ from coin

A

$A$ ,

E [# heads by coin A | X_{1}, θ] = h_{1} \cdot Pr [Z_{1} = A | X_{1}, θ] = 5 \cdot 0.45 \approx 2.2.

$\E[\# \text{heads by coin }A | X_1, \theta] = h_1 \cdot \Pr[Z_1=A|X_1,\theta] = 5 \cdot 0.45 \approx 2.2.$ Doing the same thing for coin

B

$B$ we get,

E [# heads by coin B | X_{1}, θ] = h_{1} \cdot Pr [Z_{1} = B | X_{1}, θ] = 5 \cdot 0.55 \approx 2.8.

$\E[\# \text{heads by coin }B | X_1, \theta] = h_1 \cdot \Pr[Z_1=B|X_1,\theta] = 5 \cdot 0.55 \approx 2.8.$ We can compute the same for the number of tails by substituting

h_{1}

$h_1$ for

10 - h_{1}

$10 - h_1$ . This continues for all other values of

X_{i}

$X_i$ and

h_{i}

$h_i$

1 \leq i \leq 5

$1 \leq i \leq 5$ . Thanks to linearity of expectation we can figure out

E [# heads by coin A | X, θ] = \sum_{i = 1}^{5} E [# heads by coin A | X_{i}, θ]

$\E[\#\text{heads by coin } A|X ,\theta] = \sum_{i=1}^5 \E[\# \text{heads by coin }A | X_i, \theta]$

M-Step

With our expected values in hand, now comes the M step where we want to maximize $\theta$ given our expected values. This is done by simple normalization!

θ_{A}^{1} = \frac{E [# heads over X by coin A | X, θ]}{E [# heads and tails over X by coin A | X, θ]} = \frac{21.3}{21.3 + 9.6} \approx 0.71.

$\theta_A^1 = \frac{E[\#\text{heads over } X \text{ by coin } A|X ,\theta]}{E[\#\text{heads and tails over } X \text{ by coin } A|X ,\theta]} = \frac{21.3}{21.3 + 9.6} \approx 0.71.$ Likewise for

B

$B$ . This process begins again with the E-Step and

θ^{1}

$\theta^1$ and continues until the values for

θ

$\theta$ converge (or to some alloweable threshold). In this example we have 10 iterations and

\hat{θ} = θ^{10} = (0.8, 0.52)

$\hat{\theta} = \theta^{10} = (0.8, 0.52)$ . In each iteration the value of

Pr [X, Z | θ]

$\Pr[X,Z|\theta]$ increases, due to the better estimate of

θ

$\theta$ .

Now in this case the model was fairly simplistic. Things can get much more complicated pretty quickly, however the EM algorithm will always converge, and will always produce a maxmimum likelihood estimator $\hat{\theta}$ . It may be a local estimator, but to get around this we can just restart the EM process with a different initialization. We can do this a constant amount of times and retain the best results (i.e., those with the highest final likelihood).

— Nicholas Mancuso
kaynak

If any parts aren't clear I can try to expand them also.

— Nicholas Mancuso

It gets much clearer now. What I don't really get is why the expected number of heads for coin A was calculated as: E[#heads by coin A|X1,θ]=h1⋅Pr[Z1=A|X1,θ]=5⋅0.45≈2.2? The problem mentioned in the first PDF is more complicated. If you don't mind, can you do some illustrative calculations for it as well? Many thanks for your answer.

— IcySnow

@IcySnow, as far as the expectation calc goes:

E [# heads by coin A | X_{1}, θ] = \sum_{# heads in X_{1}} Pr [Z_{1} = A | X_{1}, θ] = 5 \cdot Pr [Z_{1} = A | X_{1}, θ]

$E[\# \text{ heads by coin }A|X_1,\theta] = \sum_{\#\text{ heads in }X_1} \Pr[Z_1 = A| X_1, \theta] = 5 \cdot \Pr[Z_1 = A| X_1, \theta]$ . The reason is you can think of there being another indicator random variable if A was used. Computing expectation over indicator variables is simple the probability of that event.

— Nicholas Mancuso

Sorry for the slow reply. Thanks to you, I can now really understand the logic behind the two coin examples, after going through your answer many times. There's one last thing I want to ask regarding this question: The example starting from page 8 in this slide cs.northwestern.edu/~ddowney/courses/395_Winter2010/em.ppt shows that in the M-Step, we have to first compute the derivative of the log-likelihood function and use it to maximize the expectation. Why isn't something like that in the coin toss examples' M-Steps? Because these M-Steps don't look like they're maximizing anything

— IcySnow

I'm confused by the first displayed equation after "Constructing the Model". Can you explain where that came from? It looks to me like

Pr [Z_{i} = A | X_{i}, θ] + Pr [Z_{i} = B | X_{i}, θ] = 1

$\Pr[Z_i=A|X_i,\theta]+\Pr[Z_i=B|X_i,\theta]=1$ , so the inner sum is 1 for every

i

$i$ , so the entire right-hand side becomes zero. I'm sure I'm missing something -- can you spell out the reasoning about how you got to that equation?

— D.W.