Ki-kare testi neden beklenen sayıyı varyans olarak kullanıyor?

In $\chi^2$ testi normal dağılımların her birinin standart sapma (sapmalar olarak yani beklendiği sayımları) olarak beklenen sayımları karekökünü kullanarak temeli nedir? Bunu tartışırken bulabildiğim tek şey http://www.physics.csbsju.edu/stats/chi-square.html ve sadece Poisson dağılımlarından bahsediyor.

Karışıklığımın basit bir örneği olarak, iki sürecin önemli ölçüde farklı olup olmadığını test ediyor olsaydık, biri çok küçük varyansla 500 As ve 500 Bs üreten, diğeri de çok küçük varyansla 550 As ve 450 Bs üreten (nadiren üreten) 551 As ve 449 Pansiyonlar)? Buradaki varyans açık bir şekilde beklenen değer değil mi?

(Ben bir istatistikçi değilim, bu yüzden gerçekten uzman olmayanların erişebileceği bir cevap arıyorum.)

hypothesis-testing chi-squared

— Yang
kaynak

Bunun muhtemelen

χ_{k}^{2}

$\chi^{2}_{k}$ rasgele değişkenin varyansının

2 k

$2k$ ve ayrıca doğru dağılım için (olasılık oranı testinde olduğu gibi) istatistiğin 2 ile çarpılması gerektiğiyle ilgisi vardır . Belki birisi bunu daha resmi olarak bilir.

— Makro

Yanıtlar:

Birçok test istatistiği için genel form

$\frac{observed - expected}{standard error}$

Normal bir değişken olması durumunda, standart hata ya bilinen popülasyon varyansına (z-istatistikleri) ya da örnekten tahmini (t-istatistiklerine) dayanır. Binom ile standart hata bu orana dayanır (testler için varsayılan oran).

Bir olasılık tablosunda, her bir hücredeki sayım, beklenen değere (null altında) eşit bir ortalamaya sahip bir Poisson dağılımından geliyor gibi düşünülebilir. Poisson dağılımı için varyans ortalamaya eşittir, bu nedenle standart hata hesaplaması için de beklenen değeri kullanıyoruz. Bunun yerine gözlemlenen bir istatistik gördüm, ancak daha az teorik gerekçesi var ve $\chi^2$ distribution dağılımına .

— Greg Snow
kaynak

Poisson ile bağlantıya takılıyorum / her hücrenin neden bir Poisson'dan geldiği düşünülebilir. Poissons'ın ortalamasını / varyansını biliyorum ve bir oran verilen olay sayısını temsil ettiklerini biliyorum. Ki-kare dağılımlarının standart (varyans 1) normal karelerinin toplamını temsil ettiğini de biliyorum. Ben sadece normal her "yayılması" bir varsayım olarak beklenen değeri yeniden kullanarak gerekçe etrafında başımı sarmaya çalışıyorum. Bu sadece her şeyi ki-kare dağılımına uygun hale getirmek mi / normalleri "standartlaştırmak" mı?

— Yang

Birkaç sorun var, Poisson dağılımı, işler oldukça bağımsız olduğunda sayımlar için yaygındır. Tablonun sabit bir toplamı olduğunu düşünmek yerine, değerleri tablonun hücreleri arasında dağıtmak yerine, tablonun sadece bir hücresini düşünün ve o hücreye kaç cevap düştüğünü görmek için sabit bir süre bekliyorsunuz. Bu, Poisson'un genel fikrine uyuyor. Büyük araçlar için, normal dağılıma sahip bir Poisson'a yaklaşabilirsiniz, bu nedenle test istatistiği Poisson'a normal bir yaklaşım olarak mantıklıdır, ardından

χ^{2}

$\chi^2$

— Greg Snow

(1) Farz hücre sayımları

ortalama ile bağımsız Poisson rasgele değişkenler

. O zaman, kesinlikle,

X_{i}, \dots, X_{k}

$X_i,\ldots,X_k$

n π_{i}

$n\pi_i$

dağıtımda

. Ancak, bununla ilgili sorun,

birparametre olmasıve gözlemlenen gerçek sayımların olmamasıdır. Gözlemlenen toplam sayılar

neredeyse kesinlikle SLLN tarafındanolmasına rağmen, sezgisel yöntemi uygulanabilir bir şeye dönüştürmek için biraz daha çalışma yapılması gerekiyor.

\sum_{i = 1}^{k} \frac{(X_{i} - n π_{i})^{2}}{n π_{i}} \to χ_{k}^{2}

$\sum_{i=1}^k \frac{(X_i - n\pi_i)^2}{n \pi_i} \to \chi_k^2$

n

$n$

N = \sum_{i = 1}^{k} X_{i} \sim P o i (n)

$N = \sum_{i=1}^k X_i \sim \mathrm{Poi}(n)$

N / n \to 1

$N/n \to 1$

— kardinal

— Yang

@Yang: Verileriniz gibi görünüyor - tarif etmediğiniz --- ki-kare istatistiği kullanımının altında yatan modele uymayın. Standart model çok terimli örneklemeden biridir . Açıkçası, hatta (koşulsuz) Poisson örneklemesi kapsanıyor, Greg'in cevabı bu. Önceki yorumda buna (belki de geniş) bir referans yapıyorum.

— kardinal

En sezgiyi sağlamaya çalışmak için en basit durumu ele alalım. Let ile ayrı bir dağılımından bir iid örnek olarak sonuçları. Let her bir sonucun olasılıkları olun. Ki kare istatistiği (asimtotik) dağılımı ile ilgileniyoruz $X_1, X_2, \ldots, X_n$ $k$ $\pi_1,\ldots,\pi_k$

X^{2} = \sum_{i = 1}^{k} \frac{(S_{i} - n π_{i})^{2}}{n π_{i}} .

$X^2 = \sum_{i=1}^k \frac{(S_i - n \pi_i)^2}{n\pi_i} \> .$ Here

n π_{i}

$n \pi_i$ is the expected number of counts of the

i

$i$ th outcome.

A suggestive heuristic

Define $U_i = (S_i - n\pi_i) / \sqrt{n \pi_i}$ , so that $X^2 = \sum_i U_i^2 = \newcommand{\U}{\mathbf{U}}\|\U\|^2_2$ where $\U = (U_1,\ldots,U_k)$ .

Since $S_i$ is $\mathrm{Bin}(n,\pi_i)$ , then by the Central Limit Theorem,

T_{i} = \frac{U_{i}}{\sqrt{1 - π_{i}}} = \frac{S_{i} - n π_{i}}{\sqrt{n π_{i} (1 - π_{i})}} \overset{d}{\to} N (0, 1),

$\newcommand{\convd}{\xrightarrow{d}}\newcommand{\N}{\mathcal{N}} T_i = \frac{U_i}{\sqrt{1-\pi_i}} = \frac{S_i - n \pi_i}{\sqrt{ n\pi_i(1-\pi_i)}} \convd \N(0, 1) \>,$ hence, we also have that,

U_{i} \overset{d}{\to} N (0, 1 - π_{i})

$U_i \convd \N(0, 1-\pi_i)$ .

Now, if the $T_i$ were (asymptotically) independent (which they aren't), then we could argue that $\sum_i T_i^2$ was asymptotically $\chi_k^2$ distributed. But, note that $T_k$ is a deterministic function of $(T_1,\ldots,T_{k-1})$ and so the $T_i$ variables can't possibly be independent.

Hence, we must take into account the covariance between them somehow. It turns out that the "correct" way to do this is to use the $U_i$ instead, and the covariance between the components of $\U$ also changes the asymptotic distribution from what we might have thought was $\chi_{k}^2$ to what is, in fact, a $\chi_{k-1}^2$ .

Some details on this follow.

A more rigorous treatment

It is not hard to check that, in fact, $\newcommand{\Cov}{\mathrm{Cov}}\Cov(U_i, U_j) = - \sqrt{\pi_i \pi_j}$ for $i \neq j$ .

So, the covariance of $\U$ is

A = I - \sqrt{π} {\sqrt{π}}^{T},

$\newcommand{\sqpi}{\sqrt{\boldsymbol{\pi}}} \newcommand{\A}{\mathbf{A}} \A = \mathbf{I} - \sqpi \sqpi^T \>,$ where

\sqrt{π} = (\sqrt{π_{1}}, \dots, \sqrt{π_{k}})

$\sqpi = (\sqrt{\pi_1}, \ldots, \sqrt{\pi_k})$ . Note that

A

$\A$ is symmetric and idempotent, i.e.,

A = A^{2} = A^{T}

$\A = \A^2 = \A^T$ . So, in particular, if

Z = (Z_{1}, \dots, Z_{k})

$\newcommand{\Z}{\mathbf{Z}}\Z = (Z_1, \ldots, Z_k)$ has iid standard normal components, then

A Z \sim N (0, A)

$\A \Z \sim \N(0, \A)$ . (NB The multivariate normal distribution in this case is degenerate.)

Now, by the Multivariate Central Limit Theorem, the vector $\U$ has an asymptotic multivariate normal distribution with mean $0$ and covariance $\A$ .

So, $\U$ has the same asymptotic distribution as $\A \Z$ , hence, the same asymptotic distribution of $X^2 = \U^T \U$ is the same as the distribution of $\Z^T \A^T \A \Z = \Z^T \A \Z$ by the continuous mapping theorem.

But, $\A$ is symmetric and idempotent, so (a) it has orthogonal eigenvectors, (b) all of its eigenvalues are 0 or 1, and (c) the multiplicity of the eigenvalue of 1 is $\mathrm{rank}(\A)$ . This means that $\A$ can be decomposed as $\A = \mathbf{Q D Q}^T$ where $\mathbf{Q}$ is orthogonal and $\mathbf{D}$ is a diagonal matrix with $\mathrm{rank}(\A)$ ones on the diagonal and the remaining diagonal entries being zero.

Thus, $\Z^T \A \Z$ must be $\chi^2_{k-1}$ distributed since $\A$ has rank $k-1$ in our case.

Other connections

The chi-square statistic is also closely related to likelihood ratio statistics. Indeed, it is a Rao score statistic and can be viewed as a Taylor-series approximation of the likelihood ratio statistic.

References

This is my own development based on experience, but obviously influenced by classical texts. Good places to look to learn more are

G. A. F. Seber and A. J. Lee (2003), Linear Regression Analysis, 2nd ed., Wiley.
E. Lehmann and J. Romano (2005), Testing Statistical Hypotheses, 3rd ed., Springer. Section 14.3 in particular.
D. R. Cox and D. V. Hinkley (1979), Theoretical Statistics, Chapman and Hall.

— cardinal
kaynak

(+1) I think it is hard to find this proof in standard categorical data analysis texts like Agresti, A. (2002). Categorical Data Analysis. John-Wiley.

— suncoolsu

Thanks for the comment. I know there is some treatment of the chi-squared statistic in Agresti, but don't recall how far he takes it. He may just appeal to the asymptotic equivalence with the likelihood ratio statistic.

— cardinal

I don't know if you'll find the proof above in any text. I haven't seen the use of the full (degenerate) covariance matrix and its properties elsewhere. The usual treatment looks at the (nondegenerate) distribution of the first

k - 1

$k-1$ coordinates and then uses the inverse covariance matrix (which has a nice form, but one which is not immediately obvious) and some (somewhat) tedious algebra to establish the result.

— cardinal

Your answer begins by defining a set of

X

$X$ 's but then defines the statistic in terms of

S

$S$ 's. Can you include something in the answer that indicates how the variables you define at the start and the variables in the statistic are related?

— Glen_b -Reinstate Monica