Sınırlı veri kümesi için maksimum varyasyon katsayısı değeri

17

Standart sapmanın ortalamayı aşıp aşmayacağı ile ilgili son soruyu takip eden tartışmada , bir soru kısaca gündeme getirildi, ancak hiçbir zaman tam olarak cevaplanmadı. Ben de burada soruyorum.

Bir dizi göz önünde $n$ negatif olmayan sayılar $x_i$ burada $0 \leq x_i \leq c$ için $1 \leq i \leq n$ . $x_i$ farklı olması, yani kümenin bir çoklu-set olması gerekli değildir . Setin ortalaması ve varyansı

\bar{x} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}, σ_{x}^{2} = \frac{1}{n} \sum_{i = 1}^{n} (x_{i} - \bar{x})^{2} = (\frac{1}{n} \sum_{i = 1}^{n} x_{i}^{2}) - {\bar{x}}^{2}

$\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i, ~~ \sigma_x^2 = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2 = \left(\frac{1}{n}\sum_{i=1}^n x_i^2\right) - \bar{x}^2$ ve standart sapma

σ_{x}

$\sigma_x$ . Sayı kümesininbir popülasyondan örnekolmadığınıve bir popülasyon ortalamasını veya popülasyon varyansını tahmin etmediğimizi unutmayın. O zaman soru şu:

Maksimum değeri nedir $\dfrac{\sigma_x}{\bar{x}}$ , varyasyon katsayısı, $x_i$ 's aralığınıntüm seçimlerinde $[0,c]$ mı?

için bulabileceğim maksimum değer $\frac{\sigma_x}{\bar{x}}$ isimli $\sqrt{n-1}$ zaman elde edildiği $n-1$ arasında $x_i$ değerine sahip $0$ ve kalan (uç değerlerin) $x_i$ sahiptir değeri $c$ , verme

\bar{x} = \frac{c}{n}, \frac{1}{n} \sum x_{i}^{2} = \frac{c^{2}}{n} \Rightarrow σ_{x} = \sqrt{\frac{c^{2}}{n} - \frac{c^{2}}{n^{2}}} = \frac{c}{n} \sqrt{n - 1} .

$\bar{x} = \frac{c}{n},~~ \frac{1}{n}\sum x_i^2 = \frac{c^2}{n} \Rightarrow \sigma_x = \sqrt{\frac{c^2}{n} - \frac{c^2}{n^2}} = \frac{c}{n}\sqrt{n-1}.$ Ama bu bağımlı değildir

c

$c$ hem hiç, ben büyük değerler merak ediyorum, muhtemelen bağımlı

n

$n$ ve

c

$c$ elde edilip edilemeyeceğini .

Herhangi bir fikir? Bu sorunun daha önce istatistik literatüründe çalışıldığından eminim ve bu nedenle gerçek sonuçlar olmasa bile referanslar çok takdir edilecektir.

— Dilip Sarwate
kaynak

Bence bunun mümkün olan en büyük değer olduğu konusunda haklısınız ve

önemli olmadığına da şaşırıyorum . Güzel.

c

$c$

— Peter Flom - Monica'yı eski durumuna döndürün

7

sonucu

olarak etkilememelidir

c

$c$

, tüm değerler pozitif bir sabit

ile çarpılırsa değişmez.

\frac{σ_{x}}{\bar{x}}

$\frac{\sigma_x}{\bar{x}}$

k

$k$

— Henry

15

Geometry provides insight and classical inequalities afford easy access to rigor.

Geometric solution

We know, from the geometry of least squares, that $\mathbf{\bar{x}} = (\bar{x}, \bar{x}, \ldots, \bar{x})$ is the orthogonal projection of the vector of data $\mathbf{x}=(x_1, x_2, \ldots, x_n)$ onto the linear subspace generated by the constant vector $(1,1,\ldots,1)$ and that $\sigma_x$ is directly proportional to the (Euclidean) distance between $\mathbf{x}$ and $\mathbf{\bar{x}}.$ The non-negativity constraints are linear and distance is a convex function, whence the extremes of distance must be attained at the edges of the cone determined by the constraints. This cone is the positive orthant in $\mathbb{R}^n$ and its edges are the coordinate axes, whence it immediately follows that all but one of the $x_i$ must be zero at the maximum distances. For such a set of data, a direct (simple) calculation shows $\sigma_x/\bar{x}=\sqrt{n}.$

Solution exploiting classical inequalities

$\sigma_x/\bar{x}$ is optimized simultaneously with any monotonic transformation thereof. In light of this, let's maximize

\frac{x_{1}^{2} + x_{2}^{2} + \dots + x_{n}^{2}}{(x_{1} + x_{2} + \dots + x_{n})^{2}} = \frac{1}{n} (\frac{n - 1}{n} {(\frac{σ_{x}}{\bar{x}})}^{2} + 1) = f (\frac{σ_{x}}{\bar{x}}) .

$\frac{x_1^2+x_2^2+\ldots+x_n^2}{(x_1+x_2+\ldots+x_n)^2} = \frac{1}{n}\left(\frac{n-1}{n}\left(\frac{\sigma_x}{\bar{x}}\right)^2+1\right) = f\left(\frac{\sigma_x}{\bar{x}}\right).$

(The formula for $f$ may look mysterious until you realize it just records the steps one would take in algebraically manipulating $\sigma_x/\bar{x}$ to get it into a simple looking form, which is the left hand side.)

An easy way begins with Holder's Inequality,

x_{1}^{2} + x_{2}^{2} + \dots + x_{n}^{2} \leq (x_{1} + x_{2} + \dots + x_{n}) max ({x_{i}}) .

$x_1^2+x_2^2+\ldots+x_n^2 \le \left(x_1+x_2+\ldots+x_n\right)\max(\{x_i\}).$

(This needs no special proof in this simple context: merely replace one factor of each term $x_i^2 = x_i \times x_i$ by the maximum component $\max(\{x_i\})$ : obviously the sum of squares will not decrease. Factoring out the common term $\max(\{x_i\})$ yields the right hand side of the inequality.)

Because the $x_i$ are not all $0$ (that would leave $\sigma_x/\bar{x}$ undefined), division by the square of their sum is valid and gives the equivalent inequality

\frac{x_{1}^{2} + x_{2}^{2} + \dots + x_{n}^{2}}{(x_{1} + x_{2} + \dots + x_{n})^{2}} \leq \frac{max ({x_{i}})}{x_{1} + x_{2} + \dots + x_{n}} .

$\frac{x_1^2+x_2^2+\ldots+x_n^2}{(x_1+x_2+\ldots+x_n)^2} \le \frac{\max(\{x_i\})}{x_1+x_2+\ldots+x_n}.$

Because the denominator cannot be less than the numerator (which itself is just one of the terms in the denominator), the right hand side is dominated by the value $1$ , which is achieved only when all but one of the $x_i$ equal $0$ . Whence

\frac{σ_{x}}{\bar{x}} \leq f^{- 1} (1) = \sqrt{(1 \times (n - 1)) \frac{n}{n - 1}} = \sqrt{n} .

$\frac{\sigma_x}{\bar{x}} \le f^{-1}\left(1\right) = \sqrt{\left(1 \times (n - 1)\right)\frac{n}{n-1}}=\sqrt{n}.$

Alternative approach

Because the $x_i$ are nonnegative and cannot sum to $0$ , the values $p(i) = x_i/(x_1+x_2+\ldots+x_n)$ determine a probability distribution $F$ on $\{1,2,\ldots,n\}$ . Writing $s$ for the sum of the $x_i$ , we recognize

\begin{aligned} \frac{x_{1}^{2} + x_{2}^{2} + \dots + x_{n}^{2}}{(x_{1} + x_{2} + \dots + x_{n})^{2}} & = \frac{x_{1}^{2} + x_{2}^{2} + \dots + x_{n}^{2}}{s^{2}} \\ = (\frac{x_{1}}{s}) (\frac{x_{1}}{s}) + (\frac{x_{2}}{s}) (\frac{x_{2}}{s}) + \dots + (\frac{x_{n}}{s}) (\frac{x_{n}}{s}) \\ = p_{1} p_{1} + p_{2} p_{2} + \dots + p_{n} p_{n} \\ = E_{F} [p] . \end{aligned}

$\eqalign{ \frac{x_1^2+x_2^2+\ldots+x_n^2}{(x_1+x_2+\ldots+x_n)^2} &= \frac{x_1^2+x_2^2+\ldots+x_n^2}{s^2} \\ &= \left(\frac{x_1}{s}\right)\left(\frac{x_1}{s}\right)+\left(\frac{x_2}{s}\right)\left(\frac{x_2}{s}\right) + \ldots + \left(\frac{x_n}{s}\right)\left(\frac{x_n}{s}\right)\\ &= p_1 p_1 + p_2 p_2 + \ldots + p_n p_n\\ &= \mathbb{E}_F[p]. }$

The axiomatic fact that no probability can exceed $1$ implies this expectation cannot exceed $1$ , either, but it's easy to make it equal to $1$ by setting all but one of the $p_i$ equal to $0$ and therefore exactly one of the $x_i$ is nonzero. Compute the coefficient of variation as in the last line of the geometric solution above.

— whuber
kaynak

Thanks for a detailed answer from which I have learned a lot! I assume that the difference between the

\sqrt{n}

$\sqrt{n}$ in your answer and the

\sqrt{n - 1}

$\sqrt{n-1}$ that I obtained (and Henry confirmed) is due to the fact that you are using

σ_{x} = \sqrt{\frac{1}{n - 1} \sum_{i = 1}^{n} (x_{i} - \bar{x})^{2}}

$\sigma_x = \sqrt{\frac{1}{n-1}\sum_{i=1}^n(x_i-\bar{x})^2}$ as the definition of

σ_{x}

$\sigma_x$ while I used

σ_{x} = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} (x_{i} - \bar{x})^{2}} ?

$\sigma_x = \sqrt{\frac{1}{n}\sum_{i=1}^n(x_i-\bar{x})^2}?$

— Dilip Sarwate

1

Yes Dilip, that's right. Sorry about the discrepancy with the question; I should have checked first and I should have defined

σ_{x}

$\sigma_x$ (which I intended to do but forgot).

— whuber

10

Some references, as small candles on the cakes of others:

Katsnelson and Kotz (1957) proved that so long as all $x_i \ge 0$ , then the coeﬃcient of variation cannot exceed $\sqrt{n − 1}$ . This result was mentioned earlier by Longley (1952). Cramér (1946, p.357) proved a less sharp result, and Kirby (1974) proved a less general result.

Cramér, H. 1946. Mathematical methods of statistics. Princeton, NJ: Princeton University Press.

Katsnelson, J., and S. Kotz. 1957. On the upper limits of some measures of variability. Archiv für Meteorologie, Geophysik und Bioklimatologie, Series B 8: 103–107.

Kirby, W. 1974. Algebraic boundedness of sample statistics. Water Resources Research 10: 220–222.

Longley, R. W. 1952. Measures of the variability of precipitation. Monthly Weather Review 80: 111–117.

I came across these papers in working on

Cox, N.J. 2010. The limits of sample skewness and kurtosis. Stata Journal 10: 482-495.

which discusses broadly similar bounds on moment-based skewness and kurtosis.

— Nick Cox
kaynak

8

With two numbers $x_i \ge x_j$ , some $\delta \gt 0$ and any $\mu$ :

(x_{i} + δ - μ)^{2} + (x_{j} - δ - μ)^{2} - (x_{i} - μ)^{2} - (x_{j} - μ)^{2} = 2 δ (x_{i} - x_{j} + δ) > 0.

$(x_i+\delta - \mu)^2 + (x_j - \delta - \mu)^2 - (x_i - \mu)^2 - (x_j - \mu)^2 = 2\delta(x_i - x_j +\delta) \gt 0.$

Applying this to $n$ non-negative datapoints, this means that unless all but one of the $n$ numbers are zero and so cannot be reduced further, it is possible to increase the variance and standard deviation by widening the gap between any pair of the data points while retaining the same mean, thus increasing the coefficient of variation. So the maximum coefficient of variation for the data set is as you suggest: $\sqrt{n-1}$ .

$c$ should not affect the result as $\frac{\sigma_x}{\bar{x}}$ does not change if all the values are multiplied by any positive constant $k$ (as I said in my comment).

— Henry
kaynak