Aralık verilerini sürekli olarak ele alırken en iyi uygulamalar

Bolluğun boyutla ilgili olup olmadığına bakıyorum. Boyut (elbette) süreklidir, ancak bolluk,

A = 0-10
B = 11-25
C = 26-50
D = 51-100
E = 101-250
F = 251-500
G = 501-1000
H = 1001-2500
I = 2501-5000
J = 5001-10,000
etc...

A'dan Q'ya ... 17 seviyeleri. Olası bir yaklaşımın her harfe bir sayı atamak olacağını düşünüyordum: ya minimum, maksimum ya da medyan (yani A = 5, B = 18, C = 38, D = 75.5 ...).

Potansiyel tuzaklar nelerdir ve bu nedenle bu verileri kategorik olarak ele almak daha iyi olur mu?

Bazı düşünceler sağlayan bu soruyu okudum - ancak bu veri kümesinin anahtarlarından biri kategorilerin bile eşit olmaması - kategorik olarak ele alınması A ve B arasındaki farkın aynı olduğunu varsayar. B ve C ... (logaritma kullanılarak düzeltilebilir - teşekkürler Anonymouse)

Sonuçta, diğer çevresel faktörleri göz önünde bulundurduktan sonra boyutun bolluk için bir öngörücü olarak kullanılıp kullanılamayacağını görmek istiyorum. Tahmin de bir aralıkta olacaktır: X boyutu ve A, B ve C faktörleri göz önüne alındığında, Y Bolluğu'nun Min ve Maks arasında düşeceğini tahmin ediyoruz (ki bir veya daha fazla ölçek noktasına yayılabilir: Min D'den daha fazla ve daha az Max F ... ne kadar kesin olursa o kadar iyi olur).

— Trees4theForest
kaynak

Yanıtlar:

Kategorik çözüm

Değerlerin kategorik olarak ele alınması, göreceli boyutlar hakkındaki önemli bilgileri kaybeder . Bunun üstesinden gelmek için standart bir yöntem, lojistik regresyon düzenidir . Aslında bu yöntem, ve regresörlerle (boyut gibi) gözlenen ilişkileri kullanarak her kategoriye sıralamaya göre (biraz keyfi) değerlere uyduğunu "bilir" . $A\lt B\lt \cdots \lt J\lt \ldots$

Örnek olarak, 30 (boyut, bolluk kategorisi) çiftini,

size = (1/2, 3/2, 5/2, ..., 59/2)
e ~ normal(0, 1/6)
abundance = 1 + int(10^(4*size + e))

bolluk [0,10], [11,25], ..., [10001,25000] aralıklarında kategorize edilmiştir.

Bolluk kategorisinin boyutuna göre dağılım grafiği

Sıralı lojistik regresyon, her kategori için bir olasılık dağılımı üretir; dağıtım boyutuna bağlıdır. Bu tür ayrıntılı bilgilerden, çevrelerinde tahmini değerler ve aralıklar üretebilirsiniz. İşte bu verilerden tahmin edilen 10 PDF'nin bir grafiği (orada veri eksikliği nedeniyle kategori 10 için bir tahmin mümkün değildi):

Kategoriye göre olasılık yoğunlukları

Sürekli çözüm

Neden her kategoriyi temsil etmek için sayısal bir değer seçmeyin ve kategorideki gerçek bolluk hakkındaki belirsizliği hata teriminin bir parçası olarak görmüyorsunuz ?

Bunu, bolluk değerlerini , gözlemsel hataların iyi bir yaklaşıma, simetrik olarak dağıtılmış ve kabaca aynı beklenen boyutta olduğu diğer değerlere dönüştüren idealize edilmiş bir yeniden ifade ayrı bir yaklaşım olarak analiz edebiliriz . (varyans stabilize edici bir dönüşüm). $f$ $a$ $f(a)$ $a$

Analizi basitleştirmek için, bu tür bir dönüşüme ulaşmak için kategorilerin (teori veya deneyime dayalı olarak) seçildiğini varsayalım. O zaman kategori dizinleri olarak yeniden ifade ettiğini varsayabiliriz . Öneri bir "karakteristik" değeri seçerek tutarındadır Her kategori içinde izlenerek ve bolluğu ile yalan görülmektedir zaman bolluk sayısal değer olarak ve . Bu, doğru bir şekilde yeniden ifade edilen değeri için bir proxy olacaktır . $f$ $\alpha_i$ $i$ $\beta_i$ $i$ $f(\beta_i)$ $\alpha_i$ $\alpha_{i+1}$ $f(a)$

Öyleyse, bolluğun error ile gözlendiğini varsayalım, böylece varsayımsal veri aslında yerine . Bunu olarak kodlarken yapılan hata $\varepsilon$ $a+\varepsilon$ $a$ $f(\beta_i)$ is, by definition, the difference $f(\beta_i) - f(a)$ , which we can express as a difference of two terms

error = f (a + ε) - f (a) - (f (a + ε) - f (β_{i})) .

$\text{error} = f(a + \varepsilon) - f(a) - \left(f(a + \varepsilon) - f(\beta_i)\right).$

That first term, $f(a + \varepsilon) - f(a)$ , is controlled by $f$ (we can't do anything about $\varepsilon$ ) and would appear if we did not categorize aboundances. The second term is random--it depends on $\varepsilon$ --and evidently is correlated with $\varepsilon$ . But we can say something about it: it must lie between $i - f(\beta_i) \lt 0$ and $i+1 - f(\beta_i) \ge 0$ . Moreover, if $f$ is doing a good job, the second term might be approximately uniformly distributed. Both considerations suggest choosing $\beta_i$ so that $f(\beta_i)$ lies halfway between $i$ and $i+1$ ; that is, $\beta_i \approx f^{-1}(i+1/2)$ .

These categories in this question form an approximately geometric progression, indicating that $f$ is a slightly distorted version of a logarithm. Therefore, we should consider using the geometric means of the interval endpoints to represent the abundance data.

Ordinary least squares regression (OLS) with this procedure gives a slope of 7.70 (standard error is 1.00) and intercept of 0.70 (standard error is 0.58), instead of a slope of 8.19 (se of 0.97) and intercept of 0.69 (se of 0.56) when regressing log abundances against size. Both exhibit regression to the mean, because theoretical slope should be close to $4 \log(10) \approx 9.21$ . The categorical method exhibits a bit more regression to the mean (a smaller slope) due to the added discretization error, as expected.

Regression results

This plot shows the uncategorized abundances along with a fit based on the categorized abundances (using geometric means of the category endpoints as recommended) and a fit based on the abundances themselves. The fits are remarkably close, indicating this method of replacing categories by suitably chosen numerical values works well in the example.

Some care usually is needed in choosing an appropriate "midpoint" $\beta_i$ for the two extreme categories, because often $f$ is not bounded there. (For this example I crudely took the left endpoint of the first category to be $1$ rather than $0$ and the right endpoint of the last category to be $25000$ .) One solution is to solve the problem first using data not in either of the extreme categories, then use the fit to estimate appropriate values for those extreme categories, then go back and fit all the data. The p-values will be slightly too good, but overall the fit should be more accurate and less biased.

— whuber
kaynak

+1 excellent answer! I especially like how 2 different options are described along with their justifications. I also gather taking the log of abundance, not size, should be the emphasis, which was my thought as well. One question, in part 1, you state "you can produce estimated values and intervals around them". How does one do this?

— gung - Reinstate Monica

Good question, @gung. A crude way, which may be effective, is to treat the categories as interval-valued data and the ordered logit results are providing a (discrete) distribution over those intervals for any given value of the 'size'. The result is an interval-valued distribution, which will have an interval-valued mean and interval-valued confidence limits.

— whuber

@whuber, it would be worth mentioning the software options. I am guessing that you used Stata (if I am trained well enough to Stata graphs and tell them from R and SAS graphs), where this model is fitted with ologit. In R, you can do this with polr in MASS package.

— StasK

You're correct, @Stask. Thanks for the reference to the R solution. (The graphs are all default graphs in Stata 11; only the legend and line styles in the last one were customized because the red-green distinction might otherwise not be apparent to about 3% of all readers.)

— whuber

@StasK rms::lrm and the ordinal (clm) package are also good options.

— chl

Consider using the logarithm of the size.

— Has QUIT--Anony-Mousse
kaynak

Ha - That answer elicited a partial face palm. True that takes care of the scale issue - but still at hand: to categorize or not, and which number to peg the "value" to. If these questions are irrelevant, I can handle hearing that too.

— Trees4theForest

Well, you have been putting various issues into one. The data you have seems to make more sense on a logarithmic scale. Whether you want to do binning or not is a separate question, and there I only have another face palm reply for you: depends on your data and on what you want to achieve. Then there is another hidden question: how do I compute the difference between intervals - compute the difference of their means? or the minimal distance (then A to B would be 0, B to C would be 0, but A to C not). etc.

— Has QUIT--Anony-Mousse

Good points, I have updated my question with more information to address the goals. As for the difference in intervals, I think that is my question - what would be the relative advantages / disadvantages of computing the interval based on difference of means, minimal distance, maximal distance, distance between mins, distance between maxs, etc. Any advice on what sorts of things I need to consider to make this decision (or if it even needs to be considered) would be great.

— Trees4theForest

There are plenty of further options. For example, to eliminate all scale effects, you could try to predict the ranking position instead. Other than that, it is a question of measuring errors. By taking the logarithm, you usually also weight the errors this way. So when the true value is 10000 and the predicted value is 10100 this is much less of than when the predicted value is 1 and the true value is 101. By additionally doing binning and computing the mindist between the bins, you'd even weight small errors with 0.

— Has QUIT--Anony-Mousse