β^0β^
Örneğinizi yarış kategorisine ( Asya deyin ) üçüncü bir seviye eklemek ve referans olarak Beyaz'ı seçmek için biraz genişletirsek, aşağıdakilere sahip olursunuz:
- β^0=x¯White
- β^Black=x¯Black−x¯White
- β^Asian=x¯Asian−x¯White
β^
- x¯Asian=β^Asian+β^0
Ne yazık ki, birden fazla kategorik değişken söz konusu olduğunda, kesişim için doğru yorumlama artık açık değildir (sondaki nota bakınız). Her biri birden fazla seviyeye ve bir referans seviyesine sahip n kategori olduğunda (örneğin , örnekte Beyaz ve Erkek ), kesişmenin genel formu şöyledir:
β^0=∑ni=1x¯reference,i−(n−1)x¯,
x¯reference,i is the mean of the reference level of the i-th categorical variable,
x¯ is the mean of the whole data set
β^
Örneğinize geri dönersek şunu elde ederiz:
- β^0=x¯White+x¯Male−x¯
- β^Black=x¯Black−x¯White
- β^Asian=x¯Asian−x¯White
- β^Female=x¯Female−x¯Male
You will notice that the mean of the cross categories (e.g. White males) are not present in any of the β^. As a matter of fact, you cannot calculate these means precisely from the results of this type of regression.
The reason for this is that, the number of predictor variables (i.e. the β^) is smaller then the number of cross categories (as long as you have more than 1 category) so a perfect fit is not always possible. If we go back to your example, the number of predictors is 4 (i.e. β^0, β^Black, β^Asian and β^Female) while the number of cross categories is 6.
Numerical Example
Let me borrow from @Gung for a canned numerical example:
d = data.frame(Sex=factor(rep(c("Male","Female"),times=3), levels=c("Male","Female")),
Race =factor(rep(c("White","Black","Asian"),each=2),levels=c("White","Black","Asian")),
y =c(0, 3, 7, 8, 9, 10))
d
# Sex Race y
# 1 Male White 0
# 2 Female White 3
# 3 Male Black 7
# 4 Female Black 8
# 5 Male Asian 9
# 6 Female Asian 10
In this case, the various averages that will go in the calculation of the β^ are:
aggregate(y~1, d, mean)
# y
# 1 6.166667
aggregate(y~Sex, d, mean)
# Sex y
# 1 Male 5.333333
# 2 Female 7.000000
aggregate(y~Race, d, mean)
# Race y
# 1 White 1.5
# 2 Black 7.5
# 3 Asian 9.5
We can compare these numbers with the results of the regression:
summary(lm(y~Sex+Race, d))
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.6667 0.6667 1.000 0.4226
# SexFemale 1.6667 0.6667 2.500 0.1296
# RaceBlack 6.0000 0.8165 7.348 0.0180
# RaceAsian 8.0000 0.8165 9.798 0.0103
As you can see, the various β^ estimated from the regression all line up with the formulas given above. For example, β^0 is given by:
β^0=x¯White+x¯Male−x¯
Which gives:
1.5 + 5.333333 - 6.166667
# 0.66666
Note on the choice of contrast
A final note on this topic, all the results discussed above relate to categorical regressions using contrast treatment (the default type of contrast in R). There are different types of contrast which could be used (notably Helmert and sum) and and it would change the interpretation of the various β^. However, It would not change the final predictions from the regressions (e.g. the prediction for White males is always the same no matter which type of contrast you use).
My personal favourite is contrast sum as I feel that the interpretation of the β^contr.sum generalises better when there are multiple categories. For this type of contrast, there is no reference level, or rather the reference is the mean of the whole sample, and you have the following β^contr.sum:
- β^contr.sum0=x¯
- β^contr.sumi=x¯i−x¯
If we go back to the previous example, you would have:
- β^contr.sum0=x¯
- β^contr.sumWhite=x¯White−x¯
- β^contr.sumBlack=x¯Black−x¯
- β^contr.sumAsian=x¯Asian−x¯
- β^contr.sumMale=x¯Male−x¯
- β^contr.sumFemale=x¯Female−x¯
You will notice that because White and Male are no longer reference levels, their β^contr.sum are no longer 0. The fact that these are 0 is specific to contrast treatment.