जब कई श्रेणीबद्ध चर होते हैं तो दांव की व्याख्या

मैं इस अवधारणा को समझता हूं कि जब चर 0 (या संदर्भ समूह) के बराबर है, तो इसका मतलब यह है कि अंतिम व्याख्या यह है कि प्रतिगमन गुणांक दो श्रेणियों के बीच का अंतर है। यहां तक कि> 2 श्रेणियों के साथ, मुझे लगता है कि प्रत्येक उस श्रेणी के माध्य और संदर्भ के बीच अंतर को बताता है। $\hat\beta_0$ $\hat\beta$

लेकिन, क्या होगा यदि अधिक चर को बहु-परिवर्तनीय मॉडल में लाया जाए? अब इंटरसेप्ट का क्या मतलब है कि यह दो स्पष्ट चर के संदर्भ के लिए इसका मतलब होने के लिए कोई मतलब नहीं है? एक उदाहरण यह होगा कि लिंग (M (Ref) / F) और रेस (सफेद (ref) / काला) दोनों एक मॉडल में थे। क्या सफेद का मतलब केवल सफेद पुरुषों के लिए है? कोई अन्य संभावनाओं की व्याख्या कैसे करता है? $\hat\beta_0$

एक अलग नोट के रूप में: क्या विपरीत बयान प्रभाव संशोधन की जांच के लिए एक तरीके के रूप में काम करते हैं? या सिर्फ विभिन्न स्तरों पर प्रभाव ( ) देखना है? $\hat\beta$

— रेनी
स्रोत

एक शब्दावली नोट के रूप में, "बहुभिन्नरूपी" का अर्थ है कई प्रतिक्रियाशील चर, न कि कई भविष्यवक्ता चर ( यहाँ देखें )। इसके अलावा, मैं आपके आखिरी सवाल का पालन नहीं करता।

— गंग -

इस स्पष्टीकरण के लिए धन्यवाद। भाषा का सही होना मेरे लिए महत्वपूर्ण है! मुझे लगता है कि मैं अभी यह पता नहीं लगा सकता कि कंट्रास्ट स्टेटमेंट का इस्तेमाल आखिर क्यों किया जाता है क्योंकि कोई भी हमेशा उस रेफरेंस वेरिएबल को सेट कर सकता है जिसके खिलाफ कोई कंट्रास्ट है?

— रेनी

मुझे लगता है कि आप मॉडल w / विभिन्न संदर्भ स्तरों को फिर से फिट कर सकते हैं। मुझे यकीन नहीं है कि यह अधिक सुविधाजनक है। विरोधाभासों के साथ, आप परीक्षण करने के लिए ऑर्थोगोनल विरोधाभासों या सैद्धांतिक रूप से निहित विपरीत (ए एंड बी और सी के संयोजन) का एक सेट भी निर्दिष्ट कर सकते हैं।

— गंग -

जवाबों:

बेटों की व्याख्या के बारे में आप सही हैं जब $k$ स्तरों के साथ एक एकल श्रेणीगत चर है। अगर वहाँ थे कई स्पष्ट चर (और कोई बातचीत अवधि वहाँ थे), अंत: खंड ( ) समूह है कि के लिए संदर्भ के स्तर का गठन किया का मध्यमान है दोनों (सभी) स्पष्ट चर। अपने उदाहरण के परिदृश्य का उपयोग करते हुए, उस मामले पर विचार करें जहां कोई सहभागिता नहीं है, फिर दांव हैं: $\hat\beta_0$

$\hat\beta_0$ : गोरे पुरुषों मतलब
$\hat\beta_{\rm Female}$ :अंतरमहिलाओं के मतलब और पुरुषों के मतलब के बीच
$\hat\beta_{\rm Black}$ :अंतरअश्वेतों के मतलब और गोरों के मतलब के बीच

हम इस बारे में भी सोच सकते हैं कि विभिन्न समूह साधनों की गणना कैसे करें:

\begin{aligned} {\bar{x}}_{W h i t e M a l e s} & = {\hat{β}}_{0} \\ {\bar{x}}_{W h i t e F e m a l e s} & = {\hat{β}}_{0} + {\hat{β}}_{F e m a l e} \\ {\bar{x}}_{B l a c k M a l e s} & = {\hat{β}}_{0} + {\hat{β}}_{B l a c k} \\ {\bar{x}}_{B l a c k F e m a l e s} & = {\hat{β}}_{0} + {\hat{β}}_{F e m a l e} + {\hat{β}}_{बी एल ए सी क} \end{aligned}

$\begin{align} &\bar x_{\rm White\ Males}& &= \hat\beta_0 \\ &\bar x_{\rm White\ Females}& &= \hat\beta_0 + \hat\beta_{\rm Female} \\ &\bar x_{\rm Black\ Males}& &= \hat\beta_0 + \hat\beta_{\rm Black} \\ &\bar x_{\rm Black\ Females}& &= \hat\beta_0 + \hat\beta_{\rm Female} + \hat\beta_{\rm Black} \end{align}$

यदि आपके पास इंटरैक्शन शब्द है, तो इसे अश्वेत महिलाओं के लिए समीकरण के अंत में जोड़ा जाएगा। (इस तरह के इंटरैक्शन टर्म की व्याख्या काफी जटिल है, लेकिन मैं इसके माध्यम से यहां चलता हूं: इंटरेक्शन टर्म की व्याख्या ।)

अपडेट : मेरे बिंदुओं को स्पष्ट करने के लिए, आइए एक कैन्ड उदाहरण पर विचार करें, जिसे कोडित किया गया है R।

d = data.frame(Sex  =factor(rep(c("Male","Female"),times=2), levels=c("Male","Female")),
               Race =factor(rep(c("White","Black"),each=2),  levels=c("White","Black")),
               y    =c(1, 3, 5, 7))
d
#      Sex  Race y
# 1   Male White 1
# 2 Female White 3
# 3   Male Black 5
# 4 Female Black 7

यहाँ छवि विवरण दर्ज करें

yइन श्रेणीबद्ध चरों के लिए निम्न साधन हैं:

aggregate(y~Sex,  d, mean)
#      Sex y
# 1   Male 3
# 2 Female 5
## i.e., the difference is 2
aggregate(y~Race, d, mean)
#    Race y
# 1 White 2
# 2 Black 6
## i.e., the difference is 4

हम इन साधनों के बीच के अंतरों की तुलना एक फिट मॉडल से कर सकते हैं:

summary(lm(y~Sex+Race, d))
# ...
# Coefficients:
#             Estimate Std. Error  t value Pr(>|t|)    
# (Intercept)        1   3.85e-16 2.60e+15  2.4e-16 ***
# SexFemale          2   4.44e-16 4.50e+15  < 2e-16 ***
# RaceBlack          4   4.44e-16 9.01e+15  < 2e-16 ***
# ...
# Warning message:
#   In summary.lm(lm(y ~ Sex + Race, d)) :
#   essentially perfect fit: summary may be unreliable

इस स्थिति के बारे में पहचानने की बात यह है कि, एक अंतःक्रियात्मक शब्द के बिना, हम समानांतर रेखाएं मान रहे हैं। इस प्रकार, Estimateके लिए (Intercept)गोरे पुरुषों मतलब है। इसके Estimateलिए SexFemaleमहिलाओं के बीच के अंतर और पुरुषों के बीच का अंतर है। इसके Estimateलिए RaceBlackअश्वेतों के मतलब और गोरों के बीच का अंतर है। फिर से, क्योंकि एक इंटरैक्शन शब्द के बिना एक मॉडल मानता है कि प्रभाव सख्ती से जोड़ रहे हैं (रेखाएं सख्ती से समानांतर हैं), काली महिलाओं का मतलब फिर सफेद पुरुषों का मतलब है और महिलाओं के मतलब और पुरुषों के मतलब के बीच का अंतर अश्वेतों के मतलब और गोरों के बीच का अंतर।

— गुंग - को पुनः स्थापित मोनिका
स्रोत

धन्यवाद! बहुत स्पष्ट और सहायक। अंत में आप बातचीत की शर्तों का उल्लेख करते हैं। यदि कोई इंटरैक्शन टर्म करता है तो यह बेटास को कैसे बदल देता है (मतलब इंटरेक्शन टर्म मॉडल से नया बेटास)? मुझे पता है कि इंटरैक्शन टर्म के लिए पी वैल्यू महत्वपूर्ण है, लेकिन क्या इंटरेक्शन टर्म बीटा की सार्थक व्याख्या है? आपकी मदद के लिए पुनः शुक्रिया!

— रेनी

एक बातचीत के मामले में, 'मुख्य प्रभाव' दांव केवल अन्य कारक के संदर्भ स्तर के भीतर मतभेदों को संदर्भित करता है। उदाहरण के

केवल बीच का अंतर है

और

।

{\hat{β}}_{F e m a l e}

$\hat\beta_{\rm Female}$

{\bar{x}}_{W h i t e M a l e}

$\bar x_{\rm White\ Male}$

{\bar{x}}_{W h i t e F e m a l e}

$\bar x_{\rm White\ Female}$

— गंग -

समझ में आता है। धन्यवाद! & यह मुख्य प्रभाव के बीच बातचीत के कारण बातचीत शब्द के बिना मॉडल से बदल दिया गया है? मतलब अगर कोई इंटरैक्शन नहीं है तो मुख्य प्रभाव शब्द सैद्धांतिक रूप से समान होगा?

— रेनी

यदि इंटरैक्शन प्रभाव ठीक 0 (अनंत दशमलव स्थानों पर) था, न केवल जनसंख्या में, बल्कि आपके नमूने में, मुख्य प्रभाव बिटास एक मॉडल w / या w / o में एक ही होगा।

— गंग -

@ hans0l0, कि टिप्पणियों में यहाँ दफन की गई जानकारी के बजाय एक नए प्रश्न के रूप में बेहतर होगा; आप इसे संदर्भ के लिए लिंक कर सकते हैं। संक्षेप में, यह संदर्भ स्तरों का मतलब है जब सभी निरंतर चर = 0 हैं।

— गुंग - को पुनः स्थापित मोनिका

$\hat{\beta}_0$ $\hat\beta$

यदि हम रेस के वर्ग में तीसरे स्तर को शामिल करने के लिए आपके उदाहरण को थोड़ा बढ़ाते हैं ( एशियाई कहें ) और संदर्भ के रूप में व्हाइट को चुना , तो आपके पास होगा:

$\hat{\beta}_0 = \bar{x}_{White}$
$\hat{\beta}_{Black} = \bar{x}_{Black} - \bar{x}_{White}$
$\hat{\beta}_{Asian} = \bar{x}_{Asian} - \bar{x}_{White}$

$\hat{\beta}$

$\bar{x}_{Asian} = \hat{\beta}_{Asian} + \hat{\beta}_0$

दुर्भाग्यवश कई श्रेणीबद्ध चर के मामले में, इंटरसेप्ट के लिए सही व्याख्या अब स्पष्ट नहीं है (अंत में ध्यान दें देखें)। जब एन श्रेणियां होती हैं, तो प्रत्येक में कई स्तर और एक संदर्भ स्तर होता है (उदाहरण में श्वेत और पुरुष ), इंटरसेप्ट के लिए सामान्य रूप:

{\hat{β}}_{0} = \sum_{i = 1}^{n} {\bar{x}}_{r e f e r e n c e, i} - (n - 1) \bar{x},

$\hat{\beta}_0 =∑_{i=1}^{n}\bar{x}_{reference,i} -(n-1) \bar{x} ,$ where

{\bar{x}}_{r e f e r e n c e, i} is the mean of the reference level of the i-th categorical variable,

$\bar{x}_{reference,i}\small{\text{ is the mean of the reference level of the i-th categorical variable,}}$

\bar{x} is the mean of the whole data set

$\bar{x}\small{\text{ is the mean of the whole data set}}$

The other $\hat\beta$ are the same as with a single category: they are the difference between the mean of that level of the category and the mean of the reference level of the same category.

If we go back to your example, we would get:

$\hat{\beta}_0 = \bar{x}_{White} + \bar{x}_{Male} - \bar{x}$
$\hat{\beta}_{Black} = \bar{x}_{Black} - \bar{x}_{White}$
$\hat{\beta}_{Asian} = \bar{x}_{Asian} - \bar{x}_{White}$
$\hat{\beta}_{Female} = \bar{x}_{Female} - \bar{x}_{Male}$

You will notice that the mean of the cross categories (e.g. White males) are not present in any of the $\hat\beta$ . As a matter of fact, you cannot calculate these means precisely from the results of this type of regression.

The reason for this is that, the number of predictor variables (i.e. the $\hat\beta$ ) is smaller then the number of cross categories (as long as you have more than 1 category) so a perfect fit is not always possible. If we go back to your example, the number of predictors is 4 (i.e. $\hat{\beta}_0, ~\hat{\beta}_{Black}, ~\hat{\beta}_{Asian}$ and $\hat{\beta}_{Female}$ ) while the number of cross categories is 6.

Numerical Example

Let me borrow from @Gung for a canned numerical example:

d = data.frame(Sex=factor(rep(c("Male","Female"),times=3), levels=c("Male","Female")),
    Race =factor(rep(c("White","Black","Asian"),each=2),levels=c("White","Black","Asian")),
    y    =c(0, 3, 7, 8, 9, 10))
d

#      Sex  Race  y
# 1   Male White  0
# 2 Female White  3
# 3   Male Black  7
# 4 Female Black  8
# 5   Male Asian  9
# 6 Female Asian 10

In this case, the various averages that will go in the calculation of the $\hat\beta$ are:

aggregate(y~1,  d, mean)

#          y
# 1 6.166667

aggregate(y~Sex,  d, mean)

#      Sex        y
# 1   Male 5.333333
# 2 Female 7.000000

aggregate(y~Race, d, mean)

#    Race   y
# 1 White 1.5
# 2 Black 7.5
# 3 Asian 9.5

We can compare these numbers with the results of the regression:

summary(lm(y~Sex+Race, d))

# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)
# (Intercept)   0.6667     0.6667   1.000   0.4226
# SexFemale     1.6667     0.6667   2.500   0.1296
# RaceBlack     6.0000     0.8165   7.348   0.0180
# RaceAsian     8.0000     0.8165   9.798   0.0103

As you can see, the various $\hat\beta$ estimated from the regression all line up with the formulas given above. For example, $\hat\beta_0$ is given by:

{\hat{β}}_{0} = {\bar{x}}_{W h i t e} + {\bar{x}}_{M a l e} - \bar{x}

$\hat{\beta}_0 = \bar{x}_{White} + \bar{x}_{Male} - \bar{x}$ Which gives:

1.5 + 5.333333 - 6.166667
# 0.66666

Note on the choice of contrast

A final note on this topic, all the results discussed above relate to categorical regressions using contrast treatment (the default type of contrast in R). There are different types of contrast which could be used (notably Helmert and sum) and and it would change the interpretation of the various $\hat\beta$ . However, It would not change the final predictions from the regressions (e.g. the prediction for White males is always the same no matter which type of contrast you use).

My personal favourite is contrast sum as I feel that the interpretation of the $\hat\beta^{contr.sum}$ generalises better when there are multiple categories. For this type of contrast, there is no reference level, or rather the reference is the mean of the whole sample, and you have the following $\hat\beta^{contr.sum}$ :

$\hat\beta_0^{contr.sum}=\bar{x}$
$\hat\beta_i^{contr.sum}=\bar{x}_i-\bar{x}$

If we go back to the previous example, you would have:

$\hat{\beta}_0^{contr.sum} = \bar{x}$
$\hat{\beta}_{White}^{contr.sum} = \bar{x}_{White} - \bar{x}$
$\hat{\beta}_{Black}^{contr.sum} = \bar{x}_{Black} - \bar{x}$
$\hat{\beta}_{Asian}^{contr.sum} = \bar{x}_{Asian} - \bar{x}$
$\hat{\beta}_{Male}^{contr.sum} = \bar{x}_{Male} - \bar{x}$
$\hat{\beta}_{Female}^{contr.sum} = \bar{x}_{Female} - \bar{x}$

You will notice that because White and Male are no longer reference levels, their $\hat\beta^{contr.sum}$ are no longer 0. The fact that these are 0 is specific to contrast treatment.

— G.L.
स्रोत