प्रधान घटक विश्लेषण "पीछे की ओर": चर के दिए गए रैखिक संयोजन द्वारा डेटा का कितना विचरण समझाया जाता है?

मैंने छह चर $A$ , $B$ , $C$ , $D$ , $E$ और एक प्रमुख घटक विश्लेषण किया है $F$ । अगर मैं सही ढंग से समझता हूं, तो असम्बद्ध PC1 मुझे बताता है कि इन चरों में से कौन सा रैखिक संयोजन डेटा में सबसे अधिक विचरण का वर्णन करता है / बताता है और PC2 मुझे बताता है कि इन चरों में से कौन सा रैखिक संयोजन डेटा में अगले सबसे भिन्न रूप का वर्णन करता है।

मैं बस उत्सुक हूं - क्या यह "पीछे" करने का कोई तरीका है? मान लीजिए कि मैं इन चरों के कुछ रैखिक संयोजन का चयन करता हूं - जैसे $A+2B+5C$ , क्या मैं यह बता सकता हूं कि इसमें वर्णित डेटा में कितना भिन्नता है?

— N26
स्रोत

सख्ती से, पीसी 2 पीसी 1 का रैखिक संयोजन ऑर्थोगोनल है जो डेटा में अगले सबसे अधिक विचरण का वर्णन करता है।

— हेनरी

क्या आप

का अनुमान लगाने की कोशिश कर रहे हैं

V a r (A + 2 B + 5 C)

$Var(A+2B+5C)$ ?

— vqv

सभी अच्छे उत्तर (तीन + 1s)। मैं लोगों की राय के बारे में उत्सुक हूं कि क्या तैयार की गई समस्या अव्यक्त चर दृष्टिकोण (एसईएम / एलवीएम) के माध्यम से हल करने योग्य है , अगर हम एक या एक से अधिक अव्यक्त चर "चर का एक रैखिक संयोजन" पर विचार करते हैं।

— अलेक्सांद्र ब्लेक

@ अलेक्जेंडर, मेरा जवाब वास्तव में अन्य दो के साथ सीधे बाधाओं पर है। मैंने असहमति को स्पष्ट करने के लिए अपना उत्तर संपादित किया (और गणित को आगे बढ़ाने के लिए इसे आगे संपादित करने की योजना बनाई)। दो मानकीकृत समान चर

डेटासेट की कल्पना करें

X = Y

$X=Y$ ।

द्वारा कितने विचरण का वर्णन किया गया है

X

$X$ ? दो अन्य समाधान

50 %

$50\%$ । मेरा तर्क है कि सही उत्तर

100 %

$100\%$ ।

— अमीबा का कहना है कि मोनिका

@amoeba: अभी भी पूरी तरह से सामग्री को समझने के लिए संघर्ष करने के बावजूद, मैं समझता हूं कि आपका उत्तर अलग है। जब मैंने "सभी अच्छे उत्तर" कहा, तो मैंने कहा कि मुझे प्रति उत्तर के स्तर पसंद हैं , न कि उनकी शुद्धता । मुझे लगता है कि यह मेरे जैसे लोगों के लिए एक शैक्षिक मूल्य है, जो किसी न किसी इलाके देश में अपनी शिक्षा की खोज में हैं, जिन्हें सांख्यिकी :-) कहा जाता है । आशा है कि यह समझ में आता है।

— १०:३० बजे असेम्बली बेलेख

जवाबों:

यदि हम इस आधार के साथ शुरू करते हैं कि सभी चर (पीसीए में मानक अभ्यास) केंद्रित कर दिए गए हैं, तो डेटा में कुल भिन्नता केवल वर्गों का योग है:

T = \sum_{i} (A_{i}^{2} + B_{i}^{2} + C_{i}^{2} + D_{i}^{2} + E_{i}^{2} + F_{i}^{2})

$T=\sum_{i}(A_{i}^{2}+B_{i}^{2}+C_{i}^{2}+D_{i}^{2}+E_{i}^{2}+F_{i}^{2})$

यह चर के सहसंयोजक मैट्रिक्स के ट्रेस के बराबर है, जो सहसंयोजक मैट्रिक्स के स्वदेशी के योग के बराबर है। यह वही मात्रा है जो पीसीए "डेटा की व्याख्या" के संदर्भ में बोलती है - यानी आप चाहते हैं कि आपके पीसी कोविरियस मैट्रिक्स के विकर्ण तत्वों का सबसे बड़ा अनुपात समझाएं। अब अगर हम इस तरह के अनुमानित मूल्यों के एक सेट के लिए एक उद्देश्य समारोह बनाते हैं:

S = \sum_{i} ({[A_{i} - {\hat{A}}_{i}]}^{2} + \dots + {[F_{i} - {\hat{F}}_{i}]}^{2})

$S=\sum_{i}\left(\left[A_{i}-\hat{A}_{i}\right]^{2}+\dots+\left[F_{i}-\hat{F}_{i}\right]^{2}\right)$

तब पहली प्रमुख घटक कम करता $S$ सभी रैंक 1 फिट मूल्यों के बीच $(\hat{A}_{i},\dots,\hat{F}_{i})$ । तो ऐसा लगता है कि आप बाद उपयुक्त मात्रा में हैं आपके उदाहरणउपयोग करने के लिए, हमें इस समीकरण को रैंक 1 भविष्यवाणियों में बदलना होगा। सबसे पहले आप वजन को सामान्य बनाने में चौराहों 1. की राशि के लिए तो हम को बदलने की जरूरत है(वर्गों का योग) के साथ

P = 1 - \frac{S}{T}

$P=1-\frac{S}{T}$

A + 2 B + 5 C

$A+2B+5C$

(1, 2, 5, 0, 0, 0)

$(1,2,5,0,0,0)$

30

$30$

। अगला हम सामान्य वज़न के अनुसार प्रत्येक अवलोकन "स्कोर" करते हैं:

(\frac{1}{\sqrt{30}}, \frac{2}{\sqrt{30}}, \frac{5}{\sqrt{30}}, 0, 0, 0)

$\left(\frac{1}{\sqrt{30}},\frac{2}{\sqrt{30}},\frac{5}{\sqrt{30}},0,0,0\right)$

Z_{i} = \frac{1}{\sqrt{30}} A_{i} + \frac{2}{\sqrt{30}} B_{i} + \frac{5}{\sqrt{30}} C_{i}

$Z_{i}=\frac{1}{\sqrt{30}}A_{i}+\frac{2}{\sqrt{30}}B_{i}+\frac{5}{\sqrt{30}}C_{i}$

फिर हम अपने रैंक 1 की भविष्यवाणी प्राप्त करने के लिए वेट वेक्टर द्वारा अंकों को गुणा करते हैं।

(\begin{matrix} {\hat{A}}_{i} \\ {\hat{B}}_{i} \\ {\hat{C}}_{i} \\ {\hat{D}}_{i} \\ {\hat{E}}_{i} \\ {\hat{F}}_{i} \end{matrix}) = Z_{i} \times (\begin{matrix} \frac{1}{\sqrt{30}} \\ \frac{2}{\sqrt{30}} \\ \frac{5}{\sqrt{30}} \\ 0 \\ 0 \\ 0 \end{matrix})

$\begin{pmatrix} \hat{A}_{i} \\ \hat{B}_{i} \\ \hat{C}_{i} \\ \hat{D}_{i} \\ \hat{E}_{i} \\ \hat{F}_{i}\end{pmatrix} =Z_{i}\times\begin{pmatrix} \frac{1}{\sqrt{30}} \\ \frac{2}{\sqrt{30}} \\ \frac{5}{\sqrt{30}} \\ 0 \\ 0 \\ 0\end{pmatrix}$

फिर हम इन अनुमानों को गणना में प्लग करते हैं । आप इसे मैट्रिक्स मानदंड संकेतन में भी डाल सकते हैं, जो एक अलग सामान्यीकरण का सुझाव दे सकता है। यदि हम को चर के देखे गए मानों के मैट्रिक्स के रूप में सेट करते हैं ( आपके मामले में ), और भविष्यवाणियों के संगत मैट्रिक्स के रूप में हम इस प्रकार समझाए गए विचरण के अनुपात को परिभाषित कर सकते हैं: $S$ $P$ $O$ $N\times q$ $q=6$ $E$

\frac{| | O | |_{2}^{2} - | | O - E | |_{2}^{2}}{| | O | |_{2}^{2}}

$\frac{||O||_{2}^{2}-||O-E||_{2}^{2}}{||O||_{2}^{2}}$

कहाँ हैFrobenius मैट्रिक्स आदर्श। इसलिए आप इसे कुछ अन्य प्रकार के मैट्रिक्स मानदंड "सामान्यीकृत" कर सकते हैं, और आपको "भिन्नता समझाया" का एक अंतर उपाय मिलेगा, हालांकि यह तब तक "भिन्नता" नहीं होगा जब तक कि यह वर्गों का योग न हो। $||.||_{2}$

— probabilityislogic
स्रोत

यह एक उचित दृष्टिकोण है, लेकिन आपकी अभिव्यक्ति को बहुत सरल किया जा सकता है और इसे

के वर्गों के योग के बराबर दिखाया जा सकता है जो कि वर्गों के कुल योग

विभाजित है । इसके अलावा, मुझे लगता है कि यह प्रश्न की व्याख्या करने का सबसे अच्छा तरीका नहीं है; वैकल्पिक दृष्टिकोण के लिए मेरा जवाब देखें कि मैं तर्क करता हूं और अधिक समझ में आता है (विशेष रूप से, मेरा उदाहरण आंकड़ा देखें)।

Z_{i}

$Z_i$

T

$T$

— अमीबा का कहना है कि मोनिका

Think about it like that. Imagine a dataset with two standardized identical variables

X = Y

$X=Y$ . How much variance is described by

X

$X$ ? Your calculation gives

50 %

$50\%$ . I argue that the correct answer is

100 %

$100\%$ .

— amoeba says Reinstate Monica

@amoeba - if

X = Y

$X=Y$ then the first PC is

(\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}})

$(\frac {1}{\sqrt {2}},\frac {1}{\sqrt {2}})$ - this makes rank

1

$1$ scores of

z_{i} = \frac{x_{i} + y_{i}}{\sqrt{2}} = x_{i} \sqrt{2}

$z_i=\frac {x_i+y_i}{\sqrt {2}}=x_i\sqrt {2}$ (assuming

x_{i} = y_{i}

$x_i=y_i$ ). This gives rank

1

$1$ predictions of

{\hat{x}}_{i} = x_{i}

$\hat {x}_i=x_i$ , and similarly

{\hat{y}}_{i} = y_{i}

$\hat {y}_i=y_i$ . Hence you get

O - E = 0

$O-E=0$ and

S = 0

$S =0$ . Hence you get 100% as your intuition suggests.

— probabilityislogic

Hey, yes, sure, the 1st PC explains 100% variance, but that's not what I meant. What I meant is that

X = Y

$X=Y$ , but the question is how much variance is described by

X

$X$ , i.e. by

(1, 0)

$(1,0)$ vector? What does your formula say then?

— amoeba says Reinstate Monica

@amoeba - this says 50%, but note that the

(1, 0)

$(1,0)$ vector says that the best rank

1

$1$ predictor for

(x_{i}, y_{i})

$(x_i, y_i)$ is given as

{\hat{x}}_{i} = x_{i}

$\hat {x}_i= x_i$ and

{\hat{y}}_{i} = 0

$\hat {y}_i=0$ (noting that

z_{i} = x_{i}

$z_i=x_i$ under your choice of vector). This is not an optimal prediction, which is why you don't get 100%. You need to predict both

X

$X$ and

Y

$Y$ in this set-up.

— probabilityislogic

Let's say I choose some linear combination of these variables -- e.g. $A+2B+5C$ , could I work out how much variance in the data this describes?

This question can be understood in two different ways, leading to two different answers.

A linear combination corresponds to a vector, which in your example is $[1, 2, 5, 0, 0, 0]$ . This vector, in turn, defines an axis in the 6D space of the original variables. What you are asking is, how much variance does projection on this axis "describe"? The answer is given via the notion of "reconstruction" of original data from this projection, and measuring the reconstruction error (see Wikipedia on Fraction of variance unexplained). Turns out, this reconstruction can be reasonably done in two different ways, yielding two different answers.

Approach #1

Let $\newcommand{\S}{\boldsymbol \Sigma} \newcommand{\w}{\mathbf w} \newcommand{\v}{\mathbf v}\newcommand{\X}{\mathbf X} \X$ be the centered dataset ( $n$ rows correspond to samples, $d$ columns correspond to variables), let $\S$ be its covariance matrix, and let $\w$ be a unit vector from $\mathbb R^d$ . The total variance of the dataset is the sum of all $d$ variances, i.e. the trace of the covariance matrix: $T = \mathrm{tr}(\S)$ . The question is: what proportion of $T$ does $\w$ describe? The two answers given by @todddeluca and @probabilityislogic are both equivalent to the following: compute projection $\X \w$ $T$

R_{f i r s t}^{2} = \frac{V a r (X w)}{T} = \frac{w^{⊤} Σ w}{t r (Σ)} .

$R^2_\mathrm{first} = \frac{\mathrm{Var}(\X \w)}{T} = \frac{\w^\top \S \w}{\mathrm{tr}(\S)}.$

This might not be immediately obvious, because e.g. @probabilityislogic suggests to consider the reconstruction $\X \w \w^\top$ and then to compute

\frac{‖ X ‖^{2} - ‖ X - X w w^{⊤} ‖^{2}}{‖ X ‖^{2}},

$\frac{\|\X\|^2 - \|\X-\X \w \w^\top\|^2}{\|\X\|^2},$ but with a little algebra this can be shown to be an equivalent expression.

Approach #2

Okay. Now consider a following example: $\X$ is a $d=2$ dataset with covariance matrix

Σ = (\begin{matrix} 1 & 0.99 \\ 0.99 & 1 \end{matrix})

$\S = \left(\begin{array}{c}1&0.99\\0.99&1\end{array}\right)$ and

w = (\begin{matrix} 1 & 0 \end{matrix})^{⊤}

$\mathbf w = (\begin{array}{}1&0\end{array})^\top$ is simply an

x

$x$ vector:

variance explained

The total variance is $T=2$ . The variance of the projection onto $\w$ (shown in red dots) is equal to $1$ . So according to the above logic, the explained variance is equal to $1/2$ . And in some sense it is: red dots ("reconstruction") are far away from the corresponding blue dots, so a lot of the variance is "lost".

On the other hand, the two variables have $0.99$ correlation and so are almost identical; saying that one of them describes only $50\%$ of the total variance is weird, because each of them contains "almost all the information" about the second one. We can formalize it as follows: given projection $\X\w$ , find a best possible reconstruction $\X\w\v^\top$ with $\v$ not necessarily the same as $\w$ , and then compute the reconstruction error and plug it into the expression for the proportion of explained variance:

R_{s e c o n d}^{2} = \frac{‖ X ‖^{2} - ‖ X - X w v^{⊤} ‖^{2}}{‖ X ‖^{2}},

$R^2_\mathrm{second}=\frac{\|\X\|^2 - \|\X-\X \w \v^\top\|^2}{\|\X\|^2},$ where

v

$\v$ is chosen such that

‖ X - X w v^{⊤} ‖^{2}

$\|\X-\X \w \v^\top\|^2$ is minimal (i.e.

R^{2}

$R^2$ is maximal). This is exactly equivalent to computing

R^{2}

$R^2$ of multivariate regression predicting original dataset

X

$\X$ from the

1

$1$ -dimensional projection

X w

$\X\w$ .

It is a matter of straightforward algebra to use regression solution for $\v$ to find that the whole expression simplifies to

R_{s e c o n d}^{2} = \frac{‖ Σ w ‖^{2}}{w^{⊤} Σ w \cdot t r (Σ)} .

$R^2_\mathrm{second}=\frac{\|\S \w\|^2}{\w^\top \S \w \cdot \mathrm{tr}(\S)}.$ In the example above this is equal to

0.9901

$0.9901$ , which seems reasonable.

Note that if (and only if) $\w$ is one of the eigenvectors of $\S$ , i.e. one of the principal axes, with eigenvalue $\lambda$ (so that $\S \w = \lambda \w$ ), then both approaches to compute $R^2$ coincide and reduce to the familiar PCA expression

R_{P C A}^{2} = R_{f i r s t}^{2} = R_{s e c o n d}^{2} = λ / t r (Σ) = λ / \sum λ_{i} .

$R^2_\mathrm{PCA} = R^2_\mathrm{first} = R^2_\mathrm{second} = \lambda/\mathrm{tr}(\S) = \lambda/\sum \lambda_i.$

PS. See my answer here for an application of the derived formula to the special case of $\w$ being one of the basis vectors: Variance of the data explained by a single variable.

Appendix. Derivation of the formula for $R^2_\mathrm{second}$

Finding $\v$ minimizing the reconstruction $\|\X-\X \w \v^\top\|^2$ is a regression problem (with $\X \w$ as univariate predictor and $\X$ as multivariate response). Its solution is given by

v^{⊤} = {((X w)^{⊤} (X w))}^{- 1} (X w)^{⊤} X = (w^{⊤} Σ w)^{- 1} w^{⊤} Σ .

$\v^\top = \left((\X \w)^\top (\X \w)\right)^{-1}(\X \w)^\top \X = (\w^\top \S \w)^{-1} \w^\top \S.$

Next, the $R^2$ formula can be simplified as

R^{2} = \frac{‖ X ‖^{2} - ‖ X - X w v^{⊤} ‖^{2}}{‖ X ‖^{2}} = \frac{‖ X w v^{⊤} ‖^{2}}{‖ X ‖^{2}}

$R^2=\frac{\|\X\|^2 - \|\X-\X \w \v^\top\|^2}{\|\X\|^2} = \frac{\|\X \w \v^\top\|^2}{\|\X\|^2}$ due to the Pythagoras theorem, because the hat matrix in regression is an orthogonal projection (but it is also easy to show directly).

Plugging now the equation for $\v$ , we obtain for the numerator:

‖ X w v^{⊤} ‖^{2} = t r (X w v^{⊤} (X w v^{⊤})^{⊤}) = t r (X w w^{⊤} Σ Σ w w^{⊤} X^{⊤}) / (w^{⊤} Σ w)^{2} = t r (w^{⊤} Σ Σ w) / (w^{⊤} Σ w) = ‖ Σ w ‖^{2} / (w^{⊤} Σ w) .

$\|\X \w \v^\top\|^2 = \mathrm{tr}\left(\X \w \v^\top (\X \w \v^\top)^\top\right) = \mathrm{tr}(\X\w\w^\top\S\S\w\w^\top\X^\top)/(\w^\top\S\w)^2=\mathrm{tr}(\w^\top\S\S\w)/(\w^\top\S\w) = \|\S\w\|^2 / (\w^\top\S\w).$

The denominator is equal to $\|\X\|^2 = \mathrm{tr}(\S)$ resulting in the formula given above.

— amoeba says Reinstate Monica
स्रोत

I think this is an answer to a different question. For example, it not the case that that optimising your

R^{2}

$R^2$ wrt

w

$w$ will give the first PC as the unique answer (in those cases where it is unique). The fact that

(1, 0)

$(1,0)$ and

\frac{1}{\sqrt{2}} (1, 1)

$\frac {1}{\sqrt {2}}(1,1)$ both give 100% when

X = Y

$X=Y$ is evidence enough. Your proposed method seems to assume that the "normalised" objective function for PCA will always understate the variance explained (yours isn't a normalised PCA objective function as it normalises by the quantity being optimised in PCA).

— probabilityislogic

I agree that our answers are to different questions, but it's not clear to me which one OP had in mind. Also, note that my interpretation is not something very weird: it's a standard regression approach: when we say that

x

$x$ explains so and so much variance in

y

$y$ , we compute reconstruction error of

‖ y - x b ‖

$\|y-xb\|$ with an optimal

b

$b$ , not just

‖ y - x ‖

$\|y-x\|$ . Here is another argument: if all

n

$n$ variables are standardized, then in your approach each one explains

1 / n

$1/n$ amount of variance. This is not very informative: some variables can be much more predictive than others! My approach reflects that.

— amoeba says Reinstate Monica

@amoeba (+1) Great answer, it's really helpful! Would you know any reference that tackles this issue? Thanks!

— PierreE

@PierreE Thanks. No, I don't think I have any reference for that.

— amoeba says Reinstate Monica

Let the total variance, $T$ , in a data set of vectors be the sum of squared errors (SSE) between the vectors in the data set and the mean vector of the data set,

T = \sum_{i} (x_{i} - \bar{x}) \cdot (x_{i} - \bar{x})

$T = \sum_{i} (x_i-\bar{x}) \cdot (x_i-\bar{x})$ where

\bar{x}

$\bar{x}$ is the mean vector of the data set,

x_{i}

$x_i$ is the ith vector in the data set, and

\cdot

$\cdot$ is the dot product of two vectors. Said another way, the total variance is the SSE between each

x_{i}

$x_i$ and its predicted value,

f (x_{i})

$f(x_i)$ , when we set

f (x_{i}) = \bar{x}

$f(x_i)=\bar{x}$ .

Now let the predictor of $x_i$ , $f(x_i)$ , be the projection of vector $x_i$ onto a unit vector $c$ .

f_{c} (x_{i}) = (c \cdot x_{i}) c

$f_c(x_i) = (c \cdot x_i)c$

Then the $SSE$ for a given $c$ is

S S E_{c} = \sum_{i} (x_{i} - f_{c} (x_{i})) \cdot (x_{i} - f_{c} (x_{i}))

$SSE_c = \sum_i (x_i - f_c(x_i)) \cdot (x_i - f_c(x_i))$

I think that if you choose $c$ to minimize $SSE_c$ , then $c$ is the first principal component.

If instead you choose $c$ to be the normalized version of the vector $(1, 2, 5, ...)$ , then $T-SSE_c$ is the variance in the data described by using $c$ as a predictor.

— todddeluca
स्रोत

This is a reasonable approach, but I think this is not the best way to interpret the question; see my answer for an alternative approach that I argue makes more sense (in particular, see my example figure there).

— amoeba says Reinstate Monica

Think about it like that. Imagine a dataset with two standardized identical variables

X = Y

$X=Y$ . How much variance is described by

X

$X$ ? Your calculation gives

50 %

$50\%$ . I argue that the correct answer is

100 %

$100\%$ .

— amoeba says Reinstate Monica

प्रधान घटक विश्लेषण "पीछे की ओर": चर के दिए गए रैखिक संयोजन द्वारा डेटा का कितना विचरण समझाया जाता है?

Approach #1

Approach #2

Appendix. Derivation of the formula for R2secondRsecond2R^2_\mathrm{second}

Appendix. Derivation of the formula for $R^2_\mathrm{second}$