अधिकतम संभावना अनुमानक - बहुभिन्नरूपी गौसियन


20

प्रसंग

मल्टीवीरेट गॉसियन मशीन लर्निंग में अक्सर दिखाई देता है और निम्नलिखित परिणाम कई एमएल पुस्तकों और पाठ्यक्रमों में व्युत्पन्न के बिना उपयोग किए जाते हैं।

आयामों के एक मैट्रिक्स X के रूप में डेटा को देखते हुए , अगर हम यह मान लें कि डेटा एक -ariate Gaussian वितरण का अनुसरण करता है, जिसमें पैरामीटर माध्य ( ) और सहसंयोजक मैट्रिक्स ( ) होता है। अधिकतम संभावना आकलनकर्ता द्वारा दिया जाता है:m×ppμp×1Σp×p

  • μ^=1mi=1mx(i)=x¯
  • Σ^=1mi=1m(x(i)μ^)(x(i)μ^)T

I understand that knowledge of the multivariate Gaussian is a pre-requisite for many ML courses, but it would be helpful to have the full derivation in a self contained answer once and for all as I feel many self-learners are bouncing around the stats.stackexchange and math.stackexchange websites looking for answers.


Question

What is the full derivation of the Maximum Likelihood Estimators for the multivariate Gaussian


Examples:

These lecture notes (page 11) on Linear Discriminant Analysis, or these ones make use of the results and assume previous knowledge.

There are also a few posts which are partly answered or closed:

जवाबों:


24

Deriving the Maximum Likelihood Estimators

Assume that we have m random vectors, each of size p: X(1),X(2),...,X(m) where each random vectors can be interpreted as an observation (data point) across p variables. If each X(i) are i.i.d. as multivariate Gaussian vectors:

X(i)Np(μ,Σ)

Where the parameters μ,Σ are unknown. To obtain their estimate we can use the method of maximum likelihood and maximize the log likelihood function.

Note that by the independence of the random vectors, the joint density of the data {X(i),i=1,2,...,m} is the product of the individual densities, that is i=1mfX(i)(x(i);μ,Σ). Taking the logarithm gives the log-likelihood function

l(μ,Σ|x(i))=logi=1mfX(i)(x(i)|μ,Σ)=log i=1m1(2π)p/2|Σ|1/2exp(12(x(i)μ)TΣ1(x(i)μ))=i=1m(p2log(2π)12log|Σ|12(x(i)μ)TΣ1(x(i)μ))

l(μ,Σ;)=mp2log(2π)m2log|Σ|12i=1m(x(i)μ)TΣ1(x(i)μ)

Deriving μ^

To take the derivative with respect to μ and equate to zero we will make use of the following matrix calculus identity:

wTAww=2Aw if w does not depend on A and A is symmetric.

μl(μ,Σ|x(i))=i=1mΣ1(μx(i))=0Since Σ is positive definite0=mμi=1mx(i)μ^=1mi=1mx(i)=x¯

Which is often called the sample mean vector.

Deriving Σ^

Deriving the MLE for the covariance matrix requires more work and the use of the following linear algebra and calculus properties:

  • The trace is invariant under cyclic permutations of matrix products: tr[ACB]=tr[CAB]=tr[BCA]
  • Since xTAx is scalar, we can take its trace and obtain the same value: xtAx=tr[xTAx]=tr[xtxA]
  • Atr[AB]=BT
  • Alog|A|=AT

Combining these properties allows us to calculate

AxtAx=Atr[xTxA]=[xxt]T=xTTxT=xxT

Which is the outer product of the vector x with itself.

We can now re-write the log-likelihood function and compute the derivative w.r.t. Σ1 (note C is constant)

l(μ,Σ|x(i))=Cm2log|Σ|12i=1m(x(i)μ)TΣ1(x(i)μ)=C+m2log|Σ1|12i=1mtr[(x(i)μ)(x(i)μ)TΣ1]Σ1l(μ,Σ|x(i))=m2Σ12i=1m(x(i)μ)(x(i)μ)T  Since ΣT=Σ

Equating to zero and solving for Σ

0=mΣi=1m(x(i)μ)(x(i)μ)TΣ^=1mi=1m(x(i)μ^)(x(i)μ^)T

Sources


Alternative proofs, more compact forms, or intuitive interpretation are welcome !
Xavier Bourret Sicotte

In the derivation for μ, why does Σ need to be positive definite? Does it seem enough that Σ is invertible? For an invertible matrix A, Ax=0 only when x=0?
Tom Bennett

To clarify, Σ is an m×m matrix that may have finite diagonal and non-diagonal components indicating correlation between vectors, correct? If that is the case, in what sense are these vectors independent? Also, why is the joint probability function equal to the likelihood? Shouldn't the joint density, f(x,y), be equal to the likelihood multiplied by the prior, i.e. f(x|y)f(y)?
Mathews24

1
@TomBennett the sigma matrix is positive definite by definition - see stats.stackexchange.com/questions/52976/… for the proof. The matrix calculus identity requires the matrix to be symmetric, not positive definite. But since positive definite matrices are always symmetric that works
Xavier Bourret Sicotte

1
Yes indeed - independence between observations allow to get the likelihood - the wording may be unclear faie enough - this is the multivariate version of the likelihood. The prior is still irrelevant regardless
Xavier Bourret Sicotte

5

के लिए एक वैकल्पिक सबूत Σ^ कि सम्मान के साथ व्युत्पन्न लेता है Σ सीधे:

उपर्युक्त के रूप में लॉग-लाइक के साथ पिकिंग:

(μ,Σ)=सी-2लॉग|Σ|-12Σमैं=1टीआर[(एक्स(मैं)-μ)टीΣ-1(एक्स(मैं)-μ)]=सी-12(लॉग|Σ|+Σमैं=1टीआर[(एक्स(मैं)-μ)(एक्स(मैं)-μ)टीΣ-1])=सी-12(लॉग|Σ|+टीआर[एसμΣ-1])
कहाँ पे एसμ=Σमैं=1(एक्स(मैं)-μ)(एक्स(मैं)-μ)टी and we have used the cyclic and linear properties of tr. To compute /Σ we first observe that
Σlog|Σ|=ΣT=Σ1
by the fourth property above. To take the derivative of the second term we will need the property that
Xtr(AX1B)=(X1BAX1)T.
(from The Matrix Cookbook, equation 63). Applying this with B=I we obtain that
Σtr[SμΣ1]=(Σ1SμΣ1)T=Σ1SμΣ1
because both Σ and Sμ are symmetric. Then
Σ(μ,Σ)mΣ1Σ1SμΣ1.
Setting this to 0 and rearranging gives
Σ^=1mSμ.

This approach is more work than the standard one using derivatives with respect to Λ=Σ1, and requires a more complicated trace identity. I only found it useful because I currently need to take derivatives of a modified likelihood function for which it seems much harder to use /Σ1 than /Σ.

हमारी साइट का प्रयोग करके, आप स्वीकार करते हैं कि आपने हमारी Cookie Policy और निजता नीति को पढ़ और समझा लिया है।
Licensed under cc by-sa 3.0 with attribution required.