अधिकतम संभावना अनुमानक - बहुभिन्नरूपी गौसियन

प्रसंग

मल्टीवीरेट गॉसियन मशीन लर्निंग में अक्सर दिखाई देता है और निम्नलिखित परिणाम कई एमएल पुस्तकों और पाठ्यक्रमों में व्युत्पन्न के बिना उपयोग किए जाते हैं।

आयामों के एक मैट्रिक्स $\mathbf{X}$ के रूप में डेटा को देखते हुए , अगर हम यह मान लें कि डेटा एक -ariate Gaussian वितरण का अनुसरण करता है, जिसमें पैरामीटर माध्य ( ) और सहसंयोजक मैट्रिक्स ( ) होता है। अधिकतम संभावना आकलनकर्ता द्वारा दिया जाता है: $m \times p$ $p$ $\mu$ $p \times 1$ $\Sigma$ $p \times p$

$\hat \mu = \frac{1}{m} \sum_{i=1}^m \mathbf{ x^{(i)} } = \mathbf{\bar{x}}$

$\hat \Sigma = \frac{1}{m} \sum_{i=1}^m \mathbf{(x^{(i)} - \hat \mu) (x^{(i)} -\hat \mu)}^T$

I understand that knowledge of the multivariate Gaussian is a pre-requisite for many ML courses, but it would be helpful to have the full derivation in a self contained answer once and for all as I feel many self-learners are bouncing around the stats.stackexchange and math.stackexchange websites looking for answers.

Question

What is the full derivation of the Maximum Likelihood Estimators for the multivariate Gaussian

Examples:

These lecture notes (page 11) on Linear Discriminant Analysis, or these ones make use of the results and assume previous knowledge.

There are also a few posts which are partly answered or closed:

— Xavier Bourret Sicotte
स्रोत

जवाबों:

Deriving the Maximum Likelihood Estimators

Assume that we have $m$ random vectors, each of size $p$ : $\mathbf{X^{(1)}, X^{(2)},...,X^{(m)}}$ where each random vectors can be interpreted as an observation (data point) across $p$ variables. If each $\mathbf{X}^{(i)}$ are i.i.d. as multivariate Gaussian vectors:

X^{(i)} \sim N_{p} (μ, Σ)

$\mathbf{X^{(i)}} \sim \mathcal{N}_p(\mu, \Sigma)$

Where the parameters $\mu, \Sigma$ are unknown. To obtain their estimate we can use the method of maximum likelihood and maximize the log likelihood function.

Note that by the independence of the random vectors, the joint density of the data $\mathbf{ \{X^{(i)}}, i = 1,2,...,m\}$ is the product of the individual densities, that is $\prod_{i=1}^m f_{\mathbf{X^{(i)}}}(\mathbf{x^{(i)} ; \mu , \Sigma })$ . Taking the logarithm gives the log-likelihood function

\begin{aligned} l (μ, Σ | x^{(i)}) & = \log \prod_{i = 1}^{m} f_{X^{(i)}} (x^{(i)} | μ, Σ) \\ = \log \prod_{i = 1}^{m} \frac{1}{(2 π)^{p / 2} | Σ |^{1 / 2}} \exp (- \frac{1}{2} (x^{(i)} - μ)^{T} Σ^{- 1} (x^{(i)} - μ)) \\ = \sum_{i = 1}^{m} (- \frac{p}{2} \log (2 π) - \frac{1}{2} \log | Σ | - \frac{1}{2} (x^{(i)} - μ)^{T} Σ^{- 1} (x^{(i)} - μ)) \end{aligned}

$\begin{aligned} l(\mathbf{ \mu, \Sigma | x^{(i)} }) & = \log \prod_{i=1}^m f_{\mathbf{X^{(i)}}}(\mathbf{x^{(i)} | \mu , \Sigma }) \\ & = \log \ \prod_{i=1}^m \frac{1}{(2 \pi)^{p/2} |\Sigma|^{1/2}} \exp \left( - \frac{1}{2} \mathbf{(x^{(i)} - \mu)^T \Sigma^{-1} (x^{(i)} - \mu) } \right) \\ & = \sum_{i=1}^m \left( - \frac{p}{2} \log (2 \pi) - \frac{1}{2} \log |\Sigma| - \frac{1}{2} \mathbf{(x^{(i)} - \mu)^T \Sigma^{-1} (x^{(i)} - \mu) } \right) \end{aligned}$

\begin{aligned} l (μ, Σ;) & = - \frac{m p}{2} \log (2 π) - \frac{m}{2} \log | Σ | - \frac{1}{2} \sum_{i = 1}^{m} (x^{(i)} - μ)^{T} Σ^{- 1} (x^{(i)} - μ) \end{aligned}

$\begin{aligned} l(\mu, \Sigma ; ) & = - \frac{mp}{2} \log (2 \pi) - \frac{m}{2} \log |\Sigma| - \frac{1}{2} \sum_{i=1}^m \mathbf{(x^{(i)} - \mu)^T \Sigma^{-1} (x^{(i)} - \mu) } \end{aligned}$

Deriving $\hat \mu$

To take the derivative with respect to $\mu$ and equate to zero we will make use of the following matrix calculus identity:

$\mathbf{ \frac{\partial w^T A w}{\partial w} = 2Aw}$ if $\mathbf{w}$ does not depend on $\mathbf{A}$ and $\mathbf{A}$ is symmetric.

\begin{aligned} \frac{\partial}{\partial μ} l (μ, Σ | x^{(i)}) & = \sum_{i = 1}^{m} Σ^{- 1} (μ - x^{(i)}) = 0 \\ Since Σ is positive definite \\ 0 & = m μ - \sum_{i = 1}^{m} x^{(i)} \\ \hat{μ} & = \frac{1}{m} \sum_{i = 1}^{m} x^{(i)} = \bar{x} \end{aligned}

$\begin{aligned} \frac{\partial }{\partial \mu} l(\mathbf{ \mu, \Sigma | x^{(i)} }) & = \sum_{i=1}^m \mathbf{ \Sigma^{-1} ( \mu - x^{(i)} ) } = 0 \\ & \text{Since $\Sigma$ is positive definite} \\ 0 & = m \mu - \sum_{i=1}^m \mathbf{ x^{(i)} } \\ \hat \mu &= \frac{1}{m} \sum_{i=1}^m \mathbf{ x^{(i)} } = \mathbf{\bar{x}} \end{aligned}$

Which is often called the sample mean vector.

Deriving $\hat \Sigma$

Deriving the MLE for the covariance matrix requires more work and the use of the following linear algebra and calculus properties:

The trace is invariant under cyclic permutations of matrix products: $tr[ACB] = tr[CAB] = tr[BCA]$

Since $x^TAx$ is scalar, we can take its trace and obtain the same value: $x^tAx = tr[x^TAx] = tr[x^txA]$

$\frac{\partial}{\partial A} tr[AB] = B^T$

$\frac{\partial}{\partial A} \log |A| = A^{-T}$

Combining these properties allows us to calculate

\frac{\partial}{\partial A} x^{t} A x = \frac{\partial}{\partial A} t r [x^{T} x A] = [x x^{t}]^{T} = x^{T T} x^{T} = x x^{T}

$\frac{\partial}{\partial A} x^tAx =\frac{\partial}{\partial A} tr[x^TxA] = [xx^t]^T = x^{TT}x^T = xx^T$

Which is the outer product of the vector $x$ with itself.

We can now re-write the log-likelihood function and compute the derivative w.r.t. $\Sigma^{-1}$ (note $C$ is constant)

\begin{aligned} l (μ, Σ | x^{(i)}) & = C - \frac{m}{2} \log | Σ | - \frac{1}{2} \sum_{i = 1}^{m} (x^{(i)} - μ)^{T} Σ^{- 1} (x^{(i)} - μ) \\ = C + \frac{m}{2} \log | Σ^{- 1} | - \frac{1}{2} \sum_{i = 1}^{m} t r [(x^{(i)} - μ) (x^{(i)} - μ)^{T} Σ^{- 1}] \\ \frac{\partial}{\partial Σ^{- 1}} l (μ, Σ | x^{(i)}) & = \frac{m}{2} Σ - \frac{1}{2} \sum_{i = 1}^{m} {(x^{(i)} - μ) (x^{(i)} - μ)}^{T} Since Σ^{T} = Σ \end{aligned}

$\begin{aligned} l(\mathbf{ \mu, \Sigma | x^{(i)} }) & = \text{C} - \frac{m}{2} \log |\Sigma| - \frac{1}{2} \sum_{i=1}^m \mathbf{(x^{(i)} - \mu)^T \Sigma^{-1} (x^{(i)} - \mu) } \\ & = \text{C} + \frac{m}{2} \log |\Sigma^{-1}| - \frac{1}{2} \sum_{i=1}^m tr[ \mathbf{(x^{(i)} - \mu) (x^{(i)} - \mu)^T \Sigma^{-1} } ] \\ \frac{\partial }{\partial \Sigma^{-1}} l(\mathbf{ \mu, \Sigma | x^{(i)} }) & = \frac{m}{2} \Sigma - \frac{1}{2} \sum_{i=1}^m \mathbf{(x^{(i)} - \mu) (x^{(i)} - \mu)}^T \ \ \text{Since $\Sigma^T = \Sigma$} \end{aligned}$

Equating to zero and solving for $\Sigma$

\begin{aligned} 0 & = m Σ - \sum_{i = 1}^{m} {(x^{(i)} - μ) (x^{(i)} - μ)}^{T} \\ \hat{Σ} & = \frac{1}{m} \sum_{i = 1}^{m} {(x^{(i)} - \hat{μ}) (x^{(i)} - \hat{μ})}^{T} \end{aligned}

$\begin{aligned} 0 &= m \Sigma - \sum_{i=1}^m \mathbf{(x^{(i)} - \mu) (x^{(i)} - \mu)}^T \\ \hat \Sigma & = \frac{1}{m} \sum_{i=1}^m \mathbf{(x^{(i)} - \hat \mu) (x^{(i)} -\hat \mu)}^T \end{aligned}$

Sources

— Xavier Bourret Sicotte
स्रोत

Alternative proofs, more compact forms, or intuitive interpretation are welcome !

— Xavier Bourret Sicotte

In the derivation for

μ

$\mu$ , why does

Σ

$\Sigma$ need to be positive definite? Does it seem enough that

Σ

$\Sigma$ is invertible? For an invertible matrix

A

$A$ ,

A x = 0

$Ax=0$ only when

x = 0

$x=0$ ?

— Tom Bennett

To clarify,

Σ

$\Sigma$ is an

m \times m

$m \times m$ matrix that may have finite diagonal and non-diagonal components indicating correlation between vectors, correct? If that is the case, in what sense are these vectors independent? Also, why is the joint probability function equal to the likelihood? Shouldn't the joint density,

f (x, y)

$f(x,y)$ , be equal to the likelihood multiplied by the prior, i.e.

f (x | y) f (y)

$f(x|y)f(y)$ ?

— Mathews24

@TomBennett the sigma matrix is positive definite by definition - see stats.stackexchange.com/questions/52976/… for the proof. The matrix calculus identity requires the matrix to be symmetric, not positive definite. But since positive definite matrices are always symmetric that works

— Xavier Bourret Sicotte

Yes indeed - independence between observations allow to get the likelihood - the wording may be unclear faie enough - this is the multivariate version of the likelihood. The prior is still irrelevant regardless

— Xavier Bourret Sicotte

के लिए एक वैकल्पिक सबूत $\widehat{\Sigma}$ कि सम्मान के साथ व्युत्पन्न लेता है $\Sigma$ सीधे:

उपर्युक्त के रूप में लॉग-लाइक के साथ पिकिंग:

\begin{array}{rcl} ℓ (μ, Σ) & = & सी - \frac{म}{2} लॉग | Σ | - \frac{1}{2} Σ_{मैं = 1}^{म} टीआर [({एक्स}^{(मैं)} - μ)^{टी} Σ^{- 1} ({एक्स}^{(मैं)} - μ)] \\ = & सी - \frac{1}{2} (म लॉग | Σ | + Σ_{मैं = 1}^{म} टीआर [({एक्स}^{(मैं)} - μ) ({एक्स}^{(मैं)} - μ)^{टी} Σ^{- 1}]) \\ = & सी - \frac{1}{2} (म लॉग | Σ | + टीआर [{एस}_{μ} Σ^{- 1}]) \end{array}

$\begin{eqnarray} \ell(\mu, \Sigma) &=& C - \frac{m}{2}\log|\Sigma|-\frac{1}{2} \sum_{i=1}^m \text{tr}\left[(\mathbf{x}^{(i)}-\mu)^T \Sigma^{-1} (\mathbf{x}^{(i)}-\mu)\right]\\ &=&C - \frac{1}{2}\left(m\log|\Sigma| + \sum_{i=1}^m\text{tr} \left[(\mathbf{x}^{(i)}-\mu)(\mathbf{x}^{(i)}-\mu)^T\Sigma^{-1} \right]\right)\\ &=&C - \frac{1}{2}\left(m\log|\Sigma| +\text{tr}\left[ S_\mu \Sigma^{-1} \right] \right) \end{eqnarray}$ कहाँ पे

S_{μ} = \sum_{i = 1}^{m} (x^{(i)} - μ) (x^{(i)} - μ)^{T}

$S_\mu = \sum_{i=1}^m (\mathbf{x}^{(i)}-\mu)(\mathbf{x}^{(i)}-\mu)^T$ and we have used the cyclic and linear properties of

tr

$\text{tr}$ . To compute

\partial ℓ / \partial Σ

$\partial \ell /\partial \Sigma$ we first observe that

\frac{\partial}{\partial Σ} \log | Σ | = Σ^{- T} = Σ^{- 1}

$\frac{\partial}{\partial \Sigma} \log |\Sigma| = \Sigma^{-T}=\Sigma^{-1}$ by the fourth property above. To take the derivative of the second term we will need the property that

\frac{\partial}{\partial X} tr (A X^{- 1} B) = - (X^{- 1} B A X^{- 1})^{T} .

$\frac{\partial}{\partial X}\text{tr}\left( A X^{-1} B\right) = -(X^{-1}BAX^{-1})^T.$ (from The Matrix Cookbook, equation 63). Applying this with

B = I

$B=I$ we obtain that

\frac{\partial}{\partial Σ} tr [S_{μ} Σ^{- 1}] = - {(Σ^{- 1} S_{μ} Σ^{- 1})}^{T} = - Σ^{- 1} S_{μ} Σ^{- 1}

$\frac{\partial}{\partial \Sigma}\text{tr}\left[S_\mu \Sigma^{-1}\right] = -\left( \Sigma^{-1} S_\mu \Sigma^{-1}\right)^T = -\Sigma^{-1} S_\mu \Sigma^{-1}$ because both

Σ

$\Sigma$ and

S_{μ}

$S_\mu$ are symmetric. Then

\frac{\partial}{\partial Σ} ℓ (μ, Σ) \propto m Σ^{- 1} - Σ^{- 1} S_{μ} Σ^{- 1} .

$\frac{\partial}{\partial \Sigma}\ell(\mu, \Sigma) \propto m \Sigma^{-1} - \Sigma^{-1} S_\mu \Sigma^{-1}.$ Setting this to 0 and rearranging gives

\hat{Σ} = \frac{1}{m} S_{μ} .

$\widehat{\Sigma} = \frac{1}{m}S_\mu.$

This approach is more work than the standard one using derivatives with respect to $\Lambda = \Sigma^{-1}$ , and requires a more complicated trace identity. I only found it useful because I currently need to take derivatives of a modified likelihood function for which it seems much harder to use $\partial/{\partial \Sigma^{-1}}$ than $\partial/\partial \Sigma$ .

— Eric Kightley
स्रोत

अधिकतम संभावना अनुमानक - बहुभिन्नरूपी गौसियन

प्रसंग

Question

Examples:

Deriving the Maximum Likelihood Estimators

Deriving μ^μ^\hat \mu

Deriving Σ^Σ^\hat \Sigma

Sources

Deriving $\hat \mu$

Deriving $\hat \Sigma$