उम्मीद अधिकतमकरण एल्गोरिथ्म का प्रेरणा

20

EM एल्गोरिथ्म दृष्टिकोण में हम Jensen की असमानता का उपयोग करने के लिए पर पहुंचने के लिए

\log p (x | θ) \geq \int \log p (z, x | θ) p (z | x, θ^{(k)}) d z - \int \log p (z | x, θ) p (z | x, θ^{(k)}) d z

$\log p(x|\theta) \geq \int \log p(z,x|\theta) p(z|x,\theta^{(k)}) dz - \int \log p(z|x,\theta) p(z|x,\theta^{(k)})dz$

और को $\theta^{(k+1)}$

θ^{(k + 1)} = \arg max_{θ} \int \log p (z, x | θ) p (z | x, θ^{(k)}) d z

$\theta^{(k+1)}=\arg \max_{\theta}\int \log p(z,x|\theta) p(z|x,\theta^{(k)}) dz$

ईएम को पढ़ने वाली हर चीज मैं इसे नीचे रख देता हूं, लेकिन ईएम एल्गोरिथ्म स्वाभाविक रूप से क्यों उठता है, इसका स्पष्टीकरण न होने से मुझे हमेशा असहजता महसूस होती है। मैं समझता हूं कि बहुविकल्पी के बजाय इसके साथ तुलना करने के लिए आमतौर पर $\log$ संभावना को निपटाया जाता है, लेकिन की परिभाषा में की उपस्थिति मुझे असम्बद्ध लगती है। एक विचार क्यों करना चाहिए और अन्य मोनोटोनिक फ़ंक्शन नहीं करना चाहिए ? विभिन्न कारणों से मुझे संदेह है कि अपेक्षा के अधिकतमकरण के पीछे "अर्थ" या "प्रेरणा" सूचना सिद्धांत और पर्याप्त आँकड़ों के संदर्भ में किसी प्रकार की व्याख्या है। अगर ऐसी कोई व्याख्या होती जो सिर्फ एक अमूर्त एल्गोरिदम की तुलना में अधिक व्यंग्यपूर्ण होती। $\log$ $\theta^{(k+1)}$ $\log$

mixture expectation-maximization

— user782220
स्रोत

3

अपेक्षा अधिकतमकरण एल्गोरिथ्म क्या है? , प्रकृति जैव प्रौद्योगिकी 26 : 897–899 (2008) में एक अच्छी तस्वीर है जो दर्शाती है कि एल्गोरिथ्म कैसे काम करता है।

— CHL

@chl: मैंने वह लेख देखा है। मैं जो बिंदु पूछ रहा हूं वह यह है कि कहीं भी यह नहीं बताया गया है कि एक गैर-लॉग दृष्टिकोण काम क्यों नहीं कर सकता है

— user782220

10

EM एल्गोरिथ्म की अलग-अलग व्याख्याएं हैं और विभिन्न अनुप्रयोगों में विभिन्न रूपों में उत्पन्न हो सकती हैं।

यह सभी संभावना फ़ंक्शन the , या समकक्ष, लॉग-लाइबिलिटी फ़ंक्शन the शुरू होता है जिसे हम अधिकतम करना चाहते हैं। (हम आम तौर पर लघुगणक का उपयोग करते हैं क्योंकि यह गणना को सरल करता है: यह कड़ाई से मोनोटोन, अवतल और ।) एक आदर्श दुनिया में, का मान केवल मॉडल पैरामीटर पर निर्भर करता है। , इसलिए हम के स्थान के माध्यम से खोज कर सकते हैं और एक को खोज सकते हैं जो अधिकतम । $p(x \vert \theta)$ $\log p(x \vert \theta)$ $\log(ab) = \log a + \log b$ $p$ $\theta$ $\theta$ $p$

हालांकि, कई दिलचस्प वास्तविक दुनिया के अनुप्रयोगों में चीजें अधिक जटिल हैं, क्योंकि सभी चर नहीं देखे जाते हैं। हां, हम सीधे निरीक्षण कर सकते हैं , लेकिन कुछ अन्य चर शीर्षक के हैं। की वजह से लापता चर , हम चिकन और अंडे स्थिति की एक किस्म में हैं: बिना हम पैरामीटर अनुमान नहीं कर सकते और बिना हम अनुमान नहीं लगा सकता कि क्या का मूल्य हो सकता है। $x$ $z$ $z$ $z$ $\theta$ $\theta$ $z$

यह वह जगह है जहाँ EM एल्गोरिथ्म खेलने में आता है। हम मॉडल पैरामीटर प्रारंभिक अनुमान के साथ शुरू करते हैं और लापता चर (यानी, ई चरण) के अपेक्षित मूल्यों को प्राप्त करते हैं । जब हमारे पास का मान होता है , तो हम इस संभावना को अधिकतम कर सकते हैं कि पैरामीटर (यानी, M चरण, समस्या कथन में समीकरण के अनुरूप )। इस हम के नए अपेक्षित मूल्यों (एक और E चरण) को प्राप्त कर सकते हैं , इसलिए आगे और आगे। दूसरे शब्द में, प्रत्येक चरण में हम दोनों को एक मानते हैं, और $\theta$ $z$ $z$ $\theta$ $\arg \max$ $\theta$ $z$ $z$ $\theta$ , ज्ञात है। हम इस पुनरावृत्ति प्रक्रिया को तब तक दोहराते हैं जब तक कि संभावना को और अधिक नहीं बढ़ाया जा सकता है।

यह संक्षेप में ईएम एल्गोरिथ्म है। यह सर्वविदित है कि इस पुनरावृत्ति ईएम प्रक्रिया के दौरान संभावना कभी कम नहीं होगी। लेकिन ध्यान रखें कि EM एल्गोरिथ्म वैश्विक इष्टतम की गारंटी नहीं देता है। यही है, यह संभावना समारोह के एक स्थानीय इष्टतम के साथ समाप्त हो सकता है।

के समीकरण में की उपस्थिति अपरिहार्य है, क्योंकि यहां जिस फ़ंक्शन को आप अधिकतम करना चाहते हैं, वह लॉग-लाइबिलिटी के रूप में लिखा गया है। $\log$ $\theta^{(k+1)}$

— Weiwei
स्रोत

मैं यह नहीं देखता कि यह सवाल का जवाब कैसे देता है।

— ब्रोंकोएबेरिटो

9

संभावना बनाम लॉग-लाइबिलिटी

जैसा कि पहले ही कहा गया है, को अधिकतम संभावना में पेश किया जाता है, क्योंकि आमतौर पर उत्पादों की तुलना में रकम का अनुकूलन करना आसान होता है। अन्य मोनोटोनिक कार्यों पर विचार नहीं करने का कारण यह है कि उत्पादों को मोड़ने की संपत्ति के साथ लघुगणक अद्वितीय कार्य है । $\log$

लघुगणक को प्रेरित करने का दूसरा तरीका निम्नलिखित है: हमारे मॉडल के तहत डेटा की संभावना को अधिकतम करने के बजाय, हम समान रूप से डेटा वितरण, , और मॉडल वितरण, बीच कुल्बैक-लीब्लर विचलन को कम करने की कोशिश कर सकते हैं। , $p_\text{data}(x)$ $p(x \mid \theta)$

D_{KL} [p_{data} (x) ∣∣ p (x ∣ θ)] = \int p_{data} (x) \log \frac{p_{data} (x)}{p (x ∣ θ)} d x = c o n s t - \int p_{data} (x) \log p (x ∣ θ) d x .

$D_\text{KL}[p_\text{data}(x) \mid\mid p(x \mid \theta)] = \int p_\text{data}(x) \log \frac{p_\text{data}(x)}{p(x \mid \theta)} \, dx = const - \int p_\text{data}(x)\log p(x \mid \theta) \, dx.$

दाहिने हाथ की ओर पहला शब्द मापदंडों में स्थिर है। यदि हमारे पास डेटा वितरण (हमारे डेटा बिंदु) से नमूने हैं, तो हम डेटा की औसत लॉग-लाइक के साथ दूसरे शब्द का अनुमान लगा सकते हैं । $N$

\int p_{data} (x) \log p (x ∣ θ) d x \approx \frac{1}{N} \sum_{n} \log p (x_{n} ∣ θ) .

$\int p_\text{data}(x)\log p(x \mid \theta) \, dx \approx \frac{1}{N} \sum_n \log p(x_n \mid \theta).$

ईएम का एक वैकल्पिक दृश्य

मुझे यकीन नहीं है कि आप इस तरह की व्याख्या करने जा रहे हैं, लेकिन मुझे उम्मीद है कि जेन्सन की असमानता के माध्यम से इसकी प्रेरणा की तुलना में उम्मीद से अधिक ज्ञान प्राप्त करने के निम्नलिखित दृश्य (आप नेल्सिक और हिंटन में विस्तृत विवरण पा सकते हैं ) (1998) या क्रिस बिशप की PRML पुस्तक, अध्याय 9.3 में)।

यह दिखाना मुश्किल नहीं है

\log p (x ∣ θ) = \int q (z ∣ x) \log \frac{p (x, z ∣ θ)}{q (z ∣ x)} d z + D_{KL} [q (z ∣ x) ∣∣ p (z ∣ x, θ)]

$\log p(x \mid \theta) = \int q(z \mid x) \log \frac{p(x, z \mid \theta)}{q(z \mid x)} \, dz + D_\text{KL}[q(z \mid x) \mid\mid p(z \mid x, \theta)]$

किसी के लिए । अगर हम दाएँ हाथ की ओर पर पहले कार्यकाल फोन यह संकेत मिलता है कि $q(z \mid x)$ $F(q, \theta)$

F (q, θ) = \int q (z ∣ x) \log \frac{p (x, z ∣ θ)}{q (z ∣ x)} d z = \log p (x ∣ θ) - D_{KL} [q (z ∣ x) ∣∣ p (z ∣ x, θ)] .

$F(q, \theta) = \int q(z \mid x) \log \frac{p(x, z \mid \theta)}{q(z \mid x)} \, dz = \log p(x \mid \theta) - D_\text{KL}[q(z \mid x) \mid\mid p(z \mid x, \theta)].$

क्योंकि केएल विचलन हमेशा सकारात्मक है , हर तय करने के लिए लॉग-संभावना पर बाध्य एक कम है । अब, ईएम बारी-बारी से अधिकतम के रूप में देखी जा सकती है के संबंध में और । विशेष रूप से, की स्थापना करके ई-कदम में, हम दाएँ हाथ की ओर पर केएल विचलन को कम करने और इस प्रकार अधिकतम । $F(q, \theta)$ $q$ $F$ $q$ $\theta$ $q(z \mid x) = p(z \mid x, \theta)$ $F$

— लुकास
स्रोत

पोस्ट के लिए धन्यवाद! यद्यपि दिए गए दस्तावेज़ में यह नहीं कहा गया है कि लघुगणक सुस्मों में अद्वितीय कार्य करने वाला उत्पाद है। यह कहता है कि लॉगरिथम एकमात्र फ़ंक्शन है जो एक ही समय में सभी तीन सूचीबद्ध गुणों को पूरा करता है ।

— वेईवेई

@Weiwei: ठीक है, लेकिन पहली शर्त मुख्य रूप से आवश्यक है कि फ़ंक्शन उलटा है। बेशक, एफ (एक्स) = 0 का अर्थ एफ (एक्स + वाई) = एफ (एक्स) एफ (वाई) भी है, लेकिन यह एक निर्बाध मामला है। तीसरी शर्त यह पूछती है कि 1 पर व्युत्पन्न 1 है, जो केवल आधार

लिए लघुगणक के लिए सही है । इस बाधा को छोड़ें और आपको विभिन्न आधारों पर लघुगणक प्राप्त होते हैं, लेकिन फिर भी लघुगणक होते हैं।

e

$e$

— लुकास

4

जिस पेपर को मैंने अपेक्षा-अधिकतमकरण के संबंध में स्पष्ट किया था, वह है वेलिंग और कुरिहारा द्वारा बेइज़ियन के -मीन्स को "मैक्सिमाइज़ेशन-एक्सपेक्टेशन" एल्गोरिथम (पीडीएफ) के रूप में।

मान लीजिए कि हमें एक संभाव्य मॉडल है के साथ टिप्पणियों, छिपा यादृच्छिक चर, और की कुल मानकों। हम एक डाटासेट दिया जाता है और (उच्च शक्तियों द्वारा) मजबूर हैं स्थापित करने के लिए । $p(x,z,\theta)$ $x$ $z$ $\theta$ $D$ $p(z,\theta|D)$

1. गिब्स का नमूना

हम नमूना द्वारा अनुमान लगा सकते हैं । गिब्स का नमूना वैकल्पिक रूप से है: $p(z,\theta|D)$ $p(z,\theta|D)$

θ \sim p (θ | z, D) z \sim p (z | θ, D)

$\theta \sim p(\theta|z,D) \\ z \sim p(z|\theta,D)$

2. वैरिएशन बे

इसके बजाय, हम एक वितरण स्थापित करने के लिए कोशिश कर सकते हैं और और वितरण के बाद हम कर रहे हैं के साथ अंतर को कम से कम । वितरण के बीच का अंतर एक सुविधाजनक फैंसी नाम है, केएल-विचलन। को कम करने के लिए हम अद्यतन: $q(\theta)$ $q(z)$ $p(\theta,z|D)$ $KL[q(\theta)q(z)||p(\theta,z|D)]$

q (θ) \propto \exp (E [\log p (θ, z, D)]_{q (z)}) q (z) \propto \exp (E [\log p (θ, z, D)]_{q (θ)})

$q(\theta) \propto \exp (E [\log p(\theta,z,D) ]_{q(z)} ) \\ q(z) \propto \exp (E [\log p(\theta,z,D) ]_{q(\theta)} )$

3. अपेक्षा-अधिकतमकरण

दोनों के लिए पूर्ण विकसित संभाव्यता वितरण प्रदान करने के लिए और चरम पर विचार किया जा सकता है। इसके बजाय हम इनमें से एक के लिए एक बिंदु अनुमान पर विचार क्यों नहीं करते हैं और दूसरे को अच्छा और बारीक रखते हैं। ईएम में पैरामीटर (अधिकतम अनुमान किया हुआ) मूल्य अपने नक्शे के लिए एक पूर्ण वितरण में से एक होने अयोग्य, और सेट के रूप में स्थापित है, । $z$ $\theta$ $\theta$ $\theta^*$

θ^{*} = \underset{θ}{argmax} E [\log p (θ, z, D)]_{q (z)} q (z) = p (z | θ^{*}, D)

$\theta^* = \underset{\theta}{\operatorname{argmax}} E [\log p(\theta,z,D) ]_{q(z)} \\ q(z) = p(z|\theta^*,D)$

यहाँ वास्तव में एक बेहतर अंकन होगा: argmax ऑपरेटर से अधिक मान लौट सकते हैं। लेकिन चलो निपिक नहीं। वैरिएबल बे की तुलना में आप देखते हैं कि द्वारा लिए सही करने से परिणाम में बदलाव नहीं होता है, इसलिए यह आवश्यक नहीं है। $\theta^* \in \operatorname{argmax}$ $\log$ $\exp$

4. अधिकतमकरण-अपेक्षा

को एक बिगड़ैल बच्चे के रूप में मानने का कोई कारण नहीं है । हम बस के रूप में अच्छी तरह से बिंदु का अनुमान उपयोग कर सकते हैं हमारे छिपा चर के लिए और पैरामीटर देना एक पूर्ण वितरण के लक्जरी। $z$ $z^*$ $\theta$

z^{*} = \underset{z}{argmax} E [\log p (θ, z, D)]_{q (θ)} q (θ) = p (θ | z^{*}, D)

$z^* = \underset{z}{\operatorname{argmax}} E [\log p(\theta,z,D) ]_{q(\theta)} \\ q(\theta) = p(\theta|z^*,D)$

If our hidden variables $z$ are indicator variables, we suddenly have a computationally cheap method to perform inference on the number of clusters. This is in other words: model selection (or automatic relevance detection or imagine another fancy name).

5. Iterated conditional modes

Of course, the poster child of approximate inference is to use point estimates for both the parameters $\theta$ as well as the observations $z$ .

θ^{*} = \underset{θ}{argmax} p (θ, z^{*}, D) z^{*} = \underset{z}{argmax} p (θ^{*}, z, D)

$\theta^* = \underset{\theta}{\operatorname{argmax}} p(\theta,z^*,D) \\ z^* = \underset{z}{\operatorname{argmax}} p(\theta^*,z,D) \\$

To see how Maximization-Expectation plays out I highly recommend the article. In my opinion, the strength of this article is however not the application to a $k$ -means alternative, but this lucid and concise exposition of approximation.

— Anne van Rossum
स्रोत

(+1) this is a beautiful summary of all methods.

— kedarps

4

There is a useful optimisation technique underlying the EM algorithm. However, it's usually expressed in the language of probability theory so it's hard to see that at the core is a method that has nothing to do with probability and expectation.

Consider the problem of maximising

g (x) = \sum_{i} \exp (f_{i} (x))

$g(x)=\sum_i\exp(f_i(x))$ (or equivalently

\log g (x)

$\log g(x)$ ) with respect to

x

$x$ . If you write down an expression for

g^{'} (x)

$g'(x)$ and set it equal to zero you will often end up with a transcendental equation to solve. These can be nasty.

Now suppose that the $f_i$ play well together in the sense that linear combinations of them give you something easy to optimise. For example, if all of the $f_i(x)$ are quadratic in $x$ then a linear combination of the $f_i(x)$ will also be quadratic, and hence easy to optimise.

Given this supposition, it'd be cool if, in order to optimise $\log g(x)=\log \sum_i\exp(f_i(x))$ we could somehow shuffle the $\log$ past the $\sum$ so it could meet the $\exp$ s and eliminate them. Then the $f_i$ could play together. But we can't do that.

Let's do the next best thing. We'll make another function $h$ that is similar to $g$ . And we'll make it out of linear combinations of the $f_i$ .

Let's say $x_0$ is a guess for an optimal value. We'd like to improve this. Let's find another function $h$ that matches $g$ and its derivative at $x_0$ , i.e. $g(x_0)=h(x_0)$ and $g'(x_0)=h'(x_0)$ . If you plot a graph of $h$ in a small neighbourhood of $x_0$ it's going to look similar to $g$ .

You can show that

g^{'} (x) = \sum_{i} f_{i}^{'} (x) \exp (f_{i} (x)) .

$g'(x)=\sum_i f_i'(x)\exp(f_i(x)).$ We want something that matches this at

x_{0}

$x_0$ . There's a natural choice:

h (x) = constant + \sum_{i} f_{i} (x) \exp (f_{i} (x_{0})) .

$h(x)=\mbox{constant}+\sum_i f_i(x)\exp(f_i(x_0)).$ You can see they match at

x = x_{0}

$x=x_0$ . We get

h^{'} (x) = \sum_{i} f_{i}^{'} (x) \exp (f_{i} (x_{0})) .

$h'(x)=\sum_i f_i'(x)\exp(f_i(x_0)).$ As

x_{0}

$x_0$ is a constant we have a simple linear combination of the

f_{i}

$f_i$ whose derivative matches

g

$g$ . We just have to choose the constant in

h

$h$ to make

g (x_{0}) = h (x_{0})

$g(x_0)=h(x_0)$ .

So starting with $x_0$ , we form $h(x)$ and optimise that. Because it's similar to $g(x)$ in the neighbourhood of $x_0$ we hope the optimum of $h$ is similar to the optimum of g. Once you have a new estimate, construct the next $h$ and repeat.

I hope this has motivated the choice of $h$ . This is exactly the procedure that takes place in EM.

But there's one more important point. Using Jensen's inequality you can show that $h(x)\le g(x)$ . This means that when you optimise $h(x)$ you always get an $x$ that makes $g$ bigger compared to $g(x_0)$ . So even though $h$ was motivated by its local similarity to $g$ , it's safe to globally maximise $h$ at each iteration. The hope I mentioned above isn't required.

This also gives a clue to when to use EM: when linear combinations of the arguments to the $\exp$ function are easier to optimise. For example when they're quadratic - as happens when working with mixtures of Gaussians. This is particularly relevant to statistics where many of the standard distributions are from exponential families.

— Dan Piponi
स्रोत

3

As you said, I will not go into technical details. There are quite a few very nice tutorials. One of my favourites are Andrew Ng's lecture notes. Take a look also at the references here.

EM is naturally motivated in mixture models and models with hidden factors in general. Take for example the case of Gaussian mixture models (GMM). Here we model the density of the observations as a weighted sum of $K$ gaussians:
$p (x) = \sum_{i = 1}^{K} π_{i} N (x | μ_{i}, Σ_{i})$ $p(x) = \sum_{i=1}^{K}\pi_{i} \mathcal{N}(x|\mu_{i}, \Sigma_{i})$ where $\pi_{i}$ is the probability that the sample $x$ was caused/generated by the ith component, $\mu_{i}$ is the mean of the distribution, and $\Sigma_{i}$ is the covariance matrix. The way to understand this expression is the following: each data sample has been generated/caused by one component, but we do not know which one. The approach is then to express the uncertainty in terms of probability ( $\pi_{i}$ represents the chances that the ith component can account for that sample), and take the weighted sum. As a concrete example, imagine you want to cluster text documents. The idea is to assume that each document belong to a topic (science, sports,...) which you do not know beforehand!. The possible topics are hidden variables. Then you are given a bunch of documents, and by counting n-grams or whatever features you extract, you want to then find those clusters and see to which cluster each document belongs to. EM is a procedure which attacks this problem step-wise: the expectation step attempts to improve the assignments of the samples it has achieved so far. The maximization step you improve the parameters of the mixture, in other words, the form of the clusters.
The point is not using monotonic functions but convex functions. And the reason is the Jensen's inequality which ensures that the estimates of the EM algorithm will improve at every step.

— jpmuc
स्रोत