प्रतिबंधित बोल्ट्जमान मशीन (RBM) के पीछे अंतर्ज्ञान

मैं कौरसेरा पर ज्योफ हिंट्स न्यूरल नेटवर्क्स पाठ्यक्रम के माध्यम से गया और प्रतिबंधित बोल्ट्जमन मशीनों के परिचय के माध्यम से , फिर भी मैं आरबीएम के पीछे के अंतर्ज्ञान को नहीं समझ पाया।

हमें इस मशीन में ऊर्जा की गणना करने की आवश्यकता क्यों है? और इस मशीन में प्रायिकता का क्या उपयोग है? मैंने भी यह वीडियो देखा । वीडियो में, उन्होंने सिर्फ संगणना चरणों से पहले संभाव्यता और ऊर्जा समीकरण लिखे और कहीं भी इसका उपयोग नहीं किया।

उपरोक्त जोड़कर, मुझे यकीन नहीं है कि संभावना क्या है?

unsupervised-learning rbm

— Born2Code
स्रोत

मैंने प्रश्न को स्पष्ट करने की कोशिश की है, लेकिन मुझे लगता है कि इसे और अधिक काम करने की आवश्यकता है। आपको यह समझने की ज़रूरत है कि आप क्या समझते हैं, और अधिक विशेष रूप से जहां आप फंस गए हैं, अन्यथा सवाल बहुत व्यापक है।

— नील स्लेटर

केवल एक चीज जो सिर में मिली, तीन चरण हैं, पहला सकारात्मक चरण, फिर ऋणात्मक चरण जो वजन के पुनर्निर्माण के बाद है। लेकिन, ऊर्जा और संभावना समारोह के बारे में क्या? यहाँ पर क्या उपयोग है? और कितनी बार हमें यह प्रक्रिया (सकारात्मक चरण -> नकारात्मक चरण> वजन का पुनर्निर्माण) करना है?

— बोर्न्यूकोड 17

जवाबों:

आरबीएम एक दिलचस्प जानवर हैं। आपके प्रश्न का उत्तर देने के लिए, और उन पर अपनी स्मृति जॉग करने के लिए, मैं आरबीएम प्राप्त करूँगा और व्युत्पत्ति के माध्यम से बात करूँगा। आपने उल्लेख किया कि आप संभावना पर भ्रमित हैं, इसलिए मेरी व्युत्पत्ति संभावना को अधिकतम करने के प्रयास के दृष्टिकोण से होगी। तो चलिए शुरू करते हैं।

RBM में दो अलग-अलग सेट होते हैं, जो दृश्यमान और छिपे हुए होते हैं, मैं क्रमशः और सूचित करता हूँ । और एक विशिष्ट विन्यास को देखते हुए , हम इसे प्रायिकता स्थान पर मैप करते हैं। $v$ $h$ $v$ $h$

p (v, h) = \frac{e^{- E (v, h)}}{Z}

$p(v,h) = \frac{e^{-E(v,h)}}{Z}$

परिभाषित करने के लिए कुछ और चीजें हैं। सरोगेट फ़ंक्शन जिसे हम एक विशिष्ट कॉन्फ़िगरेशन से प्रायिकता स्थान पर मैप करने के लिए उपयोग करते हैं, उसे ऊर्जा फ़ंक्शन कहा जाता है । लगातार एक सामान्य कारक सुनिश्चित करना है कि हम वास्तव में संभावना अंतरिक्ष करने के लिए नक्शे है। अब हम उस चीज़ को प्राप्त करते हैं जो हम वास्तव में खोज रहे हैं; दृश्यमान न्यूरॉन्स के एक सेट की संभावना, दूसरे शब्दों में, हमारे डेटा की संभावना। $E(v,h)$ $Z$

Z = \sum_{v \in V} \sum_{h \in H} e^{- E (v, h)}

$Z = \sum_{v \in V}\sum_{h \in H}e^{-E(v,h)}$

p (v) = \sum_{h \in H} p (v, h) = \frac{\sum_{h \in H} e^{- E (v, h)}}{\underset{v \in वी}{Σ} \underset{ज \in एच}{Σ} इ^{- इ (v, ज)}}

$p(v)=\sum_{h \in H}p(v,h)=\frac{\sum_{h \in H}e^{-E(v,h)}}{\sum_{v \in V}\sum_{h \in H}e^{-E(v,h)}}$

हालांकि इस समीकरण में बहुत सारी शर्तें हैं, यह सही संभावना समीकरण लिखने के लिए नीचे आता है। उम्मीद है, अब तक, इस आपको पता कारण है कि हम ऊर्जा समारोह की जरूरत प्रायिकता की गणना करना, या क्या अधिक आम तौर पर unnormalized संभावना से किया जाता है में मदद मिली है । विभाजन फ़ंक्शन कारण अप्राकृतिक संभावना का उपयोग किया जाता है $p(v)*Z$ $Z$ गणना करने के लिए बहुत महंगा है।

अब आइए आरबीएम के वास्तविक शिक्षण चरण के बारे में जानें। संभावना को अधिकतम करने के लिए, प्रत्येक डेटा बिंदु के लिए, हमें बनाने के लिए एक ढाल कदम उठाना होगा । धीरे-धीरे अभिव्यक्ति पाने के लिए यह कुछ गणितीय कलाबाजी करता है। पहली चीज जो हम करते हैं वह है का लॉग । गणित को संभव बनाने के लिए हम अभी से लॉग प्रोबेबिलिटी स्पेस में काम कर रहे हैं। $p(v)=1$ $p(v)$

आइए के संबंध में ढाल लेना में paremeters

लॉग (पी (v)) = लॉग [\underset{ज \in एच}{Σ} इ^{- इ (v, ज)}] - लॉग [\underset{v \in वी}{Σ} \underset{ज \in एच}{Σ} इ^{- इ (v, ज)}]

$\log(p(v))=\log[\sum_{h \in H}e^{-E(v,h)}]-\log[\sum_{v \in V}\sum_{h \in H}e^{-E(v,h)}]$

p (v)

$p(v)$

\begin{aligned} \frac{\partial लॉग (पी (v))}{\partial θ} = & - \frac{1}{\underset{ज^{'} \in एच}{Σ} इ^{- इ (v, ज^{'})}} \underset{ज^{'} \in एच}{Σ} इ^{- इ (v, ज^{'})} \frac{\partial इ (v, ज^{'})}{\partial θ} \\ + \frac{1}{\underset{v^{'} \in वी}{Σ} \underset{ज^{'} \in एच}{Σ} इ^{- इ (v^{'}, ज^{'})}} \underset{v^{'} \in वी}{Σ} \underset{ज^{'} \in एच}{Σ} इ^{- इ (v^{'}, ज^{'})} \frac{\partial इ (v, ज)}{\partial θ} \end{aligned}

$\begin{align} \frac{\partial \log(p(v))}{\partial \theta}=& -\frac{1}{\sum_{h' \in H}e^{-E(v,h')}}\sum_{h' \in H}e^{-E(v,h')}\frac{\partial E(v,h')}{\partial \theta}\\ & + \frac{1}{\sum_{v' \in V}\sum_{h' \in H}e^{-E(v',h')}}\sum_{v' \in V}\sum_{h' \in H}e^{-E(v',h')}\frac{\partial E(v,h)}{\partial \theta} \end{align}$

अब मैंने यह कागज़ पर किया और सेमी-फ़ाइनल समीकरण को नीचे लिखा, क्योंकि इस साइट पर बहुत अधिक जगह बर्बाद नहीं हुई थी। मैं आपको इन समीकरणों को स्वयं प्राप्त करने की सलाह देता हूं। अब मैं नीचे कुछ समीकरण लिखूंगा जो हमारी व्युत्पत्ति को जारी रखने में मदद करेगा। ध्यान दें कि: , और कहा कि $Zp(v,h)=e^{-E(v,h')}$ $p(v)=\sum_{h \in H}p(v,h)$ $p(h|v) = \frac{p(v,h)}{p(h)}$

\begin{aligned} \frac{\partial l o g (p (v))}{\partial θ} & = - \frac{1}{p (v)} \sum_{h^{'} \in H} p (v, h^{'}) \frac{\partial E (v, h^{'})}{\partial θ} + \sum_{v^{'} \in V} \sum_{h^{'} \in H} p (v^{'}, h^{'}) \frac{\partial E (v^{'}, h^{'})}{\partial θ} \\ \frac{\partial l o g (p (v))}{\partial θ} & = - \sum_{h^{'} \in H} p (h^{'} | v) \frac{\partial E (v, h^{'})}{\partial θ} + \sum_{v^{'} \in V} \sum_{h^{'} \in H} p (v^{'}, h^{'}) \frac{\partial E (v^{'}, h^{'})}{\partial θ} \end{aligned}

$\begin{align} \frac{\partial log(p(v))}{\partial \theta}&= -\frac{1}{p(v)}\sum_{h' \in H}p(v,h')\frac{\partial E(v,h')}{\partial \theta}+\sum_{v' \in V}\sum_{h' \in H}p(v',h')\frac{\partial E(v',h')}{\partial \theta}\\ \frac{\partial log(p(v))}{\partial \theta}&= -\sum_{h' \in H}p(h'|v)\frac{\partial E(v,h')}{\partial \theta}+\sum_{v' \in V}\sum_{h' \in H}p(v',h')\frac{\partial E(v',h')}{\partial \theta} \end{align}$

And there we go, we derived maximum likelihood estimation for RBM's, if you want you can write the last two terms via expectation of their respective terms (conditional, and joint probability).

Notes on energy function and stochasticity of neurons.

As you can see above in my derivation, I left the definition of the energy function rather vague. And the reason for doing that is that many different versions of RBM implement various energy functions. The one that Hinton describes in the lecture linked above, and shown by @Laurens-Meeus is:

E (v, h) = - a^{T} v - b^{T} h - v^{T} W h .

$E(v,h)=−a^Tv−b^Th−v^TWh.$

It might be easier to reason about the gradient terms above via the expectation form.

\frac{\partial \log (p (v))}{\partial θ} = - \underset{p (h^{'} | v)}{E} \frac{\partial E (v, h^{'})}{\partial θ} + \underset{p (v^{'}, h^{'})}{E} \frac{\partial E (v^{'}, h^{'})}{\partial θ}

$\frac{\partial \log(p(v))}{\partial \theta}= -\mathop{\mathbb{E}}_{p(h'|v)}\frac{\partial E(v,h')}{\partial \theta}+\mathop{\mathbb{E}}_{p(v',h')}\frac{\partial E(v',h')}{\partial \theta}$

The expectation of the first term is actually really easy to calculate, and that was the genius behind RBMs. By restricting the connection the conditional expectation simply becomes a forward propagation of the RBM with the visible units clamped. This is the so called wake phase in Boltzmann machines. Now calculating the second term is much harder and usually Monte Carlo methods are utilized to do so. Writing the gradient via average of Monte Carlo runs:

\frac{\partial \log (p (v))}{\partial θ} \approx - ⟨ \frac{\partial E (v, h^{'})}{\partial θ} ⟩_{p (h^{'} | v)} + ⟨ \frac{\partial E (v^{'}, h^{'})}{\partial θ} ⟩_{p (v^{'}, h^{'})}

$\frac{\partial \log(p(v))}{\partial \theta}\approx -\langle \frac{\partial E(v,h')}{\partial \theta}\rangle_{p(h'|v)}+\langle\frac{\partial E(v',h')}{\partial \theta}\rangle_{p(v',h')}$

Calculating the first term is not hard, as stated above, therefore Monte-Carlo is done over the second term. Monte Carlo methods use random successive sampling of the distribution, to calculate the expectation (sum or integral). Now this random sampling in classical RBM's is defined as setting a unit to be either 0 or 1 based on its probability stochasticly, in other words, get a random uniform number, if it is less than the neurons probability set it to 1, if it is greater than set it to 0.

— Armen Aghajanyan
स्रोत

हम छिपी हुई परत को बाइनरी भी कैसे बनाते हैं? सक्रियण कार्य संचालन के बाद Bcoz, हम 0 और 1 के बीच की सीमा में मान प्राप्त कर रहे हैं।

— बोर्नमूकोड

This is usually done by thresholding the activation. Anything above 0.5, would become 1, anything below would be zero.

— Armen Aghajanyan

But in this link , in section 3.1: Hinton has stated "the hidden unit turns on if this probability is greater than a random number uniformly distributed between 0 and 1". What does this actually mean? And also in this link, they say "Then the jth unit is on if upon choosing s uniformly distributed random number between 0 and 1 we find that its value is less than sig[j]. Otherwise it is off." I didn't get this.

— Born2Code

????? How to say whether that particular unit is turned on or off?

— Born2Code

I've added an edit. I suggest reading up on Monte Carlo methods because the stochasticity of this algorithm is derived from there.

— Armen Aghajanyan

In addition to the existing answers, I would like to talk about this energy function, and the intuition behind that a bit. Sorry if this is a bit long and physical.

The energy function describes a so-called Ising model, which is a model of ferromagnetism in terms of statistical mechanics / quantum mechanics. In statistical mechanics, we use a so-called Hamiltonian operator to describe the energy of a quantum-mechanical system. And a system always tries to be in the state with the lowest energy.

Now, the Ising model basically describes the interaction between electrons with a spin $\sigma_k$ of either +1 or -1, in presence of an external magnetic field $h$ . The interaction between two electrons $i$ and $j$ is described by a coefficient $J_{ij}$ . This Hamiltonian (or energy function) is

\hat{H} = \sum_{i, j} J_{i j} σ_{i} σ_{j} - μ \sum_{j} h_{j} σ_{j}

$\hat{H} = \sum_{i,j} J_{ij} \sigma_i \sigma_j - \mu \sum_j h_j \sigma_j$ where

\hat{H}

$\hat{H}$ denotes the Hamiltonian. A standard procedure to get from an energy function to the probability, that a system is in a given state (i.e. here: a configuration of spins, e.g.

σ_{1} = + 1, σ_{2} = - 1, . . .

$\sigma_1 = {+1}, \sigma_2 = {-1}, ...$ ) is to use the Boltzmann distribution, which says that at a temperature

T

$T$ , the probability

p_{i}

$p_i$ of the system to be in a state

i

$i$ with energy

E_{i}

$E_i$ is given by

p_{i} = \frac{\exp (- E_{i} / k T)}{\sum_{i} \exp (- E_{i} / k t)}

$p_i = \frac{\exp(-E_i/kT)}{\sum_{i}\exp(-E_i/kt)}$ At this point, you should recognize that these two equations are the exact same equations as in the videos by Hinton and the answer by Armen Aghajanyan. This leads us to the question:

What does the RBM have to do with this quantum-mechanical model of ferromagnetism?

We need to use a final physical quantity: the entropy. As we know from thermodynamics, a system will settle in the state with the minimal energy, which also corresponds to the state with the maximal entropy.

As introduced by Shanon in 1946, in information theory, the entropy $H$ can also be seen as a measure of the information content in $X$ , given by the following sum over all possible states of $X$ :

H (X) = - \sum_{i} P (x_{i}) \log P (x_{i})

$H(X) = -\sum_i P(x_i) \log P(x_i)$ Now, the most efficient way to encode the information content in

X

$X$ , is to use a way that maximizes the entropy

H

$H$ .

Finally, this is where we get back to RBMs: Basically, we want this RBM to encode as much information as possible. So, as we have to maximize the (information-theoretical) entropy in our RBM-System. As proposed by Hopfield in 1982, we can maximize the information-theoretical entropy exactly like the physical entropy: by modelling the RBM like the Ising model above, and use the same methods to minimize the energy. And that is why we need this strange energy function for in an RBM!

The nice mathematical derivation in Armen Aghajanyan's answer shows everything we need to do, to minimize the energy, thus maximizing entropy and storing / saving as much information as possible in our RBM.

_{PS: Please, dear physicists, forgive any inaccuracies in this engineer's derivation. Feel free to comment on or fix inaccuracies (or even mistakes).}

— hbaderts
स्रोत

I saw this video, Just watch the video from that point. how do you get that sampled number? whether we have just run rand() in matlab and obtain it? and then it would be different for each h(i) . Oh nooo! I don't think machine will learn properly.

— Born2Code

@Born2Code this is another question. Can you post it as a new question to this site? Please try to add the equations you are talking about to the new question, and explain what parts you don't understand.

— hbaderts

link

— Born2Code

The answer of @Armen has gave myself a lot of insights. One question hasn't been answered however.

The goal is to maximize the probability (or likelihood) of the $v$ . This is correlated to minimizing the energy function related to $v$ and $h$ :

E (v, h) = - a^{T} v - b^{T} h - v^{T} W h

$E(v,h) = -a^{\mathrm{T}} v - b^{\mathrm{T}} h -v^{\mathrm{T}} W h$

Our variables are $a$ , $b$ and $W$ , which have to be trained. I'm quite sure this training will be the ultimate goal of the RBM.

— Laurens Meeus
स्रोत

How do we make hidden layer binary too? Bcoz after the activation function operation, we would be getting values in the range between 0 and 1.

— Born2Code

@Born2Code: The activation function gives you the probability that a neuron has value 1. Therefore to "make it binary", you sample from the probabilities calculated for either

h

$h$ or

v

$v$ - in other words you literally do something like h_bin = (rand() < h_val) ? 1 : 0 - this has to be done for each neuron, and each time you want a sample.

— Neil Slater

@NeilSlater: but why a random number? Also, whether the random should be generated for each iteration or the same number should be used for all the iterations? one more serious doubt, how many iterations have to be done? I have a training set V, which has only one vector, i.e. v1. With v1, how many times should I have to iterate?

— Born2Code

@NeilSlater: One more doubt is, whether the same random number is to be compared with all the values of hidden layer? I know this is such an idiotic question but still

— Born2Code

It's a random number because that is how you resolve probabilities to binary values. It is a dIfferent number for each neuron inside h or v - you are sampling a vector of binary values for h or v, in order to generate an example that the network "believes" exists - i.e. an example that has a high statistical chance of being representative of the training set. During training, you determine how well it matches an existing training example and adjust weights accordingly.

— Neil Slater