RNN में समय के माध्यम से वापस प्रचार क्यों किया जाता है?

एक आवर्तक तंत्रिका नेटवर्क में, आप आमतौर पर कई समय के चरणों के माध्यम से प्रचार को आगे बढ़ाते हैं, नेटवर्क को "अनियंत्रित" करते हैं, और फिर इनपुट के अनुक्रम में वापस प्रचार करते हैं।

आप अनुक्रम में प्रत्येक व्यक्ति के कदम के बाद वजन को अपडेट क्यों नहीं करेंगे? (1 की ट्रंकेशन लंबाई का उपयोग करने के बराबर है, इसलिए इसमें कुछ भी अनियंत्रित नहीं है) यह पूरी तरह से गायब होने वाली ढाल की समस्या को समाप्त करता है, एल्गोरिथ्म को बहुत सरल करता है, संभवतः स्थानीय मिनीमा में फंसने की संभावना को कम करेगा, और सबसे महत्वपूर्ण रूप से ठीक काम करने के लिए लगता है। । मैंने पाठ को उत्पन्न करने के लिए इस तरह से एक मॉडल को प्रशिक्षित किया और परिणाम मुझे बीपीटीटी प्रशिक्षित मॉडल से देखे गए परिणामों की तुलना में लग रहे थे। मैं केवल इस पर उलझन में हूं क्योंकि RNN पर मैंने देखा गया प्रत्येक ट्यूटोरियल बीपीटीटी का उपयोग करने के लिए कहता है, लगभग जैसे कि यह उचित सीखने के लिए आवश्यक है, जो कि मामला नहीं है।

अद्यतन: मैंने एक उत्तर जोड़ा

— Frobot
स्रोत

इस शोध को लेने के लिए एक दिलचस्प दिशा मानक आरएनएन समस्याओं पर साहित्य में प्रकाशित बेंचमार्क के साथ अपनी समस्या पर प्राप्त परिणामों की तुलना करना होगा। यह एक बहुत अच्छा लेख बना देगा।

— साइकोरैक्स का कहना है कि मोनिका

आपके "अपडेट: मैंने एक उत्तर जोड़ा" अपने आर्किटेक्चर विवरण और एक चित्रण के साथ पिछले संपादन को बदल दिया। क्या यह उद्देश्य पर है?

— अमीबा का कहना है कि

हां, मैंने इसे निकाल लिया क्योंकि यह वास्तव में वास्तविक प्रश्न के लिए प्रासंगिक नहीं लगता था और इसने बहुत सारी जगह ले ली थी, लेकिन मैं इसे वापस जोड़ सकता हूं अगर यह मदद करता है

— Frobot

वैसे लोगों को आपकी वास्तुकला को समझने में बड़े पैमाने पर समस्याएं हैं, इसलिए मुझे लगता है कि कोई भी अतिरिक्त स्पष्टीकरण उपयोगी है। यदि आप चाहें, तो आप इसे अपने प्रश्न के बजाय अपने उत्तर में जोड़ सकते हैं।

— अमीबा का कहना है कि मोनिका

जवाबों:

संपादित करें: मैंने दो तरीकों की तुलना करते समय एक बड़ी गलती की और मुझे अपना उत्तर बदलना पड़ा। यह उस तरह से बताता है जैसे मैं कर रहा था, वर्तमान समय के कदम पर प्रचार करना, वास्तव में तेजी से सीखना शुरू करता है। त्वरित अपडेट बहुत जल्दी सबसे बुनियादी पैटर्न सीखते हैं। लेकिन एक बड़े डेटा सेट और लंबे प्रशिक्षण समय के साथ, BPTT वास्तव में शीर्ष पर आता है। मैं बस कुछ युगों के लिए एक छोटे से नमूने का परीक्षण कर रहा था और यह मान लिया था कि जो कोई भी दौड़ जीतना शुरू करेगा वह विजेता होगा। लेकिन यह मुझे एक दिलचस्प खोज की ओर ले गया। यदि आप अपने प्रशिक्षण को केवल एक बार के चरण में प्रचारित करना शुरू करते हैं, तो बीपीटीटी में परिवर्तन करें और धीरे-धीरे बढ़ाएं कि आप कितनी दूर प्रचार करते हैं, आप तेजी से अभिसरण प्राप्त करते हैं।

— Frobot
स्रोत

आपके अद्यतन के लिए धन्यवाद। उस अंतिम छवि के स्रोत में वह एक से एक सेटिंग के बारे में कहता है : "आरएनएन के बिना प्रसंस्करण के वेनिला मोड, निश्चित आकार के इनपुट से निश्चित-आकार के आउटपुट (जैसे छवि वर्गीकरण) के लिए।" तो वही हम कह रहे थे। यदि यह ऐसा है जैसा आपने वर्णन किया है कि इसका कोई राज्य नहीं है और यह एक आरएनएन नहीं है। "आगे के प्रसार से पहले एक एकल इनपुट के माध्यम से प्रचार करना" - मैं कहूंगा कि एक एएनएन। लेकिन ये पाठ के साथ अच्छा प्रदर्शन नहीं करेंगे, इसलिए मुझे कुछ पता नहीं है और मुझे पता नहीं है क्योंकि मेरे पास कोड नहीं है

— ragulpr

मैंने वह हिस्सा नहीं पढ़ा और आप सही हैं। मैं जिस मॉडल का उपयोग कर रहा हूं, वह वास्तव में सबसे दूर "कई से कई" है। मैं "एक से एक" खंड में माना जाता था कि वास्तव में ये सभी जुड़े हुए थे और ड्राइंग ने इसे छोड़ दिया। लेकिन यह वास्तव में सबसे दूर के विकल्पों में से एक है जिसे मैंने नोटिस नहीं किया (आरएनएन के बारे में एक ब्लॉग में ऐसा होना अजीब है, इसलिए मैंने मान लिया कि वे सभी आवर्तक थे)। मैं उत्तर के उस भाग को और अधिक अर्थ देने के लिए संपादित

— करूंगा

मैंने कल्पना की कि ऐसा ही था, इसलिए मैंने आपके नुकसान के कार्य को देखने पर जोर दिया। यह कई लोगों के लिए कई अगर आपके नुकसान के लिए समान है

और यह हूबहू एक RNN है और आप प्रचार / पूरे अनुक्रम inputing लेकिन फिर बस छोटा BPTT आप IE ' d मेरी पोस्ट में लाल भाग की गणना करें लेकिन आगे पुनरावृत्ति न करें।

e r r o r = \sum_{t} (y_{t} - {\hat{y}}_{t})^{2}

$error=\sum_t(y_t-\hat{y}_t)^2$

— ragulpr

मेरा नुकसान फ़ंक्शन समय के साथ योग नहीं करता है। मैं एक इनपुट लेता हूं, एक आउटपुट प्राप्त करता हूं, फिर नुकसान की गणना करता हूं, और वेट को अपडेट करता हूं, फिर t + 1 पर जाता हूं, इसलिए योग करने के लिए कुछ भी नहीं है। मैं मूल पोस्ट में सटीक हानि फ़ंक्शन

— जोड़ूंगा

बस अपना कोड पोस्ट करें मैं कोई और अनुमान नहीं लगा रहा हूं, यह मूर्खतापूर्ण है।

— 1

आरएनएन एक डीप न्यूरल नेटवर्क (डीएनएन) है, जहां प्रत्येक परत नया इनपुट ले सकती है, लेकिन इसके समान पैरामीटर हैं। BPT एक ऐसे नेटवर्क पर Back Propagation के लिए एक फैंसी शब्द है जो खुद Gradient Descent के लिए एक फैंसी शब्द है।

का कहना है कि RNN आउटपुट हर कदम और में $\hat{y}_t$

e r r o r_{t} = (y_{t} - {\hat{y}}_{t})^{2}

$\begin{equation} error_t=(y_t-\hat{y}_t)^2 \end{equation}$

फ़ंक्शन को समझने के लिए वजनों को सीखने के लिए हमें प्रश्न का उत्तर देने के लिए ग्रेडिएंट्स की आवश्यकता होती है "नुकसान के फ़ंक्शन में पैरामीटर कितना बदलाव करता है?" और दिए गए दिशा में मापदंडों को स्थानांतरित करें:

\nabla e r r o r_{t} = - 2 (y_{t} - {\hat{y}}_{t}) \nabla {\hat{y}}_{t}

$\begin{equation} \nabla error_t=-2(y_t-\hat{y}_t)\nabla \hat{y}_t \end{equation}$

यानी हमारे पास एक DNN है जहां हमें इस बात पर प्रतिक्रिया मिलती है कि प्रत्येक स्तर पर भविष्यवाणी कितनी अच्छी है। चूंकि पैरामीटर में बदलाव DNN (टाइमस्टेप) में हर परत को बदल देगा और हर परत आने वाले आउटपुट में योगदान करती है जिसके लिए इस खाते की आवश्यकता है।

स्पष्ट रूप से देखने के लिए एक साधारण एक न्यूरॉन-वन लेयर नेटवर्क लें:

\begin{aligned} {\hat{y}}_{t + 1} = & f (a + b x_{t} + c {\hat{y}}_{t}) \\ \frac{\partial}{\partial a} {\hat{y}}_{t + 1} = & f^{'} (a + b x_{t} + c {\hat{y}}_{t}) \cdot c \cdot \frac{\partial}{\partial a} {\hat{y}}_{t} \\ \frac{\partial}{\partial b} {\hat{y}}_{t + 1} = & f^{'} (a + b x_{t} + c {\hat{y}}_{t}) \cdot (x_{t} + c \cdot \frac{\partial}{\partial b} {\hat{y}}_{t}) \\ \frac{\partial}{\partial c} {\hat{y}}_{t + 1} = & f^{'} (a + b x_{t} + c {\hat{y}}_{t}) \cdot ({\hat{y}}_{t} + c \cdot \frac{\partial}{\partial c} {\hat{y}}_{t}) \\ ⟺ \\ \nabla {\hat{y}}_{t + 1} = & f^{'} (a + b x_{t} + c {\hat{y}}_{t}) \cdot ([\begin{matrix} 0 \\ x_{t} \\ {\hat{y}}_{t} \end{matrix}] + c \nabla {\hat{y}}_{t}) \end{aligned}

$\begin{align*} \hat{y}_{t+1} =& f(a+bx_t+c\hat{y}_t)\\ \frac{\partial}{\partial a}\hat{y}_{t+1} = & f'(a+bx_t+c\hat{y}_t)\cdot c\cdot \frac{\partial}{\partial a}\hat{y}_{t} \\ \frac{\partial}{\partial b}\hat{y}_{t+1} = & f'(a+bx_t+c\hat{y}_t)\cdot (x_t+c\cdot\frac{\partial}{\partial b}\hat{y}_{t})\\ \frac{\partial}{\partial c}\hat{y}_{t+1} = & f'(a+bx_t+c\hat{y}_t)\cdot (\hat{y}_t+c\cdot\frac{\partial}{\partial c}\hat{y}_{t})\\ \iff\\ \nabla \hat{y}_{t+1} =& f'(a+bx_t+c\hat{y}_t)\cdot \left(\begin{bmatrix}0\\x_t\\\hat{y}_t \end{bmatrix} + c \mathbin{\color{red}{\nabla \hat{y}_{t}}} \right) \end{align*}$

$\delta$

[\begin{matrix} \tilde{a} \\ \tilde{b} \\ \tilde{c} \end{matrix}] \leftarrow [\begin{matrix} a \\ b \\ c \end{matrix}] + δ (y_{t} - {\hat{y}}_{t}) \nabla {\hat{y}}_{t}

$\begin{equation} \begin{bmatrix}\tilde{a}\\\tilde{b}\\\tilde{c}\end{bmatrix} \leftarrow \begin{bmatrix}a\\b\\c\end{bmatrix} + \delta (y_{t}-\hat{y}_{t})\nabla \hat{y}_t \end{equation}$

$\nabla \hat{y}_{t+1}$ you need to calculate i.e roll out $\nabla \hat{y}_{t}$ . What you propose is to ~~simply disregard the red part~~ calculate the red part for $t$ but not recurse further. I assume that your loss is something like

e r r o r = \sum_{t} (y_{t} - {\hat{y}}_{t})^{2}

$\begin{equation} error=\sum_t(y_t-\hat{y}_t)^2 \end{equation}$

Maybe each step will then contribute a crude direction which is enough in aggregation? This could explain your results but I'd be really interested in hearing more about your method/loss function! Also would be interested in a comparison with a two timestep windowed ANN.

edit4: After reading comments it seems like your architecture is not an RNN.

RNN: Stateful - carry forward hidden state $h_t$ indefinitely This is your model but the training is different.

~~Your model: Stateless - hidden state rebuilt in each step~~ edit2 : added more refs to DNNs edit3 : fixed gradstep and some notation edit5 : Fixed the interpretation of your model after your answer/clarification.

— ragulpr
स्रोत

thank you for your answer. I think you may have misunderstood what I am doing though. In the forward propagation I only do one step, so that in the back propagation it is also only one step. I don't forward propagate across multiple inputs in the training sequence. I see what you mean about a crude direction that is enough in aggregation to allow learning, but I have checked my gradients with numerically calculated gradients and they match for 10+ decimal places. The back prop works fine. I am using cross entropy loss.

— Frobot

मैं अपने उसी मॉडल को लेने और BPTT के साथ इसे वापस लेने पर काम कर रहा हूं क्योंकि हम स्पष्ट तुलना करते हैं। मैंने इस "एक कदम" एल्गोरिथ्म का उपयोग करके एक मॉडल को प्रशिक्षित किया है, यह भविष्यवाणी करने के लिए कि क्या स्टॉक की कीमत अगले दिन बढ़ेगी या गिर जाएगी, जो कि सभ्य सटीकता प्राप्त कर रही है, इसलिए मेरे पास बीपीटीटी बनाम सिंगल स्टेप बैक प्रोप की तुलना करने के लिए दो अलग-अलग मॉडल होंगे।

— फ्रोबोट

If you only forward propagate one step, isn't this a two layered ANN with feature input of last step to the first layer, feature input to the current step at the second layer but has same weights/parameters for both layers? I'd expect similar results or better with an ANN that takes input

{\hat{y}}_{t + 1} = f (x_{t}, x_{t - 1})

$\hat{y}_{t+1}=f(x_t,x_{t-1})$ i.e that uses a fixed time-window of size 2. If it only carries forward one step, can it learn long term dependencies?

— ragulpr

I'm using a sliding window of size 1, but the results are vastly different than making a sliding window of size 2 ANN with inputs (xt,xt−1). I can purposely let it overfit when learning a huge body of text and it can reproduce the entire text with 0 errors, which requires knowing long term dependencies that would be impossible if you only had (xt,xt−1) as input. the only question I have left is if using BPTT would allow the dependencies to become longer, but it honestly doesn't look like it would.

— Frobot

Look at my updated post. Your architecture is not an RNN, it's stateless so long term-dependencies not explicitly baked into the features can't be learned. Previous predictions does not influence future predictions. You can see this as if

\frac{\partial}{\partial {\hat{y}}_{t - 2}} {\hat{y}}_{t} = 0

$\frac{\partial}{\partial \hat{y}_{t-2}}\hat{y}_t =0$ for your architecture. BPTT is in theory identical to BP but performed on an RNN-architecture so you can't but I see what you mean, and the answer is no. Would be really interesting to see experiments on stateful RNN but only onestep BPTT though ^^

— ragulpr

"Unfolding through time" is simply an application of the chain rule,

\frac{d F (g (x), h (x), m (x))}{d x} = \frac{\partial F}{\partial g} \frac{d g}{d x} + \frac{\partial F}{\partial h} \frac{d h}{d x} + \frac{\partial F}{\partial m} \frac{d m}{d x}

$\frac{dF(g(x), h(x), m(x))}{dx} = \frac{\partial F}{\partial g}\frac{dg}{dx} + \frac{\partial F}{\partial h}\frac{dh}{dx} + \frac{\partial F}{\partial m}\frac{dm}{dx}$

The output of an RNN at time step $t$ , $H_t$ is a function of the parameters $\theta$ , the input $x_t$ and the previous state, $H_{t-1}$ (note that instead $H_t$ may be transformed again at time step $t$ to obtain the output, that is not important here). Remember the goal of gradient descent: given some error function $L$ , let's look at our error for the current example (or examples), and then let's adjust $\theta$ in such a way, that given the same example again, our error would be reduced.

How exactly did $\theta$ contribute to our current error? We took a weighted sum with our current input, $x_t$ , so we'll need to backpropagate through the input to find $\nabla_\theta a(x_t, \theta)$ , to work out how to adjust $\theta$ . But our error was also the result of some contribution from $H_{t-1}$ , which was also a function of $\theta$ , right? So we need to find out $\nabla_\theta H_{t-1}$ , which was a function of $x_{t-1}$ , $\theta$ and $H_{t-2}$ . But $H_{t-2}$ was also a function a function of $\theta$ . And so on.

— Matthew Hampsey
स्रोत

I understand why you back propagate through time in a traditional RNN. I'm trying to find out why a traditional RNN uses multiple inputs at once for training, when using just one at a time is much simpler and also works

— Frobot

The only sense in which you can feed in multiple inputs at once into an RNN is feeding in multiple training examples, as part of a batch. The batch size is arbitrary, and convergence is guaranteed for any size, but higher batch sizes may lead to more accurate gradient estimations and faster convergence.

— Matthew Hampsey

That's not what I meant by "multiple inputs at once". I didn't word it very well. I meant you usually forward propagate through several inputs in the training sequence, then back propagate back through them all, then update the weights. So the question is, why propagate through a whole sequence when doing just one input at a time is much easier and still works

— Frobot

I think some clarification here is required. When you say "inputs", are you referring to multiple training examples, or are you referring to multiple time steps within a single training example?

— Matthew Hampsey

I will post an answer to this question by the end of today. I finished making a BPTT version, just have to train and compare. After that if you still want to see some code let me know what you want to see and I guess I could still post it

— Frobot