रिज प्रतिगमन के समकक्ष सूत्र का प्रमाण

15

मैंने सांख्यिकीय शिक्षा में सबसे लोकप्रिय किताबें पढ़ी हैं

1- सांख्यिकीय शिक्षा के तत्व।

2- सांख्यिकीय शिक्षा का परिचय ।

दोनों का उल्लेख है कि रिज प्रतिगमन में दो सूत्र हैं जो समकक्ष हैं। क्या इस परिणाम का एक समझने योग्य गणितीय प्रमाण है?

मैं क्रॉस वैलिडेट के माध्यम से भी गया , लेकिन मुझे वहां कोई निश्चित प्रमाण नहीं मिला।

इसके अलावा, क्या LASSO एक ही प्रकार के प्रमाण का आनंद लेगा?

— Jeza
स्रोत

2

en.wikipedia.org/wiki/…

— टेलर

1

लसो रिज रिग्रेशन का एक रूप नहीं है।

— शीआन

@ जीजा, क्या आप बता सकते हैं कि मेरे उत्तर में क्या कमी है? यह वास्तव में व्युत्पन्न सभी कनेक्शन के बारे में प्राप्त किया जा सकता है।

— रॉय

@ जीजा, क्या आप विशिष्ट हो सकते हैं? जब तक आप विवश समस्या के लिए लैग्रैजियन अवधारणा को नहीं जानते हैं, संक्षिप्त उत्तर देना कठिन है।

— रॉय

1

@ जेजा, एक विवश अनुकूलन समस्या को लैग्रैन्जियन फंक्शन / केकेटी स्थितियों के अनुकूलन में परिवर्तित किया जा सकता है (जैसा कि वर्तमान उत्तरों में समझाया गया है)। इस सिद्धांत के पहले से ही इंटरनेट पर कई अलग-अलग सरल स्पष्टीकरण हैं। प्रमाण की अधिक व्याख्या किस दिशा में आवश्यक है? Lagrangian गुणक / कार्य, स्पष्टीकरण / प्रमाण का स्पष्टीकरण / प्रमाण कैसे यह समस्या अनुकूलन का एक मामला है जो Lagrange, अंतर KKT / Lagrange की विधि से संबंधित है, नियमितीकरण के सिद्धांत की व्याख्या, आदि?

— सेक्स्टस एम्पिरिकस

19

क्लासिक रिज रिग्रेशन ( तिखोनोव रेगुलराइजेशन ) निम्न द्वारा दिया जाता है:

\arg min_{x} \frac{1}{2} {‖ x - y ‖}_{2}^{2} + λ {‖ x ‖}_{2}^{2}

$\arg \min_{x} \frac{1}{2} {\left\| x - y \right\|}_{2}^{2} + \lambda {\left\| x \right\|}_{2}^{2}$

उपरोक्त दावा है कि निम्नलिखित समस्या समतुल्य है:

\begin{aligned} \arg min_{x} & \frac{1}{2} {‖ x - y ‖}_{2}^{2} \\ subject to & {‖ x ‖}_{2}^{2} \leq t \end{aligned}

$\begin{align*} \arg \min_{x} \quad & \frac{1}{2} {\left\| x - y \right\|}_{2}^{2} \\ \text{subject to} \quad & {\left\| x \right\|}_{2}^{2} \leq t \end{align*}$

चलो परिभाषित पहली समस्या के इष्टतम समाधान के रूप में और $\hat{x}$ $\tilde{x}$ दूसरी समस्या के इष्टतम समाधान के रूप में।

तुल्यता का मतलब है कि के दावे $\forall t, \: \exists \lambda \geq 0 : \hat{x} = \tilde{x}$ ।
अर्थात् आप हमेशा $t$ औरकी एक जोड़ी रख सकते हैं $\lambda \geq 0$ समस्या के इस तरह के समाधान में ही है।

हम एक जोड़ी कैसे पा सकते हैं?
खैर, समस्याओं को हल करने और समाधान के गुणों को देखकर।
दोनों समस्याएं उत्तल और चिकनी हैं, इसलिए इसे चीजों को सरल बनाना चाहिए।

पहली समस्या का हल उस बिंदु पर दिया जाता है, जिसमें ढाल गायब हो जाता है, जिसका अर्थ है:

\hat{x} - y + 2 λ \hat{x} = 0

$\hat{x} - y + 2 \lambda \hat{x} = 0$

KKT स्थितियां दूसरी समस्या राज्यों के:

\tilde{x} - y + 2 μ \tilde{x} = 0

$\tilde{x} - y + 2 \mu \tilde{x} = 0$

तथा

μ ({‖ \tilde{x} ‖}_{2}^{2} - t) = 0

$\mu \left( {\left\| \tilde{x} \right\|}_{2}^{2} - t \right) = 0$

पिछले समीकरण पता चलता है कि या तो $\mu = 0$ या ${\left\| \tilde{x} \right\|}_{2}^{2} = t$ ।

ध्यान दें कि 2 आधार समीकरण समतुल्य हैं।
अर्थात् यदि और $\hat{x} = \tilde{x}$ $\mu = \lambda$ दोनों समीकरणों पकड़ो।

तो इसका मतलब है कि मामले में ${\left\| y \right\|}_{2}^{2} \leq t$ एक जरूरी सेट $\mu = 0$ जिसका अर्थ है कि के लिए $t$ दोनों के लिए आदेश बराबर एक सेट करना होगा होने के लिए बड़ा पर्याप्त $\lambda = 0$ ।

अन्य मामले में एक को ढूंढना चाहिए $\mu$ जहां :

y^{t} {(I + 2 μ I)}^{- 1} {(I + 2 μ I)}^{- 1} y = t

${y}^{t} \left( I + 2 \mu I \right)^{-1} \left( I + 2 \mu I \right)^{-1} y = t$

यह मूल रूप से जब है ${\left\| \tilde{x} \right\|}_{2}^{2} = t$

एक बार जब आप पाते हैं कि $\mu$ समाधान भिड़ना होगा।

के बारे में ${L}_{1}$ (LASSO) मामले, ठीक है, यह एक ही विचार के साथ काम करता है।
एकमात्र अंतर यह है कि हमने समाधान के लिए बंद नहीं किया है इसलिए कनेक्शन को प्राप्त करना मुश्किल है।

StackExchange Cross Validated Q291962 और StackExchange सिग्नल प्रोसेसिंग Q21730 - बेसिस परसेंट में $\lambda$ महत्व पर मेरे उत्तर पर एक नज़र डालें ।

टिप्पणी
क्या वास्तव में हो रहा है?
दोनों समस्याओं में, $x$ , $y$ जितना संभव हो उतना करीब होने की कोशिश करता है ।
पहले मामले में, $x = y$ पहले शब्द ( ${L}_{2}$ दूरी) को गायब कर देगा और दूसरे मामले में यह उद्देश्य फ़ंक्शन को गायब कर देगा।
अंतर यह है कि पहले मामले में ${L}_{2}$ नॉर्म को संतुलित करना चाहिए । जैसे ही उच्च हो जाता है संतुलन का मतलब है कि आपको छोटा करना चाहिए । दूसरे मामले में एक दीवार है, आप करीब और करीब लाते हैं $x$ $\lambda$ $x$
$x$ $y$ जब तक आप उस दीवार से नहीं टकराते जो उसके नॉर्म ( $t$ ) पर कसना है ।
यदि दीवार काफी दूर है ( $t$ का उच्च मूल्य ) और पर्याप्त $y$ के मानक पर निर्भर करता है तो मेरा कोई मतलब नहीं है, जैसे कि $\lambda$ केवल प्रासंगिक है इसके मूल्य के बराबर $y$ के आदर्श से गुणा करना सार्थक होने लगता है।
सटीक संबंध ऊपर बताए गए लैग्रैनिजियम द्वारा है।

साधन

मुझे आज यह पत्र मिला (०३/०४/२०१९):

स्पार्स अनुकूलन समस्याओं के एक वर्ग के लिए लगभग कठोरता ।

— Royi
स्रोत

क्या इसका मतलब यह है कि \ lambda और t समान होना चाहिए। क्योंकि मैं उस प्रमाण में नहीं देख सकता। धन्यवाद

— jeza

@jeza, जैसा कि मैंने ऊपर लिखा है, किसी के लिए

है

(जरूरी नहीं के बराबर

लेकिन के एक समारोह

और डेटा

) ऐसी है कि दो रूपों में से समाधान एक ही हैं।

t

$t$

λ \geq 0

$\lambda \geq 0$

t

$t$

t

$t$

y

$y$

— रॉय

3

@ जेजा, दोनों

&

अनिवार्य रूप से यहां मुक्त पैरामीटर हैं। एक बार, आप कहते हैं,

, कि एक विशिष्ट इष्टतम समाधान पैदावार। लेकिन

एक मुक्त पैरामीटर है। तो इस बिंदु पर दावा है कि

का कुछ मूल्य हो सकता है जो समान इष्टतम समाधान प्राप्त करेगा। अनिवार्य रूप से इस बात पर कोई अड़चन नहीं है कि

क्या होनी चाहिए; यह ऐसा नहीं है कि इसे

का कुछ निश्चित कार्य होना चाहिए , जैसे

या कुछ।

λ

$\lambda$

t

$t$

λ

$\lambda$

t

$t$

t

$t$

t

$t$

λ

$\lambda$

t = λ / 2

$t=\lambda/2$

— गुंग - को पुनः स्थापित मोनिका

@ रोई, मैं जानना चाहूंगा 1- आपके सूत्र में (1/2) क्यों है, जबकि प्रश्न में सूत्र नहीं हैं? 2- दो फॉर्मूलों की समानता दिखाने के लिए केकेटी का उपयोग कर रहे हैं? 3- यदि हां, तो मैं अब भी उस समानता को नहीं देख सकता। मुझे यकीन नहीं है लेकिन मैं जो देखने की उम्मीद करता हूं, वह उस सूत्र को एक = सूत्र दो को दर्शाने का प्रमाण है।

— जीजा

1. आसान है जब आप LS शब्द को अलग करते हैं। आप मेरे

को दो के कारक से

को ओपी

ले जा सकते हैं । 2. मैंने केकेटी को दूसरे मामले के लिए इस्तेमाल किया। पहले मामले में कोई बाधा नहीं है, इसलिए आप इसे हल कर सकते हैं। 3. उनके बीच कोई बंद फॉर्म समीकरण नहीं है। मैंने तर्क दिखाया और आप उन्हें जोड़ने वाला एक ग्राफ कैसे बना सकते हैं। लेकिन जैसा कि मैंने लिखा है कि यह प्रत्येक

लिए बदल जाएगा (यह डेटा निर्भर है)।

λ

$\lambda$

λ

$\lambda$

y

$y$

— रॉय

9

एक कम गणितीय रूप से कठोर, लेकिन संभवतः अधिक सहज, यह समझने के लिए कि क्या चल रहा है, बाधा संस्करण (प्रश्न में समीकरण 3.42) के साथ शुरू करने के लिए है और इसे "Lagrange गुणक" ( https: //en.wikipedia ) के तरीकों का उपयोग करके हल करें। .org / wiki / Lagrange_multiplier या आपका पसंदीदा बहुविकल्पी पथरी पाठ)। बस याद रखें कि कैलकुलस में चर का वेक्टर है, लेकिन हमारे मामले में स्थिर है और चर वेक्टर है। एक बार जब आप लैगरेंज गुणक तकनीक लागू करते हैं, तो आप पहले समीकरण (3.41) के साथ समाप्त हो जाते हैं (अतिरिक्त को फेंकने के बाद जो कम से कम करने के लिए निरंतर सापेक्ष है और इसे अनदेखा किया जा सकता है)। $x$ $x$ $\beta$ $-\lambda t$

इससे यह भी पता चलता है कि यह लासो और अन्य बाधाओं के लिए काम करता है।

— Greg Snow
स्रोत

8

It's perhaps worth reading about Lagrangian duality and a broader relation (at times equivalence) between:

optimization subject to hard (i.e. inviolable) constraints
optimization with penalties for violating constraints.

Quick intro to weak duality and strong duality

Assume we have some function $f(x,y)$ of two variables. For any $\hat{x}$ and $\hat{y}$ , we have:

min_{x} f (x, \hat{y}) \leq f (\hat{x}, \hat{y}) \leq max_{y} f (\hat{x}, y)

$\min_x f(x, \hat{y}) \leq f(\hat{x}, \hat{y}) \leq \max_y f(\hat{x}, y)$

Since that holds for any $\hat{x}$ and $\hat{y}$ it also holds that:

max_{y} min_{x} f (x, y) \leq min_{x} max_{y} f (x, y)

$\max_y \min_x f(x, y) \leq \min_x \max_y f(x, y)$

इसे कमजोर द्वंद्व के रूप में जाना जाता है । कुछ परिस्थितियों में, आपके पास मजबूत द्वंद्व भी होता है (जिसे काठी बिंदु संपत्ति के रूप में भी जाना जाता है ):

max_{y} min_{x} f (x, y) = min_{x} max_{y} f (x, y)

$\max_y \min_x f(x, y) = \min_x \max_y f(x, y)$

जब मजबूत द्वंद्व होता है, तो दोहरी समस्या को हल करना भी मौलिक समस्या को हल करता है। वे एक ही समस्या में हैं!

विवश रिज प्रतिगमन के लिए अंतराल

मुझे फ़ंक्शन को परिभाषित करने दें $\mathcal{L}$ as:

L (b, λ) = \sum_{i = 1}^{n} (y - x_{i} \cdot b)^{2} + λ (\sum_{j = 1}^{p} b_{j}^{2} - t)

$\mathcal{L}(\mathbf{b}, \lambda) = \sum_{i=1}^n (y - \mathbf{x}_i \cdot \mathbf{b})^2 + \lambda \left( \sum_{j=1}^p b_j^2 - t \right)$

The min-max interpretation of the Lagrangian

The Ridge regression problem subject to hard constraints is:

min_{b} max_{λ \geq 0} L (b, λ)

$\min_\mathbf{b} \max_{\lambda \geq 0} \mathcal{L}(\mathbf{b}, \lambda)$

You pick $\mathbf{b}$ to minimize the objective, cognizant that after $\mathbf{b}$ is picked, your opponent will set $\lambda$ to infinity if you chose $\mathbf{b}$ such that $\sum_{j=1}^p b_j^2 > t$ .

If strong duality holds (which it does here because Slater's condition is satisfied for $t>0$ ), you then achieve the same result by reversing the order:

max_{λ \geq 0} min_{b} L (b, λ)

$\max_{\lambda \geq 0} \min_\mathbf{b} \mathcal{L}(\mathbf{b}, \lambda)$

Here, your opponent chooses $\lambda$ first! You then choose $\mathbf{b}$ to minimize the objective, already knowing their choice of $\lambda$ . The $\min_\mathbf{b} \mathcal{L}(\mathbf{b}, \lambda)$ part (taken $\lambda$ as given) is equivalent to the 2nd form of your Ridge Regression problem.

As you can see, this isn't a result particular to Ridge regression. It is a broader concept.

References

(I started this post following an exposition I read from Rockafellar.)

Rockafellar, R.T., Convex Analysis

You might also examine lectures 7 and lecture 8 from Prof. Stephen Boyd's course on convex optimization.

— Matthew Gunn
स्रोत

note that your answer can be extended to any convex function.

— 81235

6

They are not equivalent.

For a constrained minimization problem

\begin{matrix} (1) & min_{b} \sum_{i = 1}^{n} (y - x_{i}^{'} \cdot b)^{2} s . t . \sum_{j = 1}^{p} b_{j}^{2} \leq t, b = (b_{1}, . . ., b_{p}) \end{matrix}

$\min_{\mathbf b} \sum_{i=1}^n (y - \mathbf{x}'_i \cdot \mathbf{b})^2\\ s.t. \sum_{j=1}^p b_j^2 \leq t,\;\;\; \mathbf b = (b_1,...,b_p) \tag{1}$

we solve by minimize over $\mathbf b$ the corresponding Lagrangean

\begin{matrix} (2) & Λ = \sum_{i = 1}^{n} (y - x_{i}^{'} \cdot b)^{2} + λ (\sum_{j = 1}^{p} b_{j}^{2} - t) \end{matrix}

$\Lambda = \sum_{i=1}^n (y - \mathbf{x}'_i \cdot \mathbf{b})^2 + \lambda \left( \sum_{j=1}^p b_j^2 - t \right) \tag{2}$

Here, $t$ is a bound given exogenously, $\lambda \geq 0$ is a Karush-Kuhn-Tucker non-negative multiplier, and both the beta vector and $\lambda$ are to be determined optimally through the minimization procedure given $t$ .

Comparing $(2)$ and eq $(3.41)$ in the OP's post, it appears that the Ridge estimator can be obtained as the solution to

\begin{matrix} (3) & min_{b} {Λ + λ t} \end{matrix}

$\min_{\mathbf b}\{\Lambda + \lambda t\} \tag{3}$

Since in $(3)$ the function to be minimized appears to be the Lagrangean of the constrained minimization problem plus a term that does not involve $\mathbf b$ , it would appear that indeed the two approaches are equivalent...

But this is not correct because in the Ridge regression we minimize over $\mathbf b$ given $\lambda >0$ . But, in the lens of the constrained minimization problem, assuming $\lambda >0$ imposes the condition that the constraint is binding, i.e that

\sum_{j = 1}^{p} (b_{j, r i d g e}^{*})^{2} = t

$\sum_{j=1}^p (b^*_{j,ridge})^2 = t$

The general constrained minimization problem allows for $\lambda = 0$ also, and essentially it is a formulation that includes as special cases the basic least-squares estimator ( $\lambda ^*=0$ ) and the Ridge estimator ( $\lambda^* >0$ ).

So the two formulation are not equivalent. Nevertheless, Matthew Gunn's post shows in another and very intuitive way how the two are very closely connected. But duality is not equivalence.

— Alecos Papadopoulos
स्रोत

@MartijnWeterings Thanks for the comment, I have reworked my answer.

— Alecos Papadopoulos

@MartijnWeterings I do not see what is confusing since the expression written in your comment is exactly the expression I wrote in my reworked post.

— Alecos Papadopoulos

1

This was the duplicate question I had in mind were the equivalence is explained very intuitively to me math.stackexchange.com/a/336618/466748 the argument that you give for the two not being equivalent seems only secondary to me, and a matter of definition (the OP uses

λ \geq 0

$\lambda \geq 0$ instead of

λ > 0

$\lambda > 0$ and we could just as well add the constrain

t < ‖ β^{O L S} ‖_{2}^{2}

$t < \Vert \beta^{OLS} \Vert^2_2$ to exclude the cases where

λ = 0

$\lambda=0$ ) .

— Sextus Empiricus

@MartijnWeterings When A is a special case of B, A cannot be equivalent to B. And ridge regression is a special case of the general constrained minimization problem, Namely a situation to which we arrive if we constrain further the general problem (like you do in your last comment).

— Alecos Papadopoulos

Certainly you could define some constrained minimization problem that is more general then ridge regression (like you can also define some regularization problem that is more general than ridge regression, e.g. negative ridge regression), but then the non-equivalence is due to the way that you define the problem and not due to the transformation from the constrained representation to the Lagrangian representation. The two forms can be seen as equivalent within the constrained formulation/definition (non-general) that are useful for ridge regression.

— Sextus Empiricus