सबसे छोटा

लैसो अनुमान परिभाषित

{\hat{β}}^{λ} = \arg min_{β \in R^{p}} \frac{1}{2 n} ‖ y - X β ‖_{2}^{2} + λ ‖ β ‖_{1},

$\hat\beta^\lambda = \arg\min_{\beta \in \mathbb{R}^p} \frac{1}{2n} \|y - X \beta\|_2^2 + \lambda \|\beta\|_1,$ जहां

i^{t h}

$i^{th}$ पंक्ति

x_{i} \in R^{p}

$x_i \in \mathbb{R}^p$ डिजाइन के मैट्रिक्स

X \in R^{n \times p}

$X \in \mathbb{R}^{n \times p}$ स्टोकेस्टिक प्रतिक्रिया की व्याख्या के लिए covariates का एक वेक्टर है

y_{i}

$y_i$ (

i = 1, \dots n

$i=1, \dots n$ ) के लिए।

हम जानते हैं कि के लिए $\lambda \geq \frac{1}{n} \|X^T y\|_\infty$ , लैसो अनुमान। (देखें उदाहरण के लिए,कमंद और रिज ट्यूनिंग पैरामीटर गुंजाइशहै कि।) अन्य अंकन में, यह व्यक्त कर रहा है $\hat\beta^\lambda = 0$ $\lambda_\max = \frac{1}{n} \|X^T y\|_\infty$ । सूचना है कि $\lambda_\mathrm{max} = \sup_{\hat\beta^\lambda \ne 0} \lambda.$ हम इसे निम्नलिखित लसो समाधान पथ प्रदर्शित करने वाली छवि के साथ देख सकते हैं:

सूचना है कि पर अब तक साजिश के दाहिने हाथ की ओर, गुणांक के सभी शून्य कर रहे हैं। इस बिंदु पर होता है ऊपर वर्णित है। $\lambda_\mathrm{max}$

इस साजिश से, हम यह भी देखते हैं कि अब तक बाईं ओर, गुणांक के सभी अशून्य हैं: का मूल्य क्या है , जिस पर के किसी भी घटक शुरू में शून्य है? यह है कि, क्या है $\lambda$ $\hat\beta^\lambda$ , के बराबर के एक समारोह के रूप मेंऔर? मुझे एक बंद फॉर्म समाधान में दिलचस्पी है। विशेष रूप से, मैं एक एल्गोरिथम समाधान में दिलचस्पी नहीं रखता हूं, जैसे कि, उदाहरण के लिए, यह सुझाव देते हुए कि एलएआरएस गणना के माध्यम से गाँठ पा सकते हैं।

λ_{min} = min_{\exists j s . t . {\hat{β}}_{j} = 0} λ

$\lambda_\textrm{min} = \min_{\exists j \, \mathrm{ s.t. } \, \hat\beta_j = 0} \lambda$

X

$X$

y

$y$

मेरे हितों के बावजूद, ऐसा लगता है कि बंद रूप में उपलब्ध नहीं हो सकता है, क्योंकि, अन्यथा, क्रॉस सत्यापन के दौरान ट्यूनिंग पैरामीटर गहराई का निर्धारण करते समय लसो कम्प्यूटेशनल पैकेज संभवतः इसका लाभ उठाएगा। इसके प्रकाश में, मुझे ऐसी किसी भी चीज़ में दिलचस्पी है जिसे सैद्धांतिक रूप से बारे में दिखाया जा सकता है और (अभी भी) विशेष रूप से एक बंद रूप में रुचि रखता है। $\lambda_\mathrm{min}$ $\lambda_\mathrm{min}$

lasso regularization

— user795305
स्रोत

यह glmnet पेपर में बताया और सिद्ध किया गया है: web.stanford.edu/~hastie/Papers/glmnet.pdf

— मैथ्यू ड्र्यू

@MatthewDrury इसे साझा करने के लिए धन्यवाद! हालाँकि, यह कागज आपको लगता है कि वे जो सुझाव देते हैं उसे साझा नहीं करते हैं। विशेष रूप से, ध्यान दें कि मेरा

उनका

λ_{max}

$\lambda_\max$

λ_{min}

$\lambda_\min$ ।

— user795305

क्या आप सुनिश्चित हैं कि हमें [ट्यूनिंग-पैरामीटर] टैग की आवश्यकता है?

— अमीबा का कहना है कि मोनिका

आप सही कह रहे हैं, लैस्सो समाधान के लिए एक बंद फॉर्म सामान्य रूप से मौजूद नहीं है (देखें । आँकड़े ।ackackchange.com/questions/174003/… )। हालाँकि, लार्स कम से कम आपको बताता है कि क्या चल रहा है और किन सटीक परिस्थितियों में / जिस समय आप एक चर जोड़ सकते हैं / हटा सकते हैं। मुझे लगता है कि ऐसा कुछ आप सबसे अच्छा कर सकते हैं।

— Chrrr

@chRrr मुझे यकीन है कि ऐसा नहीं है कि पूरी तरह से निष्पक्ष का कहना हूँ: हम जानते हैं कि

के लिए

{\hat{β}}^{λ} = 0

$\hat\beta^\lambda = 0$

। यही कारण है कि समाधान 0 के चरम मामले में, हमारे पास एक बंद रूप है। मैं पूछ रहा हूं कि क्या इसी तरह के लास्सो एस्टीम के घने मामले में सच है (यानी शून्य नहीं)। दरअसल, मैं भी की सटीक प्रविष्टियों में कोई दिलचस्पी नहीं हूँ

--- बस क्या वे शून्य या नहीं कर रहे हैं।

λ \geq \frac{1}{n} ‖ X^{t} y ‖_{\infty}

$\lambda \geq \frac{1}{n} \|X^t y\|_\infty$

{\hat{β}}_{λ}

$\hat\beta_\lambda$

— user795305

प्रश्न में वर्णित लैस्सो अनुमान निम्न अनुकूलन समस्या के बराबर लैग्रेंज गुणक है:

minimize f (β) subject to g (β) \leq t

${\text{minimize } f(\beta) \text{ subject to } g(\beta) \leq t}$

\begin{aligned} f (β) & = \frac{1}{2 n} | | y - X β | |_{2}^{2} \\ g (β) & = | | β | |_{1} \end{aligned}

$\begin{align} f(\beta) &= \frac{1}{2n} \vert\vert y-X\beta \vert\vert_2^2 \\ g(\beta) &= \vert\vert \beta \vert\vert_1 \end{align}$

इस ऑप्टिमाइज़ेशन में एक बहुआयामी क्षेत्र और एक पोलिटोप (एक्स के वैक्टर द्वारा प्रायोजित ) के बीच संपर्क के बिंदु को खोजने का एक ज्यामितीय प्रतिनिधित्व है । Polytope की सतह का प्रतिनिधित्व करता है $g(\beta)$ । गोलक की त्रिज्या के वर्ग समारोह का प्रतिनिधित्व करता है $f(\beta)$ और कम से कम है जब संपर्क सतहों।

नीचे दिए गए चित्र एक चित्रमय विवरण प्रदान करते हैं। चित्र 3 लंबाई के वैक्टर के साथ निम्नलिखित सरल समस्या का उपयोग किया (एक ड्राइंग बनाने में सक्षम होने के लिए सादगी के लिए):

[\begin{matrix} y_{1} \\ y_{2} \\ y_{3} \end{matrix}] = [\begin{matrix} 1.4 \\ 1.84 \\ 0.32 \end{matrix}] = β_{1} [\begin{matrix} 0.8 \\ 0.6 \\ 0 \end{matrix}] + β_{2} [\begin{matrix} 0 \\ 0.6 \\ 0.8 \end{matrix}] + β_{3} [\begin{matrix} 0.6 \\ 0.64 \\ - 0.48 \end{matrix}] + [\begin{matrix} ϵ_{1} \\ ϵ_{2} \\ ϵ_{3} \end{matrix}]

$\begin{bmatrix} y_1 \\ y_2 \\ y_3\\ \end{bmatrix} = \begin{bmatrix} 1.4 \\ 1.84 \\ 0.32\\ \end{bmatrix} = \beta_1 \begin{bmatrix} 0.8 \\ 0.6 \\ 0\\ \end{bmatrix} +\beta_2 \begin{bmatrix} 0 \\ 0.6 \\ 0.8\\ \end{bmatrix} +\beta_3 \begin{bmatrix} 0.6 \\ 0.64 \\ -0.48\\ \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3\\ \end{bmatrix}$ और हम कम से कम

ϵ_{1}^{2} + ϵ_{2}^{2} + ϵ_{3}^{2}

$\epsilon_1^2+\epsilon_2^2+\epsilon_3^2$ बाधा के साथ

a b s (β_{1}) + a b s (β_{2}) + a b s (β_{3}) \leq t

$abs(\beta_1)+abs(\beta_2)+abs(\beta_3) \leq t$

चित्र दिखाते हैं:

लाल सतह पर कसाव दर्शाया गया है, एक्स द्वारा फैलाया गया एक पॉलीटॉप है।
और हरी सतह न्यूनतम सतह, एक गोले को दर्शाती है।
नीली रेखा लस्सो पथ को दर्शाती है, जो समाधान हम पाते हैं जैसे हम $t$ या $\lambda$ बदलते हैं ।
हरे रंग वेक्टर शो OLS समाधान (जो के रूप में चुना गया था या $\hat{y}$ $\beta_1=\beta_2=\beta_3=1$ $\hat{y} = x_1 + x_2 + x_3$ ।
तीन काले वैक्टर $x_1 = (0.8,0.6,0)$ , $x_2 = (0,0.6,0.8)$ और $x_3 = (0.6,0.64,-0.48)$ ।

हम तीन चित्र दिखाते हैं:

पहली छवि में केवल पॉलीटॉप का एक बिंदु गोले को छू रहा है । यह छवि बहुत अच्छी तरह से प्रदर्शित करती है कि लसो समाधान ओएलएस समाधान का सिर्फ एक बहु क्यों नहीं है। OLS समाधान की दिशा योग को मजबूत करती है $\vert \beta \vert_1$ । इस मामले में केवल एक ही में $\beta_i$ गैर-शून्य है।
दूसरी छवि में पॉलीटोप का एक रिज क्षेत्र को छू रहा है (उच्च आयामों में हमें उच्च आयामी एनालॉग्स मिलते हैं)। इस मामले में कई $\beta_i$ कर रहे हैं गैर शून्य।
तीसरी छवि में एक पहलू टोफ पोलीटॉप क्षेत्र को छू रहा है । इस मामले में सभी $\beta_i$ गैर-शून्य हैं ।

$t$ या $\lambda$ की श्रेणी जिसके लिए हमारे पास पहले और तीसरे मामले हैं, उनकी सरल ज्यामितीय प्रतिनिधित्व के कारण आसानी से गणना की जा सकती है।

केस 1: केवल एक $\beta_i$ नॉन-जीरो

गैर शून्य $\beta_i$ जिसके लिए संबंधित वेक्टर है $x_i$ साथ सहप्रसरण के उच्चतम निरपेक्ष मूल्य है $\hat{y}$ parrallelotope जो OLS समाधान के सबसे करीब की बात है)। हम Lagrange गुणक गणना कर सकते हैं $\lambda_{max}$ जिसके नीचे हम कम से कम एक गैर शून्य है $\beta$ के साथ व्युत्पन्न लेने के द्वारा $\pm\beta_i$ (साइन है कि क्या हम में वृद्धि के आधार पर $\beta_i$ नकारात्मक या सकारात्मक दिशा में):

\frac{\partial (\frac{1}{2 n} | | y - X β | |_{2}^{2} - λ | | β | |_{1})}{\pm \partial β_{i}} = 0

$\frac{\partial ( \frac{1}{2n} \vert \vert y - X\beta \vert \vert_2^2 - \lambda \vert \vert \beta \vert \vert_1 )}{\pm \partial \beta_i} = 0$

जिससे होता है

λ_{m a x} = \frac{(\frac{1}{2 n} \frac{\partial (| | y - X β | |_{2}^{2}}{\pm \partial β_{i}})}{(\frac{| | β | |_{1})}{\pm \partial β_{i}})} = \pm \frac{\partial (\frac{1}{2 n} | | y - X β | |_{2}^{2}}{\partial β_{i}} = \pm \frac{1}{n} x_{i} \cdot y

$\lambda_{max} = \frac{ \left( \frac{1}{2n}\frac{\partial ( \vert \vert y - X\beta \vert \vert_2^2}{\pm \partial \beta_i} \right) }{ \left( \frac{ \vert \vert \beta \vert \vert_1 )}{\pm \partial \beta_i}\right)} = \pm \frac{\partial ( \frac{1}{2n} \vert \vert y - X\beta \vert \vert_2^2}{\partial \beta_i} = \pm \frac{1}{n} x_i \cdot y$

$\vert \vert X^Ty \vert \vert_\infty$ mentioned in the comments.

where we should notice that this is only true for the special case in which the tip of the polytope is touching the sphere (so this is not a general solution, although generalization is straightforward).

Case 3: All $\beta_i$ are non-zero.

In this case that a facet of the polytope is touching the sphere. Then the direction of change of the lasso path is normal to the surface of the particular facet.

The polytope has many facets, with positive and negative contributions of the $x_i$ . In the case of the last lasso step, when the lasso solution is close to the ols solution, then the contributions of the $x_i$ must be defined by the sign of the OLS solution. The normal of the facet can be defined by taking the gradient of the function $\vert \vert \beta(r) \vert \vert_1$ , the value of the sum of beta at the point $r$ , which is:

n = - \nabla_{r} (| | β (r) | |_{1}) = - \nabla_{r} (sign (\hat{β}) \cdot (X^{T} X)^{- 1} X^{T} r) = - sign (\hat{β}) \cdot (X^{T} X)^{- 1} X^{T}

$n = - \nabla_r ( \vert \vert \beta(r) \vert \vert_1) = -\nabla_r ( \text{sign} (\hat{\beta}) \cdot (X^TX)^{-1}X^Tr ) = -\text{sign} (\hat{\beta}) \cdot (X^TX)^{-1}X^T$

and the equivalent change of beta for this direction is:

{\vec{β}}_{l a s t} = (X^{T} X)^{- 1} X n = - (X^{T} X)^{- 1} X^{T} [sign (\hat{β}) \cdot (X^{T} X)^{- 1} X^{T}]

$\vec{\beta}_{last} = (X^TX)^{-1}X n = -(X^TX)^{-1}X^T [\text{sign} (\hat{\beta}) \cdot (X^TX)^{-1}X^T]$

which after some algebraic tricks with shifting the transposes ( $A^TB^T = [BA]^T$ ) and distribution of brackets becomes

{\vec{β}}_{l a s t} = - (X^{T} X)^{- 1} sign (\hat{β})

$\vec{\beta}_{last} = - (X^TX)^{-1} \text{sign} (\hat{\beta})$

we normalize this direction:

{\vec{β}}_{l a s t, n o r m a l i z e d} = \frac{{\vec{β}}_{l a s t}}{\sum {\vec{β}}_{l a s t} \cdot s i g n (\hat{β})}

$\vec{\beta}_{last,normalized} = \frac{\vec{\beta}_{last}}{\sum \vec{\beta}_{last} \cdot sign(\hat{\beta})}$

To find the $\lambda_{min}$ below which all coefficients are non-zero. We only have to calculate back from the OLS solution back to the point where one of the coefficients is zero,

d = m i n (\frac{\hat{β}}{{\vec{β}}_{l a s t, n o r m a l i z e d}}) with the condition that \frac{\hat{β}}{{\vec{β}}_{l a s t, n o r m a l i z e d}} > 0

$d = min \left( \frac{\hat{\beta}}{\vec{\beta}_{last,normalized}} \right)\qquad \text{with the condition that } \frac{\hat{\beta}}{\vec{\beta}_{last,normalized}} >0$

,and at this point we evaluate the derivative (as before when we calculate $\lambda_{max}$ ). We use that for a quadratic function we have $q'(x) = 2 q(1) x$ :

λ_{m i n} = \frac{d}{n} | | X {\vec{β}}_{l a s t, n o r m a l i z e d} | |_{2}^{2}

$\lambda_{min} = \frac{d}{n} \vert \vert X \vec{\beta}_{last,normalized} \vert \vert_2^2$

Images

a point of the polytope is touching the sphere, a single $\beta_i$ is non-zero:

a ridge (or differen in multiple dimensions) of the polytope is touching the sphere, many $\beta_i$ are non-zero:

a facet of the polytope is touching the sphere, all $\beta_i$ are non-zero:

Code example:

library(lars)    
data(diabetes)
y <- diabetes$y - mean(diabetes$y)
x <- diabetes$x

# models
lmc <- coef(lm(y~0+x))
modl <- lars(diabetes$x, diabetes$y, type="lasso")

# matrix equation
d_x <- matrix(rep(x[,1],9),length(x[,1])) %*% diag(sign(lmc[-c(1)]/lmc[1]))
x_c = x[,-1]-d_x
y_c = -x[,1]

# solving equation
cof <- coefficients(lm(y_c~0+x_c))
cof <- c(1-sum(cof*sign(lmc[-c(1)]/lmc[1])),cof)

# alternatively the last direction of change in coefficients is found by:
solve(t(x) %*% x) %*% sign(lmc)

# solution by lars package
cof_m <-(coefficients(modl)[13,]-coefficients(modl)[12,])

# last step
dist <- x %*% (cof/sum(cof*sign(lmc[])))
#dist_m <- x %*% (cof_m/sum(cof_m*sign(lmc[]))) #for comparison

# calculate back to zero
shrinking_set <- which(-lmc[]/cof>0)  #only the positive values
step_last <- min((-lmc/cof)[shrinking_set])

d_err_d_beta <- step_last*sum(dist^2)

# compare
modl[4] #all computed lambda
d_err_d_beta  # lambda last change
max(t(x) %*% y) # lambda first change
enter code here

note: those last three lines are the most important

> modl[4]            # all computed lambda by algorithm
$lambda
 [1] 949.435260 889.315991 452.900969 316.074053 130.130851  88.782430  68.965221  19.981255   5.477473   5.089179
[11]   2.182250   1.310435

> d_err_d_beta       # lambda last change by calculating only last step
    xhdl 
1.310435 
> max(t(x) %*% y)    # lambda first change by max(x^T y)
[1] 949.4353

Written by StackExchangeStrike

— Sextus Empiricus
स्रोत

Thanks for including the edits! So far in my reading, I'm stuck just past the "case 1" subsection. The result for

λ_{max}

$\lambda_\max$ derived there is wrong since it doesn't include an absolute value or a maximum. We know further that there's a mistake since in the derivation, there's a sign mistake, a place where differentiability is wrongly assumed, an "arbitrary choice" of

i

$i$ to differentiate with respect to, and an incorrectly evaluated derivative. To be frank, there isn't one "

=

$=$ " sign that's valid.

— user795305

I have corrected it with a plus minus sign. The change of the beta can be possitive or negative. Regarding the maximum and "arbitrary choice"... "for which the associated vector $x_i$ has the highest covariance with $\hat{y}$ "

— Sextus Empiricus

Thanks for the update! However, there's still problems. For instance,

\frac{\partial}{\partial β_{i}} ‖ y - X β ‖_{2}^{2}

$\frac{\partial}{\partial \beta_i} \|y - X \beta\|_2^2$ is evaluated incorrectly.

— user795305

β = 0

$\beta=0$ then

\frac{\partial}{\partial β_{i}} | | y - X β | |_{2}^{2}

$\frac{\partial}{\partial\beta_i} \vert \vert y - X\beta \vert \vert_2^2$

= \frac{\partial | | y - X β | |_{2}}{\partial β_{i}} 2 | | y - X β | |_{2}

$= \frac{\partial \vert \vert y - X\beta \vert \vert_2}{\partial\beta_i} 2 \vert \vert y - X\beta \vert \vert_2$

= \frac{\partial | | y - s x_{i} | |_{2}}{\partial s} 2 | | y - X β | |_{2}

$= \frac{\partial \vert \vert y - s x_i \vert \vert_2}{\partial s} 2 \vert \vert y - X\beta \vert \vert_2$

= 2 c o r (x_{i}, y) | | x_{i} | |_{2} | | y | |_{2}

$= 2 cor(x_i,y) \vert \vert x_i \vert \vert_2 \vert \vert y \vert \vert_2$

= 2 x_{i} \cdot y

$= 2 x_i \cdot y$ this correlation enters the equation because,if s=0 then only the change of

s x_{i}

$s x_i$ tangent to

y

$y$ is changing the length of the vector

y - s x_{i}

$y - s x_i$

— Sextus Empiricus

Ah, okay, so there's a limit involved in your argument! (You're using both

β = 0

$\beta = 0$ and that a coefficient is nonzero.) Further, the second equality in the line with

λ_{max}

$\lambda_\max$ is misleading since the sign could change due to the differentiation of the absolute value.

— user795305

सबसे छोटा

केस 1: केवल एक βiβi\beta_i नॉन-जीरो

Case 3: All βiβi\beta_i are non-zero.

Images

Code example:

केस 1: केवल एक $\beta_i$ नॉन-जीरो

Case 3: All $\beta_i$ are non-zero.