किस मॉडल के लिए MLE का पूर्वाग्रह तेजी से विचरण से गिरता है?

$\hat\theta$ $\theta^*$ $n$ $\lVert\hat\theta-\theta^*\rVert$ आम तौर पर कम हो जाती है के रूप में $O(1/\sqrt n)$ । त्रिकोण असमानता और उम्मीद के गुणों का उपयोग करना, यह है कि इस त्रुटि दर है कि दोनों "पूर्वाग्रह" का तात्पर्य दिखाया जा सकता है $\lVert \mathbb E\hat\theta - \theta^*\rVert$ और "विचलन" $\lVert \mathbb E\hat\theta - \hat\theta\rVert$ ही में कमी $O(1/\sqrt{n})$ दर। बेशक, मॉडल के लिए पूर्वाग्रह होना संभव है जो एक तेज दर से सिकुड़ता है। कई मॉडल (जैसे ऑर्डिनरी कम से कम वर्ग प्रतिगमन) में कोई पूर्वाग्रह नहीं है।

मैं मॉडल पूर्वाग्रह है में रुचि रहा है कि तेजी से सिकुड़ती $O(1/\sqrt n)$ , लेकिन जहां त्रुटि इस तेज दर से सिकुड़ती नहीं है क्योंकि विचलन अभी भी रूप में सिकुड़ता है $O(1/\sqrt n)$ । विशेष रूप से, मैं एक मॉडल के पूर्वाग्रह के लिए दरसिकुड़ने के लिए पर्याप्त शर्तें जानना चाहूंगा $O(1/n)$ ।

— माइक इज़्बीकी
स्रोत

क्या

∥θ^−θ∗∥=(θ^−θ∗)2 $\lVert\hat\theta-\theta^*\rVert = (\hat\theta-\theta^*)^2$ ? या?

— एलेकोस पापाडोपोलोस

मैं विशेष रूप से L2 मानदंड के बारे में पूछ रहा था, हाँ। लेकिन मुझे अन्य मानदंडों में भी दिलचस्पी होगी अगर यह सवाल का जवाब देने में आसान बनाता है।

— माइक इज़्बीकी

(θ^−θ∗)2 $(\hat \theta -\theta^*)^2$ है

Op(1/n) $O_p(1/n)$ ।

— एलेकोस पापाडोपोलोस

क्षमा करें, मैंने आपकी टिप्पणी को गलत बताया। में L2 आदर्श के लिए

d $d$ आयाम,

∥a−b∥=∑di=1(ai−bi)2−−−−−−−−−−−−√ $\Vert a-b\Vert = \sqrt{\sum_{i=1}^d (a_i-b_i)^2}$ , और इसलिए अभिसरण

की दर से होता है

। मैं सहमत हूं कि अगर हमने इसे चुकता किया तो यह

रूप में परिवर्तित हो जाएगा। O(1/n−−√) $O(1/\sqrt n)$

O(1/n) $O(1/n)$

— माइक इज़्बीकी

क्या आपने रिज रिग्रेशन (Hoerl & Kennard 1970) पेपर देखा है? मेरा मानना है कि यह डिज़ाइन मैट्रिक्स + पेनल्टी पर शर्तें देता है जहाँ यह सही होने की उम्मीद है।

— डीसी

जवाबों:

सामान्य तौर पर, आपको ऐसे मॉडल की आवश्यकता होती है जहां MLE asymptotically सामान्य नहीं है, लेकिन कुछ अन्य वितरण में परिवर्तित होता है (और यह तेज दर पर ऐसा करता है)। यह आमतौर पर तब होता है जब अनुमान के तहत पैरामीटर पैरामीटर स्थान की सीमा पर होता है। सहज रूप से, इसका मतलब है कि MLE "केवल एक तरफ से" पैरामीटर को अप्रोच करेगा, इसलिए यह "अभिसरण गति में सुधार करता है" क्योंकि यह पैरामीटर के चारों ओर "आगे और पीछे" जाने से "विचलित" नहीं होता है।

एक मानक उदाहरण, MLE के लिए है के आईआईडी नमूने में वर्दी आर.वी. की द MLE यहां अधिकतम आदेश आंकड़ा है, $\theta$ $U(0,\theta)$

θ^n = u (n)

$\hat \theta_n = u_{(n)}$

इसका परिमित नमूना वितरण है

F θ^n = ( θ ^ n ) n θ n, f θ^= n ( θ ^ n ) n - 1 θ n

$F_{\hat \theta_n} = \frac {(\hat \theta_n)^n}{\theta ^n},\;\;\; f_{\hat \theta}=n\frac {(\hat \theta_n)^{n-1}}{\theta ^n}$

E (θ^n) = n n + 1 θ ⟹ B (θ^) = - 1 n + 1 θ

$\mathbb E(\hat \theta_n) = \frac {n}{n+1}\theta \implies B(\hat \theta) = -\frac {1}{n+1}\theta$

So $B(\hat \theta_n) = O(1/n)$ . But the same increased rate will hold also for the variance.

One can also verify that to obtain a limiting distribution, we need to look at the variable $n(\theta - \hat \theta_n)$ ,(i.e we need to scale by $n$ ) since

P [n (θ - θ^n) \leq z] = 1 - P [θ^n \leq θ - (z / n)]

$P[n(\theta - \hat \theta_n)\leq z] = 1-P[\hat \theta_n\leq \theta - (z/n)]$

= 1 - 1 θ n \cdot (θ + - z n) n = 1 - θ n θ n \cdot (1 + - z / θ n) n

$=1-\frac 1 {\theta^n}\cdot \left(\theta + \frac{-z}{n}\right)^n = 1-\frac {\theta^n} {\theta^n}\cdot \left(1 + \frac{-z/\theta}{n}\right)^n$

\to 1 - e - z / θ

$\to 1- e^{-z/\theta}$

which is the CDF of the Exponential distribution.

I hope this provides some direction.

— Alecos Papadopoulos
स्रोत

This is getting close, but I'm specifically interested in situations where the bias shrinks faster than the variance.

— Mike Izbicki

@MikeIzbicki Hmm... the bias convergence depends on the first moment of the distribution, and the (square root of the) variance is also a "first-order" magnitude. I am not sure then that this is possible to happen, because it appears that it would imply that the moments of the limiting distribution "arise" at convergence rates that are not compatible with each other... I' ll think about it though.

— Alecos Papadopoulos

Following comments in my other answer (and looking again at the title of the OP's question!), here is an not very rigorous theoretical exploration of the issue.

We want to determine whether Bias $B(\hat \theta_n) = E(\hat \theta_n) - \theta$ may have different convergence rate than the square root of the Variance,

B (θ^n) = O (1 / n δ), Var (θ^n) - - - - - - - \sqrt = O (1 / n γ), γ \neq δ ? ? ?

$B(\hat \theta_n) = O(1/n^{\delta}),\;\;\; \sqrt {\text{Var}(\hat \theta_n)} = O(1/n^{\gamma}), \;\;\gamma \neq \delta \;???$

We have

B (θ^n) = O (1 / n δ) ⟹ lim n δ E (θ^n) < K ⟹ lim n 2 δ [E (θ^n)] 2 < K'

$B(\hat \theta_n) = O(1/n^{\delta}) \implies \lim n^{\delta}\mathbb E(\hat \theta_n) < K \implies \lim n^{2\delta}[\mathbb E(\hat \theta_n)]^2 < K'$

⟹ [E (θ^n)] 2 = O (1 / n 2 δ) (1)

$\implies [\mathbb E(\hat \theta_n)]^2 = O(1/n^{2\delta}) \tag{1}$

while

Var (θ^n) - - - - - - - \sqrt = O (1 / n γ) ⟹ lim n γ E (θ^2 n) - [E (θ^n)] 2 - - - - - - - - - - - - - \sqrt < M

$\sqrt {\text{Var}(\hat \theta_n)} = O(1/n^{\gamma}) \implies \lim n^{\gamma}\sqrt{\mathbb E (\hat \theta_n^2) - [\mathbb E(\hat \theta_n)]^2 }<M$

⟹ lim n 2 γ E (θ^2 n) - n 2 γ [E (θ^n)] 2 - - - - - - - - - - - - - - - - - - \sqrt < M

$\implies \lim \sqrt{n^{2\gamma}\mathbb E (\hat \theta_n^2) - n^{2\gamma}[\mathbb E(\hat \theta_n)]^2 }<M$

⟹ lim n 2 γ E (θ^2 n) - lim n 2 γ [E (θ^n)] 2 < M' (2)

$\implies \lim n^{2\gamma}\mathbb E (\hat \theta_n^2) - \lim n^{2\gamma}[\mathbb E(\hat \theta_n)]^2 < M' \tag{2}$

We see that $(2)$ may hold happen if

A) both components are $O(1/n^{2\gamma})$ , in which case we can only have $\gamma = \delta$ .

B) But it may also hold if

lim n 2 γ [E (θ^n)] 2 \to 0 ⟹ [E (θ^n)] 2 = o (1 / n 2 γ) (3)

$\lim n^{2\gamma}[\mathbb E(\hat \theta_n)]^2 \to 0 \implies [\mathbb E(\hat \theta_n)]^2 = o(1/n^{2\gamma}) \tag{3}$

For $(3)$ to be compatible with $(1)$ , we must have

n 2 γ < n 2 δ ⟹ δ > γ (4)

$n^{2\gamma} < n^{2\delta} \implies \delta > \gamma\tag {4}$

So it appears that in principle it is possible to have the Bias converging at a faster rate than the square root of the variance. But we cannot have the square root of the variance converging at a faster rate than the Bias.

— Alecos Papadopoulos
स्रोत

How would you reconcile this with the existence of unbiased estimators like ordinary least squares? In that case,

B(θ^)=0 $B(\hat\theta)=0$ , but

Var(θ^)−−−−−−√=O(1/n−−√) $\sqrt{Var(\hat\theta)} = O(1/\sqrt n)$ .

— Mike Izbicki

@MikeIzbicki Is the concept of convergence/big-O applicable in this case? Because here

B(θ^) $B(\hat \theta)$ is not "

O() $O()$ -anything" to begin with.

— Alecos Papadopoulos

In this case,

Eθ^=θ∗ $\mathbb E\hat\theta=\theta^*$ , so

B(θ^)=∥Eθ^−θ∗∥=0=O(1)=O(1/n0) $B(\hat\theta) = \lVert \mathbb E \hat\theta - \theta^*\rVert = 0 = O(1) = O(1/n^0)$ .

— Mike Izbicki

@MikeIzbicki But also

B(θ^)=O(n) $B(\hat \theta) = O(n)$ or

$B(\hat \theta) =O(1/\sqrt{n})$ or any other you care to write down. So which one is the rate of convergence here?

— Alecos Papadopoulos

@MikeIzbicki I have corrected my answer to show that it is possible in principle to have the Bias converging faster, although I still think the "zero-bias" example is problematic.

— Alecos Papadopoulos