डेटासेट में बदलाव के बाद पुराने मानक विचलन का उपयोग करके नए मानक विचलन की गणना

मेरे पास $n$ वास्तविक मूल्यों की एक सरणी है , जिसका अर्थ है $\mu_{old}$ और मानक विचलन $\sigma_{old}$ । यदि सरणी $x_i$ का एक तत्व दूसरे तत्व द्वारा प्रतिस्थापित किया जाता है $x_j$ , तो नया मतलब होगा

$\mu_{new}=\mu_{old}+\frac{x_j-x_i}{n}$

इस दृष्टिकोण का लाभ यह है कि के मूल्य की परवाह किए बिना निरंतर गणना की आवश्यकता है $n$ । वहाँ की गणना करने के लिए किसी भी दृष्टिकोण है $\sigma_{new}$ का उपयोग कर $\sigma_{old}$ की गणना की तरह $\mu_{new}$ का उपयोग कर $\mu_{old}$ ?

standard-deviation online

— उपयोगकर्ता
स्रोत

क्या यह होमवर्क है? एक बहुत ही समान कार्य हमारे गणितीय आँकड़ों के पाठ्यक्रम में पूछा गया ...

— krlmlr

@ user946850: नहीं, यह होमवर्क नहीं है। मैं विकासवादी एल्गोरिथम पर अपनी थीसिस का संचालन कर रहा हूं । मैं जनसंख्या विविधता के उपाय के रूप में मानक विचलन का उपयोग करना चाहता हूं। बस अधिक कुशल समाधान की तलाश है।

— उपयोगकर्ता

SD विचरण का वर्गमूल है, जो कि केवल माध्य वर्गीय मान है (एक वर्ग के कई अर्थों द्वारा समायोजित किया गया है, जिसे आप पहले से जानते हैं कि अद्यतन कैसे किया जाता है)। इसलिए, चल रहे गणना की गणना करने के लिए उपयोग किए जाने वाले समान तरीकों को किसी भी मौलिक परिवर्तन के बिना चल रहे संस्करण की गणना के लिए लागू किया जा सकता है। वास्तव में, बहुत अधिक परिष्कृत आंकड़ों को समान विचारों का उपयोग करके एक ऑनलाइन आधार पर गणना की जा सकती है: उदाहरण के लिए, आँकड़े .stackexchange.com / questions / 6920 और आँकड़े . stackexchange.com/questions/23481 पर थ्रेड देखें ।

— whuber

@ वाउबर: विकिपीडिया के विकिपीडिया लेख में इसका उल्लेख किया गया है , लेकिन यह विनाशकारी नोटबंदी (या महत्व की हानि) पर एक नोट के साथ भी हो सकता है। क्या यह ओवररेटेड है, या रनिंग विचरण के लिए एक वास्तविक समस्या है?

— krlmlr

यह बहुत अच्छा सवाल है। यदि आप भिन्न रूप से भिन्न रूप से संचय करते हैं, तो उन्हें पहले से ध्यान दिए बिना, आप वास्तव में परेशानी में पड़ सकते हैं। समस्या तब होती है जब संख्या बड़ी होती है लेकिन उनका विचरण छोटा होता है। उदाहरण के लिए, m / s में प्रकाश की गति की सटीक मापों की एक श्रृंखला पर विचार करें, जैसे 299792458.145, 299792457.883, 299792457.998, ...: उनका विचरण, जो 0.01 के आसपास है, उनके वर्गों की तुलना में बहुत छोटा है, जो कि लगभग

, उस लापरवाह गणना (यहां तक कि दोहरी सटीकता में) के परिणामस्वरूप शून्य विचरण होगा: सभी महत्वपूर्ण अंक लुप्त हो जाएंगे।

10^{17}

$10^{17}$

— whuber

जवाबों:

विकिपीडिया लेख में "विचरण की गणना के लिए एल्गोरिदम" पर एक अनुभाग दिखाता है कि यदि तत्वों को आपकी टिप्पणियों में जोड़ा जाता है तो विचरण की गणना कैसे करें। (याद रखें कि मानक विचलन विचरण का वर्गमूल है।) मान लें कि आप को अपने सरणी में जोड़ते हैं , फिर $x_{n+1}$

σ_{n e w}^{2} = σ_{o l d}^{2} + (x_{n + 1} - μ_{n e w}) (x_{n + 1} - μ_{o l d}) .

$\sigma_{new}^2 = \sigma_{old}^2 + (x_{n+1} - \mu_{new})(x_{n+1} - \mu_{old}).$

संपादित करें : उपरोक्त सूत्र गलत प्रतीत होता है, टिप्पणी देखें।

अब, एक तत्व को बदलने का अर्थ है अवलोकन को जोड़ना और दूसरे को हटाना; दोनों की गणना उपरोक्त सूत्र से की जा सकती है। हालांकि, ध्यान रखें कि संख्यात्मक स्थिरता की समस्याएं सुनिश्चित हो सकती हैं; उद्धृत लेख भी संख्यात्मक रूप से स्थिर रूप से प्रस्तावित करता है।

खुद के द्वारा सूत्र प्राप्त करने के लिए, गणना नमूना प्रसरण और स्थानापन्न की परिभाषा का उपयोग कर सूत्र द्वारा आप जब उचित दे दी है। यह आपको देता है अंत में करने का फार्मूला, और इस तरह $(n-1)(\sigma_{new}^2 - \sigma_{old}^2)$ $\mu_{new}$ $\sigma_{new}^2 - \sigma_{old}^2$ $\sigma_{new}$ दिया और $\sigma_{old}$ $\mu_{old}$ । मेरी अंकन में, मैं तुम्हें तत्व की जगह मान से : $x_n$ $x_n'$

\begin{array}{rcl} σ^{2} & = & (n - 1)^{- 1} \sum_{k} (x_{k} - μ)^{2} \\ (n - 1) (σ_{n e w}^{2} - σ_{o l d}^{2}) & = & \sum_{k = 1}^{n - 1} ((x_{k} - μ_{n e w})^{2} - (x_{k} - μ_{o l d})^{2}) \\ + ((x_{n}^{'} - μ_{n e w})^{2} - (x_{n} - μ_{o l d})^{2}) \\ = & \sum_{k = 1}^{n - 1} ((x_{k} - μ_{o l d} - n^{- 1} (x_{n}^{'} - x_{n}))^{2} - (x_{k} - μ_{o l d})^{2}) \\ + ((x_{n}^{'} - μ_{o l d} - n^{- 1} (x_{n}^{'} - x_{n}))^{2} - (x_{n} - μ_{o l d})^{2}) \end{array}

$\begin{eqnarray*} \sigma^2 &=& (n-1)^{-1} \sum_k (x_k - \mu)^2 \\ (n-1)(\sigma_{new}^2 - \sigma_{old}^2) &=& \sum_{k=1}^{n-1} ((x_k - \mu_{new})^2 - (x_k - \mu_{old})^2) \\ &&+\ ((x_n' - \mu_{new})^2 - (x_n - \mu_{old})^2) \\ &=& \sum_{k=1}^{n-1} ((x_k - \mu_{old} - n^{-1}(x_n'-x_n))^2 - (x_k - \mu_{old})^2) \\ &&+\ ((x_n' - \mu_{old} - n^{-1}(x_n'-x_n))^2 - (x_n - \mu_{old})^2) \\ \end{eqnarray*}\\$

The $x_k$ in the sum transform into something dependent of $\mu_{old}$ , but you'll have to work the equation a little bit more to derive a neat result. This should give you the general idea.

— krlmlr
स्रोत

the first formula you gave does not seem correct, well it means that if the

x_{n + 1}

$x_{n+1}$ is smaller/larger then from both new and old mean, the variance always increases, which does not make any sense. It may increase or decrease depending on the distribution.

— Emmet B

@EmmetB: Yes, you're right -- this should probably be

σ_{n e w}^{2} = \frac{n - 1}{n} σ_{o l d}^{2} + \frac{1}{n} (x_{n + 1} - μ_{n e w}) (x_{n + 1} - μ_{o l d}) .

$\sigma_{new}^2 = \frac{n-1}{n} \sigma_{old}^2 + \frac{1}{n} (x_{n+1} - \mu_{new})(x_{n+1} - \mu_{old}).$ Unfortunately, this renders void my whole discussion from there, but I'm leaving it for historic purposes. Feel free to edit, though.

— krlmlr

Based on what i think i'm reading on the linked Wikipedia article you can maintain a "running" standard deviation:

real sum = 0;
int count = 0;
real S = 0;
real variance = 0;

real GetRunningStandardDeviation(ref sum, ref count, ref S, x)
{
   real oldMean;

   if (count >= 1)
   {
       real oldMean = sum / count;
       sum = sum + x;
       count = count + 1;
       real newMean = sum / count;

       S = S + (x-oldMean)*(x-newMean)
   }
   else
   {
       sum = x;
       count = 1;
       S = 0;         
   }

   //estimated Variance = (S / (k-1) )
   //estimated Standard Deviation = sqrt(variance)
   if (count > 1)
      return sqrt(S / (count-1) );
   else
      return 0;
}

Although in the article they don't maintain a separate running sum and count, but instead have the single mean. Since in thing i'm doing today i keep a count (for statistical purposes), it is more useful to calculate the means each time.

— Ian Boyd
स्रोत

Given original $\bar x$ , $s$ , and $n$ , as well as the change of a given element $x_n$ to $x_n'$ , I believe your new standard deviation $s'$ will be the square root of

s^{2} + \frac{1}{n - 1} (2 n Δ \bar{x} (x_{n} - \bar{x}) + n (n - 1) (Δ \bar{x})^{2}),

$s^2 + \frac{1}{n-1}\left(2n\Delta \bar x(x_n-\bar x) +n(n-1)(\Delta \bar x)^2\right),$ where

Δ \bar{x} = {\bar{x}}^{'} - \bar{x}

$\Delta \bar x = \bar x' - \bar x$ , with

{\bar{x}}^{'}

$\bar x'$ denoting the new mean.

Maybe there is a snazzier way of writing it?

I checked this against a small test case and it seemed to work.

— Whistling in the Dark
स्रोत

@john / whistling in the Dark: I liked your answer, it seems work properly in my small dataset. Is there any mathematical foundation/reference on it? Could you kindly help?

— Alok Chowdhury

The question was all @Whistling in the Dark, I just cleaned it up for the site. You should pose a new question referencing the question and answer here. And also you should upvote this answer if you feel that way.

— John