डेटा माइनिंग: मुझे कार्यात्मक रूप खोजने के बारे में कैसे जाना चाहिए?


34

मैं repeatable प्रक्रियाओं समारोह के कार्यात्मक रूप की खोज के लिए इस्तेमाल किया जा सकता है कि के बारे में उत्सुक हूँ y = f(A, B, C) + error_termजहाँ मेरे ही इनपुट टिप्पणियों का एक सेट है ( y, A, Bऔर C)। कृपया ध्यान दें कि कार्यात्मक रूप fअज्ञात है।

निम्नलिखित डेटासेट पर विचार करें:

एए बीबी सीसी डीडी ईई एफएफ
== == == == == ==
98 11 66 84 67 10500
१ ४४ ४ 72 १२ ४ 12 .२५०
54 28 90 73 95 5463 है
34 95 15 45 75 2581
56 37 0 79 43 3221 है
68 79 1 65 9 4721
५३ २ ९ ० १० १ 2 ३० ९ ५
38 75 41 97 40 4558
29 99 46 28 96 5336
22 63 27 43 4 2196
4 5 89 78 39 492
10 28 39 59 64 1178
११ ५ ९ ५६ २५ ५ ३४१ 34
१० ४ ९ 79 ९ २४ ४३१
86 36 84 14 67 10526
80 46 29 96 7 7793
67 71 12 43 3 5411
१४ ६३ २ ९ ५२ ३६।
99 62 56 81 26 13334
56 4 72 65 33 3495
५१ ४० ६२ ११ ५२ ५१8।
29 77 80 2 54 7001
42 32 4 17 72 1926
४४ ४५ ३० २५ ५ ३३६०
6 3 65 16 87 288

इस उदाहरण में, मान लें कि हम जानते हैं कि FF = f(AA, BB, CC, DD, EE) + error term, लेकिन हम कार्यात्मक रूप के बारे में निश्चित नहीं हैं f(...)

किस प्रक्रिया / किन तरीकों का उपयोग आप कार्यात्मक रूप खोजने में करेंगे f(...)?

(बोनस बिंदु: fऊपर दिए गए डेटा की निश्चितता पर आपका सबसे अच्छा अनुमान क्या है ? :-) और हाँ, एक "सही" उत्तर है जो R^20.99 से अधिक का उत्पादन करेगा ।)


1
@OP: नीचे आयरिशस्टैट की टिप्पणियां मुझे याद दिलाती हैं कि आपके स्वतंत्र चर एक दूसरे से कैसे संबंधित हैं और / या आश्रित चर के कुछ ज्ञान के बिना, सिद्धांत रूप में, आपको "बिना पैडल के क्रीक" छोड़ देता है। उदाहरण के लिए, अगर FFथा "दहन उपज" और AAईंधन की मात्रा था, और BBऑक्सीजन की मात्रा था, तो आप के लिए एक बातचीत के दौरान पद के लिए विचार करेंगे AAऔरBB
पीट

@ टिप्पणी: बातचीत की शर्तें पूरी तरह से संभव हैं। मुझे आशा है कि मैंने अपने प्रश्न को गलत तरीके से बताकर इससे इंकार नहीं किया।
knorv

2
@ टिप्पणी: यह कोई समस्या नहीं है (और मैं इसे वास्तविक जीवन सेटिंग में भी यथार्थवादी कहूंगा), बस नीचे मेरा जवाब देखें।
वॉनजड

3
पीट: R^2 >= 0.99एक के साथ डेटा फिट होगा कार्यों की अनंत संख्या में से एक जटिलता अनुपात (और निश्चित रूप से नमूना फिट से बाहर) के लिए सबसे अच्छा प्रदर्शन के साथ एक को खोजने के लिए करना चाहते हैं। उस खोज को न लिखने के लिए क्षमा करें, मुझे लगा कि यह स्पष्ट था :-)
knorv

1
इसके अलावा, अब जब इस सवाल का बहुत अच्छी तरह से उत्तर दिया गया है, तो यह जानना अच्छा होगा कि क्या डेटा नीचे दिए गए कार्यों में से एक द्वारा उत्पन्न किया गया था
n

जवाबों:


29

डेटा के लिए सबसे अच्छा फिटिंग फ़ंक्शनल फॉर्म (जिसे फ़्री-फ़ार्म या प्रतीकात्मक प्रतिगमन कहा जाता है) खोजने के लिए इस टूल को आज़माएं - मेरे सभी ज्ञान के लिए यह सबसे अच्छा उपलब्ध है (कम से कम मैं इसके बारे में बहुत उत्साहित हूं) ... और इसके मुक्त :-)

http://creativemachines.cornell.edu/eureqa

संपादित करें : मैंने इसे यूरेका के साथ एक शॉट दिया और मैं इसके लिए जाऊंगा:

के साथ आर 2 = .९९,९८८

AA+AA2+BBCC
R2=0.99988

मैं इसे एक सही फिट कहूंगा (यूरेका अन्य, बेहतर फिटिंग समाधान देता है, लेकिन ये भी थोड़ा अधिक जटिल हैं। यूरेका इस पर एहसान करता है, इसलिए मैंने इसे चुना) - और यूरेका ने मेरे बारे में कुछ ही सेकंड में सब कुछ किया। एक सामान्य लैपटॉप ;-)


6
सिर्फ संदर्भ के लिए, यूरेका जेनेटिक प्रोग्रामिंग का उपयोग प्रतीकात्मक प्रतिगमन समस्या का समाधान खोजने के लिए कर रहा है।
थिएस हेइडेके

10
एक नासमझ, स्वचालित उपकरण के लिए +1 प्रभावशाली प्रदर्शन!
whuber

1
@vonjd, लिंक अब "नि: शुल्क 30 दिन परीक्षण" कहता है। क्या आप एक मुफ्त विकल्प के बारे में जानते हैं?
डेनिस

3
@denis: आप इस R पैकेज को आज़मा सकते हैं: cran.r-project.org/web/packages/rgp/index.html - लेकिन यह उपरोक्त सॉफ्टवेयर जैसा परिष्कृत नहीं है (अभी तक नहीं?)
vonjj

3
यूरेका अकादमिक / गैर-लाभकारी संगठनों के लिए अभी भी स्वतंत्र है
उलटा

25

अकेले फिट होने की अच्छाई का अच्छा पैमाना नहीं है, लेकिन आइए हम इस बात पर ध्यान न दें कि सिवाय इसके किमॉडलिंग मेंपारसीमोनीको महत्व दिया जाए।R2

इस दिशा में ध्यान दें कि के मानक तकनीक खोजपूर्ण डेटा विश्लेषण (ईडीए) और प्रतिगमन (लेकिन नहीं चरणबद्ध या अन्य स्वचालित प्रक्रिया) के रूप में एक रेखीय मॉडल उपयोग करने का सुझाव

f=a+bc+abc+constant+error

R2fabcabc

f=a2+bc+constant+error

with a root MSE of under 34 and an adjusted R2 of 0.9999. The estimated coefficients of 1.0112 and 0.988 suggest the data may be artificially generated with the formula

f=a2+bc+50

plus a little normally distributed error of SD approximately equal to 50.

Edit

In response to @knorv's hints, I continued the analysis. To do so I used the techniques that had been successful so far, beginning with inspecting scatterplot matrices of the residuals against the original variables. Sure enough, there was a clear indication of correlation between a and the residuals (even though OLS regression of f against a, a2, and bc did not indicate a was "significant"). Continuing in this vein I explored all correlations between the quadratic terms a2,,e2,ab,ac,,de and the new residuals and found a tiny but highly significant relationship with b2. "Highly significant" means that all this snooping involved looking at some 20 different variables, so my criterion for significance on this fishing expedition was approximately 0.05/20 = 0.0025: anything less stringent could easily be an artifact of the probing for fits.

This has something of the flavor of a physical model in that we expect, and therefore search for, relationships with "interesting" and "simple" coefficients. So, for instance, seeing that the estimated coefficient of b2 was -0.0092 (between -0.005 and -0.013 with 95% confidence), I elected to use -1/100 for it. If this were some other dataset, such as observations of a social or political system, I would make no such changes but just use the OLS estimates as-is.

Anyway, an improved fit is given by

f=a+a2+bcb2/100+30.5+error

with mean residual 0, standard deviation 26.8, all residuals between -50 and +43, and no evidence of non-normality (although with such a small dataset the errors could even be uniformly distributed and one couldn't really tell the difference). The reduction in residual standard deviation from around 50 to around 25 would often be expressed as "explaining 75% of the residual variance."


I make no claim that this is the formula used to generate the data. The residuals are large enough to allow some fairly large changes in a few of the coefficients. For instance, 95% CIs for the coefficients of a, b2, and the constant are [-0.4, 2.7], [-0.013, -0.003], and [-7, 61] respectively. The point is that if any random error has actually been introduced in the data-generation procedure (and that is true of all real-world data), that would preclude definitive identification of the coefficients (and even of all the variables that might be involved). That's not a limitation of statistical methods: it's just a mathematical fact.

BTW, using robust regression I can fit the model

f=1.0103a2+0.99493bc0.007b2+46.78+error

with residual SD of 27.4 and all residuals between -51 and +47: essentially as good as the previous fit but with one less variable. It is more parsimonious in that sense, but less parsimonious in the sense that I haven't rounded the coefficients to "nice" values. Nevertheless, this is the form I would usually favor in a regression analysis absent any rigorous theories about what kinds of values the coefficients ought to have and which variables ought to be included.

It is likely that additional strong relationships are lurking here, but they would have to be fairly complicated. Incidentally, taking data whose original SD is 3410 and reducing their variation to residuals with an SD of 27 is a 99.99384% reduction in variance (the R2 of this new fit). One would continue looking for additional effects only if the residual SD is too large for the intended purpose. In the absence of any purpose besides second-guessing the OP, it's time to stop.


1
Good work! So far this seems like the best answer.
Zach

@whuber: Nice work -- you're getting close! :-) It is true that the data was artifically generated with a formula plus an error term. But the formula is not exactly the one you've found - you're missing out on a couple of terms. But you're close and you're currently in the lead :-)
knorv

4
@whuber I already gave my +1, but I'd like to add that this is very instructive to read one's approach to such a problem. You're worth the bounty in any manner whatsoever.
chl

1
@bill I did try it, early on. I trust my explanation provides room for your proposal as well as the two I have included. There's more than one right answer. I continued the analysis and included those extra terms because it was clear there are patterns in the residuals and that accounting for them materially reduces the residual variance. (I will confess that I have spent very little time and attention on this, though: the total time for the initial analysis, including writing the answer, was 17 minutes. More time often translates to more insight...)
whuber

1
@naught It would be interesting to begin with such a long formula and apply an Elastic Net (or some similar variable-elimination algorithm). I suspect the success of any such approach would depend on keeping the number of functions relatively small and including the correct functions among them--which sounds more like a matter of good luck and good guessing than of any principled investigation. But if blindly throwing a huge number of functional forms at the problem results in success, that would be worth knowing.
whuber

5

Your question needs refining because the function f is almost certainly not uniquely defined by the sample data. There are many different functions which could generate the same data.

That being said, Analysis of Variance (ANOVA) or a "sensitivity study" can tell you a lot about how your inputs (AA..EE) affect your output (FF).

I just did a quick ANOVA and found a reasonably good model: FF = 101*A + 47*B + 49*C - 4484. The function does not seem to depend on DD or EE linearly. Of course, we could go further with the model and add quadratic and mixture terms. Eventually you will have a perfect model that over-fits the data and has no predictive value. :)


@Pete As you said you could add quadratic,cubic,quartic ....and mixture terms but that would just be just nonsense. There is nonsense and there is non-sensical nonsense bot the most non-sensicle nonsense is "statistical nonsense ".
IrishStat

2
@IrishStat it is not generally nonsense to add mixture and higher order terms; only bad when it is done without restraint and without regard to theory
Pete

2
@Pete . Correct! The absence of a pre-existing theory makes it silly.
IrishStat

@Pete: What R^2 do you get for your model?
knorv

@knorv: I don't quite remember but it was > 0.90. When plotted about the regression line the points appeared to have a little bit of an "S"/cubic shape so I'm guessing the function "f" was a mathematical creation where someone typed 100A + 50(B+C) + higher order terms involving D & E.
Pete

3

Broadly speaking, there's no free lunch in machine learning:

In particular, if algorithm A outperforms algorithm B on some cost functions, then loosely speaking there must exist exactly as many other functions where B outperforms A

/edit: also, a radial SVM with C = 4 and sigma = 0.206 easily yields an R2 of .99. Extracting the actual equation used to derive this dataset is left as an exercise to the class. Code is in R.

setwd("~/wherever")
library('caret')
Data <- read.csv("CV.csv", header=TRUE)
FL <- as.formula("FF ~ AA+BB+CC+DD+EE")
model <- train(FL,data=Data,method='svmRadial',tuneGrid = expand.grid(.C=4,.sigma=0.206))
R2( predict(model, Data), Data$FF)

-2

All Models are wrong but some are useful : G.E.P.Box

Y(T)= - 4709.7
+ 102.60*AA(T)- 17.0707*AA(T-1)
+ 62.4994*BB(T) + 41.7453*CC(T) + 965.70*ZZ(T)

where ZZ(T)=0 FOR T=1,10 =1 OTHERWISE

There appears to be a "lagged relationship" between Y and AA AND an explained shift in the mean for observations 11-25 .

Curious results if this is not chronological or spatial data.


@IrishStats What is "G.E.P. Box"?
knorv

IrishStat: The data is not chronological. So the ordering of the observations is not of importance. The shift in the mean for observations #11-25 is merely a side-effect on how I retrieved the dataset.
knorv

1
@IrishStat: I meant that I just happened to retrieve the records in a certain order (think ORDER BY). The rows have no inherent special order. So you can safely re-arrange them without losing any information. Sorry if I confused you :-)
knorv

1
IrishStat: The dataset is unordered. The AA(T-1) term in your equation makes no sense in this context.
naught101

2
@naught That is correct. It means that the finding of any "significant" coefficient for the lagged variable AA(T-1) or of any "mean shift" introduces spurious variables: overfitting. What is interesting in this example is that although I have tended to think of overfitting as producing optimistically (and incorrectly) high R2 values, in this circumstance it has produced a huge increase in the residual variance, because it has not found several important variables, either.
whuber

-3

r square of 97.2

Estimation/Diagnostic Checking for Variable Y Y
X1 AAS
X2 BB
X3 BBS
X4 CC

Number of Residuals (R) =n 25
Number of Degrees of Freedom =n-m 20
Residual Mean =Sum R / n -.141873E-05
Sum of Squares =Sum R2 .775723E+07
Variance =SOS/(n) 310289.
Adjusted Variance =SOS/(n-m) 387861.
Standard Deviation RMSE =SQRT(Adj Var) 622.785
Standard Error of the Mean =Standard Dev/ (n-m) 139.259
Mean / its Standard Error =Mean/SEM -.101877E-07
Mean Absolute Deviation =Sum(ABS(R))/n 455.684
AIC Value ( Uses var ) =nln +2m 326.131
SBC Value ( Uses var ) =nln +m*lnn 332.226
BIC Value ( Uses var ) =see Wei p153 340.388
R Square = .972211
Durbin-Watson Statistic =[-A(T-1)]**2/A
2 1.76580

**
MODEL COMPONENT LAG COEFF STANDARD P T
# (BOP) ERROR VALUE VALUE

1CONSTANT                         -.381E+04   466.       .0000    -8.18

INPUT SERIES X1 AAS AA SQUARED

2Omega (input) -Factor #  1    0   .983       .410E-01   .0000    23.98

INPUT SERIES X2 BB BB AS GIVEN

3Omega (input) -Factor #  2    0   108.       14.9       .0000     7.27

INPUT SERIES X3 BBS BB SQUARED

4Omega (input) -Factor #  3    0  -.577       .147       .0008    -3.93

INPUT SERIES X4 CC CC AS GIVEN

5Omega (input) -Factor #  4    0   49.9       4.67       .0000    10.67

 Residual PLOT

हमारी साइट का प्रयोग करके, आप स्वीकार करते हैं कि आपने हमारी Cookie Policy और निजता नीति को पढ़ और समझा लिया है।
Licensed under cc by-sa 3.0 with attribution required.