स्कारिट-सीख में स्तरीकृत ट्रेन / टेस्ट-विभाजन

Question 1

मुझे अपना डेटा एक प्रशिक्षण सेट (75%) और परीक्षण सेट (25%) में विभाजित करना होगा। मैं वर्तमान में नीचे दिए गए कोड के साथ ऐसा करता हूं:

X, Xt, userInfo, userInfo_train = sklearn.cross_validation.train_test_split(X, userInfo)

हालाँकि, मैं अपने प्रशिक्षण डेटासेट को स्तरीकृत करना चाहूँगा। मैं उसको कैसे करू? मैं StratifiedKFoldविधि में देख रहा हूं , लेकिन मुझे 75% / 25% विभाजन को निर्दिष्ट नहीं करने देता है और केवल प्रशिक्षण डेटासेट को स्तरीकृत करता है।

Question 2

[0.17 के लिए अपडेट करें]

के डॉक्स देखें sklearn.model_selection.train_test_split:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    stratify=y, 
                                                    test_size=0.25)

[/ ०.१17 के लिए अपडेट करें]

यहाँ एक पुल अनुरोध है । लेकिन अगर आप चाहें तो बस train, test = next(iter(StratifiedKFold(...))) ट्रेन का उपयोग कर सकते हैं और सूचकांकों का परीक्षण कर सकते हैं।

Question 3

टीएल; डीआर: स्तरीकृत शफल के साथ प्रयोग करेंtest_size=0.25

स्किमिट-लर्न स्तरीकृत विभाजन के लिए दो मॉड्यूल प्रदान करता है:

StratifiedKFold : यह मॉड्यूल प्रत्यक्ष k- गुना क्रॉस-सत्यापन ऑपरेटर के रूप में उपयोगी है: जैसा कि इसमें n_foldsप्रशिक्षण / परीक्षण सेट स्थापित किए जाएंगे, जैसे कि कक्षाएं दोनों में समान रूप से संतुलित हैं।

कुछ कोड का उपयोग करें (सीधे प्रलेखन से ऊपर)

>>> skf = cross_validation.StratifiedKFold(y, n_folds=2) #2-fold cross validation
>>> len(skf)
2
>>> for train_index, test_index in skf:
...    print("TRAIN:", train_index, "TEST:", test_index)
...    X_train, X_test = X[train_index], X[test_index]
...    y_train, y_test = y[train_index], y[test_index]
...    #fit and predict with X_train/test. Use accuracy metrics to check validation performance

स्तरीकृतशफ़लप्लेट : यह मॉड्यूल एक एकल प्रशिक्षण / परीक्षण सेट बनाता है जिसमें समान रूप से संतुलित (स्तरीकृत) कक्षाएं होती हैं। अनिवार्य रूप से यह वही है जो आप चाहते हैं n_iter=1। आप यहां परीक्षण-आकार का उल्लेख कर सकते हैंtrain_test_split

कोड:

>>> sss = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=0)
>>> len(sss)
1
>>> for train_index, test_index in sss:
...    print("TRAIN:", train_index, "TEST:", test_index)
...    X_train, X_test = X[train_index], X[test_index]
...    y_train, y_test = y[train_index], y[test_index]
>>> # fit and predict with your classifier using the above X/y train/test

Question 4

आप इसे train_test_split()स्किकिट में उपलब्ध विधि से सीख सकते हैं:

from sklearn.model_selection import train_test_split 
train, test = train_test_split(X, test_size=0.25, stratify=X['YOUR_COLUMN_LABEL'])

मैंने एक छोटा GitHub Gist भी तैयार किया है जो दिखाता है कि stratifyविकल्प कैसे काम करता है:

https://gist.github.com/SHi-ON/63839f3a3647051a180cb03af0f7d0d9

Question 5

यहां निरंतर / प्रतिगमन डेटा के लिए एक उदाहरण है (जब तक कि GitHub पर इस मुद्दे को हल नहीं किया गया है)।

min = np.amin(y)
max = np.amax(y)

# 5 bins may be too few for larger datasets.
bins     = np.linspace(start=min, stop=max, num=5)
y_binned = np.digitize(y, bins, right=True)

X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    stratify=y_binned
)

जहां startन्यूनतम है और stopआपके निरंतर लक्ष्य का अधिकतम हिस्सा है।
यदि आप सेट नहीं करते हैं right=Trueतो यह कम से कम आपके अधिकतम मूल्य को एक अलग बिन बना देगा और आपका विभाजन हमेशा विफल रहेगा क्योंकि बहुत सारे नमूने उस अतिरिक्त बिन में होंगे।

Question 6

@Andreas Mueller द्वारा स्वीकृत उत्तर के अलावा, बस उस @tangy को ऊपर बताए अनुसार जोड़ना चाहते हैं:

StratifiedShuffleSplit सबसे मिलता-जुलता train_test_split का जोड़ा सुविधाओं के साथ (विभक्त = y):

डिफ़ॉल्ट रूप से स्तरीकरण करें
n_splits निर्दिष्ट करके , यह बार-बार डेटा को विभाजित करता है

Question 7

#train_size is 1 - tst_size - vld_size
tst_size=0.15
vld_size=0.15

X_train_test, X_valid, y_train_test, y_valid = train_test_split(df.drop(y, axis=1), df.y, test_size = vld_size, random_state=13903) 

X_train_test_V=pd.DataFrame(X_train_test)
X_valid=pd.DataFrame(X_valid)

X_train, X_test, y_train, y_test = train_test_split(X_train_test, y_train_test, test_size=tst_size, random_state=13903)

Question 8

ऊपर से @tangy उत्तर को scikit के वर्तमान संस्करण में अपडेट करना सीखें: 0.23.2 (StratifiedShuffleSplit प्रलेखन )।

from sklearn.model_selection import StratifiedShuffleSplit

n_splits = 1  # We only want a single split in this case
sss = StratifiedShuffleSplit(n_splits=n_splits, test_size=0.25, random_state=0)

for train_index, test_index in sss.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]