एक डेटासेट को प्रशिक्षण और परीक्षण डेटासेट में विभाजित / विभाजन कैसे करें, जैसे, क्रॉस सत्यापन?

Question 1

एक पीपीपी सरणी को बेतरतीब ढंग से प्रशिक्षण और परीक्षण / सत्यापन डेटासेट में विभाजित करने का एक अच्छा तरीका क्या है? मतलाब में cvpartitionया crossvalindकार्यों के समान ।

Question 2

यदि आप डेटा सेट को दो हिस्सों में एक बार विभाजित करना चाहते हैं numpy.random.shuffle, numpy.random.permutationतो आप उपयोग कर सकते हैं , या यदि आपको सूचकांकों का ट्रैक रखने की आवश्यकता है:

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
numpy.random.shuffle(x)
training, test = x[:80,:], x[80:,:]

या

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
indices = numpy.random.permutation(x.shape[0])
training_idx, test_idx = indices[:80], indices[80:]
training, test = x[training_idx,:], x[test_idx,:]

क्रॉस सत्यापन के लिए एक ही डेटा सेट को बार-बार विभाजित करने के कई तरीके हैं । एक रणनीति पुनरावृत्ति के साथ, डेटासेट से फिर से जोड़ना है:

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
training_idx = numpy.random.randint(x.shape[0], size=80)
test_idx = numpy.random.randint(x.shape[0], size=20)
training, test = x[training_idx,:], x[test_idx,:]

अंत में, स्केलेर में कई क्रॉस सत्यापन विधियां (k- गुना , लीव -एन-आउट, ...) हैं। इसमें अधिक उन्नत "स्तरीकृत नमूनाकरण" विधियां भी शामिल हैं जो डेटा के एक विभाजन का निर्माण करती हैं जो कुछ विशेषताओं के संबंध में संतुलित है, उदाहरण के लिए सुनिश्चित करें कि प्रशिक्षण और परीक्षण सेट में सकारात्मक और नकारात्मक उदाहरणों का समान अनुपात है।

Question 3

एक और विकल्प है जो सिर्फ स्किकिट-लर्न का उपयोग करता है। जैसा कि scikit का विकि वर्णन करता है , आप निम्नलिखित निर्देशों का उपयोग कर सकते हैं:

from sklearn.model_selection import train_test_split

data, labels = np.arange(10).reshape((5, 2)), range(5)

data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.20, random_state=42)

इस तरह आप प्रशिक्षण और परीक्षण में विभाजित करने की कोशिश कर रहे डेटा के लिए लेबल को सिंक में रख सकते हैं।

Question 4

सिर्फ एक नोट। यदि आप ट्रेन, परीक्षण और सत्यापन सेट चाहते हैं, तो आप ऐसा कर सकते हैं:

from sklearn.cross_validation import train_test_split

X = get_my_X()
y = get_my_y()
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5)

ये पैरामीटर 70% प्रशिक्षण, और 15% प्रत्येक परीक्षण और वैल सेट के लिए देंगे। उम्मीद है की यह मदद करेगा।

Question 5

के रूप में sklearn.cross_validationमॉड्यूल पदावनत किया गया था, आप का उपयोग कर सकते हैं:

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)

X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.2, random_state=42)

Question 6

आप प्रशिक्षण और परीक्षण सेट में स्तरीकृत विभाजन पर भी विचार कर सकते हैं। प्रारंभ विभाजन भी बेतरतीब ढंग से प्रशिक्षण और परीक्षण सेट उत्पन्न करता है लेकिन इस तरह से कि मूल वर्ग अनुपात संरक्षित हैं। यह प्रशिक्षण और परीक्षण सेट को मूल डेटासेट के गुणों को बेहतर ढंग से दर्शाता है।

import numpy as np  

def get_train_test_inds(y,train_proportion=0.7):
    '''Generates indices, making random stratified split into training set and testing sets
    with proportions train_proportion and (1-train_proportion) of initial sample.
    y is any iterable indicating classes of each observation in the sample.
    Initial proportions of classes inside training and 
    testing sets are preserved (stratified sampling).
    '''

    y=np.array(y)
    train_inds = np.zeros(len(y),dtype=bool)
    test_inds = np.zeros(len(y),dtype=bool)
    values = np.unique(y)
    for value in values:
        value_inds = np.nonzero(y==value)[0]
        np.random.shuffle(value_inds)
        n = int(train_proportion*len(value_inds))

        train_inds[value_inds[:n]]=True
        test_inds[value_inds[n:]]=True

    return train_inds,test_inds

y = np.array([1,1,2,2,3,3])
train_inds,test_inds = get_train_test_inds(y,train_proportion=0.5)
print y[train_inds]
print y[test_inds]

यह कोड आउटपुट:

[1 2 3]
[1 2 3]

Question 7

मैंने ऐसा करने के लिए अपने स्वयं के प्रोजेक्ट के लिए एक फ़ंक्शन लिखा (यह संख्यात्मक उपयोग नहीं करता है, हालांकि):

def partition(seq, chunks):
    """Splits the sequence into equal sized chunks and them as a list"""
    result = []
    for i in range(chunks):
        chunk = []
        for element in seq[i:len(seq):chunks]:
            chunk.append(element)
        result.append(chunk)
    return result

यदि आप चाहते हैं कि विखंडू को यादृच्छिक बनाया जाए, तो सूची को पास करने से पहले उसे फेरबदल करें।

Question 8

यहां डेटा को एन = 5 फोल्ड में स्तरीकृत तरीके से विभाजित करने के लिए एक कोड है

% X = data array
% y = Class_label
from sklearn.cross_validation import StratifiedKFold
skf = StratifiedKFold(y, n_folds=5)
for train_index, test_index in skf:
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

Question 9

आपके जवाब के लिए धन्यवाद pberkes। मैंने केवल इसे संशोधित करने से बचने के लिए (1) प्रतिस्थापन किया है जबकि नमूना (2) दोहराए गए उदाहरण प्रशिक्षण और परीक्षण दोनों में हुए हैं:

training_idx = np.random.choice(X.shape[0], int(np.round(X.shape[0] * 0.8)),replace=False)
training_idx = np.random.permutation(np.arange(X.shape[0]))[:np.round(X.shape[0] * 0.8)]
    test_idx = np.setdiff1d( np.arange(0,X.shape[0]), training_idx)

Question 10

कुछ पढ़ने और खाते में लेने के बाद (कई ..) ट्रेन और परीक्षण करने के लिए डेटा को विभाजित करने के विभिन्न तरीके, मैंने समय तय किया!

मैंने 4 अलग-अलग तरीकों का इस्तेमाल किया (उनमें से कोई भी पुस्तकालय स्केलेर का उपयोग कर रहा है, जो मुझे यकीन है कि सर्वोत्तम परिणाम देगा, यह देखते हुए कि यह अच्छी तरह से डिज़ाइन किया गया और परीक्षण किया गया कोड है):

पूरे मैट्रिक्स गिरफ्तार को फेरबदल करें और फिर ट्रेन और परीक्षण के लिए डेटा को विभाजित करें
सूचकांकों को फेरबदल करें और फिर डेटा को विभाजित करने के लिए इसे x और y असाइन करें
विधि 2 के समान, लेकिन इसे करने के लिए अधिक कुशल तरीके से
स्पंदना करने के लिए पांडा डेटाफ्रेम का उपयोग करना

विधि 3 ने कम से कम समय के साथ जीता, उसके बाद विधि 1, और विधि 2 और 4 की खोज वास्तव में अक्षम थी।

4 अलग-अलग विधियों के लिए कोड जो मैंने समयबद्ध किया:

import numpy as np
arr = np.random.rand(100, 3)
X = arr[:,:2]
Y = arr[:,2]
spl = 0.7
N = len(arr)
sample = int(spl*N)

#%% Method 1:  shuffle the whole matrix arr and then split
np.random.shuffle(arr)
x_train, x_test, y_train, y_test = X[:sample,:], X[sample:, :], Y[:sample, ], Y[sample:,]

#%% Method 2: shuffle the indecies and then shuffle and apply to X and Y
train_idx = np.random.choice(N, sample)
Xtrain = X[train_idx]
Ytrain = Y[train_idx]

test_idx = [idx for idx in range(N) if idx not in train_idx]
Xtest = X[test_idx]
Ytest = Y[test_idx]

#%% Method 3: shuffle indicies without a for loop
idx = np.random.permutation(arr.shape[0])  # can also use random.shuffle
train_idx, test_idx = idx[:sample], idx[sample:]
x_train, x_test, y_train, y_test = X[train_idx,:], X[test_idx,:], Y[train_idx,], Y[test_idx,]

#%% Method 4: using pandas dataframe to split
import pandas as pd
df = pd.read_csv(file_path, header=None) # Some csv file (I used some file with 3 columns)

train = df.sample(frac=0.7, random_state=200)
test = df.drop(train.index)

और समय के लिए, 1000 छोरों के 3 पुनरावृत्तियों में से निष्पादित करने का न्यूनतम समय है:

विधि 1: 0.35883826200006297 सेकंड
विधि 2: 1.7157016959999964 सेकंड
विधि 3: 1.7876616719995582 सेकंड
विधि 4: 0.07562861499991413 सेकंड

मुझे आशा है कि यह उपयोगी है!

Question 11

संभवत: आपको न केवल ट्रेन और परीक्षण में विभाजित होने की आवश्यकता होगी, बल्कि यह सुनिश्चित करने के लिए कि आपके मॉडल का सामान्यीकरण भी हो जाए। यहां मैं 70% प्रशिक्षण डेटा, 20% सत्यापन और 10% होल्डआउट / परीक्षण डेटा मान रहा हूं।

की जाँच करें np.split :

यदि indices_or_sections सॉर्ट किए गए पूर्णांकों का 1-डी सरणी है, तो प्रविष्टियां इंगित करती हैं कि अक्ष के साथ सरणी कहां विभाजित है। उदाहरण के लिए, [2, 3], अक्ष = 0 के लिए, परिणाम होगा

एरी [: २] एरी [२: ३] ऐरी [३:]

t, v, h = np.split(df.sample(frac=1, random_state=1), [int(0.7*len(df)), int(0.9*len(df))])

Question 12

ट्रेन परीक्षण और वैध में विभाजित करें

x =np.expand_dims(np.arange(100), -1)


print(x)

indices = np.random.permutation(x.shape[0])

training_idx, test_idx, val_idx = indices[:int(x.shape[0]*.9)], indices[int(x.shape[0]*.9):int(x.shape[0]*.95)],  indices[int(x.shape[0]*.9):int(x.shape[0]*.95)]


training, test, val = x[training_idx,:], x[test_idx,:], x[val_idx,:]

print(training, test, val)