Model selection and evaluation

It's all about spliting dataset to training, testing and evaluation parts (3 parts), but better to use cross-validation (cv). That means making k-fold split of the dataset in training part then use test data for both testing and validation. but what's the size of each k-folding, equl portion or not and if not, based on what. ways:

score value of each estimator
scoring parameter
CV iteration methods

score value of each estimator

each estimator (model) that you decide to use in regression phase, has a "score" method providing a default evaluation criterion for the problem they are designed to solve. refer to the post related to regression and clustering for more detailed information.

scoring parameter: cross_val_score (and GridSearchCV) - as a model-evaluation tool relying on an internal scoring strategy. more detailed in this page

CV iteration

(spliting dataset to k folds then in each iteration, set 1 fold for test and others for training)

other mthods: stratifiedkfold method or shuffle stratifiedkfold or groupkfold method, Leave One Group Out, Leave P Groups Out, Group shuffle split

LOO (Leave One Out)

taking all the samples except one, the test set being the sample left out (for n samples, n different training sets and n different test sets

LPO (Leave P Out)

for n samples, produces C(n,p) train-test pairs, Unlike LeaveOneOut and KFold, the test sets will overlap for p>1

shuffle and split (random P(n,r)

shuffling all over the dataset

for time series data

Note that unlike standard cross-validation methods, successive training sets are supersets of those that come before them. Also, it adds all surplus data to the first training partition, which is always used to train the model.

as an example:


        from sklearn.model_selection import cross_val_score
        clf = svm.SVC(kernel='linear', C=1)
        scores = cross_val_score(clf, X, y, cv=5)
        #fitting a model and computing the score 5 consecutive times (with different splits each time)
        scores
        output: array([0.96..., 1. ..., 0.96..., 0.96..., 1. ])
        print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
        output: Accuracy: 0.98 (+/- 0.03)

By default, the score computed at each CV iteration is the score method of the estimator. It is possible to change this by using the scoring parameter:


    from sklearn import metrics
    scores = cross_val_score(clf, X, y, cv=5, scoring='f1_macro')
    #use other cross validation strategies by passing a cross validation iterator instead
    from sklearn.model_selection import ShuffleSplit
    n_samples = X.shape[0]
    cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
    cross_val_score(clf, X, y, cv=cv)
    #or use an iterable yielding (train, test) splits as arrays of indices
    def custom_cv_2folds(X):
            n = X.shape[0]
            i = 1
            while i <= 2:
                idx = np.arange(n * (i - 1) / 2, n * i / 2, dtype=int)
                yield idx, idx
                i += 1 
#or cross_validate function from
sklearn.model_selection import cross_validate 
from sklearn.metrics import recall_score
scoring=['precision_macro', 'recall_macro' ] 
clf=svm.SVC(kernel='linear' , C=1, random_state=0)
scores=cross_validate(clf, X, y, scoring=scoring) 
#or CV Iteration 
import numpy as np 
from sklearn.model_selection import KFold
X=["a", "b" , "c" , "d" ] 
kf=KFold(n_splits=2) 
for train, test in kf.split(X): 
    print("%s %s" % (train, test)) 
output: [2 3] [0 1] [0 1] [2 3] 
#or repeatedK-Fold function - similar to CV Iteration above 
import numpy as np 
from sklearn.model_selection import RepeatedKFold
X=np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) 
random_state=12883823 
rkf=RepeatedKFold(n_splits=2, n_repeats=2, random_state=random_state) 
for train, test in rkf.split(X): 
    print("%s %s" % (train, test)) 
output: [2 3] [0 1] [0 1] [2 3] [0 2] [1 3] [1 3] [0 2] 
#or LLO 
from sklearn.model_selection import LeaveOneOut 
X=[1, 2, 3, 4]
loo=LeaveOneOut() 
for train, test in loo.split(X): 
    print("%s %s" % (train, test)) 
output: [1 2 3] [0] [0 2 3] [1] [0 1 3] [2] [0 1 2] [3] 
#or LPO (Leave P out) 
from sklearn.model_selection import LeavePOut 
X=np.ones(4)
lpo=LeavePOut(p=2) 
for train, test in lpo.split(X): 
    print("%s %s" % (train, test)) 
output: [2 3] [0 1] [1 3] [0 2] [1 2] [0 3] [0 3] [1 2] [0 2] [1 3] [0 1] [2 3] 
#or shuffle and split 
from sklearn.model_selection import ShuffleSplit 
X=np.arange(10) 
ss=ShuffleSplit(n_splits=5, test_size=0.25, random_state=0) 
for train_index, test_index in ss.split(X): 
    print("%s %s" % (train_index, test_index)) 
output: [9 1 6 7 3 0 5] [2 8 4] [2 9 8 0 6 7 4] [3 5 1]
    [4 5 1 0 6 9 7] [2 3 8] [2 7 5 8 0 3 4] [6 1 9] [4 1 0 6 8 9 3] [5 2 7] 
#for times series data:
tscv=TimeSeriesSplit(n_splits=3) 
for train, test in tscv.split(X): 
    print("%s %s" % (train, test))

as an extra note: check this page for an example of Model selection with Probabilistic PCA and Factor Analysis (FA)

Have Fun!

realDSProjs

Search This Blog