Model Selection and Evaluation

Model selection and evaluation

It's all about spliting dataset to training, testing and evaluation parts (3 parts), but better to use cross-validation (cv). That means making k-fold split of the dataset in training part then use test data for both testing and validation. but what's the size of each k-folding, equl portion or not and if not, based on what. ways:

  • score value of each estimator
  • scoring parameter
  • CV iteration methods
-

score value of each estimator

each estimator (model) that you decide to use in regression phase, has a "score" method providing a default evaluation criterion for the problem they are designed to solve. refer to the post related to regression and clustering for more detailed information.

scoring parameter: cross_val_score (and GridSearchCV) - as a model-evaluation tool relying on an internal scoring strategy. more detailed in this page 

CV iteration

(spliting dataset to k folds then in each iteration, set 1 fold for test and others for training)

other mthods: stratifiedkfold method or shuffle stratifiedkfold or groupkfold method, Leave One Group Out, Leave P Groups Out, Group shuffle split

LOO (Leave One Out)

taking all the samples except one, the test set being the sample left out (for n samples, n different training sets and n different test sets

LPO (Leave P Out)

for n samples, produces C(n,p) train-test pairs, Unlike LeaveOneOut and KFold, the test sets will overlap for p>1

shuffle and split (random P(n,r)

shuffling all over the dataset

for time series data

Note that unlike standard cross-validation methods, successive training sets are supersets of those that come before them. Also, it adds all surplus data to the first training partition, which is always used to train the model.

as an example:

from sklearn.model_selection import cross_val_score

clf = svm.SVC(kernel='linear', C=1)

scores = cross_val_score(clf, X, y, cv=5)

#fitting a model and computing the score 5 consecutive times (with different splits each time)

scores

output: array([0.96..., 1. ..., 0.96..., 0.96..., 1. ])

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

output: Accuracy: 0.98 (+/- 0.03)

By default, the score computed at each CV iteration is the score method of the estimator. It is possible to change this by using the scoring parameter:

from sklearn import metrics

scores = cross_val_score(clf, X, y, cv=5, scoring='f1_macro')

#use other cross validation strategies by passing a cross validation iterator instead

from sklearn.model_selection import ShuffleSplit

n_samples = X.shape[0]

cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)

cross_val_score(clf, X, y, cv=cv)

#or use an iterable yielding (train, test) splits as arrays of indices

def custom_cv_2folds(X):

    n = X.shape[0]

    i = 1

    while i <= 2:

        idx = np.arange(n * (i - 1) / 2, n * i / 2, dtype=int)

        yield idx, idx

        i += 1

#or cross_validate function from

sklearn.model_selection import cross_validate

from sklearn.metrics import recall_score

scoring=['precision_macro', 'recall_macro' ]

clf=svm.SVC(kernel='linear' , C=1, random_state=0)

scores=cross_validate(clf, X, y, scoring=scoring)

#or CV Iteration

import numpy as np

from sklearn.model_selection import KFold

X=["a", "b" , "c" , "d" ]

kf=KFold(n_splits=2)

for train, test in kf.split(X):

    print("%s %s" % (train, test))

output: [2 3] [0 1] [0 1] [2 3]

#or repeatedK-Fold function - similar to CV Iteration above

import numpy as np

from sklearn.model_selection import RepeatedKFold

X=np.array([[1, 2], [3, 4], [1, 2], [3, 4]])

random_state=12883823

rkf=RepeatedKFold(n_splits=2, n_repeats=2, random_state=random_state)

for train, test in rkf.split(X):

    print("%s %s" % (train, test))

output: [2 3] [0 1] [0 1] [2 3] [0 2] [1 3] [1 3] [0 2]

#or LLO

from sklearn.model_selection import LeaveOneOut

X=[1, 2, 3, 4]

loo=LeaveOneOut()

for train, test in loo.split(X):

    print("%s %s" % (train, test))

output: [1 2 3] [0] [0 2 3] [1] [0 1 3] [2] [0 1 2] [3]

#or LPO (Leave P out)

from sklearn.model_selection import LeavePOut

X=np.ones(4)

lpo=LeavePOut(p=2)

for train, test in lpo.split(X):

    print("%s %s" % (train, test))

output: [2 3] [0 1] [1 3] [0 2] [1 2] [0 3] [0 3] [1 2] [0 2] [1 3] [0 1] [2 3]

#or shuffle and split

from sklearn.model_selection import ShuffleSplit

X=np.arange(10)

ss=ShuffleSplit(n_splits=5, test_size=0.25, random_state=0)

for train_index, test_index in ss.split(X):

    print("%s %s" % (train_index, test_index))

output: [9 1 6 7 3 0 5] [2 8 4] [2 9 8 0 6 7 4] [3 5 1] [4 5 1 0 6 9 7] [2 3 8] [2 7 5 8 0 3 4] [6 1 9] [4 1 0 6 8 9 3] [5 2 7]

#for times series data:

tscv=TimeSeriesSplit(n_splits=3)

for train, test in tscv.split(X):

    print("%s %s" % (train, test))

as an extra note: check this page for an example of Model selection with Probabilistic PCA and Factor Analysis (FA)

Have Fun!

Comments