Model selection and evaluation
It's all about spliting dataset to training, testing and evaluation parts (3 parts), but better to use cross-validation (cv). That means making k-fold split of the dataset in training part then use test data for both testing and validation. but what's the size of each k-folding, equl portion or not and if not, based on what. ways:
- score value of each estimator
- scoring parameter
- CV iteration methods
score value of each estimator
each estimator (model) that you decide to use in regression phase, has a "score" method providing a default evaluation criterion for the problem they are designed to solve. refer to the post related to regression and clustering for more detailed information.
scoring parameter: cross_val_score (and GridSearchCV) - as a model-evaluation tool relying on an internal scoring strategy. more detailed in this page
CV iteration
(spliting dataset to k folds then in each iteration, set 1 fold for test and others for training)
other mthods: stratifiedkfold method or shuffle stratifiedkfold or groupkfold method, Leave One Group Out, Leave P Groups Out, Group shuffle split
LOO (Leave One Out)
taking all the samples except one, the test set being the sample left out (for n samples, n different training sets and n different test sets
LPO (Leave P Out)
for n samples, produces C(n,p) train-test pairs, Unlike LeaveOneOut and KFold, the test sets will overlap for p>1
shuffle and split (random P(n,r)
shuffling all over the dataset
for time series data
Note that unlike standard cross-validation methods, successive training sets are supersets of those that come before them. Also, it adds all surplus data to the first training partition, which is always used to train the model.
as an example:
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, X, y, cv=5)
#fitting a model and computing the score 5 consecutive times (with different splits each time)
scores
output: array([0.96..., 1. ..., 0.96..., 0.96..., 1. ])
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
output: Accuracy: 0.98 (+/- 0.03)
By default, the score computed at each CV iteration is the score method of the estimator. It is possible to change this by using the scoring parameter:
from sklearn import metrics
scores = cross_val_score(clf, X, y, cv=5, scoring='f1_macro')
#use other cross validation strategies by passing a cross validation iterator instead
from sklearn.model_selection import ShuffleSplit
n_samples = X.shape[0]
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
cross_val_score(clf, X, y, cv=cv)
#or use an iterable yielding (train, test) splits as arrays of indices
def custom_cv_2folds(X):
n = X.shape[0]
i = 1
while i <= 2:
idx = np.arange(n * (i - 1) / 2, n * i / 2, dtype=int)
yield idx, idx
i += 1
#or cross_validate function from
sklearn.model_selection import cross_validate
from sklearn.metrics import recall_score
scoring=['precision_macro', 'recall_macro' ]
clf=svm.SVC(kernel='linear' , C=1, random_state=0)
scores=cross_validate(clf, X, y, scoring=scoring)
#or CV Iteration
import numpy as np
from sklearn.model_selection import KFold
X=["a", "b" , "c" , "d" ]
kf=KFold(n_splits=2)
for train, test in kf.split(X):
print("%s %s" % (train, test))
output: [2 3] [0 1] [0 1] [2 3]
#or repeatedK-Fold function - similar to CV Iteration above
import numpy as np
from sklearn.model_selection import RepeatedKFold
X=np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
random_state=12883823
rkf=RepeatedKFold(n_splits=2, n_repeats=2, random_state=random_state)
for train, test in rkf.split(X):
print("%s %s" % (train, test))
output: [2 3] [0 1] [0 1] [2 3] [0 2] [1 3] [1 3] [0 2]
#or LLO
from sklearn.model_selection import LeaveOneOut
X=[1, 2, 3, 4]
loo=LeaveOneOut()
for train, test in loo.split(X):
print("%s %s" % (train, test))
output: [1 2 3] [0] [0 2 3] [1] [0 1 3] [2] [0 1 2] [3]
#or LPO (Leave P out)
from sklearn.model_selection import LeavePOut
X=np.ones(4)
lpo=LeavePOut(p=2)
for train, test in lpo.split(X):
print("%s %s" % (train, test))
output: [2 3] [0 1] [1 3] [0 2] [1 2] [0 3] [0 3] [1 2] [0 2] [1 3] [0 1] [2 3]
#or shuffle and split
from sklearn.model_selection import ShuffleSplit
X=np.arange(10)
ss=ShuffleSplit(n_splits=5, test_size=0.25, random_state=0)
for train_index, test_index in ss.split(X):
print("%s %s" % (train_index, test_index))
output: [9 1 6 7 3 0 5] [2 8 4] [2 9 8 0 6 7 4] [3 5 1]
[4 5 1 0 6 9 7] [2 3 8] [2 7 5 8 0 3 4] [6 1 9] [4 1 0 6 8 9 3] [5 2 7]
#for times series data:
tscv=TimeSeriesSplit(n_splits=3)
for train, test in tscv.split(X):
print("%s %s" % (train, test))
as an extra note: check this page for an example of Model selection with Probabilistic PCA and Factor Analysis (FA)
Have Fun!
Comments
Post a Comment