Regression and Classification

Regression and Classification

At first we present a short summary of this article and regression issue in general. Next we're going to check each method with an example. each example has presentation part that you can see the result. This article is based on regression and classification documentation from sklearn website

summary of regression topic

general issues:

from sklearn import linear_model

from sklearn.metrics import mean_squared_error, r2_score

#importing datasets and specifying X, y, X_train, y_train, X_test, y_test

#then:

model = linear_model.method()

model.fit(X, y)

model.coef_

model.intercept_

model.decision_function([[x, y]])

model.score(X, y)

#for testing the model/predictions

y_pred = model.predict(X_test)

mean_squared_error(y_test, y_pred)

r2_score(y_test, y_pred)

for using models check this algorithm (all methods described later in this post):

if values going different in time:

Ridge regression (with higher alpha), RidgeCV, Lasso

For high-dimensional datasets with many collinear features: (all discrete)

LassoCV (with alpha), LassoLarsCV (with alpha), Least Angle Regression - like Lasso (with n_nonzero_coefs)

Orthogonal Matching Pursuit (OMP) >> OrthogonalMatchingPursuit and orthogonal_mp methods

Robustness regression, PolynomialFeatures(degree=2) (also refer to preprocessing topic)

Joint feature selection with multi-task Lasso: (all discrete)

MultiTaskLasso (with alpha)

one dimensional regression

linear regression, BaysianRidge, Automatic Relevance Determination Regression (ARD)

for classification:

Logistic Regression, Stochastic Gradient Descent (SGDClassifier(loss, penalty, max_iter))

Perceptron(tol=1e-3, random_state=0), PassiveAggressiveClassifier(max_iter=1000, random_state=0, tol=1e-3)

Steps:

#first of all load your dataset, and extract features (columns) you neeed and assign them to X and y as below. in real case you load the csv file, set the y (target variable) and X (data), then X_train, y_train, so on...

# Use only one feature for X

X = X[:, np.newaxis, 2]

# Split the data into training/testing sets, suppose it has 100 data (0-80 for training, 81-100 for test)

X_train = X[:-20]

X_test = X[-20:]

# Split the targets into training/testing sets. you can refer to another post about "model selection and evaluation" and check which method you can choose for splitting the dataset

y_train = X[:-20]

y_test = X[-20:]

in future we will build a project and use all methods with both presentation and coeff then visually and numerically infer from it and make a report. But for this post we just focus on learning and coding basics.

1--- LinearRegression

It will take in its fit method arrays X, y and will store the coefficients of the linear model in its coef_ member:

from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
reg = linear_model.LinearRegression()
# reg.fit(X, y)
reg.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
output: LinearRegression()
reg.coef_
output: array([0.5, 0.5])
y_pred = regr.predict(X_test)
mean_squared_error(y_test, y_pred))
r2_score(y_test, y_pred))

2--- Ridge regression and classification

The complexity parameter alpha >= 0, controls the amount of shrinkage: the larger the value of alpha, the greater the amount of shrinkage and thus the coefficients become more robust to collinearity.

The RidgeClassifier can be significantly faster than e.g. LogisticRegression with a high number of classes

A- usage in regression
>>> from sklearn import linear_model
>>> reg = linear_model.Ridge(alpha=.5)
>>> reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])
output: Ridge(alpha=0.5)
>>> reg.coef_
output: array([0.34545455, 0.34545455])
>>> reg.intercept_
output: 0.13636...
B- usage in classification
>>> import numpy as np
>>> from sklearn import linear_model
>>> reg = linear_model.RidgeCV(alphas=np.logspace(-6, 6, 13))
>>> reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])
output: RidgeCV(alphas=array([1.e-06, 1.e-05, 1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03, 1.e+04, 1.e+05, 1.e+06]))
>>> reg.alpha_
output: 0.01

3--- Lasso

including: LassoCV, LassoLarsCV, Multi-task Lasso, LassoLars, Elastic-Net

Lasso CV
>>> from sklearn import linear_model
>>> reg = linear_model.Lasso(alpha=0.1)
>>> reg.fit([[0, 0], [1, 1]], [0, 1])
output: Lasso(alpha=0.1)
>>> reg.predict([[1, 1]])
output: array([0.8])

more examples available in ElasticNet for sparse signals and Compressive sensing cases.

LassoLarsCV

For high-dimensional datasets with many collinear features, LassoCV is most often preferable. However, LassoLarsCV has the advantage of exploring more relevant values of alpha parameter, and if the number of samples is very small compared to the number of features, it is often faster than LassoCV.

Use the Akaike information criterion (AIC), the Bayes Information criterion (BIC) and cross-validation to select an optimal value of the regularization parameter alpha of the :ref:`lasso` estimator.

Multi-task Lasso - Joint feature selection with multi-task Lasso
import numpy as np
from sklearn.linear_model import MultiTaskLasso, Lasso
rng = np.random.RandomState(42)
# Generate some 2D coefficients with sine waves with random frequency and phase
n_samples, n_features, n_tasks = 100, 30, 40
n_relevant_features = 5
coef = np.zeros((n_tasks, n_features))
times = np.linspace(0, 2 * np.pi, n_tasks)
for k in range(n_relevant_features):
coef[:, k] = np.sin((1. + rng.randn(1)) * times + 3 * rng.randn(1))
X = rng.randn(n_samples, n_features)
Y = np.dot(X, coef.T) + rng.randn(n_samples, n_tasks)
coef_lasso_ = np.array([Lasso(alpha=0.5).fit(X, y).coef_ for y in Y.T])
coef_multi_task_lasso_ = MultiTaskLasso(alpha=1.).fit(X, Y).coef_
Elastic-Net

ElasticNet is a linear regression model trained with both l1 and l2-norm regularization of the coefficients. This combination allows for learning a sparse model where few of the weights are non-zero like Lasso, while still maintaining the regularization properties of Ridge. We control the convex combination of l1 and l2 using the l1_ratio parameter. refer to Lasso and Elastic Net for Sparse Signals above for an example in addition to this:

Multi-task Elastic-Net:

>>> from sklearn import linear_model
>>> clf = linear_model.MultiTaskElasticNet(alpha=0.1)
>>> clf.fit([[0,0], [1, 1], [2, 2]], [[0, 0], [1, 1], [2, 2]])
output: MultiTaskElasticNet(alpha=0.1)
>>> print(clf.coef_)
output: [[0.45663524 0.45612256]
[0.45663524 0.45612256]]
>>> print(clf.intercept_)
output: [0.0872422 0.0872422]

Least Angle Regression (Lars)

for high-dimensional data, LARS is similar to forward stepwise regression. At each step, it finds the feature most correlated with the target. When there are multiple features having equal correlation, instead of continuing along the same feature, it proceeds in a direction equiangular between the features.

>>> from sklearn import linear_model
>>> reg = linear_model.Lars(n_nonzero_coefs=1)
>>> reg.fit([[-1, 1], [0, 0], [1, 1]], [-1.1111, 0, -1.1111])
output: Lars(n_nonzero_coefs=1)
>>> print(reg.coef_)
output: [ 0. -1.11...]

refer to Lasso section above

LassoLars

>>> from sklearn import linear_model
>>> reg = linear_model.LassoLars(alpha=.1)
>>> reg.fit([[0, 0], [1, 1]], [0, 1])
output: LassoLars(alpha=0.1)
>>> reg.coef_
output: array([0.717157..., 0. ])

As an another example:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn import datasets
X, y = datasets.load_diabetes(return_X_y=True)
print("Computing regularization path using the LARS ...")
_, _, coefs = linear_model.lars_path(X, y, method='lasso', verbose=True)
xx = np.sum(np.abs(coefs.T), axis=1)
xx /= xx[-1]

Orthogonal Matching Pursuit (OMP)

OrthogonalMatchingPursuit and orthogonal_mp implements the OMP algorithm for approximating the fit of a linear model with constraints imposed on the number of non-zero coefficients Being a forward feature selection method like Least Angle Regression, orthogonal matching pursuit can approximate the optimum solution vector with a fixed number of non-zero elements

check this page for more detailed information.

Bayesian Regression

Bayesian regression techniques can be used to include regularization parameters in the estimation procedure: the regularization parameter is not set in a hard sense but tuned to the data at hand. This can be done by introducing uninformative priors over the hyper parameters of the model. The l2 regularization used in Ridge regression and classification is equivalent to finding a maximum a posteriori estimation under a Gaussian prior over the coefficients w with precision (lambra**-1). Instead of setting lambda manually, it is possible to treat it as a random variable to be estimated from the data.

A- BaysianRidge

Compared to the OLS (ordinary least squares) estimator, the coefficient weights are slightly shifted toward zeros, which stabilises them. We also plot predictions and uncertainties for Bayesian Ridge Regression for one dimensional regression using polynomial feature expansion. check this link for an example beside this: Bayesian Ridge Regression is used for regression:

>>> from sklearn import linear_model
>>> X = [[0., 0.], [1., 1.], [2., 2.], [3., 3.]]
>>> Y = [0., 1., 2., 3.]
>>> reg = linear_model.BayesianRidge()
>>> reg.fit(X, Y)
output: BayesianRidge()
>>> reg.predict([[1, 0.]])
output: array([0.50000013])
>>> reg.coef_
output: array([0.49999993, 0.49999993])
Curve Fitting with Bayesian Ridge Regression:

In general, when fitting a curve with a polynomial by Bayesian ridge regression, the selection of initial values of the regularization parameters (alpha, lambda) may be important. This is because the regularization parameters are determined by an iterative procedure that depends on initial values. check this link for an example

B- Automatic Relevance Determination Regression (ARD)

check this link for details and example.

Logistic regression

Logistic regression, despite its name, is a linear model for classification rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function. The penalties supported by each solver and implemented in the class LogisticRegression are:

“liblinear”, “newton-cg”, “lbfgs”, “sag” and “saga”: better to use saga

Examples would be available here

Stochastic Gradient Descent - SGD

Stochastic gradient descent is a simple yet very efficient approach to fit linear models. It is particularly useful when the number of samples (and the number of features) is very large. also for large data: Perceptron and Passive Aggressive Algorithms below

>>> from sklearn.linear_model import SGDClassifier
>>> X = [[0., 0.], [1., 1.]]
>>> y = [0, 1]
>>> clf = SGDClassifier(loss="hinge", penalty="l2", max_iter=5)
>>> clf.fit(X, y)
output: SGDClassifier(max_iter=5)
>>> clf.predict([[2., 2.]])
output: array([1])
>>> clf.coef_
output: array([[9.9..., 9.9...]])
>>> clf.intercept_
output: array([-9.9...])
#To get the signed distance to the hyperplane use SGDClassifier.decision_function:
>>> clf.decision_function([[2., 2.]])
output: array([29.6...])
>>> clf = SGDClassifier(loss="log", max_iter=5).fit(X, y)
>>> clf.predict_proba([[1., 1.]])
output: array([[0.00..., 0.99...]])

check this link for an another example

Perceptron

The Perceptron is another simple classification algorithm suitable for large scale learning. By default:

  • It does not require a learning rate.
  • It is not regularized (penalized).
  • It updates its model only on mistakes.

The last characteristic implies that the Perceptron is slightly faster to train than SGD with the hinge loss and that the resulting models are sparser.

>>> from sklearn.datasets import load_digits
>>> from sklearn.linear_model import Perceptron
>>> X, y = load_digits(return_X_y=True)
>>> clf = Perceptron(tol=1e-3, random_state=0)
>>> clf.fit(X, y)
output: Perceptron()
>>> clf.score(X, y)
output: 0.939...
Passive Aggressive Algorithms

The passive-aggressive algorithms are a family of algorithms for large-scale learning. They are similar to the Perceptron in that they do not require a learning rate. However, contrary to the Perceptron, they include a regularization parameter C.

>>> from sklearn.linear_model import PassiveAggressiveClassifier
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_features=4, random_state=0)
>>> clf = PassiveAggressiveClassifier(max_iter=1000, random_state=0, tol=1e-3)
>>> clf.fit(X, y)
output: PassiveAggressiveClassifier(random_state=0)
>>> print(clf.coef_)
output: [[0.26642044 0.45070924 0.67251877 0.64185414]]
>>> print(clf.intercept_)
output: [1.84127814]
>>> print(clf.predict([[0, 0, 0, 0]]))
output: [1]

Robustness regression: outliers and modeling errors

Robust regression aims to fit a regression model in the presence of corrupt data: either outliers, or error in the model.

check this link and this link for examples. check this link for new method and comparison to ridge method.

Polynomial regression: extending linear models with basis functions

One common pattern within machine learning is to use linear models trained on nonlinear functions of the data. This approach maintains the generally fast performance of linear methods, while allowing them to fit a much wider range of data.

>>> from sklearn.preprocessing import PolynomialFeatures
>>> import numpy as np
>>> X = np.arange(6).reshape(3, 2)
>>> X
output: array([[0, 1], [2, 3], [4, 5]])
>>> poly = PolynomialFeatures(degree=2)
>>> poly.fit_transform(X)
output: array([[ 1., 0., 1., 0., 0., 1.],
[ 1., 2., 3., 4., 6., 9.],
[ 1., 4., 5., 16., 20., 25.]])

This sort of preprocessing can be streamlined with the Pipeline tools. A single object representing a simple polynomial regression can be created and used as follows:

>>> from sklearn.preprocessing import PolynomialFeatures
>>> from sklearn.linear_model import LinearRegression
>>> from sklearn.pipeline import Pipeline
>>> import numpy as np
>>> model = Pipeline([('poly', PolynomialFeatures(degree=3)), ('linear', LinearRegression(fit_intercept=False))])
>>> # fit to an order-3 polynomial data
>>> x = np.arange(5)
>>> y = 3 - 2 * x + x ** 2 - x ** 3
>>> model = model.fit(x[:, np.newaxis], y)
>>> model.named_steps['linear'].coef_
output: array([ 3., -2., 1., -1.])

For example, when dealing with boolean features, xixj represents the conjunction of two booleans. This way, we can solve the XOR problem with a linear classifier:

>>> from sklearn.linear_model import Perceptron
>>> from sklearn.preprocessing import PolynomialFeatures
>>> import numpy as np
>>> X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
>>> y = X[:, 0] ^ X[:, 1]
>>> y
output: array([0, 1, 1, 0])
>>> X = PolynomialFeatures(interaction_only=True).fit_transform(X).astype(int)
>>> X
output: array([[1, 0, 0, 0],[1, 0, 1, 0],[1, 1, 0, 0],[1, 1, 1, 1]])
>>> clf = Perceptron(fit_intercept=False, max_iter=10, tol=None,shuffle=False).fit(X, y)
And the classifier “predictions” are perfect:
>>> clf.predict(X)
output: array([0, 1, 1, 0])
>>> clf.score(X, y)
output: 1.0

Comments