Preprocessing
Despite techniques we discussed in collecting data (getting rid of not useful tags, characters and parts) and
ignoring (not removing) outliers by looking at data plot with limiting time periods in data frame, other methods are
going to be discussed here.
First we're looking at more common preprocessing techniques like normalization, then we're talking about more
advanced one (robust scalers or transformers in sklearn library).
1: normalization metods:
linear transformations: to match or adjust the vars data facilitating the statistical methods on them by tests like lognormal,
pareto distribution test, z-score test
lognormal (logarithms):
a way to normalizing the distribution, reducing the skewness, more useful for 99% of the distributions
calculate logarirthmic value of each feature in new column
pareto distribution test:
more useful for 1% of the distributions
cdf(x) = 1 - ((x/xm)^(-a))
xm is the minimum possible value = location of the distribution
a = shape of the distribution
on a log-log scale, the CCDF looks like a straight line: y = (x/xm^(-a))
Calculate log from it: log(y) = -a*(log(x) - log(xm))
It looks like a straight line with the slope of -a and intercept of -log(xm)
other methods and techniques:
checking the variability of the values - most frequently used measures of variablity: range, interquartile range
(IQR), variance, stdev (standard deviation), index of skewness
#calculating range
l = df.len()
r = df[l-1] - df[0]
#calculating IQR - 75th rank is higher hinge, ... then IQR is the H-spread
# IQR is a measure of the spread of a distribution, percentile-based statistics
#after sorting and calculating rank, ...
r_25 = (25/100)*(n+1)
r25 = df[r_25]
r_75 = (75/100)*(n+1)
r75 = df[r_75]
IQR = r75 - r25
#calculating the variance: average squared difference of the scores from the mean
import numpy as np
import pandas as pnd
def s2_func(df):
df = df
mu = df.mean()
for x in df:
s += (x - mu)^2
return s2 = s/(df.len())
#if it's used in a sample not the population, use mean of the sample and df.len()-1 instead
#calculating stdev: square root of the variance
s = df.std()
index of skewness of a distribution: 3 ways: pearson's, 3rd moment, kurtosis
1. (pearson's formula)
iskew = 3 * (df.mean() - df.median()) / df.std()
2. most common way
#but more commonly used formula for calculating the index of skewness: third moment about the mean:
import numpy as np
import pandas as pnd
def i_skew(df):
df = df
mu = df.mean()
for x in df:
s += (x - mu)^3
s3 = (df.std())^3
return iskew = s/s3
3. kurtosis formula
import numpy as np
import pandas as pnd
def kurtosis_skew(df):
df = df
mu = df.mean()
for x in df:
s += (x - mu)^4
s4 = (df.std())^4
return kurtosis = (s/s4) - 3
A normal distribution with a mean of 0 and a standard deviation of 1 is called a standard normal distribution using z_score in python to see if it's normally distributed or not
A value from any normal distribution can be transformed into its corresponding value on a standard normal distribution using the following formula:
z = (mu) / stdev
import scipy.stats as stats
stats.zscore(df)
#or in this way:
from scipy.stats import zscore
df.apply(zscore)
if all the values in the distribution are transformed to z_score, then the distribution would be normalized another way is using logarithmic value of each x (above)
#shapiro test - test for normality:
import scipy as sc
h = sc.shapiro(df)
#h smaller than 0.05 then rejected, i.e. less than normal
#known as z-score function: (xb)/stdev
import scipy.stats as stats
stats.zscore(df)
#or in this way:
from scipy.stats import zscore
df.apply(zscore)
using transformations techniques to normalize data: like z_score, logs, chi-square, Fisher Exact Test, F-ratio (comparing several means, computed as a measure how different the groups are) we talked about z_score, logs above. chi-square
formula: make a crosstab table from data then:
chisq=x2=(sig((oi-ei)^2))/n
Kruskal-Wallis test is a rank-randomization test that extends the Wilcoxon test to designs with more than two groups. It tests for differences in central tendency in designs with one between-subjects variable. The test is based on a statistic H that is approximately distributed as Chi Square. The first step is to convert the data to ranks (ignoring group membership) and then find the sum of the ranks for each group. Then, compute H:
H=(-3) * (N+1) + (12 / (N *(N+1))) * (sigma(T^2 / n)
from i=0 to k) N is the total number of observations, Ti is the sum of ranks for the ith group, ni is the sample size for the ith group, k is the number of groups.
Finally, the significance test is done using a Chi-Square distribution with k-1 degrees of freedom (DegF=k-1).
Chi-square function:
Chisqnorm=sigma((df.std()^2)
Chisqnorm or The Chi Square distribution is the distribution of the sum of squared standard normal deviates (i.e. sigma((df.std()^2))
Degf, The degrees of freedom of the distribution is equal to the number of standard normal deviates being summed
Chisqnorm or The Chi Square distribution is the distribution of the sum of squared standard normal deviates (i.e. sigma((df.std()^2))
Degf, The degrees of freedom of the distribution is equal to the number of standard normal deviates being summed
def probChi(Chisqnorm, Degf):
p = 0
AM = 1.7E-38
WU2P = 2.5066282746310005
pr, z, SU, J
#Chisqnorm = sigma((df.std()^2)
if Chisqnorm == 0:
x = 1
EC = math.exp(-Chisqnorm/2)
if N1 == 1:
z = Math.sqrt(Chisqnorm)
pr = 1 - zProb(z)
p = 2*pr
return p
elif N1 == 2:
p = EC
return p
CX = Math.sqrt(Chisqnorm)
ZC = 1.0/WU2P*EC
IE, I1
var RD
if (N1%2 == 0):
IE = 1
SU = ZC
I1 = (N1-2)/2
RD = ZC
else:
IE = 0
SU = 0
I1 = (N1-1)/2
RD = ZC/CX
if (RD>=AM && N1<=500):
#if ( N1<=500):
z = CX
pr = 1 - zProb(z)
if (IE==1):
J = 2
else:
J = 1
for (var i=1;i<=I1;i++):
RD=RD*Chisqnorm/J
J += 2
SU += RD
if (IE==1):
p = WU2P*SU
else:
p = 2*(pr + SU)
else:
var z0= Math.pow(Chisqnorm/N1,0.333333333333333333)-1.0+2.0/(9.0*N1)
z=z0*Math.sqrt(9.0*N1/2.0)
p=1-zProb(z)
return p
def zProb (z):
if (z<-7):
return 0.0
if (z>7):
return 1.0
if (z<0.0):
flag= true
else:
flag = false
z = Math.abs(z)
b=0.0
s=Math.sqrt(2)/3*z
HH=.5
for (var i=0;i<12;i++):
a = Math.exp(-HH*HH/9)*Math.sin(HH*s)/HH
b=b+a
HH=HH+1.0
p = 0.5 - b/Math.PI
#p = b/Math.PI
if (!flag):
p=1.0-p
return p
2: using robust scalers
in addition to normalization methods, using robust scalers or transformers are more appropriate This part is based on the article
check this page for comparison between these scalers. including: linear, non-linear, normalization, encoding, Discretization, polynomial features, custom transform
A---- Linear transformation
1- scaling to a center In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation. For instance, many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected. The function scale provides a quick and easy way to perform this operation on a single array-like dataset:
from sklearn import preprocessing
import numpy as np
X_train = np.array(...
X_scaled = preprocessing.scale(X_train)
The preprocessing module further provides a utility class StandardScaler that implements the Transformer API to compute the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set.
scaler = preprocessing.StandardScaler().fit(X_train)
scaler.mean_
array([1. ..., 0. ..., 0.33...])
scaler.scale_
array([0.81..., 0.81..., 1.24...])
scaler.transform(X_train)
output: array([[ 0. ..., -1.22..., 1.33...],
[ 1.22..., 0. ..., -0.26...],
[-1.22..., 1.22..., -1.06...]])
2- scaling to a range
An alternative standardization is scaling features to lie between a given minimum and maximum value, often between zero and one
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
3- Scaling sparse data
Centering sparse data would destroy the sparseness structure in the data, and thus rarely is a sensible thing to do. However, it can make sense to scale sparse inputs, especially if features are on different scales. MaxAbsScaler and maxabs_scale were specifically designed for scaling sparse data, and are the recommended way to go about this
4- Scaling data with outliers
If your data contains many outliers, scaling using the mean and variance of the data is likely to not work very well. In these cases, you can use robust_scale and RobustScaler as drop-in replacements instead.
B---- non-Linear transformation
Two types of transformations are available:
1. quantile transforms - put all features into the same desired distribution based on the formula: Quentile(cdf(x))
if x has continuous, cdf(x) is uniformly distributed on [0,1]
if cdf(x) is uniformly distributed on [0,1], then quentile as a uniform distribution
mapping to uniform dist:
QuantileTransformer and quantile_transform provide a non-parametric transformation to map the data to a uniform distribution with values between 0 and 1:
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> X, y = load_iris(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
>>> quantile_transformer = preprocessing.QuantileTransformer(random_state=0)
>>> X_train_trans = quantile_transformer.fit_transform(X_train)
>>> X_test_trans = quantile_transformer.transform(X_test)
>>> np.percentile(X_train[:, 0], [0, 25, 50, 75, 100])
output: array([ 4.3, 5.1, 5.8, 6.5, 7.9])
2. power transforms - mapping to gaussian dist:
>>> pt = preprocessing.PowerTransformer(method='box-cox', standardize=False)
>>> X_lognormal = np.random.RandomState(616).lognormal(size=(3, 3))
>>> X_lognormal
output: array([[1.28..., 1.18..., 0.84...],
[0.94..., 1.60..., 0.38...],
[1.35..., 0.21..., 1.09...]])
>>> pt.fit_transform(X_lognormal)
output: array([[ 0.49..., 0.17..., -0.15...],
[-0.05..., 0.58..., -0.57...],
[ 0.69..., -0.84..., 0.10...]])
It is also possible to map data to a normal distribution using QuantileTransformer by setting output_distribution='normal'.
Using the earlier example with the iris dataset:
>>> quantile_transformer = preprocessing.QuantileTransformer(output_distribution='normal', random_state=0)
>>> X_trans = quantile_transformer.fit_transform(X)
>>> quantile_transformer.quantiles_
output: array([[4.3, 2. , 1. , 0.1],
[4.4, 2.2, 1.1, 0.1],
[4.4, 2.2, 1.2, 0.1],
...,
[7.7, 4.1, 6.7, 2.5],
[7.7, 4.2, 6.7, 2.5],
[7.9, 4.4, 6.9, 2.5]])
C---- normalization
>>> X_normalized = preprocessing.normalize(X, norm='l2')
>>> normalizer = preprocessing.Normalizer().fit(X) # fit does nothing
>>> normalizer.transform(X)
D---- converting to integer
To convert categorical features to such integer codes, we can use:
1. np.map(dict) method
2. the OrdinalEncoder method
>>> enc = preprocessing.OrdinalEncoder()
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X)
>>> enc.transform([['female', 'from US', 'uses Safari']])
output: array([[0., 1., 1.]])
E---- Discretization
Discretization (otherwise known as quantization or binning) provides a way to partition continuous features into discrete values.
>>> X = np.array([[ -3., 5., 15 ],
... [ 0., 6., 14 ],
... [ 6., 3., 11 ]])
>>> est = preprocessing.KBinsDiscretizer(n_bins=[3, 2, 2], encode='ordinal').fit(X)
>>> est.transform(X)
output: array([[ 0., 1., 1.],
[ 1., 1., 1.],
[ 2., 0., 0.]])
#--- Binarization
>>> X = [[ 1., -1., 2.], [ 2., 0., 0.], [ 0., 1., -1.]]
>>> binarizer = preprocessing.Binarizer().fit(X) # fit does nothing
output: binarizer
Binarizer()
>>> binarizer.transform(X)
output: array([[1., 0., 1.], [1., 0., 0.], [0., 1., 0.]])
F---- Generating polynomial features
>>> import numpy as np
>>> from sklearn.preprocessing import PolynomialFeatures
>>> X = np.arange(6).reshape(3, 2)
>>> X
output: array([[0, 1], [2, 3], [4, 5]])
>>> poly = PolynomialFeatures(2)
>>> poly.fit_transform(X)
output: array([[ 1., 0., 1., 0., 0., 1.],
[ 1., 2., 3., 4., 6., 9.],
[ 1., 4., 5., 16., 20., 25.]])
The features of X have been transformed from (x1, x2) to (1, x1, x2, x1^2, x1*x2, x2^2)
G---- custom transform
to build a transformer that applies a log transformation in a pipeline, do:
>>> import numpy as np
>>> from sklearn.preprocessing import FunctionTransformer
>>> transformer = FunctionTransformer(np.log1p, validate=True)
>>> X = np.array([[0, 1], [2, 3]])
>>> transformer.transform(X)
output: array([[0. , 0.69314718], [1.09861229, 1.38629436]])
dimension reduction (Decomposing signals in components)
common mothods: removing (ignoring) some features, combining features to one, feature engineering techniques, removing outliers, using presentation to recognize the loop in time series, cropping images in face recognition, ...
Now look at the summary as follows:
in summary:
- in face recog - use LDA (Linear...), PCA using SVD, ICA, Dictionary Learning,
- in topic modeling, document clustering - use NNMF, LDA (Latent Diri...), LSA, other PCA methods
but for those who interested in mathematics behind all these stuffs, consider these explanations:
1. Principal component analysis (PCA)
PCA is used to decompose a multivariate dataset in a set of successive orthogonal components that explain a maximum amount of the variance. In scikit-learn, PCA is implemented as a transformer object that learns n components in its fit method, and can be used on new data to project it on these components.
The PCA object also provides a probabilistic interpretation of the PCA that can give a likelihood of data based on the amount of variance it explains. As such it implements a score method that can be used in cross-validation.
check this page for an example and comparison to LDA below
check this page for an example to Model selection with Probabilistic PCA and Factor Analysis (FA)
--- The IncrementalPCA
The IncrementalPCA object uses a different form of processing and allows for partial computations which almost exactly match the results of PCA while processing the data in a minibatch fashion.
check this page for an example
--- PCA using singular value decomposition (SVD)
more useful for face recognition systems or plant disease detection project
check this page for an exmple to Faces recognition example using eigenfaces and SVMs
check this page for an another example
--- Kernel PCA
Kernel PCA is able to find a projection of the data that makes data linearly separable
check this page for an example
2. Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA) tries to identify attributes that account for the most variance between classes. In particular, LDA, in contrast to PCA, is a supervised method, using known class labels.
check this page for an example and comparison to PCA above
3. Truncated singular value decomposition (SVD) and latent semantic analysis (LSA)
TruncatedSVD implements a variant of singular value decomposition (SVD) that only computes the largest k singular values, where k is a user-specified parameter.
check this page for an example to clustering text documents using k-means
4. Dictionary Learning
All variations of dictionary learning implement the following transform methods, controllable via the transform_method initialization parameter:
Orthogonal Matching Pursuit (OMP)
Least-angle regression (Least Angle Regression (LAR))
Lasso computed by least-angle regression
Lasso using coordinate descent (Lasso)
Thresholding
check this page for an example to Sparse coding with a precomputed dictionary
chekc this page for an example to Image denoising using dictionary learning
5. Independent component analysis (ICA)
Independent component analysis separates a multivariate signal into additive subcomponents that are maximally independent. It is implemented in scikit-learn using the Fast ICA algorithm.
Similar to PCA using SVD for face recognition system
check this page for an example to Blind source separation using FastICA to estimate sources given noisy measurements.
check this page> for an example to FastICA on 2D point clouds, a comparison between ICA and PCA
ICA is an algorithm that finds directions in the feature space corresponding to projections with high non-Gaussianity.
These directions need not be orthogonal in the original feature space, but they are orthogonal in the whitened feature space, in which all directions correspond to the same variance.
PCA, on the other hand, finds orthogonal directions in the raw feature space that correspond to directions accounting for maximum variance.
6. Non-negative matrix factorization (NMF or NNMF)
Similar to PCA using SVD for face recognition system, but as an alternative approach that the data and the components are non-negative. in the cases where the data matrix does not contain negative values.
check this page> for Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation
check code below for Beta-divergence loss functions that compares the various Beta-divergence loss functions supported by the Multiplicative-Update (‘mu’) solver in sklearn.decomposition.NMF. Beta from 0 to 2.0 is much more like exponential (log) chart
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition._nmf import _beta_divergence
print(__doc__)
x = np.linspace(0.001, 4, 1000)
y = np.zeros(x.shape)
colors = 'mbgyr'
for j, beta in enumerate((0., 0.5, 1., 1.5, 2.)):
for i, xi in enumerate(x):
y[i] = _beta_divergence(1, xi, 1, beta)
name = "beta = %1.1f" % beta
plt.plot(x, y, label=name, color=colors[j])
plt.xlabel("x")
plt.title("beta-divergence(1, x)")
plt.legend(loc=0)
plt.axis([0, 4, 0, 3])
plt.show()
7. Latent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation is a generative probabilistic model for collections of discrete dataset such as text corpora. It is also a topic model that is used for discovering abstract topics from a collection of documents.
refer to NNMF (above) for an example
Comments
Post a Comment