Data Science Process
Data Science Process comprises of several steps:
- defining the problem, drawing diagrams, providing some theoretical issues, blablabla - we're not going to dive into this but maybe you can find some information in another post
- collecting data
- training data to make a model
- test your model by more data and prepare it for prediction
But as a data scientist who is going to work with Python and its libraries you should consider other view angle and go through some steps. In this post we're going to talk about technical information of these steps with examples and detailed explanations.
So, the process you're going to follow (after some theoretical things) includes:
importing libraries and packages to .py file
according to the job you're going to do, you will use some libraries that will be explained later.
collecting data
It can be done from various sources. Also it can be categorized by the data type (number, text, image, voice). As before mentioned, we're going to talk about the topics in summary in this post and will provide you in more detailed in another posts. So surf the blog and see other posts and follow us in future!
-
based on sources
sources can be 1- a website (like scrapinig data from website) that you should refer to another posts and 2- files from server (no matter which server)
for files:
1- pdf
for working on pdf file, use PyPDF2 library:
import PyPDF2
myfile = open('pdffilename', mode='rb')
var = PyPDF2.pdffilereader(myfile)
var.numPages
p = var.getPage(0)
p.extractText()
pdf_writer = pypdf2.pdffilewriter()
pdf_writer.addPage(p) #adding a page from one pdf to another
pdf_output = open('filename', 'wb')
pdf_writer.write(pdf_output)
pdf_text = []
pdf_reader = pypdf2.pdffilereader(myfile)
for p in range(pdf_reader.numPages):
page = pdf_reader.getPage(p)
pdf_text.append(page.extractText())
2- json
import json
# some JSON:
x = '{ "name":"John", "age":30, "city":"New York", "s": [{"a": "bann"}, {"b": "nab"}]}'
# parse x:
y = json.loads(x)
# the result is a Python dictionary:
print(y["s"][0])
output: {'a': 'bann'}
3- excel file (.xls or .xlsx)
import pandas as pnd
pnd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
df.to_excel('foo.xlsx', sheet_name='Sheet1')
#or this way
xlsx = pnd.ExcelFile('path\to\file.xls')
df = pnd.read_excel(xlsx, 'Sheet1')
4- html file
url = 'https://www.fdic.gov/bank/individual/failed/banklist.html'
#the html must have "table" tag following with "tbody" tag
dfs = pnd.read_html(url) #getting the table inside the url's body
#html would be read and loaded as list (must be converted to array by numpy as below)
print(df.to_html()) # raw html
dfs = pnd.read_html(url, index_col=0)
dfs = pnd.read_html(url, header=0)
dfs = pnd.read_html(url, skiprows=0)
dfs1 = pnd.read_html(url, attrs={'id': 'table'})
dfs2 = pnd.read_html(url, attrs={'class': 'sortable'})
print(np.array_equal(dfs1[0], dfs2[0])) # Should be True
s = df.to_html(float_format='{0:.40g}'.format)
#converting to an array using Numpy library
import numpy as np
df = np.asarray(df)
df = np.array(df)
#if there's an error to conversion, consider that it must not be nested tables (td and tr): i.e: 30 rows x 10 cols -- as stack not nestedly tabled
#instead use scraping methods to get the data and format it in a csv file
5- csv
#extracting all rows of col 2 from dfs:
dff = dfs[:,2] #or two cols: dff = dfs[:,2:4]
df = pnd.read_csv('foo.csv')
df.to_csv('foo.csv')
you can get more detailed information about Pandas and Numpy libraries in another posts.
-
based on data type
if the data is numbers (like databases) use Pandas or Numpy or even pure Python codes as well. I suggest you using Pandas or Numpy. Databases or technically data sets in Python can be stored in sql database or an internal dataframe in python. heres some pure Python examples:
var = "something"
print(f"text {var}") or print("text {}".format(var))
d = {'a': 123, 'b': 456}
mylist =[1,2,3]
print(f"text {d['a']}") #shows 123
print(f"text {mylist[0]}")
for b in mylist:
print(b) #or the specified index of list
print(f"{var:{10}}")
#length of 10, add > after : to align to right at the 10, add - or . before > to fill the gap with - or .
You can find better way in Pandas
if data is text, refer to a post for NLP (Natural Language Processing) issues.
for pictoral data (images) refer to another post related to image processing in Python.
for manipulating audio files in Python, as well refer to related post.
Cleaning data
As a short and straight explanation, cleaning means to get rid of not useful data (characters, tags, ...) from data you've just collected! For example when you scrap a website and get the html content, there may be tags and characters that you don't want to be in your dataset. So you clean the data to be more useful. Other advanced example of cleaning data is to get rid of outliers in data you collected, another is dimension reduction (e.g. cropping images or limit the time series in some big data sets).
in this section, briefly we're going to introduce you to some techniques for preprocessing and cleaning data (i.e. providing them) to be useful better.
As in real world, data distribution is not normal, normalization is concerned as a preprocessing technique. Additionally using robust scalers or transformers are more appropriate. Other technique is transformtion (linear, non-linear, normalization, encoding, Discretization, polynomial features or custom transform)
preprocessing is useful in fitting model (in regression and making model, we're talking about this later in another post) as well.
Training data to make a model
It means splitting data from dataset to training and testing part. We use training part to make our model and train it, then check it with test data. But the question is which part of data would be the best for training the model and which part for testinig.
There's certain amount of algorithms and methods to split data. We discuss this part as "Model selection and evaluation" in detail in another post but for now, consider that better to use cross-validation (cv) to k-fold split the dataset in training part then use test data for both testing and validation but what kind of k-folding, equl portion or not and if not, based on what. We will get into it in detail later.
Regression and Clustering
The main part of data science process is making a model for regression or clustering. A model finally have to be able to predict the future, by examining the past data (regression) or in some cases categorizing data in some ways that would be meaningful (clustering). We usually use sklearn (scikit learn) library for these two purposes but keep in mind that definitely there are other libraries out there you can use them as well. I would be preciated to have your comments if you think other libraries have benefits over sklearn.
summary of regression
general issues:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
#importing datasets and specifying X, y, X_train, y_train, X_test, y_test
#then:
model = linear_model.method()
model.fit(X, y)
model.coef_
model.intercept_
model.decision_function([[x, y]])
model.score(X, y)
#for testing the model/predictions
y_pred = model.predict(X_test)
mean_squared_error(y_test, y_pred)
r2_score(y_test, y_pred)
if values going different in time:
Ridge regression (with higher alpha), RidgeCV, Lasso
For high-dimensional datasets with many collinear features: (all discrete)
LassoCV (with alpha), LassoLarsCV (with alpha), Least Angle Regression - like Lasso (with n_nonzero_coefs)
Orthogonal Matching Pursuit (OMP) >> OrthogonalMatchingPursuit and orthogonal_mp methods
Robustness regression, PolynomialFeatures(degree=2)
Joint feature selection with multi-task Lasso: (all discrete)
MultiTaskLasso (with alpha)
one dimensional regression
linear regression, BaysianRidge, Automatic Relevance Determination Regression (ARD)
for classification:
Logistic Regression, Stochastic Gradient Descent (SGDClassifier(loss, penalty, max_iter))
Perceptron(tol=1e-3, random_state=0), PassiveAggressiveClassifier(max_iter=1000, random_state=0, tol=1e-3)
We get into regression in detail in another post.
Clustering
Because we use sklearn library, we explain its methods here. Clustering means categorizing data in some groups. There are supervized and unsupervised clustering. Again we dive into this later in another post. Here's just a short summary:
Clustering of unlabeled data can be performed with the module sklearn.cluster. Each clustering algorithm comes in two variants: a class, that implements the fit method to learn the clusters on train data, and a function, that, given train data, returns an array of integer labels corresponding to the different clusters. For the class, the labels over the training data can be found in the labels_ attribute.
in summary, based on:
-
parameters
n_clusters as params: K-Means, Spectral, ward hierarchical, agglomerative, OPTICS (minimum)
neighborhood size: DBSCAN
many params: Gaussian Mixtures
distance threshold: ward hierarchical, agglomerative, birch
bandwidth: Mean-shift
sample preferences/damping: affinity propagation,
-
scalability
large n_samples, medium n_clusters: K-Means, DBSCAN
medium n_samples, small n_clusters: spectral
large n_samples, large n_clusters: ward hierarchical, agglomerative, OPTICS, Birch
not-scalable with n_samples: affinity propagation, Mean-shift, Guassian Mixture
-
n_clusters
many clusters: affinity propagation, Mean-shift, ward hierarchical, agglomerative,
medium clusts: K-Means, spectral
-
geometray
flat: K-Means, Gaussian Mixture
non-flat: affinity propagation, Mean-shift, spectral, DBSCAN, OPTICS
-
metric used
dist between pooint: K-Means, Mean-shift, ward hierarchical, DBSCAN, OPTICS, Birch, Gaussian Mixtures
graph distance: affinity propagation, Spectral
any pairwise dist: agglomerative
Presentation
Every product must be presented well to get attentions. As said before, we're providing here an introduction to presentation and more detailed information can be found in another posts.
You can present your data and model in several ways:
by using welknown matplot library or exporting data to a csv, json or txt file and use third party libraries like D3 or applications like SPSS or SAS to make graphical presentation
most part of the detailed explanation is provided with specific topic in another posts. So don't worry, check those posts and enjoy the world of data sceince.
Making Report
It means in addition to making graphs, result images and diagrams (in some cases), you can put all these together with explanation (as text) and make a report for senior managers or your clients.
We dive into this issue by a real project named "marketing toolkit" to show you how it will be done.
As always, I preciate your comments and feel free to contact me via email (saman_shahin@yahoo.com).
Have Fun!
Comments
Post a Comment