Data Science Process


 

Data Science Process

Data Science Process comprises of several steps:

  • defining the problem, drawing diagrams, providing some theoretical issues, blablabla - we're not going to dive into this but maybe you can find some information in another post
  • collecting data
  • training data to make a model
  • test your model by more data and prepare it for prediction

But as a data scientist who is going to work with Python and its libraries you should consider other view angle and go through some steps. In this post we're going to talk about technical information of these steps with examples and detailed explanations.

So, the process you're going to follow (after some theoretical things) includes:

importing libraries and packages to .py file

according to the job you're going to do, you will use some libraries that will be explained later.

collecting data

It can be done from various sources. Also it can be categorized by the data type (number, text, image, voice). As before mentioned, we're going to talk about the topics in summary in this post and will provide you in more detailed in another posts. So surf the blog and see other posts and follow us in future!

  • based on sources

    sources can be 1- a website (like scrapinig data from website) that you should refer to another posts and 2- files from server (no matter which server)

    for files:

    1- pdf

    for working on pdf file, use PyPDF2 library:

    import PyPDF2

    myfile = open('pdffilename', mode='rb')

    var = PyPDF2.pdffilereader(myfile)

    var.numPages

    p = var.getPage(0)

    p.extractText()

    pdf_writer = pypdf2.pdffilewriter()

    pdf_writer.addPage(p) #adding a page from one pdf to another

    pdf_output = open('filename', 'wb')

    pdf_writer.write(pdf_output)

    pdf_text = []

    pdf_reader = pypdf2.pdffilereader(myfile)

    for p in range(pdf_reader.numPages):

    page = pdf_reader.getPage(p)

    pdf_text.append(page.extractText())

    2- json

    import json

    # some JSON:

    x = '{ "name":"John", "age":30, "city":"New York", "s": [{"a": "bann"}, {"b": "nab"}]}'

    # parse x:

    y = json.loads(x)

    # the result is a Python dictionary:

    print(y["s"][0])

    output: {'a': 'bann'}

    3- excel file (.xls or .xlsx)

    import pandas as pnd

    pnd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

    df.to_excel('foo.xlsx', sheet_name='Sheet1')

    #or this way

    xlsx = pnd.ExcelFile('path\to\file.xls')

    df = pnd.read_excel(xlsx, 'Sheet1')

    4- html file

    url = 'https://www.fdic.gov/bank/individual/failed/banklist.html'

    #the html must have "table" tag following with "tbody" tag

    dfs = pnd.read_html(url) #getting the table inside the url's body

    #html would be read and loaded as list (must be converted to array by numpy as below)

    print(df.to_html()) # raw html

    dfs = pnd.read_html(url, index_col=0)

    dfs = pnd.read_html(url, header=0)

    dfs = pnd.read_html(url, skiprows=0)

    dfs1 = pnd.read_html(url, attrs={'id': 'table'})

    dfs2 = pnd.read_html(url, attrs={'class': 'sortable'})

    print(np.array_equal(dfs1[0], dfs2[0])) # Should be True

    s = df.to_html(float_format='{0:.40g}'.format)

    #converting to an array using Numpy library

    import numpy as np

    df = np.asarray(df)

    df = np.array(df)

    #if there's an error to conversion, consider that it must not be nested tables (td and tr): i.e: 30 rows x 10 cols -- as stack not nestedly tabled

    #instead use scraping methods to get the data and format it in a csv file

    5- csv

    #extracting all rows of col 2 from dfs:

    dff = dfs[:,2] #or two cols: dff = dfs[:,2:4]

    df = pnd.read_csv('foo.csv')

    df.to_csv('foo.csv')

    you can get more detailed information about Pandas and Numpy libraries in another posts.

  • based on data type

    if the data is numbers (like databases) use Pandas or Numpy or even pure Python codes as well. I suggest you using Pandas or Numpy. Databases or technically data sets in Python can be stored in sql database or an internal dataframe in python. heres some pure Python examples:

    var = "something"

    print(f"text {var}") or print("text {}".format(var))

    d = {'a': 123, 'b': 456}

    mylist =[1,2,3]

    print(f"text {d['a']}") #shows 123

    print(f"text {mylist[0]}")

    for b in mylist:

    print(b) #or the specified index of list

    print(f"{var:{10}}")

    #length of 10, add > after : to align to right at the 10, add - or . before > to fill the gap with - or .

    You can find better way in Pandas

    if data is text, refer to a post for NLP (Natural Language Processing) issues.

    for pictoral data (images) refer to another post related to image processing in Python.

    for manipulating audio files in Python, as well refer to related post.

Cleaning data

As a short and straight explanation, cleaning means to get rid of not useful data (characters, tags, ...) from data you've just collected! For example when you scrap a website and get the html content, there may be tags and characters that you don't want to be in your dataset. So you clean the data to be more useful. Other advanced example of cleaning data is to get rid of outliers in data you collected, another is dimension reduction (e.g. cropping images or limit the time series in some big data sets).

in this section, briefly we're going to introduce you to some techniques for preprocessing and cleaning data (i.e. providing them) to be useful better.

As in real world, data distribution is not normal, normalization is concerned as a preprocessing technique. Additionally using robust scalers or transformers are more appropriate. Other technique is transformtion (linear, non-linear, normalization, encoding, Discretization, polynomial features or custom transform)

preprocessing is useful in fitting model (in regression and making model, we're talking about this later in another post) as well.

Training data to make a model

It means splitting data from dataset to training and testing part. We use training part to make our model and train it, then check it with test data. But the question is which part of data would be the best for training the model and which part for testinig.

There's certain amount of algorithms and methods to split data. We discuss this part as "Model selection and evaluation" in detail in another post but for now, consider that better to use cross-validation (cv) to k-fold split the dataset in training part then use test data for both testing and validation but what kind of k-folding, equl portion or not and if not, based on what. We will get into it in detail later.

Regression and Clustering

The main part of data science process is making a model for regression or clustering. A model finally have to be able to predict the future, by examining the past data (regression) or in some cases categorizing data in some ways that would be meaningful (clustering). We usually use sklearn (scikit learn) library for these two purposes but keep in mind that definitely there are other libraries out there you can use them as well. I would be preciated to have your comments if you think other libraries have benefits over sklearn.

summary of regression

general issues:

from sklearn import linear_model

from sklearn.metrics import mean_squared_error, r2_score

#importing datasets and specifying X, y, X_train, y_train, X_test, y_test

#then:

model = linear_model.method()

model.fit(X, y)

model.coef_

model.intercept_

model.decision_function([[x, y]])

model.score(X, y)

#for testing the model/predictions

y_pred = model.predict(X_test)

mean_squared_error(y_test, y_pred)

r2_score(y_test, y_pred)

if values going different in time:

Ridge regression (with higher alpha), RidgeCV, Lasso

For high-dimensional datasets with many collinear features: (all discrete)

LassoCV (with alpha), LassoLarsCV (with alpha), Least Angle Regression - like Lasso (with n_nonzero_coefs)

Orthogonal Matching Pursuit (OMP) >> OrthogonalMatchingPursuit and orthogonal_mp methods

Robustness regression, PolynomialFeatures(degree=2)

Joint feature selection with multi-task Lasso: (all discrete)

MultiTaskLasso (with alpha)

one dimensional regression

linear regression, BaysianRidge, Automatic Relevance Determination Regression (ARD)

for classification:

Logistic Regression, Stochastic Gradient Descent (SGDClassifier(loss, penalty, max_iter))

Perceptron(tol=1e-3, random_state=0), PassiveAggressiveClassifier(max_iter=1000, random_state=0, tol=1e-3)

We get into regression in detail in another post.

Clustering

Because we use sklearn library, we explain its methods here. Clustering means categorizing data in some groups. There are supervized and unsupervised clustering. Again we dive into this later in another post. Here's just a short summary:

Clustering of unlabeled data can be performed with the module sklearn.cluster. Each clustering algorithm comes in two variants: a class, that implements the fit method to learn the clusters on train data, and a function, that, given train data, returns an array of integer labels corresponding to the different clusters. For the class, the labels over the training data can be found in the labels_ attribute.

in summary, based on:

  • parameters

    n_clusters as params: K-Means, Spectral, ward hierarchical, agglomerative, OPTICS (minimum)

    neighborhood size: DBSCAN

    many params: Gaussian Mixtures

    distance threshold: ward hierarchical, agglomerative, birch

    bandwidth: Mean-shift

    sample preferences/damping: affinity propagation,

  • scalability

    large n_samples, medium n_clusters: K-Means, DBSCAN

    medium n_samples, small n_clusters: spectral

    large n_samples, large n_clusters: ward hierarchical, agglomerative, OPTICS, Birch

    not-scalable with n_samples: affinity propagation, Mean-shift, Guassian Mixture

  • n_clusters

    many clusters: affinity propagation, Mean-shift, ward hierarchical, agglomerative,

    medium clusts: K-Means, spectral

  • geometray

    flat: K-Means, Gaussian Mixture

    non-flat: affinity propagation, Mean-shift, spectral, DBSCAN, OPTICS

  • metric used

    dist between pooint: K-Means, Mean-shift, ward hierarchical, DBSCAN, OPTICS, Birch, Gaussian Mixtures

    graph distance: affinity propagation, Spectral

    any pairwise dist: agglomerative

Presentation

Every product must be presented well to get attentions. As said before, we're providing here an introduction to presentation and more detailed information can be found in another posts.

You can present your data and model in several ways:

by using welknown matplot library or exporting data to a csv, json or txt file and use third party libraries like D3 or applications like SPSS or SAS to make graphical presentation

most part of the detailed explanation is provided with specific topic in another posts. So don't worry, check those posts and enjoy the world of data sceince.

Making Report

It means in addition to making graphs, result images and diagrams (in some cases), you can put all these together with explanation (as text) and make a report for senior managers or your clients.

We dive into this issue by a real project named "marketing toolkit" to show you how it will be done.

As always, I preciate your comments and feel free to contact me via email (saman_shahin@yahoo.com).

Have Fun!

Comments