Misc Notes On Machine Learning

Contents

local

References

Overview

  • This contains various Misc Notes or machine learning.
  • Needs to be organized.
  • The contents include:
    • Andrew's Courssera Course notes
    • Modeling workflow using pandas, seaborn, scikit-learn
    • Model Cross Validation Strategies
    • NLP and Machine Learning Intro

Resources

Learning resources for machine learning

Extra Reference resources

Books

Visualization

public datasets

Introduction

Installation Tips

  • Use anaconda with python 3.5. Synopsis:

    $ anaconda-navigator
    $ pip install jupyter_contrib_nbextensions
    

    Invoke jupyter notebook, spyder, orange app from anaconda navigator.

  • Install and use rstudio

Udemy A-Z Machine Learning Course Notes

  • Install Anadonda, rstudio

  • Get data sets from http://www.superdatascience.com/machine-learning

  • Get the data (like csv files)

  • Pre-process the data. Python Synopsis:

    import numpy as np
    import matplotlib.pyplot as plt
    import pandas as pd
    
    # Data = (country, age, salary, purchased). We need to predict last column.
    dataset = pd.read_csv('Data.csv')     # Use spyder variable explorer to examine dataset.
    
    X = dataset.iloc[:, :-1].values   # Features. All lines, all columns except last column.
    y = dataset.iloc[:, 3].values     # The target dependent column values as single array. (purchased column)
    
    # Splitting the dataset into the Training set and Test set
    from sklearn.cross_validation import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
    
    ....
    
    from sklearn.svm import SVR
    regressor = SVR(kernel = 'rbf')    # other types of kernel are: 'linear', 'poly', etc.
    regressor.fit(X, y)
    y_pred = regressor.predict(6.5)
    
  • Pre-process the data. R Synopsis:

    # Importing the dataset dataset = read.csv('Data.csv') View(dataset) # Inside Rstudio interface

  • If your input data contains categorical variable column (e.g. state = 'NY', 'CA', etc), which can not be translated to numerical quantity, then you may have to introduce dummy variables: is_ny, is_ca, ..., etc. You should not include all dummy variables, since if (x1, x2) ==> x3 there is redundancy in the dataset. i.e. include "All but one dummy variables".

  • There are different approaches to remove insignificant features from input features:

    • Backward Elimination -- Concepts: Significance Level 5% or so ; Use the P-value of the predictor; Remove predictor with highest P-value; (higher P-Value, bad variable?!)
    • Forward Selection -- Start from empty, keep adding features which meet criteria of SL > 0.05; Lot of work, inefficient. If there are dependent variables with in features, the dependent variable will not be added if the independent variable is selected first.
    • Bidirectional Elimination -- You need significance level to enter and to stay (say 5%).
  • In R programs, summary(regressor) gives very good summary with coefficients, P-value, etc. R also automatically internally replaces enumerated values by dummy variables (e.g. State1, State2, etc for enumerated values 'NY','CA', etc.

  • Regression types:

    • Simple Linear regression - b + alpha * x1
    • Multiple Linear regression - b + alpha * x1 + beta * x2
    • Polynomial Linear regression - Uses x1, x1^2 - e.g. Parabolic. -- The co-efficients are linear ... The input is pre-processed to include x^2 terms, however the maths behind fitting is same linear.
    • Support Vector Regression - Can use some sort of 'rbf' kernel to capture non-linear dependency.
    • Decision Tree Regression - Useful to map active areas of (x1, x2) ranges to certain y values. i.e. predict y = 100 when 40 < x1 < 50 and 4 < x2 < 6;
    • Random Forest Regression - Pick K random data points from training set. Build decision tree on this training subset. e.g. Pick some 500 trees (forests) which gives 500 prediction. Take mean average of 500 y values.

Machine Learning Notes - Udemy Jose Portilla

Misc Notes

Top Python Libraries for Data Science

scikit-learn

  • 1.2K committers; 32K stars
  • High Level API
  • Machine Learning: Based on SciPy stack of libraries.
  • Built-in datasets like iris are not exactly in pandas DataFrame format. e.g The iris dataset has all features as iris.data, and iris.target is Species. This must be the format to be used with scikit-learn.
  • Run by volunteers
  • See Also:
    • sklearn-pandas - bridge library between scikit-learn and pandas
    • scikit-plot - covenient visualization lib for machine learning
    • joblib - scikit-learn parallelization library
  • See question: - https://stackoverflow.com/questions/31421413/how-to-compute-precision-recall-accuracy-and-f1-score-for-the-multiclass-case

scipy stack

The scipy stack includes following libraris (See http://scipy.org) :

  • Numpy
  • Pandas
  • Scipy
  • Matplotlib
  • IPython
  • SymPy

numpy

  • 700+ committers; 9K stars
  • Efficient implementation for arrays and matrices

scipy

  • 700 commiters; 5K stars
  • Built on top of numpy
  • core scientific functions includes modules (sub packages) such as linalg, cluster, stats etc.

Pandas

  • 1300+ committers; 17K stars
  • Provides data frame, data analysis and even plotting functions and more.

matplotlib

  • 700 commiters; 8K stars
  • Visualization

Seaborn

  • 90 committers; 5.5K stars
  • Visualization - built on tpop of matplotlib - Higher level statistical API
  • Auto generates many pretty statistical plots given data frame.
  • Can visualize data as starting point to get gut feeling even before building precise model.
  • Independent of scipy stack. Often used along with scikit-learn.

The library for easy interactive Pandas charting with Plotly. Cufflinks binds Plotly directly to pandas dataframes. Charts created with cufflinks are synced with your online Plotly account. You can also go offline and create charts.

Bokeh

  • 330 committers; 8.5K stars
  • Visualization: Interactive Web Plotting for Python

Keras

One of the most prominent and convenient Python libraries in deep learning is Keras, which can function either on top of TensorFlow or Theano. Let’s discuss some details about all of them.

  • Deep Learning High Level API
  • Wrapper Lib on top of Theano / Tensorflow

Theano

  • Theano defines multi-dimensional arrays similar to NumPy, efficient implementation.
  • Originally developed by the Machine Learning group of Université de Montréal
  • Used for machine learning / Deep Learning applications.
  • Uses GPU/CPU and integrates nicely with numpy

TensorFlow

  • Machine learning Library From Google
  • Successor of DistBelief, a machine learning system based on neural networks.
  • Supports multi-layered nodes system - i.e. Deep learning.
  • Currently powers Google’s voice recognition and object identification from pictures.

Now , we will look at NLP libraries.

NTLK - Natural Language Toolkit

  • Foundation toolkit. Focused for teaching and basic building block.
  • Allows operations such as text tagging, classification, and tokenizing, stemming etc.
  • Allows building complex systems such as sentiment analytics.

Gensim - Unsupervised NLP Learning Library

  • Just give corpus (input), let it do the job.
  • Uses word2vec, document2vec, etc algorithms to find recurring patterns of words.

Now , we will look at Data Mining and Statistics libraries ...

Scrapy

  • Data mining lib using crawling web, can gather data from various APIs.

Statsmodels

  • Commits: 8960, Contributors: 119
  • Built on top of scipy stack.
  • Complements scipy's stats module. Higher Level API.
  • Uses patsy to express statistics models similar to R formula :
    • e.g. patsy.dmatrices("y ~ x + a + b + a:b", data)
    • i.e. y linearly depends on: x, a, b and (a and b ie. a*b, a+b, etc)
  • Supports using :
    • linear regression models,
    • generalized linear models,
    • discrete choice models,
    • robust linear models,
    • time series analysis models, and various estimators.
  • Also provides extensive plotting functions specifically for use in statistics.
  • Most results have been verified with at least one other statistical package: R, Stata or SAS.

Coursera Notes

  • Machine Learning == Science of getting computers to learn without being explicitly programmed.

  • Anti spam, recommendations, search ranking, click stream data, OCR, Learn Helicopter to fly itself, etc.

  • Anti spam task watches you classifying spam mails and learns what is spam mail from that.

  • Tip: Learn by example ... and relate your problem to other examples.

  • Machine learning algorithms types:

    • Supervised , unsupervised.
    • others: Reinforcement learning, Recomender systems,
  • There are learning algorithms like SVM -- support vector machines - can deal with infinite number of features. SVM works by finding maximum-marigin hyperplane for classification. For 3D space, hyperplane is a surface in 2D. For 2D space, hyperplane is a line. For N-dimensional space, hyperplane is (N-1) dimensional plane. For sparse matrix of large set of features, it is easier to find hyperplane compared to other algo.

  • Training Set => Learning Algorithm => Hypothesis Funtion H. H Takes input features x and yields output result y.

    y = h(x) = t0 + t1 * x ; For linear dependency. Find t0, t1. 2 unknown variables, 2 data points will give you answer. If it is 3 data points, the answer may not exist. So, we need to find "closest" (t0, t1) that minimizes the error wrt all data points.

    Cost function J(t0, t1) = (1/2m) sum( (y1-h1)^2 + ...)

    Find t0, t1 to minimize error sum( (y1 - h1)^2 + (y2 - h2)^2 ), For given set of 3 data points, (1, 2), (2, 4), (3, 6), this would be: error square sum = ( 2 - (t0 + t1 * 1) )^2 + (4 - (t0 + t1* 2))^2 + ... Note that error sum function is equation of 2 unknown variables t0, t1. To find out the global gradient descent, find slope, then find value for the lowest slope. To do that, we will have to apply partial derivative (since there are 2 unknown variables instead of 1). MSE = J(t0, t1). The dy/dx function differentiates considering "x" as variable.

    Set dy/dx = 0 and solve unknown variables to find the lowest gradient decent point.

    So take d(J)/dt0 and d(J)/dt1 and find the lowest slopes to find both t0, t1 having fixed the other at arbitrary value. Once you find the direction to traverse to global minimum, the iteration follows this process:

    t0 := t0 - alpha * (dJ/dt0) t1 := t1 - alpha * (dJ/dt1)

    t0 := t0 - alpha * (1/m) * sum(all_distances) // sum(distances) = (h(x1) - y1) + (h(x2) - y2) + etc.
    // Note: negative distance influences the direction.

    t1 := t1 - alpha * (1/m) * sum(dist1 * x1[row1] + dist2 * x1[row2] + ... )
    // To detect weightage for nth feature, magnify the distance by the feature value, // in order to find the convergence direction.

    How to derive dJ/dt0 and dJ/dt1 ... See Lecture 2 notes pdf.

    Where alpha is learning rate. e.g. 0.1

  • Squared Error function works best for linear regression optimization.

  • A contour plot is a graph that contains many contour lines. A contour line of a two variable function has a constant value at all points of the same line. e.g. x, y graph may have concentric circles with each circle representing cost 10, 20, 30 etc. higher the cost, bigger the circle. i.e. if you allow more error, you can choose more pairs of linear parameters theta1, theta0.

  • Gradient descent may give you some local minimum which may not be the global optimal minimum. If you visualize surface, this is the local minimum.

  • Convex function has only one global minimum, not multiple local minimums. The square error function is an example of convex function.

  • For "multivariate linear regression" of n features, (x1, x2, ... xn), we need to find (t0, t1, .... tn), an n+1 tuple:

    i.e. h(x) = t0 + t1*x1 + t2*x2 + .... + tn*xn

    For convenience of matrix ops and notation, we define dummy feature x0 which is always a constant 1. So we have now n+1 features:

    h(x) = transpose([x0, x1, ... xn ]) * [t0, ... tn ]

  • Feature scaling -- Make sure features are on same scale to converge quickly. e.g. x1 = size of bedroom is 0 to 2000 sqft, and bedrooms is 1 to 5. Get every feature into -1 to +1 range. i.e. divide the value by max value.

  • Mean normalization (a Feature scaling technique) transforms features so that it has means approximately 0. e.g. x = (#bedrooms - 2) / 5 where 2 is mean number of bedrooms and 5 is max.

    i.e. x = ( x1 - mean(s) ) / Range(x) if x varies from 30 to 50, range is 20. Note: You can also choose to divide by std_deviation instead of Range.

  • Another alternative to Mean Normalization is standardisation. It has means approx 0 and std-deviation 1 but the individual values are not limited to -1 to 1 range:

    X.stand =  (x - Mean(x)) / StdDev(x)   # See sklearn.StandardScaler() for more info.
    
  • Valid approach try alpha 0.001, 0.01, 0.1, 1, etc.

  • Feature transformation -- If you suspect only (some) features may fit in non-linear fashion, you can create your own dummy features only for them. e.g. House Area Feature = Length * Width So you may have t1 * width + t2 * length + t3 * area. So you may choose to fit "important selective features" as non-linear and keep other simple features as "linear".

  • theta_params = Inverse(Transpose(X) * X) * Transpose(X) * y

  • m training examples, n features -- Gradient Descent vs Matrix (equation) approach -- pros and cons Usually matrix approach is best since there is no need to choose alpha and iterate. However gradient descent works best when there is large number of features. Matrix approach is too slow when n is large. Inverse(Matrix) is O(n^3) complexity. Upto n = 1000, we may still employ matrix approach. when n > 10000, definitely gradient descent is better.

  • Matrix can be non-invertible (e.g. zero matrix). pinv(A) is pseudo-inverse routine in Octave, which gives you inverse even when matrix is not invertible -- to yield closest approximation. Matrix can be non-invertible if m < n (i.e. more features than samples) or when there is redundant features (e.g. length in meters and in feets)

  • Octave tips and Synopsis:

    1 == 2
    ans = 0
    1 ~= 2        % Note that it is ~= not  !=
    ans = 1
    1 && 2        % also: 1 & 2 , 1 || 2 etc
    PS1('>> ')
    a = 5 ;       % semicolon suppresses output
    disp(sprintf('this is %0.3f', a));
    A = [1 2 ; 3 4 ; 5 6 ]     % No comma needed !
    v = [ 1  2 3 ]             % This is 1x3 matrix
    v = [ 1; 2; 3 ]            % This is 3x1 value vector
    v = 1:6
    v =  1 2 3 4 5 6        % This is 1x6 matrix
    v = 1:0.1:2                % The 0.1 is delta. This gives 1x11 matrix.
    ones(2, 3)   % 2x3 matrix of all ones.
    c = 2 * ones(2,3)          % 2x3 matrix of all 2's.
    zeros(1,3)
    rand(3, 3)
    I = eye(4)     % Identity matrix.
    help help
    help randn
    load features.dat
    load ('features.dat')   % Load csv or space separated data.
    who                     % List variables in current scope
    whos                    % list variables in detail
    clear featuresX         % unload variable
    v = priceY(1:10)        % extract subrange data 
    save hello.mat  v       % save variables into hello.mat. Saved in binary.
    A(3, 2)                 % get element at 3rd row, 2nd col
    A(:, 2)                 % all at 2nd col.
    A([1 3], :)             % Extract 1st and 3rd rows.
    A(:, 2) = [10; 11; 12]  % Assign 2nd col by given values.
    A = [ A, [100; 200; 300] ]  % append another column vector
    A(:)                    % Put all elements in single col vector
    C = [A B]                % Join 2 matrices adjacent. Same as C = [A, B]
    C = [A ; B]              % Join Matrix B below A.
    A * B
    C = A .* B               % Elementwise Multiply c(i,j) = a(i,j)*b(i,j).
    C = A .^ 2               % Each element is squared.
    abs(A)                   % elementwise abs operation applied.
    A'                       % This is A transpose
    [value, indexx ] = max(a)  % Get the max value in vector and the position
    find(a < 3)              % given vector a, yields another vector of indexes
    A = magic(3)              % NxN matrix where cols, rows add up to constant
    sum(a);  prod(a); floor(a); ceil(a); 
    rand(3)                  %  Random matrix.                          
    max(rand(3), rand(3))    % gives 3x3 matrix each with max of 2 rands.
    max(A, [], 1)            % Take columnwise max
    max(A, [], 2)            % Take rowwise max
    max(A(:))                % Take max of entire matrix
    sum(A, 1)                % sum it columnwise.
    sum(A, 2)                % sum it rowwise
    A .* eye(9)               % Initializes all non-diagonal elements to zero.
    flipup(eye(9))           % Flips the matrix
    
    t = [0:0.01:0.98] ; y1 = sin(2*pi*4*t)
    plot(t, y1);  y2 = cos(2*pi*4*5); plot(t, y2);
    % To plot one on top of another
    hold on;  plot(t, y1); plot(t, y2, 'r');
    xlabel('time); ylabel('value')
    legend('sin', 'cos')               % displays color info
    title('my plot')
    cd 'c:/temp' ; print -dpng  'myplot.png'
    close     % close the plot
    figure(1); plot(...); figure(2); plot(....)
    subplot(1, 2, 1);  % Divide the canvas into into  2x1 grid. Set the current grid as the first one.
    plot(t1, y1);                       % Print on the first area in the canvas.
    subplot(1, 2, 2); plot(t, y2);      % Print on the second area in the canvas.
    axis([0.5 1 -1 1])                  % Sets x, y ranges. ie. minx,maxx,miny,maxy
    clf; % clear canvas
    imagesc(A)     % Visualize matrix values by color.
    imagesc(A), colorbar, colormap gray
    
prediction = 0.0;               % Iterative implementation.

  for j = 1:n+1, 
    prediction = prediction +   theta(j) * x(y) 
  end;

prediction = theta’ * x;        % Vectorized implementation
  • Hypothesis representation.

    Let us consider binary classification problem, with output y as 1 or 0. We make use of a sigmoid function: g(z) = ( 1 / (1 + e^-z) ) where z = theta * X This function gives value between 0 and 1.

    z = 0; e^z = 1; g(z) = 0.5; z = +inf; e^-z = 0; g(z) = 1; z = -inf; e^-z = inf; g(z) = 0;

    Consider training set X. Set of (x1, y1) where y1 is 0 or 1. or in-between. We usually find theta based on existing set of points. but we can't allow output values < 0 or > 1. If we want to find a decision boundary, Let us assign h(x) = 0.5 and then solve the equation.

    If there is equal number of positive and negative samples, the decision boundary is exactly the fitting "curve". In case of logistic regression, this is where we get "0.5" as output.

    So, we can say:

    When theta * X >= 0  =>  y = 1
         theta * X >  0  =>  y = 0
    

    The function z does not have to be linear. It could be circle and such.

    In logistic regression, the cost function for our hypothesis outputting (predicting) hθ(x) on a training example that has label y with {0,1} is:

    point cost(hθ(x),y) = −log ( hθ(x) )     if y=1
                          −log(1−hθ(x))      if y=0
    
    cost J(theta) = (1/m) sum(costs)
    

    Converting the if logic into single equation:

    Cost(hθ(x),y)=−ylog(hθ(x))−(1−y)log(1−hθ(x))
    
    J(θ) = (−1/m)  Sum ( (y(i) * log(hθ(x(i))))  +  (1−y(i)) * log(1−hθ(x(i))) )
    
    Vectorized Implementation:
    h=g(Xθ)                                                                                        
    J(θ) = (1/m) * ( −y' * log(h)− (1−y)' * log(1−h))
    

    Gradient of the cost is a vector of the same length as θ where the jth element (for j = 0,1,...,n) is defined as follows:

    ∂J(θ) / ∂θj = ( 1/m) * Sum( (hθ(x(i))−y(i)) * x(i) )
              in other words ...
    dJ/dtj = (1/m) * Sum ( distance * xj )   ;   // xj = jth feature in each input entry
    

Gradient descent for logistic regression

Remember that the general form of gradient descent is:

Repeat{ θj := θj − alpha * dJ/dθ }

We can work out the derivative part using calculus to get:

Repeat{ θj := θj − alpha * (1/m) * Sum( (hθ(x(i))−y(i)) * x(i) }

Notice that this algorithm is identical to the one we used in linear regression. We still have to simultaneously update all values in theta.

A vectorized implementation is:

θ := θ − (alpha/m) * X' * ( g(Xθ) − y )

Advanced Optimization

"Conjugate gradient", "BFGS", and "L-BFGS" are more sophisticated, faster ways to optimize θ that can be used instead of gradient descent. We suggest that you should not write these more sophisticated algorithms yourself (unless you are an expert in numerical computing) but use the libraries instead, as they're already tested and highly optimized. Octave provides them.

We first need to provide a function that evaluates the following two functions for a given input value θ::

Cost Function: J(θ)
Partial derivative of cost function for each theta parameter: dJ/dt0, dJ/dt1, etc.

In Octave, there is fminunc() optimization function which does this:

function [jVal, gradient] = costFunction(theta)
    jVal = [...code to compute J(theta)...];
    gradient = [...code to compute derivative of J(theta)...];
end

options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);

Multiclass Classification

We are basically choosing one class and then lumping all the others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returned the highest value as our prediction :

y∈ {0,1...n}
h0(x) =  P(y=0|x;θ)
h1(x) =  P(y=1|x;θ) 
....
prediction = max(hθ(x))

Problem of Overfitting

This terminology is applied to both linear and logistic regression. There are two main options to address the issue of overfitting:

  • Reduce the number of features:

    • Manually select which features to keep.
    • Use a model selection algorithm (studied later in the course).
  • Regularization:

    • Keep all the features, but reduce the magnitude of parameters θj. Regularization works well when we have a lot of slightly useful features.

Cost of regularized cost function:

J(θ) = (−1/m)  Sum ( (y(i) * log(hθ(x(i))))  +  (1−y(i)) * log(1−hθ(x(i))) ) + ( lambda / (2m) ) sum(square(theta_parameters))

Gradient: 

∂J(θ) / ∂θj = ( 1/m) * Sum( (hθ(x(i))−y(i)) * x(i) )   +  (lambda/m) * θj   % Correction only for j > 0
          in other words ...
dJ/dtj = (1/m) * Sum ( distance * xj ) + lambda_correction  ;   // xj = jth feature in each input entry

Neural Networks

The simple functions like AND and OR could be computed using single compute layer where as functions like XOR is non-linear (and as such require new feature like x1.x2) and can be effectively computed using additional hidden layer.

Backpropagation" is neural-network terminology for minimizing our cost function, just like what we were doing with gradient descent in logistic and linear regression.

For back propagation to work, we need to initialize theta to unsymmetric random epsilon values - Initializing it as zero matrix (or any uniform values) does not work.

Unsupervised machine learning

Unsupervised machine learning is the machine learning task of inferring a function to describe hidden structure from "unlabeled" data (a classification or categorization is not included in the observations). Since the examples given to the learner are unlabeled, there is no evaluation of the accuracy of the structure that is output by the relevant algorithm—which is one way of distinguishing unsupervised learning from supervised learning and reinforcement learning.

A central case of unsupervised learning is the problem of density estimation in statistics,[1] though unsupervised learning encompasses many other problems (and solutions) involving summarizing and explaining key features of the data.

For example, author disambiguation is unsupervised learning which can auto detect the "co-related features". A set of author instances may be close together wrt "few features". Unless we manually tag who are the real "authors", we can not measure the accuracy of the categorization/classification.

Observation: Similar words appear in similar context. That is how, machine learns "cat" and "kitty" are similar/related. This is typical example of unsupervised learning. The word2vec etc is very important example of unsupervised learning technologies. The context uses sliding window of surrounding words.

Error Analysis and Accuracy improvement design

The recommended approach to solving machine learning problems is to:

  • Start with a simple algorithm, implement it quickly, and test it early on your cross validation data.
  • Plot learning curves to decide if more data, more features, etc. are likely to help.
  • Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made.

Note: Stemming software can be used to recognize discount, discounts, discounting as same noun in NLP.

Precision/Recall analysis is very useful in presence of categorization of "skewed classes". e.g. People having cancer are 0.05%, a blind algorithm which says "No body has cancer" is 99.95 times doing right!!! But it will fail precision/recall check.

If you raise the threshold for h(x) from 0.5 to 0.9, to avoid false positives, you will get higher precision but lower recall. Some times you may want to reduce the threshold to get higher recall and lower precision.

Compare 2 algorithms using F1 Score= 2 * (PR) / (P+R) ; Higher value, better. i.e. when P = R = 1, F1 Score = 1; and it is less < 1 otherwise.

Measure precision (P) and recall (R) on the test set and choose the value of threshold which maximizes 2PRP+R

Various learning algo exist to classify between confusable words: {to, two, too}, {then, than} :

  • Perceptron (logistic regression)
  • Winnow
  • Memory based
  • Naive Bayes

Most of these algorithms do well same as long as you supply enough data. It is not who has best "algorithm" -- it is who has best data.

Support Vector Machine

It is a "large marigin" classifier ... meaning try to find threshold so that it is max distant from most of the points. Beware ... An outlier may influence too much ... Computationally better compared to h(x) with sigmoid function calls.

What is Machine Learning?

Two definitions of Machine Learning are offered. Arthur Samuel described it as: "the field of study that gives computers the ability to learn without being explicitly programmed." This is an older, informal definition.

Tom Mitchell provides a more modern definition: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."

Example: playing checkers.

E = the experience of playing many games of checkers T = the task of playing checkers. P = the probability that the program will win the next game.

Supervised Learning

In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output.

e.g. stock/house prices history.

It is also called "regression" problem - i.e. we're trying to predict a continuous value output. Namely the price. May fit using straight line or quadratic equation, etc.

e.g. tumor size, age, is_benign data. We need to predict if the tumor is malignant or benign. This is a classification problem.

Supervised learning problems are categorized into "regression" and "classification" problems. In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories.

Note: Classification problem is a subset of regression since we can always generate discrete values using continous function ??

Note: Unsuitable for author disambiguation unless we actively curate large set of authors and let the machine learn from that.

To visualize the examples in 2D graph, the regression problem may look like y = f(x) where y is what you want to predict where we consider only one feature x. (In Practice there is N features i.e. N dimensions.)

For classification, we may cluster set of points (with different colors) given features x, y. The color of the point classifies. We may draw lines or circles to separate one set of values with others.

UnSupervised learning.

We just give dataset -- and ask to find structure and co-relations with in the data. This amount to recognizing inter-dependencies among various features. Visually this is like recognizing various clusters and classification.

e.g. Google news articles. The clusters of articles may be identified wrt various topics wrt location, time, language. In terms of visualizing it is like recognizing clusters -- bunch of data points in 2D graph given various features. e.g. News may recognize that the crime rates are higher during winter on specific dates. We do not tell what features are "inputs" and which ones are "outputs". The learning algorithms should find this automatically.

The google news automatically identifies all "Oil price hike" articles together, then "war conflicts" articles together etc.

The genome data automatically clusters humans. Social network analysis. Market segmentation. Astronomical data analysis.

Cocktail party problem -- separate overlapping audio channels using different mikes placed. Note that it is a non-clustering unsupervised learning problem. Eliminating noise and finding structure in chaotic environment.

It is done using single line of code:

[S, s, v] = svd((repmat(sum(x.*x, 1), size(x,1),1).*x)*x')

We use Octave programming env (Or matlab) which will make life easier. Octave is opensource implementation for matlab.

svd == singular value decomposition (linear algebra function).

Also useful to prototype algorithm faster using octave env, then later do some micro-optimization using python or java or C.

Maths Background

  • Hard tanh(x) = tanh(x) when -1 <= tanh(x) <= +1 ; otherwise it has hard limits as -1, +1 respectively.
  • The softmax activation function returns the probability distribution over mutually exclusive output classes.
  • sigmoid() function maps [-inf, +inf] interval to [0, 1].
  • If we want to get multiple classifications per output (e.g. car + person), we use sigmoid, not softmax.
  • RELU - Rectified linear -- e.g. f(x) = max(0, x) ; The value never goes below zero.

Neural Networks Deep Learning Coursera course by Andrew Ng

RELU - Rectified Linear Unit -- It is a linear function, but looks like: y = 0 when x < 10; y = 4 (x - 10) when x > 10

Single neuron neural network can be imagined to implement a simple linear equation like t0 + t1 * x; (given input x).

Every neurons can also be called "units". e.g. 3 units or 3 neurons and 10 input attributes are all densely connected. Each 3 unit, may compute higher order relevant attributes that will be fed to next layer.

Much of neural networks value lies in supervised learning applications like :

- House properties to predict price - Standard Neural network
- Online Advertisement to maximize clicks
- Image / Photo tagging/ classification - CNN               
- Audio automatic transcription - Sequence stream of data - RNN
- English to Chinese translation - more complex RNN
- Image radar info for automonous driving -- Custom/Hybrid CNN/RNN network.
  • Neural networks is especially impressive in dealing with unstructured data.

  • Deep Learning takes off now because we can process large amount of data now.

  • The sigmoid function was used as activation function as approximation alternative for RELU function so that it is differentiable and easier to deal with mathematically for optimization, but it was computationally inefficient. Falling back to RELU implementation improved runtime performance for gradient descent.

  • Suppose, you have 3 input features, 1 single hidden layer of 4 activation units , and 1 output layer of 1 activation unit which outputs logistic expression 1 or 0 :

    y = W x + b     where  W is 4x3 matrix at middle hidden layer.
    i.e. There are 12 weight parameters for 3 input featues and 4 activation units.
    i.e. Each activation unit assigns independent weights for each feature.
    
  • For hidden layers, the choice for activation function -- tanh (whose output range is -1 to +1) is almost always superior than sigmoid:

    tanh(x) = (e^z - e^-z)/(e^z + e^-z)   [ Range: -1 to +1 ]
    
  • For output layer sigmoid may be preferred (with 0 to 1 output). Problem with both tanh and sigmoid is that for large input values, the slope is almost 0, hence the gradient is too small, the convergence takes long time. This is where, RELU is better. Most of the time z > 0, so covergence is fast. One caveat is that when z < 0, the slope is 0, so it may "think" that is optimal point, however mostly z > 0. One way to deal with that is to have "leaky RELU" function which adds a small negative slope when z < 0 i.e. (a = max(- 0.000001 * z, z) ) factor added to it. Then also convergence may be slow in that region, but still better than sigmoid.

  • For specific application, there are lot of design choices like what activation function to use, etc.

  • Why activation function instead of g(z) = z ? The linear activation function fits linear model. Such neural network is not more intelligent than regular machine learning technique of std logistic regression model. Only if you have non-linear activation function (in hidden layers), you will be able to capture non-linear dependencies.

    If your output is linear real values (not logistic regression yes/no classification), then it is better to use linear function for output layer, but the hidden layers must have non-linear activation function.

Gradient Descent Algorithm for Neural Networks

Say one input layer (layer 0), one hidden layer (layer 1), one output layer (layer 2) with only one activation unit (i.e. binary classification output). Note the following notation:

n0 = Number of input features (tuple size for each x). Let this be 3.
n1 = Hidden layer size. i.e. activation units.   Let this be 4.
n2 = Output layer size. i.e. activation unit. Let this be 1 as we consider binary classification.

W1 = Weight matrix for layer 1. This is 4x3. Every row in W matrix represents activation unit. Every column associated feature weight.
     W(i,j) represents the weight the activation unit i assigns for the j'th feature in x.

W2 = Weight matrix for layer 2. This is 1x4. i.e. Number of activation units here is just 1. 
     The input feature size of layer n is the units size of layer (n-1).

b1 = Constant bias introduced in Layer 1.  Size is  3x1. Each activation unit introduces different bias.
b2 = Constant bias introduced in Layer 2.  Size is  1x1. 

Z1 = W1 * X + b1           // This is Linear output result for Layer 1. Result is 4x1 values.
A1 = g(Z1) = sigmoid(Z1)   // This is non-Linear output result for Layer 1. 4x1 size.
                           // If the relation is all linear, you can have g = Identity function.
                           // Otherwise this maps (-inf, +inf) to (0, 1).

Z2 = W2 * A1 + b2          // Linear output for output layer 2. Result size is: 1x1
A2 = g(Z2) = sigmoid(Z2)   // Logistic non-linear output. Result size is: 1x1

Note: There is no W0, b0.

Total Cost of this neural network:  J(W1, W2, b1, b2) = Sum( Cost(y, expected-y) )

If we randomly initialize W, b then we can move W's and b's towards optimal direction.

The equation is partially differentiable Computing partial derivatives is cheap ... otherwise,
we may still be able to calculate it manually, keeping all factors same except varying one factor.

Let us use the notation:

dW[i] =  dJ/dWi     : Partial derivative of Cost function wrt weights for layer i. 
                      Note: Actual partial derivative is calculated for each weight in that weight matrix.

db[i] =  dJ/dbi     : Partial derivative wrt bias parameters of layer i.

Algorithm:

Repeat until cost difference converges :

    Compute predicted values: y1 ... ym
    This involves forward propagation and computing Z's and A's.

    For the given W, b:

    Compute values for dW[2] and db[2] i.e. for Layer 2.
    This is where backpropagation comes.

    W2 = W2 - alpha * dW[2];   // Adjust the weights for Layer 2.
    b2 = b2 - alpha * db[2];   // Adjust biases for Layer 2.

    // Computing of dW2 is quite involved, some high level details is given below ...
    dZ2 = A2 - Y  // Y is expected values
    dW2 = 1/m * dZ2 * A1' // A1' = Transpose of previous layer output
    db2 = 1/m * np.sum(dZ2, axis=1, keepdims=1)


    Compute values for dW[1] and db[1] i.e. for Layer 1.

    W1 = W1 - alpha * dW[1];   // Adjust the weights for Layer 1.
    b1 = b1 - alpha * db[1];   // Adjust biases for Layer 1.

    // Like before computation is involved.
    dZ1 = (W2' * dZ2) .*  sigmoid(z1) // elementwise product  
    dW1 = 1/m * dZ1. X'
    db1 = 1/m * np.sum(dZ1, axis=1, keepdims=true)
  • Note: You have to cache Weight matrix for layer computed during forward computation, since they will be reused in backward propagation.

  • Initializing weight matrix to zero does not work for neural networks. This is because of symmetry among activation units. It may work for some linear logistic regression.

  • High level:

    Forward Propgation:
       Input a[l-1]
       Outpt a[l], Cache(z[l]), Cache W[l], b[l]
    
    Backward Propagation:
       Input da[l]
       Output da[l-1], dW[l], db[l]
    
  • Parameters are W, b's. Hyperparameters: learning rate alha, #iterations, #hidden layers, #hidden units, choice of activation functions e.g. sigmoid. Also: Momentum, mini-batch size, regularization parameters.

Learning By Example

Differentiate Supervised vs Unsupervised Learning ?

  • Supervised learning problems are obvious ...:
    • stock/house prices history. (Goal is to predict continous variable - A regression problem)
    • Tumor size, age, is_benign. (Goal is to predict yes/no - A classification problem)
  • Unsupervised learning problems are relatively not obvious ...:
    • It usually involves finding structure and co-relations with in the data.
    • This amount to recognizing inter-dependencies among various features.
    • Example: Classify news articles and cluster similar articles together with right "topics":
      • An algorithm classifies all articles related to 'Oil price hike' together.
      • This is like placing balls of similar color close to each other in N-dimensional space.
      • News discovers that if there are many articles refferring to 'Oil price hike', it learns by itself that now it is 'trending topic'. This amounts to recoginzing dark spots in heatmap.
      • News may learn also that 'Oil' and 'Price' are somewhat related words (even if word2vec already does not know).
      • Two articles may be close to each other in 3-D, but may be very far in 4th dimension. The similarity score may still favor as long as there are sufficient overlap. The algorithm may even learn that, in reality, it is not far in the 4th dimension as well by redefining the understanding of the 4th dimension.
    • Example: Find the hot "trending" topics in twitter:
      • Usually we do not really teach the algorithm by manually identifying the 'hot topics' and tag them.
      • The hot topics are identified by counting the frequency of the words (and semantically closer words).
      • If there are two separate trending topics, say 'cricket' and 'eclipse', the algorithm does not think they are related. i.e Frequency of certain words occuring together right now does not mean they are related. So they are recognized as 2 different hot trending topics. May be if those two words always occur together over a period of time, (which is unlikely) the algorith would learn that they are related. (Suppose if everybody plays cricket to celebrate eclipse, duh ...)
    • Not all unsupervised learning problems are 'clustering' based. e.g. Cocktail party problem :
      • separate overlapping audio channels using different mikes placed.
      • Here we don't have N objects with M features ...
      • Instead, we have 1 object (streaming wav file) with sequence of bits ...(billions of them)
      • We still recognize the patterns in the bits and separate them ... It is decomposing problem into N mics streams.
      • We do not even need audio domain specific algorithm here. Just give raw input streams of 'Pitch', 'Volume', etc uncompressed.
    • We also usually refer to the unsupervised learning data as unlabeled data for which we want to apply the (classification) 'labels' where we do not even know before hand any specific set of predefined 'labels'.

Simple Linear Regression Examples

  • Suppose house price is just directly proportional to total size. (For a given locality):
    • Say Price(size) = min_price + (a_factor * size ) OR h(x) = theta0 + theta1 * x ; where h = hypothesis.
    • This univariate linear regression has 2 co-efficients: theta0 and theta1.
    • Given a training set, we fit a linear line to connect all price points as close as possible.
    • i.e. We calculate theta0 and theta1 from training set. i.e. We get hypothesis function.
    • The algorithm used to find theta0, theta1 is to minimize MSE - Mean square error. i.e. Sum( (h1-y1)^2 + (h2-y2)^2 + ... )
.
.
^
│
│
│
Price      │            ╱o                                                                                                    
│           ╱                                                                                                      
│        o ╱      h(Size) = ϴ   + ϴ   * Size                                                                           
│         ╱                  0     1                                                                              
│        ╱o
│     o ╱                                                                                                          
│      ╱                                                                                                           
└─────╱────────────────────────────────────────────>                                                            
     ϴ                                                                                                          
      0      Size 

FAQ

What is the difference between mc learning and deep learning and neural networks ?

Machine learning came directly from minds of the early AI crowd, and the algorithmic approaches over the years included decision tree learning, inductive logic programming. clustering, reinforcement learning, and Bayesian networks among others. As we know, none achieved the ultimate goal of General AI.

Arthur Samuel coined the phrase not too long after AI, in 1959, defining it as -- the ability to learn without being explicitly programmed.

Using an algorithm to predict an outcome of an event is not machine learning. Using the outcome of your prediction to improve future predictions is.

Machine learning algorithm need not require big amount of data -- For example, chess learning algorithm may just require hand coding of "chess rules" (Decision trees), but it may play against itself and find strategies to improve it's game.

Benefits of neural networks:

* Extract meaning from complicated data
* Detect trends and identify patterns too complex for humans to notice
* Learn by example
* Speed advantages

A basic neural network may have two to three layers, while a deep learning network may have dozens or hundreds.

Simple Machine learning is equivalent to having single neuron - It fits the input features into a linear/non-linear model by finding relative weights. So there are N input features and may be N+1 parameters and a model. What if you want to design a model with more and more of non-linear parameters e.g. x1.x2 or x1.x2.x3 but you have no idea if some product of subset of features really important for the output or not ? This is where a probabilistic design comes into effect ... try some random combination of features, drop the ones which really does not contribute to the output and keep the ones which do ...

Breakthrough in machine learning came through computer vision.

Consider the task of identifying a STOP sign ... Look for something which has 8 sides, red color, distinct letters S, T, O, P one after another, etc.

CNN - Convolutional Neueral networks

  • Mainly optimized algorithm for computer vision.
  • Recent variations such as residual networks.
  • Can do neural style transfer to generate art.

iris prediction using sklearn

Build model and Predict

# import load_iris function from datasets module
from sklearn.datasets import load_iris

# save "bunch" object containing iris dataset and its attributes

iris = load_iris()     #  type(iris)  is  <class 'sklearn.utils.Bunch'>

# store feature matrix in "X"
X = iris.data                    # X.shape  is (150, 4)

# store response vector in "y"
y = iris.target                  # y.shape  is (150,)

## scikit-learn 4-step modeling pattern
# **Step 1:** Import the class you plan to use

from sklearn.neighbors import KNeighborsClassifier

# **Step 2:** "Instantiate" the "estimator"
#  "Estimator" is scikit-learn's term for model

knn = KNeighborsClassifier(n_neighbors=1)

# Can specify tuning parameters (aka "hyperparameters") during this step
print(knn)

      KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                 metric_params=None, n_jobs=1, n_neighbors=1, p=2,
                 weights='uniform')

# Step 3:** Fit the model with data (aka "model training")

knn.fit(X, y)

# **Step 4:** Predict the response for a new observation

knn.predict([[3, 5, 4, 2]])   # Can pass multiple rows/observations.

array([2])   // Single prediction for single input row. Numpy Array.

X_new = [[3, 5, 4, 2], [5, 4, 3, 2]]
knn.predict(X_new)

array([2, 1])     // Multiple inputs / multiple predictions.

## Using a different value for K (i.e. change hyperparameters )
# knn = KNeighborsClassifier(n_neighbors=5)
# Then you can repeat the whole thing ...
...

array([1, 1])    // Different models (or tunings) may predict differently.

## Using a different classification model
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(X, y)
logreg.predict(X_new)

array([2, 0])

Evaluate the model

# Evaluation procedure #1: Train and test on the entire dataset

# compute classification accuracy for the logistic regression model

y_pred = logreg.predict(X)           # Pass entire training set on the trained model.
len(y_pred)                          # It is 150

from sklearn import metrics
print(metrics.accuracy_score(y, y_pred))   # y vs y_pred

      0.96

# Known as training accuracy when you train and test the model on the same data

# For KNN (K=5)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X, y)
y_pred = knn.predict(X)
print(metrics.accuracy_score(y, y_pred))

     0.96666              # Slightly better than logistic

# For KNN = 1, the training accuracy is 1 !!!

     1.0                  # KNN=1 model defines 'anything at that (x1,y1) is z1 by definition.
                          # This does not imply KNN=1 is a good model.

# Evaluation procedure #2: Train and test set split.

from sklearn.model_selection import train_test_split

# Randomly pick 40% rows for testing. Remaining 60% is for training.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=4)

logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

# compare actual response values (y_test) with predicted response values (y_pred)
print(metrics.accuracy_score(y_test, y_pred)) 

    0.95 

# Repeat above for KNN = 5  i.e. split train/test data.
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))

  0.9666666666666667

# For K=1, it yields  0.95

# Can we locate an even better value for K?
k_range = list(range(1, 26))
scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    scores.append(metrics.accuracy_score(y_test, y_pred))

import matplotlib.pyplot as plt

# allow plots to appear within the notebook

# %matplotlib inline # For notebook as magic command

# plot the relationship between K and testing accuracy
plt.plot(k_range, scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Testing Accuracy')

# You can see that Accuracy is max between K=7 to k=15
# You may decide to pick K=11 for your model, for example.

# Make prediction for single out-of-sample data ...
# It is obvious ....

knn.predict([[3, 5, 4, 2]])

    array([1])

Modeling workflow using pandas, seaborn, scikit-learn

Overview

When you just have the data and want to build model, this workflow is typically very useful. This applies for supervised learning, where you have labeled data:

  • You load data using pandas
  • Use seaborn for adhoc visualization to understand data better
  • Use scikit-learn to build models and fine tune them, then measure acuracy of model.

Using seaborn

import seaborn as sns
sns.pairplot(data, x_vars=['TV','Radio','Newspaper'], y_vars='Sales', size=7, aspect=0.7, kind='reg')

# 3 Scatter plots displayed i.e. TV vs Sales, Radio Vs Sales, Newspaper Vs Sales

# Visualize iris dataset correation.

sns.pairplot(iris,hue='species',palette='Dark2')

# Create a kde plot of sepal_length versus sepal width for setosa species of flower.

setosa = iris[iris['species']=='setosa']
sns.kdeplot( setosa['sepal_width'], setosa['sepal_length'],
                 cmap="plasma", shade=True, shade_lowest=False)

# Single value density plot is histogram or density plot.
# 2 Variables density plot is KDE. Kernel density estimate 
# or density contours or 2D bins or Hex bins.
#

Preparing input for scikit-learn

  • scikit-learn expects numpy matrix for feature input and numpy array for response
  • Since pandas Dataframe is built on top of numpy, they are compatible.
X = data[['TV', 'Radio', 'Newspaper']]    # Feature matrix - Data frame - numpy matrix compatible.
y = data['Sales']                         # Response Vector - Pandas Series - numpy array compatible. 

Using scikit-learn

  • The iris problem is a supervised learning classification problem where KNN and logistic regression models are applicable.
  • The sales vs advertising model is a supervised learning regression problem where we can apply LinearRegression model supported by scikit-learn.
from sklearn.linear_model import LinearRegression

linreg = LinearRegression()

linreg.fit(X_train, y_train)    # fit the model to the training data (learn the coefficients)

          LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Evaluating Linear Regression Models

from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, y_train)  

          #  LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

print(linreg.intercept_)
print(linreg.coef_)

          # 2.8769666223179318
          # [0.04656457 0.17915812 0.00345046]

# pair the feature names with the coefficients
list(zip(feature_cols, linreg.coef_))

   #  [('TV', 0.04656456787415029),
   #   ('Radio', 0.17915812245088839),
   #   ('Newspaper', 0.003450464711180378)]

# make predictions on the testing set
y_pred = linreg.predict(X_test)

# We need an **evaluation metric** in order to compare our predictions with the actual values!

# Evaluation metrics for classification problems, such as **accuracy** not applicable here.

# Three common evaluation metrics** for regression problems:

#  1. **Mean Absolute Error** (MAE) is the mean of the absolute value of the errors:

print(metrics.mean_absolute_error(true_values, y_pred))

#  2. **Mean Squared Error** (MSE) is the mean of the squared errors:

print(metrics.mean_absolute_error(true_values, y_pred))

#  3. **Root Mean Squared Error** (RMSE) is the square root of the mean of the squared errors:

print(np.sqrt(metrics.mean_squared_error(true_values, y_pred)))

# **MAE** is the easiest to understand, because it's the average error.
# **MSE** is more popular than MAE, because MSE "punishes" larger errors.
# **RMSE** is even more popular than MSE, because RMSE is interpretable in the "y" units.

# Example :
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

          #  1.404651423032895

## Feature selection

# Does **Newspaper** "belong" in our model? In other words, does it improve the quality of our predictions?
# Let's **remove it** from the model and check the RMSE!

print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

          # 1.3879034699382886

# You may decide to remove this feature since the error decreases by removing it.

Model Cross Validation Strategies

You have standard methedologies for cross validation of your classification model or regression model.

  • We aim to maximize test acuracy score for classification model.
  • We aim to minimize the RMSE (root mean square error) for regression model.
  • The default split of 75% training and 25% testing is good in general, but you may test with different ratio for testing e.g. 5%, 10%, 15%, 20%, 25% and see how the accuracy changes. This is my own intuition but not a std procedure. Let us assume the split percentage is not a problem.
  • Let us say 10% testing data is good, but how to pick the 10% ? The 10% data must have equal representation of response classes! e.g. It should not exclude certain classes altogether, then we can never validate if the model would be able to predict that class properly or not.
  • Another approach is to randomize and pick 10% for test data (still ensuring it has proper representation of response class). The idea of K-fold cross validation is based on this approach. We shuffle data strategically and then use a sliding window of 10% to pick test data. It is 10-fold cross validation procedure. The test accuracy may be calculated as "average" accuracy of these 10 iterations (given 10% is test data).
  • K-fold cross validation process runs K times slower compared to single train/test data split.
  • The scikit-learn's cross_val_score(model, ... cv=10,...) function does intelligent splittings and applies the given model for each K-fold splits and gathers the K scores in array. The average of these scores is more reliable than any one specific split.
  • For feature selection, you may re-calculate K-Fold test accuracy with/without feature.
  • To find optimal K for KNN n=k nearest neighbour algorithm (it is model parameter tuning) you may use K-Fold method to measure training accuracy for different values of N for KNN algorithm (say run this for 5 to 25 neighbors) and plot the K vs Accuracy graph. You may choose K neighbours for which accuracy is maximum.
  • To summarize, k-fold cross validation is useful for 1) tuning parameters, 2) choose models and 3) selecting features.
  • For automating cross validation for "tuning parameters" of model, scikit learn function sklearn.model_selection.GridSearchCV() is very useful.

Synopsis :

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))    # For classification problem.

# simulate splitting a dataset of 25 observations into 5 folds
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=False).split(range(25))

# Above it is 25 integers, but in real, it would be 25 rows.

#
# Enumerating the folds ...
#
for iteration, data in enumerate(kf, start=1):
    print( iteration, data[0], str(data[1]) )   
    # Prints:  1  [ 20-elements-training-set ] [ 5-test-set ]

## Cross-validation recommendations

1. K can be any number, but **K=10** is generally recommended
2. For classification problems, **stratified sampling** is recommended for creating the folds
    - Each response class should be represented with equal proportions in each of the K folds
    - scikit-learn's cross_val_score function does this by default


# 10-fold cross-validation with K=5 for KNN (the n_neighbors parameter)
knn = KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
print(scores)

    [1.         0.93333333 1.         1.         0.86666667 0.93333333
     0.93333333 1.         1.         1.        ]

# Note: cross_val_score() makes intelligent splits 10 times, each time applying the
# given model and computing the accuracy for that split.

# use average accuracy as an estimate of out-of-sample accuracy
print(scores.mean())

    0.9666666666666668

#
# Typically you use cross_val_score() to choose between knn or logistic regression.

# 10-fold cross-validation with logistic regression
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
print(cross_val_score(logreg, X, y, cv=10, scoring='accuracy').mean())

    0.9533333333333334

#
# But you can also use this to find specific K for KNN during initial stages.
# Note that this runs slowest since you vary K on outer loop for neighbors
# and vary folds in inner loop for test/train splits.
# Use it judiciously.
#

# search for an optimal value of K for KNN
k_range = list(range(1, 31))
k_scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
    k_scores.append(scores.mean())
print(k_scores)
# Choose k for which you have highest accuracy score.

# Example scoring calculation for Linear Regression ...
# 10-fold cross-validation with all three features
lm = LinearRegression()
scores = cross_val_score(lm, X, y, cv=10, scoring='neg_mean_squared_error')
print(scores)

    [-3.56038438 -3.29767522 -2.08943356 -2.82474283 -1.3027754  -1.74163618
     -8.17338214 -2.11409746 -3.04273109 -2.45281793]

mse_scores = -scores
# convert from MSE to RMSE
rmse_scores = np.sqrt(mse_scores)
print(rmse_scores)

    [1.88689808 1.81595022 1.44548731 1.68069713 1.14139187 1.31971064
     2.85891276 1.45399362 1.7443426  1.56614748]

# calculate the average RMSE
print(rmse_scores.mean())

    1.6913531708051797

# You may compare the other value with some other Regression Model RMSE value ...
# For example, you can use it for feature selection by recomputing the RMSE with same model
# with new feature set (by excluding/including certain features) and drop/add features.
#

# You can also automate the model parameter tuning (like k for KNN n=k) using GridSeachCV().

from sklearn.model_selection import GridSearchCV
k_range = list(range(1, 31))
param_grid = dict(n_neighbors=k_range)
print(param_grid)

   {'n_neighbors': [1, 2, 3, ..., 28, 29, 30]}

grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy', return_train_score=False)
grid.fit(X, y)

# view the results as a pandas DataFrame
import pandas as pd
pd.DataFrame(grid.cv_results_)[['mean_test_score', 'std_test_score', 'params']]

    mean_test_score std_test_score  params

0   0.960000        0.053333        {'n_neighbors': 1}
1   0.953333        0.052068        {'n_neighbors': 2}
2   .....

# examine the best model
print(grid.best_score_)
print(grid.best_params_)
print(grid.best_estimator_)

0.98
  {'n_neighbors': 13}
  KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
             metric_params=None, n_jobs=1, n_neighbors=13, p=2,
             weights='uniform')

# if you want to optimize based on multiple parameters like n_neighbors as well as weight_options ...
param_grid = dict(n_neighbors=[1,2,3,...,28,29,30], weights=[ 'uniform', 'distance' ])
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy', return_train_score=False)
#
# Now repeat all above to find the best parameters ...
#
print(grid.best_score_)
print(grid.best_params_)
0.98
{'n_neighbors': 13, 'weights': 'uniform'}

# End of Grid Search for Cross Validation.

# RandomizedSearchCV searches a subset of the parameters, and you control the computational "budget"

from sklearn.model_selection import RandomizedSearchCV
# n_iter controls the number of searches
rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10, random_state=5, return_train_score=False)
...
# examine the best model
print(rand.best_score_)
print(rand.best_params_)

    0.98
    {'weights': 'uniform', 'n_neighbors': 18}

NLP and Machine Learning Intro

Using CountVectorizer to transform text into feature matrix

simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

# learn the 'vocabulary' of the training data (occurs in-place)
#
# Note: We are not associating any Y values here. This is just to learn Universal vocabulary.
#
vect.fit(simple_train)

vect.get_feature_names()

    ['cab', 'call', 'me', 'please', 'tonight', 'you']

#
# We got 6 words out of 3 sentences inputs that we have given !!!
# transform training data into a 'document-term matrix'
#
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm

<3x6 sparse matrix of type '<class 'numpy.int64'>'
    with 9 stored elements in Compressed Sparse Row format>

# convert sparse matrix to a dense matrix
simple_train_dtm.toarray()

array([[0, 1, 0, 0, 1, 1],
       [1, 1, 1, 0, 0, 0],
       [0, 1, 1, 2, 0, 0]])

# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())

   cab   call    me  please  tonight   you
0    0   1       0   0       1         1
1    1   1       1   0       0         0
2    0   1       1   2       0         0

# This is called **Bag of Words** or "Bag of n-grams" representation. 
# we completely ignore the relative position information of the words in the document.

# check the type of the document-term matrix
type(simple_train_dtm)

     scipy.sparse.csr.csr_matrix

# examine the sparse matrix contents
print(simple_train_dtm)

  (0, 1)    1    
  (0, 4)    1
  (0, 5)    1
  (1, 0)    1
  (1, 1)    1
  (1, 2)    1
  (2, 1)    1
  (2, 2)    1
  (2, 3)    2   # Third input has the 4th word "Please" 2 times.

# transform testing data into a document-term matrix (using existing vocabulary)
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()

 array([[0, 1, 1, 1, 0, 0]], dtype=int64)

NLP Example: Train model to identify SMS spam

  • We have labeled SMS input data with spam/ham.
  • We build the model to train using input and to predict new messages as spam.
# read file into pandas using a relative path
path = 'data/sms.tsv'
sms = pd.read_table(path, header=None, names=['label', 'message'])

# alternative: read file into pandas from a URL
# url = 'https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv'
# sms = pd.read_table(url, header=None, names=['label', 'message'])

# examine the shape
sms.shape

      (5572, 2)

# examine the first 10 rows
sms.head(10)

   label     message
0    ham     Go until jurong point, crazy.. Available only ...
1    ham     Ok lar... Joking wif u oni...
2    spam    Free entry in 2 a wkly comp to win FA Cup fina...
3    ham     U dun say so early hor... U c already then say...
4    ham     Nah I don't think he goes to usf, he lives aro...
5    spam    FreeMsg Hey there darling it's been 3 week's n...
6    ham     Even my brother is not like to speak with me. ...
7    ham     As per your request 'Melle Melle (Oru Minnamin...
8    spam    WINNER!! As a valued network customer you have...
9    spam    Had your mobile 11 months or more? U R entitle...

# examine the class distribution
sms.label.value_counts()

      ham     4825
      spam     747
      Name: label, dtype: int64

# convert label to a numerical variable
sms['label_num'] = sms.label.map({'ham':0, 'spam':1})

# check that the conversion worked
sms.head(10)

          label  message                                             label_num
      0  ham     Go until jurong point, crazy.. Available only ...   0
      1  ham     Ok lar... Joking wif u oni...                       0

# how to define X and y (from the iris data) for use with a MODEL
X = iris.data
y = iris.target
print(X.shape)
print(y.shape)

        (150L, 4L)
        (150L,)

# how to define X and y (from the SMS data) for use with COUNTVECTORIZER
X = sms.message
y = sms.label_num
print(X.shape)
print(y.shape)

        (5572L,)
        (5572L,)

# split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

# ## Part 4: Vectorizing our dataset

# instantiate the vectorizer
vect = CountVectorizer()

# learn training data vocabulary, then use it to create a document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)

# equivalently: combine fit and transform into a single step
X_train_dtm = vect.fit_transform(X_train)

# examine the document-term matrix
X_train_dtm

      <4179x7456 sparse matrix of type '<type 'numpy.int64'>'       # 4179 inputs; 7456 words.
        with 55209 stored elements in Compressed Sparse Row format>

# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm

      <1393x7456 sparse matrix of type '<type 'numpy.int64'>'      # 1393 Inputs; 7456 words;
          with 17604 stored elements in Compressed Sparse Row format>

# ## Part 5: Building and evaluating a model
# 
# We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):
# 
# > The multinomial Naive Bayes classifier is suitable for classification with **discrete features** 
# (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. 
# However, in practice, fractional counts such as tf-idf may also work.

# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

# train the model using X_train_dtm
nb.fit(X_train_dtm, y_train)

    Wall time: 3 ms

# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

   0.98851399856424982

# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

   array([[1203,    5],
          [  11,  174]])


# print message text for the false positives (ham incorrectly classified as spam)
X_test[y_test < y_pred_class]

      574               Waiting for your call.
      3375             Also andros ice etc etc
      45      No calls..messages..missed calls
      3415             No pic. Please re-send.
      1988    No calls..messages..missed calls
      Name: message, dtype: object

# print message text for the false negatives (spam incorrectly classified as ham)
X_test[y_test > y_pred_class]

    3132    LookAtMe!: Thanks for your purchase of a video...
    5       FreeMsg Hey there darling it's been 3 week's n...
    3530    Xmas & New Years Eve tickets are now on sale f...
    684     Hi I'm sue. I am 20 years old and work as a la...
    1875    Would you like to see my XXX pics they are so ...
    1893    CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...
    4298    thesmszone.com lets you send free anonymous an...
    4949    Hi this is Amy, we will be sending you a free ...
    2821    INTERFLORA - It's not too late to order Inter...
    2247    Hi ya babe x u 4goten bout me?' scammers getti...
    4514    Money i have won wining number 946 wot do i do...
    Name: message, dtype: object

# example false negative
X_test[3132]

  "LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, 
  you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323."

# calculate predicted probabilities for X_test_dtm (poorly calibrated)
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

    array([  2.87744864e-03,   1.83488846e-05,   2.07301295e-03, ...,
             1.09026171e-06,   1.00000000e+00,   3.98279868e-09])

# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)

  0.98664310005369604

# ## Part 6: Comparing models
# 
# We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):
# 
# > Logistic regression, despite its name, is a **linear model for classification** rather than regression. 
# Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) 
# or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a 
# single trial are modeled using a logistic function.

# import and instantiate a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

# train the model using X_train_dtm
logreg.fit(X_train_dtm, y_train)

    Wall time: 39 ms

# make class predictions for X_test_dtm
y_pred_class = logreg.predict(X_test_dtm)

# calculate predicted probabilities for X_test_dtm (well calibrated)
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

  array([ 0.01269556,  0.00347183,  0.00616517, ...,  0.03354907,
        0.99725053,  0.00157706])

# calculate accuracy
metrics.accuracy_score(y_test, y_pred_class)

   0.9877961234745154

# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)

   0.99368176123143015

Tuning the vectorizer

# 
# show default parameters for CountVectorizer
vect

    CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
            dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
            lowercase=True, max_df=1.0, max_features=None, min_df=1,
            ngram_range=(1, 1), preprocessor=None, stop_words=None,
            strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
            tokenizer=None, vocabulary=None)


# However, the vectorizer is worth tuning.
# 
# - **stop_words:** string {'english'}, list, or None (default)
#     - If 'english', a built-in stop word list for English is used.
#     - If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
#     - If None, no stop words will be used.

# remove English stop words
vect = CountVectorizer(stop_words='english')

# - **ngram_range:** tuple (min_n, max_n), default=(1, 1)
#     - The lower and upper boundary of the range of n-values for different n-grams to be extracted.
#     - All values of n such that min_n <= n <= max_n will be used.
#
#  ngram is used to recognize words appearing together as single entity/word.
#
#      >>> v = CountVectorizer(ngram_range=(1, 2))
#      >>> pprint(v.fit(["an apple a day keeps the doctor away"]).vocabulary_)
#      {u'an': 0,
#       u'an apple': 1,
#       u'apple': 2,
#       u'apple day': 3,
#       u'away': 4,
#       u'day': 5,
#       u'day keeps': 6,
#       u'doctor': 7,
#       u'doctor away': 8,
#       u'keeps': 9,
#       u'keeps the': 10,
#       u'the': 11,
#       u'the doctor': 12}

# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))

# - **max_df:** float in range [0.0, 1.0] or int, default=1.0
#     - When building the vocabulary, ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).
#     - If float, the parameter represents a proportion of documents.
#     - If integer, the parameter represents an absolute count.

# ignore terms that appear in more than 50% of the documents # Max document frequency
vect = CountVectorizer(max_df=0.5)


# - **min_df:** float in range [0.0, 1.0] or int, default=1
#     - When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold. (This value is also called "cut-off" in the literature.)
#     - If float, the parameter represents a proportion of documents.
#     - If integer, the parameter represents an absolute count.

# only keep terms that appear in at least 2 documents   # Min document frequency.
vect = CountVectorizer(min_df=2)

Cross Validation Metrics

Overview

  • We know for regression problem, root mean square error is a good metrics.
  • For classification problem, accuracy is a metrics.
  • There are better more accurate metrics like F1 score.
  • The confusion matrix is nothing but TP (True-Positive), TN, FP, FN as 2x2 matrix.
  • sklearn.metrics.confusion_matrix(y_test, y_pred_class) gives you the matrix.
  • Null accuracy: accuracy that could be achieved by always predicting the most frequent class.
  • You can also change classification threshold probabilities to redefine the classification results in order to optimize one metrics over another (e.g. Fire Alarm vs Spam Filter).
Terminology            Computation        sklearn-function           Comments

Accuracy                 (TP+TN)/Total      accuracy_score()   Not good for skewed classes.
                                                               1-Accuracy = Classification Error Rate

Sensitivity/Recall/TPR   TP/(TP+FN)         recall_score()     Given it is positive, how accurate your prediction ?
                                                               Should be high when you can't afford to lose positive.
                                                               E.g. Cancer screening. FP is OK, but FN is not OK.

Specificity (TNR)        TN/(TN+FP)         --not-available--  Given it is negative, How accurate your prediction ? 
                                                               E.g. Defense interview. FN is OK, but FP is not OK.

False Positive Rate      FP/(FP+TN)         ----------------   False Alarm Rate; 1-Specificity; 

Precision                TP/(TP+FP)        precision_score    Probability that Alarm is genuine.

F1 Score

F1 Scroe is harmonic mean of Precision and Recall :

F1 Score =   2 * ( Precision * Recall / (Precision+Recall) )
Note

Spam filter optimizes for Precision (High probability that Alarm is genuine) and Specificity (True Negative Rate).

Tune Classification Threshold

You can also change classification threshold probabilities to redefine the classification results in order to optimize one metrics over another (e.g. For Fire Alarm vs Spam Filter). You may want high precision over recall, etc.

# print the first 10 predicted responses
logreg.predict(X_test)[0:10]

  array([0, 0, ..., 1, 0, 1])  # For multi-class of 4 classes it may contain 0,1,2,3 

# print the first 10 predicted probabilities of class membership
logreg.predict_proba(X_test)[0:10, :]

    array([[0.63247571, 0.36752429],    # Only 2 classes. [P(class_0), P(class_1)) ]
           [0.71643656, 0.28356344],    # For multi-class of 4, this will have 4 probabilities.
           [0.71104114, 0.28895886],
           ....
          )

# Note: For multi-class classification of N classes, You can peek into closest m classes !!
#       You can use logreg.predict_proba() to predict the closest top m classifications.

# print the first 10 predicted probabilities for class 1
logreg.predict_proba(X_test)[0:10, 1] 

       array([0.36752429, 0.28356344, 0.28895886, 0.4141062 , 0.15896027,
             0.17065156, 0.49889026, 0.51341541, 0.27678612, 0.67189438])

# store the predicted probabilities for class 1
y_pred_prob = logreg.predict_proba(X_test)[:, 1]

# Visualize the probability range using histogram...

import matplotlib.pyplot as plt
plt.hist(y_pred_prob, bins=8)  # 8 bars with in actual range of values.
plt.xlim(0, 1)
plt.title('Histogram of predicted probabilities')
plt.xlabel('Predicted probability (0 to 1) ')
plt.ylabel('Frequency')


# **Decrease the threshold** to **increase the sensitivity** of the classifier

# Generate fire alarm even if the predicted probability is greater than 0.3

from sklearn.preprocessing import binarize
y_pred_class = binarize([y_pred_prob], 0.3)[0]

# print the first 10 predicted probabilities
y_pred_prob[0:10]

# print the first 10 predicted classes with the lower threshold
y_pred_class[0:10]

# new confusion matrix (threshold of 0.3)
print(metrics.confusion_matrix(y_test, y_pred_class))

# Now you can re-calculate sensitivity, precision etc.

ROC Curve as metrics to evaluate model

Overview

A receiver operating characteristic curve, i.e., ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.

Especially the plot is False Positive Rate vs True Positive Rate. We can increase True Positive Rate at the cost of increasing False Positive Rate.

e.g. We can be conservative and raise fire alarms even at the slightest hint, we will get high TPR at the cost of raising many false alarms.

e.g. If we can raise TPR to 0.8 using FPR of 0.3, the model is fairly good for fire alarms. But you may need different thresholds for different problems.

Note

  • Given the threshold, you can find the TPR and FPR for that threshold.
  • you can fix FPR threshold and reverse calculate the probability threshold for classification.

AUC (Area under curve) is a single indicator, higher the value, better. Higher value implies steep increase in TPR, even at small increase of FPR.

Synopsis:

# IMPORTANT: first argument is true values, second argument is predicted probabilities
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for diabetes classifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)

# define a function that accepts a threshold and prints sensitivity and specificity
def evaluate_threshold(threshold):
    print('Sensitivity:', tpr[thresholds > threshold][-1])
    print('Specificity:', 1 - fpr[thresholds > threshold][-1])

# IMPORTANT: first argument is true values, second argument is predicted probabilities
print(metrics.roc_auc_score(y_test, y_pred_prob))

   0.7245657568238213  # Single number summary metric for your classifier model. Higher, better.

# calculate cross-validated AUC
from sklearn.model_selection import cross_val_score
cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()

Confusion Matrix Vs ROC/AUX

  • Confusion matrix gives good insights and numbers to compute various metrics. It is also useful for multi-class problems (unlike ROC/AUC).
  • ROC/AUX method does not require you to set a classification threshold (advantage). Applicable for only binary classification (limitation).

Word Embeddings

How does word embedding work ?

A word embedding maps a word to N dimensional vector. W is initialized to have random vectors for each word. It learns to have meaningful vectors in order to perform some task.

WordEmbedding(word) => vector (r1, r2, ... rn)

For example, one task we might train a network for is predicting whether a 5-gram (sequence of five words) is valid.

In this process of 'training' the model, we get the word embeddings. It is the side effect of the 'hidden layer' calculating the weights.

No body knows what exactly the 'significance' of each individual dimension, since it is determined by the hidden layer.

Let us say, we are told that we can only have 4 dimensional properties and each property can assume only binary values.

DataScience Approach

Understand and Prepare your Data

  • Understand your data at high level:
    • Look at summary statistics
    • Box plot can identify your outliers
    • Density plots and histograms to see the spread of data
    • Scatter plots to spot bivariate relationships
  • Deal with missing data.
    • Should you fill with nearest neighbor or average value ?
    • Should you choose model which is more tolerant to missing value ? ** To Do **
    • Should you drop the entire observation if a feature is missing ?
  • Decide what to do with outliers.
    • Could it be due to bad data collection or legitimate extreme values ?
    • Would dropping extreme 5% values help or hurt prediction ?
  • Augment/Refactor your data
    • Normalize/Standarize your data by rescaling
    • Reduce dimensionality (PCA)
    • Capture more complex relationships (e.g. x*y as additional feature)
    • Transform model to be easier to interpret.

Problem Definition

  • If you have labelled data, then it is likely to be supervised learning problem.
  • Want to find structure with in unlabelled data ? It is unsupervised problem. The desired output is usually a set of clusters.
  • Do you want to optimize your objective by continous experiments or interacting with an environment, it is a reinforcement learning problem.
  • If the desired output is a number, it is also a regression problem. If the desired output is yes/no binary classification it is a "logistic regression" problem. Else if the output is to identify the class from finite groups, it is a general classification problem.
  • If the goal is to detect anamolies, it is anamoly detection problem. [ SVM Classifier probably works better for this compared to other continuous equation classifier ?? ]
  • Define your limitations in terms of storage power and computation. Is realtime response a requirement ? Should the learning be fast ? Should the response prediction be fast ? Autonomous driving requires the prediction response time to be fast, though training the model could take long time.

Identify potential Algorithms

  • Even though Linear and Logistic regression Algorithms differ in the goal --to predict a number vs yes/no classification, the algorithm is somewhat similar. Logistic regression uses non-linear sigmoid function with final output number is converted to 1/0 using a threshold.
  • Logistic regression is more stable -- can take more input features, less sensitive about having co-related features (unlike Naive Bayes, Decision Trees and SVM ) ?
Note

Why Naive Bayes, Decision Trees and SVM more sensitive to input data features corelation and redundancy ?

  • Decision trees typically used in conjunction with other techniques like Random Forest or Gradient Tree Boosting. Easily handles feature interactions. The disadvantages of decision trees are: 1) Does not support on-the-fly learning, you need to rebuild your trees when you add more training data or features ??

    1. Take lot of memory (more features, larger the tree?)

    3) Easily overfit, but random forests handle it to mitigate this issue. Lot of gaming theory uses tree pruning. Has strong theoritical foundation and easy to comprehend.

  • K-Means clustering of input set. If you want to make clusters of input groups based on all features without even understanding the original structure, this is the way to go. The disadvantage is you have to guess the best K or you do some trial and error.

  • Principal component analysis (PCA) can help to discard unnecessary and redundant features, to keep the model simpler, faster and more stable.

  • SVM can handle high dimensions well. Provides high accuracy. The cons are ... it is difficult to tune and memory intensive. Example domains: Character recognition, Stock market price, text categorization.

Note

What is discriminative vs generative models ?

The Supervised learning models are categorized as Discriminative and Generative.

Logistic regression, SVM, etc. are discriminative. (Studies P(y/x) conditional probability) They don't care about probability distribution of the input.

The typical generative model approaches contain Naive Bayes, Gaussian Mixture Model, and etc. (Studies Joint Probability P(x,y)) They also model the probability distribution of the inputs and outputs. So it is easy to 'Generate' potential inputs using this distribution. Generative model need more training data since it represents the "universe".

A combination of generative/discriminative model is highly recommended and found to be useful.

  • Naive Bayes focuses on joint probability of inputs and outputs. It can not learn interactions between features. For problems where Bayes hold good and all inputs are fairly orthogonal, it performs really well. It can be used as 'Generative' model where it can even generate possible inputs. This can be applied to:
    • sentiment analysis and text classification
    • Recommendation systems like Netflix, Amazon
    • To mark an email as spam or not spam
    • Face recognition
  • Random Forest is an ensemble of decision trees. It can solve both regression and classification problems with large data sets. It also helps identify most significant variables from thousands of input variables. Random Forest is highly scalable to any number of dimensions and has generally quite acceptable performances. Then finally, there are genetic algorithms, which scale admirably well to any dimension and any data with minimal knowledge of the data itself, with the most minimal and simplest implementation being the microbial genetic algorithm. With Random Forest however, learning may be slow (depending on the parameterization) and it is not possible to iteratively improve the generated models
  • Neural Networks take in the weights of connections between neurons . When all weights are trained, the neural network can be utilized to predict the class or a quantity. E.g. object recognition has been as of late enormously enhanced utilizing Deep Neural Networks. Applied to unsupervised learning tasks, such as feature extraction, deep learning also extracts features from raw images or speech with much less human intervention. (Here the outputs may be labeled but the input features may not be labeled at all). On the other hand, neural networks are very hard to just clarify and parameterization is extremely mind boggling. They are also very resource and memory intensive.
  • Scikit Cheat Sheet to pick up the right algorithm : http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

Choosing right ML algorithm

Classifiers

Synopsis:

# K Neighbors Classifier;
knn = KNeighborsClassifier(n_neighbors=5)

# It uses collection of other algorithms.
rfc = RandomForestClassifier(n_estimators=200)


# Train the Support Vector Classifier"
from sklearn.svm import SVC
model = SVC()
# param_grid = {'C': [0.1,1, 10, 100, 1000],
#               'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['rbf']}
# Since it is difficult to guess the parameters, it is often used with GridSearch.
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=3)


classifier = tf.estimator.DNNClassifier(hidden_units=[10, 20, 10],
        n_classes=2,feature_columns=feat_cols)"

# Good when input is discrete. Like classifying spam sms based on spam words count.
classifier = MultinomialNB()

dtree = DecisionTreeClassifier()

Pros

  • Firstly it has a regularisation parameter, which makes the user think about avoiding over-fitting.
  • Secondly it uses the kernel trick, so you can build in expert knowledge about the problem via engineering the kernel. (e.g. pass a loss function)
  • Thirdly an SVM is defined by a convex optimisation problem (no local minima) for which there are efficient methods (e.g. SMO).
  • This is good for data with large featues with missing data ???

Cons

  • In a way the SVM moves the problem of over-fitting from optimising the parameters to model selection. ??
  • Often the optimal choice of kernel and regularisation parameters means you end up with all data being support vectors. If you really want a sparse kernel machine, use something that was designed to be sparse from the outset (rather than being a useful byproduct), such as the Informative Vector Machine ???

Kernel Logistic Regression

Visualization

Choice of Colors

  • ColorBrewer project colors include both fairly light and fairly dark colors (Brewer 2017).
  • Okabe lto colors
  • ggplot colors.

Heatmap and colors

  • You can use ColorBrewer blues -- light blue to dark blue
  • Green/yellow to dark brown
  • Yellow to dark purple

Code Recipes

Preprocessing Template

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

dataset = pd.read_csv('Data.csv')     // This is dataframe of N columns.

X = dataset.iloc[:, :-1].values       // Everything except last column
y = dataset.iloc[:, 3].values         // Last column

# Taking care of missing data. We want to fill the missing values by mean of other values.
# Alternative way: Use the value from the nearest neighbor vs mean.
#

from sklearn.preprocessing import Imputer
# Axis = 0 uses mean of the column for missing value. axis=1 for mean of row.
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis=0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

# Encoding categorical data
# from sklearn.preprocessing import LabelEncoder, OneHotEncoder, etc.

Feature Scaling

from sklearn.preprocessing import StandardScaler
import numpy as np

data = np.array([[0, 0], [0, 0], [1, 1], [1, 1]])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

print(data)
[[0 0]
 [0 0]
 [1 1]
 [1 1]]

print(scaled_data)
[[-1. -1.]
 [-1. -1.]
 [ 1.  1.]
 [ 1.  1.]]

scaled_data.mean(axis = 0)
array([0., 0.])

scaled_data.std(axis = 0)
array([1., 1.])

Heatmap

You can plot heatmap using seaborn as below :

# Visualize correlation ..
import seaborn as sns
sns.heatmap(df.corr())
sns.heatmap(df.corr(), annot=True)

#
# Visualize missing data ...
# Missing values appear as yellow bars.

sns.heatmap(train.isnull(),yticklabels=False,cbar=False,
     cmap='viridis')sns.heatmap(train.isnull(),
     yticklabels=False,cbar=False,cmap='viridis')


# Also see:
sns.pairplot(df)             # Plot for each pair
sns.distplot(df['Price'])    # Plot distribution of Price.

Setting seaborn styling

sns.set_style('whitegrid') sns.countplot(x='Survived',hue='Sex',data=train,palette='RdBu_r') sns.countplot(x='Survived',hue='Pclass',data=train,palette='rainbow') sns.distplot(train['Age'].dropna(),kde=False,color='darkred',bins=30)

Dataframe Creation

Different ways of creating Dataframe quickly:

import pandas as pd
mx3 = np.arange(6).reshape(3,2)

       array([[0, 1],
           [2, 3],
           [4, 5]])

df = pd.DataFrame(mx3,index='R1 R2 R3'.split(), columns='C1 C2'.split())
df
              C1  C2
          R1   0   1
          R2   2   3
          R3   4   5

# If you want to fill the data column wise. ....
mx3 = np.arange(6).reshape(3,2).transpose()

     array([[0, 2, 4],
       [1, 3, 5]])

df = pd.DataFrame(mx3,index='R1 R2'.split(), columns='C1 C2 C3'.split())
df
         C1  C2  C3
    R1   0   2   4
    R2   1   3   5

Notes From Deep Learning Book by Josh Patterson

  • A feed-forward multilayer neural network can represent any function, given enough artificial neuron units.
  • It is generally trained by a learning algorithm called backpropagation learning.
  • Backpropagation uses gradient descent on the weights. It can get stuck in local minima, but mostly it performs well.
  • Historically, backpropagation has been considered slow. But with powerful GPUs and CPUs, it is now more effective.
  • Activation function is the final output function. e.g. sigmoid function is an activation function.
  • For classification, though sigmoid function is historically a gold standard, it has fallen out of favor in modern era.
  • In DL4J (Deep Learning for Java), all neurons in a layer have the same activation function.
  • RELU (Rectified Linear Unit) function is 0 when x is negative; when x > 0, y = x or y = k.x i.e. output raises linearly.
  • The mechanics of trial-and-error and delayed reward are key features of reinforcement learning.
  • We define the four major architectures of deep networks:
    • Unsupervised Pretrained Networks
    • Convolutional Neural Networks
    • Recurrent Neural Networks
    • Recursive Neural Networks
  • Automatic feature extraction is the focus of research.