Misc Notes On Machine Learning
Contents
local
Use anaconda with python 3.5. Synopsis:
$ anaconda-navigator
$ pip install jupyter_contrib_nbextensions
Invoke jupyter notebook, spyder, orange app from anaconda navigator.
Install and use rstudio
Install Anadonda, rstudio
Get data sets from http://www.superdatascience.com/machine-learning
Get the data (like csv files)
Pre-process the data. Python Synopsis:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Data = (country, age, salary, purchased). We need to predict last column.
dataset = pd.read_csv('Data.csv') # Use spyder variable explorer to examine dataset.
X = dataset.iloc[:, :-1].values # Features. All lines, all columns except last column.
y = dataset.iloc[:, 3].values # The target dependent column values as single array. (purchased column)
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
....
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf') # other types of kernel are: 'linear', 'poly', etc.
regressor.fit(X, y)
y_pred = regressor.predict(6.5)
Pre-process the data. R Synopsis:
# Importing the dataset dataset = read.csv('Data.csv') View(dataset) # Inside Rstudio interface
If your input data contains categorical variable column (e.g. state = 'NY', 'CA', etc), which can not be translated to numerical quantity, then you may have to introduce dummy variables: is_ny, is_ca, ..., etc. You should not include all dummy variables, since if (x1, x2) ==> x3 there is redundancy in the dataset. i.e. include "All but one dummy variables".
There are different approaches to remove insignificant features from input features:
In R programs, summary(regressor) gives very good summary with coefficients, P-value, etc. R also automatically internally replaces enumerated values by dummy variables (e.g. State1, State2, etc for enumerated values 'NY','CA', etc.
Regression types:
The scipy stack includes following libraris (See http://scipy.org) :
- Numpy
- Pandas
- Scipy
- Matplotlib
- IPython
- SymPy
The library for easy interactive Pandas charting with Plotly. Cufflinks binds Plotly directly to pandas dataframes. Charts created with cufflinks are synced with your online Plotly account. You can also go offline and create charts.
One of the most prominent and convenient Python libraries in deep learning is Keras, which can function either on top of TensorFlow or Theano. Let’s discuss some details about all of them.
Now , we will look at NLP libraries.
Now , we will look at Data Mining and Statistics libraries ...
Machine Learning == Science of getting computers to learn without being explicitly programmed.
Anti spam, recommendations, search ranking, click stream data, OCR, Learn Helicopter to fly itself, etc.
Anti spam task watches you classifying spam mails and learns what is spam mail from that.
Tip: Learn by example ... and relate your problem to other examples.
Machine learning algorithms types:
- Supervised , unsupervised.
- others: Reinforcement learning, Recomender systems,
There are learning algorithms like SVM -- support vector machines - can deal with infinite number of features. SVM works by finding maximum-marigin hyperplane for classification. For 3D space, hyperplane is a surface in 2D. For 2D space, hyperplane is a line. For N-dimensional space, hyperplane is (N-1) dimensional plane. For sparse matrix of large set of features, it is easier to find hyperplane compared to other algo.
Training Set => Learning Algorithm => Hypothesis Funtion H. H Takes input features x and yields output result y.
y = h(x) = t0 + t1 * x ; For linear dependency. Find t0, t1. 2 unknown variables, 2 data points will give you answer. If it is 3 data points, the answer may not exist. So, we need to find "closest" (t0, t1) that minimizes the error wrt all data points.
Cost function J(t0, t1) = (1/2m) sum( (y1-h1)^2 + ...)
Find t0, t1 to minimize error sum( (y1 - h1)^2 + (y2 - h2)^2 ), For given set of 3 data points, (1, 2), (2, 4), (3, 6), this would be: error square sum = ( 2 - (t0 + t1 * 1) )^2 + (4 - (t0 + t1* 2))^2 + ... Note that error sum function is equation of 2 unknown variables t0, t1. To find out the global gradient descent, find slope, then find value for the lowest slope. To do that, we will have to apply partial derivative (since there are 2 unknown variables instead of 1). MSE = J(t0, t1). The dy/dx function differentiates considering "x" as variable.
Set dy/dx = 0 and solve unknown variables to find the lowest gradient decent point.
So take d(J)/dt0 and d(J)/dt1 and find the lowest slopes to find both t0, t1 having fixed the other at arbitrary value. Once you find the direction to traverse to global minimum, the iteration follows this process:
t0 := t0 - alpha * (dJ/dt0) t1 := t1 - alpha * (dJ/dt1)
t0 := t0 - alpha * (1/m) * sum(all_distances) // sum(distances) = (h(x1) - y1) + (h(x2) - y2) + etc.
// Note: negative distance influences the direction.
t1 := t1 - alpha * (1/m) * sum(dist1 * x1[row1] + dist2 * x1[row2] + ... )
// To detect weightage for nth feature, magnify the distance by the
feature value, // in order to find the convergence direction.
How to derive dJ/dt0 and dJ/dt1 ... See Lecture 2 notes pdf.
Where alpha is learning rate. e.g. 0.1
Squared Error function works best for linear regression optimization.
A contour plot is a graph that contains many contour lines. A contour line of a two variable function has a constant value at all points of the same line. e.g. x, y graph may have concentric circles with each circle representing cost 10, 20, 30 etc. higher the cost, bigger the circle. i.e. if you allow more error, you can choose more pairs of linear parameters theta1, theta0.
Gradient descent may give you some local minimum which may not be the global optimal minimum. If you visualize surface, this is the local minimum.
Convex function has only one global minimum, not multiple local minimums. The square error function is an example of convex function.
For "multivariate linear regression" of n features, (x1, x2, ... xn), we need to find (t0, t1, .... tn), an n+1 tuple:
i.e. h(x) = t0 + t1*x1 + t2*x2 + .... + tn*xn
For convenience of matrix ops and notation, we define dummy feature x0 which is always a constant 1. So we have now n+1 features:
h(x) = transpose([x0, x1, ... xn ]) * [t0, ... tn ]
Feature scaling -- Make sure features are on same scale to converge quickly. e.g. x1 = size of bedroom is 0 to 2000 sqft, and bedrooms is 1 to 5. Get every feature into -1 to +1 range. i.e. divide the value by max value.
Mean normalization (a Feature scaling technique) transforms features so that it has means approximately 0. e.g. x = (#bedrooms - 2) / 5 where 2 is mean number of bedrooms and 5 is max.
i.e. x = ( x1 - mean(s) ) / Range(x) if x varies from 30 to 50, range is 20. Note: You can also choose to divide by std_deviation instead of Range.
Another alternative to Mean Normalization is standardisation. It has means approx 0 and std-deviation 1 but the individual values are not limited to -1 to 1 range:
X.stand = (x - Mean(x)) / StdDev(x) # See sklearn.StandardScaler() for more info.
Valid approach try alpha 0.001, 0.01, 0.1, 1, etc.
Feature transformation -- If you suspect only (some) features may fit in non-linear fashion, you can create your own dummy features only for them. e.g. House Area Feature = Length * Width So you may have t1 * width + t2 * length + t3 * area. So you may choose to fit "important selective features" as non-linear and keep other simple features as "linear".
theta_params = Inverse(Transpose(X) * X) * Transpose(X) * y
m training examples, n features -- Gradient Descent vs Matrix (equation) approach -- pros and cons Usually matrix approach is best since there is no need to choose alpha and iterate. However gradient descent works best when there is large number of features. Matrix approach is too slow when n is large. Inverse(Matrix) is O(n^3) complexity. Upto n = 1000, we may still employ matrix approach. when n > 10000, definitely gradient descent is better.
Matrix can be non-invertible (e.g. zero matrix). pinv(A) is pseudo-inverse routine in Octave, which gives you inverse even when matrix is not invertible -- to yield closest approximation. Matrix can be non-invertible if m < n (i.e. more features than samples) or when there is redundant features (e.g. length in meters and in feets)
Octave tips and Synopsis:
1 == 2
ans = 0
1 ~= 2 % Note that it is ~= not !=
ans = 1
1 && 2 % also: 1 & 2 , 1 || 2 etc
PS1('>> ')
a = 5 ; % semicolon suppresses output
disp(sprintf('this is %0.3f', a));
A = [1 2 ; 3 4 ; 5 6 ] % No comma needed !
v = [ 1 2 3 ] % This is 1x3 matrix
v = [ 1; 2; 3 ] % This is 3x1 value vector
v = 1:6
v = 1 2 3 4 5 6 % This is 1x6 matrix
v = 1:0.1:2 % The 0.1 is delta. This gives 1x11 matrix.
ones(2, 3) % 2x3 matrix of all ones.
c = 2 * ones(2,3) % 2x3 matrix of all 2's.
zeros(1,3)
rand(3, 3)
I = eye(4) % Identity matrix.
help help
help randn
load features.dat
load ('features.dat') % Load csv or space separated data.
who % List variables in current scope
whos % list variables in detail
clear featuresX % unload variable
v = priceY(1:10) % extract subrange data
save hello.mat v % save variables into hello.mat. Saved in binary.
A(3, 2) % get element at 3rd row, 2nd col
A(:, 2) % all at 2nd col.
A([1 3], :) % Extract 1st and 3rd rows.
A(:, 2) = [10; 11; 12] % Assign 2nd col by given values.
A = [ A, [100; 200; 300] ] % append another column vector
A(:) % Put all elements in single col vector
C = [A B] % Join 2 matrices adjacent. Same as C = [A, B]
C = [A ; B] % Join Matrix B below A.
A * B
C = A .* B % Elementwise Multiply c(i,j) = a(i,j)*b(i,j).
C = A .^ 2 % Each element is squared.
abs(A) % elementwise abs operation applied.
A' % This is A transpose
[value, indexx ] = max(a) % Get the max value in vector and the position
find(a < 3) % given vector a, yields another vector of indexes
A = magic(3) % NxN matrix where cols, rows add up to constant
sum(a); prod(a); floor(a); ceil(a);
rand(3) % Random matrix.
max(rand(3), rand(3)) % gives 3x3 matrix each with max of 2 rands.
max(A, [], 1) % Take columnwise max
max(A, [], 2) % Take rowwise max
max(A(:)) % Take max of entire matrix
sum(A, 1) % sum it columnwise.
sum(A, 2) % sum it rowwise
A .* eye(9) % Initializes all non-diagonal elements to zero.
flipup(eye(9)) % Flips the matrix
t = [0:0.01:0.98] ; y1 = sin(2*pi*4*t)
plot(t, y1); y2 = cos(2*pi*4*5); plot(t, y2);
% To plot one on top of another
hold on; plot(t, y1); plot(t, y2, 'r');
xlabel('time); ylabel('value')
legend('sin', 'cos') % displays color info
title('my plot')
cd 'c:/temp' ; print -dpng 'myplot.png'
close % close the plot
figure(1); plot(...); figure(2); plot(....)
subplot(1, 2, 1); % Divide the canvas into into 2x1 grid. Set the current grid as the first one.
plot(t1, y1); % Print on the first area in the canvas.
subplot(1, 2, 2); plot(t, y2); % Print on the second area in the canvas.
axis([0.5 1 -1 1]) % Sets x, y ranges. ie. minx,maxx,miny,maxy
clf; % clear canvas
imagesc(A) % Visualize matrix values by color.
imagesc(A), colorbar, colormap gray
prediction = 0.0; % Iterative implementation.
for j = 1:n+1,
prediction = prediction + theta(j) * x(y)
end;
prediction = theta’ * x; % Vectorized implementation
Hypothesis representation.
Let us consider binary classification problem, with output y as 1 or 0. We make use of a sigmoid function: g(z) = ( 1 / (1 + e^-z) ) where z = theta * X This function gives value between 0 and 1.
z = 0; e^z = 1; g(z) = 0.5; z = +inf; e^-z = 0; g(z) = 1; z = -inf; e^-z = inf; g(z) = 0;
Consider training set X. Set of (x1, y1) where y1 is 0 or 1. or in-between. We usually find theta based on existing set of points. but we can't allow output values < 0 or > 1. If we want to find a decision boundary, Let us assign h(x) = 0.5 and then solve the equation.
If there is equal number of positive and negative samples, the decision boundary is exactly the fitting "curve". In case of logistic regression, this is where we get "0.5" as output.
So, we can say:
When theta * X >= 0 => y = 1
theta * X > 0 => y = 0
The function z does not have to be linear. It could be circle and such.
In logistic regression, the cost function for our hypothesis outputting (predicting) hθ(x) on a training example that has label y with {0,1} is:
point cost(hθ(x),y) = −log ( hθ(x) ) if y=1
−log(1−hθ(x)) if y=0
cost J(theta) = (1/m) sum(costs)
Converting the if logic into single equation:
Cost(hθ(x),y)=−ylog(hθ(x))−(1−y)log(1−hθ(x))
J(θ) = (−1/m) Sum ( (y(i) * log(hθ(x(i)))) + (1−y(i)) * log(1−hθ(x(i))) )
Vectorized Implementation:
h=g(Xθ)
J(θ) = (1/m) * ( −y' * log(h)− (1−y)' * log(1−h))
Gradient of the cost is a vector of the same length as θ where the jth element (for j = 0,1,...,n) is defined as follows:
∂J(θ) / ∂θj = ( 1/m) * Sum( (hθ(x(i))−y(i)) * x(i) )
in other words ...
dJ/dtj = (1/m) * Sum ( distance * xj ) ; // xj = jth feature in each input entry
Remember that the general form of gradient descent is:
Repeat{ θj := θj − alpha * dJ/dθ }
We can work out the derivative part using calculus to get:
Repeat{ θj := θj − alpha * (1/m) * Sum( (hθ(x(i))−y(i)) * x(i) }
Notice that this algorithm is identical to the one we used in linear regression. We still have to simultaneously update all values in theta.
A vectorized implementation is:
θ := θ − (alpha/m) * X' * ( g(Xθ) − y )
"Conjugate gradient", "BFGS", and "L-BFGS" are more sophisticated, faster ways to optimize θ that can be used instead of gradient descent. We suggest that you should not write these more sophisticated algorithms yourself (unless you are an expert in numerical computing) but use the libraries instead, as they're already tested and highly optimized. Octave provides them.
We first need to provide a function that evaluates the following two functions for a given input value θ::
Cost Function: J(θ)
Partial derivative of cost function for each theta parameter: dJ/dt0, dJ/dt1, etc.
In Octave, there is fminunc() optimization function which does this:
function [jVal, gradient] = costFunction(theta)
jVal = [...code to compute J(theta)...];
gradient = [...code to compute derivative of J(theta)...];
end
options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);
We are basically choosing one class and then lumping all the others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returned the highest value as our prediction :
y∈ {0,1...n}
h0(x) = P(y=0|x;θ)
h1(x) = P(y=1|x;θ)
....
prediction = max(hθ(x))
This terminology is applied to both linear and logistic regression. There are two main options to address the issue of overfitting:
Reduce the number of features:
- Manually select which features to keep.
- Use a model selection algorithm (studied later in the course).
Regularization:
- Keep all the features, but reduce the magnitude of parameters θj. Regularization works well when we have a lot of slightly useful features.
Cost of regularized cost function:
J(θ) = (−1/m) Sum ( (y(i) * log(hθ(x(i)))) + (1−y(i)) * log(1−hθ(x(i))) ) + ( lambda / (2m) ) sum(square(theta_parameters))
Gradient:
∂J(θ) / ∂θj = ( 1/m) * Sum( (hθ(x(i))−y(i)) * x(i) ) + (lambda/m) * θj % Correction only for j > 0
in other words ...
dJ/dtj = (1/m) * Sum ( distance * xj ) + lambda_correction ; // xj = jth feature in each input entry
The simple functions like AND and OR could be computed using single compute layer where as functions like XOR is non-linear (and as such require new feature like x1.x2) and can be effectively computed using additional hidden layer.
Backpropagation" is neural-network terminology for minimizing our cost function, just like what we were doing with gradient descent in logistic and linear regression.
For back propagation to work, we need to initialize theta to unsymmetric random epsilon values - Initializing it as zero matrix (or any uniform values) does not work.
Unsupervised machine learning is the machine learning task of inferring a function to describe hidden structure from "unlabeled" data (a classification or categorization is not included in the observations). Since the examples given to the learner are unlabeled, there is no evaluation of the accuracy of the structure that is output by the relevant algorithm—which is one way of distinguishing unsupervised learning from supervised learning and reinforcement learning.
A central case of unsupervised learning is the problem of density estimation in statistics,[1] though unsupervised learning encompasses many other problems (and solutions) involving summarizing and explaining key features of the data.
For example, author disambiguation is unsupervised learning which can auto detect the "co-related features". A set of author instances may be close together wrt "few features". Unless we manually tag who are the real "authors", we can not measure the accuracy of the categorization/classification.
Observation: Similar words appear in similar context. That is how, machine learns "cat" and "kitty" are similar/related. This is typical example of unsupervised learning. The word2vec etc is very important example of unsupervised learning technologies. The context uses sliding window of surrounding words.
The recommended approach to solving machine learning problems is to:
Note: Stemming software can be used to recognize discount, discounts, discounting as same noun in NLP.
Precision/Recall analysis is very useful in presence of categorization of "skewed classes". e.g. People having cancer are 0.05%, a blind algorithm which says "No body has cancer" is 99.95 times doing right!!! But it will fail precision/recall check.
If you raise the threshold for h(x) from 0.5 to 0.9, to avoid false positives, you will get higher precision but lower recall. Some times you may want to reduce the threshold to get higher recall and lower precision.
Compare 2 algorithms using F1 Score= 2 * (PR) / (P+R) ; Higher value, better. i.e. when P = R = 1, F1 Score = 1; and it is less < 1 otherwise.
Measure precision (P) and recall (R) on the test set and choose the value of threshold which maximizes 2PRP+R
Various learning algo exist to classify between confusable words: {to, two, too}, {then, than} :
Most of these algorithms do well same as long as you supply enough data. It is not who has best "algorithm" -- it is who has best data.
It is a "large marigin" classifier ... meaning try to find threshold so that it is max distant from most of the points. Beware ... An outlier may influence too much ... Computationally better compared to h(x) with sigmoid function calls.
Two definitions of Machine Learning are offered. Arthur Samuel described it as: "the field of study that gives computers the ability to learn without being explicitly programmed." This is an older, informal definition.
Tom Mitchell provides a more modern definition: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."
Example: playing checkers.
E = the experience of playing many games of checkers T = the task of playing checkers. P = the probability that the program will win the next game.
In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output.
e.g. stock/house prices history.
It is also called "regression" problem - i.e. we're trying to predict a continuous value output. Namely the price. May fit using straight line or quadratic equation, etc.
e.g. tumor size, age, is_benign data. We need to predict if the tumor is malignant or benign. This is a classification problem.
Supervised learning problems are categorized into "regression" and "classification" problems. In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories.
Note: Classification problem is a subset of regression since we can always generate discrete values using continous function ??
Note: Unsuitable for author disambiguation unless we actively curate large set of authors and let the machine learn from that.
To visualize the examples in 2D graph, the regression problem may look like y = f(x) where y is what you want to predict where we consider only one feature x. (In Practice there is N features i.e. N dimensions.)
For classification, we may cluster set of points (with different colors) given features x, y. The color of the point classifies. We may draw lines or circles to separate one set of values with others.
We just give dataset -- and ask to find structure and co-relations with in the data. This amount to recognizing inter-dependencies among various features. Visually this is like recognizing various clusters and classification.
e.g. Google news articles. The clusters of articles may be identified wrt various topics wrt location, time, language. In terms of visualizing it is like recognizing clusters -- bunch of data points in 2D graph given various features. e.g. News may recognize that the crime rates are higher during winter on specific dates. We do not tell what features are "inputs" and which ones are "outputs". The learning algorithms should find this automatically.
The google news automatically identifies all "Oil price hike" articles together, then "war conflicts" articles together etc.
The genome data automatically clusters humans. Social network analysis. Market segmentation. Astronomical data analysis.
Cocktail party problem -- separate overlapping audio channels using different mikes placed. Note that it is a non-clustering unsupervised learning problem. Eliminating noise and finding structure in chaotic environment.
It is done using single line of code:
[S, s, v] = svd((repmat(sum(x.*x, 1), size(x,1),1).*x)*x')
We use Octave programming env (Or matlab) which will make life easier. Octave is opensource implementation for matlab.
svd == singular value decomposition (linear algebra function).
Also useful to prototype algorithm faster using octave env, then later do some micro-optimization using python or java or C.
RELU - Rectified Linear Unit -- It is a linear function, but looks like: y = 0 when x < 10; y = 4 (x - 10) when x > 10
Single neuron neural network can be imagined to implement a simple linear equation like t0 + t1 * x; (given input x).
Every neurons can also be called "units". e.g. 3 units or 3 neurons and 10 input attributes are all densely connected. Each 3 unit, may compute higher order relevant attributes that will be fed to next layer.
Much of neural networks value lies in supervised learning applications like :
- House properties to predict price - Standard Neural network
- Online Advertisement to maximize clicks
- Image / Photo tagging/ classification - CNN
- Audio automatic transcription - Sequence stream of data - RNN
- English to Chinese translation - more complex RNN
- Image radar info for automonous driving -- Custom/Hybrid CNN/RNN network.
Neural networks is especially impressive in dealing with unstructured data.
Deep Learning takes off now because we can process large amount of data now.
The sigmoid function was used as activation function as approximation alternative for RELU function so that it is differentiable and easier to deal with mathematically for optimization, but it was computationally inefficient. Falling back to RELU implementation improved runtime performance for gradient descent.
Suppose, you have 3 input features, 1 single hidden layer of 4 activation units , and 1 output layer of 1 activation unit which outputs logistic expression 1 or 0 :
y = W x + b where W is 4x3 matrix at middle hidden layer.
i.e. There are 12 weight parameters for 3 input featues and 4 activation units.
i.e. Each activation unit assigns independent weights for each feature.
For hidden layers, the choice for activation function -- tanh (whose output range is -1 to +1) is almost always superior than sigmoid:
tanh(x) = (e^z - e^-z)/(e^z + e^-z) [ Range: -1 to +1 ]
For output layer sigmoid may be preferred (with 0 to 1 output). Problem with both tanh and sigmoid is that for large input values, the slope is almost 0, hence the gradient is too small, the convergence takes long time. This is where, RELU is better. Most of the time z > 0, so covergence is fast. One caveat is that when z < 0, the slope is 0, so it may "think" that is optimal point, however mostly z > 0. One way to deal with that is to have "leaky RELU" function which adds a small negative slope when z < 0 i.e. (a = max(- 0.000001 * z, z) ) factor added to it. Then also convergence may be slow in that region, but still better than sigmoid.
For specific application, there are lot of design choices like what activation function to use, etc.
Why activation function instead of g(z) = z ? The linear activation function fits linear model. Such neural network is not more intelligent than regular machine learning technique of std logistic regression model. Only if you have non-linear activation function (in hidden layers), you will be able to capture non-linear dependencies.
If your output is linear real values (not logistic regression yes/no classification), then it is better to use linear function for output layer, but the hidden layers must have non-linear activation function.
Say one input layer (layer 0), one hidden layer (layer 1), one output layer (layer 2) with only one activation unit (i.e. binary classification output). Note the following notation:
n0 = Number of input features (tuple size for each x). Let this be 3.
n1 = Hidden layer size. i.e. activation units. Let this be 4.
n2 = Output layer size. i.e. activation unit. Let this be 1 as we consider binary classification.
W1 = Weight matrix for layer 1. This is 4x3. Every row in W matrix represents activation unit. Every column associated feature weight.
W(i,j) represents the weight the activation unit i assigns for the j'th feature in x.
W2 = Weight matrix for layer 2. This is 1x4. i.e. Number of activation units here is just 1.
The input feature size of layer n is the units size of layer (n-1).
b1 = Constant bias introduced in Layer 1. Size is 3x1. Each activation unit introduces different bias.
b2 = Constant bias introduced in Layer 2. Size is 1x1.
Z1 = W1 * X + b1 // This is Linear output result for Layer 1. Result is 4x1 values.
A1 = g(Z1) = sigmoid(Z1) // This is non-Linear output result for Layer 1. 4x1 size.
// If the relation is all linear, you can have g = Identity function.
// Otherwise this maps (-inf, +inf) to (0, 1).
Z2 = W2 * A1 + b2 // Linear output for output layer 2. Result size is: 1x1
A2 = g(Z2) = sigmoid(Z2) // Logistic non-linear output. Result size is: 1x1
Note: There is no W0, b0.
Total Cost of this neural network: J(W1, W2, b1, b2) = Sum( Cost(y, expected-y) )
If we randomly initialize W, b then we can move W's and b's towards optimal direction.
The equation is partially differentiable Computing partial derivatives is cheap ... otherwise,
we may still be able to calculate it manually, keeping all factors same except varying one factor.
Let us use the notation:
dW[i] = dJ/dWi : Partial derivative of Cost function wrt weights for layer i.
Note: Actual partial derivative is calculated for each weight in that weight matrix.
db[i] = dJ/dbi : Partial derivative wrt bias parameters of layer i.
Algorithm:
Repeat until cost difference converges :
Compute predicted values: y1 ... ym
This involves forward propagation and computing Z's and A's.
For the given W, b:
Compute values for dW[2] and db[2] i.e. for Layer 2.
This is where backpropagation comes.
W2 = W2 - alpha * dW[2]; // Adjust the weights for Layer 2.
b2 = b2 - alpha * db[2]; // Adjust biases for Layer 2.
// Computing of dW2 is quite involved, some high level details is given below ...
dZ2 = A2 - Y // Y is expected values
dW2 = 1/m * dZ2 * A1' // A1' = Transpose of previous layer output
db2 = 1/m * np.sum(dZ2, axis=1, keepdims=1)
Compute values for dW[1] and db[1] i.e. for Layer 1.
W1 = W1 - alpha * dW[1]; // Adjust the weights for Layer 1.
b1 = b1 - alpha * db[1]; // Adjust biases for Layer 1.
// Like before computation is involved.
dZ1 = (W2' * dZ2) .* sigmoid(z1) // elementwise product
dW1 = 1/m * dZ1. X'
db1 = 1/m * np.sum(dZ1, axis=1, keepdims=true)
Note: You have to cache Weight matrix for layer computed during forward computation, since they will be reused in backward propagation.
Initializing weight matrix to zero does not work for neural networks. This is because of symmetry among activation units. It may work for some linear logistic regression.
High level:
Forward Propgation:
Input a[l-1]
Outpt a[l], Cache(z[l]), Cache W[l], b[l]
Backward Propagation:
Input da[l]
Output da[l-1], dW[l], db[l]
Parameters are W, b's. Hyperparameters: learning rate alha, #iterations, #hidden layers, #hidden units, choice of activation functions e.g. sigmoid. Also: Momentum, mini-batch size, regularization parameters.
.
.
^
│
│
│
Price │ ╱o
│ ╱
│ o ╱ h(Size) = ϴ + ϴ * Size
│ ╱ 0 1
│ ╱o
│ o ╱
│ ╱
└─────╱────────────────────────────────────────────>
ϴ
0 Size
Machine learning came directly from minds of the early AI crowd, and the algorithmic approaches over the years included decision tree learning, inductive logic programming. clustering, reinforcement learning, and Bayesian networks among others. As we know, none achieved the ultimate goal of General AI.
Arthur Samuel coined the phrase not too long after AI, in 1959, defining it as -- the ability to learn without being explicitly programmed.
Using an algorithm to predict an outcome of an event is not machine learning. Using the outcome of your prediction to improve future predictions is.
Machine learning algorithm need not require big amount of data -- For example, chess learning algorithm may just require hand coding of "chess rules" (Decision trees), but it may play against itself and find strategies to improve it's game.
Benefits of neural networks:
* Extract meaning from complicated data
* Detect trends and identify patterns too complex for humans to notice
* Learn by example
* Speed advantages
A basic neural network may have two to three layers, while a deep learning network may have dozens or hundreds.
Simple Machine learning is equivalent to having single neuron - It fits the input features into a linear/non-linear model by finding relative weights. So there are N input features and may be N+1 parameters and a model. What if you want to design a model with more and more of non-linear parameters e.g. x1.x2 or x1.x2.x3 but you have no idea if some product of subset of features really important for the output or not ? This is where a probabilistic design comes into effect ... try some random combination of features, drop the ones which really does not contribute to the output and keep the ones which do ...
Breakthrough in machine learning came through computer vision.
Consider the task of identifying a STOP sign ... Look for something which has 8 sides, red color, distinct letters S, T, O, P one after another, etc.
# import load_iris function from datasets module
from sklearn.datasets import load_iris
# save "bunch" object containing iris dataset and its attributes
iris = load_iris() # type(iris) is <class 'sklearn.utils.Bunch'>
# store feature matrix in "X"
X = iris.data # X.shape is (150, 4)
# store response vector in "y"
y = iris.target # y.shape is (150,)
## scikit-learn 4-step modeling pattern
# **Step 1:** Import the class you plan to use
from sklearn.neighbors import KNeighborsClassifier
# **Step 2:** "Instantiate" the "estimator"
# "Estimator" is scikit-learn's term for model
knn = KNeighborsClassifier(n_neighbors=1)
# Can specify tuning parameters (aka "hyperparameters") during this step
print(knn)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=1, p=2,
weights='uniform')
# Step 3:** Fit the model with data (aka "model training")
knn.fit(X, y)
# **Step 4:** Predict the response for a new observation
knn.predict([[3, 5, 4, 2]]) # Can pass multiple rows/observations.
array([2]) // Single prediction for single input row. Numpy Array.
X_new = [[3, 5, 4, 2], [5, 4, 3, 2]]
knn.predict(X_new)
array([2, 1]) // Multiple inputs / multiple predictions.
## Using a different value for K (i.e. change hyperparameters )
# knn = KNeighborsClassifier(n_neighbors=5)
# Then you can repeat the whole thing ...
...
array([1, 1]) // Different models (or tunings) may predict differently.
## Using a different classification model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X, y)
logreg.predict(X_new)
array([2, 0])
# Evaluation procedure #1: Train and test on the entire dataset
# compute classification accuracy for the logistic regression model
y_pred = logreg.predict(X) # Pass entire training set on the trained model.
len(y_pred) # It is 150
from sklearn import metrics
print(metrics.accuracy_score(y, y_pred)) # y vs y_pred
0.96
# Known as training accuracy when you train and test the model on the same data
# For KNN (K=5)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X, y)
y_pred = knn.predict(X)
print(metrics.accuracy_score(y, y_pred))
0.96666 # Slightly better than logistic
# For KNN = 1, the training accuracy is 1 !!!
1.0 # KNN=1 model defines 'anything at that (x1,y1) is z1 by definition.
# This does not imply KNN=1 is a good model.
# Evaluation procedure #2: Train and test set split.
from sklearn.model_selection import train_test_split
# Randomly pick 40% rows for testing. Remaining 60% is for training.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=4)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
# compare actual response values (y_test) with predicted response values (y_pred)
print(metrics.accuracy_score(y_test, y_pred))
0.95
# Repeat above for KNN = 5 i.e. split train/test data.
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))
0.9666666666666667
# For K=1, it yields 0.95
# Can we locate an even better value for K?
k_range = list(range(1, 26))
scores = []
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
scores.append(metrics.accuracy_score(y_test, y_pred))
import matplotlib.pyplot as plt
# allow plots to appear within the notebook
# %matplotlib inline # For notebook as magic command
# plot the relationship between K and testing accuracy
plt.plot(k_range, scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Testing Accuracy')
# You can see that Accuracy is max between K=7 to k=15
# You may decide to pick K=11 for your model, for example.
# Make prediction for single out-of-sample data ...
# It is obvious ....
knn.predict([[3, 5, 4, 2]])
array([1])
When you just have the data and want to build model, this workflow is typically very useful. This applies for supervised learning, where you have labeled data:
import seaborn as sns
sns.pairplot(data, x_vars=['TV','Radio','Newspaper'], y_vars='Sales', size=7, aspect=0.7, kind='reg')
# 3 Scatter plots displayed i.e. TV vs Sales, Radio Vs Sales, Newspaper Vs Sales
# Visualize iris dataset correation.
sns.pairplot(iris,hue='species',palette='Dark2')
# Create a kde plot of sepal_length versus sepal width for setosa species of flower.
setosa = iris[iris['species']=='setosa']
sns.kdeplot( setosa['sepal_width'], setosa['sepal_length'],
cmap="plasma", shade=True, shade_lowest=False)
# Single value density plot is histogram or density plot.
# 2 Variables density plot is KDE. Kernel density estimate
# or density contours or 2D bins or Hex bins.
#
X = data[['TV', 'Radio', 'Newspaper']] # Feature matrix - Data frame - numpy matrix compatible.
y = data['Sales'] # Response Vector - Pandas Series - numpy array compatible.
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, y_train) # fit the model to the training data (learn the coefficients)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, y_train)
# LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
print(linreg.intercept_)
print(linreg.coef_)
# 2.8769666223179318
# [0.04656457 0.17915812 0.00345046]
# pair the feature names with the coefficients
list(zip(feature_cols, linreg.coef_))
# [('TV', 0.04656456787415029),
# ('Radio', 0.17915812245088839),
# ('Newspaper', 0.003450464711180378)]
# make predictions on the testing set
y_pred = linreg.predict(X_test)
# We need an **evaluation metric** in order to compare our predictions with the actual values!
# Evaluation metrics for classification problems, such as **accuracy** not applicable here.
# Three common evaluation metrics** for regression problems:
# 1. **Mean Absolute Error** (MAE) is the mean of the absolute value of the errors:
print(metrics.mean_absolute_error(true_values, y_pred))
# 2. **Mean Squared Error** (MSE) is the mean of the squared errors:
print(metrics.mean_absolute_error(true_values, y_pred))
# 3. **Root Mean Squared Error** (RMSE) is the square root of the mean of the squared errors:
print(np.sqrt(metrics.mean_squared_error(true_values, y_pred)))
# **MAE** is the easiest to understand, because it's the average error.
# **MSE** is more popular than MAE, because MSE "punishes" larger errors.
# **RMSE** is even more popular than MSE, because RMSE is interpretable in the "y" units.
# Example :
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
# 1.404651423032895
## Feature selection
# Does **Newspaper** "belong" in our model? In other words, does it improve the quality of our predictions?
# Let's **remove it** from the model and check the RMSE!
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
# 1.3879034699382886
# You may decide to remove this feature since the error decreases by removing it.
You have standard methedologies for cross validation of your classification model or regression model.
Synopsis :
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred)) # For classification problem.
# simulate splitting a dataset of 25 observations into 5 folds
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=False).split(range(25))
# Above it is 25 integers, but in real, it would be 25 rows.
#
# Enumerating the folds ...
#
for iteration, data in enumerate(kf, start=1):
print( iteration, data[0], str(data[1]) )
# Prints: 1 [ 20-elements-training-set ] [ 5-test-set ]
## Cross-validation recommendations
1. K can be any number, but **K=10** is generally recommended
2. For classification problems, **stratified sampling** is recommended for creating the folds
- Each response class should be represented with equal proportions in each of the K folds
- scikit-learn's cross_val_score function does this by default
# 10-fold cross-validation with K=5 for KNN (the n_neighbors parameter)
knn = KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
print(scores)
[1. 0.93333333 1. 1. 0.86666667 0.93333333
0.93333333 1. 1. 1. ]
# Note: cross_val_score() makes intelligent splits 10 times, each time applying the
# given model and computing the accuracy for that split.
# use average accuracy as an estimate of out-of-sample accuracy
print(scores.mean())
0.9666666666666668
#
# Typically you use cross_val_score() to choose between knn or logistic regression.
# 10-fold cross-validation with logistic regression
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
print(cross_val_score(logreg, X, y, cv=10, scoring='accuracy').mean())
0.9533333333333334
#
# But you can also use this to find specific K for KNN during initial stages.
# Note that this runs slowest since you vary K on outer loop for neighbors
# and vary folds in inner loop for test/train splits.
# Use it judiciously.
#
# search for an optimal value of K for KNN
k_range = list(range(1, 31))
k_scores = []
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
k_scores.append(scores.mean())
print(k_scores)
# Choose k for which you have highest accuracy score.
# Example scoring calculation for Linear Regression ...
# 10-fold cross-validation with all three features
lm = LinearRegression()
scores = cross_val_score(lm, X, y, cv=10, scoring='neg_mean_squared_error')
print(scores)
[-3.56038438 -3.29767522 -2.08943356 -2.82474283 -1.3027754 -1.74163618
-8.17338214 -2.11409746 -3.04273109 -2.45281793]
mse_scores = -scores
# convert from MSE to RMSE
rmse_scores = np.sqrt(mse_scores)
print(rmse_scores)
[1.88689808 1.81595022 1.44548731 1.68069713 1.14139187 1.31971064
2.85891276 1.45399362 1.7443426 1.56614748]
# calculate the average RMSE
print(rmse_scores.mean())
1.6913531708051797
# You may compare the other value with some other Regression Model RMSE value ...
# For example, you can use it for feature selection by recomputing the RMSE with same model
# with new feature set (by excluding/including certain features) and drop/add features.
#
# You can also automate the model parameter tuning (like k for KNN n=k) using GridSeachCV().
from sklearn.model_selection import GridSearchCV
k_range = list(range(1, 31))
param_grid = dict(n_neighbors=k_range)
print(param_grid)
{'n_neighbors': [1, 2, 3, ..., 28, 29, 30]}
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy', return_train_score=False)
grid.fit(X, y)
# view the results as a pandas DataFrame
import pandas as pd
pd.DataFrame(grid.cv_results_)[['mean_test_score', 'std_test_score', 'params']]
mean_test_score std_test_score params
0 0.960000 0.053333 {'n_neighbors': 1}
1 0.953333 0.052068 {'n_neighbors': 2}
2 .....
# examine the best model
print(grid.best_score_)
print(grid.best_params_)
print(grid.best_estimator_)
0.98
{'n_neighbors': 13}
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=13, p=2,
weights='uniform')
# if you want to optimize based on multiple parameters like n_neighbors as well as weight_options ...
param_grid = dict(n_neighbors=[1,2,3,...,28,29,30], weights=[ 'uniform', 'distance' ])
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy', return_train_score=False)
#
# Now repeat all above to find the best parameters ...
#
print(grid.best_score_)
print(grid.best_params_)
0.98
{'n_neighbors': 13, 'weights': 'uniform'}
# End of Grid Search for Cross Validation.
# RandomizedSearchCV searches a subset of the parameters, and you control the computational "budget"
from sklearn.model_selection import RandomizedSearchCV
# n_iter controls the number of searches
rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10, random_state=5, return_train_score=False)
...
# examine the best model
print(rand.best_score_)
print(rand.best_params_)
0.98
{'weights': 'uniform', 'n_neighbors': 18}
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
# learn the 'vocabulary' of the training data (occurs in-place)
#
# Note: We are not associating any Y values here. This is just to learn Universal vocabulary.
#
vect.fit(simple_train)
vect.get_feature_names()
['cab', 'call', 'me', 'please', 'tonight', 'you']
#
# We got 6 words out of 3 sentences inputs that we have given !!!
# transform training data into a 'document-term matrix'
#
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm
<3x6 sparse matrix of type '<class 'numpy.int64'>'
with 9 stored elements in Compressed Sparse Row format>
# convert sparse matrix to a dense matrix
simple_train_dtm.toarray()
array([[0, 1, 0, 0, 1, 1],
[1, 1, 1, 0, 0, 0],
[0, 1, 1, 2, 0, 0]])
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())
cab call me please tonight you
0 0 1 0 0 1 1
1 1 1 1 0 0 0
2 0 1 1 2 0 0
# This is called **Bag of Words** or "Bag of n-grams" representation.
# we completely ignore the relative position information of the words in the document.
# check the type of the document-term matrix
type(simple_train_dtm)
scipy.sparse.csr.csr_matrix
# examine the sparse matrix contents
print(simple_train_dtm)
(0, 1) 1
(0, 4) 1
(0, 5) 1
(1, 0) 1
(1, 1) 1
(1, 2) 1
(2, 1) 1
(2, 2) 1
(2, 3) 2 # Third input has the 4th word "Please" 2 times.
# transform testing data into a document-term matrix (using existing vocabulary)
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()
array([[0, 1, 1, 1, 0, 0]], dtype=int64)
# read file into pandas using a relative path
path = 'data/sms.tsv'
sms = pd.read_table(path, header=None, names=['label', 'message'])
# alternative: read file into pandas from a URL
# url = 'https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv'
# sms = pd.read_table(url, header=None, names=['label', 'message'])
# examine the shape
sms.shape
(5572, 2)
# examine the first 10 rows
sms.head(10)
label message
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
5 spam FreeMsg Hey there darling it's been 3 week's n...
6 ham Even my brother is not like to speak with me. ...
7 ham As per your request 'Melle Melle (Oru Minnamin...
8 spam WINNER!! As a valued network customer you have...
9 spam Had your mobile 11 months or more? U R entitle...
# examine the class distribution
sms.label.value_counts()
ham 4825
spam 747
Name: label, dtype: int64
# convert label to a numerical variable
sms['label_num'] = sms.label.map({'ham':0, 'spam':1})
# check that the conversion worked
sms.head(10)
label message label_num
0 ham Go until jurong point, crazy.. Available only ... 0
1 ham Ok lar... Joking wif u oni... 0
# how to define X and y (from the iris data) for use with a MODEL
X = iris.data
y = iris.target
print(X.shape)
print(y.shape)
(150L, 4L)
(150L,)
# how to define X and y (from the SMS data) for use with COUNTVECTORIZER
X = sms.message
y = sms.label_num
print(X.shape)
print(y.shape)
(5572L,)
(5572L,)
# split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
# ## Part 4: Vectorizing our dataset
# instantiate the vectorizer
vect = CountVectorizer()
# learn training data vocabulary, then use it to create a document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
# equivalently: combine fit and transform into a single step
X_train_dtm = vect.fit_transform(X_train)
# examine the document-term matrix
X_train_dtm
<4179x7456 sparse matrix of type '<type 'numpy.int64'>' # 4179 inputs; 7456 words.
with 55209 stored elements in Compressed Sparse Row format>
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm
<1393x7456 sparse matrix of type '<type 'numpy.int64'>' # 1393 Inputs; 7456 words;
with 17604 stored elements in Compressed Sparse Row format>
# ## Part 5: Building and evaluating a model
#
# We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):
#
# > The multinomial Naive Bayes classifier is suitable for classification with **discrete features**
# (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts.
# However, in practice, fractional counts such as tf-idf may also work.
# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
# train the model using X_train_dtm
nb.fit(X_train_dtm, y_train)
Wall time: 3 ms
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)
0.98851399856424982
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)
array([[1203, 5],
[ 11, 174]])
# print message text for the false positives (ham incorrectly classified as spam)
X_test[y_test < y_pred_class]
574 Waiting for your call.
3375 Also andros ice etc etc
45 No calls..messages..missed calls
3415 No pic. Please re-send.
1988 No calls..messages..missed calls
Name: message, dtype: object
# print message text for the false negatives (spam incorrectly classified as ham)
X_test[y_test > y_pred_class]
3132 LookAtMe!: Thanks for your purchase of a video...
5 FreeMsg Hey there darling it's been 3 week's n...
3530 Xmas & New Years Eve tickets are now on sale f...
684 Hi I'm sue. I am 20 years old and work as a la...
1875 Would you like to see my XXX pics they are so ...
1893 CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...
4298 thesmszone.com lets you send free anonymous an...
4949 Hi this is Amy, we will be sending you a free ...
2821 INTERFLORA - It's not too late to order Inter...
2247 Hi ya babe x u 4goten bout me?' scammers getti...
4514 Money i have won wining number 946 wot do i do...
Name: message, dtype: object
# example false negative
X_test[3132]
"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!,
you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323."
# calculate predicted probabilities for X_test_dtm (poorly calibrated)
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob
array([ 2.87744864e-03, 1.83488846e-05, 2.07301295e-03, ...,
1.09026171e-06, 1.00000000e+00, 3.98279868e-09])
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)
0.98664310005369604
# ## Part 6: Comparing models
#
# We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):
#
# > Logistic regression, despite its name, is a **linear model for classification** rather than regression.
# Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt)
# or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a
# single trial are modeled using a logistic function.
# import and instantiate a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
# train the model using X_train_dtm
logreg.fit(X_train_dtm, y_train)
Wall time: 39 ms
# make class predictions for X_test_dtm
y_pred_class = logreg.predict(X_test_dtm)
# calculate predicted probabilities for X_test_dtm (well calibrated)
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
y_pred_prob
array([ 0.01269556, 0.00347183, 0.00616517, ..., 0.03354907,
0.99725053, 0.00157706])
# calculate accuracy
metrics.accuracy_score(y_test, y_pred_class)
0.9877961234745154
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)
0.99368176123143015
#
# show default parameters for CountVectorizer
vect
CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)
# However, the vectorizer is worth tuning.
#
# - **stop_words:** string {'english'}, list, or None (default)
# - If 'english', a built-in stop word list for English is used.
# - If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
# - If None, no stop words will be used.
# remove English stop words
vect = CountVectorizer(stop_words='english')
# - **ngram_range:** tuple (min_n, max_n), default=(1, 1)
# - The lower and upper boundary of the range of n-values for different n-grams to be extracted.
# - All values of n such that min_n <= n <= max_n will be used.
#
# ngram is used to recognize words appearing together as single entity/word.
#
# >>> v = CountVectorizer(ngram_range=(1, 2))
# >>> pprint(v.fit(["an apple a day keeps the doctor away"]).vocabulary_)
# {u'an': 0,
# u'an apple': 1,
# u'apple': 2,
# u'apple day': 3,
# u'away': 4,
# u'day': 5,
# u'day keeps': 6,
# u'doctor': 7,
# u'doctor away': 8,
# u'keeps': 9,
# u'keeps the': 10,
# u'the': 11,
# u'the doctor': 12}
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))
# - **max_df:** float in range [0.0, 1.0] or int, default=1.0
# - When building the vocabulary, ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).
# - If float, the parameter represents a proportion of documents.
# - If integer, the parameter represents an absolute count.
# ignore terms that appear in more than 50% of the documents # Max document frequency
vect = CountVectorizer(max_df=0.5)
# - **min_df:** float in range [0.0, 1.0] or int, default=1
# - When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold. (This value is also called "cut-off" in the literature.)
# - If float, the parameter represents a proportion of documents.
# - If integer, the parameter represents an absolute count.
# only keep terms that appear in at least 2 documents # Min document frequency.
vect = CountVectorizer(min_df=2)
Terminology Computation sklearn-function Comments
Accuracy (TP+TN)/Total accuracy_score() Not good for skewed classes.
1-Accuracy = Classification Error Rate
Sensitivity/Recall/TPR TP/(TP+FN) recall_score() Given it is positive, how accurate your prediction ?
Should be high when you can't afford to lose positive.
E.g. Cancer screening. FP is OK, but FN is not OK.
Specificity (TNR) TN/(TN+FP) --not-available-- Given it is negative, How accurate your prediction ?
E.g. Defense interview. FN is OK, but FP is not OK.
False Positive Rate FP/(FP+TN) ---------------- False Alarm Rate; 1-Specificity;
Precision TP/(TP+FP) precision_score Probability that Alarm is genuine.
F1 Scroe is harmonic mean of Precision and Recall :
F1 Score = 2 * ( Precision * Recall / (Precision+Recall) )
Spam filter optimizes for Precision (High probability that Alarm is genuine) and Specificity (True Negative Rate).
You can also change classification threshold probabilities to redefine the classification results in order to optimize one metrics over another (e.g. For Fire Alarm vs Spam Filter). You may want high precision over recall, etc.
# print the first 10 predicted responses
logreg.predict(X_test)[0:10]
array([0, 0, ..., 1, 0, 1]) # For multi-class of 4 classes it may contain 0,1,2,3
# print the first 10 predicted probabilities of class membership
logreg.predict_proba(X_test)[0:10, :]
array([[0.63247571, 0.36752429], # Only 2 classes. [P(class_0), P(class_1)) ]
[0.71643656, 0.28356344], # For multi-class of 4, this will have 4 probabilities.
[0.71104114, 0.28895886],
....
)
# Note: For multi-class classification of N classes, You can peek into closest m classes !!
# You can use logreg.predict_proba() to predict the closest top m classifications.
# print the first 10 predicted probabilities for class 1
logreg.predict_proba(X_test)[0:10, 1]
array([0.36752429, 0.28356344, 0.28895886, 0.4141062 , 0.15896027,
0.17065156, 0.49889026, 0.51341541, 0.27678612, 0.67189438])
# store the predicted probabilities for class 1
y_pred_prob = logreg.predict_proba(X_test)[:, 1]
# Visualize the probability range using histogram...
import matplotlib.pyplot as plt
plt.hist(y_pred_prob, bins=8) # 8 bars with in actual range of values.
plt.xlim(0, 1)
plt.title('Histogram of predicted probabilities')
plt.xlabel('Predicted probability (0 to 1) ')
plt.ylabel('Frequency')
# **Decrease the threshold** to **increase the sensitivity** of the classifier
# Generate fire alarm even if the predicted probability is greater than 0.3
from sklearn.preprocessing import binarize
y_pred_class = binarize([y_pred_prob], 0.3)[0]
# print the first 10 predicted probabilities
y_pred_prob[0:10]
# print the first 10 predicted classes with the lower threshold
y_pred_class[0:10]
# new confusion matrix (threshold of 0.3)
print(metrics.confusion_matrix(y_test, y_pred_class))
# Now you can re-calculate sensitivity, precision etc.
A receiver operating characteristic curve, i.e., ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.
Especially the plot is False Positive Rate vs True Positive Rate. We can increase True Positive Rate at the cost of increasing False Positive Rate.
e.g. We can be conservative and raise fire alarms even at the slightest hint, we will get high TPR at the cost of raising many false alarms.
e.g. If we can raise TPR to 0.8 using FPR of 0.3, the model is fairly good for fire alarms. But you may need different thresholds for different problems.
Note
AUC (Area under curve) is a single indicator, higher the value, better. Higher value implies steep increase in TPR, even at small increase of FPR.
Synopsis:
# IMPORTANT: first argument is true values, second argument is predicted probabilities
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for diabetes classifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)
# define a function that accepts a threshold and prints sensitivity and specificity
def evaluate_threshold(threshold):
print('Sensitivity:', tpr[thresholds > threshold][-1])
print('Specificity:', 1 - fpr[thresholds > threshold][-1])
# IMPORTANT: first argument is true values, second argument is predicted probabilities
print(metrics.roc_auc_score(y_test, y_pred_prob))
0.7245657568238213 # Single number summary metric for your classifier model. Higher, better.
# calculate cross-validated AUC
from sklearn.model_selection import cross_val_score
cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()
A word embedding maps a word to N dimensional vector. W is initialized to have random vectors for each word. It learns to have meaningful vectors in order to perform some task.
WordEmbedding(word) => vector (r1, r2, ... rn)
For example, one task we might train a network for is predicting whether a 5-gram (sequence of five words) is valid.
In this process of 'training' the model, we get the word embeddings. It is the side effect of the 'hidden layer' calculating the weights.
No body knows what exactly the 'significance' of each individual dimension, since it is determined by the hidden layer.
Let us say, we are told that we can only have 4 dimensional properties and each property can assume only binary values.
Why Naive Bayes, Decision Trees and SVM more sensitive to input data features corelation and redundancy ?
Decision trees typically used in conjunction with other techniques like Random Forest or Gradient Tree Boosting. Easily handles feature interactions. The disadvantages of decision trees are: 1) Does not support on-the-fly learning, you need to rebuild your trees when you add more training data or features ??
3) Easily overfit, but random forests handle it to mitigate this issue. Lot of gaming theory uses tree pruning. Has strong theoritical foundation and easy to comprehend.
K-Means clustering of input set. If you want to make clusters of input groups based on all features without even understanding the original structure, this is the way to go. The disadvantage is you have to guess the best K or you do some trial and error.
Principal component analysis (PCA) can help to discard unnecessary and redundant features, to keep the model simpler, faster and more stable.
SVM can handle high dimensions well. Provides high accuracy. The cons are ... it is difficult to tune and memory intensive. Example domains: Character recognition, Stock market price, text categorization.
What is discriminative vs generative models ?
The Supervised learning models are categorized as Discriminative and Generative.
Logistic regression, SVM, etc. are discriminative. (Studies P(y/x) conditional probability) They don't care about probability distribution of the input.
The typical generative model approaches contain Naive Bayes, Gaussian Mixture Model, and etc. (Studies Joint Probability P(x,y)) They also model the probability distribution of the inputs and outputs. So it is easy to 'Generate' potential inputs using this distribution. Generative model need more training data since it represents the "universe".
A combination of generative/discriminative model is highly recommended and found to be useful.
Synopsis:
# K Neighbors Classifier;
knn = KNeighborsClassifier(n_neighbors=5)
# It uses collection of other algorithms.
rfc = RandomForestClassifier(n_estimators=200)
# Train the Support Vector Classifier"
from sklearn.svm import SVC
model = SVC()
# param_grid = {'C': [0.1,1, 10, 100, 1000],
# 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['rbf']}
# Since it is difficult to guess the parameters, it is often used with GridSearch.
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=3)
classifier = tf.estimator.DNNClassifier(hidden_units=[10, 20, 10],
n_classes=2,feature_columns=feat_cols)"
# Good when input is discrete. Like classifying spam sms based on spam words count.
classifier = MultinomialNB()
dtree = DecisionTreeClassifier()
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
dataset = pd.read_csv('Data.csv') // This is dataframe of N columns.
X = dataset.iloc[:, :-1].values // Everything except last column
y = dataset.iloc[:, 3].values // Last column
# Taking care of missing data. We want to fill the missing values by mean of other values.
# Alternative way: Use the value from the nearest neighbor vs mean.
#
from sklearn.preprocessing import Imputer
# Axis = 0 uses mean of the column for missing value. axis=1 for mean of row.
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis=0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
# Encoding categorical data
# from sklearn.preprocessing import LabelEncoder, OneHotEncoder, etc.
from sklearn.preprocessing import StandardScaler
import numpy as np
data = np.array([[0, 0], [0, 0], [1, 1], [1, 1]])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print(data)
[[0 0]
[0 0]
[1 1]
[1 1]]
print(scaled_data)
[[-1. -1.]
[-1. -1.]
[ 1. 1.]
[ 1. 1.]]
scaled_data.mean(axis = 0)
array([0., 0.])
scaled_data.std(axis = 0)
array([1., 1.])
You can plot heatmap using seaborn as below :
# Visualize correlation ..
import seaborn as sns
sns.heatmap(df.corr())
sns.heatmap(df.corr(), annot=True)
#
# Visualize missing data ...
# Missing values appear as yellow bars.
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,
cmap='viridis')sns.heatmap(train.isnull(),
yticklabels=False,cbar=False,cmap='viridis')
# Also see:
sns.pairplot(df) # Plot for each pair
sns.distplot(df['Price']) # Plot distribution of Price.
sns.set_style('whitegrid') sns.countplot(x='Survived',hue='Sex',data=train,palette='RdBu_r') sns.countplot(x='Survived',hue='Pclass',data=train,palette='rainbow') sns.distplot(train['Age'].dropna(),kde=False,color='darkred',bins=30)
Different ways of creating Dataframe quickly:
import pandas as pd
mx3 = np.arange(6).reshape(3,2)
array([[0, 1],
[2, 3],
[4, 5]])
df = pd.DataFrame(mx3,index='R1 R2 R3'.split(), columns='C1 C2'.split())
df
C1 C2
R1 0 1
R2 2 3
R3 4 5
# If you want to fill the data column wise. ....
mx3 = np.arange(6).reshape(3,2).transpose()
array([[0, 2, 4],
[1, 3, 5]])
df = pd.DataFrame(mx3,index='R1 R2'.split(), columns='C1 C2 C3'.split())
df
C1 C2 C3
R1 0 2 4
R2 1 3 5