cssclasses:
- wideTable
- mytable
Notes On Elements of Statistical Learning with R Book
Brief notes on Introduction to Statistical Learning and Elements of Statistical Learning books by authors Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.
Statistical learning involves assuming a hypothesis and build model (e.g. LR model) and then evaluate if the model holds good. If not, you reject and build another model.
In machine learning, we kind of automate this whole process of fitting
the model to data without manual intervention. The algorithm itself may
try to fit different models and choose the best one given enough data.
Over period of time, the machine learning model becomes more
intelligent, where as statistical model relies on good enough large
data upfront so that we can design a decent
model.
Our motivation for statistical learning comes from various goals:
We want to understand specific domain, so that we can plan accordingly. The domain could be:
We want to detect inherent clusters given population and N different
variables. (This is not prediction but identification aka clustering
).
The top domains of applications are:
Prediction vs Probability:
A prediction is nothing but the most probable value.
If the predicted value follows normal distribution, we will always
predict the mean
value - Because that is the peak!
It makes more sense to talk about confidence intervals
for prediction
and their associated proability e.g. mean +/- 3*sigma
covers major
interval of potential values for normal distribution.
If our desired prediction is binary (i.e. Classification), then we
can only come up with the probability
of being 1 or 0 as outcome.
Such probability function is usually called logistic regression function
;
This is neither linear or non-linear model even though the function
is like sigmoid, a non-linear function.
Probability
is always relevant whether your predict a linear value
or non-linear value or answering classification yes/no question.
Linear Model:
Generalized Linear Model:
Nonlinear Model:
Statistics | Computer Science | Comments |
---|---|---|
classification | Supervised Learning | |
clustering / | Unsupervised Learning | |
Density Estimation | ||
data | training sample | |
covariates | Features | |
classifier | Hypothesis | p<0.05 reject Null Hypo. |
DAG | Bayes Net - Multivariate Dist | Conditional Probability |
Bayes Inference | Prior Belief/Distribution | |
Frequentist Inference | Large Sample Approximation |
See Also:
Regression is a set of techniques for estimating relationships.
Regression also has another independent meaning going back
, which is
nothing to do with Relationship Estimation
.
Term | Name | Formula | Comments. |
---|---|---|---|
Y | Response Y value - Actual | Y or Yi | Actual Yi. |
Y^ Ye | Expected Y or Y hat. | Y^ or Ye | Expected Y |
Yavg | Average Y or Y mean | Ym or Yavg | Y Mean |
MSE | Mean Squared Error | Sum(Y - Ye)^2 / N | Error Variance |
RMSE | Root Mean Squared Error | Sqrt(MSE) | Unit: Y. |
RMSD | RMS Deviation | --ditto-- | Alias for RMSE |
NRMSD | Normalized RMSD | Sqrt(MSE)/Range(Y) | < 1 Fraction. |
RSS | Residual Sum of Squares | MSE * N | Unit: Y^2. |
SSR | Sum of Squared Residuals | --Ditto-- | |
SSE | Sum of Squared Errors | --Ditto-- | |
RSE | Residual Std Error. | Sqrt(RSS / (N-2)) | Unit: Y. Predictors = 1 |
Approximately RMSE. | |||
RSE m>1 | RSE for Multiple LR | Sqrt(RSS / (N-m-2)) | Total predictors = m |
TSS | Total Sum of Squares | Sum(Y - Yavg)^2 | Actual Variance. |
ESS | Explained Sum of Squares | Sum(Ye - Yavg)^2 | Expected Variance. |
r | Pearson Corr Coeff or r | Cov(X,Y) / | Linear Bivariate Corr |
. | (sd(X) * sd(Y)) | ||
R^2 | r-squared or R-squared | r * r |
Coeff of Determination < 1 |
Corr(Actual Vs Predict) | Higher value, better model. | ||
ANOVA | Analysis of Variance | Analyse means of 3 or more | |
groups | |||
F | F-Statistic (Threshold) | Are joint predictors p1, p2 | |
significant? yes if higher. | |||
T | T-Statistic | B1/StdErr(B1) | Higher value predictor |
(B1 = X1 Coeff) | is significant. B1 varies | ||
by sampling distribution. | |||
p value | P-Value of variable | Prob(>T-Statistic) | Lower value predictor |
is significant. |
See What is ANOVA?
class:my-img-75pc
The reason why we divide RSS
by (N-2) not N-1 and not N for Simple
linear expression (with 1 predictor) is that the error itself
follows chi-square distribution with N-2 (or N-m-2) degrees of
freedom. Anyway for large N, it approximates RMSE
, i.e. Root Mean
Squared Error. You can treat RSE as RMSE
for intuitive
understanding. It would be helpful to see RMSE/Avg(Y) as a fractional
measure of error.
Generalization of Linear Models :
Classification Problems: Logistic Regression, Support Vector Machine Even though the logistic regression function is non-linear, the decision boundary is linear.
Non-Linearity: Splines, General Additive Models, Nearest neighbor methods.
Remember dummy auxilary predictor p1*p2
captures non-linear
dependency using linear methods. Linear does not mean "continous",
It can have corner edges.
It just means the methodology of finding coefficients is linear
using certain cost function like least MSE.
Interactions: Tree based methods, (can have different linear equations for different ranges in subtrees), bagging, random forests and boosting. Note that interactions are captured more easily in these methods compared to simple linear methods.
Regularized Fiting: Ridge and Lasso. More appropriate when p is large ???
Middle Line is median. Box rectangle signifies 25% and 75% boundaries.
The lower/upper most ends of vertical line indicates hinges.
1.5*IQR
above Q3 or below Q1 is considered outlier.
class: my-img-75pc
A Regression function(X) is nothing but finding an average of
Y values given X = x1, for all possible X values. If Y does not exist for
given X = x1, then use Neighborhood average of all known Y values
around given x1. It is good technique for small p (predictors or features)
and large n. There are also alternatives like spline smoothing and kernels.
When p is large nearest neighbors are typically far away (since each point
really have a subset of possible feature values - sparse features).
(Interpretability vs Flexibility)
Bayes-Classifier KNN
Most Flexible: SVM, Bagging, Boosting
Consider optimizing given advertising budget into TV, News Paper, Radio
Input variable == budget == Predictors, Variables, Features. Output variable == Response == Sales
Y = f(X) + Epsilon. Epsilon is random error term with mean zero. Determine f.
Inference Problem: Some times the goal is just not prediction in same settings -- but to get deeper insights into individual feature effects, that could be useful to apply in another settings as well. i.e. F can not be considered as black box (like it does using deep learning algorithms). Example: Is this house over priced ? This is prediction problem. How much extra house will cost if it had River View ? This is Inference problem.
Parametric Approach vs Non-Parmetric Approach to find F: E.g. Assume the function is linear and find the parameter coefficients. Thin plate spline is an example approach for non-parametric approach. Non parametric approach involves more data and smoothness level to reduce level of rough edges.
Inflexible or Restrictive model: A linear model is said to be inflexible since it can not represent complex shapes for function f. But it is more interpretable since it throws more insights about features.
As flexibility increases, interpretability decreases.
Least flexible models are: Lasso, Least squares linear models. Most flexible models are SVM, Bagging and Boosting.
Lasso sets number of coefficients to exactly zero reducing the total number of features -- So it is more inflexible but easier to interpret or relate to small number of features. (at the cost of accuracy)
Flexible does not necessarily mean more accurate due to overfitting --so accuracy it depends on the problem. Many times linear models are more accurate as it avoids overfitting.
Some problems can be hybrid of supervised vs unsupervised learning. E.g. we may have partial response values for subset of data points and the goal is clustering and so on.
Quantitative Variables == Numerical =~ Regression Problems. Qualitative Variables == Categorical =~ Classification Problems.
Classification problem can be thought of as special case of Regression problem since output value can be simply mapped to multi-classifier. The reverse is also true. If a system can perform well for multi-classifier without response values coded as "numerical integer" values, then it is as good as the regression problem. Hence most methods perform equally well for both regression and classifier problems.
A regression method can very well be used for classfication problem since it can estimate probability (regression value) which is mapped to classification. Example: Least Squares Regression Method for logistic regression classification.
K Nearest neighbours, Boosting methods can be used for both: Regression and Classification problems.
The true f may be:
Plot graph for Flexibility vs MeanSquareError for test data. Select flexibility level (smoothness level for smoothing splines) that correspond to least MSE.
Expected test MSE is the expected error in prediction. For the selected model, this is the best we can do to minimize MSE.
Bias - Variance Trade off :
Expected MSE = Var(f(x)) + Bias(f(x))**2 + Var(Epsilon)
Var(f(x)) means how f would vary depending on training set selection.
Higher flexibility methods will overfit depending on training set resulting
in high variance of possible f values.
Bias(f(x)) is inherent error in selecting model vs true f.
i.e. Error due to selecting linear vs true non-linear.
Epsilon is the inherent error that will be present even if you perfectly
fit the data wrt training set and even if variance is zero.
Bayes Classifier: For each observation X0, it assigns Y class for which the conditional probability of P(Y/X) is highest.
In practice, it is impossible to compute conditional probability of real data. So it is a gold ideal standard. Even it has error rate which coresponds to irreducible error. Example: For given instance of same features, there may be 3 persons with blue eyes and 8 persons with black eyes -- The prediction engine will always predict black eye for the given other features. Suppose X has 2 features, we can imagine that it has "Bayes Decision Boundary", by which it predicts the classes depending on region.
KNN classifier attempts to approximate the Bayes classifier by computing conditional probability using K closest neighbors. The choice of K needs to be tuned to minimize test error.
Linear-Model
F-Statistic
T-Statistic
Simple Linear Expression: E.g. Y = B0 + B1 * X (Single Predictor Only) Multiple Linear Expression: E.g. Y = B0 + B1 * X1 + B2 * X2 + ...
For most optimal value of (B0, B1) the RSS (Residual Sum of Squares i.e. MSE * n) is lowest. We can plot isobar circles in (B0, B1) 2D space where bigger circles corresponding to higher RSS values.
Lower p-value implies higher association between X and Y. We compute p-value by assuming Null Hypothesis i.e. No relation between X and Y. If B1 is high both numerically and also B1/StdError(B1) is high, we can assume correlation. The B1/StdErr(B1) is called T-Statistic. The p-value is the probability that we encounter a value higher than T-Statistic. If p-value is less then, there is higher association. Higher T-statistic, lower the p-value.
R^2 Statistics is a measure of correlation similar to r = Corr(x, y). For simple linear relations, R^2 = r^2. For multiple linear relations, it is Corr(Y, Y^)^2 where Y-hat is expected Y. Higher it is to 1, there is correlation. It is a measure of model fit.
RSE - Residual Standard Error -
Higher correlation is also measure of better model fit. So (lower) p-value, (higher) T-Statistic (threshold), (higher) R-square Statistic are treated as better model fit measures. Usually R^2 is used as model fit indicator as a percentage of confidence. (between 0 to 1).
Some predictors in linear model could be qualitative (e.g. gender, marital status, nationality) If the predictor is a multi-class, then break into multi boolean predictors. e.g. is_indian, is_chinese, is_american, etc.
Multiple Linear Regression: It is when there is more than one predictor. E.g. Y = B0 + B1 * X1 + B2 * X2 + B3 * X3
Since we fit a model (new plane) using least sum of residual errors, the result may suggest (Y, X3) has no correlation (since B3 is almost 0), where there is really correlation. This may happen if there is correlation between predictors which makes X3 as redundant.
Even though ice cream sales may show correlation to higher shark attacks, in reality the temperature increase may be the true factor. Since temperature increases, more ice cream sales, more people at beach. Identifying correlation between temperature increase and ice cream sales may help eliminate false predictors.
First ask: Is there atleast 1 predictor which is useful to predict response ? We use F-Statistic (threshold > 1) as a measure to answer this question. For each preditor F-Statistic = Square(T-Statistic); For overall model, for all predictors, F-Statistic measures if atleast one predictor is useful.
How to do variables selection ? All subsets of p predictors = 2^p; Computing p-value for each predictor is obvious choice, but if all p-values are large, but combined F statistic indicates correlation, then it implies some subset is optimal. Brute force method won't work since combination is high.
Feature Selection strategies:
Potential Problems:
Non-linearity can be identified by inspecting a plot of Residuals against single predictor OR Residuals against Predicted-Y (for multiple predictors). Outliers are also clearly visible in this plot.
If you want to include interaction terms in linear model you can use this :
lm.fit = lm(formula = medv ∼ lstat * age, data = Boston)
# Above includes lstat, age, lstat * age as predictors.
# It is short cut for:
lm.fit = lm(formula = medv ∼ lstat+age+lstat:age, data = Boston)
To introduce non-linear predictor to fit in linear model, do this :
lm.fit2=lm(medv∼lstat+I(lstat^2))
summary(lm.fit2)
# Additional predictor = lstat * lstat
# The ^ has special meaning in formula.
# I() function forces standard R meaning of raising to power.
# It is a hack to fit a non-linear relation using linear model primitives.
# Can be handy to explore in specific directions on specific predictors.
To include all polynomial terms from X^2 to X^5, we can use poly() :
lm.fit5=lm(medv∼poly(lstat ,5))
summary(lm.fit5)
# Similary, to use log() ...
summary(lm(medv∼log(rm),data=Boston))
Logistic-Regression (Better than LDA?)
LDA
KNN
ANOVA
ROC Plot (True vs False Positives)
Confusion Matrix
Naive Bayes (Large number of features)
Response is qualitative instead of quantitative.
Most widely-used classifiers:
- logistic regression,
- linear discriminant analysis,
- K-nearest neighbors.
We discuss more computer-intensive methods in later:
- generalized additive models (Chapter 7),
- trees, random forests, and boosting (Chapter 8),
- support vector machines (Chapter 9).
Multiple Logistic Regression means number of predictors > 1. Response class can be binary or multiple classes.
Logistic regression involves directly modeling Pr(Y = k|X = x) using the logistic function (logit) for binary response classes.
For LDA, we model the distribution of the predictors X separately in each of the response classes (i.e. given Y ), and then use Bayes theorem to flip these around into estimates for Pr(Y = k|X = x).
LDA is based on Joint Probability P(X, Y), so it is said to be "generative learning".
Fundamental assumption of the LDA method is that the continous independent variables are normally distributed. If this assumption is wrong, the model yields poor fit.
ANOVA uses categorical independent variables and a continuous dependent variable, whereas discriminant analysis has continuous independent variables and a categorical dependent variable. E.g. For ANOVA: analyze the weight distribution for given few types of breeds. For LDA: Analyze weight distribution to classify the breeds.
ANOVA is useful for exploratory analysis of data and feature identification, where as LDA is useful for classification.
The decision formula for LDA is bit involved mathematical function but it is linear. To predict class y (among few pre-defined classes), given x :
w * x > c then it belongs to class y. # For some constant c.
# The c computation looks mathematically involved.
# For each response class y, the (w, c) pairs are different.
The observation belongs to y if corresponding x vector is located on a certain side of a hyperplane perpendicular to vector w. The location of the plane is defined by the threshold c.
LDA is superior in these cases:
- When the classes are well-separated LDA is more stable.
- If n is small & predictors X is approx normal distributed, LDA is more stable.
- LDA is more popular (and appropriate?) when we have more than two response classes.
Confusion matrix is a good way to visualize False Positive, True Positive, etc.
For LDA, we can tune the classifier to reduce certain risk at the cost of increased error rate. For example, by increasing false positive rate (signaling genuine credit card applicants as defaulters) we can capture majority of true positves (most of true defaulters). Example: Pr(default = Yes|X = x) > 0.2
ROC plot is False Positve vs True Positive plot where AUC should be maximum.
QDA - Quadratic Discriminant Analysis is another variation of LDA. Unlike LDA, QDA assumes that each class has its own covariance matrix. QDA is more flexible, but also vulnerable to over-fitting. If the data set really has multiple groups with clearly different co-variance matrices, then QDA tends to be better fit.
The six examples illustrate that no one method will dominate the others in every situation. When the true decision boundaries are linear, then the LDA and logistic regression approaches will tend to perform well. When the boundaries are moderately non-linear, QDA may give better results. Finally, for much more complicated decision boundaries, a non-parametric approach such as KNN can be superior. But the level of smoothness for a non-parametric approach must be chosen carefully.
Generalized linear model glm in R allows to use API similar to lm, but also allows to return binary response by passing, family=binomial option :
glm.fits=glm(Direction∼Lag1+Lag2+Lag3+Lag4+Lag5+Volume , data=Smarket ,family=binomial )
To select only couple of important predicators and use data upto 2004 to predict 2005, do :
glm.fits=glm(Direction∼Lag1+Lag2,data=Smarket ,family=binomial , subset=train)
glm.probs=predict(glm.fits,Smarket .2005,type="response ")
glm.pred=rep("Down",252)
glm.pred[glm.probs >.5]="Up"
table(glm.pred,Direction .2005) # Display cross tabulated data with counts of factors.
# This is confusion matrix.
Direction .2005
glm.pred Down Up
Down 35 35
Up 76 106
mean(glm.pred==Direction .2005)
# [1] 0.56 56% accuracy prediction.
106/(106+76) # Accuracy after Up prediction.
# [1] 0.582 # 58% it is right, if it predicted "Market Up" !!!
lda() and qda() follow similar approach, but qda() yields 60% accuracy for this example!!!
knn() prediction is different from other models -- No need to first "fit" model then predict. knn() forms predictions using a single command. This is because the training set itself is used like the "fitted model" to predict for new data :
knn.pred=knn(train.X,test.X,train.Direction ,k=3)
> table(knn.pred,Direction.2005)
Direction.2005
knn.pred Down Up
Down 48 54
Up 63 87
> mean(knn.pred==Direction.2005)
[1] 0.536
When the number of features p is large, local approaches like KNN do poorly. This is because there is no interpolation happens like other methods, and if the samples are not dense enough, you tend to pick "far away" neighbor to incorrectly "predict" your response class.
Naive Bayes is useful classifier when total no of features p is large. Also known as Simple Bayes and independence Bayes.
Naive Bayes is a "conditional Probability" model.
Naive Bayes is simple "probabilistic classifiers" based on applying Bayes theorem with strong (naive) independence assumptions between the features.
Naive Bayes Classical use case is Spam identification where word frequencies are the features. (Large p). Another alternative advaned approach is SVM in this domain.
Resampling is used to incrementally fit and adjust the model by repeatedly resampling the subsets of training data.
Resampling is computationally expensive, but not prohibitive.
Often this technique is very useful to tune the model for best accuracy! (as compared to using whole of data to train at once)
Two of the most commonly used resampling methods: cross-validation and the bootstrap
Cross validation can be used to measure the learning method's performance.
Bootstrap can be used as a measure of accuracy.
Validation set approach is easy to understand and implement. But it has following drawbacks:
- Sample may be skewed and may not represent data points of all classes.
- Not all of training set is used to train model, problem for smaller training set.
Cross validation approach is an improved version of Validation Set.
Leave one out Cross validation (LOOCV) trains model excluding single point in succession. i.e. CVn = Avg(MSE) is the average cost of error over excluding each point. Though LOOCV looks expensive, for some models there is a shortcut which makes computing this CVn really quick. e.g. For linear model fitting using least squares, the cost of computing LOOCV error rate for n points is same as doing it for single point!
K-Fold Cross validation involves making k partitions, and using each partition as validation set in succession with remaining (k-1) partition as training set. For learning methods which are complex and costly to fit, k-fold approach is k times faster. The K-fold is also more smoother and useful to prevent overfitting.
Dimension-Reduction
Least squares fitting proecedure performs poorly when p > n. There is not even unique least squares coefficient estimate. The variance is infinite.
By constraining or shrinking the estimated coefficients, we can often substantially reduce the variance at the cost of a negligible increase in bias.
Least squares fitting model is not "intelligent enough" to recognize irrelevant features (and assigning corresponding coefficients zero).
Some strategies to deal with least squares fitting model problems are:
Understanding high dimensional data: Consider gene expression data for 100 individuals, let us say each gene expression contains half million features. p >> n. However, it is quite possible to locate subset of p features that correlate with response Y. Hence it is common to encounter such cases. It is important to include "only relevant" features -- also "exclude coincidental false correlated" features.
Ridge Regression: The modified least squares term now includes a additional term, lambda * (sum(beta*beta)) which encourages the coefficients towards zero. The lambda is tuning parameter, higher the value, more aggressive shrinkage happens. The additional term is called "Shrinkage Penalty" (actually penalty to encourage shrinkage). This is related to bias-variance tradeoff -- As lambda increases, we lose flexibility, slight increase bias for reduction in variance -- often delivering better accuracy. Model fitting is computationally very efficient -- Just fix lambda, then do linear fit.
Lasso regression is similar to Ridge Regression but also helps to set some coefficients to be exactly zero instead of "very small value". This can lead to variable selection. The penalty term is: lambda*(sum(abs(beta))) i.e. It applies l1 (level 1) penalty instead of l2(level 2).
Best subset selection using R:
library(leaps)
regfit.full=regsubsets (Salary∼.,Hitters) # 19 variables/features.
summary(regfit.full)
# Displays a matrix of selected features if we must select only 1..8 features.
# i.e. It displays Top 1, 2, 3, ... 8 features using exhaustive, etc algorithms.
coef(regfit.full ,6) # Displays 6 most significant features.
Using glmnet package for Ridge or Lasso Regression:
library(glmnet)
grid=10^seq(10,-2, length =100)
ridge.mod=glmnet(x,y,alpha=0,lambda=grid)
# alpha = 0 means Ridge ; alpha = 1 means Lasso.
# Lambda range: 10^10 to 10^-2 ... i.e. wide range.
ridge.pred=predict(ridge.mod,s=4,newx=x[test ,]) # s = 4 supplies another lambda.
Using pls package for PCR Regression :
library(pls)
set.seed(2)
pcr.fit=pcr(Salary∼., data=Hitters ,scale=TRUE, validation ="CV")
summary(pcr.fit)
Data: X dimension : 263 19
Y dimension : 263 1
Fit method: svdpc # Fits a PCR model using single value decomposition.
Using pls package for PLS - Partial Least Squares Fit :
library(pls)
pls.fit=plsr(Salary∼., data=Hitters ,subset=train ,scale=TRUE , validation ="CV")
summary(pls.fit)
# Data: X dimension : 131 19
# Y dimension : 131 1
# Fit method: kernelpls
We examine very simple extensions of linear models like polynomial regression and step functions, as well as more sophisticated approaches such as splines, local regression, and generalized additive models.
Polynomial regression extends the linear model by adding extra predictors, For example, X, X^2 , and X^3 as predictors.
Step functions cut the range of a variable into K distinct regions. This has the effect of fitting a piecewise constant function.
Regression splines: It is extension of both polynomials and step functions; More flexible; They involve dividing the range of X into K distinct regions. Within each region, a polynomial function is fit to the data. However, these polynomials are constrained so that they join smoothly at the region boundaries, or knots. Provided that the interval is divided into enough regions, this can produce an extremely flexible fit.
Smoothing splines are similar to regression splines, but arise in a slightly different situation. Smoothing splines result from minimizing a residual sum of squares criterion subject to a smoothness penalty.
Local regression is similar to splines, but differs in an important way. The regions are allowed to overlap, and indeed they do so in a very smooth way.
Generalized additive models allow us to extend the methods above to deal with multiple predictors.
Degrees of freedom: Number of free parameters. e.g. All coefficients and intercept.
There is a shortcut efficient formula to compute LOOCV (Leave one out cross validation) error function for smoothing splines (similar to least squares linear regression).
Local regression is sometimes referred to as a memory-based procedure, because like nearest-neighbors, we need all the training data each time we wish to compute a prediction.
GAM : Generalized Additive Models: GAMs can be applied with both quantitative and qualitative responses.
Limitation of GAM is that interaction between features is missed. e.g. X1*X2 For fully general models, we should look for Random Forests and Boosting.
In order to fit a step function, we use the cut() function :
fit=lm(wage∼cut(age ,4),data=Wage)
coef(summary(fit))
# Estimate Std. Error t value Pr(>|t|)
# (Intercept ) 94.16 1.48 63.79 0.00e+00
# cut(age, 4)(33.5,49] 24.05 1.83 13.15 1.98e-38
# cut(age, 4)(49,64.5] 23.66 2.07 11.44 1.04e-29
# cut(age, 4)(64.5,80.1] 7.64 4.99 1.53 1.26e-01
The function cut() returns an ordered categorical variable; the lm() function then creates a set of dummy variables for use in the regression.
Fitting wage to age using a regression spline is simple:
library(splines)
fit=lm(wage∼bs(age ,knots=c(25,40,60) ),data=Wage) # bs() is basis functions. Default is cubic.
pred=predict(fit,newdata=list(age=age.grid),se=T) #
plot(age,wage,col="gray")
lines(age.grid,pred$fit ,lwd=2) # Plot prediction
lines(age.grid,pred$fit +2*pred$se ,lty="dashed") # Plot 2 * stderr
lines(age.grid,pred$fit -2*pred$se ,lty="dashed")
Here we have prespecified knots at ages 25, 40, and 60. This produces a spline with six basis functions. (Recall that a cubic spline with three knots has seven degrees of freedom; these degrees of freedom are used up by an intercept, plus six basis functions.) We could also use the df option to produce a spline with knots at uniform quantiles of the data. :
> dim(bs(age,knots=c(25,40,60) ))
[1] 3000 6
> dim(bs(age,df=6))
[1] 3000 6
> attr(bs(age,df=6) ,"knots")
25% 50% 75%
33.8 42.0 51.0
# R chooses knots at ages 33.8,42.0, and 51.0, which correspond
# to the 25th, 50th, and 75th percentiles of age .
# The function bs() also has a degree argument, so we can fit splines of
# any degree, rather than the default degree of 3 (which yields a cubic spline).
# To fit a natural spline, we use the ns() function.
fit2=lm(wage∼ns(age,df=4),data=Wage)
pred2=predict(fit2 ,newdata=list(age=age.grid),se=T)
lines(age.grid, pred2$fit ,col="red",lwd=2)
# In order to fit a smoothing spline, we use the smooth.spline() function.
plot(age,wage,xlim=agelims ,cex=.5,col="darkgrey ")
title("Smoothing Spline")
fit=smooth.spline(age ,wage,df=16)
fit2=smooth.spline(age,wage,cv=TRUE)
fit2$df
# [1] 6.8
lines(fit ,col="red",lwd=2)
# In order to perform local regression, we use the loess() function.
Segment the predictor space into simple regions. Single tree methods do not provide enough accuracy.
So we examine bagging, random forests, and boosting.
A classification tree is very similar to a regression tree --except that the tree is used to predict a qualitative response. The predicted response for an observation is the most commonly occuring class of training observations in that region (as against mean of values for qualitative response). If f(x) is qualitative, take the most common occuring value as result.
Bagging: Decision trees suffer from high variance. If we calculate f(x) over B different training sets and average the results, we will have low variance. We use bootstrap (sampling) method to generate B different training sets.
Random Forests: An improvement over bagging - by averaging the importance over all predictors. At each split time, random sample of m predictors is considered. The m = sqrt(p); At each split, not even the majority of predictors considered. This results in more diversity of trees in result, which in turn results in lower variance! (though that is not obvious fact).
The total number of trees (B) and predictor subset size (m) are the main factors for Random Forests predictive model. For example with gene expressions, p = 4K ; B = 400; m = sqrt(p) works well.
As with bagging, random forests will not overfit if we increase B.
Boosting: It involves incrementally improving the decision tree based on previous results to successively reduce residual errors. Both boosting and Random Forests use tree ensembles where Random Forest has more suitable for parallelization of algorithm where boosting is mostly sequential. However there are opportunities to parallelize operations at single split level for boosting which is done most efficiently using xgboost library. The gbm library in R is another alternative.
Boosting has three tuning parameters:
- The number of trees B. Unlike bagging, boosting can overfit if B is large. We use cross-validation to select B.
- The shrinkage parameter λ to control learning rate. e.g. 0.01 or 0.001 Very small λ can require using large value of B.
- The number d of splits in each tree, which controls the complexity of the boosted ensemble. Often d = 1 works well, consisting of a single split. In this case, the boosted stump ensemble is fitting an additive model, since each term involves only a single variable. More generally d is the interaction depth, and controls interaction depth.
Using single decision tree :
library(MASS)
set.seed(1)
train = sample(1:nrow(Boston), nrow(Boston)/2)
tree.boston=tree(medv∼.,Boston ,subset=train)
summary(tree.boston)
...
yhat = predict(tree.boston ,newdata=Boston[-train ,])
Using bagging and RandomForests :
# Note: Bagging is special case of RandomForests where m = p
library(randomForest) set.seed(1) bag.boston=randomForest(medv∼.,data=Boston ,subset=train , mtry=13,importance =TRUE) # Total predictors = 13; m = 13; This is bagging.
yhat.bag = predict(bag.boston ,newdata=Boston[-train ,]) plot(yhat.bag, boston.test)
##### Tree for RandomForests. m = p/3 by default for regression. We try m = 6. ### rf.boston=randomForest(medv∼.,data=Boston ,subset=train , mtry=6,importance =TRUE) yhat.rf = predict(rf.boston ,newdata=Boston[-train ,])
Using Boosting in R :
library(gbm)
boost.boston=gbm(medv∼.,data=Boston[train ,],distribution="gaussian ",
n.trees=5000, interaction .depth=4)
# Use "Bernoulii" distribution for classification problem.
summary(boost.boston)
var rel.inf # Relative importance in % for all vars.
1 lstat 45.96
2 rm 31.22
3 dis 6.81
...
# We see that lstat and rm are by far the most important variables. We can
# also produce partial dependence plots for these two variables. These plots
yhat.boost=predict(boost.boston ,newdata=Boston[-train ,], n.trees=5000)
mean((yhat.boost -boston.test)^2)
# [1] 11.8
# The test MSE obtained is 11.8; similar to random forests and superior to bagging.
# If we want to, we can perform boosting with different lambda shrinkage parameter.
boost.boston=gbm(medv∼.,data=Boston[train ,],distribution= "gaussian ",n.trees=5000,
interaction .depth=4,shrinkage =0.2, verbose=F)
yhat.boost=predict(boost.boston ,newdata=Boston[-train ,], n.trees=5000)
mean((yhat.boost -boston.test)^2)
# [1] 11.5
# In this case, using λ = 0.2 leads to a slightly lower test MSE than λ = 0.001.
SVM classifier is a generalization of maximal margin classifier.
Support Vector Machine is an extension of SVM classifier (which handles non-linear boundaries)
Maximal margin classifier though elegant and simple, can not be applied to most datasets since the classes boundary must be linear.
When points are non-separable we generalize the concept of Maximal Margin Classifier to find the plane "which almost" separates it using "soft margin". Such classifier is called Support Vector Classifier.
SVM is intended for binary classification but can be extended for more classses.
Support Vector Machine is a Support Vector Classifier which uses enlarged feature space e.g. predictors X, X*X etc so that the decision boundary can be non-linear. Even interaction terms e.g. X1*X2 can be included.
The enlarging of feature space is done using "kernels".
The example for "kernel" is a "polynomial kernel" of degree d :
K(o1, o2) = (1 + sum(o1.xi*o2.xi) where i = 1 to p features )^d
# Another form is a "Radial Kernel"
SVM for multi-class classification has possible different approaches:
- One-Versus-One Classification: Consider all KC2 combinations and apply. Choose the result with maximum frequency.
- One-Versus-All : Create K SVM models - Each represents ith class vs remaining. Choose the one with highest level of confidence (the distance on right side is high).
SVM is not unique in using "kernels" to accomodate non-linear boundaries. We can very well use this technique with logistic regression as well. For historic reasons, use of kernels is most common with SVM.
Extension of SVM for regression is called: Support Vector Regression
Using svm in R :
# Response var must be factor to force classification vs regression.
dat=data.frame(x=x, y=as.factor(y))
library(e1071)
svmfit=svm(y∼., data=dat , kernel="linear", cost=10, scale=FALSE)
#
# Note: kernel could be "polynomial" or "radial" as well.
#
summary(svmfit)
ypred=predict(svmfit ,testdat)
In this section, we describe bottom-up or agglomerative clustering.
Tendrogram is built starting from the leaves and combining clusters up to trunk.
Also some times the clustering intention is ambiguous e.g. split by gender vs nationality.
The clustering algorithm involves measuring dissimilarity between each pairs. It starts with all nodes are in their own clusters at leaves and keep fusing the similar ones in succession.
Dissimilarity between groups is computed by using a concept of "Linkage" ...:
- Complete: Compute all pairwise dissimilarity between clusters A and B. The maximal dissimilarity is "Complete" linkage.
- Centroid: Dissimilarity between centroid of cluster A (a mean vector of length p) and B. (Undesirable inversion may occur with centroid so it is avoided).
Clustering depends on type of linkage used and the choice of dissimilarity measure.
Clustering methods are not stable if some observations were removed.
Principal Components Analysis in R :
pr.out=prcomp(USArrests , scale=TRUE)
# By default, prcomp() centers the variables to have mean zero.
# scale=TRUE scales the variables to have standard deviation one.
names(pr.out)
# [1] "sdev" "rotation " "center" "scale" "x"
K-means clustering in R :
km.out=kmeans(x,2,nstart =20)
km.out$cluster
# [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1
# [30] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# The nstart=20 suggests to rerun algorithm 20 times with different
# initial random assignments to find the best grouping with least intra variance.
Hierarchical clustering in R :
# The hclust() function implements hierarchical clustering in R
hc.complete =hclust(dist(x), method="complete ")
hc.average=hclust(dist(x), method="average ")
hc.single=hclust(dist(x), method="single")
par(mfrow=c(1,3))
plot(hc.complete ,main="Complete Linkage", xlab="", sub="", cex=.9)
plot(hc.average , main="Average Linkage", xlab="", sub="", cex=.9)
plot(hc.single , main="Single Linkage", xlab="", sub="", cex=.9)
hc.out = cutree(hc.complete , 2) # Display the grouping if we have to cut this into 2.
hc.out
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2
# [30] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
# Draw line at height 139 where it produced certain desired number of clusters (found by cutree)
abline(h=139, col="red)
Key Concepts of ANOVA are as below.
Hypothesis Testing:
- Null Hypothesis H0 : All group means are equal
- Alternative Hypothesis H-alpha: At least one group mean is different.
ANOVA calculates the F-statistic: :
𝐹 = ( Variance Between Groups / Variance Within Groups )
# Higher the F, the groups are different.
Types of ANOVA:
Post-hoc Tests:
Example:
In classification there are 2 main concepts -- Activation function and decision boundary. Another important concept is feature space.
If all these are non-linear, obviously the classifier is non-linear. If some of them are non-linear, it is not obvious if the classifier is non-linear or linear.
For example a linear classifier with transformed non-linear features creates non-linear decision boundary.
A logistic regression classifier has non-linear activation function with linear decision boundary. i.e. The predictor function (z) is linear. It can be claimed as "generalized linear model".
A truly non-linear classifier example is K-Nearest neighbor algorithm. The decision boundary is non-linear.
We can say linear decision boundary implies that it is a linear model.
However non-linear decision boundary does not necessarily imply non-linear model since it could have linear decision boundary in transformed feature space.
If there is no linear decision boundary at all (even with transformed feature space), then it is non-linear classifier.
By definition, linear model is when the output variable can be expressed as linear expression of input predictors. The predictors themselves can be non-linear but paramters should be linear.
We use log(y) instead of y as output variable to make the expression linear.
Such transformation falls under "generalized linear models"
Note that you can use feature transformation like x1*x2, x1*x1 and apply linear classification algorithm. The decision boundary is linear in transformed feature space, though non-linear in original data space.
The sigmoid activation function pi(x1,x2,...) :
1 / (1 + e^-z) [ when z = 0, result = 0.5 ]
z = w1 * x1 + w2 * x2 + ....
# There is no interaction between features.
# Hence decision boundary is linear.
# You can also make decision boundary non-linear by introducing
# transformed featurs like x1*x2.
Note: Similarly, you can have linear and non-linear SVM classifiers.
Model should satisfy these components :
Consider big two groups of balls with different predictors with one predictor being weight. Each predictor has variance and std deviation based on observed values.
If the real group is big, we consider samplings of this group and compute variance/deviation. In that case we call these as sample variance for each predictor. If total group size is N, sample size is n, we have N-Cn combination of samplings of size=n. For given subsample size n, the population mean for predictor p1 will follow certain distribution e.g. normal, uniform, etc. By observing this distribution of population mean/median, we can obtain lot of information. Common goal is to find out if sample mean is unbiased estimator of real population mean/median. Higher variance may imply inherent spread level of the predictor, may indicate sample size small, etc.
For Linear Regression, for predictor X1, we can compute coeff B1 so that y = B0 + B1 * X1. The computation of B1 varies depending on subsampling. This variance of B1 is indicator of significance of predictor X1 for the Linear model. Less variance, more significant predictor. That is how T-statistic is computed as B1/StdErr(B1).