cssclasses:
  - wideTable
  - mytable

Notes On Elements of Statistical Learning with R Book

Notes On Elements of Statistical Learning with R Book

Overview

About Book

Brief notes on Introduction to Statistical Learning and Elements of Statistical Learning books by authors Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.

The terminology Statistical learning is relatively new and is in parallel to machine learning.
The field encompasses many methods such as:
- the lasso and sparse regression
- classification and regression trees
- boosting
- support vector machines.

ESLR Models Interpretability Vs Flexibility

Machine Learning Vs Statistical Learning

Statistical learning involves assuming a hypothesis and build model (e.g. LR model) and then evaluate if the model holds good. If not, you reject and build another model.

In machine learning, we kind of automate this whole process of fitting the model to data without manual intervention. The algorithm itself may try to fit different models and choose the best one given enough data. Over period of time, the machine learning model becomes more intelligent, where as statistical model relies on good enough large data upfront so that we can design a decent model.

Background and Commentary

Our motivation for statistical learning comes from various goals:

We want to understand specific domain, so that we can plan accordingly. The domain could be:
- Financial market
- Crime Rates
- Clinical Response of medicine for Risk Analysis.
We want to detect inherent clusters given population and N different variables. (This is not prediction but identification aka clustering). The top domains of applications are:
- Genomics: Origin of a person, Medical Profiling.
- Typically 7000 Gene Expressions per cell line can be projected to 2D (Z1, Z2) and find cluster of individuals to assess cancer vulnerability.
- Image Classification/Recognition - Medical, Robotics, Vision Tech.
- Financial Analysis: Portfolio management: Grouping similar stocks; Risk assessment: Identifying groups of high-risk customers.
- Anomaly and Fraud Detection - Cyber/Network Security, Crime Monitoring
- Document Clustering - Library, Search, Indexing
- Market segmentation and Customer Profiling. For recommender systems, targeted marketing, Product feasibility analysis and such.

Prediction vs Probability:

A prediction is nothing but the most probable value. If the predicted value follows normal distribution, we will always predict the mean value - Because that is the peak!
It makes more sense to talk about confidence intervals for prediction and their associated proability e.g. mean +/- 3*sigma covers major interval of potential values for normal distribution.
If our desired prediction is binary (i.e. Classification), then we can only come up with the probability of being 1 or 0 as outcome. Such probability function is usually called logistic regression function; This is neither linear or non-linear model even though the function is like sigmoid, a non-linear function.
Probability is always relevant whether your predict a linear value or non-linear value or answering classification yes/no question.

Linear Model:

Relationships between (2 or more) variables is linear.
A model is considered linear if it is linear in parameters (coefficients), regardless of whether it is linear in features (predictors).
Relationship follows a constant rate of change throughout its entire domain.
y = a x + b is linear; Even y = a (xx) + b is linear; But y = 1 / (1+ a exp(x)) is not linear.

Generalized Linear Model:

Response variable is non-linear but a link function can be used to transform that to linear.
Allows non-normal discrete distributions.
Consider y = exp(x); It is non-linear; Now use a link function log: log(y) = x which is linear. Now treat your response variable as log(y).
Some logistic regression models are GLM.

Nonlinear Model:

Bacterial growth is non-linear -- Grows exponentially but saturates at some point. (asymptote)
Company or Human or Engine performance is usually nonlinear. After reaching a peak, it starts to decline.
It may have complex relationships involving many factors (variables) with various levels of non-linearity like quadratic, exponential, etc.
y = sin(x) is a non-linear model.

NonLinear Bacterial Growth

See Linear Vs NonLinear Models

Terminology Statistics Vs Computer Science

Statistics	Computer Science	Comments
classification	Supervised Learning
clustering /	Unsupervised Learning
Density Estimation
data	training sample
covariates	Features
classifier	Hypothesis	p<0.05 reject Null Hypo.
DAG	Bayes Net - Multivariate Dist	Conditional Probability
Bayes Inference		Prior Belief/Distribution
Frequentist Inference		Large Sample Approximation

Term	Name	Formula	Comments.
Y	Response Y value - Actual	Y or Yi	Actual Yi.
Y^ Ye	Expected Y or Y hat.	Y^ or Ye	Expected Y
Yavg	Average Y or Y mean	Ym or Yavg	Y Mean
MSE	Mean Squared Error	Sum(Y - Ye)^2 / N	Error Variance
RMSE	Root Mean Squared Error	Sqrt(MSE)	Unit: Y.
RMSD	RMS Deviation	--ditto--	Alias for RMSE
NRMSD	Normalized RMSD	Sqrt(MSE)/Range(Y)	< 1 Fraction.
RSS	Residual Sum of Squares	MSE * N	Unit: Y^2.
SSR	Sum of Squared Residuals	--Ditto--
SSE	Sum of Squared Errors	--Ditto--
RSE	Residual Std Error.	Sqrt(RSS / (N-2))	Unit: Y. Predictors = 1
	Approximately RMSE.
RSE m>1	RSE for Multiple LR	Sqrt(RSS / (N-m-2))	Total predictors = m
TSS	Total Sum of Squares	Sum(Y - Yavg)^2	Actual Variance.
ESS	Explained Sum of Squares	Sum(Ye - Yavg)^2	Expected Variance.
r	Pearson Corr Coeff or r	Cov(X,Y) /	Linear Bivariate Corr
	.	(sd(X) * sd(Y))
R^2	r-squared or R-squared	`r * r`	Coeff of Determination < 1
	Corr(Actual Vs Predict)		Higher value, better model.
ANOVA	Analysis of Variance		Analyse means of 3 or more
			groups
F	F-Statistic (Threshold)		Are joint predictors p1, p2
			significant? yes if higher.
T	T-Statistic	B1/StdErr(B1)	Higher value predictor
		(B1 = X1 Coeff)	is significant. B1 varies
			by sampling distribution.
p value	P-Value of variable	Prob(>T-Statistic)	Lower value predictor
			is significant.

Book Organization

Chapter 1 Introduction
Chapter 2 Introduces terminology, K nearest neighbour Classifier.
Chapter 3 reviews linear regression.
Chapter 4 we discuss logistic regression and linear discriminant analysis.
Chapter 5 we introduce cross-validation and the bootstrap since the central problem in all statistical learning situations involves choosing the best method for a given application.
In Chapter 6 we consider a host of linear methods, both classical and more modern, which offer potential improvements over stanard linear regression. These include stepwise selection, ridge regression, principal components regression, partial least squares, and the lasso. Linear models are easier to interpret.
Chapter 7 is for non-linear methods both for single inputs and extend this to additive models for multi inputs.
Chapter 8, we investigate tree-based methods including: bagging, boosting, and random forests.
Chapter 9, Support vector machines is about set of approaches for performing both linear and non-linear classification.
Chapter 10 is about clustering: PCA, K-Means, and Hierarchical.
The website for this book is located at www.StatLearning.com

Chapter 1 Introduction

As an example we examine wage distribution as function of Age and Education.
We fit a quadratic discriminant analysis model to the subset of the Smarket data corresponding to the 2001--2004 time period, and predicted the probability of a stock market decrease using the 2005 data with 60% success.
Discriminant Analysis == Binary classification Problem. (Fisher's) LDA (Linear Discriminant Analysis) works using Least Squares. Logistic regression is another alternative approach which is superior to LDA:
- LDA requires the groups involved must be similar in size with similar covariance whereas logistic regression does not require that.
- LDA requires similar probability distributions where as other does not.
In the early 1970s, Nelder and Wedderburn coined the term generalized linear models for an entire class of statistical learning methods that include both linear and logistic regression as special cases.
In mid 1980s Breiman, Friedman, Olshen and Stone introduced classification and regression trees, and were among the first to demonstrate the power of a detailed practical implementation of a method, including cross-validation for model selection.
Hastie and Tibshirani coined the term generalized additive models in 1986 for a class of non-linear extensions to generalized linear models, and also provided a practical software implementation.
Since that time, inspired by the advent of machine learning and other disciplines, statistical learning has emerged as a new subfield in statistics, focused on supervised and unsupervised modeling and prediction. In recent years, progress in statistical learning has been marked by the increasing availability of powerful and relatively user-friendly software, such as the popular and freely available R system. This has the potential to continue the transformation of the field from a set of techniques used and developed by statisticians and computer scientists to an essential toolkit for a much broader community.
The Elements of Statistical Learning (ESL) by Hastie, Tibshirani, and Friedman was first published in 2001. Since that time, it has become an important reference on the fundamentals of statistical machine learning.
f hat (drawn as hat on top of f) means "Estimated f" using "training data". Similarly B0^ and B1^ means estimated beta0 and beta1 using training data.
A regression function which makes "no errors" on training data is typically prone to overfitting. When you smoothen the regression function you get more error on training data, but better at predicting the test data!!!
Thin spline model is easier to build and it may even predict well, but it is harder to interpret (say in 10 dimensional space). For example, there may be a local minimum at p1, p2, p3 but we won't even know as we can not visualize it easily.

Regression Function

A Regression function(X) is nothing but finding an average of
Y values given X = x1, for all possible X values. If Y does not exist for given X = x1, then use Neighborhood average of all known Y values around given x1. It is good technique for small p (predictors or features) and large n. There are also alternatives like spline smoothing and kernels. When p is large nearest neighbors are typically far away (since each point really have a subset of possible feature values - sparse features).

Chapter 2: Statistical Learning

(Interpretability vs Flexibility) 
Bayes-Classifier              KNN
Most Flexible: SVM, Bagging, Boosting

Consider optimizing given advertising budget into TV, News Paper, Radio
Input variable == budget == Predictors, Variables, Features. Output variable == Response == Sales
Y = f(X) + Epsilon. Epsilon is random error term with mean zero. Determine f.
Inference Problem: Some times the goal is just not prediction in same settings -- but to get deeper insights into individual feature effects, that could be useful to apply in another settings as well. i.e. F can not be considered as black box (like it does using deep learning algorithms). Example: Is this house over priced ? This is prediction problem. How much extra house will cost if it had River View ? This is Inference problem.
Parametric Approach vs Non-Parmetric Approach to find F: E.g. Assume the function is linear and find the parameter coefficients. Thin plate spline is an example approach for non-parametric approach. Non parametric approach involves more data and smoothness level to reduce level of rough edges.
Inflexible or Restrictive model: A linear model is said to be inflexible since it can not represent complex shapes for function f. But it is more interpretable since it throws more insights about features.
As flexibility increases, interpretability decreases.
Least flexible models are: Lasso, Least squares linear models. Most flexible models are SVM, Bagging and Boosting.
Lasso sets number of coefficients to exactly zero reducing the total number of features -- So it is more inflexible but easier to interpret or relate to small number of features. (at the cost of accuracy)
Flexible does not necessarily mean more accurate due to overfitting --so accuracy it depends on the problem. Many times linear models are more accurate as it avoids overfitting.
Some problems can be hybrid of supervised vs unsupervised learning. E.g. we may have partial response values for subset of data points and the goal is clustering and so on.
Quantitative Variables == Numerical =~ Regression Problems. Qualitative Variables == Categorical =~ Classification Problems.
Classification problem can be thought of as special case of Regression problem since output value can be simply mapped to multi-classifier. The reverse is also true. If a system can perform well for multi-classifier without response values coded as "numerical integer" values, then it is as good as the regression problem. Hence most methods perform equally well for both regression and classifier problems.
A regression method can very well be used for classfication problem since it can estimate probability (regression value) which is mapped to classification. Example: Least Squares Regression Method for logistic regression classification.
K Nearest neighbours, Boosting methods can be used for both: Regression and Classification problems.
The true f may be:
- Non-linear
- Approximately linear (may benefit from slightly bending the line)
- Strictly linear (Least test error after training)
Plot graph for Flexibility vs MeanSquareError for test data. Select flexibility level (smoothness level for smoothing splines) that correspond to least MSE.
Expected test MSE is the expected error in prediction. For the selected model, this is the best we can do to minimize MSE.

Bias - Variance Trade off :

Expected MSE = Var(f(x)) + Bias(f(x))**2 + Var(Epsilon)

Var(f(x)) means how f would vary depending on training set selection.
Higher flexibility methods will overfit depending on training set resulting
in high variance of possible f values.

Bias(f(x)) is inherent error in selecting model vs true f.
i.e. Error due to selecting linear vs true non-linear.

Epsilon is the inherent error that will be present even if you perfectly
fit the data wrt training set and even if variance is zero.

Bayes Classifier: For each observation X0, it assigns Y class for which the conditional probability of P(Y/X) is highest.

In practice, it is impossible to compute conditional probability of real data. So it is a gold ideal standard. Even it has error rate which coresponds to irreducible error. Example: For given instance of same features, there may be 3 persons with blue eyes and 8 persons with black eyes -- The prediction engine will always predict black eye for the given other features. Suppose X has 2 features, we can imagine that it has "Bayes Decision Boundary", by which it predicts the classes depending on region.
KNN classifier attempts to approximate the Bayes classifier by computing conditional probability using K closest neighbors. The choice of K needs to be tuned to minimize test error.

Chapter 3: Linear Regression

Linear-Model
F-Statistic
T-Statistic

Simple Linear Expression: E.g. Y = B0 + B1 * X (Single Predictor Only) Multiple Linear Expression: E.g. Y = B0 + B1 * X1 + B2 * X2 + ...
For most optimal value of (B0, B1) the RSS (Residual Sum of Squares i.e. MSE * n) is lowest. We can plot isobar circles in (B0, B1) 2D space where bigger circles corresponding to higher RSS values.
Lower p-value implies higher association between X and Y. We compute p-value by assuming Null Hypothesis i.e. No relation between X and Y. If B1 is high both numerically and also B1/StdError(B1) is high, we can assume correlation. The B1/StdErr(B1) is called T-Statistic. The p-value is the probability that we encounter a value higher than T-Statistic. If p-value is less then, there is higher association. Higher T-statistic, lower the p-value.
R^2 Statistics is a measure of correlation similar to r = Corr(x, y). For simple linear relations, R^2 = r^2. For multiple linear relations, it is Corr(Y, Y^)^2 where Y-hat is expected Y. Higher it is to 1, there is correlation. It is a measure of model fit.
RSE - Residual Standard Error -
Higher correlation is also measure of better model fit. So (lower) p-value, (higher) T-Statistic (threshold), (higher) R-square Statistic are treated as better model fit measures. Usually R^2 is used as model fit indicator as a percentage of confidence. (between 0 to 1).
Some predictors in linear model could be qualitative (e.g. gender, marital status, nationality) If the predictor is a multi-class, then break into multi boolean predictors. e.g. is_indian, is_chinese, is_american, etc.
Multiple Linear Regression: It is when there is more than one predictor. E.g. Y = B0 + B1 * X1 + B2 * X2 + B3 * X3
Since we fit a model (new plane) using least sum of residual errors, the result may suggest (Y, X3) has no correlation (since B3 is almost 0), where there is really correlation. This may happen if there is correlation between predictors which makes X3 as redundant.
Even though ice cream sales may show correlation to higher shark attacks, in reality the temperature increase may be the true factor. Since temperature increases, more ice cream sales, more people at beach. Identifying correlation between temperature increase and ice cream sales may help eliminate false predictors.
First ask: Is there atleast 1 predictor which is useful to predict response ? We use F-Statistic (threshold > 1) as a measure to answer this question. For each preditor F-Statistic = Square(T-Statistic); For overall model, for all predictors, F-Statistic measures if atleast one predictor is useful.
How to do variables selection ? All subsets of p predictors = 2^p; Computing p-value for each predictor is obvious choice, but if all p-values are large, but combined F statistic indicates correlation, then it implies some subset is optimal. Brute force method won't work since combination is high.
Feature Selection strategies:
- Forward Selection : Start with 0 Features. Keep adding features with lowest P-values. Greedy approach. May add features that later becomes redundant. Can always be used even if p > n. i.e. Total Features > Total Samples. Faster but may select inferior model due to greedy approach. Some predictors selected first may become redundant later.
- Backward Selection: Start with all features. Keep eliminating ones with high p-values. If p > n, Backward selection can not be used.
- Mixed Selection: Start wih No variables. However after adding each feature, recompute p-values of existing features and eliminate ones with high p-values.
Potential Problems:
1. Non-linearity of the response-predictor relationships.
2. Correlation of error terms.
3. Non-constant variance of error terms.
4. Outliers. (Response is unusual)
5. High-leverage points. (Predictor is unusual, more problematic)
6. Collinearity.
Non-linearity can be identified by inspecting a plot of Residuals against single predictor OR Residuals against Predicted-Y (for multiple predictors). Outliers are also clearly visible in this plot.

If you want to include interaction terms in linear model you can use this :

lm.fit = lm(formula = medv ∼ lstat * age, data = Boston)
# Above includes lstat, age, lstat * age as predictors.
# It is short cut for:
lm.fit = lm(formula = medv ∼ lstat+age+lstat:age, data = Boston)

To introduce non-linear predictor to fit in linear model, do this :

lm.fit2=lm(medv∼lstat+I(lstat^2))
summary(lm.fit2)
# Additional predictor = lstat * lstat
# The ^ has special meaning in formula. 
# I() function forces standard R meaning of raising to power.
# It is a hack to fit a non-linear relation using linear model primitives.
# Can be handy to explore in specific directions on specific predictors.

To include all polynomial terms from X^2 to X^5, we can use poly() :

lm.fit5=lm(medv∼poly(lstat ,5))
summary(lm.fit5)
# Similary, to use log() ...
summary(lm(medv∼log(rm),data=Boston))

Chapter 4: Classification

Logistic-Regression (Better than LDA?)
LDA
KNN
ANOVA
ROC Plot (True vs False Positives)
Confusion Matrix
Naive Bayes (Large number of features)

Response is qualitative instead of quantitative.

Most widely-used classifiers:

- logistic regression, 
- linear discriminant analysis,
- K-nearest neighbors.

We discuss more computer-intensive methods in later:

- generalized additive models (Chapter 7), 
- trees, random forests, and boosting (Chapter 8), 
- support vector machines (Chapter 9).

Multiple Logistic Regression means number of predictors > 1. Response class can be binary or multiple classes.
Logistic regression involves directly modeling Pr(Y = k|X = x) using the logistic function (logit) for binary response classes.
For LDA, we model the distribution of the predictors X separately in each of the response classes (i.e. given Y ), and then use Bayes theorem to flip these around into estimates for Pr(Y = k|X = x).
LDA is based on Joint Probability P(X, Y), so it is said to be "generative learning".
Fundamental assumption of the LDA method is that the continous independent variables are normally distributed. If this assumption is wrong, the model yields poor fit.
ANOVA uses categorical independent variables and a continuous dependent variable, whereas discriminant analysis has continuous independent variables and a categorical dependent variable. E.g. For ANOVA: analyze the weight distribution for given few types of breeds. For LDA: Analyze weight distribution to classify the breeds.
ANOVA is useful for exploratory analysis of data and feature identification, where as LDA is useful for classification.

The decision formula for LDA is bit involved mathematical function but it is linear. To predict class y (among few pre-defined classes), given x :

w * x > c then it belongs to class y.  # For some constant c. 
# The c computation looks mathematically involved.
# For each response class y, the (w, c) pairs are different.

The observation belongs to y if corresponding x vector is located on a certain side of a hyperplane perpendicular to vector w. The location of the plane is defined by the threshold c.

LDA is superior in these cases:

- When the classes are well-separated LDA is more stable.
- If n is small & predictors X is approx normal distributed, LDA is more stable.
- LDA is more popular (and appropriate?) when we have more than two response classes.

Confusion matrix is a good way to visualize False Positive, True Positive, etc.
For LDA, we can tune the classifier to reduce certain risk at the cost of increased error rate. For example, by increasing false positive rate (signaling genuine credit card applicants as defaulters) we can capture majority of true positves (most of true defaulters). Example: Pr(default = Yes|X = x) > 0.2
ROC plot is False Positve vs True Positive plot where AUC should be maximum.
QDA - Quadratic Discriminant Analysis is another variation of LDA. Unlike LDA, QDA assumes that each class has its own covariance matrix. QDA is more flexible, but also vulnerable to over-fitting. If the data set really has multiple groups with clearly different co-variance matrices, then QDA tends to be better fit.
The six examples illustrate that no one method will dominate the others in every situation. When the true decision boundaries are linear, then the LDA and logistic regression approaches will tend to perform well. When the boundaries are moderately non-linear, QDA may give better results. Finally, for much more complicated decision boundaries, a non-parametric approach such as KNN can be superior. But the level of smoothness for a non-parametric approach must be chosen carefully.
Generalized linear model glm in R allows to use API similar to lm, but also allows to return binary response by passing, family=binomial option :
```
glm.fits=glm(Direction∼Lag1+Lag2+Lag3+Lag4+Lag5+Volume , data=Smarket ,family=binomial )
```

To select only couple of important predicators and use data upto 2004 to predict 2005, do :

glm.fits=glm(Direction∼Lag1+Lag2,data=Smarket ,family=binomial , subset=train)
glm.probs=predict(glm.fits,Smarket .2005,type="response ")
glm.pred=rep("Down",252)
glm.pred[glm.probs >.5]="Up"
table(glm.pred,Direction .2005)    # Display cross tabulated data with counts of factors.
# This is confusion matrix.
          Direction .2005
glm.pred Down    Up
    Down   35     35
      Up   76    106

mean(glm.pred==Direction .2005)
# [1] 0.56                        56% accuracy prediction.

106/(106+76)      # Accuracy after Up prediction.
# [1] 0.582       # 58% it is right, if it predicted "Market Up" !!!

lda() and qda() follow similar approach, but qda() yields 60% accuracy for this example!!!
knn() prediction is different from other models -- No need to first "fit" model then predict. knn() forms predictions using a single command. This is because the training set itself is used like the "fitted model" to predict for new data :
```
knn.pred=knn(train.X,test.X,train.Direction ,k=3)
> table(knn.pred,Direction.2005)
          Direction.2005
knn.pred  Down    Up
     Down   48    54
       Up   63    87

> mean(knn.pred==Direction.2005)
[1] 0.536
```
When the number of features p is large, local approaches like KNN do poorly. This is because there is no interpolation happens like other methods, and if the samples are not dense enough, you tend to pick "far away" neighbor to incorrectly "predict" your response class.
Naive Bayes is useful classifier when total no of features p is large. Also known as Simple Bayes and independence Bayes.
Naive Bayes is a "conditional Probability" model.
Naive Bayes is simple "probabilistic classifiers" based on applying Bayes theorem with strong (naive) independence assumptions between the features.
Naive Bayes Classical use case is Spam identification where word frequencies are the features. (Large p). Another alternative advaned approach is SVM in this domain.

Chapter 5 Resampling

Resampling is used to incrementally fit and adjust the model by repeatedly resampling the subsets of training data.
Resampling is computationally expensive, but not prohibitive.
Often this technique is very useful to tune the model for best accuracy! (as compared to using whole of data to train at once)
Two of the most commonly used resampling methods: cross-validation and the bootstrap
Cross validation can be used to measure the learning method's performance.
Bootstrap can be used as a measure of accuracy.

Validation set approach is easy to understand and implement. But it has following drawbacks:

- Sample may be skewed and may not represent data points of all classes.
- Not all of training set is used to train model, problem for smaller training set.

Cross validation approach is an improved version of Validation Set.
Leave one out Cross validation (LOOCV) trains model excluding single point in succession. i.e. CVn = Avg(MSE) is the average cost of error over excluding each point. Though LOOCV looks expensive, for some models there is a shortcut which makes computing this CVn really quick. e.g. For linear model fitting using least squares, the cost of computing LOOCV error rate for n points is same as doing it for single point!
K-Fold Cross validation involves making k partitions, and using each partition as validation set in succession with remaining (k-1) partition as training set. For learning methods which are complex and costly to fit, k-fold approach is k times faster. The K-fold is also more smoother and useful to prevent overfitting.

Chapter 6: Linear Model Selection and Regularization

Dimension-Reduction

Least squares fitting proecedure performs poorly when p > n. There is not even unique least squares coefficient estimate. The variance is infinite.
By constraining or shrinking the estimated coefficients, we can often substantially reduce the variance at the cost of a negligible increase in bias.
Least squares fitting model is not "intelligent enough" to recognize irrelevant features (and assigning corresponding coefficients zero).
Some strategies to deal with least squares fitting model problems are:
- Subset selection: Apply same least squares algorithm but on subset of p features. Total combinations for subsets is too big: i.e. 2^p so not practical when p is large. Hence some kind of forward/backward/hybrid stepwise selection is better for this approach. Also both forward and backward stepwise selection is not guaranteed to yield the best model containing a subset of the p predictors. Backward selection requires n > p ; when p is large, forward selection is the only viable option.
- Shrinkage: Attempt to reduce the coefficiens towards zero, still applying same least squares estimate. Some coefficients may be estimated exactly zero resulting in variables elimination. The Ridge regression and Lasso are 2 such methods.
- Dimension Reduction: Project p predictors into a M-dimensional subspace, where M < p. Then these M projections are used as predictors to fit a linear regression model by least squares. Principal Component Analysis: Approach to reduce the dimension of data matrix from n x p to n x m. The first principle component is the direction along which the data vary most. PCR is principal component regression. One can even think of ridge regression as a continuous version of PCR. In general, it is recommended to standardize/normalize the values before performing PCR. Partial least squares (PLS), is a supervised alternative to PCR. PLS computes the first direction Z1 by setting each coefficient from the simple linear regression of Y onto Xj . To identify the second PLS direction we first adjust each of the variables for Z 1 , by regressing each variable on Z 1 and taking residuals. These residuals can be interpreted as the remaining information that has not been explained by Z1. This iterative approach can be repeated M times to identify multiple PLS components Z 1 ,...,Z M . Finally, at the end of this procedure, we use least squares to fit a linear model to predict Y using Z 1 ,...,Z M in exactly the same fashion as for PCR. PLS does better job for interpretability. As with PCR, the number M of PLS drections is a tuning parameter chosen by cross-validation.
Understanding high dimensional data: Consider gene expression data for 100 individuals, let us say each gene expression contains half million features. p >> n. However, it is quite possible to locate subset of p features that correlate with response Y. Hence it is common to encounter such cases. It is important to include "only relevant" features -- also "exclude coincidental false correlated" features.
Ridge Regression: The modified least squares term now includes a additional term, lambda * (sum(beta*beta)) which encourages the coefficients towards zero. The lambda is tuning parameter, higher the value, more aggressive shrinkage happens. The additional term is called "Shrinkage Penalty" (actually penalty to encourage shrinkage). This is related to bias-variance tradeoff -- As lambda increases, we lose flexibility, slight increase bias for reduction in variance -- often delivering better accuracy. Model fitting is computationally very efficient -- Just fix lambda, then do linear fit.
Lasso regression is similar to Ridge Regression but also helps to set some coefficients to be exactly zero instead of "very small value". This can lead to variable selection. The penalty term is: lambda*(sum(abs(beta))) i.e. It applies l1 (level 1) penalty instead of l2(level 2).

Best subset selection using R:

library(leaps)
regfit.full=regsubsets (Salary∼.,Hitters)     # 19 variables/features.
summary(regfit.full)                          
# Displays a matrix of selected features if we must select only 1..8 features.
# i.e. It displays Top 1, 2, 3, ... 8 features using exhaustive, etc algorithms.
coef(regfit.full ,6)   # Displays 6 most significant features.

Using glmnet package for Ridge or Lasso Regression:

library(glmnet)
grid=10^seq(10,-2, length =100)
ridge.mod=glmnet(x,y,alpha=0,lambda=grid)
# alpha = 0 means Ridge ; alpha = 1 means Lasso.
# Lambda range: 10^10 to 10^-2 ... i.e. wide range. 
ridge.pred=predict(ridge.mod,s=4,newx=x[test ,])  #  s = 4 supplies another lambda.

Using pls package for PCR Regression :

library(pls)
set.seed(2)
pcr.fit=pcr(Salary∼., data=Hitters ,scale=TRUE, validation ="CV")
summary(pcr.fit)
Data: X dimension : 263 19
Y dimension : 263 1
Fit method: svdpc   # Fits a PCR model using single value decomposition.

Using pls package for PLS - Partial Least Squares Fit :

library(pls)
pls.fit=plsr(Salary∼., data=Hitters ,subset=train ,scale=TRUE , validation ="CV")
summary(pls.fit)

# Data: X dimension : 131 19
# Y dimension : 131 1
# Fit method: kernelpls

Chapter 7: Moving Beyond Linearity

We examine very simple extensions of linear models like polynomial regression and step functions, as well as more sophisticated approaches such as splines, local regression, and generalized additive models.
Polynomial regression extends the linear model by adding extra predictors, For example, X, X^2 , and X^3 as predictors.
Step functions cut the range of a variable into K distinct regions. This has the effect of fitting a piecewise constant function.
Regression splines: It is extension of both polynomials and step functions; More flexible; They involve dividing the range of X into K distinct regions. Within each region, a polynomial function is fit to the data. However, these polynomials are constrained so that they join smoothly at the region boundaries, or knots. Provided that the interval is divided into enough regions, this can produce an extremely flexible fit.
Smoothing splines are similar to regression splines, but arise in a slightly different situation. Smoothing splines result from minimizing a residual sum of squares criterion subject to a smoothness penalty.
Local regression is similar to splines, but differs in an important way. The regions are allowed to overlap, and indeed they do so in a very smooth way.
Generalized additive models allow us to extend the methods above to deal with multiple predictors.
Degrees of freedom: Number of free parameters. e.g. All coefficients and intercept.
There is a shortcut efficient formula to compute LOOCV (Leave one out cross validation) error function for smoothing splines (similar to least squares linear regression).
Local regression is sometimes referred to as a memory-based procedure, because like nearest-neighbors, we need all the training data each time we wish to compute a prediction.
GAM : Generalized Additive Models: GAMs can be applied with both quantitative and qualitative responses.
Limitation of GAM is that interaction between features is missed. e.g. X1*X2 For fully general models, we should look for Random Forests and Boosting.

In order to fit a step function, we use the cut() function :

fit=lm(wage∼cut(age ,4),data=Wage)
coef(summary(fit))
#                           Estimate Std. Error t value Pr(>|t|)
# (Intercept )              94.16 1.48 63.79 0.00e+00
# cut(age, 4)(33.5,49]      24.05 1.83 13.15 1.98e-38
# cut(age, 4)(49,64.5]      23.66 2.07 11.44 1.04e-29
# cut(age, 4)(64.5,80.1]    7.64 4.99 1.53 1.26e-01

The function cut() returns an ordered categorical variable; the lm() function then creates a set of dummy variables for use in the regression.

Fitting wage to age using a regression spline is simple:

library(splines)
fit=lm(wage∼bs(age ,knots=c(25,40,60) ),data=Wage)   # bs() is basis functions. Default is cubic.
pred=predict(fit,newdata=list(age=age.grid),se=T)    #

plot(age,wage,col="gray")
lines(age.grid,pred$fit ,lwd=2)                      # Plot prediction
lines(age.grid,pred$fit +2*pred$se ,lty="dashed")    # Plot 2 * stderr 
lines(age.grid,pred$fit -2*pred$se ,lty="dashed")

Here we have prespecified knots at ages 25, 40, and 60. This produces a spline with six basis functions. (Recall that a cubic spline with three knots has seven degrees of freedom; these degrees of freedom are used up by an intercept, plus six basis functions.) We could also use the df option to produce a spline with knots at uniform quantiles of the data. :

> dim(bs(age,knots=c(25,40,60) ))
[1] 3000 6

> dim(bs(age,df=6))
[1] 3000 6

> attr(bs(age,df=6) ,"knots")
25%     50%     75%
33.8    42.0    51.0

# R chooses knots at ages 33.8,42.0, and 51.0, which correspond
# to the 25th, 50th, and 75th percentiles of age . 
# The function bs() also has a degree argument, so we can fit splines of 
# any degree, rather than the default degree of 3 (which yields a cubic spline).

# To fit a natural spline, we use the ns() function. 
fit2=lm(wage∼ns(age,df=4),data=Wage)
pred2=predict(fit2 ,newdata=list(age=age.grid),se=T)
lines(age.grid, pred2$fit ,col="red",lwd=2)

# In order to fit a smoothing spline, we use the smooth.spline() function.
plot(age,wage,xlim=agelims ,cex=.5,col="darkgrey ")
title("Smoothing Spline")
fit=smooth.spline(age ,wage,df=16)
fit2=smooth.spline(age,wage,cv=TRUE)
fit2$df
# [1] 6.8
lines(fit ,col="red",lwd=2)

# In order to perform local regression, we use the loess() function.

Chapter 8: Tree based Methods

Segment the predictor space into simple regions. Single tree methods do not provide enough accuracy.
So we examine bagging, random forests, and boosting.
A classification tree is very similar to a regression tree --except that the tree is used to predict a qualitative response. The predicted response for an observation is the most commonly occuring class of training observations in that region (as against mean of values for qualitative response). If f(x) is qualitative, take the most common occuring value as result.
Bagging: Decision trees suffer from high variance. If we calculate f(x) over B different training sets and average the results, we will have low variance. We use bootstrap (sampling) method to generate B different training sets.
Random Forests: An improvement over bagging - by averaging the importance over all predictors. At each split time, random sample of m predictors is considered. The m = sqrt(p); At each split, not even the majority of predictors considered. This results in more diversity of trees in result, which in turn results in lower variance! (though that is not obvious fact).
The total number of trees (B) and predictor subset size (m) are the main factors for Random Forests predictive model. For example with gene expressions, p = 4K ; B = 400; m = sqrt(p) works well.
As with bagging, random forests will not overfit if we increase B.
Boosting: It involves incrementally improving the decision tree based on previous results to successively reduce residual errors. Both boosting and Random Forests use tree ensembles where Random Forest has more suitable for parallelization of algorithm where boosting is mostly sequential. However there are opportunities to parallelize operations at single split level for boosting which is done most efficiently using xgboost library. The gbm library in R is another alternative.
Boosting has three tuning parameters:

- The number of trees B. Unlike bagging, boosting can overfit if B is large. We use cross-validation to select B.

- The shrinkage parameter λ to control learning rate. e.g. 0.01 or 0.001 Very small λ can require using large value of B.

- The number d of splits in each tree, which controls the complexity of the boosted ensemble. Often d = 1 works well, consisting of a single split. In this case, the boosted stump ensemble is fitting an additive model, since each term involves only a single variable. More generally d is the interaction depth, and controls interaction depth.

Using single decision tree :

library(MASS)
set.seed(1)
train = sample(1:nrow(Boston), nrow(Boston)/2)
tree.boston=tree(medv∼.,Boston ,subset=train)
summary(tree.boston)
...
yhat = predict(tree.boston ,newdata=Boston[-train ,])

Using bagging and RandomForests :
```
# Note: Bagging is special case of RandomForests where m = p
```
library(randomForest) set.seed(1) bag.boston=randomForest(medv∼.,data=Boston ,subset=train , mtry=13,importance =TRUE) # Total predictors = 13; m = 13; This is bagging.

yhat.bag = predict(bag.boston ,newdata=Boston[-train ,]) plot(yhat.bag, boston.test)

##### Tree for RandomForests. m = p/3 by default for regression. We try m = 6. ### rf.boston=randomForest(medv∼.,data=Boston ,subset=train , mtry=6,importance =TRUE) yhat.rf = predict(rf.boston ,newdata=Boston[-train ,])

Using Boosting in R :

library(gbm)
boost.boston=gbm(medv∼.,data=Boston[train ,],distribution="gaussian ",
                 n.trees=5000, interaction .depth=4)
# Use "Bernoulii" distribution for classification problem.
summary(boost.boston)
      var rel.inf          # Relative importance in % for all vars.
      1   lstat   45.96
      2   rm      31.22
      3   dis      6.81
      ...
# We see that lstat and rm are by far the most important variables. We can
# also produce partial dependence plots for these two variables. These plots

yhat.boost=predict(boost.boston ,newdata=Boston[-train ,], n.trees=5000)
mean((yhat.boost -boston.test)^2)

#   [1] 11.8
# The test MSE obtained is 11.8; similar to random forests and superior to bagging.
# If we want to, we can perform boosting with different lambda shrinkage parameter.
boost.boston=gbm(medv∼.,data=Boston[train ,],distribution= "gaussian ",n.trees=5000, 
                 interaction .depth=4,shrinkage =0.2, verbose=F)
yhat.boost=predict(boost.boston ,newdata=Boston[-train ,], n.trees=5000)
mean((yhat.boost -boston.test)^2)
#  [1] 11.5
# In this case, using λ = 0.2 leads to a slightly lower test MSE than λ = 0.001.

Chapter 9: Support Vector Machines

SVM classifier is a generalization of maximal margin classifier.
Support Vector Machine is an extension of SVM classifier (which handles non-linear boundaries)
Maximal margin classifier though elegant and simple, can not be applied to most datasets since the classes boundary must be linear.
When points are non-separable we generalize the concept of Maximal Margin Classifier to find the plane "which almost" separates it using "soft margin". Such classifier is called Support Vector Classifier.
SVM is intended for binary classification but can be extended for more classses.
Support Vector Machine is a Support Vector Classifier which uses enlarged feature space e.g. predictors X, X*X etc so that the decision boundary can be non-linear. Even interaction terms e.g. X1*X2 can be included.
The enlarging of feature space is done using "kernels".

The example for "kernel" is a "polynomial kernel" of degree d :

K(o1, o2) = (1 + sum(o1.xi*o2.xi) where i = 1 to p features )^d
# Another form is a "Radial Kernel"

SVM for multi-class classification has possible different approaches:

- One-Versus-One Classification: Consider all KC2 combinations and apply. Choose the result with maximum frequency.

- One-Versus-All : Create K SVM models - Each represents ith class vs remaining. Choose the one with highest level of confidence (the distance on right side is high).
SVM is not unique in using "kernels" to accomodate non-linear boundaries. We can very well use this technique with logistic regression as well. For historic reasons, use of kernels is most common with SVM.
Extension of SVM for regression is called: Support Vector Regression

Using svm in R :

# Response var must be factor to force classification vs regression.
dat=data.frame(x=x, y=as.factor(y))
library(e1071)
svmfit=svm(y∼., data=dat , kernel="linear", cost=10, scale=FALSE)
#
# Note: kernel could be "polynomial" or "radial" as well.
#

summary(svmfit)
ypred=predict(svmfit ,testdat)

Chapter 10: Unsupervised Learning

Unsupervised learning often done as part of exploratory analysis.
Hard to assess the quality of results since there is no well defined cross validation process.

Principal Component Analysis

Principal component analysis (PCA) refers to the process by which principal components are computed and used for further understanding of data.
PCA also serves as a tool for data visualization.
Motivation: We need to obtain a 2D representation of data to capture as much information to visualize. The axes must have as high variance as possible to be able to visualize the spread.
Each of the dimensions found by PCA is a linear combination of the p features.
The first principal component of a set of features X 1 ,X 2 ,...,X p is the normalized linear combination of the features Z1 = φ11 X1 + φ21 X2 + ... + φp1 Xp that has the largest variance.
we can perform regression using the principal component score vectors as features.

K-Means

Clustering approaches: K-means clustering and hierarchical clustering.
K-Means require that you know the number of clusters. Hierarchical clustering produces dendrogram where you can get/visualize clusters of size from 1 ... n.
K-Means clustering aims to reduce with-in-cluster variance.
K-Means clustering algorithm involves random initial assignment of each observation to 1..K and then iteratively calculate centriod of the cluster and assign each observation to the nearest centroid. It depends on the local optimum rather than the global optimum, so there is a probability that it is not optimal split. By repeating the initial step of random assignments and rerunning the K-Means algorithm, and selecting the best run, is a common approach.

Hierarchical Clustering

In this section, we describe bottom-up or agglomerative clustering.
Tendrogram is built starting from the leaves and combining clusters up to trunk.
Also some times the clustering intention is ambiguous e.g. split by gender vs nationality.
The clustering algorithm involves measuring dissimilarity between each pairs. It starts with all nodes are in their own clusters at leaves and keep fusing the similar ones in succession.
Dissimilarity between groups is computed by using a concept of "Linkage" ...:

- Complete: Compute all pairwise dissimilarity between clusters A and B. The maximal dissimilarity is "Complete" linkage.
- Single: We measure the minimal dissimilartiy between all pairs.
- Average: Mean intercluster dissimilarity. Right choice in most cases.
- Centroid: Dissimilarity between centroid of cluster A (a mean vector of length p) and B. (Undesirable inversion may occur with centroid so it is avoided).
Clustering depends on type of linkage used and the choice of dissimilarity measure.
Clustering methods are not stable if some observations were removed.

Principal Components Analysis in R :

pr.out=prcomp(USArrests , scale=TRUE)
# By default, prcomp() centers the variables to have mean zero.
# scale=TRUE scales the variables to have standard deviation one.
names(pr.out)
# [1] "sdev" "rotation " "center" "scale" "x"

K-means clustering in R :

km.out=kmeans(x,2,nstart =20)
km.out$cluster
#  [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1
#  [30] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# The nstart=20 suggests to rerun algorithm 20 times with different 
# initial random assignments to find the best grouping with least intra variance.

Hierarchical clustering in R :

# The hclust() function implements hierarchical clustering in R 
hc.complete =hclust(dist(x), method="complete ")
hc.average=hclust(dist(x), method="average ")
hc.single=hclust(dist(x), method="single")
par(mfrow=c(1,3))
plot(hc.complete ,main="Complete Linkage", xlab="", sub="", cex=.9)
plot(hc.average , main="Average Linkage", xlab="", sub="", cex=.9)
plot(hc.single , main="Single Linkage", xlab="", sub="", cex=.9)

hc.out = cutree(hc.complete , 2)   # Display the grouping if we have to cut this into 2.
hc.out
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2
# [30] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

# Draw line at height 139 where it produced certain desired number of clusters (found by cutree)
abline(h=139, col="red)

My Own Notes

Model Fit Approach

Y = alpha + beta * X ; The solution space involves 2D plane of alpha and beta. Is the MSE best approach to fit when n points are given ? May be or may be not. why not abs(y-y') instead of MSE ? Slightly reducing beta (and increasing alpha) may yield better result as well. On the other hand, the true value of alpha may be lower also. In that case, the true value of beta may be higher than what we estimated using linear fit.
In general, the lower coefficients are preferred since it eliminates noise and helps towards variable elimination.
Both Ridge and Lasso comes under shrinkage methods which involves lambda (shrinkage penalty) tuning parameter. This is evaluated using cross validation procedure.
Though PCA method reduces overall dimensions, interpretability suffers since it does not do variables elimination (only does transformation/projection).
Examples of High dimensional problems are: SNP of individuals e.g. 100 individuals with 500K gene expressions; SMS classification with lots of spam tags; The probability of "coincidal correlation" of terms is very high. i.e. Identifying certain SNPs genes as "problematic" just by chance. Least squares approach is not appropriate here. Bayesian model is most appropriate here since it makes use of conditional probability.

Clustering Concerns

Clustering does not suffer from accidental assigning of higher weightage for certain criteria since:
- All features are normalized (from 0 to 1)
- If there is a feature which is truly random, it will not contribute to clustering any way.
If we are really interested in grouping news articles by crime, not by geography, the resulting grouping will have noise. Or you may have groups by geography, with in that may have sub-clusters as per crime. In NLP domain how to influence the clustering ?? Using stop words ??

Appendix

ANOVA

ANOVA (Analysis of Variance) is a statistical method used to determine if 3 or more groups are different from each other based on measures.
It evaluates individual group means, internal variance with in groups and variance of means across groups.

Key Concepts of ANOVA are as below.

Hypothesis Testing:

Null Hypothesis H0 : All group means are equal

Alternative Hypothesis H-alpha: At least one group mean is different.

ANOVA calculates the F-statistic: :

𝐹 = ( Variance Between Groups /  Variance Within Groups )

# Higher the F, the groups are different.

The p-value is the probability that your Null Hypothesis is still true after given F value. In other words it measures if the higher F value occured by chance or indicates the real difference between the groups.
Higher F value, lower the p-value. p < 0.05 means there is less than 5 percent chance that the difference between groups is insignificant.
There could be known/unknown factors associated with groups.
When there are no known factors, we are just interested to know if the groups are really different or not.
When there are known factors (one or more) associated with the groups, we may use ANOVA to study if those factors really made difference or not.

Types of ANOVA:

One-Way ANOVA: Examines one independent variable with multiple groups.
Two-Way ANOVA: Examines two independent variables and their interaction.

Post-hoc Tests:

If ANOVA indicates significant differences, post-hoc tests like Tukey's HSD or Bonferroni correction help identify which groups differ.

Example:

Suppose you test the effectiveness of three fertilizers on plant growth.
You measure the heights of plants using each fertilizer.
ANOVA helps determine if differences in plant height are statistically significant across fertilizers.
In the above example, there was one known factor (Fertilizer) and one measurement variable (i.e. Height). It is One-Way ANOVA (since one known factor).
If "Soil Type" is added as another factor, then it becomes Two-Way ANOVA.

FAQ

What is linear classifier ?

In classification there are 2 main concepts -- Activation function and decision boundary. Another important concept is feature space.

If all these are non-linear, obviously the classifier is non-linear. If some of them are non-linear, it is not obvious if the classifier is non-linear or linear.

For example a linear classifier with transformed non-linear features creates non-linear decision boundary.

A logistic regression classifier has non-linear activation function with linear decision boundary. i.e. The predictor function (z) is linear. It can be claimed as "generalized linear model".

A truly non-linear classifier example is K-Nearest neighbor algorithm. The decision boundary is non-linear.

We can say linear decision boundary implies that it is a linear model.

However non-linear decision boundary does not necessarily imply non-linear model since it could have linear decision boundary in transformed feature space.

If there is no linear decision boundary at all (even with transformed feature space), then it is non-linear classifier.

Why logistic regression is called linear ?

By definition, linear model is when the output variable can be expressed as linear expression of input predictors. The predictors themselves can be non-linear but paramters should be linear.
We use log(y) instead of y as output variable to make the expression linear.
Such transformation falls under "generalized linear models"

Note that you can use feature transformation like x1*x2, x1*x1 and apply linear classification algorithm. The decision boundary is linear in transformed feature space, though non-linear in original data space.

The sigmoid activation function pi(x1,x2,...) :

1 / (1 + e^-z)  [ when z = 0, result = 0.5 ]
z = w1 * x1 + w2 * x2 + ....
# There is no interaction between features.
# Hence decision boundary is linear.
# You can also make decision boundary non-linear by introducing
# transformed featurs like x1*x2.

Note: Similarly, you can have linear and non-linear SVM classifiers.

What Qualifies as Generalized Linear Model ?

Model should satisfy these components :

Response variable (y) should follow some Probability Dist Function. For example: Linear regression of y (dependent variable) follows normal distribution. Logistic regression response variable follows binomial distribution. (Discrete points in y axis, but looks like normal bell shape).
There should be linear predictor. (i.e. Decision boundary is linear)
A Link function that tells how the expected value of response variable relates to predictors. Link function of linear regression is E[y] and link function of logistic regression is logit(π).

SVM

SVM is in general not linear. Can be used for both regression and classification, but often used for classification.
Linear SVM is a latest algorithm which is very fast learning and efficient.
A linear svm is no different from other linear classifiers except for the fact that it attempts to find a hyperplane with the largest margin that splits the input space into two regions.

Bayes Classifier

Bayes Classifier is ideal classifier gold standard -- virtual concept. It just means a classifier with "Least Test Error".
All classifiers are attempting to approximate this Bayes Classifier. If the problem is inherently linear, a linear model works best or a model like KNN or logistic regression or Naive Bayes works best.
Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables (features/predictors) in a learning problem.
Maximum-likelihood training can be done by evaluating a closed-form expression, which takes linear time, rather than by expensive iterative approximation as used for many other types of classifiers.
Naive Bayes is a concrete implementation based on conditional probability model. It is called "Naive" or "Simple" because it naively assumes each feature is independent of other.
If data is continuous data, a typical assumption is Gaussian (normal) distribution. You can also use binning to translate continous data to discrete intervals.
If data is discrete, then multinomial and Bernoulli distributions are popular assumptions.
The multinomial distribution approach assigns probabilities for each word for each class. Compute probability for each class and choose the class with max probability. e.g. Given 3 words: "I love you", each word can have probability to be spam / non-spam. This sentence may have 0.2 probability to be spam and 0.1 probability to be non-spam, then it is classified as spam. The weights/probabilities can be computed using training set. The features are the word count vectors. The decision boundary is linear in log space. If certain word does not appear at all in training set, you may want to assign dummy value of 1 instead of 0 to prevent underrun. Doing normalization depending on total length of document, helps better accuracy. Use of TF-IDF (Term frequency - Inverse document frequency - specifies relative importance of term wrt overall document size) instead of raw term count works better. The SVM method performs better since such normalizations happen automatically with SVM ??
In the multivariate Bernoulli event model, features are independent booleans (binary variables present/absent) describing inputs. It has the benefit of explicitly modelling the absence of terms.
Each distribution can be independently estimated as a one-dimensional distribution. Hence the curse of dimensionality (the need for very big training data set) does not apply. Even though the naive assumption yields poor (absolute) class probability values, the overall relative probabilities are accurate hence good for classification purpose.
It is in general not a linear classifier.

Sampling Distribution

Consider big two groups of balls with different predictors with one predictor being weight. Each predictor has variance and std deviation based on observed values.

If the real group is big, we consider samplings of this group and compute variance/deviation. In that case we call these as sample variance for each predictor. If total group size is N, sample size is n, we have N-Cn combination of samplings of size=n. For given subsample size n, the population mean for predictor p1 will follow certain distribution e.g. normal, uniform, etc. By observing this distribution of population mean/median, we can obtain lot of information. Common goal is to find out if sample mean is unbiased estimator of real population mean/median. Higher variance may imply inherent spread level of the predictor, may indicate sample size small, etc.

For Linear Regression, for predictor X1, we can compute coeff B1 so that y = B0 + B1 * X1. The computation of B1 varies depending on subsampling. This variance of B1 is indicator of significance of predictor X1 for the Linear model. Less variance, more significant predictor. That is how T-statistic is computed as B1/StdErr(B1).

Covariance

Normalized covariance is correlation coefficient.
Cov(X, Y) higher means larger value of X implies larger value of Y.

Notes On Elements of Statistical Learning with R Book

Links

Overview

About Book

Machine Learning Vs Statistical Learning

Background and Commentary

Terminology Statistics Vs Computer Science

Book Organization

Chapter 1 Introduction

Chapter 2: Statistical Learning

Chapter 3: Linear Regression

Chapter 4: Classification

Chapter 5 Resampling

Chapter 6: Linear Model Selection and Regularization

Chapter 7: Moving Beyond Linearity

Chapter 8: Tree based Methods

Chapter 9: Support Vector Machines

Chapter 10: Unsupervised Learning

Principal Component Analysis

K-Means

Hierarchical Clustering

My Own Notes

Model Fit Approach

Clustering Concerns

Appendix

ANOVA

FAQ

What is linear classifier ?

Why logistic regression is called linear ?

What Qualifies as Generalized Linear Model ?

SVM

Bayes Classifier

Sampling Distribution

Covariance