ML Summary

depth
2

References

Overview

This document contains summary notes about various core concepts.

Core concepts employed by data scientist:
1. Supervised Learning
2. Unsupervised Learning
3. Semi-Supervised Learning
4. Reinforcement Learning
5. Statistical Analysis and Methods
6. Basic Machine Learning Approaches
7. Natural Language Processing (NLP)
8. Deep Learning
9. Visualization and Communication
10. Business Intelligence Integration
Various ML Libraries

Core ML Concepts

Supervised Learning

Definition: A machine learning paradigm where the model is trained on labeled data (input-output pairs).

Applications:

Classification: Predicting categorical outcomes (e.g., customer churn prediction, fraud detection).
Regression: Predicting continuous outcomes (e.g., sales forecasting, pricing models).

Key Algorithms:

Linear regression, logistic regression
decision trees, support vector machines (SVM)
neural networks. (Neural networks used both in supervised and unsupervised learning)

Unsupervised Learning

Definition: A paradigm where the model identifies patterns in unlabeled data.

Applications:

Clustering: Grouping similar data points (e.g., customer segmentation, market research).
Dimensionality Reduction: Simplifying data while retaining essential features (e.g., Principal Component Analysis, t-SNE).

Key Algorithms:

K-means clustering
hierarchical clustering,
DBSCAN.

Semi-Supervised Learning

Definition: Combines a small amount of labeled data with a large amount of unlabeled data.

Applications: Useful when labeling data is expensive or time-consuming (e.g., text classification in NLP)

Reinforcement Learning

Definition: A learning paradigm where an agent learns by interacting with an environment to maximize rewards.

Applications: Optimizing supply chains, dynamic pricing, and recommendation systems.

Statistical Analysis and Methods

Exploratory Data Analysis (EDA):
- Using statistical techniques to understand the structure of data (e.g., summary statistics, correlation analysis).
Hypothesis Testing: Validating assumptions or relationships in data using tests like t-tests, chi-square tests, and ANOVA.
Time Series Analysis: Analyzing sequential data points for trends, seasonality, and forecasting (e.g., ARIMA models).

Basic Machine Learning Approaches

Ensemble Methods: Combining multiple models to improve performance:
- Random Forests,
- Gradient Boosting Machines like XGBoost or LightGBM.
Feature Engineering: Enhancing model input through domain knowledge (e.g., creating ratios, aggregating data).
Model Evaluation: Validating model performance using metrics like accuracy, precision, recall, F1 score, or ROC-AUC.

Natural Language Processing (NLP)

Applications:

Text analytics for sentiment analysis
topic modeling (Identify underlying themes or topics in a large corpus of text - Unsupervised learning) useful in indexing, semantic analysis and visualization.
Information extraction from unstructured data.

Deep Learning

Applications: Leveraging neural networks for tasks like:

image recognition, speech analysis and complex pattern identification.

Visualization and Communication

For visual insights tools like:

Tableau, Power BI, and Python libraries (e.g., Matplotlib, Seaborn) are used
Storytelling with data is crucial for translating analytical findings into actionable business strategies.

Business Intelligence Integration

Combining predictive analytics, dashboards, and automated reporting systems to facilitate data-driven decision-making.

ML Libraries/Tools

Various ML libraries used from over all least complext to most complex order are given below.

Foundational Libraries

Numpy and Pandas are Foundational Libraries which provide basic utilities for numerical operations and data manipulation.

NumPy

Scope: Fundamental library for numerical computations.
Key Goals: Efficient handling of large arrays and matrices.
Key Features: Mathematical functions, linear algebra, random number generation.

Pandas

Scope: Data manipulation and analysis.
Key Goals: Handling tabular data with DataFrames, cleaning and preprocessing data.
Key Features: Data filtering, grouping, merging, reshaping.

Visualization Libraries

Matplotlib

Scope: Low-level static visualizations.
Key Goals: Create custom plots, charts, and visualizations.
Key Algorithms: Plotting primitives, customization of visual elements.

Seaborn

Scope: Statistical data visualization built on Matplotlib.
Key Goals: Simplify the creation of attractive and informative graphics.
Key Features: Heatmaps, pair plots, violin plots, categorical plots.

Plotly

Scope: Interactive visualizations for the web.
Key Goals: Interactive dashboards and 3D plots.
Key Features: Zooming, hover labels, interactive sliders.

Statistical Analysis Libraries

Statistical Analysis libraries focus on exploratory data analysis and statistical hypothesis testing.

SciPy

Scope: Advanced statistical computations and optimization.
Key Goals: Hypothesis testing, curve fitting, integration, and signal processing.
Key Features: Statistical distributions, t-tests, ANOVA, optimization solvers.

Statsmodels

Scope: Statistical modeling and testing.
Key Goals: Perform linear and generalized linear regression, time-series analysis.
Key Features: Regression models, ARIMA, hypothesis testing.

Machine Learning Libraries

Core libraries for classical and advanced machine learning algorithms. Core libraries for classical and advanced machine learning algorithms.

Core libraries for classical and advanced machine learning algorithms.
Core libraries for classical and advanced machine learning algorithms.
Core libraries for classical and advanced machine learning algorithms.

Scikit-learn

Scope: Comprehensive toolkit for classical machine learning.
Key Goals: Train and evaluate machine learning models.
Key Algorithms:
- Classification: SVMs, Random Forests, kNN, Logistic Regression.
- Regression: Linear Regression, Ridge, Lasso.
- Clustering: K-Means, DBSCAN, Hierarchical Clustering.

XGBoost, LightGBM, CatBoost

Scope: Gradient boosting frameworks for structured data.
Key Goals: High-performance ensemble models for tabular data.

Key Features:

XGBoost: Handles sparsity, regularization support.
LightGBM: Faster training, optimized for large datasets.
CatBoost: Native handling of categorical variables.

Deep Learning Libraries

Specialized frameworks for designing and training neural networks.

TensorFlow

Scope: Full-stack deep learning framework.
Key Goals: Develop scalable and production-ready models.
Key Algorithms: Neural networks (CNNs, RNNs, Transformers), reinforcement learning.

PyTorch

Scope: Flexible framework for research-oriented deep learning.
Key Goals: Enable dynamic graph computation and experimentation.
Key Algorithms: Similar to TensorFlow, with more flexibility in prototyping.

NLP Libraries

Libraries dedicated to text processing and linguistic analysis.

NLTK (NLP Toolkit Library)

Scope: Foundational NLP tasks.
Key Goals: Text tokenization, stemming, and parsing.
Key Features: POS tagging, N-gram models.

Gensim: NLP Library

Scope: Topic modeling and vector representation.
Key Goals: Perform LDA, Word2Vec, and document similarity analysis.

spaCy: NLP Library

Scope: Industrial-strength NLP pipelines.
Key Goals: Fast and accurate linguistic analysis.
Key Features: Named Entity Recognition (NER), Dependency Parsing.
Complexity: Moderate to High.

Hugging Face Transformers Library

Scope: State-of-the-art transformer-based NLP.
Key Goals: Enable tasks like text classification, summarization, and translation.
Key Algorithms: BERT, GPT, T5, etc.
Complexity: High.

Specialized Tools

Libraries for domain-specific tasks or scaling.

OpenCV Library

Scope: Computer vision tasks. Key Goals: Image processing, object detection. Complexity: High.

Dask (Parallel Computing) Library

Scope: Parallel computing for large datasets.
Key Goals: Scale Pandas and NumPy workflows.
Complexity: High.

SHAP / LIME Libraries

Scope: Model interpretability tools.
Key Goals: Explain predictions and understand feature importance.
Complexity: High.

Libraries Summary Table

Library/Tool	Scope	Key Goals
NumPy, Pandas	Data manipulation	Data cleaning, numerical ops
Matplotlib, Seaborn	Data visualization	Static and statistical plots
Scikit-learn	Classical ML	Train ML models (SVM, RF)
TensorFlow, PyTorch	Deep learning	Neural networks, DL models
Gensim, spaCy	NLP	Topic modeling, linguistic ops
XGBoost, LightGBM	Boosting models	High-performance ML
OpenCV, SHAP	Specialized tools	Vision tasks, model explainers

Library/Tool	Scope	Key Goals
NumPy, Pandas	Data manipulation	Data cleaning, numerical ops
Matplotlib, Seaborn	Data visualization	Static and statistical plots
Scikit-learn	Classical ML	Train ML models (SVM, RF)
TensorFlow, PyTorch	Deep learning	Neural networks, DL models
Gensim, spaCy	NLP	Topic modeling, linguistic ops
XGBoost, LightGBM	Boosting models	High-performance ML
OpenCV, SHAP	Specialized tools	Vision tasks, model explainers