ML Summary

depth
2

References

Overview

This document contains summary notes about various core concepts.

  • Core concepts employed by data scientist:

    1. Supervised Learning
    2. Unsupervised Learning
    3. Semi-Supervised Learning
    4. Reinforcement Learning
    5. Statistical Analysis and Methods
    6. Basic Machine Learning Approaches
    7. Natural Language Processing (NLP)
    8. Deep Learning
    9. Visualization and Communication
    10. Business Intelligence Integration
  • Various ML Libraries

Core ML Concepts

Supervised Learning

Definition: A machine learning paradigm where the model is trained on labeled data (input-output pairs).

Applications:

  • Classification: Predicting categorical outcomes (e.g., customer churn prediction, fraud detection).
  • Regression: Predicting continuous outcomes (e.g., sales forecasting, pricing models).

Key Algorithms:

  • Linear regression, logistic regression
  • decision trees, support vector machines (SVM)
  • neural networks. (Neural networks used both in supervised and unsupervised learning)

Unsupervised Learning

Definition: A paradigm where the model identifies patterns in unlabeled data.

Applications:

  • Clustering: Grouping similar data points (e.g., customer segmentation, market research).
  • Dimensionality Reduction: Simplifying data while retaining essential features (e.g., Principal Component Analysis, t-SNE).

Key Algorithms:

  • K-means clustering
  • hierarchical clustering,
  • DBSCAN.

Semi-Supervised Learning

Definition: Combines a small amount of labeled data with a large amount of unlabeled data.

Applications: Useful when labeling data is expensive or time-consuming (e.g., text classification in NLP)

Reinforcement Learning

Definition: A learning paradigm where an agent learns by interacting with an environment to maximize rewards.

Applications: Optimizing supply chains, dynamic pricing, and recommendation systems.

Statistical Analysis and Methods

  • Exploratory Data Analysis (EDA):
    • Using statistical techniques to understand the structure of data (e.g., summary statistics, correlation analysis).
  • Hypothesis Testing: Validating assumptions or relationships in data using tests like t-tests, chi-square tests, and ANOVA.
  • Time Series Analysis: Analyzing sequential data points for trends, seasonality, and forecasting (e.g., ARIMA models).

Basic Machine Learning Approaches

  • Ensemble Methods: Combining multiple models to improve performance:
    • Random Forests,
    • Gradient Boosting Machines like XGBoost or LightGBM.
  • Feature Engineering: Enhancing model input through domain knowledge (e.g., creating ratios, aggregating data).
  • Model Evaluation: Validating model performance using metrics like accuracy, precision, recall, F1 score, or ROC-AUC.

Natural Language Processing (NLP)

Applications:

  • Text analytics for sentiment analysis
  • topic modeling (Identify underlying themes or topics in a large corpus of text - Unsupervised learning) useful in indexing, semantic analysis and visualization.
  • Information extraction from unstructured data.

Deep Learning

Applications: Leveraging neural networks for tasks like:

  • image recognition, speech analysis and complex pattern identification.

Visualization and Communication

For visual insights tools like:

  • Tableau, Power BI, and Python libraries (e.g., Matplotlib, Seaborn) are used
  • Storytelling with data is crucial for translating analytical findings into actionable business strategies.

Business Intelligence Integration

Combining predictive analytics, dashboards, and automated reporting systems to facilitate data-driven decision-making.

ML Libraries/Tools

Various ML libraries used from over all least complext to most complex order are given below.

Foundational Libraries

Numpy and Pandas are Foundational Libraries which provide basic utilities for numerical operations and data manipulation.

NumPy

  • Scope: Fundamental library for numerical computations.
  • Key Goals: Efficient handling of large arrays and matrices.
  • Key Features: Mathematical functions, linear algebra, random number generation.

Pandas

  • Scope: Data manipulation and analysis.
  • Key Goals: Handling tabular data with DataFrames, cleaning and preprocessing data.
  • Key Features: Data filtering, grouping, merging, reshaping.

Visualization Libraries

Matplotlib

  • Scope: Low-level static visualizations.
  • Key Goals: Create custom plots, charts, and visualizations.
  • Key Algorithms: Plotting primitives, customization of visual elements.

Seaborn

  • Scope: Statistical data visualization built on Matplotlib.
  • Key Goals: Simplify the creation of attractive and informative graphics.
  • Key Features: Heatmaps, pair plots, violin plots, categorical plots.

Plotly

  • Scope: Interactive visualizations for the web.
  • Key Goals: Interactive dashboards and 3D plots.
  • Key Features: Zooming, hover labels, interactive sliders.

Statistical Analysis Libraries

Statistical Analysis libraries focus on exploratory data analysis and statistical hypothesis testing.

SciPy

  • Scope: Advanced statistical computations and optimization.
  • Key Goals: Hypothesis testing, curve fitting, integration, and signal processing.
  • Key Features: Statistical distributions, t-tests, ANOVA, optimization solvers.

Statsmodels

  • Scope: Statistical modeling and testing.
  • Key Goals: Perform linear and generalized linear regression, time-series analysis.
  • Key Features: Regression models, ARIMA, hypothesis testing.

Machine Learning Libraries

Core libraries for classical and advanced machine learning algorithms. Core libraries for classical and advanced machine learning algorithms.

Core libraries for classical and advanced machine learning algorithms.
Core libraries for classical and advanced machine learning algorithms.
Core libraries for classical and advanced machine learning algorithms.

Scikit-learn

  • Scope: Comprehensive toolkit for classical machine learning.
  • Key Goals: Train and evaluate machine learning models.
  • Key Algorithms:
    • Classification: SVMs, Random Forests, kNN, Logistic Regression.
    • Regression: Linear Regression, Ridge, Lasso.
    • Clustering: K-Means, DBSCAN, Hierarchical Clustering.

XGBoost, LightGBM, CatBoost

  • Scope: Gradient boosting frameworks for structured data.
  • Key Goals: High-performance ensemble models for tabular data.

Key Features:

  • XGBoost: Handles sparsity, regularization support.
  • LightGBM: Faster training, optimized for large datasets.
  • CatBoost: Native handling of categorical variables.

Deep Learning Libraries

Specialized frameworks for designing and training neural networks.

TensorFlow

  • Scope: Full-stack deep learning framework.
  • Key Goals: Develop scalable and production-ready models.
  • Key Algorithms: Neural networks (CNNs, RNNs, Transformers), reinforcement learning.

PyTorch

  • Scope: Flexible framework for research-oriented deep learning.
  • Key Goals: Enable dynamic graph computation and experimentation.
  • Key Algorithms: Similar to TensorFlow, with more flexibility in prototyping.

NLP Libraries

Libraries dedicated to text processing and linguistic analysis.

NLTK (NLP Toolkit Library)

  • Scope: Foundational NLP tasks.
  • Key Goals: Text tokenization, stemming, and parsing.
  • Key Features: POS tagging, N-gram models.

Gensim: NLP Library

  • Scope: Topic modeling and vector representation.
  • Key Goals: Perform LDA, Word2Vec, and document similarity analysis.

spaCy: NLP Library

  • Scope: Industrial-strength NLP pipelines.
  • Key Goals: Fast and accurate linguistic analysis.
  • Key Features: Named Entity Recognition (NER), Dependency Parsing.
  • Complexity: Moderate to High.

Hugging Face Transformers Library

  • Scope: State-of-the-art transformer-based NLP.
  • Key Goals: Enable tasks like text classification, summarization, and translation.
  • Key Algorithms: BERT, GPT, T5, etc.
  • Complexity: High.

Specialized Tools

Libraries for domain-specific tasks or scaling.

OpenCV Library

Scope: Computer vision tasks. Key Goals: Image processing, object detection. Complexity: High.

Dask (Parallel Computing) Library

  • Scope: Parallel computing for large datasets.
  • Key Goals: Scale Pandas and NumPy workflows.
  • Complexity: High.

SHAP / LIME Libraries

  • Scope: Model interpretability tools.
  • Key Goals: Explain predictions and understand feature importance.
  • Complexity: High.

Libraries Summary Table

Library/Tool

Scope

Key Goals

NumPy, Pandas

Data manipulation

Data cleaning, numerical ops

Matplotlib, Seaborn

Data visualization

Static and statistical plots

Scikit-learn

Classical ML

Train ML models (SVM, RF)

TensorFlow, PyTorch

Deep learning

Neural networks, DL models

Gensim, spaCy

NLP

Topic modeling, linguistic ops

XGBoost, LightGBM

Boosting models

High-performance ML

OpenCV, SHAP

Specialized tools

Vision tasks, model explainers

Library/Tool Scope Key Goals
NumPy, Pandas Data manipulation Data cleaning, numerical ops
Matplotlib, Seaborn Data visualization Static and statistical plots
Scikit-learn Classical ML Train ML models (SVM, RF)
TensorFlow, PyTorch Deep learning Neural networks, DL models
Gensim, spaCy NLP Topic modeling, linguistic ops
XGBoost, LightGBM Boosting models High-performance ML
OpenCV, SHAP Specialized tools Vision tasks, model explainers