Word Embeddings

Most Popular Methods Used for Classifications Historically ?

Naive Bayes Prababilistic Model (Naive, Because features assumed to be independent)
Logistic regression model
1989 - Backpropagation Algorithm used to predict hand written digits using neural networks.
1990s - Kernel methods (SVM, )
In 1990, the "Kernel methods" such as SVM, Decision Trees and Random Forests gained popularity and neural networks took backseat.
Kernel methods do not work well with "perceptual data" (like images). Otherwise they are very good if the problem is not too complicated.
SVM was most popular in 1990s. Decision trees were most popular from 2000-2010. Random forest is a group of specialized decision trees. When kaggle.com was started in 2010, the random forests was number 1 preferred method for (shallow) machine learning.
In 2014, "Gradient Boosting mcahines" became most popular - It is improved version of random forests where "Gradient Boosting" algorithm was iteratively used to improve the model. It is still one of the best methods if the problem does not require deep neural networks.
In 2012, classification of images of ImageNet (1.4 million images into 1000 classes) was solved using CNN upto 83.6% accuracy (from earlier 74.3%). By 2015, the accuracy reached 96.4%.
By 2015, the deep learning found applications in NLP and many other applications. Deep learning mostly replaced SVM and decision trees in most applications.
Deep learning became popular because not much of feature engineering was required and it was easier to solve most problems.
With deep learning you learn all features in one pass! ??? How ???
By 2018, for structured data Gradient Boosting was used (using XGBoost library), and for perceptual unstructured data, deep learning (using Keras library) was used for all Kaggle competitions.
Four branches of machine learning:
- Supervised
- Unsupervised
- Self Supervised: Auto-generated labels; Autoencoders; etc
- Reinforcement

See https://neptune.ai/blog/word-embeddings-guide
Language models like RNN, BERT, Chat-GPT implicitly use word embeddings.
The words are indexed then represented by N dimensional (features) vector. Similar words are closer in that vector space. The N features (e.g. 30, 60, 100 etc) is much smaller than Vocabulary size.
It is an improvement over n-grams model which looks at N consecutive words and learns.
Word embeddings can recognize new sentences as long as it has seen similar sentences with "similar" words.
The joint distribution between many discrete and random variables is difficult to model due to curse of dimensionality. So we try to model discrete variables to continuous vector.
The neural network model used to create word embeddings (i.e. Find joint probability distribution of words and conditional probability of next word).