DL MIT Lectures Notes

* MIT 6.S191 (2019): Visualization for Machine Learning (Google Brain)
https://www.youtube.com/watch?v=ulLx2iPTIcs&list=PLtBw6njQRU-rwp5__7C0oIVt26ZgjG9NI&index=48&pp=iAQB

* MIT 6.S191 Deep Learning New frontiers
https://www.youtube.com/watch?v=FHeCmnNe0P8

Notes

Talks happens every Friday 10 AM PT between Fri Mar 10 - Fri May 12, 2023.

L1 - Intro to Deep Learning

L2 - Deep Sequence Modeling

  • Yt = F(Xt, h.t-1) : Output depends on current input and past hidden state.

  • Each input step updates hidden state using weight matrices.

  • Hidden state is an array of numbers: The size of that array is called RNN units.

  • Different 3 weight matrices and 1 hidden state are involved:

    • Wxh : Matrix which processes input X to normalize. [input_dim x rnn_units ]
    • Whh : hidden state matrix. [ rnn_unitx x rnn_units ]
    • Why : Matrix which generates output Y given normalized input and Hidden State. [ rnn_units x output_dim ]
    • h : Array of rnn_units numbers. [ 1 x rnn_units ]
  • Higher the rnn_units, the higher the capacity of RNN. Can recognize more features.

  • y = X * Wxh * Whh * Why

  • Update of hidden state is performed using tanh operation. The weight matrix multiplication is sort of Tan-Inverse operation:

    ht = tanh( Whh * h(t-1) + Wxh * Xt)
    # i.e. Take sort of tanh inverse of past hidden state and add with normalized input then apply tanh.
    
  • Can be used for 1-1, 1-N (Image captioning, Text generation from image), N-1 (Sentiment classification), N-N (Language translation), etc.

  • Sequence model should be able to:

    • Handle variable length sequences
    • Track long time dependencies
    • Recoginze the importance of ordering
  • Predict Next Word is an example application of sequence modeling application of type N-1:

    • Corpus of words (Vocabulary) is indexed so that every word has index number. (Word to Index).
    • Each word is transformed from index to single vector using the process called Embedding (NLP). cat = [3, 4, 5, 6]. Each number in each dimension may represent something about cat.
    • These embedding can be fed into neural networks to easily predict the next word.
  • Not clear how to do mini-batch training

NLP

* 2003: (R)NNLM — (Recurrent) Neural Network Language Models aka Bengio's Neural Language Model:

  • Vannila RNN (not the GRU - Gated Recurrent Units nor Long Short Term Memory LSTM)
  • Google has open sourced a pre-trained embedding for English (and most other languages) trained on 200B corpus with 128-dimensional embedding!
  • Based on training using CBOW (continous bag of words: both n words before and after are taken to predict the middle word).
  • Long term dependencies are not properly accounted.

* 2013: Word2Vec - Word to Vector - Released from Google :

  • Based on skip-gram and negative sampling. Trained using only one hidden layer.
  • Easy to train.
  • Yields vector with good interpretability. (i.e. King - Queen ~= Man - Woman )
  • Pretrained model is readily available online using gensim library. But not used anymore.
  • Training done at word level: Context sensitive different meanings are not modeled properly. (GloVe does better job on this).

* 2014: GloVe - Released from Stanford:

  • GloVe improves on Word2Vec by adding the frequency of words’ co-occurrence.
  • Does not use Neural networks!
  • Algorithm minimizes some probability (exact details not known)

* 2016: fastText - Released from Facebook:

  • Extends word2Vect

  • Each word treated as character n-grams!

  • Can generate embedding for unknown words?!

  • Context information is not properly accounted for. Still it is not clear why it is considered superior to GloVe.

  • https://medium.com/mlearning-ai/recurrent-neural-networks-learn-without-forgetting-598ba9daafe3

  • RNNs are neural networks with memory, which allows weight optimization to have a better understanding of the data than a fully connected network.

  • The RNN memory cell could be as simple as past N inputs or could be complex combination of short term and long term memory cells.

  • LSTM and GRU are improved RNNs that account for long-term memory and are more resistant to gradient vanishing.

  • Simple RNN:

    import tensorflow as tf
    
    rnn = tf.keras.Sequential([
        tf.keras.layers.SimpleRNN(32, input_shape=[None, 1]),
        tf.keras.layers.Dense(1)
    ])
    
  • Deep RNN:

    deep_rnn = tf.keras.Sequential([
         tf.keras.layers.SimpleRNN(32, return_sequences=True, input_shape=[None, 1]),
         tf.keras.layers.SimpleRNN(32, return_sequences=True),
         tf.keras.layers.SimpleRNN(32),
         tf.keras.layers.Dense(1)
    ])
    
  • LSTM :

    # RNN with LSTM cell
    lstm_model = tf.keras.models.Sequential([
        tf.keras.layers.LSTM(32, return_sequences=True, input_shape=[None, 5]),
        tf.keras.layers.Dense(14)
    ])
    
    # RNN with GRU cell
    gru_model = tf.keras.Sequential([
        tf.keras.layers.GRU(32, return_sequences=True, input_shape=[None, 5]),
        tf.keras.layers.Dense(14)
    ])
    
  • LSTM models could be stateless or stateful. i.e. If you don't want to remember anything older than N inputs (but only interested in predicting with-in that sequence), then you choose stateless.

  • LSTM example. See https://stackoverflow.com/questions/54009661/what-is-the-timestep-in-keras-lstm?rq=1:

    My training set is structured as follow:
        - number of sequences: 5358
        - the length of each sequence is 300
        - each element of the sequence is a vector of 54 features
    
    Should we use sliding window_size ???
    Should we give one entire sequence to single input layer so that it learns even the intra dependencies
    with in that sequence ?
    
    Is the input 54 x 1 or 300 x 54 x 1 features ?
    
    Use simple stateless LSTM if all sequences are independent from each other.
    
  • See Dhavel's tutorial on LSTM: https://www.youtube.com/watch?v=LfnrRPFhkuY

  • codebasics github link: https://github.com/codebasics/deep-learning-keras-tf-tutorial/blob/master/1_digits_recognition/digits_recognition_neural_network.ipynb

  • Polynomial features are those features created by raising existing features to an exponent. See https://machinelearningmastery.com/polynomial-features-transforms-for-machine-learning/

  • If y_train values range from 1 to 10, then you change each value to one-hot-encoding vector of size 10: [0, 0, 1, 0, 0, ...] i.e. 3rd number is 1 to represent class 3. If you do that, the model loss is: loss = 'categorical_crossentropy'. If you use the number as it is, then use loss = 'sparse_categorical_crossentropy'

* Notes From Francois Deep Learning Book ======================================= * Prior to Neural networks, first prababilistic modeling Naive Bayes algorithm used for classification. The "Naive" assumption is that the features are all independent.

  • Closely similar model is the logistic regression model (misnomer) which is classification algo. (uses sigmoid function which maps input values from 0 to 1).
  • In 1989, backpropagation for neural networks was found and applied to classify handwritten digits.
  • In 1990, the "Kernel methods" such as SVM, Decision Trees and Random Forests gained popularity and neural networks took backseat.
  • Kernel methods do not work well with "perceptual data" (like images). Otherwise they are very good if the problem is not too complicated.
  • SVM was most popular in 1990s. Decision trees were most popular from 2000-2010. Random forest is a group of specialized decision trees. When kaggle.com was started in 2010, the random forests was number 1 preferred method for (shallow) machine learning.
  • In 2014, "Gradient Boosting mcahines" became most popular - It is improved version of random forests where "Gradient Boosting" algorithm was iteratively used to improve the model. It is still one of the best methods if the problem does not require deep neural networks.
  • In 2012, classification of images of ImageNet (1.4 million images into 1000 classes) was solved using CNN upto 83.6% accuracy (from earlier 74.3%). By 2015, the accuracy reached 96.4%.
  • By 2015, the deep learning found applications in NLP and many other applications. Deep learning mostly replaced SVM and decision trees in most applications.
  • Deep learning became popular because not much of feature engineering was required and it was easier to solve most problems.
  • With deep learning you learn all features in one pass! ??? How ???
  • By 2018, for structured data Gradient Boosting was used (using XGBoost library), and for perceptual unstructured data, deep learning (using Keras library) was used for all Kaggle competitions.
  • Four branches of machine learning:
    • Supervised
    • Unsupervised
    • Self Supervised: Auto-generated labels; Autoencoders; etc
    • Reinforcement

Word Embeddings

  • See https://neptune.ai/blog/word-embeddings-guide
  • Language models like RNN, BERT, Chat-GPT implicitly use word embeddings.
  • The words are indexed then represented by N dimensional (features) vector. Similar words are closer in that vector space. The N features (e.g. 30, 60, 100 etc) is much smaller than Vocabulary size.
  • It is an improvement over n-grams model which looks at N consecutive words and learns.
  • Word embeddings can recognize new sentences as long as it has seen similar sentences with "similar" words.
  • The joint distribution between many discrete and random variables is difficult to model due to curse of dimensionality. So we try to model discrete variables to continuous vector.
  • The neural network model used to create word embeddings (i.e. Find joint probability distribution of words and conditional probability of next word).