Kaggle Notes

Contents

local

Notes

CNN

Convolution is a small matrix to detect a feature. For example :

|  1.5    1.5 |    This matrix will detect horizontal line in image.
| -1.5   -1.5 |

Object Detection

See https://medium.com/comet-app/review-of-deep-learning-algorithms-for-object-detection-c1f3d437b852

RetinaNet (Feb 2018) and YOLOv3 (Apr 2018)(Fast and Realtime) are cutting edge object detection models.
Mask Region-based Convolutional Network (Mask R-CNN) Another extension of the Faster R-CNN model has been released by K. He and al.

(2017) adding a parallel branch to the bounding box detection in order to predict object mask. The mask of an object is its segmentation by pixel in an image. This model outperforms the state-of-the-art in the four COCO challenges: the instance segmentation, the bounding box detection, the object detection and the key point detection.
YOLO - You Look Only Once
MAP - Mean Average Precision
SSD - (Single Shot Dectector)
Pascal VOC - Pascal Visual Object Classes Challenge
See https://www.kaggle.com/seohyeondeok/yolov3-rsna-starting-notebook
See https://medium.com/@jonathan_hui/object-detection-speed-and-accuracy-comparison-faster-r-cnn-r-fcn-ssd-and-yolo-5425656ae359
https://towardsdatascience.com/faster-r-cnn-object-detection-implemented-by-keras-for-custom-data-from-googles-open-images-125f62b9141a

Embeddings

When you have large dimensions of sparse features (and one response variable) then it is appropriate to compute embedding vectors for each of those features so that you will be able to do prediction.
Example: (user, movie, rating). Number of feature categories = 2; Response variable = 1; user x movie = N x N ; The user and movie has numerical "ids" which are not a "qualitative" feature. The real "features" of user is indirectly defined by (movie, rating) associations. i.e. Users who rated with same ratings of same movies are "closer" to each other. If there is a separate feature called "genre" then we can atleast associate user with genre. Even with absense of genre, given the user, it is possible to guess the movie rating, by using following logic:
- The given movie already has higher ratings by N1 users with whom this user has very low overlap.
- The given movie has low ratings by N2 users with whom this user has more overlap.
- So this user will rate this movie with lower rating.
So it is possible to create "embedding" for users with some "m" dimension vector. Using this vector we can compare 2 users. Similarly we can create "embedding" for movies with some other "k" dimension vector to be able to compare two movies. It is preferable to have m = k, then say user1 = (m1, m2, ... m) and movie1 = (k1, k2, ... k), the weight k1 of movie corresponds to weight m1 of user1.
In general, the embeddings are created as part of "solving/training" a model, and they are useful in that same problem domain to see the "closeness" of one from another.
Given (user, movie) it is possible to predict rating, if we already have embedding vectors calculated for both user and movie.
Another example is: (sms-id, list-of-words, is_spam). If each word or word-pair or 3-words are associated with probabilities of being spam, we may have an algorithm to guess if this sms is spam or not. (How to calculate cumulative probability ?? In long message if you just have single spam like word e.g. Prize! does that mean this is spam ???)
We can imagine embedding a kind of PCA method to project the features into dense m-dimensional vector.
We initialize an embedding for each user and movie using random noise, then we train them as part of the process of training the overall rating-prediction model.
An object's embedding, if it's any good, should capture some useful latent properties of that object. But the key word here is latent AKA hidden. It's up to the model to discover whatever properties of the entities are useful for the prediction task, and encode them in the embedding space.

An example to build this embeding using Keras :

# See https://www.kaggle.com/colinmorris/embedding-layers

hidden_units = (32,4)
movie_embedding_size = 8
user_embedding_size = 8

# Each instance will consist of two inputs: a single user id, and a single movie id
user_id_input = keras.Input(shape=(1,), name='user_id')
movie_id_input = keras.Input(shape=(1,), name='movie_id')

#
# Give name to this layer, since you will need it later to retrieve associated weights later.
# It takes N discrete (integer?) values and maps it to smaller vector.
#
# keras.layers.Embedding(input_dim, output_dim,  ..., input_length=None)
#
# Turns positive integers (e.g. word indexes) into dense vectors of fixed size. 
#    eg. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]
#
# input_dim: int > 0. Size of the vocabulary, i.e. maximum integer index + 1.
#
# This layer can only be used as the first layer in a model.
#
user_embedded = keras.layers.Embedding(df.userId.max()+1, user_embedding_size,
                                       input_length=1, name='user_embedding')(user_id_input)

movie_embedded = keras.layers.Embedding(df.movieId.max()+1, movie_embedding_size,
                                        input_length=1, name='movie_embedding')(movie_id_input)
# Concatenate the embeddings (and remove the useless extra dimension)
concatenated = keras.layers.Concatenate()([user_embedded, movie_embedded])
out = keras.layers.Flatten()(concatenated)

# Add one or more hidden layers
for n_hidden in hidden_units:
    out = keras.layers.Dense(n_hidden, activation='relu')(out)

# A single output: our predicted rating
out = keras.layers.Dense(1, activation='linear', name='prediction')(out)

model = keras.Model(
    inputs = [user_id_input, movie_id_input],
    outputs = out,
)
model.summary(line_length=88)
________________________________________________________________________________________
Layer (type)                 Output Shape       Param #   Connected to
========================================================================================
user_id (InputLayer)         (None, 1)          0
________________________________________________________________________________________
movie_id (InputLayer)        (None, 1)          0
________________________________________________________________________________________
user_embedding (Embedding)   (None, 1, 8)       1107952   user_id[0][0]
________________________________________________________________________________________
movie_embedding (Embedding)  (None, 1, 8)       213952    movie_id[0][0]
________________________________________________________________________________________
concatenate (Concatenate)    (None, 1, 16)      0         user_embedding[0][0]
                                                          movie_embedding[0][0]
________________________________________________________________________________________
flatten (Flatten)            (None, 16)         0         concatenate[0][0]
________________________________________________________________________________________
dense_5 (Dense)              (None, 32)         544       flatten[0][0]
________________________________________________________________________________________
dense_6 (Dense)              (None, 4)          132       dense_5[0][0]
________________________________________________________________________________________
prediction (Dense)           (None, 1)          5         dense_6[0][0]
========================================================================================
Total params: 1,322,585
Trainable params: 1,322,585
Non-trainable params: 0
________________________________________________________________________________________

model.compile(
# Technical note: when using embedding layers, I highly recommend using one of the optimizers
# found  in tf.train: https://www.tensorflow.org/api_guides/python/train#Optimizers
# Passing in a string like 'adam' or 'SGD' will load one of keras's optimizers (found under
# tf.keras.optimizers). They seem to be much slower on problems like this, because they
# don't efficiently handle sparse gradient updates.
tf.train.AdamOptimizer(0.005),
loss='MSE',
metrics=['MAE'],
)

# Let's train the model.

history = model.fit(
    [df.userId, df.movieId],
    df.y,
    batch_size=5000,
    epochs=20,
    verbose=0,
    validation_split=.05,
);

candidate_movies = movies[
movies.title.str.contains('Naked Gun')
| (movies.title == 'The Sisterhood of the Traveling Pants')
| (movies.title == 'Lilo & Stitch')
].copy()

preds = model.predict([
    [uid] * len(candidate_movies), # User ids
    candidate_movies.index, # Movie ids
])

# You got the predictions now.

You can get the weights from the embedding layer :

emb_layer = model.get_layer('movie_embedding')
(w,) = emb_layer.get_weights()
w.shape
# (26744, 32)    # There is 32 weights for each movie. 26K total movies.

You can examine vector similarity using distance :

from scipy.spatial import distance

distance.euclidean(toy_story_vec, shrek_vec)
1.4916094541549683

distance.cosine(toy_story_vec, shrek_vec),
0.3593705892562866

You can use Gensim library to retrieve "closest" entities, etc. Fairly simple library to retrieve relevant information given the dictionary of inputs and model weight vectors :
```
from gensim.models.keyedvectors import WordEmbeddingsKeyedVectors
kv = WordEmbeddingsKeyedVectors(movie_embedding_size)
kv.add( ... )
kv.most_similar('Toy Story')
```

To view the embeddings (i.e. weights) of higher dimensions (e.g. 8), we can reduce it to 2D using TSNE library :

from sklearn.manifold import TSNE

tsne = TSNE(random_state=1, metric="cosine")

embs = tsne.fit_transform(w)
# Add to dataframe for convenience
df['x'] = embs[:, 0]
df['y'] = embs[:, 1]
# Now you can visualize df

RNN

Consider the following RNN model ... The model takes care of "remembering" the past using some kind of "decay" algorithm to reduce the importance of past events over time.

The "hidden" LSTM layers are "intelligent" to encode the entire past to smaller dimension so that it is relevant to predict the next step.

siraj_model = Sequential()

siraj_model.add(LSTM(
    input_shape=(None, 1),
    units=50,
    return_sequences=True))
siraj_model.add(Dropout(0.2))

siraj_model.add(LSTM(100, return_sequences=False))
siraj_model.add(Dropout(0.2))

siraj_model.add(Dense(1))
siraj_model.add(Activation('linear'))

siraj_model.compile(loss='mse', optimizer='rmsprop')

siraj_model.fit(
    X_train,
    y_train,
    batch_size=512,
    epochs=1,
    validation_split=0.05)

Concepts

There is a seasonality involved in time series. e.g. every 7 days or every month.
There is "trend" involved in time series.
ARIMA model works using moving average.
You can extract trend, seasonal components using appropriate API :
```
from statsmodels.tsa.seasonal import seasonal_decompose
```

You can apply smoothing to the time series model to smooth rough edges :

from statsmodels.tsa.api import ExponentialSmoothing,SimpleExpSmoothing, Holt

Interpret Deep Learning Models

See https://www.kaggle.com/learn/machine-learning-explainability
After the model is fit, just randomize the values of any one feature and see how the model prediction suffers. That measures the importance of the feature.
Similarly Partial Plots works.

NLP

The order of words are important. Useful to predict next word, for auto translation, disambiguation of meaning, etc.
To capture all n-grams in a document, we use ngram library. Since it is not "permutation", capturing 2-grams (2-words) or 3-grams, ..., 5-grams it is not going to make much of performance difference.

from nltk.util import ngrams # function for making ngrams
tokenized = text.split()

# and get a list of all the bi-grams
esBigrams = ngrams(tokenized, 2)

# get the frequency of each bigram in our corpus
esBigramFreq = collections.Counter(esBigrams)

# what are the ten most popular ngrams in this Spanish corpus?
esBigramFreq.most_common(10)

Interesting Articles

Analyse medium articles using spacy: https://www.kaggle.com/youhanlee/medium-author-analysis-using-spacy
Find the nearest neighbor: https://www.dataquest.io/blog/k-nearest-neighbors-in-python/