Kaggle Notes

Contents

local

Notes

CNN

  • Convolution is a small matrix to detect a feature. For example :

    |  1.5    1.5 |    This matrix will detect horizontal line in image.
    | -1.5   -1.5 |
    

Object Detection

Embeddings

  • When you have large dimensions of sparse features (and one response variable) then it is appropriate to compute embedding vectors for each of those features so that you will be able to do prediction.

  • Example: (user, movie, rating). Number of feature categories = 2; Response variable = 1; user x movie = N x N ; The user and movie has numerical "ids" which are not a "qualitative" feature. The real "features" of user is indirectly defined by (movie, rating) associations. i.e. Users who rated with same ratings of same movies are "closer" to each other. If there is a separate feature called "genre" then we can atleast associate user with genre. Even with absense of genre, given the user, it is possible to guess the movie rating, by using following logic:

    • The given movie already has higher ratings by N1 users with whom this user has very low overlap.
    • The given movie has low ratings by N2 users with whom this user has more overlap.
    • So this user will rate this movie with lower rating.
  • So it is possible to create "embedding" for users with some "m" dimension vector. Using this vector we can compare 2 users. Similarly we can create "embedding" for movies with some other "k" dimension vector to be able to compare two movies. It is preferable to have m = k, then say user1 = (m1, m2, ... m) and movie1 = (k1, k2, ... k), the weight k1 of movie corresponds to weight m1 of user1.

  • In general, the embeddings are created as part of "solving/training" a model, and they are useful in that same problem domain to see the "closeness" of one from another.

  • Given (user, movie) it is possible to predict rating, if we already have embedding vectors calculated for both user and movie.

  • Another example is: (sms-id, list-of-words, is_spam). If each word or word-pair or 3-words are associated with probabilities of being spam, we may have an algorithm to guess if this sms is spam or not. (How to calculate cumulative probability ?? In long message if you just have single spam like word e.g. Prize! does that mean this is spam ???)

  • We can imagine embedding a kind of PCA method to project the features into dense m-dimensional vector.

  • We initialize an embedding for each user and movie using random noise, then we train them as part of the process of training the overall rating-prediction model.

  • An object's embedding, if it's any good, should capture some useful latent properties of that object. But the key word here is latent AKA hidden. It's up to the model to discover whatever properties of the entities are useful for the prediction task, and encode them in the embedding space.

  • An example to build this embeding using Keras :

    # See https://www.kaggle.com/colinmorris/embedding-layers
    
    hidden_units = (32,4)
    movie_embedding_size = 8
    user_embedding_size = 8
    
    # Each instance will consist of two inputs: a single user id, and a single movie id
    user_id_input = keras.Input(shape=(1,), name='user_id')
    movie_id_input = keras.Input(shape=(1,), name='movie_id')
    
    #
    # Give name to this layer, since you will need it later to retrieve associated weights later.
    # It takes N discrete (integer?) values and maps it to smaller vector.
    #
    # keras.layers.Embedding(input_dim, output_dim,  ..., input_length=None)
    #
    # Turns positive integers (e.g. word indexes) into dense vectors of fixed size. 
    #    eg. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]
    #
    # input_dim: int > 0. Size of the vocabulary, i.e. maximum integer index + 1.
    #
    # This layer can only be used as the first layer in a model.
    #
    user_embedded = keras.layers.Embedding(df.userId.max()+1, user_embedding_size,
                                           input_length=1, name='user_embedding')(user_id_input)
    
    movie_embedded = keras.layers.Embedding(df.movieId.max()+1, movie_embedding_size,
                                            input_length=1, name='movie_embedding')(movie_id_input)
    # Concatenate the embeddings (and remove the useless extra dimension)
    concatenated = keras.layers.Concatenate()([user_embedded, movie_embedded])
    out = keras.layers.Flatten()(concatenated)
    
    # Add one or more hidden layers
    for n_hidden in hidden_units:
        out = keras.layers.Dense(n_hidden, activation='relu')(out)
    
    # A single output: our predicted rating
    out = keras.layers.Dense(1, activation='linear', name='prediction')(out)
    
    model = keras.Model(
        inputs = [user_id_input, movie_id_input],
        outputs = out,
    )
    model.summary(line_length=88)
    ________________________________________________________________________________________
    Layer (type)                 Output Shape       Param #   Connected to
    ========================================================================================
    user_id (InputLayer)         (None, 1)          0
    ________________________________________________________________________________________
    movie_id (InputLayer)        (None, 1)          0
    ________________________________________________________________________________________
    user_embedding (Embedding)   (None, 1, 8)       1107952   user_id[0][0]
    ________________________________________________________________________________________
    movie_embedding (Embedding)  (None, 1, 8)       213952    movie_id[0][0]
    ________________________________________________________________________________________
    concatenate (Concatenate)    (None, 1, 16)      0         user_embedding[0][0]
                                                              movie_embedding[0][0]
    ________________________________________________________________________________________
    flatten (Flatten)            (None, 16)         0         concatenate[0][0]
    ________________________________________________________________________________________
    dense_5 (Dense)              (None, 32)         544       flatten[0][0]
    ________________________________________________________________________________________
    dense_6 (Dense)              (None, 4)          132       dense_5[0][0]
    ________________________________________________________________________________________
    prediction (Dense)           (None, 1)          5         dense_6[0][0]
    ========================================================================================
    Total params: 1,322,585
    Trainable params: 1,322,585
    Non-trainable params: 0
    ________________________________________________________________________________________
    
    model.compile(
    # Technical note: when using embedding layers, I highly recommend using one of the optimizers
    # found  in tf.train: https://www.tensorflow.org/api_guides/python/train#Optimizers
    # Passing in a string like 'adam' or 'SGD' will load one of keras's optimizers (found under
    # tf.keras.optimizers). They seem to be much slower on problems like this, because they
    # don't efficiently handle sparse gradient updates.
    tf.train.AdamOptimizer(0.005),
    loss='MSE',
    metrics=['MAE'],
    )
    
    # Let's train the model.
    
    history = model.fit(
        [df.userId, df.movieId],
        df.y,
        batch_size=5000,
        epochs=20,
        verbose=0,
        validation_split=.05,
    );
    
    candidate_movies = movies[
    movies.title.str.contains('Naked Gun')
    | (movies.title == 'The Sisterhood of the Traveling Pants')
    | (movies.title == 'Lilo & Stitch')
    ].copy()
    
    preds = model.predict([
        [uid] * len(candidate_movies), # User ids
        candidate_movies.index, # Movie ids
    ])
    
    # You got the predictions now.
    
  • You can get the weights from the embedding layer :

    emb_layer = model.get_layer('movie_embedding')
    (w,) = emb_layer.get_weights()
    w.shape
    # (26744, 32)    # There is 32 weights for each movie. 26K total movies.
    
  • You can examine vector similarity using distance :

    from scipy.spatial import distance
    
    distance.euclidean(toy_story_vec, shrek_vec)
    1.4916094541549683
    
    distance.cosine(toy_story_vec, shrek_vec),
    0.3593705892562866
    
  • You can use Gensim library to retrieve "closest" entities, etc. Fairly simple library to retrieve relevant information given the dictionary of inputs and model weight vectors :

    from gensim.models.keyedvectors import WordEmbeddingsKeyedVectors
    kv = WordEmbeddingsKeyedVectors(movie_embedding_size)
    kv.add( ... )
    kv.most_similar('Toy Story')
    
  • To view the embeddings (i.e. weights) of higher dimensions (e.g. 8), we can reduce it to 2D using TSNE library :

    from sklearn.manifold import TSNE
    
    tsne = TSNE(random_state=1, metric="cosine")
    
    embs = tsne.fit_transform(w)
    # Add to dataframe for convenience
    df['x'] = embs[:, 0]
    df['y'] = embs[:, 1]
    # Now you can visualize df
    

RNN

Consider the following RNN model ... The model takes care of "remembering" the past using some kind of "decay" algorithm to reduce the importance of past events over time.

The "hidden" LSTM layers are "intelligent" to encode the entire past to smaller dimension so that it is relevant to predict the next step.

siraj_model = Sequential()

siraj_model.add(LSTM(
    input_shape=(None, 1),
    units=50,
    return_sequences=True))
siraj_model.add(Dropout(0.2))

siraj_model.add(LSTM(100, return_sequences=False))
siraj_model.add(Dropout(0.2))

siraj_model.add(Dense(1))
siraj_model.add(Activation('linear'))

siraj_model.compile(loss='mse', optimizer='rmsprop')

siraj_model.fit(
    X_train,
    y_train,
    batch_size=512,
    epochs=1,
    validation_split=0.05)

Concepts

  • There is a seasonality involved in time series. e.g. every 7 days or every month.

  • There is "trend" involved in time series.

  • ARIMA model works using moving average.

  • You can extract trend, seasonal components using appropriate API :

    from statsmodels.tsa.seasonal import seasonal_decompose
    
  • You can apply smoothing to the time series model to smooth rough edges :

    from statsmodels.tsa.api import ExponentialSmoothing,SimpleExpSmoothing, Holt
    

Interpret Deep Learning Models

NLP

  • The order of words are important. Useful to predict next word, for auto translation, disambiguation of meaning, etc.
  • To capture all n-grams in a document, we use ngram library. Since it is not "permutation", capturing 2-grams (2-words) or 3-grams, ..., 5-grams it is not going to make much of performance difference.
from nltk.util import ngrams # function for making ngrams
tokenized = text.split()

# and get a list of all the bi-grams
esBigrams = ngrams(tokenized, 2)

# get the frequency of each bigram in our corpus
esBigramFreq = collections.Counter(esBigrams)

# what are the ten most popular ngrams in this Spanish corpus?
esBigramFreq.most_common(10)

Interesting Articles