Data Science Notes

Overview

These are high level notes based on data camp courses which includes:

  • Python notes
  • Numpy
  • Pandas
  • Notes on Statistics
  • Basic matplotlib plotting
  • Seaborn plotting

Python Notes

  • There are different scopes for variables:

    • Local
    • Enclosing Function
    • Global
    • Built-in functions (Called LEGB Rule)
  • Example for Enclosing function declaration:

    def outer_func(x):
       n = 10
       def inner_func(y):
         nonlocal n
         n = 20
         return 30
    

Numpy

  • Numpy is a Fundamental package for scientific computing.
  • Provides n-dimensional array and statistical methods.
  • The ndarray is a class.
  • All elements must be homogenous (except for Object types).
  • Enables vectorized operations.
  • Much faster than Python built-in sequence types (Lists, tuples)
  • Tight integration with matplotlib

:

pip install numpy

a = np.array([1, 2, 3, 4, 5, 6])

# array([1,  2,  3,  4,  5,  6])

a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

a.ndim
# 2

a.shape
# (3, 4)

len(a.shape) == a.ndim
# True

# size == Fixed total number of elements in array
a.size
# 12

import math
a.size == math.prod(a.shape)
# True

# Arrays are typically “homogeneous”;
# The data type is recorded in the dtype attribute.
a.dtype
# dtype('int64')  # "int" for integer, "64" for 64-bit

np.zeros(2)
# array([0., 0.])
# See Also:  np.zeros((3,4))

np.ones(2)
# array([1., 1.])

# Create empty array! Contents are random.
np.empty(2) 
# array([3.14, 42.  ])  # may vary

np.arange(4)
# array([0, 1, 2, 3])

np.arange(2, 9, 2)
# array([2, 4, 6, 8])

# np.linspace() : Divide given interval to yield given number of points.
np.linspace(0, 10, num=5)
#  array([ 0. ,  2.5,  5. ,  7.5, 10. ])

# Default data type is: np.float64, you can override data type.
x = np.ones(2, dtype=np.int64)
x
# array([1, 1])

# Adding, removing, and sorting elements

arr = np.array([2, 1, 5, 3, 7, 4, 6, 8])
np.sort(arr)                                # Creates new sorted copy.
# array([1, 2, 3, 4, 5, 6, 7, 8])

# argsort, which is an indirect sort along a specified axis,

# lexsort, which is an indirect stable sort on multiple keys,

# searchsorted, which will find elements in a sorted array, and

# partition, which is a partial sort.

a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])

np.concatenate((a, b))              # Same as a + b
array([1, 2, 3, 4, 5, 6, 7, 8])

x = np.array([[1, 2], [3, 4]])
y = np.array([[5, 6]])

np.concatenate((x, y), axis=0)      # Concatenate is Not same as x + y
# array([[1, 2],
#        [3, 4],
#        [5, 6]])
#
# Above is same as: np.vstack((x, y))
#

x + y                               # y is added to every row of x.
# array([[ 6,  8],
#        [ 8, 10]])

np.concatenate((x, y), axis=1)      # axis 1 concatenate also Not same as x + y
# Error: Dimensions do not match to concatenate!
#        y must have two rows.

# Find out the shape and size of an array ...
# Use ndarray.ndim, ndarray.size, ndarray.shape

array_example = np.array([[[0, 1, 2, 3],        # 3D Array
                           [4, 5, 6, 7]],

                          [[0, 1, 2, 3],
                           [4, 5, 6, 7]],

                          [[0 ,1 ,2, 3],
                           [4, 5, 6, 7]]])


array_example.ndim
# 3

array_example.size
# 24

array_example.shape
# (3, 2, 4)

a = np.arange(6)
# [0 1 2 3 4 5]

b = a.reshape(3, 2)   # same as np.reshape(a, (3, 2)). Creates new array.
print(b)

# [[0 1]
#  [2 3]
#  [4 5]]

a = np.array([1, 2, 3, 4, 5, 6])
a.shape
# (6,)

# You can use np.newaxis to add a new axis:
a2 = a[np.newaxis, :]
a2.shape
# (1, 6)
a2
# array([[1, 2, 3, 4, 5, 6]])   # Note: Array of Array.
# 

b = np.expand_dims(a, axis=1)
b.shape
# (6, 1)
b
# array([[1],
#    [2],
#    [3],
#    [4],
#    [5],
#    [6]])


v_copy = v.copy()

a = np.array([[1 , 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
print(a[a < 5])
# [1 2 3 4]

# Also See: a[a%2==0];   a[(a > 2) & (a < 11)]

#  See Also: 
#   np.arange(1, 25).reshape(2, 12)
#   v.min(), v.max(), v.sum()
# 

What is row and what is column ?

  • For 2D array, first dimension is row and second dimension is column.
  • For N-dimensional array, [last index is column, last-but-one index is row]{.title-ref}!
  • Not uncommon to represent each column in different units.
a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

#-------------------------------------------
#       c1  c2  c3  c4
#-------------------------------------------
array([[ 1,  2,  3,  4],        # Row One
       [ 5,  6,  7,  8],        # Row Two
       [ 9, 10, 11, 12]])       # Row Three
#-------------------------------------------

#
# a = Array_x(Array_y(integers))
#     X-Axis === Axis 0
#     Y-Axis === Axis 1
#
#  Every row (i.e. Y-Axis) represents 4-dimensional point.
#  However, the array itself is a 2-D array.
#
#  Remember every axis represents linear infinite points.
#

Total Dimensions vs Array Dimensions

  • In a philosophical sense, Every column is implicitly an independent dimension.

  • In theory, you can represent a single data point of N dimensions in 1D array:

    [ c1, c2, ..., cn ]
    
  • Hence 2-D array is sufficient to represent any N-dimensional data:

    [ c11, c12, ..., c1n ]
    [ c21, c22, ..., c2n ]
    [ c31, c32, ..., c3n ]
    ....
    
  • Imagine sales data having N-dimensional attributes like following:

    • Time
    • Country
    • City
    • Product
    • Cost
    • Price
    • Color
  • While sourcing the data, we may often get list of 2-D data from different sources. If the source is one dimension, actually this is

  • We may collect data from different sources, say for each country, then we may get a list of 2-D data during sourcing. This is actually 3-D data, the third dimension being source (e.g. Country)

  • We often convert 3-D data to 2-D array by introducing new column.

  • Discrete data is difficult to understand and digest. So, we often want to understand the consolidated data wrt aggregating over specific dimension.

  • Usually we keep a flat 2-D information and aggregate on-demand to visualize information over any specific dimension.

Numpy terminologies

  • Numpy 1-D array may be called as vector.
  • 2-D array may be called as matrix.
  • 3-D array may be called as tensor.
  • These are not accurate since matrix multiplication has different semantics.
  • The N-D array is not exactly like a tensor.
  • Numpy n-d array is a basic equivalent of scalar data with n-dimensions.
  • It could be the basis for vector, matrix or tensor or whatever. But it does not claim itself to be anything.
v * 10   # Multiplies every element by given scalar.
v1 * v2  # Not a matrix multiplication. Same positional values get multiplied.
1/v1     # Every element gets inverted.

Various Operations

np.unique(arr)            # Yields unique values. Will flatten array.
v.min(), v.max()          # Yields scalar value
v.ndim                    # Number of dimensions
v.reshape()               # Reshape and create a new view over same base.
v.shape = (2, 5)          # You can reshape, in-place by assigning!
v.base                    # The base array for the (reshaped) array.
v < 5                     # Create boolean array
np.unique(v_2d)           # Returns 1-D array of unique elements.
np.unique(v_2d, axis=0)   # Returns list of unique rows.
v.transpose() == v.T      # Returns transpose
np.flip(v)                # Reverse Array. Both columns, rows reversed.
np.flip(v, axis=0)        # Reverse only rows. Column order intact.
v.flatten()               # New copy of flattened array.
v.ravel()                 # Flattened view over parent array. (Pointer)
np.sum()                  # Prefer this over python built-in sum.
np.square()
np.save('filename', v)    # Stores v in filename.npy file.
np.load('filename.npy')   # Load variable from file.
np.savetxt('file.csv', v) # Stores numpy variable in csv file.
np.loadtxt('file.csv')    # Load numpy variable from csv file.
x[..., 0] == x[:, :, 0]   # x is 3d array.
x[i, j] == x[i][j]        # Prefer x[i, j] over x[i][j]. Faster.
np.intp == np.int64       # np.intp is platform specific int type.
                          # Used with dtype=np.intp option.

numpy sum operation

data = np.array([[1, 2], [3, 4], [5, 6]])
data

# array([[1, 2],
#      [3, 4],
#      [5, 6]])
# 

data.sum()    # By default sum applies to all elements. Like reduce.
# 21

#          +------ Operation done over iterating along the row (last but one axis).
#          |       This axis is lost after operation. This is Reduce operation.
#          |       More common operation.
#          |
data.sum(axis=0)
# array([9, 12])

#          +------ Operation done over iterating along the column (last axis).
#          |       This axis is lost after operation. This is Reduce operation.
#          |       Less common operation. 
#          |       Unless columns are homogenous, no point of sum across columns.
#          |
data.sum(axis=1)
# array([3, 7, 11])

slicing and broadcasting

  • Slicing operator does not make new copy.
  • Broadcast assignment over the sliced copy modifies the original array!
  • If you want make copy, use v.copy()
v = np.arange(12)
v_slice = v[0:5]       # By default slice is not a copy for numpy!
v_slice[:] = 100       # This is called broadcasting.
print(v)
# [100 100 100 100 100   5   6   7   8   9  10  11]
  • Broadcasting works even when you add two arrays. If one Array contains singleton element, then it gets applied over other:

    v.shape = (3, 4)
    v + 1              # 1 added to each element. Result same shape.
    v + [1]            # 1 added to each element. Result same shape.
    v + [[1]]          # 1 added to each element. Result same shape.
    v + [[[1]]]        # Note:  Result shape: (1, 3, 4)
    
    v + [1, 2]         # Error: shape(3,4) + shape(2,)(1d-array)
    v + [[1, 2]]       # Error: shape(3,4) + shape(1, 2): No Broadcast.
    
  • Result of Slicing retains total number of dimensions. Indexing result loses dim:

    x = np.array([[[1],[2],[3]], [[4],[5],[6]]])
    
    x.shape
    # (2, 3, 1)
    
    x[1:2]             # Row 1 to 2, 2 not included. So only Row 1.
    # array([[[4],
    #         [5],
    #         [6]]])
    
  • Using elipsis, you can mention "all missing dimensions" :

    y = x[..., 0]                 # Same  as x[:, :, 0]
    print(y)
    # array([[1, 2, 3],           # Shape: (2,3,1) => (2,3)
    #        [4, 5, 6]])
    
    print(x.shape, y.shape)
    # (2, 3, 1)   (2, 3)                # column dimension is lost.
    
  • You can introduce new dimension using np.newaxis (alias for None) :

    y[:, :, np.newaxis]       # This is same as x.
    y[:, :, None]             # This is same as x.
    
  • You can add (5, 1) array and (1, 5) array to create (5, 5) array! Basically result(i, j) = a(i) + b(j) :

    x = np.arange(5)                      #  [0, 1, 2, 3, 4]
    x[:, np.newaxis]                      # [[0], [1], [2], [3], [4]]
    x[np.newaxis, :]                      # [[0, 1, 2, 3, 4]]
    
    x[:, np.newaxis] + x[np.newaxis, :]   # Shape (5,1) + Shape (1, 5)
    
    #                              #     0  1  2  3  4
    # array([[0, 1, 2, 3, 4],      #  0
    #       [1, 2, 3, 4, 5],       #  1
    #       [2, 3, 4, 5, 6],       #  2        +
    #       [3, 4, 5, 6, 7],       #  3
    #       [4, 5, 6, 7, 8]])      #  4
    #                              #   
    
  • For Array addtion variables must have same dimension or atleast one of the dimensions is 1 :

    Shape1            Shape2          Result
    
    (2, 2)            (2, 1)          (2, 2)
    (2, 2)            (1, 2)          (2, 2)
    (4, 2)            (4, 1)          (4, 2)
    (4, 2)            (1, 2)          (4, 2)
    (5, 1)            (1, 5)          (5, 5)
    (4, 1, 5, 1)      (2, 1, 3)       (4, 2, 5, 3)
    
    (2, 2)            (2, 3)          Not Compatible
    (4, 2)            (2, 2)          Not Compatible
    (4, 2)            (2, 1)          Not Compatible
    

random operations

# Default random generator

# np.random.seed(1234)  # For predictable sequence of random numbers.
# np.random.rand(4)     # Generate 4 random numbers between 0 and 1

rng = np.random.default_rng(seed = 1234)
rng.random(3)
# array([0.63696169, 0.26978671, 0.04097352])
rng.random((3, 3))            # Generate random 3x3 2-D array.
rng.standard_normal((3, 3))   # Use normal distribution.

rng.integers(5, size=(2, 4))

Apply custom function on Numpy array

  • Numpy array is iterable. Hence 1-D array can be easily manipulated by using python built-in map function:

    arr = np.array([1, 2, 3, 4, 5])
    result = np.array(list(map(lambda x: 2*x, arr)))
    
  • You can also np.vectorize. For convenience. Implemented by for loop:

    arr = np.array([ [1, 2], [3, 4]])
    vfunc = np.vectorize(lambda x: x*x)
    result = vfunc(arr)
    # array([[ 1,  4],
    #        [ 9, 16]])
    
  • You can use Python universal function. Fast. :

    #  np.frompyfunc() : Create unfunc (universal function)
    
    arr = np.array([ [1, 2], [3, 4]])
    ufunc = np.frompyfunc(lambda x: 2*x, 1, 1)
    result = ufunc(arr)
    print(result)  # [[2 4], [6 8]]
    
  • If you want to apply an aggregation function over the N-D array (effectively losing 1 dimension in that process), you can use [np.apply_along_axis]{.title-ref} :

    arr = np.array([[1, 2, 3], [4, 5, 6]])
    result = np.apply_along_axis(lambda x: sum(x), axis=1, arr=arr)
    print(result)  # [ 6 15]
    

Numpy universal Functions

  • Universal functions ufunc operate on ndarrays element-wise.

  • Supports array broadcasting and typecasting.

  • Can take n arguments. unary or binary, etc.

  • More than 60 built-in numpy ufunc functions available. Example Math operations:

    • add(x1, x2, ...), subtract, multiply, divide, ...
    • matmul(x1, x2, ...) # Matrix multiplication.
    • negative(x, ...) # numerical negative
    • power(x1, x2, ...)
    • bitwise_and, logical_and, greater, less, equal,
    • minimum, maximum (faster than python built-in min, max)
    • isinf, isnan, isnat (not a time), isfinite (not nan and not inf)
  • If you supply your own ufunc, you should either implement these component functions or use np.frompyfunc() to generate them for you :

    - ufunc.reduce(array, axis, dtype, out, ...)  - Reduces one dimension.
    - ufunc.accumulate(array, axis, dtype, out, ...)
    - ufunc.reduceat - Reduce using indices
    - ufunc.outer - (A, B) : apply ufunc for all (a,b) for a in A, b in B
    - ufunc.at(a, indices, ...) : apply ufunc given indices.
    

Pandas

import pandas as pd

s  = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})

df
#      A  B
#   0  1  3
#   1  2  4

df['A'] > 2

# 0    False
# 1    False
# Name: A, dtype: bool

df[df['A'] > 2]
#   Empty DataFrame
#   Columns: [A, B]
#   Index: []

df['A'] > 1

#   0    False
#   1     True

df[df['A'] > 1]

#      A  B
#   1  2  4

df.sort_values('A')
df.sort_values(by='A', ascending=False)

#      A  B
#   1  2  4         # Rows sorted using column A.
#   0  1  3

Sales df

import pandas as pd

data = {
    'Order ID': [101, 102, 103, 104, 105, 106, 107, 108, 109],
    'Product': ['Laptop', 'Desktop', 'Tablet', 'Laptop', 'Desktop', 'Tablet', 'Laptop', 'Desktop', 'Tablet'],
    'Customer': ['John', 'Jane', 'Bob', 'John', 'Jane', 'Bob', 'John', 'Jane', 'Bob'],
    'Date': ['2022-01-01', '2022-01-05', '2022-01-10', '2022-01-15', '2022-01-20', '2022-01-25', '2022-02-01', '2022-02-05', '2022-02-10'],
    'Qty': [2, 1, 3, 4, 2, 1, 3, 4, 2],
    'Price': [1000, 800, 600, 1000, 800, 600, 1000, 800, 600],
    'Total': [2000, 800, 1800, 4000, 1600, 600, 3000, 3200, 1200]
}

# You can either use Dictionary of Lists or List of Dictionaries to create df.

df = pd.DataFrame(data)
print(df)

     Order ID   Product    Customer        Date    Qty  Price  Total
  0       101     Laptop     John      2022-01-01   2   1000   2000
  1       102    Desktop     Jane      2022-01-05   1    800    800
  2       103     Tablet      Bob      2022-01-10   3    600   1800
  3       104     Laptop     John      2022-01-15   4   1000   4000
  4       105    Desktop     Jane      2022-01-20   2    800   1600
  5       106     Tablet      Bob      2022-01-25   1    600    600
  6       107     Laptop     John      2022-02-01   3   1000   3000
  7       108    Desktop     Jane      2022-02-05   4    800   3200
  8       109     Tablet      Bob      2022-02-10   2    600   1200

Avocados df

nRowsRead = 1000 # specify 'None' if want to read whole file
# avocado.csv has 18250 rows, but we load only the first 1000 rows
adf = pd.read_csv('./data/avocado.csv', delimiter=',', nrows = nRowsRead)

adf.info()

  <class 'pandas.core.frame.DataFrame'>
  RangeIndex: 1000 entries, 0 to 999
  Data columns (total 14 columns):
   #   Column        Non-Null Count  Dtype
  ---  ------        --------------  -----
   0   Unnamed: 0    1000 non-null   int64
   1   Date          1000 non-null   object
   2   AveragePrice  1000 non-null   float64
   3   Total Volume  1000 non-null   float64
   4   4046          1000 non-null   float64
   5   4225          1000 non-null   float64
   6   4770          1000 non-null   float64
   7   Total Bags    1000 non-null   float64
   8   Small Bags    1000 non-null   float64
   9   Large Bags    1000 non-null   float64
   10  XLarge Bags   1000 non-null   float64
   11  type          1000 non-null   object
   12  year          1000 non-null   int64
   13  region        1000 non-null   object
  dtypes: float64(9), int64(2), object(3)

adf.iloc[0]

  Unnamed: 0                 0
  Date              2015-12-27
  AveragePrice            1.33
  Total Volume        64236.62
  4046                 1036.74
  4225                54454.85
  4770                   48.16
  Total Bags           8696.87
  Small Bags           8603.62
  Large Bags             93.25
  XLarge Bags              0.0
  type            conventional
  year                    2015
  region                Albany
  Name: 0, dtype: object

Convert DataFrame to Numpy Array

  • df.values is the backing numpy array
  • df.to_numpy(copy=True) -- Returns a copy of backing numpy array.
  • To convert array to DataFrame, use pd.DataFrame(arr, columns=['A','B'])

Basic Statistics and Inspection

df.head()

     Order ID  Product Customer        Date  Qty  Price  Total
  0       101   Laptop     John  2022-01-01    2   1000   2000
  1       102  Desktop     Jane  2022-01-05    1    800    800
  2       103   Tablet      Bob  2022-01-10    3    600   1800
  3       104   Laptop     John  2022-01-15    4   1000   4000
  4       105  Desktop     Jane  2022-01-20    2    800   1600

df.describe()       # Basic Statistics.

           Order ID       Qty        Price        Total
  count    9.000000  9.000000     9.000000     9.000000
  mean   105.000000  2.444444   800.000000  2022.222222
  std      2.738613  1.130388   173.205081  1155.181563
  min    101.000000  1.000000   600.000000   600.000000
  25%    103.000000  2.000000   600.000000  1200.000000
  50%    105.000000  2.000000   800.000000  1800.000000
  75%    107.000000  3.000000  1000.000000  3000.000000
  max    109.000000  4.000000  1000.000000  4000.000000

df.info()

  <class 'pandas.core.frame.DataFrame'>
  RangeIndex: 9 entries, 0 to 8
  Data columns (total 7 columns):
   #   Column    Non-Null Count  Dtype
  ---  ------    --------------  -----
   0   Order ID  9 non-null      int64
   1   Product   9 non-null      object
   2   Customer  9 non-null      object
   3   Date      9 non-null      object
   4   Qty       9 non-null      int64
   5   Price     9 non-null      int64
   6   Total     9 non-null      int64
  dtypes: int64(4), object(3)
  memory usage: 632.0+ bytes

set_index

  • The set_index() applies only to Pandas not Numpy.
  • Numpy has a basic raw arrays and ability to aggregate along different axes. Pandas builds upon it and provides higher order concepts like index and groupby.
  • Use sort_index() to sort by index.
  • When you set_index a column, the column name becomes index name aka index label. Similarly there exists a columns header label which applies to all columns. When you transpose dataframe index label becomes columns label.
dfc = df.set_index('Customer')    # Returns new dataframe.
print(dfc)

            Order ID  Product        Date  Qty  Price  Total
  Customer
  John           101   Laptop  2022-01-01    2   1000   2000
  Jane           102  Desktop  2022-01-05    1    800    800
  Bob            103   Tablet  2022-01-10    3    600   1800
  John           104   Laptop  2022-01-15    4   1000   4000
  Jane           105  Desktop  2022-01-20    2    800   1600
  Bob            106   Tablet  2022-01-25    1    600    600
  John           107   Laptop  2022-02-01    3   1000   3000
  Jane           108  Desktop  2022-02-05    4    800   3200
  Bob            109   Tablet  2022-02-10    2    600   1200

df2 = df.set_index(['Customer', 'Product'])    # Returns new dataframe.

                    Order ID        Date  Qty  Price  Total
  Customer Product
  John     Laptop        101  2022-01-01    2   1000   2000
  Jane     Desktop       102  2022-01-05    1    800    800
  Bob      Tablet        103  2022-01-10    3    600   1800
  John     Laptop        104  2022-01-15    4   1000   4000
  Jane     Desktop       105  2022-01-20    2    800   1600
  Bob      Tablet        106  2022-01-25    1    600    600
  John     Laptop        107  2022-02-01    3   1000   3000
  Jane     Desktop       108  2022-02-05    4    800   3200
  Bob      Tablet        109  2022-02-10    2    600   1200

df2.loc[('John', 'Laptop')]

                    Order ID        Date  Qty  Price  Total
  Customer Product
  John     Laptop        101  2022-01-01    2   1000   2000
           Laptop        104  2022-01-15    4   1000   4000
           Laptop        107  2022-02-01    3   1000   3000

dfc.sort_index()    # Sort by index.

              Product        Date  Qty  Price  Total
    Customer
    Alice         NaN  2022-02-10    2    600   1200
    Bob        Tablet  2022-01-10    3    600   1800
    Bob        Tablet  2022-01-25    1    600    600
    Bob        Tablet  2022-02-10    2    600   1200
    Jane      Desktop  2022-01-05    1    800    800
    Jane      Desktop  2022-01-20    2    800   1600
    Jane      Desktop  2022-02-05    4    800   3200
    John       Laptop  2022-01-01    2   1000   2000
    John       Laptop  2022-01-15    4   1000   4000
    John       Laptop  2022-02-01    3   1000   3000

Access Rows in df

  • The most obvious choice df[n] does not work. You must use df.iloc[n]
  • However df[5:6] returns row slice, not column slice.
  • df.loc[] is used to access using label based index.
  • df.iloc[] is used to access using integer offset.
  • df.loc[[n]] should be used to return dataframe instead of pd.Series.
#  df[0]                # Error. Does not work.
#  df[colname]          # Reserved for accessing column.

df[0:1]                 # Returns First Row.
df[5:6]                 # Returns 5th Single Row. Slice Interval is open.
df.head(1)              # Returns first row.
df.head(10).tail(1)     # Returns tenth row.

df.iloc[0]              # Returns single Row as Series, not DataFrame!
                        # You don't want Series. You want single row DataFrame.

df.iloc[0:1]            # Returns first row. It is DataFrame.
                        # Tip: Always use Slice Range even to get 1 row.
df.iloc[[0]]            # Returns first row as DataFrame.


df.loc[0]               # Works with default index. Returns pd.Series.
df.loc[0:1]             # Works with default index. Returns DataFrame
df.loc[[0]]             # Works with default index. Returns DataFrame

df.values[0]            # Returns first row as numpy array!

dfc.loc['John']         # If name was set as index, this works.
dfc.loc[df.index.get_loc('index_value')]  # If there is non-default index.
                        # df.index.get_loc('John') returns boolean array.

# If you get pd.Series, you can  convert series to dataframe
# Following kind of works. But dtype becomes 'object' for all columns!
df.iloc[0].to_frame().transpose()

# Moral of the story: Always use slice notation to fetch even single row
# from dataframe to ensure the return type is DataFrame.

Access column in df

df['Product']       # Returns  pd.Series
df[['Product']]     # Returns  DataFrame

  0     Laptop
  1    Desktop
  2     Tablet
  ....
  8     Tablet
  Name: Product, dtype: object

df[df.columns[0]]   # Access first column.

df[[col1, col2]]    # Subset the columns

DataFrame Subsetting/Slicing

  • Simple row slicing done by using syntax df[n:m]
  • Simple column slicing done df[[col1, col2]]
  • For anything else, you must use df.loc or df.iloc
  • For both row and col slicing together use df.iloc[4:8, 2:4]
# Rows slicing.
df[4:8]                 # Returns rows 4 to 7. Note 8 is not included.
df.iloc[4:8]            # Works even with custom Index.

# Rows slicing with Custom index 
dfc.loc['John':'Bob']   # Range syntax applicable for custom index as well.
                        # Works only for unique Index labels.
                        # Range Syntax can't be used for columns.

df2.loc[('John', 'Laptop')]                     # If df has multi-index.
df2.loc[('John', 'Laptop'):('Bob', 'Tablet')] 

# Columns Slicing
df[['Customer', 'Qty', 'Price']]

# Both row and column slicing.
df.loc[4:8, ['Customer', 'Qty', 'Price']]

        Customer  Qty  Price
  4     Jane    2    800
  5      Bob    1    600
  6     John    3   1000
  7     Jane    5    800
  8      Bob    2    600

df.iloc[4:8, 2:4]  # Rows: 4 to 7; Columns: 2 to 3

        Customer        Date
  4     Jane  2022-01-20
  5      Bob  2022-01-25
  6     John  2022-02-01
  7     Jane  2022-02-05

See Also: df.iloc[:, 2:4]  # Rows: All; Columns: 2 to 3

Pandas groupby

  • Unlike numpy, the pandas columns could be of different types, so numerical aggregations do not apply across columns.
  • groupby can be based on a column or unrelated array or column derived values e.g. ((df[col1] + df[col2]) ; Just the length of the array/series should match.
df.groupby('Customer').mean()

          Order ID       Qty   Price        Total
Customer
Bob          106.0  2.000000   600.0  1200.000000
Jane         105.0  2.666667   800.0  1866.666667
John         104.0  3.000000  1000.0  3000.000000

df.groupby('Customer').head(1)    # Group by, retain max one per group.

         Order ID  Product Customer    Date  Qty  Price  Total
  0       101   Laptop     John  2022-01-01    2   1000   2000
  1       102  Desktop     Jane  2022-01-05    1    800    800
  2       103   Tablet      Bob  2022-01-10    3    600   1800


df[['Customer', 'Qty', 'Price']].groupby('Customer').mean()

                 Qty   Price
  Customer
  Alice     2.000000   600.0
  Bob       2.000000   600.0
  Jane      2.666667   800.0


df[['Customer', 'Qty', 'Price']].groupby('Customer').agg([np.mean, 
                                        np.median, np.min, np.max])

                 Qty                    Price
                mean median amin amax    mean  median  amin  amax
  Customer
  Bob       2.000000    2.0    1    3   600.0   600.0   600   600
  Jane      2.666667    2.0    1    5   800.0   800.0   800   800
  John      3.000000    3.0    2    4  1000.0  1000.0  1000  1000

# Apply different operation to different columns after groupby ...
df[['Customer', 'Qty', 'Price']].groupby('Customer').agg(
                         Avg_Qty      =('Qty', np.mean), 
                         Median_Price =('Price',np.median ),
                         as_index = False  
                ) 


# To round the results to 2 decimal places for all numeric columns ...
df = df.round(2)  

#
# Inspecting group by object ...
#
grouped = df[['Customer', 'Qty', 'Price']].groupby('Customer')
grouped.groups

# Out: {'Bob': [2, 5, 8], 'Jane': [1, 4, 7], 'John': [0, 3, 6]}

grouped.size()

# Customer
# Bob     3
# Jane    3
# John    3
# dtype: int64

grouped = df[['Customer', 'Qty', 'Price']].groupby(['one', 'two', ..., 'nine', 'nine'])
grouped.size()

  nine     2
  one      1
  two      1
  three    1
  four     1
  five     1
  six      1
  seven    1
  dtype: int64

Note: groupby operation just remembers the distinct values of the groupby column
and corresponding row numbers. The column name is unimportant and may be missing.

Pivot Tables

  • Pivot tables is a mechanism to re-organize the table into rows and columns which represents discrete metric dimensions with all values as metrics.
  • In a typical table each column represents a dimension (or type). In pivot table, apart from id column, other columns represent (discrete) values.
  • In a sense, Pivot Tables is a visualization technique like scatter plot.
  • The process of changing "normal" table into "pivot" view is called pivoting. (pd.pivot_table). The reverse process is called melting. (pd.melt)
  • If you have one target numeric value, you can view the mean or sum of that value wrt 2 other variables over row and column.
Example: Total/Average/Count of Sales Revenue generated by 2 Sales Reps

--------------------------
        2020   2021  2022
--------------------------
John    Avg    Avg   Avg
Mary    Avg    Avg   Avg

SQL:  select Name, Year, avg(Revenue) From sales Group By Name, Year;
  • With Pivot tables, the columns are discrete values instead of type. For example, a single column "Date" is expanded to multiple columns of discreate value or ranges (e.g. 2001-2010, 2011-2020, etc).

  • In Pivot operation done on Dataframes, the values (or ranges) become columns resulting in wider table. (pd.pivot_table operation).

  • The pd.melt operation is the reverse operation where multiple columns can be reduced to one. For example (2020, 2021, ...) can be reduced to single Year (Rpresents Variable) column with additional column AvgSales (Represents Value).

  • You can view even 3D data by making either row or column as composite:

    --------------------------------
                   Avg   Count  Max
    --------------------------------
    John  2020     avg   count  max
    Mary  2020     avg   count  max
    
    John  2021     avg   count  max
    Mary  2021     avg   count  max
    
    John  2022     avg   count  max
    Mary  2022     avg   count  max
    
  • Pivot tables are typically used to group by 2 different variables and to view corresponding mean values.

  • The equivalent of df.groupby('Customer').mean() using Pivot tables would be:

    df.pivot_table(values='Qty', index='Customer')
    df.pivot_table(values='Qty', index='Customer', aggfunc=np.mean)
    
                     Qty
      Customer
      Alice     2.000000
      Bob       2.000000
      Jane      2.666667
      John      3.000000
    
    df.pivot_table(values='Qty', index='Customer', aggfunc=np.median)
    
                Qty
      Customer
      Alice       2
      Bob         2
      Jane        2
      John        3
    
    Note: df.pivot(values='Qty', index='Customer') May raise error if the
          index created is not unique. Either use aggfunc or use df.pivot_table().
    
  • Mostly we pivot on 2 variables. We call them as index and columns:

    df.pivot_table(values='Price', index='Customer', columns='Qty',
                                        fill_value=0, margins = True)
    
    # We are viewing mean (default) value of Price against (Customer, Qty)
    
        Qty         1     2     3     4    5     All
        Customer
        Alice       0   600     0     0    0   600.0
        Bob       600   600   600     0    0   600.0
        Jane      800   800     0     0  800   800.0
        John        0  1000  1000  1000    0  1000.0
        All       700   750   800  1000  800   780.0
    
  • For use of multi-Index columns, you can create multi-Index rows df and transpose it as df.T

  • You can access values of dataframe with multi-index using df.xs method or slicing method. The index need not be strict prefix:

    df.xs(('val1', 'val3'), level=[0,2])
    # Uses col1, col3 of 3 column index.
    

pd.melt (Reverse Pivoting)

  • In this process of melting, multiple homogenous columns are merged into single "variable" column whose values are the column names. This results in longer table. The unique index (if any) can become non-unique.
  • For some visualization, melting is better if there are multiple homogenous columns and visualization over that dimension is desirable.

Here is an example of pivoting and melting back:

import pandas as pd

# Melted data
data = {
    'Student': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 
                'Charlie', 'Alice', 'Bob', 'Charlie'],
    'Subject': ['Math', 'Math', 'Math', 'Science', 'Science', 
                'Science', 'English', 'English', 'English'],
    'Score': [85, 78, 92, 89, 94, 88, 91, 82, 85]
}

df_melted = pd.DataFrame(data)
print("Original (Melted long format) data:\n", df_melted)

Output:

   Student   Subject  Score
0    Alice      Math     85
1      Bob      Math     78
2  Charlie      Math     92
3    Alice   Science     89
4      Bob   Science     94
5  Charlie   Science     88
6    Alice   English     91
7      Bob   English     82
8  Charlie   English     85

Pivoting Back to Wide Format to return to the original wide format:

# Pivot back to wide format
df_pivoted = df_melted.pivot(index='Student',
                              columns='Subject', 
                              values='Score').reset_index()

print("\nPivoted (wide format) data:\n", df_original)

Output:

Subject Student  English  Math  Science
0         Alice       91    85       89
1           Bob       82    78       94
2       Charlie       85    92       88

# df.columns.name = None   # Get rid off the useless column index name/label.
# In above example, the column index label "Subject" may be retained, if desired.
# df.index.name   = None   # Get rid off the useless row index name/label.

Explanation

index='Student':      This sets "Student" as the row index.
columns='Subject':    Each unique values in "Subject" becomes a new column.
values='Score':       Fills the cells with the "Score" values from each "Subject".

The result is a DataFrame in the original wide format. 
Note that reset_index() is used to convert the "Student" index back into a regular column.

Now here is how to melt (Reverse Pivoting) operation happens:

df_melted = pd.melt(df_pivoted, 
      id_vars=['Student'],  # Student column to remain as id for each row
      var_name='Subject',   # Name for the new column in place of all non-id columns
                            # i.e. ("Math", "Science", "English").
      value_name='Score'   # New column name that represents the numeric values.
      # value_vars=['Math', 'Science', 'English'] # Optional. By default, it is same.
)

print("\nLong format data:\n", df_melted)

Output:

   Student   Subject  Score
0    Alice      Math     85
1      Bob      Math     78
2  Charlie      Math     92
3    Alice   Science     89
4      Bob   Science     94
5  Charlie   Science     88
6    Alice   English     91
7      Bob   English     82
8  Charlie   English     85
  • This long-form is more suitable for visualization using Seaborn or Matplotlib.
  • There are total 3 dimensions, 2 can be along x,y axes, size could represent size of the circle at the (x,y) point.
  • Or we could have bar plots of student vs Score for each subject. (multiple bar plots)

drop columns

# Note: By default, df.drop(8, inplace=True) drops the row.
#       To drop column, you must specify columns explicitly.

dfc.drop(columns=['Order ID'], inplace=True)
print(dfc)

item = dfc.pop('Order ID')   # Returns pd.Series
# It drops the specified column in place. Returns popped column.

sort_values

dfc.sort_values('Qty', ascending=False)

            Product        Date  Qty  Price  Total
  Customer
  Jane      Desktop  2022-01-05    1    800    800
  Bob        Tablet  2022-01-25    1    600    600
  John       Laptop  2022-01-01    2   1000   2000
  Jane      Desktop  2022-01-20    2    800   1600
  ....

# See Also:  dfc.sort_values([col1,col2], ascending=(False,True))

Adding a new Row

# Method 1:  Add new row by duplicating last row

dfc.loc['Alice'] = dfc.iloc[-1].copy()

# Modify 'Product' to NaN
dfc.loc[dfc.index[-1], 'Product'] = np.nan
print(dfc)

            Product        Date  Qty  Price  Total
  Customer
  John       Laptop  2022-01-01    2   1000   2000
  Jane      Desktop  2022-01-05    1    800    800
  Bob        Tablet  2022-01-10    3    600   1800
  John       Laptop  2022-01-15    4   1000   4000
  Jane      Desktop  2022-01-20    2    800   1600
  Bob        Tablet  2022-01-25    1    600    600
  John       Laptop  2022-02-01    3   1000   3000
  Jane      Desktop  2022-02-05    4    800   3200
  Bob        Tablet  2022-02-10    2    600   1200
  Alice         NaN  2022-02-10    2    600   1200

# Delete the row added.
dfc = dfc.drop('Alice')         # Delete first occurence of this index.
dfc = dfc.drop(df.index[[9]])   # Delete rows of specific indexes.
dfc = dfc[~dfc.index.isin(['Alice'])]

# Note: To get rows of only Alice, do this: df = df.loc[['Alice']]

# Method 2:
# Add another new row by creating and adding dataframe.

new_df = pd.DataFrame( {
    'Product': [np.nan],
    'Date': ['2022-02-15'],
    'Quantity': [3],
    'Price': [1000],
    'Total': [3000]
}, index = ['Alice'] )

dfc = pd.concat([dfc, new_df])
print(dfc)

pd.concat Rows

pd.concat([df1, df2])       # Default axis=0; Rows added.
pd.concat([df1, df2], 
           join = 'inner',  # Only common columns retained.
                            # Else both columns included.

     # verify_integrity = True # Check if duplicates found in index.
                               # Checks only index, not columns.

  )

df.apply()

  • You typically use the df.apply() function for aggregation. If you want to transform only single column, you will use something like:

    df['newcol'] = df['col'] * 2
    new_df = my_numpy_ufunc(df)
    new_df = my_numpy_vectorized_func(df)
    
  • Apply np.sum across rows (axis=0) i.e. apply for each column. :

    result = df[['Qty', 'Price']].apply(np.sum, axis=0)
    print(result)
    
      Qty        22     # This is pd.Series
      Price    7200
      dtype: int64
    
      # result_df = result.to_frame().transpose()
    
      #  Each column is summed up. Row is lost. Returns pd.Series.
    
  • df[['Qty', 'Price']].apply(lambda x: np.int32(x), axis=0) returns DataFrame:

    • The function is not an aggregation function.
    • Result is a transformed DataFrame.
    • Less common usage, you need not use .apply() unless you want to aggregate.
  • Aggregating across column is uncommon. For this to work, all columns must have same units. e.g. You may want to sum 'StateTax', 'CentralTax'. :

    result = df[['StateTax', 'CentralTax']].apply(np.sum, axis=1)
    df['TotalTax'] = result
    print(result)
    
    
      0    1002       # result type: pd.Series
      1     801
      2     603
      3    1004
    

Iterating over Pandas Rows

  • In most cases, you want to do vector parallel operations and should not iterate over rows. In some cases, you may still want to do this.
for index, row in dfc.iterrows():
  print("Index: ", index)
  print(row)               # row is of type pd.Series
  # row.index is list of column names.
  # You can access row['col1'], etc.

# Prefer itertuples over iterrows as it is much faster.
for row in df.itertuples(index=True):   # Return Index as first element in tuple.
  print(row.c1, row.c2)
  • In most cases, you should prefer to use vector methods to avoid iterating:

    • Vectorization: e.g. df['Qty'] = df['Qty']**2; # np.square(df['Qty'])

    • List Comprehension:

      for row in df[['Qty', 'Price']].to_numpy():
         # Do stuff with row[0], row[1], etc.
      
    • use df.apply() for aggregation.

Substract one Dataframe from Another

  • Compute df1 - df2 to remove common rows.

  • Use concat with drop duplicates. Bit inefficient. :

    # Adds then removes duplicate rows that appear atleast twice
    df_result = pd.concat([df1, df2, df2]).drop_duplicates(keep=False)
    print(df_result)
    
  • Create new column as tuple of the row and use it to remove duplicates :

    df1['hash'] = df1.apply(tuple, axis=1)
    df2['hash'] = df2.apply(tuple, axis=1)
    
    result_df = df1[ ~df1['hash'].isin(df2['hash']) ]  # Return only unique rows.
    common_df = df1[  df1['hash'].isin(df2['hash']) ]  # Return common rows.
    
    df1.pop('hash'); df2.pop('hash'); result_df.pop('hash')
    
    # Without using temp variables, you can also do it as ...
    df_result = df1[~df1.apply(tuple, 1).isin(df2.apply(tuple, 1))]
    
    # Another way ...
    
    df2_tuples = [tuple(row) for row in df2.values]
    result_df = df1[~df1.apply(lambda row: tuple(row) in df2_tuples, axis=1)]
    
  • Merge two DataFrames with indicator and drop using the indicator :

    result_df = pd.merge(df1, df2, indicator=True, how='outer')
    result_df = result_df[result_df['_merge'] == 'left_only'].drop('_merge', axis=1)
    

Get Percentile value

  • To get the value for a column which is at 30 percentile point:

    df['Qty'].quantile(0.30)   # Default = 0.5 i.e. 50% percentile
    # 2.5
    df['Qty'].quantile(0)      # Min value for the column. 0th Percentile.
    df['Qty'].quantile(1)      # Max value for the column. 100th Percentile.
    
  • Given the value, how to find the percentile ? :

    percentile = ((df['Qty'].sort_values().values.searchsorted(1)+1)/ len(df)) * 100
    df['Qty'].rank(method='min')
    
  • Another way to find percentile given value:

    from scipy.stats import percentileofscore
    percentile = percentileofscore(df['column_name'], value)
    print(f"The value {value} is at {percentile:.2f}th percentile.")
    

Using aggregate vs apply

  • df[['Qty', 'Price']].agg(np.mean) is used to apply statistics function. The result always has reduced dimensions. The aggregate function is almost always applied on columns only.
  • df.apply(func) can be used with both aggregate function and normal function. The result may or may not have reduced dimensions. Also it may be applied across rows or columns.
  • The .agg() is an alias for .aggregate() function.
df[['Qty', 'Price']].agg(np.mean) 

 Qty        2.555556               # Returs pd.Series
 Price    800.000000
 dtype: float64

df[['Qty', 'Price']].agg([np.mean, np.sum]) 

               Qty   Price         # Returs DataFrame
   mean   2.555556   800.0
   sum   23.000000  7200.0

Dropping Duplicates

df.drop_duplicates(subset=['Customer'])

# First value kept. Later duplicates are dropped.

         Order ID  Product Customer        Date  Qty  Price  Total
    0       101   Laptop     John  2022-01-01    2   1000   2000
    1       102  Desktop     Jane  2022-01-05    1    800    800
    2       103   Tablet      Bob  2022-01-10    3    600   1800

Count values

df['Customer'].value_counts()

 John    3
 Jane    3
 Bob     3
 Name: Customer, dtype: int64

df['Customer'].value_counts(normalize=True)

 John     0.3
 Jane     0.3
 Bob      0.3
 Alice    0.1
 Name: Customer, dtype: float64 

Visualization: Histograms

import matplotlib.pyplot as plt

df['Qty'].hist()
plt.show()

df['Price'].hist()
plt.show()

Visualization: Bar, Line, Scatter

avg_qty_per_customer = df.groupby('Customer')['Qty'].mean() print(avg_qty_per_customer)

Customer # pd.Series Alice 2.000000 Bob 2.000000 Jane 2.666667 John 3.000000 Name: Qty, dtype: float64

avg_qty_per_customer.plot(kind='bar')

df.plot(x='Date', y='Total', kind='line', rot=45) df.plot(x='Date', y='Total', kind='scatter', rot=45, xscale='log')

# # Overlapping Histograms with alpha (transparent) # df[df['Price'] > 700 ]['Qty'].hist(alpha=0.7) df[df['Price'] <= 700 ]['Qty'].hist(alpha=0.7) plt.legend(['gt700', 'le700']) plt.show()

Correlation Matrix

df.corr()

          Order ID       Qty     Price     Total
Order ID  1.000000  0.213122 -0.423702  0.001381
Qty       0.213122  1.000000  0.349913  0.883411
Price    -0.423702  0.349913  1.000000  0.695845
Total     0.001381  0.883411  0.695845  1.000000

Funding Startups df

fdf = pd.read_csv("./data/startup_funding.csv")
fdf.info()

  <class 'pandas.core.frame.DataFrame'>
  RangeIndex: 3044 entries, 0 to 3043
  Data columns (total 10 columns):
   #   Column             Non-Null Count  Dtype
  ---  ------             --------------  -----
   0   Sr No              3044 non-null   int64
   1   Date dd/mm/yyyy    3044 non-null   object
   2   Startup Name       3044 non-null   object
   3   Industry Vertical  2873 non-null   object
   4   SubVertical        2108 non-null   object
   5   City  Location     2864 non-null   object
   6   Investors Name     3020 non-null   object
   7   InvestmentnType    3040 non-null   object
   8   Amount in USD      2084 non-null   object
   9   Remarks            419 non-null    object
  dtypes: int64(1), object(9)
  memory usage: 237.9+ KB

fd.loc[0]

  Sr No                                      1
  Date dd/mm/yyyy                   09/01/2020
  Startup Name                          BYJU’S
  Industry Vertical                     E-Tech
  SubVertical                       E-learning
  City  Location                     Bengaluru
  Investors Name       Tiger Global Management
  InvestmentnType         Private Equity Round
  Amount in USD                   20,00,00,000
  Remarks                                  NaN
  Name: 0, dtype: object

  fd.isnull()   # 2D array of booleans.

  fd.isnull().sum()  # Returns missing sum for each column. (pd.Series)

  Sr No                   0
  Date dd/mm/yyyy         0
  Startup Name            0
  Industry Vertical     171
  SubVertical           936
  City  Location        180
  Investors Name         24
  InvestmentnType         4
  Amount in USD         960
  Remarks              2625

  fd.isnull().sum().sort_values() # Sorts above array using values.

  fd.isnull().count()  # Returns constant count for each column. (pd.Series)

  Sr No                3044
  Date dd/mm/yyyy      3044
  ...
  Remarks              3044
  dtype: int64

  # To cleanup a cell value, do something like this:
  clean_col = np.vectorize(lambda x: x.replace(',', '').strip())
  fd['AmountInUSD'] = clean_col(fd['AmountInUSD'])

  # Fine missing data Percentage.
  # Find total missing data (null) for each column ...
  total = fd.isnull().sum().sort_values(ascending = False)
  # Divide by constant total entries (fd.isnull().count() == 3044)
  percent = ((fd.isnull().sum()/fd.isnull().count())*100)
  percent = percent.sort_values(ascending = False)

  missing_df = pd.concat([total, percent], axis=1, keys=['Total', 'Percent %'])
  missing_df.head()
  # Concatenate along axis=1, uses new given column names.

                       Total  Percent %
    Remarks             2625  86.235217
    Amount in USD        960  31.537451
    SubVertical          936  30.749014
    City  Location       180   5.913272
    Industry Vertical    171   5.617608

  # If you want to reverse columns and rows, just do:  df = df.T

  # To concatenate as DF, you should do ...

  mdf = pd.concat([total.to_frame().T, percent.to_frame().T]) 
  mdf.index = ['MissedCount', 'Percent']
  # Note: It is easier to concat series along axis=1

Cleaning Dataframe column

#
# E.g. Values:  10,000   undisclosed  unknown etc.
# 
def clean_amount(x):
    x = str(x).replace(',', '').strip()
    x = x.replace('+', ' ').strip()  # Remove trailing +
    if x.upper().isupper():    # Contains alphabets
        x = 0
    if (not x):
        return 0
    return np.int64(float(x))

clean_amount = np.vectorize(clean_amount)

fd['AmountInUSD'] = clean_amount(fd['AmountInUSD'])

# To drop a column ...

del fd['Remarks']
# fd.drop('Remarks', axis=1, inplace=True)  # Same as above to del column.

# Deletes rows if any one column is NA ...  using dropna() method ...
fd.dropna(inplace=True)

# Deletes rows where AmountInUSD column is na or zero.
fd = fd[ fd['AmountInUSD'] > 0 ]

Removing Outliers

There are various methods available for removing outliers.

  • IQR Method - Interquartile Range Method:
    • IQR = Percentile(0.75) - Percentile(0.25)
    • Lower bound = Q1 - 1.5 * IQR (Could be negative in some cases!)
    • Upper bound = Q3 + 1.5 * IQR
  • Z score Method:
    • Lower Bound = Mean - 3 * (std_deviation)
    • Upper Bound = Mean + 3 * (std_deviation)
  • Modified Z-Score method:
    • Instead of std_deviation, use median of all deviations.
    • Lower Bound = Median - 3 * (median_abs_deviation)
    • Upper Bound = Median + 3 * (median_abs_deviation)
    • mad = np.median(np.abs(series - median))
  • Winsorizing:
    • Preserve outliers set at nearest acceptable value
    • Low values set at 5th Percentile (for example)
    • High values set at 95th percentile. (for example)
  • Percentile Capping:
    • Similar to Winsorizing but drop lower and higher outliers.
    • Drop less than 5th percentile and higher than 95th Percentile values.
  • Helper function to display outliers information
def  outlier_info(low, high, method='', new_fd=None):

  print(f"Outlier Results: (Using Method: {method})")
  # Filter out outliers
  fd_amt  = fd['AmountInUSD']

  if new_fd is None:
    print(f"Lower Bound: {low}; Higher Bound: {high}")
    new_fd = fd[(fd_amt >= low) & (fd_amt <= high)]

  new_count = new_fd[column].count()
  total_count = fd_amt.count()
  dropped = total_count - new_count
  percent_dropped = (dropped/total_count) * 100.0
  print(f"Total: {total_count:,}")
  print(f"Dropped: {dropped:,}") 
  print(f"Percent-Dropped: {percent_dropped:.1f}")

  print(new_fd.describe())

IQR Method

# We want to remove outliers in AmountInUSD column.
fda = fd['AmountInUSD']
fda.describe()               # Display Percentile and other statistics.

   count    2.066000e+03               2066
   mean     1.842792e+07         18,427,920      18 Million
   std      1.213441e+08        121,344,100     121 Million
   min      1.600000e+04             16,000
   25%      4.712500e+05            471,250     0.47 Million
   50%      1.725000e+06          1,725,000     1.72 Million Median ?
   75%      8.000000e+06          8,000,000     8.00 Million
   max      3.900000e+09      3,900,000,000     3.90 Billion
   Name: AmountInUSD, dtype: float64

# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = df[column].quantile(0.25)   # 0.47 Million
Q3 = df[column].quantile(0.75)   # 8.00 Million

# Calculate IQR
IQR = Q3 - Q1

   7528750.0   # 7.52 Million

# Define the outlier bounds
lower_bound = Q1 - 1.5 * IQR    # negative
upper_bound = Q3 + 1.5 * IQR    # 19 Million

# Filter out outliers
column = 'AmountInUSD'
fd_filtered = fd[(fd[column] >= lower_bound) & (fd[column] <= upper_bound)]

fd_filtered[column].count()
1783     #  2066 - 1783 = 283;  283/2066 = 13.7% dropped.

fd_filtered.describe()

          AmountInUSD
   count  1.783000e+03            1783
   mean   3.165783e+06       3,165,783   # 3 Million
   std    4.240982e+06                   # 4 Million
   min    1.600000e+04
   25%    3.750000e+05
   50%    1.000000e+06
   75%    4.500000e+06
   max    1.900000e+07

Z-score Method

Z-Score method uses values with in 3 times standard deviations. In our case of funding data, it did not work really well since the standard deviation is really high.

:

from scipy import stats

# Calculate the Z-scores for each data point
fd['z_score'] = stats.zscore(fd['AmountInUSD'])

# Filter out outliers based on Z-score
threshold = 3  # Adjust threshold as needed
new_fd = fd[fd['z_score'].abs() < threshold].drop(columns='z_score')
outlier_info(0, 0, method='z_score', new_fd=new_fd)


  Outlier Results: (Using Method: z_score)
  Total: 2,066
  Dropped: 13
  Percent-Dropped: 0.6
               Sr No   AmountInUSD
  count  2053.000000  2.053000e+03
  mean   1475.387725  1.142576e+07
  std     926.287657  3.076596e+07
  min       1.000000  1.600000e+04
  25%     622.000000  4.640000e+05
  50%    1442.000000  1.600000e+06
  75%    2310.000000  7.800000e+06
  max    3044.000000  3.300000e+08

Modified Z Score

Modified Z score works to retain all points from median and with in 3.5 times (median of all deviations).

:

def modified_z_score(series):
  median = series.median()
  mad = np.median(np.abs(series - median))
  return 0.6745 * (series - median) / mad

# Apply modified Z-score method
fd['modified_z_score'] = modified_z_score(fd['AmountInUSD'])

# Filter out outliers based on modified Z-score
threshold = 3.5
new_fd = fd[fd['modified_z_score'].abs()<threshold].drop(columns='modified_z_score')

outlier_info(0, 0, method='Modified-Z-score', new_fd=new_fd)

Result:

Outlier Results: (Using Method: Modified-Z-score)
Total: 2,066
Dropped: 489
Percent-Dropped: 23.7
             Sr No   AmountInUSD      z_score
count  1577.000000  1.577000e+03  1577.000000
mean   1517.958782  1.864793e+06    -0.136530
std     887.148232  2.168736e+06     0.017877
min       2.000000  1.600000e+04    -0.151770
25%     743.000000  3.000000e+05    -0.149429
50%    1507.000000  1.000000e+06    -0.143659
75%    2296.000000  3.000000e+06    -0.127173
max    3044.000000  9.700000e+06    -0.071944

Winsorizing Outlier Method

Winsorizing fixes lower 5% values to 5th Percentile and higher 5% values to 95th Percentile and does not drop any values. (you can change these limits)

from scipy.stats.mstats import winsorize

# Winsorize the data (limits=[lower, upper] for percentile capping)
winsorized_values = winsorize(fd['AmountInUSD'], limits=[0.05, 0.05])

# Create a new DataFrame with Winsorized data
fd_winsorized = pd.DataFrame(winsorized_values, columns=['AmountInUSD'])

outlier_info(0, 0, method='Winsorize', new_fd=fd_winsorized)

Result:

Outlier Results: (Using Method: Winsorize)

Total: 2,066
Dropped: 0
Percent-Dropped: 0.0

        AmountInUSD
count  2.066000e+03
mean   8.655042e+06     8.6 Million
std    1.546028e+07      15 Million
min    1.000000e+05     0.1 Million
25%    4.712500e+05     0.5 Million
50%    1.725000e+06     1.7 Million
75%    8.000000e+06       8 Million
max    6.000000e+07      60 Million

Percentile Cap

Cap the lower/upper bounds to be 5% and 95%, for example.

low = fd['AmountInUSD'].quantile(0.05)
high = fd['AmountInUSD'].quantile(0.95)

outlier_info(low, high, method='Percentile-Cap')

Result :

Outlier Results: (Using Method: Percentile-Cap)
Lower Bound: 100000.0; Higher Bound: 60000000.0
Total: 2,066
Dropped: 162
Percent-Dropped: 7.8

        AmountInUSD
count  1.904000e+03
mean   6.331311e+06     6.3 Million
std    1.082068e+07    10.8 Million
min    1.000000e+05     0.1 Million
25%    5.000000e+05     0.5 Million
50%    1.500000e+06     1.5 Million
75%    6.500000e+06     6.5 Million
max    6.000000e+07    60.0 Million

Counting Missing Values

df.isnull().sum()       # Total missing values. 
                        # Result: Series; Col as index and sum as value.
df.isnull().count()     # Total constant count as Series.

df[col].isnull().sum()  # Scalar sum value. Total missing count.
df[col].isnull().count()# Total constant as total rows.

Merging Dataframes

df_merged = df1.merge(df2, on='common_col', 
                           suffixes=('_one', '_two'), 
                           how='inner'    # Default
      #   left_on = 'col1'
      #   right_on = 'col2'
      #   left_index = True  ; Joining key is index.
      #   right_index = True ; 
      #   validate = 'one_to_one' # Check no duplicates due to merge.
   )
# common columns are auto renamed as _x, _y suffixes by default otherwise.

See Also:  pd.merge_ordered(df1, df2, on='col1', # or left_on, right_on
                    # left_by = 'col', right_by='col',
                    how='inner', # Default: 'outer'!!!
                    fill_method='ffill', # Forward Filling missing values
           )
           pd.merge_asof(df1, df2, 
                 direction = 'nearest' | 'forward' | 'backward' # Row to match.
                 ...) 
           # Kind of outer join where nearest rows are chosen to join.

Semi Join

  • Also known as Filtering Join.
  • Like inner Join, but only columns of A retained.
  • Semi_Join(A, B) => Rows of A which has atleast one match in B.
  • No duplicate rows in result.
df2_ids = df2['id']
result = df1[ df1['id'].isin(df2_ids) ]

Anti Join

  • Just the opposite of Semi Join. A filter join on negative match with B.
  • Anti_Join(A, B) => Rows Of A without any match in B.
merged = df1.merge(df2, on='id', how='left', indicator=True)

# merged['_merge'] == 'left_only' | 'both' | right

result = merged[ merged['_merge'] == 'left_only' ]
# Drop B's columns.

df.query()

df = df.query('Qty >1 and Qty < 3')

Pandas Interval Type

  • Interval type is essential to divide a numerical value column by different ranges and group by the range as category.
  • Pandas defines two important data types:
    • pandas._libs.interval.Interval or Interval
    • pandas.core.arrays.categorical.Categorical or simply dtype='category'
interval = pd.Interval(1, 5, closed='both')    # [1, 5]
interval = pd.Interval(1, 5, closed='right')   # (1, 5]

# Interval from array
intervals = pd.IntervalIndex.from_arrays([1, 2, 3], [5, 6, 7], closed='both')
#  list of [1,5], [2,6], [3,7] intervals


# Interval Operations:
# Intersection: interval1 & interval2
# Union: interval1 | interval2
# Difference: interval1 - interval2
# Contains: interval.contains(value)
# Overlaps: interval.overlaps(interval2)

# IntervalIndex: A pandas Index composed of Intervals.

intervals = pd.IntervalIndex.from_tuples([(1, 5), (6, 10), (11, 15)], closed='both')

# Range-based filtering: Filter data within a specific range.
df[df['values'].isin(intervals)]

Pandas Category Type

#
# Use pd.cut() to return series of categories given series/array of values.
#
pd.cut(x,                    # Input Array or Series
       bins,                 # bins=3 means divide by equal 3 Intervals.
                             # bins=array to create categories by this criteria.
       right=True|False,     # e.g. [1,5] interval to include 5 or not.
                             #   Note: left is always included.
       labels,               # Category labels. By default, interval name becomes
                             #                      category name as well. ?
       retbins=True|False    # e.g. bins=3;  so return categories: 0, 1, 2 
       ordered=True|False    # Usually True for numerical categories.

       include_lowest=True|False   # To include lowest value or not.
  )

pd.cut(range(7), 3, labels=['low', 'medium', 'high'])
# Divides input array into 3 equal bins and then maps input to categories.

# Output:
   ['low', 'low', 'low', 'medium', 'medium', 'high', 'high']  # Input Size = Output Size
   Categories (3, object): ['low' < 'medium' < 'high']  # Labels becomes categories.

pd.cut(range(7), 3)

# Output:

 [(-0.006, 2.0], (-0.006, 2.0], ... , (4.0, 6.0], (4.0, 6.0]] # 7 categories in output
 Categories (3, interval[float64, right]): [(-0.006, 2.0] < (2.0, 4.0] < (4.0, 6.0]]
 # The interval becomes category label by default!
  • Another way to create custom category for non-numeric data:

    conditions = [
        df['values'] == 'A',
        df['values'] == 'B',
        df['values'] == 'C'
    ]
    
    choices = ['Category A', 'Category B', 'Category C']
    
    df['category'] = np.select(conditions, choices, default='Other')
    
  • groupby category using pd.cut of numerical range into bins:

    # Define bins and labels
    bins = [0, 10, 20, 30, 40, 50, float('inf')]  # Include an open-ended upper bin
    labels = ['0-9', '10-19', '20-29', '30-39', '40-49', '50+']
    
    # Assign bins to Sales column
    df['Sales_Bin'] = pd.cut(df['Sales'], bins=bins, labels=labels, right=False)
    # right=False means for each interval right side value is not included.
    # left value is always included.
    
    # Group by bins
    grouped = df.groupby('Sales_Bin')['Sales'].sum().reset_index()
    print(grouped)
    

Misc Pandas Notes

  • df['Qty'].cumsum() returns new column of cumulative sum.
  • See Also: cummax(), cummin(), cumprod()
  • df['Qty'].mean() # See Also: median(),
  • pd_series.reset_index() converts it into df with 'index' as one column and the other value column.
  • Index need not be unique. Duplicate values in index allowed.
Merging: pd.merge(df1, df2, on='A')

3. Data Analysis

Summary stats: df.describe()
Correlation: df.corr()
Data cleaning: df.dropna(), df.fillna()

4. Data Input/Output

CSV: pd.read_csv('data.csv'), df.to_csv('data.csv')
Excel: pd.read_excel('data.xlsx'), df.to_excel('data.xlsx')

5. Data Transformation

Melting: pd.melt(df, id_vars='A', value_vars='B')
Pivoting: df.pivot_table(index='A', columns='B', values='C')

Key Functions:

head(), tail(): View first/last few rows
info(), describe(): View data summary
shape, size: View data dimensions
columns, index: View column/index labels

Notes On Statistics

Measures of Center

  • Histogram involves bins and counts.

  • Mean, Median, Mode

  • Mode is the most frequent value in the set. Useful for non-numerical categorical set.

  • To find mode:

    df[col].value_counts().head(1)  # Give the biggest value count. For numeric.
    # Displays  (value, count) tuple for the highest count.
    
    import statistics
    statistics.mode(df[col])  # Gives you the (value) for the most frequent item.
    

Measures of Spread

  • np.var (variance), np.std (std deviation)

  • Mean absolute deviation: np.mean(np.abs(distance_from_mean)); Similar to np.std, but np.std penalizes longer deviations more.

  • Quantiles -- 0.50 Quantile is median. :

    np.quantile(df[col], 0.5)
    
  • Quartiles -- 4 equal parts of data between equal quantiles intervals between 0 and 1. i.e. [0, .25, .5, .75, 1] -- There are 4 intervals:

    np.quantile(df[col], [0, 0.25, 0.5, 0.75, 1.0])  # 5 points are quartiles. Q0 ... Q4
    
  • The boxplot uses the Q1-Q2-Q3 box:

    plt.boxplot(df[col]) ; plt.show()
    
  • Get quantiles using custom linear space intervals :

    np.quantile(df[col], np.linspace(0, 1, 5)) # linspace gives: [0, 0.2,...,0.8, 1]
    # Output: 5 quantiles.
    
  • Interquartile Range (IQR): Distance between 0.75 Percentile - 0.25 Percentile. e.g. scipy.stats.iqr(df[col]) ; Height of boxplot box.

  • Outliers. Thumbrules:

    data < Q1 - (1.5 x IQR) or data > Q3 + (1.5 x IQR)
    

Measuring Chance

  • probability
  • Sample 2 rows from data frame: df.sample(2, replace=False)
  • You can sample with or without replacing it back to the bin.
  • When the variable involved is continous, the probability is computed using area.

Generating Uniform and Normal Random numbers

  • To generate uniform random numbers, you can either use scipy.stats.uniform or np.random
  • scipy.stats functions makes use of np.random functions. So do np.random.seed(123) for reproducible results.
from scipy.stats import uniform

# Suppose You are waiting for bus that comes every 12 mins. 
#     prob(wait_time <= 12) = 1
#     prob(wait_time <= 6) = 0.5

# cumulative distribution function. (x, min, interval_size)
# loc=0 is minimum. scale=interval_size
prob = uniform.cdf(6, loc=0, scale=12)  
# prob is 0.5

# RVS = Random Variates of Given Size; Generate random variables of given size.
uniform.rvs(loc=0, scale=1, size=1)
uniform.rvs()
# output: 0.3862

uniform.rvs(size=5)
# output: array([x, ..., y]) 
uniform.rvs(size=(2,3))  # Generates 2x3 matrix of random numbers in 0<x<1

np.random.randn()   # generates normal-distribution random number between -1 to 1.
np.random.randn(5)  # generates 5 numbers as array.
np.random.randn((3,3))  # generates (3,3) matrix of normal random numbers.

# default_rng() is recommended over np.random()
rng = np.random.default_rng(seed=42)
rng.random()  # generates a random number 0 < x < 1; Uniform distribution.

# 0.6823518632481435

rng.standard_normal(size=(2,3))  # Generates normal distributed number.
                                 # Same as np.random.randn((2,3))


array([[-0.36778665,  1.28792526,  0.19397442],  # mean=0; stddev=1 is fixed.
     [ 0.9202309 ,  0.57710379, -0.63646365]])

rng.uniform(0, 5, (2,3))  # between 0 to 5, generate 2x3 matrix. Uniform distribution.
array([[1.38287199, 4.09877281, 4.44946347],
     [2.56485228, 1.22482301, 4.12120798]])

# See Also: rng.exponential(), rng.normal(), rng.logistic(), rng.uniform()

Generating Random 1s and 0s

  • Each outcome is a success (1) or fail (0).
  • We are interested in total number of successes in n trials.
  • It is a discrete function.
from scipy.stats import binom
binom.rvs(no_of_coins, prob_of_success, size=None)

binom.rvs(10, 0.5, size=1)
# 5 or 4 or 6 etc.           # Meaning: 10 coins thrown with 1 trial: 5 sucesses.
                             # It is like rolling binary dice using multiple dices.

binom.rvs(10, 0.5, size=(2,3))

# Output:
array([[5, 4, 6],      # Each element is centered around 10*0.5 = 5 successes.
       [5, 3, 5]])

binom.rvs(1, 0.5, size=10)
array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0]) 

Probability for specific total successes

  • pmf - Probability Mass Function is used to find the probability of specific (total) result out of series of binary events.
# Define parameters
n = 10  # Number of trials
p = 0.5  # Probability of success
k = 5    # Number of successes

# Out of 10 trials, getting total 5 successes is the max probability.
# Out of 1 trial getting total 0 or 1 successes probability = 0.5
# Out of 2 trials getting total 1 (out of 0, 1, 2) success is <  0.5
# Out of 10 trials getting total 5 successes is <<  0.5 
#             since total 4, 6 also have high probabilities.

# Probability Mass Function (PMF)

# for k in range(11):
pk = binom.pmf(k, n, p)
print(f'P(X = {k}) = {pk:.4f}')

# Output for different k's ...
P(X = 0) = 0.0010
P(X = 1) = 0.0098
P(X = 2) = 0.0439
P(X = 3) = 0.1172
P(X = 4) = 0.2051

P(X = 5) = 0.2461    # Exactly 5 success out of 10 trials is less than 25% !

P(X = 6) = 0.2051    # Same as P(X = 4)
P(X = 7) = 0.1172
P(X = 8) = 0.0439
P(X = 9) = 0.0098
P(X = 10) = 0.0010

binom cdf

  • To find P(Total_Heads <= k out of 10 trials), we use binom.cdf(k, n_trials, success_p)
k = 5    # Max number of successes.
cdk = binom.cdf(k, n, p)
print(f'P(X <= {k}) = {cdk:.4f}')

# Output for k in 0 to 10:

P(X <= 0) = 0.0010
P(X <= 1) = 0.0107
P(X <= 2) = 0.0547
P(X <= 3) = 0.1719
P(X <= 4) = 0.3770      # 20% increase! == P(X = 4)
P(X <= 5) = 0.6230      # 25% increase! == P(X = 5)
P(X <= 6) = 0.8281
P(X <= 7) = 0.9453
P(X <= 8) = 0.9893
P(X <= 9) = 0.9990
P(X <= 10) = 1.0000

Normal Distribution

  • Standard normal distribution: mean=0; stddev=1; Prob(-1 < x < 1) = 68%
  • Prob(-2 < x < 2) = 95%
  • Prob(-3 < x < 3) = 99.7%
from scipy.stats import norm
norm.cdf(k, mean, std)

# Define parameters
mu = 0  # Mean
sigma = 1  # Standard Deviation

# Generate x values
x = np.linspace(mu - 3 * sigma, mu + 3 * sigma, 100)

# Calculate PDF (probablity density function)
pdx = norm.pdf(x, loc=mu, scale=sigma)

# Cumulative Density function: P(result < x)
cdx = norm.cdf(x, loc=mu, scale=sigma)

norm.cdf(154, 161, 7)  # Avg height = 161; stddev=7; P(height<154)
# 0.1586 

# P(154 < height < 157) == cdf(157) - cdf(154)

# Plot PDF and CDF
plt.plot(x, pdx, label='PDF')
plt.plot(x, cdx, label='CDF')
plt.legend()
plt.show()

# Calculate the point for the given Percentile (or cdf)

q = 0.95
mu = 161   # Average Height
sigma = 7  # Std Dev
x = norm.ppf(q, loc=mu, scale=sigma)           # Percent Point Function
print(f'X corresponding to P = {q}: {x:.4f}')

# X corresponding to P = 0.95: 172.5140

# Generate random variates
rv = norm.rvs(loc=mu, scale=sigma, size=3)

# array([147.63477247, 148.09195502, 163.61683838])

Central Limit Theorem

  • Sampling distribution will approach normal distribution.
  • Note that Sampling (pick k times) uses a normal distribution to pick the sample.
  • Mean of the k samples (k is not too big) over n trials is a normal distribution as n becomes large. i.e. outer_mean(uniform_mean(k samples)) becomes normal mean.
  • This is useful to collect survey on smaller groups and extrapolate to bigger population assuming normal distribution.

Poisson Distribution

The Poisson distribution is a discrete probability distribution modeling the number of events occurring in a fixed interval.

  • lambda: Average rate of events (parameter) (For certain fixed interval)
  • Discrete distribution and Non-negative integer values
  • Mean = Variance = lambda
  • poisson.pmf(k, mu): Probability Mass Function (PMF); Returns the probability of k events.
  • poisson.cdf(k, mu): Cumulative Distribution Function (CDF); Returns the probability of k or fewer events.
  • poisson.rvs(mu, size): Random Variates; Generates random numbers from the Poisson distribution.
  • poisson.sf(k, mu): Survival Function (SF); Returns the probability of more than k events. (1 - cdf(k))
  • poisson.isf(q, mu): Inverse Survival Function (ISF) ; Returns the number of events corresponding to a given survival probability q.
from scipy.stats import poisson
import numpy as np
import matplotlib.pyplot as plt

# Define parameter
mu = 5  # Average rate of events

# Generate k values
k = np.arange(0, 15)

# Calculate PMF and CDF
pmf = poisson.pmf(k, mu)
cdf = poisson.cdf(k, mu)

# Plot PMF and CDF
plt.plot(k, pmf, label='PMF')
plt.plot(k, cdf, label='CDF')
plt.legend()
plt.show()

# Generate random variates
rv = poisson.rvs(mu, size=1000)

# Calculate probability
k_val = 8
prob = poisson.pmf(k_val, mu)
print(f'P(X = {k_val}) = {prob:.4f}')

# P(X = 8) = 0.0653   # Note: lambda = mu = 5

Exponential Distribution

  • Modeling the time between events in a Poisson process, where:
    • λ (lambda): Rate parameter (average number of events per unit time)
    • θ (theta): Scale parameter (1/λ)
  • Continuous Probability distribution
  • Non-negative values, Memoryless property (lack of memory)
  • expon.pdf(x, scale): Probability Density Function (PDF)
  • expon.cdf(x, scale): Cumulative Distribution Function (CDF); P(k < x)
  • expon.rvs(scale, size): Random Variates
  • expon.sf(x, scale): Survival Function (SF); P(K>x) = (1 - cdf(x))
  • expon.isf(q, scale): Inverse Survival Function (ISF); expon.isf(0.95, scale) - Finds the wait time so that P(K>wait_time) = 0.95
from scipy.stats import expon
import numpy as np
import matplotlib.pyplot as plt

# Define parameter
scale = 1/2  # Scale parameter (1/λ)

# Generate x values
x = np.linspace(0, 10, 100)

# Calculate PDF and CDF
pdf = expon.pdf(x, scale=scale)
cdf = expon.cdf(x, scale=scale)

# Plot PDF and CDF
plt.plot(x, pdf, label='PDF')
plt.plot(x, cdf, label='CDF')
plt.legend()
plt.show()

# Generate random variates
rv = expon.rvs(scale=scale, size=1000)

# Calculate probability
x_val = 5
prob = expon.cdf(x_val, scale=scale)
print(f'P(X ≤ {x_val}) = {prob:.4f}')

Use Cases:

  • Modeling time-to-failure or survival analysis
  • Analyzing queueing systems and waiting times
  • Simulating reliability and maintenance models

T Distribution

  • Like normal distribution but with higher std deviation.
  • More streteched across x-axis.
  • Has associated with (small) degrees of freedom.
  • Higher the degrees of freedom, it more looks like normal distribution with taller bell.

Log Normal Distribution

  • Means Log(x) follows normal distribution
  • It means the frequent values are more narrowly focused on smaller intervals and skewed.
  • The shape of this distribution and exponential distribution is very different. The exponential distribution looks like a exponetial sharp fall from peak at 0, all positive.

Seaborn Plotting

  • Seaborn is a Python visualization library built on top of Matplotlib.

  • Aesthetic Defaults

  • High-Level Abstractions

  • Distribution Plots:

    • sns.distplot(): Histograms and kernel density estimates.
    • sns.kdeplot(): Kernel density estimates.
  • Categorical Plots:

    • Categorical plot (like group by category and draw bars or count or box.
    • sns.catplot(): Generic plot by category. e.g. sns.catplot(x='col1', data=df, kind='count' |
    • sns.barplot(): Bar plots for categorical data. Mean for each category.
    • sns.countplot(): Count plots for categorical data. e.g. sns.countplot(x='col1', data=df)
    • sns.boxplot(): Box plots for categorical data.
    • sns.pointplot(): Shows mean of quantitative variable. show 95% confidence interval. Looks like line plot, but here one axis is category. e.g. sns.catplot(x='smoker', y='total_bill', data=tips, kind='point')

:

sns.catplot(x='country', data=df,
                kind = 'count',       # same as sns.countplot()
                kind = 'bar',         # Display mean for each category.
                                      # Also shows confidence interval at top.
                kind = 'box',         # Shows median, spread, skewness, outliers
                kind = 'point'        # Shows mean, ci(95%)

                order=[ 'USA', 'INDIA', 'CHINA'])
    )
  • Regression Plots:

    • sns.regplot(): Simple linear regression plots.
    • sns.lmplot(): More complex linear regression plots.
  • Scatter Plots:

    • sns.scatterplot(): Scatter plots for relationships between two variables.
    • sns.jointplot(): Joint plots for relationships between two variables.
  • Relational Plot:

    • sns.relplot(): Multiple plots in single plot or in array of plots sharing Y axis. Offers lot of flexibilities.
  • Heatmaps:

    • sns.heatmap(): Heatmaps for 2D data.
import seaborn as sns
import matplotlib.pyplot as plt

# Load example dataset
tips = sns.load_dataset("tips")
flights = sns.load_dataset("flights")

# Distribution Plot
sns.distplot(tips["total_bill"])
plt.show()

# Categorical Plot
sns.barplot(x="sex", y="total_bill", data=tips)
plt.show()

# Regression Plot
sns.regplot(x="total_bill", y="tip", data=tips)
plt.show()

# Scatter Plot
sns.scatterplot(x="total_bill", y="tip", data=tips)
plt.show()

# Joint Plot
sns.jointplot(x="total_bill", y="tip", data=tips)
plt.show()

# Heatmap
flights_pivot = flights.pivot_table(index="month", columns="year", values="passengers")
sns.heatmap(flights_pivot)
plt.show()
  • Customization:

    • Colors: Use sns.set_palette() or sns.color_palette().
    • Styles: Use sns.set_style() or sns.axes_style().
    • Fonts: Use sns.set_fontscale() or plt.rcParams["font.family"].
  • Use Seaborn's built-in datasets for testing:

    sns.get_dataset_names()
    
    [
     'car_crashes', 'exercise', 'flights', 'iris',      'planets',
     'seaice',      'taxis',    'titanic', 'dowjones',
    ]
    
  • Tips:

    • Explore Seaborn's example gallery for inspiration.
    • Customize plots using Matplotlib functions.
  • Scatterplot Options:

    • hue: Color points by category.
    • style: Change point style.
    • size: Change point size.
    • palette: Specify color palette.
# different categories will be colored differently using hue option.
sns.scatterplot(x="total_bill", y="tip", data=tips, hue="sex", style="sex")
plt.show()

Seaborn relplot vs scatterplot

  • You can easily create multiple scatter plots using seaborn without having to use multi-dimensional ax axes provided by matplotlib.
  • Use relplot() (relational plot) instead of scatterplot()
sns.replot(x='col1', y='col2', data=df, kind='scatter', 
           col='col3',   # Another dimension. Create subplot stacked as column.
           row='col4',   # Another dimension. Create subplot in next row.
           col_wrap = 2, # Can not accomodate too many subplots in single line.
           col_order= [ 'col3_val1', 'col3_val2' ] # Subplot order as per value.
           ci = 'sd'   # Confidence interval to std deviation or None
                       # Default is 95%
           hue='col5', # Vary color according to col5.
           style='col6'  # Vary line style (dashes etc) as per col6.
           markers=True  # put a marker like +, x, * at each data point.
           dashes=False  # Do not use dashed lines.

)
  • if you set ci='sd' for a line plot, it will show shaded area along the line representing the shaded thickness proportional to confidence interval. The default confidence interval is 95%.

seaborn style customize

  • Figure style includes background and axes.
  • Pre set options: white (default), dark, whitegrid, darkgrid, ticks
  • sns.set_style('whitegrid')
  • sns.set_palette('RdBu') # PRGn (Purple to green), RdBu_r (Red-Blue:Reverse) 'Greys', 'Blues', 'PuRd' (Purple to Red)
  • sns.set_palette(['#FBB4AE', '#B3CDEE', ....])
  • sns.set_context('paper' | notebook | talk | poster)
g = sns.relplot(...)  # or catplot()
type(g)
<class 'seaborn.axisgrid.FacetGrid'>

g.fig.suptitle('... Title ... ', y=1.03) # Place title slightly higher

g.set(xlabel='My x label', ylabel='My y label')

# Set x ticks rotation.
plt.xticks(rotation=90)

g = sns.boxplot(...)
type(g)   
<class 'matplotlib.axes._subplots.AxesSubplot'>

g.set_title('...title...')

# scatterplot(), etc returns axes subplot, not FaceGrid

correlation

  • df['col1'].corr(df['col2']) finds a numeric correlation.

  • Correlation only detects numeric correlation.

  • If there is non-linear correlation, only visualization can find it.

  • Some times the log(x) or 1/x is more linearly correlated to y. :

    pds1.corr(pds2)       =  0.1
    log(pds1).corr(pds2)  =  0.5    <=== Note this.
    
  • Correlation does not mean causation. e.g. Coffee Drinking seems to be correlated to cancer because people who smoke consume more coffee. And smoking causes cancer.

  • Here the third variable, i.e. Smoking is called Confounder or Lurking Variable.

Matplotlib Plotting

Sharing x Axis

fig, ax = plt.subplots()
ax.plot(df[col1], df[col2], 
          marker="o",        # v etc
          linestyle="--",
          color="r")

ax.set_xlabel("Time in years")
ax.set_ylabel(" Amount ")
ax.set_title(" Time vs Amount ")

# Multiple subplots

fix, ax = plt.subplots(2, 1,     # 2 x 1  plots
               sharey = True     # Prevents y autoscaling differently
          )
a[0].plot(...)
a[1].plot(...)

# Using twin x axis we can share x axis, use different y axis.
ax2 = ax.twinx()
ax2.tick_params('y', colors='red') 

Annotation

ax.annotate(">1 degree", xy=(pd.Timestamp('2015-10-06'), 1))

ax2.annotate(">1 degree", xy=(pd.Timestamp('2015-10-06'), 1), 
     xytext=(pd.Timestamp('2008-10-06'), -0.2),      # Position text elsewhere
     arrowprops={'arrowstyle':'->', 'color':'gray'}) # Add arrow to text

#  ======     Stacked bar chart.  ==================
#       Gold + Silver + Bronze medals stacked on top 
#

Stacked Bar Chart

ax.bar(medals.index, medals['Gold'], label='Gold')
ax.set_xticklabels(medals.index, rotation=90)
ax.set_ylabel("Number of medals")

ax.bar(medals.index, medals["Silver"], bottom=medals["Gold"], label="Silver")
ax.bar(medals.index, medals["Bronze"], bottom=medals["Gold"]+medals["Silver"], 
                                                              label="Bronze")
ax.legend()

Multiple Histograms

mr = mens_rowing
mg = mens_gymnastics

ax.hist(mr['Weight'], label='Rowing',     histtype='step', bins=5) # Use hollow bars
ax.hist(mg['Weight'], label='Gymnastics', histtype='step', bins=5)

ax.set_xlabel("Weight (kg)")
ax.set_ylabel("# of observations")

ax.legend() # Add the legend and show the Figure
plt.show() 

Error Bars

  • Can be plotted using ax.bar() or ax.errorbar() methods.

ax.bar() method to represent errors/std-dev:

  • Just the mean and std-deviation of a df[col] is represented by the bar!
  • Multiple error bars are drawn next to each other to compare the means and std-deviations.
.
.        +---|---+  Middle line represents std-dev
.        |       |                                        +---|---+ 
.        |       |                                        |       | 
.        |       |   Height Represents Mean.              |       | 
.        |       |                                        |       | 
.        |       |                                        |       | 
.                                                      

ax.bar("Rowing", mens_rowing['Height'].mean(), yerr=mens_rowing['Height'].std()), 

ax.errorbar() Method:

  • Every (x, y) point has associated std-deviation. i.e. (x, y, y-std)
  • The y itself is a micro-average in small interval, so there is also micro-std.
  • This results in a line plot connecting all dots; But there is also a small vertical bar on each observation point representing std-deviation!
.
.
.         |-----|
.        /       \
.       |         |-------|
.                          \--------
.

 ax.errorbar(df[col1], df[col2_avg], yerr=df[col2_std])
 ax.set_ylabel('Y label')

Boxplots

  • Shows a box which shows Q1, Q3, IQR information
  • Also draws boundary marker above and below box to represent 1.5*IQR to show outliers threshold.
ax.boxplot([mr['Height'], mg['Height']])
ax.set_xticklabels(['Rowing', 'Gymnastics'])
ax.set_ylabel('Height (cm)')

Scatter Plot

ax.scatter(climate_change['co2'], climate_change['relative_temp'])
ax.set_xlabel("CO2 (ppm)")
ax.set_ylabel("Relative temperature (C)")

Matplotlib Style

plt.style.use('ggplot')         # or 'default', 'bmh', 'seaborn-colorblind',  

Saving Figure

# plt.show()
fig.set_size_inches([5,3])             # widgh, height inches
fig.savefig('myimage.png', dpi=300)
fig.savefig('myimage.jpg', quality=50)