Data Science Notes
These are high level notes based on data camp courses which includes:
There are different scopes for variables:
Example for Enclosing function declaration:
def outer_func(x):
n = 10
def inner_func(y):
nonlocal n
n = 20
return 30
:
pip install numpy
a = np.array([1, 2, 3, 4, 5, 6])
# array([1, 2, 3, 4, 5, 6])
a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
a.ndim
# 2
a.shape
# (3, 4)
len(a.shape) == a.ndim
# True
# size == Fixed total number of elements in array
a.size
# 12
import math
a.size == math.prod(a.shape)
# True
# Arrays are typically “homogeneous”;
# The data type is recorded in the dtype attribute.
a.dtype
# dtype('int64') # "int" for integer, "64" for 64-bit
np.zeros(2)
# array([0., 0.])
# See Also: np.zeros((3,4))
np.ones(2)
# array([1., 1.])
# Create empty array! Contents are random.
np.empty(2)
# array([3.14, 42. ]) # may vary
np.arange(4)
# array([0, 1, 2, 3])
np.arange(2, 9, 2)
# array([2, 4, 6, 8])
# np.linspace() : Divide given interval to yield given number of points.
np.linspace(0, 10, num=5)
# array([ 0. , 2.5, 5. , 7.5, 10. ])
# Default data type is: np.float64, you can override data type.
x = np.ones(2, dtype=np.int64)
x
# array([1, 1])
# Adding, removing, and sorting elements
arr = np.array([2, 1, 5, 3, 7, 4, 6, 8])
np.sort(arr) # Creates new sorted copy.
# array([1, 2, 3, 4, 5, 6, 7, 8])
# argsort, which is an indirect sort along a specified axis,
# lexsort, which is an indirect stable sort on multiple keys,
# searchsorted, which will find elements in a sorted array, and
# partition, which is a partial sort.
a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])
np.concatenate((a, b)) # Same as a + b
array([1, 2, 3, 4, 5, 6, 7, 8])
x = np.array([[1, 2], [3, 4]])
y = np.array([[5, 6]])
np.concatenate((x, y), axis=0) # Concatenate is Not same as x + y
# array([[1, 2],
# [3, 4],
# [5, 6]])
#
# Above is same as: np.vstack((x, y))
#
x + y # y is added to every row of x.
# array([[ 6, 8],
# [ 8, 10]])
np.concatenate((x, y), axis=1) # axis 1 concatenate also Not same as x + y
# Error: Dimensions do not match to concatenate!
# y must have two rows.
# Find out the shape and size of an array ...
# Use ndarray.ndim, ndarray.size, ndarray.shape
array_example = np.array([[[0, 1, 2, 3], # 3D Array
[4, 5, 6, 7]],
[[0, 1, 2, 3],
[4, 5, 6, 7]],
[[0 ,1 ,2, 3],
[4, 5, 6, 7]]])
array_example.ndim
# 3
array_example.size
# 24
array_example.shape
# (3, 2, 4)
a = np.arange(6)
# [0 1 2 3 4 5]
b = a.reshape(3, 2) # same as np.reshape(a, (3, 2)). Creates new array.
print(b)
# [[0 1]
# [2 3]
# [4 5]]
a = np.array([1, 2, 3, 4, 5, 6])
a.shape
# (6,)
# You can use np.newaxis to add a new axis:
a2 = a[np.newaxis, :]
a2.shape
# (1, 6)
a2
# array([[1, 2, 3, 4, 5, 6]]) # Note: Array of Array.
#
b = np.expand_dims(a, axis=1)
b.shape
# (6, 1)
b
# array([[1],
# [2],
# [3],
# [4],
# [5],
# [6]])
v_copy = v.copy()
a = np.array([[1 , 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
print(a[a < 5])
# [1 2 3 4]
# Also See: a[a%2==0]; a[(a > 2) & (a < 11)]
# See Also:
# np.arange(1, 25).reshape(2, 12)
# v.min(), v.max(), v.sum()
#
a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
#-------------------------------------------
# c1 c2 c3 c4
#-------------------------------------------
array([[ 1, 2, 3, 4], # Row One
[ 5, 6, 7, 8], # Row Two
[ 9, 10, 11, 12]]) # Row Three
#-------------------------------------------
#
# a = Array_x(Array_y(integers))
# X-Axis === Axis 0
# Y-Axis === Axis 1
#
# Every row (i.e. Y-Axis) represents 4-dimensional point.
# However, the array itself is a 2-D array.
#
# Remember every axis represents linear infinite points.
#
In a philosophical sense, Every column is implicitly an independent dimension.
In theory, you can represent a single data point of N dimensions in 1D array:
[ c1, c2, ..., cn ]
Hence 2-D array is sufficient to represent any N-dimensional data:
[ c11, c12, ..., c1n ]
[ c21, c22, ..., c2n ]
[ c31, c32, ..., c3n ]
....
Imagine sales data having N-dimensional attributes like following:
While sourcing the data, we may often get list of 2-D data from different sources. If the source is one dimension, actually this is
We may collect data from different sources, say for each country, then we may get a list of 2-D data during sourcing. This is actually 3-D data, the third dimension being source (e.g. Country)
We often convert 3-D data to 2-D array by introducing new column.
Discrete data is difficult to understand and digest. So, we often want to understand the consolidated data wrt aggregating over specific dimension.
Usually we keep a flat 2-D information and aggregate on-demand to visualize information over any specific dimension.
v * 10 # Multiplies every element by given scalar.
v1 * v2 # Not a matrix multiplication. Same positional values get multiplied.
1/v1 # Every element gets inverted.
np.unique(arr) # Yields unique values. Will flatten array.
v.min(), v.max() # Yields scalar value
v.ndim # Number of dimensions
v.reshape() # Reshape and create a new view over same base.
v.shape = (2, 5) # You can reshape, in-place by assigning!
v.base # The base array for the (reshaped) array.
v < 5 # Create boolean array
np.unique(v_2d) # Returns 1-D array of unique elements.
np.unique(v_2d, axis=0) # Returns list of unique rows.
v.transpose() == v.T # Returns transpose
np.flip(v) # Reverse Array. Both columns, rows reversed.
np.flip(v, axis=0) # Reverse only rows. Column order intact.
v.flatten() # New copy of flattened array.
v.ravel() # Flattened view over parent array. (Pointer)
np.sum() # Prefer this over python built-in sum.
np.square()
np.save('filename', v) # Stores v in filename.npy file.
np.load('filename.npy') # Load variable from file.
np.savetxt('file.csv', v) # Stores numpy variable in csv file.
np.loadtxt('file.csv') # Load numpy variable from csv file.
x[..., 0] == x[:, :, 0] # x is 3d array.
x[i, j] == x[i][j] # Prefer x[i, j] over x[i][j]. Faster.
np.intp == np.int64 # np.intp is platform specific int type.
# Used with dtype=np.intp option.
data = np.array([[1, 2], [3, 4], [5, 6]])
data
# array([[1, 2],
# [3, 4],
# [5, 6]])
#
data.sum() # By default sum applies to all elements. Like reduce.
# 21
# +------ Operation done over iterating along the row (last but one axis).
# | This axis is lost after operation. This is Reduce operation.
# | More common operation.
# |
data.sum(axis=0)
# array([9, 12])
# +------ Operation done over iterating along the column (last axis).
# | This axis is lost after operation. This is Reduce operation.
# | Less common operation.
# | Unless columns are homogenous, no point of sum across columns.
# |
data.sum(axis=1)
# array([3, 7, 11])
v = np.arange(12)
v_slice = v[0:5] # By default slice is not a copy for numpy!
v_slice[:] = 100 # This is called broadcasting.
print(v)
# [100 100 100 100 100 5 6 7 8 9 10 11]
Broadcasting works even when you add two arrays. If one Array contains singleton element, then it gets applied over other:
v.shape = (3, 4)
v + 1 # 1 added to each element. Result same shape.
v + [1] # 1 added to each element. Result same shape.
v + [[1]] # 1 added to each element. Result same shape.
v + [[[1]]] # Note: Result shape: (1, 3, 4)
v + [1, 2] # Error: shape(3,4) + shape(2,)(1d-array)
v + [[1, 2]] # Error: shape(3,4) + shape(1, 2): No Broadcast.
Result of Slicing retains total number of dimensions. Indexing result loses dim:
x = np.array([[[1],[2],[3]], [[4],[5],[6]]])
x.shape
# (2, 3, 1)
x[1:2] # Row 1 to 2, 2 not included. So only Row 1.
# array([[[4],
# [5],
# [6]]])
Using elipsis, you can mention "all missing dimensions" :
y = x[..., 0] # Same as x[:, :, 0]
print(y)
# array([[1, 2, 3], # Shape: (2,3,1) => (2,3)
# [4, 5, 6]])
print(x.shape, y.shape)
# (2, 3, 1) (2, 3) # column dimension is lost.
You can introduce new dimension using np.newaxis (alias for None) :
y[:, :, np.newaxis] # This is same as x.
y[:, :, None] # This is same as x.
You can add (5, 1) array and (1, 5) array to create (5, 5) array! Basically result(i, j) = a(i) + b(j) :
x = np.arange(5) # [0, 1, 2, 3, 4]
x[:, np.newaxis] # [[0], [1], [2], [3], [4]]
x[np.newaxis, :] # [[0, 1, 2, 3, 4]]
x[:, np.newaxis] + x[np.newaxis, :] # Shape (5,1) + Shape (1, 5)
# # 0 1 2 3 4
# array([[0, 1, 2, 3, 4], # 0
# [1, 2, 3, 4, 5], # 1
# [2, 3, 4, 5, 6], # 2 +
# [3, 4, 5, 6, 7], # 3
# [4, 5, 6, 7, 8]]) # 4
# #
For Array addtion variables must have same dimension or atleast one of the dimensions is 1 :
Shape1 Shape2 Result
(2, 2) (2, 1) (2, 2)
(2, 2) (1, 2) (2, 2)
(4, 2) (4, 1) (4, 2)
(4, 2) (1, 2) (4, 2)
(5, 1) (1, 5) (5, 5)
(4, 1, 5, 1) (2, 1, 3) (4, 2, 5, 3)
(2, 2) (2, 3) Not Compatible
(4, 2) (2, 2) Not Compatible
(4, 2) (2, 1) Not Compatible
# Default random generator
# np.random.seed(1234) # For predictable sequence of random numbers.
# np.random.rand(4) # Generate 4 random numbers between 0 and 1
rng = np.random.default_rng(seed = 1234)
rng.random(3)
# array([0.63696169, 0.26978671, 0.04097352])
rng.random((3, 3)) # Generate random 3x3 2-D array.
rng.standard_normal((3, 3)) # Use normal distribution.
rng.integers(5, size=(2, 4))
Numpy array is iterable. Hence 1-D array can be easily manipulated by using python built-in map function:
arr = np.array([1, 2, 3, 4, 5])
result = np.array(list(map(lambda x: 2*x, arr)))
You can also np.vectorize. For convenience. Implemented by for loop:
arr = np.array([ [1, 2], [3, 4]])
vfunc = np.vectorize(lambda x: x*x)
result = vfunc(arr)
# array([[ 1, 4],
# [ 9, 16]])
You can use Python universal function. Fast. :
# np.frompyfunc() : Create unfunc (universal function)
arr = np.array([ [1, 2], [3, 4]])
ufunc = np.frompyfunc(lambda x: 2*x, 1, 1)
result = ufunc(arr)
print(result) # [[2 4], [6 8]]
If you want to apply an aggregation function over the N-D array (effectively losing 1 dimension in that process), you can use [np.apply_along_axis]{.title-ref} :
arr = np.array([[1, 2, 3], [4, 5, 6]])
result = np.apply_along_axis(lambda x: sum(x), axis=1, arr=arr)
print(result) # [ 6 15]
Universal functions ufunc operate on ndarrays element-wise.
Supports array broadcasting and typecasting.
Can take n arguments. unary or binary, etc.
More than 60 built-in numpy ufunc functions available. Example Math operations:
If you supply your own ufunc, you should either implement these component functions or use np.frompyfunc() to generate them for you :
- ufunc.reduce(array, axis, dtype, out, ...) - Reduces one dimension.
- ufunc.accumulate(array, axis, dtype, out, ...)
- ufunc.reduceat - Reduce using indices
- ufunc.outer - (A, B) : apply ufunc for all (a,b) for a in A, b in B
- ufunc.at(a, indices, ...) : apply ufunc given indices.
import pandas as pd
s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df
# A B
# 0 1 3
# 1 2 4
df['A'] > 2
# 0 False
# 1 False
# Name: A, dtype: bool
df[df['A'] > 2]
# Empty DataFrame
# Columns: [A, B]
# Index: []
df['A'] > 1
# 0 False
# 1 True
df[df['A'] > 1]
# A B
# 1 2 4
df.sort_values('A')
df.sort_values(by='A', ascending=False)
# A B
# 1 2 4 # Rows sorted using column A.
# 0 1 3
import pandas as pd
data = {
'Order ID': [101, 102, 103, 104, 105, 106, 107, 108, 109],
'Product': ['Laptop', 'Desktop', 'Tablet', 'Laptop', 'Desktop', 'Tablet', 'Laptop', 'Desktop', 'Tablet'],
'Customer': ['John', 'Jane', 'Bob', 'John', 'Jane', 'Bob', 'John', 'Jane', 'Bob'],
'Date': ['2022-01-01', '2022-01-05', '2022-01-10', '2022-01-15', '2022-01-20', '2022-01-25', '2022-02-01', '2022-02-05', '2022-02-10'],
'Qty': [2, 1, 3, 4, 2, 1, 3, 4, 2],
'Price': [1000, 800, 600, 1000, 800, 600, 1000, 800, 600],
'Total': [2000, 800, 1800, 4000, 1600, 600, 3000, 3200, 1200]
}
# You can either use Dictionary of Lists or List of Dictionaries to create df.
df = pd.DataFrame(data)
print(df)
Order ID Product Customer Date Qty Price Total
0 101 Laptop John 2022-01-01 2 1000 2000
1 102 Desktop Jane 2022-01-05 1 800 800
2 103 Tablet Bob 2022-01-10 3 600 1800
3 104 Laptop John 2022-01-15 4 1000 4000
4 105 Desktop Jane 2022-01-20 2 800 1600
5 106 Tablet Bob 2022-01-25 1 600 600
6 107 Laptop John 2022-02-01 3 1000 3000
7 108 Desktop Jane 2022-02-05 4 800 3200
8 109 Tablet Bob 2022-02-10 2 600 1200
nRowsRead = 1000 # specify 'None' if want to read whole file
# avocado.csv has 18250 rows, but we load only the first 1000 rows
adf = pd.read_csv('./data/avocado.csv', delimiter=',', nrows = nRowsRead)
adf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1000 non-null int64
1 Date 1000 non-null object
2 AveragePrice 1000 non-null float64
3 Total Volume 1000 non-null float64
4 4046 1000 non-null float64
5 4225 1000 non-null float64
6 4770 1000 non-null float64
7 Total Bags 1000 non-null float64
8 Small Bags 1000 non-null float64
9 Large Bags 1000 non-null float64
10 XLarge Bags 1000 non-null float64
11 type 1000 non-null object
12 year 1000 non-null int64
13 region 1000 non-null object
dtypes: float64(9), int64(2), object(3)
adf.iloc[0]
Unnamed: 0 0
Date 2015-12-27
AveragePrice 1.33
Total Volume 64236.62
4046 1036.74
4225 54454.85
4770 48.16
Total Bags 8696.87
Small Bags 8603.62
Large Bags 93.25
XLarge Bags 0.0
type conventional
year 2015
region Albany
Name: 0, dtype: object
df.head()
Order ID Product Customer Date Qty Price Total
0 101 Laptop John 2022-01-01 2 1000 2000
1 102 Desktop Jane 2022-01-05 1 800 800
2 103 Tablet Bob 2022-01-10 3 600 1800
3 104 Laptop John 2022-01-15 4 1000 4000
4 105 Desktop Jane 2022-01-20 2 800 1600
df.describe() # Basic Statistics.
Order ID Qty Price Total
count 9.000000 9.000000 9.000000 9.000000
mean 105.000000 2.444444 800.000000 2022.222222
std 2.738613 1.130388 173.205081 1155.181563
min 101.000000 1.000000 600.000000 600.000000
25% 103.000000 2.000000 600.000000 1200.000000
50% 105.000000 2.000000 800.000000 1800.000000
75% 107.000000 3.000000 1000.000000 3000.000000
max 109.000000 4.000000 1000.000000 4000.000000
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Order ID 9 non-null int64
1 Product 9 non-null object
2 Customer 9 non-null object
3 Date 9 non-null object
4 Qty 9 non-null int64
5 Price 9 non-null int64
6 Total 9 non-null int64
dtypes: int64(4), object(3)
memory usage: 632.0+ bytes
dfc = df.set_index('Customer') # Returns new dataframe.
print(dfc)
Order ID Product Date Qty Price Total
Customer
John 101 Laptop 2022-01-01 2 1000 2000
Jane 102 Desktop 2022-01-05 1 800 800
Bob 103 Tablet 2022-01-10 3 600 1800
John 104 Laptop 2022-01-15 4 1000 4000
Jane 105 Desktop 2022-01-20 2 800 1600
Bob 106 Tablet 2022-01-25 1 600 600
John 107 Laptop 2022-02-01 3 1000 3000
Jane 108 Desktop 2022-02-05 4 800 3200
Bob 109 Tablet 2022-02-10 2 600 1200
df2 = df.set_index(['Customer', 'Product']) # Returns new dataframe.
Order ID Date Qty Price Total
Customer Product
John Laptop 101 2022-01-01 2 1000 2000
Jane Desktop 102 2022-01-05 1 800 800
Bob Tablet 103 2022-01-10 3 600 1800
John Laptop 104 2022-01-15 4 1000 4000
Jane Desktop 105 2022-01-20 2 800 1600
Bob Tablet 106 2022-01-25 1 600 600
John Laptop 107 2022-02-01 3 1000 3000
Jane Desktop 108 2022-02-05 4 800 3200
Bob Tablet 109 2022-02-10 2 600 1200
df2.loc[('John', 'Laptop')]
Order ID Date Qty Price Total
Customer Product
John Laptop 101 2022-01-01 2 1000 2000
Laptop 104 2022-01-15 4 1000 4000
Laptop 107 2022-02-01 3 1000 3000
dfc.sort_index() # Sort by index.
Product Date Qty Price Total
Customer
Alice NaN 2022-02-10 2 600 1200
Bob Tablet 2022-01-10 3 600 1800
Bob Tablet 2022-01-25 1 600 600
Bob Tablet 2022-02-10 2 600 1200
Jane Desktop 2022-01-05 1 800 800
Jane Desktop 2022-01-20 2 800 1600
Jane Desktop 2022-02-05 4 800 3200
John Laptop 2022-01-01 2 1000 2000
John Laptop 2022-01-15 4 1000 4000
John Laptop 2022-02-01 3 1000 3000
# df[0] # Error. Does not work.
# df[colname] # Reserved for accessing column.
df[0:1] # Returns First Row.
df[5:6] # Returns 5th Single Row. Slice Interval is open.
df.head(1) # Returns first row.
df.head(10).tail(1) # Returns tenth row.
df.iloc[0] # Returns single Row as Series, not DataFrame!
# You don't want Series. You want single row DataFrame.
df.iloc[0:1] # Returns first row. It is DataFrame.
# Tip: Always use Slice Range even to get 1 row.
df.iloc[[0]] # Returns first row as DataFrame.
df.loc[0] # Works with default index. Returns pd.Series.
df.loc[0:1] # Works with default index. Returns DataFrame
df.loc[[0]] # Works with default index. Returns DataFrame
df.values[0] # Returns first row as numpy array!
dfc.loc['John'] # If name was set as index, this works.
dfc.loc[df.index.get_loc('index_value')] # If there is non-default index.
# df.index.get_loc('John') returns boolean array.
# If you get pd.Series, you can convert series to dataframe
# Following kind of works. But dtype becomes 'object' for all columns!
df.iloc[0].to_frame().transpose()
# Moral of the story: Always use slice notation to fetch even single row
# from dataframe to ensure the return type is DataFrame.
df['Product'] # Returns pd.Series
df[['Product']] # Returns DataFrame
0 Laptop
1 Desktop
2 Tablet
....
8 Tablet
Name: Product, dtype: object
df[df.columns[0]] # Access first column.
df[[col1, col2]] # Subset the columns
# Rows slicing.
df[4:8] # Returns rows 4 to 7. Note 8 is not included.
df.iloc[4:8] # Works even with custom Index.
# Rows slicing with Custom index
dfc.loc['John':'Bob'] # Range syntax applicable for custom index as well.
# Works only for unique Index labels.
# Range Syntax can't be used for columns.
df2.loc[('John', 'Laptop')] # If df has multi-index.
df2.loc[('John', 'Laptop'):('Bob', 'Tablet')]
# Columns Slicing
df[['Customer', 'Qty', 'Price']]
# Both row and column slicing.
df.loc[4:8, ['Customer', 'Qty', 'Price']]
Customer Qty Price
4 Jane 2 800
5 Bob 1 600
6 John 3 1000
7 Jane 5 800
8 Bob 2 600
df.iloc[4:8, 2:4] # Rows: 4 to 7; Columns: 2 to 3
Customer Date
4 Jane 2022-01-20
5 Bob 2022-01-25
6 John 2022-02-01
7 Jane 2022-02-05
See Also: df.iloc[:, 2:4] # Rows: All; Columns: 2 to 3
df.groupby('Customer').mean()
Order ID Qty Price Total
Customer
Bob 106.0 2.000000 600.0 1200.000000
Jane 105.0 2.666667 800.0 1866.666667
John 104.0 3.000000 1000.0 3000.000000
df.groupby('Customer').head(1) # Group by, retain max one per group.
Order ID Product Customer Date Qty Price Total
0 101 Laptop John 2022-01-01 2 1000 2000
1 102 Desktop Jane 2022-01-05 1 800 800
2 103 Tablet Bob 2022-01-10 3 600 1800
df[['Customer', 'Qty', 'Price']].groupby('Customer').mean()
Qty Price
Customer
Alice 2.000000 600.0
Bob 2.000000 600.0
Jane 2.666667 800.0
df[['Customer', 'Qty', 'Price']].groupby('Customer').agg([np.mean,
np.median, np.min, np.max])
Qty Price
mean median amin amax mean median amin amax
Customer
Bob 2.000000 2.0 1 3 600.0 600.0 600 600
Jane 2.666667 2.0 1 5 800.0 800.0 800 800
John 3.000000 3.0 2 4 1000.0 1000.0 1000 1000
# Apply different operation to different columns after groupby ...
df[['Customer', 'Qty', 'Price']].groupby('Customer').agg(
Avg_Qty =('Qty', np.mean),
Median_Price =('Price',np.median ),
as_index = False
)
# To round the results to 2 decimal places for all numeric columns ...
df = df.round(2)
#
# Inspecting group by object ...
#
grouped = df[['Customer', 'Qty', 'Price']].groupby('Customer')
grouped.groups
# Out: {'Bob': [2, 5, 8], 'Jane': [1, 4, 7], 'John': [0, 3, 6]}
grouped.size()
# Customer
# Bob 3
# Jane 3
# John 3
# dtype: int64
grouped = df[['Customer', 'Qty', 'Price']].groupby(['one', 'two', ..., 'nine', 'nine'])
grouped.size()
nine 2
one 1
two 1
three 1
four 1
five 1
six 1
seven 1
dtype: int64
Note: groupby operation just remembers the distinct values of the groupby column
and corresponding row numbers. The column name is unimportant and may be missing.
Example: Total/Average/Count of Sales Revenue generated by 2 Sales Reps
--------------------------
2020 2021 2022
--------------------------
John Avg Avg Avg
Mary Avg Avg Avg
SQL: select Name, Year, avg(Revenue) From sales Group By Name, Year;
With Pivot tables, the columns are discrete values instead of type. For example, a single column "Date" is expanded to multiple columns of discreate value or ranges (e.g. 2001-2010, 2011-2020, etc).
In Pivot operation done on Dataframes, the values (or ranges) become columns resulting in wider table. (pd.pivot_table operation).
The pd.melt operation is the reverse operation where multiple columns can be reduced to one. For example (2020, 2021, ...) can be reduced to single Year (Rpresents Variable) column with additional column AvgSales (Represents Value).
You can view even 3D data by making either row or column as composite:
--------------------------------
Avg Count Max
--------------------------------
John 2020 avg count max
Mary 2020 avg count max
John 2021 avg count max
Mary 2021 avg count max
John 2022 avg count max
Mary 2022 avg count max
Pivot tables are typically used to group by 2 different variables and to view corresponding mean values.
The equivalent of df.groupby('Customer').mean() using Pivot tables would be:
df.pivot_table(values='Qty', index='Customer')
df.pivot_table(values='Qty', index='Customer', aggfunc=np.mean)
Qty
Customer
Alice 2.000000
Bob 2.000000
Jane 2.666667
John 3.000000
df.pivot_table(values='Qty', index='Customer', aggfunc=np.median)
Qty
Customer
Alice 2
Bob 2
Jane 2
John 3
Note: df.pivot(values='Qty', index='Customer') May raise error if the
index created is not unique. Either use aggfunc or use df.pivot_table().
Mostly we pivot on 2 variables. We call them as index and columns:
df.pivot_table(values='Price', index='Customer', columns='Qty',
fill_value=0, margins = True)
# We are viewing mean (default) value of Price against (Customer, Qty)
Qty 1 2 3 4 5 All
Customer
Alice 0 600 0 0 0 600.0
Bob 600 600 600 0 0 600.0
Jane 800 800 0 0 800 800.0
John 0 1000 1000 1000 0 1000.0
All 700 750 800 1000 800 780.0
For use of multi-Index columns, you can create multi-Index rows df and transpose it as df.T
You can access values of dataframe with multi-index using df.xs method or slicing method. The index need not be strict prefix:
df.xs(('val1', 'val3'), level=[0,2])
# Uses col1, col3 of 3 column index.
Here is an example of pivoting and melting back:
import pandas as pd
# Melted data
data = {
'Student': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob',
'Charlie', 'Alice', 'Bob', 'Charlie'],
'Subject': ['Math', 'Math', 'Math', 'Science', 'Science',
'Science', 'English', 'English', 'English'],
'Score': [85, 78, 92, 89, 94, 88, 91, 82, 85]
}
df_melted = pd.DataFrame(data)
print("Original (Melted long format) data:\n", df_melted)
Output:
Student Subject Score
0 Alice Math 85
1 Bob Math 78
2 Charlie Math 92
3 Alice Science 89
4 Bob Science 94
5 Charlie Science 88
6 Alice English 91
7 Bob English 82
8 Charlie English 85
Pivoting Back to Wide Format to return to the original wide format:
# Pivot back to wide format
df_pivoted = df_melted.pivot(index='Student',
columns='Subject',
values='Score').reset_index()
print("\nPivoted (wide format) data:\n", df_original)
Output:
Subject Student English Math Science
0 Alice 91 85 89
1 Bob 82 78 94
2 Charlie 85 92 88
# df.columns.name = None # Get rid off the useless column index name/label.
# In above example, the column index label "Subject" may be retained, if desired.
# df.index.name = None # Get rid off the useless row index name/label.
Explanation
index='Student': This sets "Student" as the row index.
columns='Subject': Each unique values in "Subject" becomes a new column.
values='Score': Fills the cells with the "Score" values from each "Subject".
The result is a DataFrame in the original wide format.
Note that reset_index() is used to convert the "Student" index back into a regular column.
Now here is how to melt (Reverse Pivoting) operation happens:
df_melted = pd.melt(df_pivoted,
id_vars=['Student'], # Student column to remain as id for each row
var_name='Subject', # Name for the new column in place of all non-id columns
# i.e. ("Math", "Science", "English").
value_name='Score' # New column name that represents the numeric values.
# value_vars=['Math', 'Science', 'English'] # Optional. By default, it is same.
)
print("\nLong format data:\n", df_melted)
Output:
Student Subject Score
0 Alice Math 85
1 Bob Math 78
2 Charlie Math 92
3 Alice Science 89
4 Bob Science 94
5 Charlie Science 88
6 Alice English 91
7 Bob English 82
8 Charlie English 85
# Note: By default, df.drop(8, inplace=True) drops the row.
# To drop column, you must specify columns explicitly.
dfc.drop(columns=['Order ID'], inplace=True)
print(dfc)
item = dfc.pop('Order ID') # Returns pd.Series
# It drops the specified column in place. Returns popped column.
dfc.sort_values('Qty', ascending=False)
Product Date Qty Price Total
Customer
Jane Desktop 2022-01-05 1 800 800
Bob Tablet 2022-01-25 1 600 600
John Laptop 2022-01-01 2 1000 2000
Jane Desktop 2022-01-20 2 800 1600
....
# See Also: dfc.sort_values([col1,col2], ascending=(False,True))
# Method 1: Add new row by duplicating last row
dfc.loc['Alice'] = dfc.iloc[-1].copy()
# Modify 'Product' to NaN
dfc.loc[dfc.index[-1], 'Product'] = np.nan
print(dfc)
Product Date Qty Price Total
Customer
John Laptop 2022-01-01 2 1000 2000
Jane Desktop 2022-01-05 1 800 800
Bob Tablet 2022-01-10 3 600 1800
John Laptop 2022-01-15 4 1000 4000
Jane Desktop 2022-01-20 2 800 1600
Bob Tablet 2022-01-25 1 600 600
John Laptop 2022-02-01 3 1000 3000
Jane Desktop 2022-02-05 4 800 3200
Bob Tablet 2022-02-10 2 600 1200
Alice NaN 2022-02-10 2 600 1200
# Delete the row added.
dfc = dfc.drop('Alice') # Delete first occurence of this index.
dfc = dfc.drop(df.index[[9]]) # Delete rows of specific indexes.
dfc = dfc[~dfc.index.isin(['Alice'])]
# Note: To get rows of only Alice, do this: df = df.loc[['Alice']]
# Method 2:
# Add another new row by creating and adding dataframe.
new_df = pd.DataFrame( {
'Product': [np.nan],
'Date': ['2022-02-15'],
'Quantity': [3],
'Price': [1000],
'Total': [3000]
}, index = ['Alice'] )
dfc = pd.concat([dfc, new_df])
print(dfc)
pd.concat([df1, df2]) # Default axis=0; Rows added.
pd.concat([df1, df2],
join = 'inner', # Only common columns retained.
# Else both columns included.
# verify_integrity = True # Check if duplicates found in index.
# Checks only index, not columns.
)
You typically use the df.apply() function for aggregation. If you want to transform only single column, you will use something like:
df['newcol'] = df['col'] * 2
new_df = my_numpy_ufunc(df)
new_df = my_numpy_vectorized_func(df)
Apply np.sum across rows (axis=0) i.e. apply for each column. :
result = df[['Qty', 'Price']].apply(np.sum, axis=0)
print(result)
Qty 22 # This is pd.Series
Price 7200
dtype: int64
# result_df = result.to_frame().transpose()
# Each column is summed up. Row is lost. Returns pd.Series.
df[['Qty', 'Price']].apply(lambda x: np.int32(x), axis=0) returns DataFrame:
Aggregating across column is uncommon. For this to work, all columns must have same units. e.g. You may want to sum 'StateTax', 'CentralTax'. :
result = df[['StateTax', 'CentralTax']].apply(np.sum, axis=1)
df['TotalTax'] = result
print(result)
0 1002 # result type: pd.Series
1 801
2 603
3 1004
for index, row in dfc.iterrows():
print("Index: ", index)
print(row) # row is of type pd.Series
# row.index is list of column names.
# You can access row['col1'], etc.
# Prefer itertuples over iterrows as it is much faster.
for row in df.itertuples(index=True): # Return Index as first element in tuple.
print(row.c1, row.c2)
In most cases, you should prefer to use vector methods to avoid iterating:
Vectorization: e.g. df['Qty'] = df['Qty']**2; # np.square(df['Qty'])
List Comprehension:
for row in df[['Qty', 'Price']].to_numpy():
# Do stuff with row[0], row[1], etc.
use df.apply() for aggregation.
Compute df1 - df2 to remove common rows.
Use concat with drop duplicates. Bit inefficient. :
# Adds then removes duplicate rows that appear atleast twice
df_result = pd.concat([df1, df2, df2]).drop_duplicates(keep=False)
print(df_result)
Create new column as tuple of the row and use it to remove duplicates :
df1['hash'] = df1.apply(tuple, axis=1)
df2['hash'] = df2.apply(tuple, axis=1)
result_df = df1[ ~df1['hash'].isin(df2['hash']) ] # Return only unique rows.
common_df = df1[ df1['hash'].isin(df2['hash']) ] # Return common rows.
df1.pop('hash'); df2.pop('hash'); result_df.pop('hash')
# Without using temp variables, you can also do it as ...
df_result = df1[~df1.apply(tuple, 1).isin(df2.apply(tuple, 1))]
# Another way ...
df2_tuples = [tuple(row) for row in df2.values]
result_df = df1[~df1.apply(lambda row: tuple(row) in df2_tuples, axis=1)]
Merge two DataFrames with indicator and drop using the indicator :
result_df = pd.merge(df1, df2, indicator=True, how='outer')
result_df = result_df[result_df['_merge'] == 'left_only'].drop('_merge', axis=1)
To get the value for a column which is at 30 percentile point:
df['Qty'].quantile(0.30) # Default = 0.5 i.e. 50% percentile
# 2.5
df['Qty'].quantile(0) # Min value for the column. 0th Percentile.
df['Qty'].quantile(1) # Max value for the column. 100th Percentile.
Given the value, how to find the percentile ? :
percentile = ((df['Qty'].sort_values().values.searchsorted(1)+1)/ len(df)) * 100
df['Qty'].rank(method='min')
Another way to find percentile given value:
from scipy.stats import percentileofscore
percentile = percentileofscore(df['column_name'], value)
print(f"The value {value} is at {percentile:.2f}th percentile.")
df[['Qty', 'Price']].agg(np.mean)
Qty 2.555556 # Returs pd.Series
Price 800.000000
dtype: float64
df[['Qty', 'Price']].agg([np.mean, np.sum])
Qty Price # Returs DataFrame
mean 2.555556 800.0
sum 23.000000 7200.0
df.drop_duplicates(subset=['Customer'])
# First value kept. Later duplicates are dropped.
Order ID Product Customer Date Qty Price Total
0 101 Laptop John 2022-01-01 2 1000 2000
1 102 Desktop Jane 2022-01-05 1 800 800
2 103 Tablet Bob 2022-01-10 3 600 1800
df['Customer'].value_counts()
John 3
Jane 3
Bob 3
Name: Customer, dtype: int64
df['Customer'].value_counts(normalize=True)
John 0.3
Jane 0.3
Bob 0.3
Alice 0.1
Name: Customer, dtype: float64
import matplotlib.pyplot as plt
df['Qty'].hist()
plt.show()
df['Price'].hist()
plt.show()
avg_qty_per_customer = df.groupby('Customer')['Qty'].mean() print(avg_qty_per_customer)
Customer # pd.Series Alice 2.000000 Bob 2.000000 Jane 2.666667 John 3.000000 Name: Qty, dtype: float64
avg_qty_per_customer.plot(kind='bar')
df.plot(x='Date', y='Total', kind='line', rot=45) df.plot(x='Date', y='Total', kind='scatter', rot=45, xscale='log')
# # Overlapping Histograms with alpha (transparent) # df[df['Price'] > 700 ]['Qty'].hist(alpha=0.7) df[df['Price'] <= 700 ]['Qty'].hist(alpha=0.7) plt.legend(['gt700', 'le700']) plt.show()
df.corr()
Order ID Qty Price Total
Order ID 1.000000 0.213122 -0.423702 0.001381
Qty 0.213122 1.000000 0.349913 0.883411
Price -0.423702 0.349913 1.000000 0.695845
Total 0.001381 0.883411 0.695845 1.000000
fdf = pd.read_csv("./data/startup_funding.csv")
fdf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3044 entries, 0 to 3043
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sr No 3044 non-null int64
1 Date dd/mm/yyyy 3044 non-null object
2 Startup Name 3044 non-null object
3 Industry Vertical 2873 non-null object
4 SubVertical 2108 non-null object
5 City Location 2864 non-null object
6 Investors Name 3020 non-null object
7 InvestmentnType 3040 non-null object
8 Amount in USD 2084 non-null object
9 Remarks 419 non-null object
dtypes: int64(1), object(9)
memory usage: 237.9+ KB
fd.loc[0]
Sr No 1
Date dd/mm/yyyy 09/01/2020
Startup Name BYJU’S
Industry Vertical E-Tech
SubVertical E-learning
City Location Bengaluru
Investors Name Tiger Global Management
InvestmentnType Private Equity Round
Amount in USD 20,00,00,000
Remarks NaN
Name: 0, dtype: object
fd.isnull() # 2D array of booleans.
fd.isnull().sum() # Returns missing sum for each column. (pd.Series)
Sr No 0
Date dd/mm/yyyy 0
Startup Name 0
Industry Vertical 171
SubVertical 936
City Location 180
Investors Name 24
InvestmentnType 4
Amount in USD 960
Remarks 2625
fd.isnull().sum().sort_values() # Sorts above array using values.
fd.isnull().count() # Returns constant count for each column. (pd.Series)
Sr No 3044
Date dd/mm/yyyy 3044
...
Remarks 3044
dtype: int64
# To cleanup a cell value, do something like this:
clean_col = np.vectorize(lambda x: x.replace(',', '').strip())
fd['AmountInUSD'] = clean_col(fd['AmountInUSD'])
# Fine missing data Percentage.
# Find total missing data (null) for each column ...
total = fd.isnull().sum().sort_values(ascending = False)
# Divide by constant total entries (fd.isnull().count() == 3044)
percent = ((fd.isnull().sum()/fd.isnull().count())*100)
percent = percent.sort_values(ascending = False)
missing_df = pd.concat([total, percent], axis=1, keys=['Total', 'Percent %'])
missing_df.head()
# Concatenate along axis=1, uses new given column names.
Total Percent %
Remarks 2625 86.235217
Amount in USD 960 31.537451
SubVertical 936 30.749014
City Location 180 5.913272
Industry Vertical 171 5.617608
# If you want to reverse columns and rows, just do: df = df.T
# To concatenate as DF, you should do ...
mdf = pd.concat([total.to_frame().T, percent.to_frame().T])
mdf.index = ['MissedCount', 'Percent']
# Note: It is easier to concat series along axis=1
#
# E.g. Values: 10,000 undisclosed unknown etc.
#
def clean_amount(x):
x = str(x).replace(',', '').strip()
x = x.replace('+', ' ').strip() # Remove trailing +
if x.upper().isupper(): # Contains alphabets
x = 0
if (not x):
return 0
return np.int64(float(x))
clean_amount = np.vectorize(clean_amount)
fd['AmountInUSD'] = clean_amount(fd['AmountInUSD'])
# To drop a column ...
del fd['Remarks']
# fd.drop('Remarks', axis=1, inplace=True) # Same as above to del column.
# Deletes rows if any one column is NA ... using dropna() method ...
fd.dropna(inplace=True)
# Deletes rows where AmountInUSD column is na or zero.
fd = fd[ fd['AmountInUSD'] > 0 ]
There are various methods available for removing outliers.
def outlier_info(low, high, method='', new_fd=None):
print(f"Outlier Results: (Using Method: {method})")
# Filter out outliers
fd_amt = fd['AmountInUSD']
if new_fd is None:
print(f"Lower Bound: {low}; Higher Bound: {high}")
new_fd = fd[(fd_amt >= low) & (fd_amt <= high)]
new_count = new_fd[column].count()
total_count = fd_amt.count()
dropped = total_count - new_count
percent_dropped = (dropped/total_count) * 100.0
print(f"Total: {total_count:,}")
print(f"Dropped: {dropped:,}")
print(f"Percent-Dropped: {percent_dropped:.1f}")
print(new_fd.describe())
# We want to remove outliers in AmountInUSD column.
fda = fd['AmountInUSD']
fda.describe() # Display Percentile and other statistics.
count 2.066000e+03 2066
mean 1.842792e+07 18,427,920 18 Million
std 1.213441e+08 121,344,100 121 Million
min 1.600000e+04 16,000
25% 4.712500e+05 471,250 0.47 Million
50% 1.725000e+06 1,725,000 1.72 Million Median ?
75% 8.000000e+06 8,000,000 8.00 Million
max 3.900000e+09 3,900,000,000 3.90 Billion
Name: AmountInUSD, dtype: float64
# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = df[column].quantile(0.25) # 0.47 Million
Q3 = df[column].quantile(0.75) # 8.00 Million
# Calculate IQR
IQR = Q3 - Q1
7528750.0 # 7.52 Million
# Define the outlier bounds
lower_bound = Q1 - 1.5 * IQR # negative
upper_bound = Q3 + 1.5 * IQR # 19 Million
# Filter out outliers
column = 'AmountInUSD'
fd_filtered = fd[(fd[column] >= lower_bound) & (fd[column] <= upper_bound)]
fd_filtered[column].count()
1783 # 2066 - 1783 = 283; 283/2066 = 13.7% dropped.
fd_filtered.describe()
AmountInUSD
count 1.783000e+03 1783
mean 3.165783e+06 3,165,783 # 3 Million
std 4.240982e+06 # 4 Million
min 1.600000e+04
25% 3.750000e+05
50% 1.000000e+06
75% 4.500000e+06
max 1.900000e+07
Z-Score method uses values with in 3 times standard deviations. In our case of funding data, it did not work really well since the standard deviation is really high.
:
from scipy import stats
# Calculate the Z-scores for each data point
fd['z_score'] = stats.zscore(fd['AmountInUSD'])
# Filter out outliers based on Z-score
threshold = 3 # Adjust threshold as needed
new_fd = fd[fd['z_score'].abs() < threshold].drop(columns='z_score')
outlier_info(0, 0, method='z_score', new_fd=new_fd)
Outlier Results: (Using Method: z_score)
Total: 2,066
Dropped: 13
Percent-Dropped: 0.6
Sr No AmountInUSD
count 2053.000000 2.053000e+03
mean 1475.387725 1.142576e+07
std 926.287657 3.076596e+07
min 1.000000 1.600000e+04
25% 622.000000 4.640000e+05
50% 1442.000000 1.600000e+06
75% 2310.000000 7.800000e+06
max 3044.000000 3.300000e+08
Modified Z score works to retain all points from median and with in 3.5 times (median of all deviations).
:
def modified_z_score(series):
median = series.median()
mad = np.median(np.abs(series - median))
return 0.6745 * (series - median) / mad
# Apply modified Z-score method
fd['modified_z_score'] = modified_z_score(fd['AmountInUSD'])
# Filter out outliers based on modified Z-score
threshold = 3.5
new_fd = fd[fd['modified_z_score'].abs()<threshold].drop(columns='modified_z_score')
outlier_info(0, 0, method='Modified-Z-score', new_fd=new_fd)
Result:
Outlier Results: (Using Method: Modified-Z-score)
Total: 2,066
Dropped: 489
Percent-Dropped: 23.7
Sr No AmountInUSD z_score
count 1577.000000 1.577000e+03 1577.000000
mean 1517.958782 1.864793e+06 -0.136530
std 887.148232 2.168736e+06 0.017877
min 2.000000 1.600000e+04 -0.151770
25% 743.000000 3.000000e+05 -0.149429
50% 1507.000000 1.000000e+06 -0.143659
75% 2296.000000 3.000000e+06 -0.127173
max 3044.000000 9.700000e+06 -0.071944
Winsorizing fixes lower 5% values to 5th Percentile and higher 5% values to 95th Percentile and does not drop any values. (you can change these limits)
from scipy.stats.mstats import winsorize
# Winsorize the data (limits=[lower, upper] for percentile capping)
winsorized_values = winsorize(fd['AmountInUSD'], limits=[0.05, 0.05])
# Create a new DataFrame with Winsorized data
fd_winsorized = pd.DataFrame(winsorized_values, columns=['AmountInUSD'])
outlier_info(0, 0, method='Winsorize', new_fd=fd_winsorized)
Result:
Outlier Results: (Using Method: Winsorize)
Total: 2,066
Dropped: 0
Percent-Dropped: 0.0
AmountInUSD
count 2.066000e+03
mean 8.655042e+06 8.6 Million
std 1.546028e+07 15 Million
min 1.000000e+05 0.1 Million
25% 4.712500e+05 0.5 Million
50% 1.725000e+06 1.7 Million
75% 8.000000e+06 8 Million
max 6.000000e+07 60 Million
Cap the lower/upper bounds to be 5% and 95%, for example.
low = fd['AmountInUSD'].quantile(0.05)
high = fd['AmountInUSD'].quantile(0.95)
outlier_info(low, high, method='Percentile-Cap')
Result :
Outlier Results: (Using Method: Percentile-Cap)
Lower Bound: 100000.0; Higher Bound: 60000000.0
Total: 2,066
Dropped: 162
Percent-Dropped: 7.8
AmountInUSD
count 1.904000e+03
mean 6.331311e+06 6.3 Million
std 1.082068e+07 10.8 Million
min 1.000000e+05 0.1 Million
25% 5.000000e+05 0.5 Million
50% 1.500000e+06 1.5 Million
75% 6.500000e+06 6.5 Million
max 6.000000e+07 60.0 Million
df.isnull().sum() # Total missing values.
# Result: Series; Col as index and sum as value.
df.isnull().count() # Total constant count as Series.
df[col].isnull().sum() # Scalar sum value. Total missing count.
df[col].isnull().count()# Total constant as total rows.
df_merged = df1.merge(df2, on='common_col',
suffixes=('_one', '_two'),
how='inner' # Default
# left_on = 'col1'
# right_on = 'col2'
# left_index = True ; Joining key is index.
# right_index = True ;
# validate = 'one_to_one' # Check no duplicates due to merge.
)
# common columns are auto renamed as _x, _y suffixes by default otherwise.
See Also: pd.merge_ordered(df1, df2, on='col1', # or left_on, right_on
# left_by = 'col', right_by='col',
how='inner', # Default: 'outer'!!!
fill_method='ffill', # Forward Filling missing values
)
pd.merge_asof(df1, df2,
direction = 'nearest' | 'forward' | 'backward' # Row to match.
...)
# Kind of outer join where nearest rows are chosen to join.
df2_ids = df2['id']
result = df1[ df1['id'].isin(df2_ids) ]
merged = df1.merge(df2, on='id', how='left', indicator=True)
# merged['_merge'] == 'left_only' | 'both' | right
result = merged[ merged['_merge'] == 'left_only' ]
# Drop B's columns.
df = df.query('Qty >1 and Qty < 3')
interval = pd.Interval(1, 5, closed='both') # [1, 5]
interval = pd.Interval(1, 5, closed='right') # (1, 5]
# Interval from array
intervals = pd.IntervalIndex.from_arrays([1, 2, 3], [5, 6, 7], closed='both')
# list of [1,5], [2,6], [3,7] intervals
# Interval Operations:
# Intersection: interval1 & interval2
# Union: interval1 | interval2
# Difference: interval1 - interval2
# Contains: interval.contains(value)
# Overlaps: interval.overlaps(interval2)
# IntervalIndex: A pandas Index composed of Intervals.
intervals = pd.IntervalIndex.from_tuples([(1, 5), (6, 10), (11, 15)], closed='both')
# Range-based filtering: Filter data within a specific range.
df[df['values'].isin(intervals)]
#
# Use pd.cut() to return series of categories given series/array of values.
#
pd.cut(x, # Input Array or Series
bins, # bins=3 means divide by equal 3 Intervals.
# bins=array to create categories by this criteria.
right=True|False, # e.g. [1,5] interval to include 5 or not.
# Note: left is always included.
labels, # Category labels. By default, interval name becomes
# category name as well. ?
retbins=True|False # e.g. bins=3; so return categories: 0, 1, 2
ordered=True|False # Usually True for numerical categories.
include_lowest=True|False # To include lowest value or not.
)
pd.cut(range(7), 3, labels=['low', 'medium', 'high'])
# Divides input array into 3 equal bins and then maps input to categories.
# Output:
['low', 'low', 'low', 'medium', 'medium', 'high', 'high'] # Input Size = Output Size
Categories (3, object): ['low' < 'medium' < 'high'] # Labels becomes categories.
pd.cut(range(7), 3)
# Output:
[(-0.006, 2.0], (-0.006, 2.0], ... , (4.0, 6.0], (4.0, 6.0]] # 7 categories in output
Categories (3, interval[float64, right]): [(-0.006, 2.0] < (2.0, 4.0] < (4.0, 6.0]]
# The interval becomes category label by default!
Another way to create custom category for non-numeric data:
conditions = [
df['values'] == 'A',
df['values'] == 'B',
df['values'] == 'C'
]
choices = ['Category A', 'Category B', 'Category C']
df['category'] = np.select(conditions, choices, default='Other')
groupby category using pd.cut of numerical range into bins:
# Define bins and labels
bins = [0, 10, 20, 30, 40, 50, float('inf')] # Include an open-ended upper bin
labels = ['0-9', '10-19', '20-29', '30-39', '40-49', '50+']
# Assign bins to Sales column
df['Sales_Bin'] = pd.cut(df['Sales'], bins=bins, labels=labels, right=False)
# right=False means for each interval right side value is not included.
# left value is always included.
# Group by bins
grouped = df.groupby('Sales_Bin')['Sales'].sum().reset_index()
print(grouped)
Merging: pd.merge(df1, df2, on='A')
3. Data Analysis
Summary stats: df.describe()
Correlation: df.corr()
Data cleaning: df.dropna(), df.fillna()
4. Data Input/Output
CSV: pd.read_csv('data.csv'), df.to_csv('data.csv')
Excel: pd.read_excel('data.xlsx'), df.to_excel('data.xlsx')
5. Data Transformation
Melting: pd.melt(df, id_vars='A', value_vars='B')
Pivoting: df.pivot_table(index='A', columns='B', values='C')
Key Functions:
head(), tail(): View first/last few rows
info(), describe(): View data summary
shape, size: View data dimensions
columns, index: View column/index labels
Histogram involves bins and counts.
Mean, Median, Mode
Mode is the most frequent value in the set. Useful for non-numerical categorical set.
To find mode:
df[col].value_counts().head(1) # Give the biggest value count. For numeric.
# Displays (value, count) tuple for the highest count.
import statistics
statistics.mode(df[col]) # Gives you the (value) for the most frequent item.
np.var (variance), np.std (std deviation)
Mean absolute deviation: np.mean(np.abs(distance_from_mean)); Similar to np.std, but np.std penalizes longer deviations more.
Quantiles -- 0.50 Quantile is median. :
np.quantile(df[col], 0.5)
Quartiles -- 4 equal parts of data between equal quantiles intervals between 0 and 1. i.e. [0, .25, .5, .75, 1] -- There are 4 intervals:
np.quantile(df[col], [0, 0.25, 0.5, 0.75, 1.0]) # 5 points are quartiles. Q0 ... Q4
The boxplot uses the Q1-Q2-Q3 box:
plt.boxplot(df[col]) ; plt.show()
Get quantiles using custom linear space intervals :
np.quantile(df[col], np.linspace(0, 1, 5)) # linspace gives: [0, 0.2,...,0.8, 1]
# Output: 5 quantiles.
Interquartile Range (IQR): Distance between 0.75 Percentile - 0.25 Percentile. e.g. scipy.stats.iqr(df[col]) ; Height of boxplot box.
Outliers. Thumbrules:
data < Q1 - (1.5 x IQR) or data > Q3 + (1.5 x IQR)
from scipy.stats import uniform
# Suppose You are waiting for bus that comes every 12 mins.
# prob(wait_time <= 12) = 1
# prob(wait_time <= 6) = 0.5
# cumulative distribution function. (x, min, interval_size)
# loc=0 is minimum. scale=interval_size
prob = uniform.cdf(6, loc=0, scale=12)
# prob is 0.5
# RVS = Random Variates of Given Size; Generate random variables of given size.
uniform.rvs(loc=0, scale=1, size=1)
uniform.rvs()
# output: 0.3862
uniform.rvs(size=5)
# output: array([x, ..., y])
uniform.rvs(size=(2,3)) # Generates 2x3 matrix of random numbers in 0<x<1
np.random.randn() # generates normal-distribution random number between -1 to 1.
np.random.randn(5) # generates 5 numbers as array.
np.random.randn((3,3)) # generates (3,3) matrix of normal random numbers.
# default_rng() is recommended over np.random()
rng = np.random.default_rng(seed=42)
rng.random() # generates a random number 0 < x < 1; Uniform distribution.
# 0.6823518632481435
rng.standard_normal(size=(2,3)) # Generates normal distributed number.
# Same as np.random.randn((2,3))
array([[-0.36778665, 1.28792526, 0.19397442], # mean=0; stddev=1 is fixed.
[ 0.9202309 , 0.57710379, -0.63646365]])
rng.uniform(0, 5, (2,3)) # between 0 to 5, generate 2x3 matrix. Uniform distribution.
array([[1.38287199, 4.09877281, 4.44946347],
[2.56485228, 1.22482301, 4.12120798]])
# See Also: rng.exponential(), rng.normal(), rng.logistic(), rng.uniform()
from scipy.stats import binom
binom.rvs(no_of_coins, prob_of_success, size=None)
binom.rvs(10, 0.5, size=1)
# 5 or 4 or 6 etc. # Meaning: 10 coins thrown with 1 trial: 5 sucesses.
# It is like rolling binary dice using multiple dices.
binom.rvs(10, 0.5, size=(2,3))
# Output:
array([[5, 4, 6], # Each element is centered around 10*0.5 = 5 successes.
[5, 3, 5]])
binom.rvs(1, 0.5, size=10)
array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0])
# Define parameters
n = 10 # Number of trials
p = 0.5 # Probability of success
k = 5 # Number of successes
# Out of 10 trials, getting total 5 successes is the max probability.
# Out of 1 trial getting total 0 or 1 successes probability = 0.5
# Out of 2 trials getting total 1 (out of 0, 1, 2) success is < 0.5
# Out of 10 trials getting total 5 successes is << 0.5
# since total 4, 6 also have high probabilities.
# Probability Mass Function (PMF)
# for k in range(11):
pk = binom.pmf(k, n, p)
print(f'P(X = {k}) = {pk:.4f}')
# Output for different k's ...
P(X = 0) = 0.0010
P(X = 1) = 0.0098
P(X = 2) = 0.0439
P(X = 3) = 0.1172
P(X = 4) = 0.2051
P(X = 5) = 0.2461 # Exactly 5 success out of 10 trials is less than 25% !
P(X = 6) = 0.2051 # Same as P(X = 4)
P(X = 7) = 0.1172
P(X = 8) = 0.0439
P(X = 9) = 0.0098
P(X = 10) = 0.0010
k = 5 # Max number of successes.
cdk = binom.cdf(k, n, p)
print(f'P(X <= {k}) = {cdk:.4f}')
# Output for k in 0 to 10:
P(X <= 0) = 0.0010
P(X <= 1) = 0.0107
P(X <= 2) = 0.0547
P(X <= 3) = 0.1719
P(X <= 4) = 0.3770 # 20% increase! == P(X = 4)
P(X <= 5) = 0.6230 # 25% increase! == P(X = 5)
P(X <= 6) = 0.8281
P(X <= 7) = 0.9453
P(X <= 8) = 0.9893
P(X <= 9) = 0.9990
P(X <= 10) = 1.0000
from scipy.stats import norm
norm.cdf(k, mean, std)
# Define parameters
mu = 0 # Mean
sigma = 1 # Standard Deviation
# Generate x values
x = np.linspace(mu - 3 * sigma, mu + 3 * sigma, 100)
# Calculate PDF (probablity density function)
pdx = norm.pdf(x, loc=mu, scale=sigma)
# Cumulative Density function: P(result < x)
cdx = norm.cdf(x, loc=mu, scale=sigma)
norm.cdf(154, 161, 7) # Avg height = 161; stddev=7; P(height<154)
# 0.1586
# P(154 < height < 157) == cdf(157) - cdf(154)
# Plot PDF and CDF
plt.plot(x, pdx, label='PDF')
plt.plot(x, cdx, label='CDF')
plt.legend()
plt.show()
# Calculate the point for the given Percentile (or cdf)
q = 0.95
mu = 161 # Average Height
sigma = 7 # Std Dev
x = norm.ppf(q, loc=mu, scale=sigma) # Percent Point Function
print(f'X corresponding to P = {q}: {x:.4f}')
# X corresponding to P = 0.95: 172.5140
# Generate random variates
rv = norm.rvs(loc=mu, scale=sigma, size=3)
# array([147.63477247, 148.09195502, 163.61683838])
The Poisson distribution is a discrete probability distribution modeling the number of events occurring in a fixed interval.
from scipy.stats import poisson
import numpy as np
import matplotlib.pyplot as plt
# Define parameter
mu = 5 # Average rate of events
# Generate k values
k = np.arange(0, 15)
# Calculate PMF and CDF
pmf = poisson.pmf(k, mu)
cdf = poisson.cdf(k, mu)
# Plot PMF and CDF
plt.plot(k, pmf, label='PMF')
plt.plot(k, cdf, label='CDF')
plt.legend()
plt.show()
# Generate random variates
rv = poisson.rvs(mu, size=1000)
# Calculate probability
k_val = 8
prob = poisson.pmf(k_val, mu)
print(f'P(X = {k_val}) = {prob:.4f}')
# P(X = 8) = 0.0653 # Note: lambda = mu = 5
from scipy.stats import expon
import numpy as np
import matplotlib.pyplot as plt
# Define parameter
scale = 1/2 # Scale parameter (1/λ)
# Generate x values
x = np.linspace(0, 10, 100)
# Calculate PDF and CDF
pdf = expon.pdf(x, scale=scale)
cdf = expon.cdf(x, scale=scale)
# Plot PDF and CDF
plt.plot(x, pdf, label='PDF')
plt.plot(x, cdf, label='CDF')
plt.legend()
plt.show()
# Generate random variates
rv = expon.rvs(scale=scale, size=1000)
# Calculate probability
x_val = 5
prob = expon.cdf(x_val, scale=scale)
print(f'P(X ≤ {x_val}) = {prob:.4f}')
Use Cases:
Seaborn is a Python visualization library built on top of Matplotlib.
Aesthetic Defaults
High-Level Abstractions
Distribution Plots:
- sns.distplot(): Histograms and kernel density estimates.
- sns.kdeplot(): Kernel density estimates.
Categorical Plots:
- Categorical plot (like group by category and draw bars or count or box.
- sns.catplot(): Generic plot by category. e.g. sns.catplot(x='col1', data=df, kind='count' |
- sns.barplot(): Bar plots for categorical data. Mean for each category.
- sns.countplot(): Count plots for categorical data. e.g. sns.countplot(x='col1', data=df)
- sns.boxplot(): Box plots for categorical data.
- sns.pointplot(): Shows mean of quantitative variable. show 95% confidence interval. Looks like line plot, but here one axis is category. e.g. sns.catplot(x='smoker', y='total_bill', data=tips, kind='point')
:
sns.catplot(x='country', data=df,
kind = 'count', # same as sns.countplot()
kind = 'bar', # Display mean for each category.
# Also shows confidence interval at top.
kind = 'box', # Shows median, spread, skewness, outliers
kind = 'point' # Shows mean, ci(95%)
order=[ 'USA', 'INDIA', 'CHINA'])
)
Regression Plots:
- sns.regplot(): Simple linear regression plots.
- sns.lmplot(): More complex linear regression plots.
Scatter Plots:
- sns.scatterplot(): Scatter plots for relationships between two variables.
- sns.jointplot(): Joint plots for relationships between two variables.
Relational Plot:
- sns.relplot(): Multiple plots in single plot or in array of plots sharing Y axis. Offers lot of flexibilities.
Heatmaps:
- sns.heatmap(): Heatmaps for 2D data.
import seaborn as sns
import matplotlib.pyplot as plt
# Load example dataset
tips = sns.load_dataset("tips")
flights = sns.load_dataset("flights")
# Distribution Plot
sns.distplot(tips["total_bill"])
plt.show()
# Categorical Plot
sns.barplot(x="sex", y="total_bill", data=tips)
plt.show()
# Regression Plot
sns.regplot(x="total_bill", y="tip", data=tips)
plt.show()
# Scatter Plot
sns.scatterplot(x="total_bill", y="tip", data=tips)
plt.show()
# Joint Plot
sns.jointplot(x="total_bill", y="tip", data=tips)
plt.show()
# Heatmap
flights_pivot = flights.pivot_table(index="month", columns="year", values="passengers")
sns.heatmap(flights_pivot)
plt.show()
Customization:
- Colors: Use sns.set_palette() or sns.color_palette().
- Styles: Use sns.set_style() or sns.axes_style().
- Fonts: Use sns.set_fontscale() or plt.rcParams["font.family"].
Use Seaborn's built-in datasets for testing:
sns.get_dataset_names()
[
'car_crashes', 'exercise', 'flights', 'iris', 'planets',
'seaice', 'taxis', 'titanic', 'dowjones',
]
Tips:
Scatterplot Options:
# different categories will be colored differently using hue option.
sns.scatterplot(x="total_bill", y="tip", data=tips, hue="sex", style="sex")
plt.show()
sns.replot(x='col1', y='col2', data=df, kind='scatter',
col='col3', # Another dimension. Create subplot stacked as column.
row='col4', # Another dimension. Create subplot in next row.
col_wrap = 2, # Can not accomodate too many subplots in single line.
col_order= [ 'col3_val1', 'col3_val2' ] # Subplot order as per value.
ci = 'sd' # Confidence interval to std deviation or None
# Default is 95%
hue='col5', # Vary color according to col5.
style='col6' # Vary line style (dashes etc) as per col6.
markers=True # put a marker like +, x, * at each data point.
dashes=False # Do not use dashed lines.
)
g = sns.relplot(...) # or catplot()
type(g)
<class 'seaborn.axisgrid.FacetGrid'>
g.fig.suptitle('... Title ... ', y=1.03) # Place title slightly higher
g.set(xlabel='My x label', ylabel='My y label')
# Set x ticks rotation.
plt.xticks(rotation=90)
g = sns.boxplot(...)
type(g)
<class 'matplotlib.axes._subplots.AxesSubplot'>
g.set_title('...title...')
# scatterplot(), etc returns axes subplot, not FaceGrid
df['col1'].corr(df['col2']) finds a numeric correlation.
Correlation only detects numeric correlation.
If there is non-linear correlation, only visualization can find it.
Some times the log(x) or 1/x is more linearly correlated to y. :
pds1.corr(pds2) = 0.1
log(pds1).corr(pds2) = 0.5 <=== Note this.
Correlation does not mean causation. e.g. Coffee Drinking seems to be correlated to cancer because people who smoke consume more coffee. And smoking causes cancer.
Here the third variable, i.e. Smoking is called Confounder or Lurking Variable.
fig, ax = plt.subplots()
ax.plot(df[col1], df[col2],
marker="o", # v etc
linestyle="--",
color="r")
ax.set_xlabel("Time in years")
ax.set_ylabel(" Amount ")
ax.set_title(" Time vs Amount ")
# Multiple subplots
fix, ax = plt.subplots(2, 1, # 2 x 1 plots
sharey = True # Prevents y autoscaling differently
)
a[0].plot(...)
a[1].plot(...)
# Using twin x axis we can share x axis, use different y axis.
ax2 = ax.twinx()
ax2.tick_params('y', colors='red')
ax.annotate(">1 degree", xy=(pd.Timestamp('2015-10-06'), 1))
ax2.annotate(">1 degree", xy=(pd.Timestamp('2015-10-06'), 1),
xytext=(pd.Timestamp('2008-10-06'), -0.2), # Position text elsewhere
arrowprops={'arrowstyle':'->', 'color':'gray'}) # Add arrow to text
# ====== Stacked bar chart. ==================
# Gold + Silver + Bronze medals stacked on top
#
ax.bar(medals.index, medals['Gold'], label='Gold')
ax.set_xticklabels(medals.index, rotation=90)
ax.set_ylabel("Number of medals")
ax.bar(medals.index, medals["Silver"], bottom=medals["Gold"], label="Silver")
ax.bar(medals.index, medals["Bronze"], bottom=medals["Gold"]+medals["Silver"],
label="Bronze")
ax.legend()
mr = mens_rowing
mg = mens_gymnastics
ax.hist(mr['Weight'], label='Rowing', histtype='step', bins=5) # Use hollow bars
ax.hist(mg['Weight'], label='Gymnastics', histtype='step', bins=5)
ax.set_xlabel("Weight (kg)")
ax.set_ylabel("# of observations")
ax.legend() # Add the legend and show the Figure
plt.show()
ax.bar() method to represent errors/std-dev:
.
. +---|---+ Middle line represents std-dev
. | | +---|---+
. | | | |
. | | Height Represents Mean. | |
. | | | |
. | | | |
.
ax.bar("Rowing", mens_rowing['Height'].mean(), yerr=mens_rowing['Height'].std()),
ax.errorbar() Method:
.
.
. |-----|
. / \
. | |-------|
. \--------
.
ax.errorbar(df[col1], df[col2_avg], yerr=df[col2_std])
ax.set_ylabel('Y label')
ax.boxplot([mr['Height'], mg['Height']])
ax.set_xticklabels(['Rowing', 'Gymnastics'])
ax.set_ylabel('Height (cm)')
ax.scatter(climate_change['co2'], climate_change['relative_temp'])
ax.set_xlabel("CO2 (ppm)")
ax.set_ylabel("Relative temperature (C)")
plt.style.use('ggplot') # or 'default', 'bmh', 'seaborn-colorblind',
# plt.show()
fig.set_size_inches([5,3]) # widgh, height inches
fig.savefig('myimage.png', dpi=300)
fig.savefig('myimage.jpg', quality=50)