tags:
- "#data-analysis"
- "#r-programming"
Notes On R Programming
Table of Contents
Install rstudio from rcran website.
> install.packages('ggplots', dependencies = TRUE)
# Variable Inspection
> a = 3
> ls() # list global objects
[1] "a"
> typeof(a)
[1] "double"
> stopifnot(a < 1)
Error: a < 2 is not TRUE
> class(a) # class finds more generic type
[1] "numeric"
> b = list(1, 10)
# Debugging
# Allow to enter into debug mode after error
> options(error=recover)
> debug(func_name) # Breaks you later inside the function.
> undebug(func_name) # Unset break point
> n # next
> c # continue
> trace(func_name, edit = TRUE) # Allows you to edit in rstudio. without changing original code
> browser() # Gives you stack frame to jump around
numeric_vector <- c(1, 10, 49)
character_vector <- c("a", "b", "c")
# Complete the code for boolean_vector
boolean_vector <- c(TRUE, FALSE, TRUE)
# Vector is preferred over list usually.
some_vector <- c("John Doe", "poker player") # Single dimensional vector
names(some_vector) <- c("Name", "Profession") # Set names for vector elements.
# Slicing vector
my_vector[c(1, 5)] # Selects first and 5th element of vector
my_vector[1:5] # Selects columns 1, 2, 3, 4, 5
my_vector["name"] # Selects specified element if my_vector supports named column.
my_vector[c("name", "age")] # Selects specified columns if my_vector supports it.
# sum, mean, etc
v = c(10, 20, 30)
sum(v)
mean(v)
summary(my_var) # See vectory summary of a variable with mean, sum, min, max etc
v = 5:8 # These 2 statements are equivalent. 5,8 both inclusive.
v = c(5,6,7,8)
# Selection vector
> selection_vector = c(4, 5, 6) > 5
[1] FALSE FALSE TRUE
> my_vector[selection_vector] # selects vector elements
> length(1:10)
[1] 10
> nchar("Software Carpentry")
[1] 18
# Append something to vector
v = charater() # Note: c() is NULL !!! This is vector of strings.
u = charater()
v = c(v, 'one', 'two') # In efficient
v = append(v, 'one') #
v = c(v, u) # Merging two vectors. Best way
v[1] = 'one' # OK.
v[2] = 'two'
# Vector transformation
output <- sapply(values, function(v) return(2*v)) # s apply for vector
output <- lapply(values, function(v) return(2*v)) # l apply for list
R provides many functions to examine features of vectors and other objects, for example:
The str() is the one you want to use most of the time.
Examples:
> x <- "dataset"
> mode(x) # mode is the intrinsic attribute of object.
[1] "character"
> typeof(x) #
[1] "character" # x is a character vector of length 1.
> class(x) #
[1] "character"
> attributes(x)
NULL
> str(x) # Best command to see "structure" and summary of variable.
chr "hello"
> y <- 1:10 # Assign a generated sequence as vector.
> y
[1] 1 2 3 4 5 6 7 8 9 10 # y is integer vector of size 10
> typeof(y)
[1] "integer"
> length(y)
[1] 10 # Note there integer is same of vector of lengh 1 !!!!
> z <- as.numeric(y) # Coerces to type double vector from integer vector
> z
[1] 1 2 3 4 5 6 7 8 9 10
> typeof(z)
[1] "double"
> vector() # an empty 'logical' (the default) vector
logical(0)
> vector("character", length = 5) # a vector of mode 'character' with 5 elements
[1] "" "" "" "" ""
> character(5) # the same thing, but using the constructor directly # Note: It is not string of length 5 !!!!
[1] "" "" "" "" ""
> numeric(5) # a numeric vector with 5 elements
[1] 0 0 0 0 0
> logical(5) # a logical vector with 5 elements
[1] FALSE FALSE FALSE FALSE FALSE
# You can also create vectors by directly specifying their content.
> x <- c(1, 2, 3) # These are vector of double precision numbers !!!
> x <- c(1L, 2L, 3L) # It is now integer vector!!! All elements must be suffixed with L !!!!
> x <- as.intger(x) # If you want to do type conversion/coerce.
> y <- c(TRUE, TRUE, FALSE, FALSE)
# The functions typeof(), length(), class() and str() provide useful information about your vectors and R objects in general.
> z <- c("Sarah", "Tracy", "Jon")
> typeof(z)
[1] "character"
> length(z)
[1] 3
> class(z)
[1] "character"
> str(z) # Great command to use !!!!
chr [1:3] "Sarah" "Tracy" "Jon"
> z <- c(z, "Annette") # Add new element at end.
> z
[1] "Sarah" "Tracy" "Jon" "Annette"
> z <- c("Greg", z) # Prefix new element.
> z
[1] "Greg" "Sarah" "Tracy" "Jon" "Annette"
> series <- 1:10
> seq(10)
[1] 1 2 3 4 5 6 7 8 9 10
> seq(from = 1, to = 10, by = 0.1)
[1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3
[15] 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7
[29] 3.8 3.9 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0 5.1
[43] 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5
[57] 6.6 6.7 6.8 6.9 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9
[71] 8.0 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9.0 9.1 9.2 9.3
[85] 9.4 9.5 9.6 9.7 9.8 9.9 10.0
# R supports missing data NA in vectors.
> x <- c(0.5, NA, 0.7)
> is.na(x)
[1] FALSE TRUE FALSE FALSE TRUE
> anyNA(x)
[1] TRUE
# Inf is infinity. You can have either positive or negative infinity.
> 1/0
[1] Inf
> NaN means Not a Number. It’s an undefined value.
0/0
[1] NaN
Named list elements are accessed as listname$varname ; Otherwise by index only.
x <- 1:10
x <- as.list(x)
length(x)
[1] 10
What is the class of x[1]?
What about x[[1]]?
Elements of a list can be named (i.e. lists can have the names attribute)
xlist <- list(a = "Karthik Ram", b = 1:10, data = head(iris))
xlist
$a
[1] "Karthik Ram"
$b
[1] 1 2 3 4 5 6 7 8 9 10
$data
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
names(xlist)
[1] "a" "b" "data"
What is the length of this object? What about its structure?
Lists can be extremely useful inside functions. Because the functions in R are able to return only a single object, return named list.
Important Note:
> a1 = list('one', 'two' , 'three')
> a2 = list(1, 2, 3)
> v = c(5, 6, 7)
> v
[1] 5 6 7 # Entire vector is printed [1] specifies starting index. Compact output.
> v[1] # Any single number is a vector of size 1.
[1] 5
> a1
[[1]] # Each element printed in separate line. Very verbose output for list !!!
[1] "one"
[[2]]
[1] "two"
[[3]]
[1] "three"
> str(a1)
List of 3
$ : chr "one"
$ : chr "two"
$ : chr "three"
> a2[1] # Single bracket for list acts like a 'sublist' operation. It returns list !!!
[[1]]
[1] 1
> a1
[[1]]
[1] "one"
[[2]]
[1] "two"
[[3]]
[1] "three"
> a2[[1]] # Double index returns first elemnt in list.
[1] 1
> ll = list(list(5, 6), list(7, 8))
> ll
[[1]] # This refers to list(5, 6). Note double index !!!
[[1]][[1]] # This refers to 5
[1] 5
[[1]][[2]]
[1] 6
[[2]]
[[2]][[1]]
[1] 7
[[2]][[2]]
[1] 8
> str(ll, max=1) # Max nesting is 1 level.
List of 2
$ :List of 2
$ :List of 2
A custom function to print list :
printList <- function(list) {
for (item in 1:length(list)) {
print(head(list[[item]]))
}
}
# Convert named list to vector. Recursively flattens lists of lists as well.
my_vector = unlist(myList, use.names=FALSE) # Best performance
my_vector = rapply(myList, c) # Recursive apply of functions to return vector. Slow.
my_vector = sapply(myList, function(x) return x) # Works and does the job. Slowest
TODO: Cleanup
Objects can have attributes. Attributes are part of the object. By default all objects support class and typeof as built-in attributes.
There are additional optional special attributes that is supported by some objects :
> a = list(3, 4)
> attributes(a)
NULL # No attributes defined by default.
> names(a) = c('first', 'second') # Can assign names attribute like this!!!
> attributes(a)
$names
[1] "first" "second"
User defined attributes can be set like below:
> attr(baskets.team,'season') <- '2010-2011'
> attr(baskets.team,'season')
[1] "2010-2011"
# You can delete attributes again by setting their value to NULL, like this:
> attr(baskets.team,'season') <- NULL
In R matrices are an extension of the numeric or character vectors. They are not a separate type of object but simply an atomic vector with dimensions; the number of rows and columns.
The vector elements are filled column wise to create matrix. The length(my_matrix) is total number of columns.
> m = matrix(1:6, nrow = 2, ncol = 3)
> m
[,1] [,2] [,3] # Note values filled columnwise !!!
[1,] 1 3 5
[2,] 2 4 6
> as.vector(m) # Convert matrix back to vector !!!
[1] 1 2 3 4 5 6
> as.vector(t(m)) # If you want vector along rows ...
[1] 1 3 5 2 4 6
> attributes(m) # Matrix is a vector with dim attribute !!!
$dim
[1] 2 3
> class(m) # class function reveals higher level type.
[1] "matrix" # It does not mean it has 'class' attribute like data.frame.
> dim(m)
[1] 2 3
# The dimensions can have names. On reading csv file, the row names are 'line numbers',
# the column names are the header columns.
> x <- read.csv("matrix.csv",header=T,sep="\t")
> dimnames(x)
[[1]]
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" # These are line numbers used as 'row names'
[[2]] # These are column names.
[1] "Subtype" "Expression" "Quality" "Height
# Now you can assign your own names for dimensions.
> m = matrix(1:6, nrow = 2, ncol = 3)
> dimnames(m) = list( list('row1', 'row2'), list('col1', 'col2', 'col3') )
> m
col1 col2 col3
row1 1 3 5
row2 2 4 6
m <- 1:10
dim(m) <- c(2, 5)
This takes a vector and transforms it into a matrix with 2 rows and 5 columns.
Another way is to bind columns or rows using cbind() and rbind().
x <- 1:3
y <- 10:12
cbind(x, y)
x y
[1,] 1 10
[2,] 2 11
[3,] 3 12
rbind(x, y)
[,1] [,2] [,3]
x 1 2 3
y 10 11 12
You can also use the byrow argument to specify how the matrix is filled. From R’s own documentation:
mdat <- matrix(c(1, 2, 3, 11, 12, 13),
nrow = 2,
ncol = 3,
byrow = TRUE)
mdat
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 11 12 13
mdat[2, 3] # Unlike list, you Can access matrix using single brackets !!!!
[1] 13
> m = matrix(1:9, byrow = TRUE, nrow = 3)
> m
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
> m2 = replicate(2, m)
> m2
, , 1
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
, , 2
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
> class(m2) # 3 dimensional matrix is an array !!!!
[1] "array"
> typeof(m2)
[1] "integer"
> str(m2)
int [1:3, 1:3, 1:2] 1 4 7 2 5 8 3 6 9 1 ...
> m2[2, 2,]
[1] 5 5
>
A data frame is a de facto data structure for most tabular data. It is basically a named list with a special attribute class = 'data.frame' It is a list of column vectors. So each element should have same length.
df=data.frame(A=c("a","a","b","b"), B=c("X","X","Y","Z"), C=c(1,2,3,4))
attributes(df)
my_list = unclass(df) # This removes the 'class' attribute and gives you underlying list.
A data frame is a special type of list where every element of the list has same length
Data frames can have additional attributes such as rownames() but it is optional. Used for annotating data, like sample_id. But most of the time they are not used.
Usually created by read.csv() and read.table(), etc.
Assuming all columns in a data frame are of same type, data frame can be converted to a matrix with data.matrix() as.matrix(). Otherwise type coercion will be enforced and the results may not always be what you expect.
Can also create a new data frame with data.frame() function.
Find the number of rows and columns with nrow(dat) and ncol(dat), respectively.
Rownames are often automatically generated and look like 1, 2, …, n. Consistency in numbering of rownames may not be honored when rows are reshuffled or subset.
To create data frames by hand::
dat <- data.frame(id = letters[1:10], x = 1:10, y = 11:20)
Useful Data Frame Functions :
head() - shows first 6 rows
tail() - shows last 6 rows
dim() - returns the dimensions of data frame (i.e. number of rows and number of columns)
nrow() - number of rows
ncol() - number of columns
str() - structure of data frame - name, type and preview of data in each column
names() - shows the names attribute for a data frame, which gives the column names.
sapply(dataframe, class) - shows the class of each column in the data frame
See that it is actually a special list:
> is.list(dat)
[1] TRUE
> class(dat)
[1] "data.frame"
Indexing element in data frame is simpler like matrix. Note that indexing list involves double brackets and ugly!!! :
> my_df[1, 3]
[1] 11
As data frames are also lists, it is possible to refer to columns using the list notation, i.e. either double square brackets or a $ :
> my_df[["y"]]
[1] 11 12 13 14 15 16 17 18 19 20
> my_df$y
[1] 11 12 13 14 15 16 17 18 19 20
The following table summarizes the one-dimensional and two-dimensional data structures :
Dimensions Homogenous Heterogeneous
1-D atomic vector list
2-D matrix data frame
N-D array list
Lists can contain multi-dimensional elements like matrix.
What type of structure do you expect to see when you explore the structure of the iris data frame? Hint: Use str().
> str(iris) # iris is a built-in dataframe
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
For a slightly more complex problem, use the "which" to tell the "sum" where to sum: if DF is the data frame:
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 97 267 6.3 92 7 8
3 97 272 5.7 92 7 9
Example: sum the values of Solar.R (Column 2) where Column1 or Ozone>30 AND Column 4 or Temp>90
sum(DF[which(DF[,1]>30 & DF[,4]>90),2]) # Sum column 2 (across rows) where col1 > 30 and col4 > 90
sum( df$columnA < NUMBER ) # Total instances where val < NUMBER
dat <- as.data.frame(matrix(1:36,6,6))
colnames(dat) <- paste0("Col", LETTERS[1:6])
dat$ColA
# [1] 1 2 3 4 5 6
dat$ColA < 3
# [1] TRUE TRUE FALSE FALSE FALSE FALSE
sum(dat$ColA < 3)
# [1] 2
Let us start with something very simple :
> a = 1:5
[1] 1 2 3 4 5
> a > 3
[1] FALSE FALSE FALSE TRUE TRUE
> a [ a %% 2 == 0 ]
[1] 2 4
> a[ a > 3 ]
[1] 4 5
Let us move on to 2 dimensional vector, i.e. matrix:
> m = matrix(1:9, nrow=3) # By default, we fill by column.
> m
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
The style of column-wise filling is also inherent in data frame as well. You may think and expect row-wise is more natural, but it is not the case. When you convert the matrix back to vector you get the same column-wise order is restored :
> as.vector(m) # Always operates column-wise.
[1] 1 2 3 4 5 6 7 8 9
> as.vector(t(m)) # Use this trick, if you want row wise conversion.
[1] 1 4 7 2 5 8 3 6 9
We can create matrix row-wise :
> m = matrix(1:9, nrow=3, byrow = TRUE) # You can also fill by row.
> m
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
What happens when your sequence size does not exactly match matrix size ? Do you expect to truncate your sequence to fit the matrix ? Sorry. The sequence is repeated so that matrix accomodates all elements in sequence atleast once !!! :
> m = matrix(1:9, ncol=2)
Warning message:
In matrix(1:9, ncol = 2) :
data length [9] is not a sub-multiple or multiple of the number of rows [5]
> m
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 1
>
Note that above matrix is 5x2 ; Not 4x2; Creating logical matrix is simple like what we did for vector :
> m > 7
[,1] [,2] [,3]
[1,] FALSE FALSE FALSE
[2,] FALSE FALSE TRUE
[3,] FALSE FALSE TRUE
Transforming matrix to vector is easy using this logical vector ... :
> m[ m > 7 ]
[1] 8 9
Again, the result is ordered as if it was applied on consolidated single column-wise vector. Selecting single rows and columns behave as you would expect:
> m[1,] # [1] 1 4 7
> m[2,] # [1] 2 5 8
> m[3,] # [1] 3 6 9
Selecting single column, you may expect 3 x 1 matrix, but actually it gives you only simple vector :
> m[,1] # [1] 1 2 3
> m[,2] # [1] 4 5 6
> m[,3] # [1] 7 8 9
Sub-select of matrix would behave as you would expect when both nrow > 1 and ncol > 1. It will maintain the matrix dimensions as you would expect. The backing internal vector for the matrix is still stored column-wise behind the scenes :
> m[c(1,2),c(1,2)]
[,1] [,2]
[1,] 1 4
[2,] 2 5
> m[c(1,3),c(2,3)]
[,1] [,2]
[1,] 4 7
[2,] 6 9
> as.vector(m[c(1,3),c(2,3)]) # 4 6 7 9
Let us move on to array which can be any dimensional ... let us choose 3-D :
> a = array(1:8, dim = c(2, 2, 2))
> a
, , 1
[,1] [,2]
[1,] 1 3
[2,] 2 4
, , 2
[,1] [,2]
[1,] 5 7
[2,] 6 8
The above data illustrates how sequence is filled from last dimension first :
c(1,1,1), c(2,1,1), x varies as (y, z) is fixed.
c(1,2,1), c(2,2,1), x varies as (y+1, z) is fixed.
c(1,1,2), c(2,1,2), x varies as (y, z+1) is fixed.
c(1,2,2), c(2,2,2), x varies as (y+1, z+1) is fixed.
You can imagine the index as LSB appearing first and MSB appearing last. The logical operation yields results as you would expect :
a > 4
, , 1
[,1] [,2]
[1,] FALSE FALSE
[2,] FALSE FALSE
, , 2
[,1] [,2]
[1,] TRUE TRUE
[2,] TRUE TRUE
> a[ a > 4 ] # Filtering N-Dimensional array always yields 1-D array only.
[1] 5 6 7 8
Subset operation gives expected results as long as resulting each dimension size > 1 :
> a[1,,]
[,1] [,2]
[1,] 1 5
[2,] 3 7
> a[2,,]
[,1] [,2]
[1,] 2 6
[2,] 4 8
> a[,1,]
[,1] [,2]
[1,] 1 5
[2,] 2 6
Note: If the result is one dimensional, the dim structure is not preserved :
> a[1,,1]
[1] 1 3 # Simple vector, no dimension attached !!!
You can conditionally assign values like below :
> a [ a > 4 ] = 0 # Matrix structure is intact and only values > 4 is assigned 0.
Let us take a look at data frame now ... :
> str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Selecting rows based on known row numbers is straight forward :
> iris[1:5,]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
Selecting columns is also straight forward :
> iris[1:5, c(1,2,5)]
Sepal.Length Sepal.Width Species
1 5.1 3.5 setosa
2 4.9 3.0 setosa
3 4.7 3.2 setosa
4 4.6 3.1 setosa
5 5.0 3.6 setosa
Note: Usually you select rows based on certain conditions in data frame. You can not use expressions like iris[ iris$Sepal.Length > 5 ], that will not work. You must use dplyr filter command for such things :
> filter(iris, Sepal.Length > 7.5 )
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 7.6 3.0 6.6 2.1 virginica
2 7.7 3.8 6.7 2.2 virginica
3 7.7 2.6 6.9 2.3 virginica
4 7.7 2.8 6.7 2.0 virginica
5 7.9 3.8 6.4 2.0 virginica
6 7.7 3.0 6.1 2.3 virginica
You can sub-select few columns based on multiple conditions :
filter(iris[,c(1,2,5)], Sepal.Length > 7.5, Sepal.Width < 3.0,
(Species == "versicolor" | Species == "virginica") )
Sepal.Length Sepal.Width Species
1 7.7 2.6 virginica
2 7.7 2.8 virginica
Note: dplyr::filter(df, cond1, cond2, cond4 | cond5); By default and conditions are joined by 'AND'
You must select the columns if any condition refers to that. e.g. You must select Species column.
Let us take a look at lists now ... :
> l = list(1:5) # [[1]] [1] 1 2 3 4 5
> length(l) # [1] 1
Didn't you expect a list of 5 elements ? If you need that, then you should use as.list() :
> l = as.list(1:5) ; length(l) # [1] 5
To produce logical vector from list, you can use expression like l > 3 :
> l > 3
[1] FALSE FALSE FALSE TRUE TRUE
> l [ l > 3 ] # produces list(c(4), c(5))
Since list can contain non-homogeneous types, such operations can yield NAs :
> l = as.list(1:5)
> l[4] = 'hello' # list(1, 2, 3, 'hello', 5)
> l > 3 # list(FALSE, FALSE, FALSE, NA, TRUE)
# Warning message: NAs introduced by coercion
> l [ l > 3 ] # list(NULL, 5)
How do you subset a list of lists ? Let us say select all lists of length > 3. :
# Most popular method.
v = sapply(l, function(x) length(x) > 3)
my_subset = l[v]
If the only purpose is to subset the list, better method is to use Filter directly :
my_subset = Filter(function(x) length(x) > 3, l)
The list with named index can also be referred using position :
> l = list()
> l["one"] = 1 ; l["two"] = 2 ; l[3] = 3
> l[1] # 1 ; l["one"] is also same.
> str(l)
List of 3
$ one: num 1 # Index named as 'one' for position 1.
$ two: num 2 # Index named as 'two' for position 2.
$ : num 3 # No Index name for position 3.
Data table is improved version of data frame in terms of usability and built-in support for aggregation, group by operations.
data.table doesn’t set or use row names, ever.
Never coerces string to factors by default.
The syntax follows this structure :
DT[i, j, by]
## R: i j by
## SQL: where | order by select | update group by
You can subset 1 column data table using this syntax: flights[, list(arr_delay)] If you replace the list by vector, you will get 1 column vector not a data table. Note that arr_delay is a column in flights data table.
The . is an alias for list. .(2, 3) and list(2, 3) are same.
Compute or do in j :
# How many trips have had total delay < 0?
ans <- flights[, sum( (arr_delay + dep_delay) < 0 )] # ans [1] 141814
Subset in i and do in j :
ans <- flights[origin == "JFK" & month == 6L,
.(m_arr = mean(arr_delay), m_dep = mean(dep_delay))]
ans
# m_arr m_dep
# 1: 5.839349 9.807884
The .N is current total for current group :
ans <- flights[origin == "JFK" & month == 6L, .N]
ans
# [1] 8422
Grouping using by :
– How can we get the number of trips corresponding to each origin airport?
ans <- flights[, .(.N), by = .(origin)]
ans
# origin N
# 1: JFK 81483
# 2: LGA 84433
# 3: EWR 87400
## or equivalently using a character vector in 'by'
# ans <- flights[, .(.N), by = "origin"]
m = matrix(1:9, byrow = TRUE, nrow = 3)
# 1:9 which is a shortcut for c(1, 2, 3, 4, 5, 6, 7, 8, 9).
> m
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
rownames(my_matrix) <- row_names_vector
colnames(my_matrix) <- col_names_vector
> my_matrix
age experience rating
Raja 1 2 3
Rani 4 5 6
manthiri 7 8 9
rownames(my_matrix) <- c("Raja", "Rani", "manthiri")
colnames(my_matrix) <- c("age", "experience", "rating")
> rowSums(my_matrix)
Raja Rani manthiri
6 15 24
# Extend matrixes by adding columns:
big_matrix <- cbind(matrix1, matrix2, vector1 ...)
a_var = my_matrix[1,2]
sub_matrix = my_matrix[1:3,2:4] # with rows 1, 2, 3 and columns 2, 3, 4.
first_column = my_matrix[,1]
first_row = my_matrix[1,]
sex_vector <- c("Male","Female","Female","Male","Male")
factor_sex_vector <- factor(sex_vector) # Efficiently encodes categorical values
factor_sex_vector
[1] Male Female Female Male Male
Levels: Female Male
# Display built-in data frame example.
> mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
...
> typeof(my_matrix)
[1] "integer"
> class(my_matrix)
[1] "matrix"
> typeof(mtcars) # mtcars is built-in example
[1] "list"
> class(mtcars)
[1] "data.frame"
> head(mtcars) # or tail()
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
> str(mtcars) # Show structure of your dataset
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
# Definition of vectors
name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet",
"Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
# Create a data frame from the vectors
planets_df <- data.frame(name, type, diameter, rotation, rings)
> summary(planets_df)
name type diameter rotation
Earth :1 Gas giant :4 Min. : 0.3820 Min. :-243.0200
Jupiter:1 Terrestrial planet:4 1st Qu.: 0.8448 1st Qu.: 0.1275
Mars :1 Median : 2.4415 Median : 0.5500
Mercury:1 Mean : 3.9264 Mean : -22.6950
Neptune:1 3rd Qu.: 5.3675 3rd Qu.: 1.0075
Saturn :1 Max. :11.2090 Max. : 58.6400
(Other):2
rings
Mode :logical
FALSE:4
TRUE :4
# Select first 5 values of diameter column
> planets_df[1:5, "diameter"]
[1] 0.382 0.949 1.000 0.532 11.209
# These are all same
planets_df[,3]
planets_df[,"diameter"]
planets_df$diameter
> planets_df[rings_vector, ] # Selects all columns for planets with rings.
> a <- c(100, 10, 1000)
> order(a)
[1] 2 1 3
> a[order(a)]
[1] 10 100 1000
# Named lists are more powerful than lists.
my_list <- list(name1 = your_comp1, name2 = your_comp2)
my_list <- list(your_comp1, your_comp2)
names(my_list) <- c("name1", "name2")
# Accessing elements from vector use single bracket: e.g. [1], for list use: [ [1] ]
shining_list[[1]] # All these are same !!!
shining_list[["reviews"]]
shining_list$reviews
See:
For list, the single bracket [ is for subsetting and [[ is for element access.
For vector, [ is for element access, [[ is for "pure" element in the sense, drop names and dimnames attributes.
R has three basic indexing operators, single [, double [[ and dollar prefix $ :
x[i] - For vector, gets ith element. For list, gets a sublist of length 1, containing ith element.
- !!! For matrix and multi-dimensional arrays, it does not give you sub-matrix !!!
- !!! e.g. For matrix of 3x3 filled with 1:9, m[i] == i i.e. single element value !!!
- x[a_vector] extracts (only) vector from vector or matrix or array.
- Note that result vector has same length as index vector even when source is matrix or array.
- a_list[a_vector] extracts a sub list from list.
- For data frame, df[1] extracts a sub data frame with only first column.
- For data frame, df[c(1,3)] extracts a sub data frame with first and 3rd columns.
x[i, j] - You can only use this notation with Matrix or 2-D array or data frame.
- For matrix this gives you (i, j)th element.
- For vector, list, or 3+ dimensional array, you get error: Dimension does not match.
- For data frame, df[1, 2] selects element at 1st row and second column.
- Note: df[1,2] != df[1][2] (df[1] is single column, [2] attempts to get non-existent second column)
- Note: df[1,2] != df[c(1,2)] (rhs refers to sub-data frame with first and second columns)
x[[i]] - For vector, matrix and arrays, v[i] is "almost" same as v[[i]] and always returns single element.
- If vector indices are named, then v[[i]] drops names and dimnames if present.
- a_vector[[another_vector]] is not allowed.
- For list it accesses ith element instead of a sublist containing single element.
- a_list[[another_vector]] is allowed and used to find element in a deeply nested list.
- e.g. a_list[[ c(2,3) ]] means: pick second element of a_list, then 3rd element of that.
- The df[[1]] returns first column values as plain vector.
- Note that we saw df[1] returns single column data frame.
x[[i, j]] - It behaves exactly same like x[i, j] for 2-d arrays and data frame.
x$a - Lists support this syntax using named indices.
- vectors and matrices support named indices, but does not support this syntax.
- The data frame supports a_dataframe$col_name syntax.
x$"a" - Same behaviour as x$a
write.csv(MyData_Frame, file = "MyData.csv")
write.csv(MyData, file = "MyData.csv",row.names=FALSE, na="")
write.table(MyData, file = "MyData.csv",row.names=FALSE, na="",col.names=FALSE, sep=",")
d = read.csv( file = filename,
stringsAsFactors = FALSE,
headers = TRUE,
strip.white = TRUE, sep = ',')
# A sample data frame
data <- read.table(header=TRUE, text='
id weight size
1 20 small
2 27 large
3 24 medium
')
# Reorder by column number
data[c(1,3,2)]
#> id size weight
#> 1 1 small 20
#> 2 2 large 27
#> 3 3 medium 24
# To actually change `data`, you need to save it back into `data`:
# data <- data[c(1,3,2)]
# Reorder by column name
data[c("size", "id", "weight")]
#> size id weight
#> 1 small 1 20
#> 2 large 2 27
#> 3 medium 3 24
> library("foreign")
The following table lists the functions to import data from SPSS, Stata, and SAS.
Function What It Does Example
read.spss Reads SPSS data file read.spss(“myfile”)
read.dta Reads Stata binary file read.dta(“myfile”)
read.xport Reads SAS export file read.export(“myfile”)
NULL represents NULL object
NA represents missing values.
vector can not contain NULL. If you assign one, silently it is dropped and ignored.
assigning NULL to list element deletes that element in list (Note this idiom)
NA is a logical constant of length 1, which contains a missing value indicator.
There are also constants :
NA_integer_, NA_real_, NA_complex_ and NA_character_ etc.
Insert calls to browser() in your source code to debug the program in rstudio.
# Debugging
# Allow to enter into debug mode after error
> options(error=recover)
> options(error=browser)
> debug(func_name) # Breaks you later inside the function.
> undebug(func_name) # Unset break point
> n # next
> c # continue
> trace(func_name, edit = TRUE) # Allows you to edit in rstudio. without changing original code
> browser() # Gives you stack frame to jump around
> stopifnot(a < 1) # Insert this in code to break into debugging mode.
> trace() #
> rm(list=ls()) # Clean up all global variables. Start fresh.
> where # show call stack.
> traceback() # show call stack with detailed traces of args.
> sink('/tmp/out.log') # Redirect output to a file.
> # rstudio specific commands
> debugSource("file2.R")
> body(print.Date) # Displays function definition.
> func_defn_string = capture.output(print(body(print.Date)))
> getAnywhere(t.ts)
A single object matching ‘t.ts’ was found
It was found in the following places
registered S3 method for t from namespace stats
namespace:stats
with value
function (x)
{
cl <- oldClass(x)
other <- !(cl %in% c("ts", "mts"))
class(x) <- if (any(other))
cl[other]
attr(x, "tsp") <- NULL
t(x)
}
<bytecode: 0x55d33b68a800>
<environment: namespace:stats>
> methods(myvar)
> showMethods(funcname)
> require(raster)
> showMethods(extract)
Function: extract (package raster)
x="Raster", y="data.frame"
x="Raster", y="Extent"
x="Raster", y="matrix"
x="Raster", y="SpatialLines"
x="Raster", y="SpatialPoints"
x="Raster", y="SpatialPolygons"
x="Raster", y="vector"
# To see the source code for one of these methods the entire signature must be supplied, e.g.
> getMethod("extract" , signature = c( x = "Raster" , y = "SpatialPolygons")
# Cheat Sheet See https://www.dummies.com/programming/r/r-for-dummies-cheat-sheet/
To search through the Help files, you’ll use one of the following functions :
?data.frame # Displays the Help file for a specific function.
??:regression # Searches for a word (or pattern) in the function or Help files.
RSiteSearch('linear models') # Performs an online search of RSiteSearch.
install.packages("sos")
library(“sos“).
findFn("regression") into your R console, you get a web page with the details.
Best way to extract function declaration into another file :
# if we want the source for randomForest::rfcv():
# To view/edit it in a pop-up window:
> edit(getAnywhere('rfcv'), file='source_rfcv.r')
# To redirect to a separate file:
> capture.output(getAnywhere('rfcv'), file='source_rfcv.r')
> new_optim <- edit(optim)
# It will open the source code of optim edit it and assign result to new_optim.
# if you do not want the annoying long source code printed on your console, you can use
> invisible(edit(optim))
> edit(my_matrix) # you can also edit data !!!
> View(my_func) # Works in R Studio
Use <<- to assign into global variable from inside the function. Otherwise assignments change only function local variables.
bar <- "global"
foo <- function(){
bar <- "in foo"
baz <- function(){
bar <- "in baz - before <<-"
bar <<- "in baz - after <<-"
print(bar)
}
print(bar)
baz()
print(bar)
}
> bar
[1] "global"
> foo()
[1] "in foo"
[1] "in baz - before <<-"
[1] "in baz - after <<-"
> bar
[1] "global"
There are three ways to save objects from your R session:
Saving all objects in your R session:
The save.image() function will save all objects currently in your R session:
save.image(file="1.RData")
These objects can then be loaded back into a new R session using the load() function:
load(file="1.RData")
Saving some objects in your R session:
If you want to save some, but not all objects, you can use the save() function:
save(city, country, file="1.RData")
Again, these can be reloaded into another R session using the load() function:
load(file="1.RData")
Saving a single object
If you want to save a single object you can use the saveRDS() function:
save(city, file="city.rds")
save(country, file="country.rds")
You can load these into your R session using the readRDS() function, but you will need to assign the result into a the desired variable:
city <- readRDS("city.rds")
country <- readRDS("country.rds")
But this also means you can give these objects new variable names if needed (i.e. if those variables already exist in your new R session but contain different objects):
city_list <- readRDS("city.rds")
country_vector <- readRDS("country.rds")
install.packages("rio")
library("rio")
install_formats()
export(iris, 'iris.csv') # Data frame to csv
iris_df = import('iris.csv') # csv to Data frame
# export to sheets of an Excel workbook
export(list(mtcars = mtcars, iris = iris), "multi.xlsx")
multi_df = import('multi.xlsx') # Named list of data frames.
If you want a fine control over reading, you may want to use it :
library(readxl)
read_excel(path, sheet = NULL, range = NULL, col_names = TRUE,
col_types = NULL, na = "", trim_ws = TRUE, skip = 0, n_max = Inf,
guess_max = min(1000, n_max))
read_xls(...)
read_xlsx(...)
library(writexl)
x = list(mtcars = mtcars, iris = iris)
write_xlsx(x, path = "writexlsx-cars-iris.xlsx", col_names = TRUE)
# Reading single sheet
irisdf = read_excel("writexlsx-iris.xlsx", sheet = 2)
carsdf = read_excel("writexlsx-cars.x lsx", sheet = 1)
# Reading multiple sheets
sheets = ( excel_sheets( filename ) ) # vector of names.
total_sheets = length(sheets)
# Limit the number of data rows read
read_excel(datasets, n_max = 3)
# Read from an Excel range using A1 or R1C1 notation
read_excel(datasets, range = "C1:E7")
read_excel(datasets, range = "R1C2:R2C5")
# Specify the sheet as part of the range
read_excel(datasets, range = "mtcars!B1:D5")
# Read only specific rows or columns
read_excel(datasets, range = cell_rows(102:151), col_names = FALSE)
read_excel(datasets, range = cell_cols("B:D"))
# Get a preview of column names
this_col_names = names(read_excel(readxl_example("datasets.xlsx"), n_max = 0))
# Read all sheets from excel sheet.
sheet_names = excel_sheets(path)
#> [1] "iris" "mtcars" "chickwts" "quakes"
path <- readxl_example("datasets.xls")
df_list = lapply(sheet_names, read_excel, path = path)
Option 1 :
install.packages("stringr", dependencies=TRUE)
require(stringr)
example(str_trim)
# str_trim or built-in trimws will work on character vector, but not sub df.
# df$V2 returns character vector.
df$clean2<-str_trim(df$V2)
# To change only one column in-place.
myData[,1] = trimws(myData[,1]) ## trimws is R built-in
myData[,1] = str_trim(myData[,1]) ## trimws is built-in
# Trim and To replace multiple spaces by single space inside string.
myData[,1] = stringr::str_squish(myData[,1])
Option 2 :
df2 = data.frame(lapply(df, function(x) if(class(x)=="character") trimws(x) else(x)), stringsAsFactors=F)
Option 3 :
# Bit verbose but it is easy to read and to the point ...
for (i in names(mydata)) {
if(class(mydata[, i]) %in% c("factor", "character")){
mydata[, i] <- trimws(mydata[, i])
}
}
Option 4 :
# Use grepl -- grep logical
col4 = df[,4]
truth = grepl("[[:space:]]+$", col4)
# e.g. [ TRUE, FALSE, FALSE, ... ]
df = df[truth,] # Get all columns for selected rows by truth.
Option 5:
install.packages('sqldf')
library(sqldf)
# Select non-empty rows.
mydf = sqldf("select * from mydf where mycol != '' ")
long_iris = sqldf("select * from iris where `Sepal.Length` > 5.3")
Tibble is enhanced data frame which provides a 'tbl_df' class that offers following capabilities : * better printing capabilities than traditional data frames.
You can easily change between them using as_data_frame() and as.data.frame()
> t # t is matrix transpose function.
function (x)
UseMethod("t") # Means it is S3 generic method
<bytecode: 0x2332948>
<environment: namespace:base>
> with
standardGeneric for "with" defined from package "base" # Another type of method.
> methods(print) # Show Various print methods implemented by different objects.
> getMethod('print')
Error in getMethod("print") : no generic function found for 'print'
> methods(t)
[1] t.data.frame t.default t.ts* # Non-visible functions are asterisked
> getAnywhere(t.ts) # Get the method definition from pkgs
Try :
> methods(residuals)
# which lists, among others, "residuals.lm" and "residuals.glm".
# This means when you have fitted a linear model, m, and type residuals(m), residuals.lm will be called.
# When you have fitted a generalized linear model, residuals.glm will be called.
Most Important:
Usually you need a function definition to implement a 'delayed expression evaluation'. But you can use this and call your function like "f" instead of "f()" :
delayedAssign("x", {
for(i in 1:3)
cat("yippee!\n")
10
})
x^2 #- yippee is printed only 3 times and returns value 100.
The dplyr operation %<d-% stands for delayed assignment. b %<d-% { Sys.sleep(1); 1 } is equivalent to :
delayedAssign("a", { Sys.sleep(1); 1 })
Note that the assignment is instantaneous but reading it executes code :
system.time( b %<d-% Sys.sleep(1) ) # instantaneous.
system.time( b ) # Takes 1 second
You can use substitute(), expression() etc. Understand the following example :
# NOT RUN {
require(graphics)
(s.e <- substitute(expression(a + b), list(a = 1))) #> expression(1 + b)
(s.s <- substitute( a + b, list(a = 1))) #> 1 + b
c(mode(s.e), typeof(s.e)) # "call", "language"
c(mode(s.s), typeof(s.s)) # (the same)
# but:
(e.s.e <- eval(s.e)) #> expression(1 + b)
c(mode(e.s.e), typeof(e.s.e)) # "expression", "expression"
substitute(x <- x + 1, list(x = 1)) # nonsense
myplot <- function(x, y)
plot(x, y, xlab = deparse(substitute(x)),
ylab = deparse(substitute(y)))
## Simple examples about lazy evaluation, etc:
f1 <- function(x, y = x) { x <- x + 1; y }
s1 <- function(x, y = substitute(x)) { x <- x + 1; y }
s2 <- function(x, y) { if(missing(y)) y <- substitute(x); x <- x + 1; y }
a <- 10
f1(a) # 11
s1(a) # 11
s2(a) # a
typeof(s2(a)) # "symbol"
# }
library(help=datasets) # Gives you help info of all built-in datasets.
Important datasets to note:
pressure Vapor Pressure of Mercury as a Function of Temperature
USArrests Violent Crime Rates by US State
rivers Lengths of Major North American Rivers
AirPassengers Monthly Airline Passenger Numbers 1949-1960
WWWusage Internet Usage per Minute
WorldPhones The World's Telephones
iris Edgar Anderson's Iris Data
ability.cov Ability and Intelligence Tests
beavers Body Temperature Series of Two Beavers
cars Speed and Stopping Distances of Cars
datasets-package The R Datasets Package
mtcars Motor Trend Car Road Tests
quakes Locations of Earthquakes off Fiji
randu Random Numbers from Congruential Generator RANDU
sleep Student's Sleep Data
state US State Facts and Figures
women Average Heights and Weights for American Women
Loading of built-in datasets not required since it is preloaded, but the syntax is :
data(iris) # Load dataset iris. Not required to load built-in datasets.
str(iris)
head(iris)
Another one is not built-in but contains more realistic data and useful for teaching :
install.packages("dslabs")
library("dslabs")
data(package="dslabs")
Description: Predict iris flower species from flower measurements.
Type: Multi-class classification
Dimensions: 150 instances, 5 attributes
Inputs: Numeric
Output: Categorical, 3 class labels
UCI Machine Learning Repository: Description
iris flowers datasets:
data(iris)
dim(iris)
levels(iris$Species)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Visualizing using ggplot2 :
library(ggplot2)
# Visualize 2 attributes ....
ggplot(iris, aes(x = Petal.Length, y = Sepal.Length)) + geom_point()
# Note: ggplot() creates canvas, then geom_point() draws on top of that. !! Note the brackets !!
# Visualize 3 attributes to differentiate the species ...
ggplot(iris, aes(x = Petal.Length, y = Sepal.Length, colour = Species)) + geom_point() + geom_line()
# Visualize 4 attributes to see how Petal.Width correlates. Use size of the dots.
ggplot(iris, aes(x = Petal.Length, y = Sepal.Length, colour = Species, size = Petal.Width))
+ geom_point() + geom_line(size=1)
# By default it uses geom_point(), you can also connect the dots, etc.
# Add title.
ggplot(iris, aes(x = Petal.Length, y = Sepal.Length, colour = Species, size = Petal.Width)) +
geom_point() +
ggtitle('Iris Species by Petal and Sepal Length')
Visualizing using ggplot2 :
library(ggplot2)
See Also: https://www.mailman.columbia.edu/sites/default/files/media/fdawg_ggplot2.html
How do you do that ? See https://stackoverflow.com/questions/23199416/5-dimensional-plot-in-r
cr <- cor(mtcars)
# This is to remove redundancy as upper correlation matrix == lower
cr[upper.tri(cr, diag=TRUE)] <- NA
reshape2::melt(cr, na.rm=TRUE, value.name="cor")
# get pairwise combination of variable names
vars <- t(combn(colnames(myMat), 2))
# build data.frame with matrix subsetting
data.frame(vars, myMat[vars])
X1 X2 myMat.vars.
1 V1 V2 0.8500071
2 V1 V3 -0.2828288
3 V1 V4 -0.2867921
4 V2 V3 -0.2698210
5 V2 V4 -0.2273411
6 V3 V4 0.9962044
You can add column names in one line as well using setNames.
setNames(data.frame(vars, myMat[vars]), c("var1", "var2", "corr"))
# Now sort according to abs values of 3rd column.
cr_order = order(abs(df$myMat.vars))
df = df[cr_order, ]
# How to generate data
set.seed(1234)
myMat <- cor(matrix(rnorm(16), 4, dimnames=list(paste0("V", 1:4), paste0("V", 1:4))))
myMat
V1 V2 V3 V4
V1 1.0000000 0.8500071 -0.2828288 -0.2867921
V2 0.8500071 1.0000000 -0.2698210 -0.2273411
V3 -0.2828288 -0.2698210 1.0000000 0.9962044
V4 -0.2867921 -0.2273411 0.9962044 1.0000000
Note: More obvious way to generate data is here though it is bit manual process:
d <- data.frame(x1=rnorm(10),
+ x2=rnorm(10),
+ x3=rnorm(10))
> x <- cor(d) # get correlations (returns matrix)
We had to sort data frame on one parameter. What if we have to sort on multiple parameters ? :
# sorting examples using the mtcars dataset
attach(mtcars)
# sort by mpg
newdata <- mtcars[order(mpg),]
# sort by mpg and cyl
newdata <- mtcars[order(mpg, cyl),]
#sort by mpg (ascending) and cyl (descending)
newdata <- mtcars[order(mpg, -cyl),]
detach(mtcars)
# https://www.sharpsightlabs.com/blog/heatmap-ggplot2-simple/
library(ggplot2)
#------------------
# CREATE DATA FRAME
#------------------
df.team_data <- expand.grid(teams = c("Team A", "Team B", "Team C", "Team D")
,metrics = c("Metric 1", "Metric 2", "Metric 3", "Metric 4", "Metric 5")
)
# add variable: performance
set.seed(41)
df.team_data$performance <- rnorm(nrow(df.team_data))
#inspect
head(df.team_data)
#---------------------------
# PLOT: heatmap
# - here, we use geom_tile()
#---------------------------
ggplot(data = df.team_data, aes(x = metrics, y = teams)) +
geom_tile(aes(fill = performance))
# #############################################
# Load ggplot2 package for graphics/plotting
library(ggplot2)
# Create dummy dataset
df.test_data <- data.frame(x_var = 1:50 + rnorm(50,sd=15),
y_var = 1:50 + rnorm(50,sd=2)
)
# Plot data using ggplot2 ... scatter plot..
ggplot(data=df.test_data, aes(x=x_var, y=y_var)) +
geom_point()
Great to write reproducible report with inline graphics.
Follows 'Literate Programming' approach by Knuth; Treat program as a literature understandable to human beings.
RStudio has built-in support.
Cut & Paste is easier with R Markdown, in that aspect better than R notebook.
With knitr and its combination with R Markdown, the writing of reproducible reports was made infinitely easier.
In line R expressions :
# See https://bookdown.org/yihui/rmarkdown/r-code.html
```{r}
x = 5 # radius of a circle
```
For a circle with the radius `r x`, its area is `r pi * x^2`.
Note the inline use of r code using single back-tick in the line above.
Passing chunk options :
# You can pass chunk options ....
```{r, chunk-label, results='hide', fig.height=4}
```
Just display the code, but don't run it :
# Just have code verbatim here, and do not evaluate the code ...
```{r, eval=FALSE}
x = rnorm(100)
```
Setup some global options for all R chunks :
# Setup some global chunk options using kintr::opts_chunk$set()
# The include=FALSE will hide this specific block of code and its result in output doc.
# The option(s) fig.width = 8 is set for all r code chunks.
# The 'setup' just a label for this chunk.
```{r, setup, include=FALSE}
knitr::opts_chunk$set(fig.width = 8, collapse = TRUE)
```
Nicely display result in a table :
# To nicely display result in a table in output document ... use knitr::kable() ...
```{r my-table-demo}
knitr::kable(iris[1:5, ], caption = 'A caption')
```
Quick Options reference ... :
echo FALSE Hide the code in the final generated document.
results FALSE Hide results output in the document.
include FALSE Hide code, results, warnings, msgs from this chunk.
eval FALSE Dont evaluate a code chunk.
cache FALSE Re-Evaluate code chunk on each run.
fig.width 8 The (graphical device) size of width in inches.
fig.height 4 Output figure height is 4 inches.
fig.dim (8,4) width is 8; Height is 4.
fig.cap My Caption For Figure title
fig.align center Align figure left, center or right
out.width 80% The output size of R plots in document. 80% page width.
dev png Default Graphical device for html is png;
For latex output, device is usually pdf;
Could also be: svg or jpeg.
Note: output format i.e. html/pdf is not part of source doc.
Extra:
collapse FALSE Display text output in same block as code for compact output.
warning FALSE Read as ignore-warning=FALSE; i.e. Show warning messages.
message FALSE Read as ignore-message=FALSE; i.e. Show info messages.
error FALSE Read as ignore-error=FALSE; i.e. Halt on Error.
child ./doc2.rmd Include another markdown file here.
To include graphics from external image file ... :
```{r, out.width='25%', fig.align='center', fig.cap='...'}
knitr::include_graphics('images/hex-rmarkdown.png')
```
This code uses the dplyr package from the tidyverse set of packages to take the monthly mean of ozone in the airquality dataset in R. To understand it, you need to understand the concept of a data frame and of tidy data. :
library(dplyr)
group_by(airquality, Month) %>% # This pipe operator feeds lhs as the first arg of rhs.
summarize(o3 = mean(Ozone, na.rm = TRUE))
Using just the functionality that came with the base R system, you’d have to do something like :
aggregate(airquality[, "Ozone"],
list(Month = airquality[, "Month"]),
mean, na.rm = TRUE)
Note: The na.rm arg is the argument to mean, not argument to aggregate. The dplyr package use is much more readable.
You can also use tapply function for aggregate after group by function.
Suppose, for example, we have a sample of 30 tax accountants of Australian states:
> state <- c("tas", "sa", "qld", "nsw", "nsw", "nt", "wa", "wa",
"qld", "vic", "nsw", "vic", "qld", "qld", "sa", "tas",
"sa", "nt", "wa", "vic", "qld", "nsw", "nsw", "wa",
"sa", "act", "nsw", "vic", "vic", "act")
Notice that in the case of a character vector, “sorted” means sorted in alphabetical order.
A factor is similarly created using the factor() function:
> statef <- factor(state)
The print() function handles factors slightly differently from other objects:
> statef
[1] tas sa qld nsw nsw nt wa wa qld vic nsw vic qld qld sa
[16] tas sa nt wa vic qld nsw nsw wa sa act nsw vic vic act
Levels: act nsw nt qld sa tas vic wa
To find out the levels of a factor the function levels() can be used.
> levels(statef)
[1] "act" "nsw" "nt" "qld" "sa" "tas" "vic" "wa"
suppose we have the incomes of same tax accountants as 'incomes' :
> incomes <- c(60, 49, 40, 61, 64, 60, 59, 54, 62, 69, 70, 42, 56,
61, 61, 61, 58, 51, 48, 65, 49, 49, 41, 48, 52, 46,
59, 46, 58, 43)
To calculate the sample mean income for each state we can now use the special function tapply():
> incmeans <- tapply(incomes, statef, mean)
act nsw nt qld sa tas vic wa
44.500 57.333 55.500 53.600 55.000 60.500 56.000 52.250
The function tapply() is used to apply a function, here mean(), to each group of components
of the first argument, here incomes, defined by the levels of the second component, here statef 2 ,
Examine environment using search() and ls() commands :
> search()
[1] ".GlobalEnv" "package:stats" "package:graphics"
[4] "package:grDevices" "package:utils" "package:datasets"
[7] "package:methods" "Autoloads" "package:base"
> ls(1)
[1] "a" "a1" "b" "d" "df" # Show list of global variables.
> ls(2) # List exported symbols from package:stats
> detach("package:graphics") # You can detach from specific package.
> attach(.env) # Lets you direct access. e.g. s() vs .env$s().