Notes On R Programming

Table of Contents

Installation

Install rstudio from rcran website.

> install.packages('ggplots', dependencies = TRUE)

Synopsis

# Variable Inspection

> a = 3
> ls()        # list global objects
[1]  "a"
> typeof(a)
[1] "double"
> stopifnot(a < 1)
Error: a < 2 is not TRUE
> class(a)                  # class finds more generic type
[1] "numeric"
> b = list(1, 10)


# Debugging
# Allow to enter into debug mode after error
> options(error=recover)
> debug(func_name)          # Breaks you later inside the function.
> undebug(func_name)        # Unset break point
> n    # next
> c    # continue

> trace(func_name, edit = TRUE)  # Allows you to edit in rstudio. without changing original code
> browser()                       # Gives you stack frame to jump around

numeric_vector <- c(1, 10, 49)
character_vector <- c("a", "b", "c")
# Complete the code for boolean_vector
boolean_vector <- c(TRUE, FALSE, TRUE)


# Vector is preferred over list usually.

some_vector <- c("John Doe", "poker player")          # Single dimensional vector
names(some_vector) <- c("Name", "Profession")         # Set names for vector elements.

# Slicing vector
my_vector[c(1, 5)]                # Selects first and 5th element of vector
my_vector[1:5]                    # Selects columns 1, 2, 3, 4, 5
my_vector["name"]                 # Selects specified element if my_vector supports named column.
my_vector[c("name", "age")]       # Selects specified columns if my_vector supports it.

# sum, mean, etc

v = c(10, 20, 30)
sum(v)
mean(v)

summary(my_var)           # See vectory summary of a variable with mean, sum, min, max etc

v = 5:8                    # These 2 statements are equivalent. 5,8 both inclusive.
v = c(5,6,7,8)

# Selection vector
> selection_vector =  c(4, 5, 6) > 5
[1] FALSE FALSE TRUE
> my_vector[selection_vector]       # selects vector elements

> length(1:10)
   [1] 10

> nchar("Software Carpentry")
   [1] 18

# Append something to vector
v  = charater()            # Note: c() is NULL !!! This is vector of strings.
u  = charater()

v  = c(v, 'one', 'two')    # In efficient
v  = append(v, 'one')      #
v  = c(v, u)               # Merging two vectors. Best way
v[1] = 'one'               # OK.
v[2] = 'two'

# Vector transformation
output <- sapply(values, function(v) return(2*v))           # s apply for vector
output <- lapply(values, function(v) return(2*v))           # l apply for list

Data Structures

Key Points

  • R basic data structures include the vector, list, matrix, data frame, and factors.
  • Vector is atomic type. ie. holds single type values only
  • Vector types are character, numeric (real or decimal), integer, complex, and logical. The mode of numeric vector is 'numeric' and string vector is 'character'.
  • There is no separate "integer" type. a = 1 ==> means a is a integer vector of length 1.
  • By default a type is a vector. Other fundamental types are list and function.
  • Every Object has just 2 intrinsic attributes: mode and length.
  • Objects may have other attributes, such as name, dimension, and class.
  • All other higher level objects are backed by one of the fundamental types i.e. vector, list or function.
  • Just by assigning certain additional attributes you create higher level objects. e.g. Take a vector, assign: dim(v) = c(3 ,3) ; Now v is a matrix.
  • The data frame object is a list with 'class' attribute value 'data.frame'. Certain global functions treats these lists differently depending on the 'class' value. But there is nothing fundamentally different from list and data frame.
  • Everything in R is an object.
  • Indexes starts from 1
  • Supports NA (missing value), NULL, NaN, Inf

Vector Types

  • character: "a", "swc"
  • numeric: 2, 15.5
  • integer: 2L (the L tells R to store this as an integer)
  • logical: TRUE, FALSE
  • complex: 1+4i (complex numbers with real and imaginary parts)

Type inspection

R provides many functions to examine features of vectors and other objects, for example:

  • class() - what kind of object is it (high-level)?
  • typeof() - what is the object’s data type (low-level)?
  • length() - how long is it? What about two dimensional objects?
  • attributes() - does it have any metadata?
  • str() - Best command to see the "structure" of the variable and summary.

The str() is the one you want to use most of the time.

Examples:

> x <- "dataset"
> mode(x)                         # mode is the intrinsic attribute of object.
     [1] "character"
> typeof(x)                       # 
     [1] "character"              # x is a character vector of length 1.
> class(x)                        #
     [1] "character"
> attributes(x)
     NULL
> str(x)                          # Best command to see "structure" and summary of variable.
    chr "hello"

> y <- 1:10                       # Assign a generated sequence as vector.
> y
    [1]  1  2  3  4  5  6  7  8  9 10        # y is integer vector of size 10
> typeof(y)
    [1] "integer"
> length(y)
    [1] 10                        # Note there integer is same of vector of lengh 1 !!!!

> z <- as.numeric(y)              # Coerces to type double vector from integer vector
> z
 [1]  1  2  3  4  5  6  7  8  9 10
> typeof(z)
 [1] "double"

list

  • List is considered as special case of vector but non-atomic. ie. can contain different types elements.
  • A vector is a collection of elements that are most commonly of mode character, logical, integer or numeric.
  • You can create an empty vector with vector().
  • By default the mode is logical. It is more common to use direct constructors such as character(), numeric(), etc.
>  vector() # an empty 'logical' (the default) vector
      logical(0)

>  vector("character", length = 5) # a vector of mode 'character' with 5 elements
      [1] "" "" "" "" ""

>  character(5) # the same thing, but using the constructor directly     # Note:  It is not string of length 5 !!!!
      [1] "" "" "" "" ""

> numeric(5)   # a numeric vector with 5 elements
      [1] 0 0 0 0 0

> logical(5)   # a logical vector with 5 elements
      [1] FALSE FALSE FALSE FALSE FALSE

# You can also create vectors by directly specifying their content.

> x <- c(1,  2, 3)                   # These are vector of double precision numbers !!!
> x <- c(1L, 2L, 3L)                 # It is now integer vector!!! All elements must be suffixed with L !!!!
> x <- as.intger(x)                  # If you want to do type conversion/coerce.

> y <- c(TRUE, TRUE, FALSE, FALSE)

# The functions typeof(), length(), class() and str() provide useful information about your vectors and R objects in general.

> z <- c("Sarah", "Tracy", "Jon")
> typeof(z)
      [1] "character"
> length(z)
      [1] 3
> class(z)
      [1] "character"
> str(z)                                     # Great command to use !!!!
      chr [1:3] "Sarah" "Tracy" "Jon"

> z <- c(z, "Annette")                         # Add new element at end.
> z
    [1] "Sarah"   "Tracy"   "Jon"     "Annette"
> z <- c("Greg", z)                     # Prefix new element.
> z
      [1] "Greg"    "Sarah"   "Tracy"   "Jon"     "Annette"

> series <- 1:10
> seq(10)
     [1]  1  2  3  4  5  6  7  8  9 10
> seq(from = 1, to = 10, by = 0.1)
     [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3
    [15]  2.4  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7
    [29]  3.8  3.9  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1
    [43]  5.2  5.3  5.4  5.5  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5
    [57]  6.6  6.7  6.8  6.9  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9
    [71]  8.0  8.1  8.2  8.3  8.4  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3
    [85]  9.4  9.5  9.6  9.7  9.8  9.9 10.0

 # R supports missing data NA in vectors.

> x <- c(0.5, NA, 0.7)
> is.na(x)
      [1] FALSE  TRUE FALSE FALSE  TRUE
> anyNA(x)
      [1] TRUE
# Inf is infinity. You can have either positive or negative infinity.

> 1/0
      [1] Inf
> NaN means Not a Number. It’s an undefined value.
      0/0
  [1] NaN

Named list elements are accessed as listname$varname ; Otherwise by index only.

x <- 1:10
x <- as.list(x)
length(x)
[1] 10
What is the class of x[1]?
What about x[[1]]?

Elements of a list can be named (i.e. lists can have the names attribute)

xlist <- list(a = "Karthik Ram", b = 1:10, data = head(iris))
xlist
$a
[1] "Karthik Ram"

$b
 [1]  1  2  3  4  5  6  7  8  9 10

$data
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
names(xlist)
[1] "a"    "b"    "data"
What is the length of this object? What about its structure?

Lists can be extremely useful inside functions. Because the functions in R are able to return only a single object, return named list.

  • Elements are indexed by double brackets.
  • Single brackets will still return a(nother) list.
  • If the elements of a list are named, they can be referenced by the $ notation (i.e. xlist$data).

Important Note:

  • A list does not print to the console like a vector. Instead, each element of the list starts on a new line.
  • To display list, use str(mylist) or str(head(mylist))
  • To display data frame, you can: View(df).
> a1 = list('one', 'two' , 'three')
> a2 = list(1, 2, 3)
> v = c(5, 6, 7)

> v
[1] 5 6 7             # Entire vector is printed [1] specifies starting index. Compact output.

> v[1]                # Any single number is a vector of size 1.
[1] 5

> a1
[[1]]                 # Each element printed in separate line. Very verbose output for list !!!
[1] "one"

[[2]]
[1] "two"

[[3]]
[1] "three"

> str(a1)
  List of 3
   $ : chr "one"
   $ : chr "two"
   $ : chr "three"

> a2[1]               # Single bracket for list acts like a 'sublist' operation. It returns list !!!
[[1]]
[1] 1

> a1
[[1]]
[1] "one"

[[2]]
[1] "two"

[[3]]
[1] "three"

> a2[[1]]             # Double index returns first elemnt in list.
[1] 1

> ll  = list(list(5, 6), list(7, 8))
> ll
[[1]]          #  This refers to list(5, 6). Note double index !!!
[[1]][[1]]     #  This refers to 5
[1] 5

[[1]][[2]]
[1] 6


[[2]]
[[2]][[1]]
[1] 7

[[2]][[2]]
[1] 8

> str(ll, max=1)             # Max nesting is 1 level.
List of 2
 $ :List of 2
 $ :List of 2

A custom function to print list :

printList <- function(list) {

  for (item in 1:length(list)) {

    print(head(list[[item]]))

  }
}

Converting list to vector

# Convert named list to vector.  Recursively flattens lists of lists as well.
my_vector = unlist(myList, use.names=FALSE)  # Best performance

my_vector = rapply(myList, c)                     # Recursive apply of functions to return vector. Slow.
my_vector = sapply(myList, function(x) return x)  # Works and does the job. Slowest

Objects Attributes

TODO: Cleanup

Objects can have attributes. Attributes are part of the object. By default all objects support class and typeof as built-in attributes.

There are additional optional special attributes that is supported by some objects :

  • names # Named vector and lists support this.
  • dimnames # Matrix whose column names are defined support this.
  • dim # All matrix support this attribute.
> a = list(3, 4)
> attributes(a)
  NULL                                  # No attributes defined by default.
> names(a) = c('first', 'second')       # Can assign names attribute like this!!!
> attributes(a)
      $names
      [1] "first" "second"

User defined attributes can be set like below:

> attr(baskets.team,'season') <- '2010-2011'
> attr(baskets.team,'season')
    [1] "2010-2011"

# You can delete attributes again by setting their value to NULL, like this:
> attr(baskets.team,'season') <- NULL

Matrix

In R matrices are an extension of the numeric or character vectors. They are not a separate type of object but simply an atomic vector with dimensions; the number of rows and columns.

The vector elements are filled column wise to create matrix. The length(my_matrix) is total number of columns.

> m = matrix(1:6, nrow = 2, ncol = 3)
> m                             
     [,1] [,2] [,3]             # Note values filled columnwise !!!
[1,]    1    3    5
[2,]    2    4    6

>  as.vector(m)                 # Convert matrix back to vector !!!
[1] 1 2 3 4 5 6

> as.vector(t(m))               # If you want vector along rows ...
[1] 1 3 5 2 4 6

> attributes(m)                 # Matrix is a vector with dim attribute !!!
$dim
[1] 2 3

> class(m)                      # class function reveals higher level type. 
[1] "matrix"                    # It does not mean it has 'class' attribute like data.frame.

> dim(m)
[1] 2 3

# The dimensions can have names. On reading csv file, the row names are 'line numbers',
# the column names are the header columns.

> x <- read.csv("matrix.csv",header=T,sep="\t")

> dimnames(x)
[[1]]
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9"              # These are line numbers used as 'row names'

[[2]]                                                # These are column names.
[1] "Subtype"    "Expression" "Quality"    "Height

# Now you can assign your own names for dimensions.

> m = matrix(1:6, nrow = 2, ncol = 3)
> dimnames(m) = list( list('row1', 'row2'), list('col1', 'col2', 'col3') )

> m
         col1 col2 col3
    row1    1    3    5
    row2    2    4    6

m      <- 1:10
dim(m) <- c(2, 5)
This takes a vector and transforms it into a matrix with 2 rows and 5 columns.

Another way is to bind columns or rows using cbind() and rbind().

x <- 1:3
y <- 10:12
cbind(x, y)
     x  y
[1,] 1 10
[2,] 2 11
[3,] 3 12
rbind(x, y)
  [,1] [,2] [,3]
x    1    2    3
y   10   11   12
You can also use the byrow argument to specify how the matrix is filled. From R’s own documentation:

mdat <- matrix(c(1, 2, 3, 11, 12, 13),
               nrow = 2,
               ncol = 3,
               byrow = TRUE)
mdat
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]   11   12   13

mdat[2, 3]                        # Unlike list, you Can access matrix using single brackets   !!!!
[1] 13

> m = matrix(1:9, byrow = TRUE, nrow = 3)

> m
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
> m2 = replicate(2, m)
> m2
, , 1

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9

, , 2

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9

> class(m2)                           # 3 dimensional matrix is an array !!!!
[1] "array"

> typeof(m2)
[1] "integer"

> str(m2)
 int [1:3, 1:3, 1:2] 1 4 7 2 5 8 3 6 9 1 ...

> m2[2, 2,]
[1] 5 5
>

Data Frame

A data frame is a de facto data structure for most tabular data. It is basically a named list with a special attribute class = 'data.frame' It is a list of column vectors. So each element should have same length.

df=data.frame(A=c("a","a","b","b"), B=c("X","X","Y","Z"), C=c(1,2,3,4))
attributes(df)

my_list = unclass(df)        # This removes the 'class' attribute and gives you underlying list.
  • A data frame is a special type of list where every element of the list has same length

  • Data frames can have additional attributes such as rownames() but it is optional. Used for annotating data, like sample_id. But most of the time they are not used.

  • Usually created by read.csv() and read.table(), etc.

  • Assuming all columns in a data frame are of same type, data frame can be converted to a matrix with data.matrix() as.matrix(). Otherwise type coercion will be enforced and the results may not always be what you expect.

  • Can also create a new data frame with data.frame() function.

  • Find the number of rows and columns with nrow(dat) and ncol(dat), respectively.

  • Rownames are often automatically generated and look like 1, 2, …, n. Consistency in numbering of rownames may not be honored when rows are reshuffled or subset.

  • To create data frames by hand::
    dat <- data.frame(id = letters[1:10], x = 1:10, y = 11:20)

  • Useful Data Frame Functions :

    head() - shows first 6 rows
    tail() - shows last 6 rows
    dim() - returns the dimensions of data frame (i.e. number of rows and number of columns)
    nrow() - number of rows
    ncol() - number of columns
    str() - structure of data frame - name, type and preview of data in each column
    names() - shows the names attribute for a data frame, which gives the column names.
    sapply(dataframe, class) - shows the class of each column in the data frame
    
  • See that it is actually a special list:

    > is.list(dat)
      [1] TRUE
    > class(dat)
      [1] "data.frame"
    
  • Indexing element in data frame is simpler like matrix. Note that indexing list involves double brackets and ugly!!! :

    > my_df[1, 3]
          [1] 11
    
  • As data frames are also lists, it is possible to refer to columns using the list notation, i.e. either double square brackets or a $ :

    > my_df[["y"]]
      [1] 11 12 13 14 15 16 17 18 19 20
    
    > my_df$y
      [1] 11 12 13 14 15 16 17 18 19 20
    

The following table summarizes the one-dimensional and two-dimensional data structures :

Dimensions    Homogenous       Heterogeneous
1-D           atomic vector    list
2-D           matrix           data frame
N-D           array            list          

Lists can contain multi-dimensional elements like matrix.

What type of structure do you expect to see when you explore the structure of the iris data frame? Hint: Use str().

> str(iris)                 # iris is a built-in dataframe

 'data.frame':     150 obs. of  5 variables:
   $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
   $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
   $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
   $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
   $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...


  For a slightly more complex problem, use the "which" to tell the "sum" where to sum: if DF is the data frame:

        Ozone Solar.R  Wind Temp Month Day
   1      41     190  7.4   67     5   1
   2      97     267  6.3   92     7   8
   3      97     272  5.7   92     7   9

  Example: sum the values of Solar.R (Column 2) where Column1 or Ozone>30 AND Column 4 or Temp>90

  sum(DF[which(DF[,1]>30 & DF[,4]>90),2])        # Sum column 2 (across rows) where col1 > 30 and col4 > 90

  sum( df$columnA < NUMBER )            # Total instances where val < NUMBER

  dat <- as.data.frame(matrix(1:36,6,6))
  colnames(dat) <- paste0("Col", LETTERS[1:6])
  dat$ColA
  # [1] 1 2 3 4 5 6
  dat$ColA < 3
  # [1]  TRUE  TRUE FALSE FALSE FALSE FALSE
  sum(dat$ColA < 3)
  # [1] 2

Everything you wanted to know about indexing vector, matrix, array and data frame

vector

Let us start with something very simple :

> a = 1:5
  [1] 1 2 3 4 5

> a  > 3
  [1] FALSE FALSE FALSE  TRUE  TRUE

> a [ a %% 2 == 0 ]
  [1] 2 4

> a[ a > 3 ]
  [1] 4 5

matrix

Let us move on to 2 dimensional vector, i.e. matrix:

> m = matrix(1:9, nrow=3)   # By default, we fill by column.
> m
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

The style of column-wise filling is also inherent in data frame as well. You may think and expect row-wise is more natural, but it is not the case. When you convert the matrix back to vector you get the same column-wise order is restored :

> as.vector(m)                # Always operates column-wise.
[1] 1 2 3 4 5 6 7 8 9

> as.vector(t(m))             # Use this trick, if you want row wise conversion.
[1] 1 4 7 2 5 8 3 6 9 

We can create matrix row-wise :

> m = matrix(1:9, nrow=3, byrow = TRUE)  # You can also fill by row.
> m
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9

What happens when your sequence size does not exactly match matrix size ? Do you expect to truncate your sequence to fit the matrix ? Sorry. The sequence is repeated so that matrix accomodates all elements in sequence atleast once !!! :

> m = matrix(1:9, ncol=2)
Warning message:
In matrix(1:9, ncol = 2) :
  data length [9] is not a sub-multiple or multiple of the number of rows [5]
> m
     [,1] [,2]
[1,]    1    6
[2,]    2    7
[3,]    3    8
[4,]    4    9
[5,]    5    1
> 

Note that above matrix is 5x2 ; Not 4x2; Creating logical matrix is simple like what we did for vector :

> m > 7

      [,1]  [,2]  [,3]
[1,] FALSE FALSE FALSE
[2,] FALSE FALSE  TRUE
[3,] FALSE FALSE  TRUE

Transforming matrix to vector is easy using this logical vector ... :

> m[ m > 7 ]
  [1] 8 9

Again, the result is ordered as if it was applied on consolidated single column-wise vector. Selecting single rows and columns behave as you would expect:

> m[1,]       # [1]  1 4 7
> m[2,]       # [1]  2 5 8
> m[3,]       # [1]  3 6 9

Selecting single column, you may expect 3 x 1 matrix, but actually it gives you only simple vector :

> m[,1]       # [1]  1 2 3
> m[,2]       # [1]  4 5 6
> m[,3]       # [1]  7 8 9

Sub-select of matrix would behave as you would expect when both nrow > 1 and ncol > 1. It will maintain the matrix dimensions as you would expect. The backing internal vector for the matrix is still stored column-wise behind the scenes :

> m[c(1,2),c(1,2)]
     [,1] [,2]
[1,]    1    4
[2,]    2    5

> m[c(1,3),c(2,3)]
     [,1] [,2]
[1,]    4    7
[2,]    6    9

> as.vector(m[c(1,3),c(2,3)])    # 4 6 7 9

array

Let us move on to array which can be any dimensional ... let us choose 3-D :

> a = array(1:8, dim = c(2, 2, 2))
> a
, , 1

     [,1] [,2]
[1,]    1    3
[2,]    2    4

, , 2

     [,1] [,2]
[1,]    5    7
[2,]    6    8

The above data illustrates how sequence is filled from last dimension first :

c(1,1,1), c(2,1,1),    x varies as (y, z)   is fixed.
c(1,2,1), c(2,2,1),    x varies as (y+1, z) is fixed.

c(1,1,2), c(2,1,2),    x varies as (y, z+1) is fixed.
c(1,2,2), c(2,2,2),    x varies as (y+1, z+1) is fixed.

You can imagine the index as LSB appearing first and MSB appearing last. The logical operation yields results as you would expect :

a > 4

, , 1

      [,1]  [,2]
[1,] FALSE FALSE
[2,] FALSE FALSE

, , 2

     [,1] [,2]
[1,] TRUE TRUE
[2,] TRUE TRUE

> a[ a > 4 ]     # Filtering N-Dimensional array always yields 1-D array only.

[1] 5 6 7 8

Subset operation gives expected results as long as resulting each dimension size > 1 :

> a[1,,]
     [,1] [,2]
[1,]    1    5
[2,]    3    7

> a[2,,]
     [,1] [,2]
[1,]    2    6
[2,]    4    8

> a[,1,]
     [,1] [,2]
[1,]    1    5
[2,]    2    6

Note: If the result is one dimensional, the dim structure is not preserved :

> a[1,,1]   

[1] 1 3      # Simple vector, no dimension attached !!!

You can conditionally assign values like below :

> a [ a > 4 ] = 0     # Matrix structure is intact and only values > 4 is assigned 0.

Data Frame

Let us take a look at data frame now ... :

> str(iris)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Selecting rows based on known row numbers is straight forward :

> iris[1:5,]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa

Selecting columns is also straight forward :

> iris[1:5, c(1,2,5)]

  Sepal.Length Sepal.Width Species
1          5.1         3.5  setosa
2          4.9         3.0  setosa
3          4.7         3.2  setosa
4          4.6         3.1  setosa
5          5.0         3.6  setosa

Note: Usually you select rows based on certain conditions in data frame. You can not use expressions like iris[ iris$Sepal.Length > 5 ], that will not work. You must use dplyr filter command for such things :

> filter(iris, Sepal.Length > 7.5 )

  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1          7.6         3.0          6.6         2.1 virginica
2          7.7         3.8          6.7         2.2 virginica
3          7.7         2.6          6.9         2.3 virginica
4          7.7         2.8          6.7         2.0 virginica
5          7.9         3.8          6.4         2.0 virginica
6          7.7         3.0          6.1         2.3 virginica

You can sub-select few columns based on multiple conditions :

filter(iris[,c(1,2,5)], Sepal.Length > 7.5, Sepal.Width < 3.0,  
                        (Species == "versicolor" | Species == "virginica") )

Sepal.Length Sepal.Width   Species
1            7.7           2.6 virginica
2            7.7           2.8 virginica

Note: dplyr::filter(df, cond1, cond2, cond4 | cond5); By default and conditions are joined by 'AND'
      You must select the columns if any condition refers to that. e.g. You must select Species column.

List

Let us take a look at lists now ... :

> l = list(1:5)          #  [[1]]   [1] 1 2 3 4 5
> length(l)              #  [1]  1

Didn't you expect a list of 5 elements ? If you need that, then you should use as.list() :

> l = as.list(1:5) ; length(l)       # [1]   5

To produce logical vector from list, you can use expression like l > 3 :

> l > 3 
  [1] FALSE FALSE FALSE  TRUE  TRUE

> l [ l > 3 ]                #  produces  list(c(4), c(5))

Since list can contain non-homogeneous types, such operations can yield NAs :

> l = as.list(1:5)
> l[4] = 'hello'             # list(1, 2, 3, 'hello', 5)
> l > 3                      # list(FALSE, FALSE, FALSE, NA, TRUE)
                             # Warning message: NAs introduced by coercion 
> l [ l > 3 ]                # list(NULL, 5)

How do you subset a list of lists ? Let us say select all lists of length > 3. :

# Most popular method.
v = sapply(l, function(x) length(x) > 3)
my_subset = l[v]

If the only purpose is to subset the list, better method is to use Filter directly :

my_subset = Filter(function(x) length(x) > 3, l)

The list with named index can also be referred using position :

> l = list()

> l["one"] = 1 ; l["two"] = 2 ; l[3] = 3

> l[1]                                     #  1 ; l["one"] is also same.

> str(l)

 List of 3
  $ one: num 1                         # Index named as 'one' for position 1.
  $ two: num 2                         # Index named as 'two' for position 2.
  $    : num 3                         # No Index name for position 3.

Data Table

Data table is improved version of data frame in terms of usability and built-in support for aggregation, group by operations.

  • data.table doesn’t set or use row names, ever.

  • Never coerces string to factors by default.

  • The syntax follows this structure :

    DT[i, j, by]
    
    ##   R:                 i                 j        by
    ## SQL:  where | order by   select | update  group by
    
  • You can subset 1 column data table using this syntax: flights[, list(arr_delay)] If you replace the list by vector, you will get 1 column vector not a data table. Note that arr_delay is a column in flights data table.

  • The . is an alias for list. .(2, 3) and list(2, 3) are same.

  • Compute or do in j :

    #  How many trips have had total delay < 0?
    ans <- flights[, sum( (arr_delay + dep_delay) < 0 )]     # ans  [1] 141814
    
  • Subset in i and do in j :

    ans <- flights[origin == "JFK" & month == 6L,
                   .(m_arr = mean(arr_delay), m_dep = mean(dep_delay))]
    ans
    #       m_arr    m_dep
    # 1: 5.839349 9.807884
    
  • The .N is current total for current group :

    ans <- flights[origin == "JFK" & month == 6L, .N]
    ans
    # [1] 8422
    
  • Grouping using by :

    – How can we get the number of trips corresponding to each origin airport?
    ans <- flights[, .(.N), by = .(origin)]
    ans
    #    origin     N
    # 1:    JFK 81483
    # 2:    LGA 84433
    # 3:    EWR 87400
    
    ## or equivalently using a character vector in 'by'
    # ans <- flights[, .(.N), by = "origin"]
    

Matrix

  • In R, a matrix is a collection of elements of the same data type (numeric, character, or logical)
  • Matrix is only 2 dimensional.
m = matrix(1:9, byrow = TRUE, nrow = 3)
# 1:9 which is a shortcut for c(1, 2, 3, 4, 5, 6, 7, 8, 9).
> m
 [,1] [,2] [,3]
 [1,]    1    2    3
 [2,]    4    5    6
 [3,]    7    8    9

rownames(my_matrix) <- row_names_vector
colnames(my_matrix) <- col_names_vector
> my_matrix
            age experience rating
   Raja       1          2      3
   Rani       4          5      6
   manthiri   7          8      9

rownames(my_matrix) <- c("Raja", "Rani", "manthiri")
colnames(my_matrix) <- c("age", "experience", "rating")

> rowSums(my_matrix)
  Raja     Rani manthiri
     6       15       24

# Extend matrixes by adding columns:
big_matrix <- cbind(matrix1, matrix2, vector1 ...)

a_var = my_matrix[1,2]
sub_matrix = my_matrix[1:3,2:4]    # with rows 1, 2, 3 and columns 2, 3, 4.
first_column =  my_matrix[,1]
first_row    =  my_matrix[1,]

sex_vector <- c("Male","Female","Female","Male","Male")

factor_sex_vector <- factor(sex_vector)            # Efficiently encodes categorical values
factor_sex_vector
    [1] Male   Female Female Male   Male
    Levels: Female Male

# Display built-in data frame example.
 > mtcars
                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
 Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
 Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
 Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
 ...

> typeof(my_matrix)
[1] "integer"

> class(my_matrix)
[1] "matrix"

Data Frames

> typeof(mtcars)               # mtcars is built-in example
[1] "list"

> class(mtcars)
[1] "data.frame"

> head(mtcars)       # or tail()
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

> str(mtcars)    # Show structure of your dataset
'data.frame':    32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

# Definition of vectors
name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet",
          "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)

# Create a data frame from the vectors
planets_df <- data.frame(name, type, diameter, rotation, rings)

> summary(planets_df)

     name                   type      diameter          rotation
   Earth  :1   Gas giant         :4   Min.   : 0.3820   Min.   :-243.0200
   Jupiter:1   Terrestrial planet:4   1st Qu.: 0.8448   1st Qu.:   0.1275
   Mars   :1                          Median : 2.4415   Median :   0.5500
   Mercury:1                          Mean   : 3.9264   Mean   : -22.6950
   Neptune:1                          3rd Qu.: 5.3675   3rd Qu.:   1.0075
   Saturn :1                          Max.   :11.2090   Max.   :  58.6400
   (Other):2
     rings
   Mode :logical
   FALSE:4
   TRUE :4

# Select first 5 values of diameter column
> planets_df[1:5, "diameter"]
[1]  0.382  0.949  1.000  0.532 11.209

# These are all same

planets_df[,3]

planets_df[,"diameter"]

planets_df$diameter

> planets_df[rings_vector, ]  # Selects all columns for planets with rings.

> a <- c(100, 10, 1000)

> order(a)
[1] 2 1 3

> a[order(a)]
[1]   10  100 1000

# Named lists are more powerful than lists.

my_list <- list(name1 = your_comp1, name2 = your_comp2)
my_list <- list(your_comp1, your_comp2)
names(my_list) <- c("name1", "name2")

# Accessing elements from vector use single bracket: e.g.  [1], for list use: [ [1] ]

shining_list[[1]]                          # All these are same !!!
shining_list[["reviews"]]
shining_list$reviews

rbind and do.call

  • rbind can be applied for Data frame and matrix

Matrix vs Dataframe

  • Data frame can contain mixed types
  • Constructed like: planets_df <- data.frame(name, type, diameter, rotation, rings) where type supports user defined type names. (See example above)

Indexing conventions [] vs [[]]

See:

  • https://stackoverflow.com/questions/1169456/the-difference-between-bracket-and-double-bracket-for-accessing-the-elem/1169973#1169973

  • http://cran.r-project.org/doc/manuals/R-lang.html#Indexing

  • For list, the single bracket [ is for subsetting and [[ is for element access.

  • For vector, [ is for element access, [[ is for "pure" element in the sense, drop names and dimnames attributes.

  • R has three basic indexing operators, single [, double [[ and dollar prefix $ :

    x[i]                    - For vector, gets ith element. For list, gets a sublist of length 1, containing ith element.
                            - !!! For matrix and multi-dimensional arrays, it does not give you sub-matrix  !!!
                            - !!! e.g. For matrix of 3x3 filled with 1:9, m[i] == i i.e. single element value  !!!
                            - x[a_vector] extracts (only) vector from vector or matrix or array. 
                            - Note that result vector has same length as index vector even when source is matrix or array.
                            - a_list[a_vector] extracts a sub list from list.
                            - For data frame, df[1] extracts a sub data frame with only first column.
                            - For data frame, df[c(1,3)] extracts a sub data frame with first and 3rd columns.
    
    x[i, j]                 - You can only use this notation with Matrix or 2-D array or data frame. 
                            - For matrix this gives you (i, j)th element.
                            - For vector, list, or 3+ dimensional array, you get error: Dimension does not match.
                            - For data frame, df[1, 2] selects element at 1st row and second column.
                            - Note:  df[1,2] != df[1][2]   (df[1] is single column, [2] attempts to get non-existent second column)
                            - Note:  df[1,2] != df[c(1,2)] (rhs refers to sub-data frame with first and second columns)
    
    x[[i]]                  - For vector, matrix and arrays, v[i] is "almost" same as v[[i]] and always returns single element.
                            - If vector indices are named, then v[[i]] drops names and dimnames if present.
                            - a_vector[[another_vector]] is not allowed. 
                            - For list it accesses ith element instead of a sublist containing single element.
                            - a_list[[another_vector]] is allowed and used to find element in a deeply nested list.
                            - e.g. a_list[[ c(2,3) ]] means: pick second element of a_list, then 3rd element of that.
                            - The df[[1]] returns first column values as plain vector.
                            - Note that we saw df[1] returns single column data frame.
    
    x[[i, j]]               - It behaves exactly same like x[i, j] for 2-d arrays and data frame.
    
    x$a                     - Lists support this syntax using named indices. 
                            - vectors and matrices support named indices, but does not support this syntax.
                            - The data frame supports a_dataframe$col_name syntax.
    
    x$"a"                   - Same behaviour as x$a
    

Reading/Writing csv files

  • read.table() separater can be comma or tabs. (No default separater ??) read.table(file="myfile", sep="t", header=TRUE)
  • read.csv(file="myfile") Default separator is comma.
  • read.csv2() The comma as the decimal point and a semicolon as the separator.
  • read.delim() default separator is tabs.
  • scan() Allows you finer control. scan(“myfile”, skip = 1, nmax=100)
  • readLines() Reads text from a text file one line at a time. readLines(“myfile”)
  • read.fwf Read a file with dates in fixed-width format. read.fwf(“myfile”, widths=c(1,2,3)
  • Package foreign helps to read other formats ...
write.csv(MyData_Frame, file = "MyData.csv")
write.csv(MyData, file = "MyData.csv",row.names=FALSE, na="")
write.table(MyData, file = "MyData.csv",row.names=FALSE, na="",col.names=FALSE, sep=",")

d = read.csv( file =  filename,
            stringsAsFactors = FALSE,
            headers = TRUE,
            strip.white = TRUE, sep = ',')

# A sample data frame
data <- read.table(header=TRUE, text='
    id weight   size
     1     20  small
     2     27  large
     3     24 medium
')

# Reorder by column number
data[c(1,3,2)]
#>   id   size weight
#> 1  1  small     20
#> 2  2  large     27
#> 3  3 medium     24

# To actually change `data`, you need to save it back into `data`:
# data <- data[c(1,3,2)]


# Reorder by column name
data[c("size", "id", "weight")]
#>     size id weight
#> 1  small  1     20
#> 2  large  2     27
#> 3 medium  3     24

> library("foreign")
The following table lists the functions to import data from SPSS, Stata, and SAS.

Function     What It Does     Example
read.spss     Reads SPSS data file     read.spss(“myfile”)
read.dta     Reads Stata binary file     read.dta(“myfile”)
read.xport     Reads SAS export file     read.export(“myfile”)

NULL vs NA

  • NULL represents NULL object

  • NA represents missing values.

  • vector can not contain NULL. If you assign one, silently it is dropped and ignored.

  • assigning NULL to list element deletes that element in list (Note this idiom)

  • NA is a logical constant of length 1, which contains a missing value indicator.

  • There are also constants :

    NA_integer_, NA_real_, NA_complex_ and NA_character_ etc.
    

Debug Strategy

Insert calls to browser() in your source code to debug the program in rstudio.

# Debugging
# Allow to enter into debug mode after error

> options(error=recover)
> options(error=browser)

> debug(func_name)          # Breaks you later inside the function.
> undebug(func_name)        # Unset break point
> n    # next
> c    # continue

> trace(func_name, edit = TRUE)  # Allows you to edit in rstudio. without changing original code
> browser()                      # Gives you stack frame to jump around

> stopifnot(a < 1)               # Insert this in code to break into debugging mode.
> trace()                        #

> rm(list=ls())                  # Clean up all global variables. Start fresh.

> where                          # show call stack.
> traceback()                    # show call stack with detailed traces of args.
> sink('/tmp/out.log')           # Redirect output to a file.

> # rstudio specific commands
> debugSource("file2.R")

> body(print.Date)    # Displays function definition.

> func_defn_string = capture.output(print(body(print.Date)))

> getAnywhere(t.ts)
A single object matching ‘t.ts’ was found
It was found in the following places
  registered S3 method for t from namespace stats
  namespace:stats
with value

function (x)
{
    cl <- oldClass(x)
    other <- !(cl %in% c("ts", "mts"))
    class(x) <- if (any(other))
        cl[other]
    attr(x, "tsp") <- NULL
    t(x)
}
<bytecode: 0x55d33b68a800>
<environment: namespace:stats>

> methods(myvar)

> showMethods(funcname)

> require(raster)
> showMethods(extract)
      Function: extract (package raster)
      x="Raster", y="data.frame"
      x="Raster", y="Extent"
      x="Raster", y="matrix"
      x="Raster", y="SpatialLines"
      x="Raster", y="SpatialPoints"
      x="Raster", y="SpatialPolygons"
      x="Raster", y="vector"

# To see the source code for one of these methods the entire signature must be supplied, e.g.
> getMethod("extract" , signature = c( x = "Raster" , y = "SpatialPolygons")


# Cheat Sheet See https://www.dummies.com/programming/r/r-for-dummies-cheat-sheet/

To search through the Help files, you’ll use one of the following functions :

?data.frame        #  Displays the Help file for a specific function.
??:regression      # Searches for a word (or pattern) in the function or Help files.
RSiteSearch('linear models')  # Performs an online search of RSiteSearch.

install.packages("sos")
library(“sos“).
findFn("regression") into your R console, you get a web page with the details.

Best way to extract function declaration into another file :

# if we want the source for randomForest::rfcv():

# To view/edit it in a pop-up window:

>  edit(getAnywhere('rfcv'), file='source_rfcv.r')

# To redirect to a separate file:

> capture.output(getAnywhere('rfcv'), file='source_rfcv.r')

edit function for debugging

> new_optim <- edit(optim)
  #  It will open the source code of optim edit it and assign result to new_optim.

# if you do not want the annoying long source code printed on your console, you can use

> invisible(edit(optim))

> edit(my_matrix)                        # you can also edit data !!!

> View(my_func)                # Works in R Studio

Scoping, Golbal and Local Variables

Use <<- to assign into global variable from inside the function. Otherwise assignments change only function local variables.

bar <- "global"
foo <- function(){
    bar <- "in foo"
    baz <- function(){
        bar <- "in baz - before <<-"
        bar <<- "in baz - after <<-"
        print(bar)
    }
    print(bar)
    baz()
    print(bar)
}
> bar
[1] "global"
> foo()
[1] "in foo"
[1] "in baz - before <<-"
[1] "in baz - after <<-"
> bar
[1] "global"

Saving Loading Variables

There are three ways to save objects from your R session:

Saving all objects in your R session:
The save.image() function will save all objects currently in your R session:

save.image(file="1.RData")
These objects can then be loaded back into a new R session using the load() function:

load(file="1.RData")
Saving some objects in your R session:
If you want to save some, but not all objects, you can use the save() function:

save(city, country, file="1.RData")
Again, these can be reloaded into another R session using the load() function:

load(file="1.RData")
Saving a single object
If you want to save a single object you can use the saveRDS() function:

save(city, file="city.rds")
save(country, file="country.rds")
You can load these into your R session using the readRDS() function, but you will need to assign the result into a the desired variable:

city <- readRDS("city.rds")
country <- readRDS("country.rds")
But this also means you can give these objects new variable names if needed (i.e. if those variables already exist in your new R session but contain different objects):

city_list <- readRDS("city.rds")
country_vector <- readRDS("country.rds")

Import Export various formats

rio package

install.packages("rio")
library("rio")
install_formats()

export(iris, 'iris.csv')       # Data frame to csv

iris_df = import('iris.csv')   # csv to Data frame

# export to sheets of an Excel workbook
export(list(mtcars = mtcars, iris = iris), "multi.xlsx")

multi_df = import('multi.xlsx')   # Named list of data frames.

readxl and writexl packages

If you want a fine control over reading, you may want to use it :

library(readxl)

read_excel(path, sheet = NULL, range = NULL, col_names = TRUE,
      col_types = NULL, na = "", trim_ws = TRUE, skip = 0, n_max = Inf,
      guess_max = min(1000, n_max))

read_xls(...)
read_xlsx(...)

library(writexl)

x = list(mtcars = mtcars, iris = iris)
write_xlsx(x, path = "writexlsx-cars-iris.xlsx", col_names = TRUE)

# Reading single sheet
irisdf = read_excel("writexlsx-iris.xlsx", sheet = 2)
carsdf = read_excel("writexlsx-cars.x lsx", sheet = 1)    

# Reading multiple sheets
sheets = ( excel_sheets( filename ) )  # vector of names.
total_sheets = length(sheets)

# Limit the number of data rows read
read_excel(datasets, n_max = 3)

# Read from an Excel range using A1 or R1C1 notation
read_excel(datasets, range = "C1:E7")
read_excel(datasets, range = "R1C2:R2C5")

# Specify the sheet as part of the range
read_excel(datasets, range = "mtcars!B1:D5")

# Read only specific rows or columns
read_excel(datasets, range = cell_rows(102:151), col_names = FALSE)
read_excel(datasets, range = cell_cols("B:D"))


# Get a preview of column names
this_col_names = names(read_excel(readxl_example("datasets.xlsx"), n_max = 0))

# Read all sheets from excel sheet.
sheet_names = excel_sheets(path)
#> [1] "iris"     "mtcars"   "chickwts" "quakes"  

path <- readxl_example("datasets.xls")
df_list = lapply(sheet_names, read_excel, path = path)

Trimming white space in data frame

Option 1 :

install.packages("stringr", dependencies=TRUE)
require(stringr)
example(str_trim)

# str_trim or built-in trimws will work on character vector, but not sub df. 
# df$V2 returns character vector.
df$clean2<-str_trim(df$V2)

# To change only one column in-place.
myData[,1] = trimws(myData[,1])    ## trimws is R built-in

myData[,1] = str_trim(myData[,1])    ## trimws is built-in
# Trim and To replace multiple spaces by single space inside string.
myData[,1] = stringr::str_squish(myData[,1])

Option 2 :

df2 = data.frame(lapply(df, function(x) if(class(x)=="character") trimws(x) else(x)), stringsAsFactors=F)

Option 3 :

# Bit verbose but it is easy to read and to the point ...
for (i in names(mydata)) {
     if(class(mydata[, i]) %in% c("factor", "character")){
         mydata[, i] <- trimws(mydata[, i])
    }
 }

Option 4 :

# Use grepl -- grep logical 
col4  = df[,4]
truth = grepl("[[:space:]]+$", col4)
# e.g. [ TRUE, FALSE, FALSE, ... ]

df = df[truth,]    # Get all columns for selected rows by truth.

Option 5:

install.packages('sqldf')
library(sqldf)

# Select non-empty rows.
mydf = sqldf("select * from mydf where mycol != '' ")

long_iris = sqldf("select * from iris where `Sepal.Length`  > 5.3")

Tibble Vs Data Frame

Tibble is enhanced data frame which provides a 'tbl_df' class that offers following capabilities : * better printing capabilities than traditional data frames.

You can easily change between them using as_data_frame() and as.data.frame()

Polymorphism in R - S3, S4 methods

> t                                      # t is matrix transpose function.
function (x)
UseMethod("t")                           # Means it is S3 generic method
<bytecode: 0x2332948>
<environment: namespace:base>

> with
standardGeneric for "with" defined from package "base"   # Another type of method.

> methods(print)                # Show Various print methods implemented by different objects.

> getMethod('print')
  Error in getMethod("print") : no generic function found for 'print'


> methods(t)
[1] t.data.frame t.default    t.ts*       # Non-visible functions are asterisked

> getAnywhere(t.ts)             # Get the method definition from pkgs

Try :

> methods(residuals)
    # which lists, among others, "residuals.lm" and "residuals.glm".
    # This means when you have fitted a linear model, m, and type residuals(m), residuals.lm will be called.
    # When you have fitted a generalized linear model, residuals.glm will be called.
  • It's kind of the C++ object model turned upside down.
  • In C++, you define a base class having virtual functions, which are overrided by derived classed.
  • In R you define a virtual (aka generic) function and then you decide which classes will override this function
  • Note that the classes doing this do not need to be derived from one common super class.
  • S4 has more formalism (more typing) and this may be too much for some applications.
  • S4 classes, however, can be de defined like a class or struct in C++.
  • You can specify that an object of a certain class is made up of a string and two numbers for example: setClass("myClass", representation(label = "character", x = "numeric", y = "numeric"))
  • Methods that are called with an object of that class can rely on the object having those members. That's very different from S3 classes, which are just a list of a bunch of elements.

Most Important:

  • With S3 and S4, you call a member function by fun(object, args) and not by object$fun(args). If you are looking for something like the latter, have a look at the proto package. The object$fun implies that the fun must be attribute of object which may not be the case.

Notes

  • For arrays of operation use vectors like c(4, 5, 5) not list(4, 3, 2). You can assign names(c) as well. sum(a_vector) etc operations supported on vectors. sum(a_list) not supported.
  • identical() is similar to equals() in Java
  • x <<- 0 ; Assigns zero to global variable x as compared to what x = 0 does for local variable.
  • The expression { a = b ; c = d; 8 } evaluates to value 8.

Delayed binding

  • Usually you need a function definition to implement a 'delayed expression evaluation'. But you can use this and call your function like "f" instead of "f()" :

    delayedAssign("x", {
        for(i in 1:3)
            cat("yippee!\n")
        10
    })
    
    x^2 #- yippee is printed only 3 times and returns value 100.
    
  • The dplyr operation %<d-% stands for delayed assignment. b %<d-% { Sys.sleep(1); 1 } is equivalent to :

    delayedAssign("a", { Sys.sleep(1); 1 })
    
  • Note that the assignment is instantaneous but reading it executes code :

    system.time( b %<d-% Sys.sleep(1) )    # instantaneous.
    system.time( b )                       # Takes 1 second
    
  • You can use substitute(), expression() etc. Understand the following example :

    # NOT RUN {
    require(graphics)
    (s.e <- substitute(expression(a + b), list(a = 1)))  #> expression(1 + b)
    (s.s <- substitute( a + b,            list(a = 1)))  #> 1 + b
    c(mode(s.e), typeof(s.e)) #  "call", "language"
    c(mode(s.s), typeof(s.s)) #   (the same)
    # but:
    (e.s.e <- eval(s.e))          #>  expression(1 + b)
    c(mode(e.s.e), typeof(e.s.e)) #  "expression", "expression"
    
    substitute(x <- x + 1, list(x = 1)) # nonsense
    
    myplot <- function(x, y)
        plot(x, y, xlab = deparse(substitute(x)),
             ylab = deparse(substitute(y)))
    
    ## Simple examples about lazy evaluation, etc:
    
    f1 <- function(x, y = x)             { x <- x + 1; y }
    s1 <- function(x, y = substitute(x)) { x <- x + 1; y }
    s2 <- function(x, y) { if(missing(y)) y <- substitute(x); x <- x + 1; y }
    a <- 10
    f1(a)  # 11
    s1(a)  # 11
    s2(a)  # a
    typeof(s2(a))  # "symbol"
    # }
    

Built-in Data Sets

  • library(help=datasets) # Gives you help info of all built-in datasets.

  • Important datasets to note:

    pressure                Vapor Pressure of Mercury as a Function of Temperature
    USArrests               Violent Crime Rates by US State
    rivers                  Lengths of Major North American Rivers
    AirPassengers           Monthly Airline Passenger Numbers 1949-1960
    WWWusage                Internet Usage per Minute
    WorldPhones             The World's Telephones
    iris                    Edgar Anderson's Iris Data
    ability.cov             Ability and Intelligence Tests
    beavers                 Body Temperature Series of Two Beavers
    cars                    Speed and Stopping Distances of Cars
    datasets-package        The R Datasets Package
    mtcars                  Motor Trend Car Road Tests
    quakes                  Locations of Earthquakes off Fiji
    randu                   Random Numbers from Congruential Generator RANDU
    sleep                   Student's Sleep Data
    state                   US State Facts and Figures
    women                   Average Heights and Weights for American Women
    
  • Loading of built-in datasets not required since it is preloaded, but the syntax is :

    data(iris)             # Load dataset iris. Not required to load built-in datasets.
    str(iris)
    head(iris)
    
  • Another one is not built-in but contains more realistic data and useful for teaching :

    install.packages("dslabs")
    
    library("dslabs")
    data(package="dslabs")
    

Exploring Iris dataset For Prediction

  • Description: Predict iris flower species from flower measurements.

  • Type: Multi-class classification

  • Dimensions: 150 instances, 5 attributes

  • Inputs: Numeric

  • Output: Categorical, 3 class labels

  • UCI Machine Learning Repository: Description

  • iris flowers datasets:

    data(iris)
    dim(iris)
    levels(iris$Species)
    head(iris)
    
      Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    1          5.1         3.5          1.4         0.2  setosa
    2          4.9         3.0          1.4         0.2  setosa
    3          4.7         3.2          1.3         0.2  setosa
    4          4.6         3.1          1.5         0.2  setosa
    5          5.0         3.6          1.4         0.2  setosa
    6          5.4         3.9          1.7         0.4  setosa
    
  • Visualizing using ggplot2 :

    library(ggplot2)
    
    # Visualize 2 attributes ....
    ggplot(iris, aes(x = Petal.Length, y = Sepal.Length))  + geom_point() 
    # Note: ggplot() creates canvas, then geom_point() draws on top of that. !! Note the brackets !!
    
    # Visualize 3 attributes to differentiate the species ...
    ggplot(iris, aes(x = Petal.Length, y = Sepal.Length, colour = Species)) + geom_point() + geom_line()
    
    # Visualize 4 attributes to see how Petal.Width correlates. Use size of the dots.
    ggplot(iris, aes(x = Petal.Length, y = Sepal.Length, colour = Species, size = Petal.Width)) 
                                                  + geom_point() + geom_line(size=1)
    
    # By default it uses geom_point(), you can also connect the dots, etc.
    # Add title.
    ggplot(iris, aes(x = Petal.Length, y = Sepal.Length, colour = Species, size = Petal.Width)) +
                 geom_point() +
                 ggtitle('Iris Species by Petal and Sepal Length')
    
  • Visualizing using ggplot2 :

    library(ggplot2)
    
  • See Also: https://www.mailman.columbia.edu/sites/default/files/media/fdawg_ggplot2.html

Visualizing 5 dimensions

  • Say there are 5 variables: var1 ... var5;
  • Use x, y axis for var1 and var2.
  • Divide var3 range into 5 categories. Do same for var4; Now plot 5x5 = 25 plots.
  • With in each square var3/var4 values fixed, but we still need visual feedback. Use var3 = Circle Fill color (with intensity); var4 = Outer circle (with intensity)
  • Use var5 as the size of the circle.
  • The technique is called quantile categorization with facet_grid or facet_wrap

Generate distributions

  • runif() - Generate uniform distribution between min to max for total N points. (random deviates)
  • dunif() - Gives density - Needs vector of quantiles x. Density = 1 / (max - min);
  • punif() - Gives distribution function. Needs vector of quantiles q.
  • qunif() - Gives quantile function. Needs vector of probabilities p.
  • By default min = 0 and max = 1
  • rnorm(), dnorm(), pnorm(), qnorm() functions are for corresponding Normal distribution.

Solution 1

cr <- cor(mtcars)
# This is to remove redundancy as upper correlation matrix == lower 
cr[upper.tri(cr, diag=TRUE)] <- NA
reshape2::melt(cr, na.rm=TRUE, value.name="cor")

Solution 2

# get pairwise combination of variable names
vars <- t(combn(colnames(myMat), 2))

# build data.frame with matrix subsetting
data.frame(vars, myMat[vars])
  X1 X2 myMat.vars.
1 V1 V2   0.8500071
2 V1 V3  -0.2828288
3 V1 V4  -0.2867921
4 V2 V3  -0.2698210
5 V2 V4  -0.2273411
6 V3 V4   0.9962044

You can add column names in one line as well using setNames.

setNames(data.frame(vars, myMat[vars]), c("var1", "var2", "corr"))

# Now sort according to abs values of 3rd column.
cr_order = order(abs(df$myMat.vars))
df = df[cr_order, ]

# How to generate data

set.seed(1234)
myMat <- cor(matrix(rnorm(16), 4, dimnames=list(paste0("V", 1:4), paste0("V", 1:4))))
myMat
           V1         V2         V3         V4
V1  1.0000000  0.8500071 -0.2828288 -0.2867921
V2  0.8500071  1.0000000 -0.2698210 -0.2273411
V3 -0.2828288 -0.2698210  1.0000000  0.9962044
V4 -0.2867921 -0.2273411  0.9962044  1.0000000

Note: More obvious way to generate data is here though it is bit manual process:

d <- data.frame(x1=rnorm(10),
+                 x2=rnorm(10),
+                 x3=rnorm(10))
> x <- cor(d) # get correlations (returns matrix)

Note about sorting

We had to sort data frame on one parameter. What if we have to sort on multiple parameters ? :

# sorting examples using the mtcars dataset
attach(mtcars)

# sort by mpg
newdata <- mtcars[order(mpg),] 

# sort by mpg and cyl
newdata <- mtcars[order(mpg, cyl),]

#sort by mpg (ascending) and cyl (descending)
newdata <- mtcars[order(mpg, -cyl),] 

detach(mtcars)

Heat maps and scatter plots

# https://www.sharpsightlabs.com/blog/heatmap-ggplot2-simple/

library(ggplot2)

#------------------
# CREATE DATA FRAME
#------------------
df.team_data <- expand.grid(teams = c("Team A", "Team B", "Team C", "Team D")
                           ,metrics = c("Metric 1", "Metric 2", "Metric 3", "Metric 4", "Metric 5")
                           )

# add variable: performance
set.seed(41)
df.team_data$performance <- rnorm(nrow(df.team_data))

#inspect
head(df.team_data)



#---------------------------
# PLOT: heatmap
# - here, we use geom_tile()
#---------------------------

ggplot(data = df.team_data, aes(x = metrics, y = teams)) +
  geom_tile(aes(fill = performance))

# #############################################


# Load ggplot2 package for graphics/plotting
library(ggplot2)

# Create dummy dataset
df.test_data <- data.frame(x_var = 1:50 + rnorm(50,sd=15),
                           y_var = 1:50 + rnorm(50,sd=2)
                          )

# Plot data using ggplot2 ... scatter plot..

ggplot(data=df.test_data, aes(x=x_var, y=y_var)) +
  geom_point()

R Markdown

  • See https://rmarkdown.rstudio.com/r_notebooks

  • Great to write reproducible report with inline graphics.

  • Follows 'Literate Programming' approach by Knuth; Treat program as a literature understandable to human beings.

  • RStudio has built-in support.

  • Cut & Paste is easier with R Markdown, in that aspect better than R notebook.

  • With knitr and its combination with R Markdown, the writing of reproducible reports was made infinitely easier.

  • In line R expressions :

    # See https://bookdown.org/yihui/rmarkdown/r-code.html
    ```{r}
           x = 5  # radius of a circle
    ```
    
    For a circle with the radius `r x`, its area is `r pi * x^2`.
    Note the inline use of r code using single back-tick in the line above.
    
  • Passing chunk options :

    # You can pass chunk options ....
    ```{r, chunk-label, results='hide', fig.height=4}
    
    ```
    
  • Just display the code, but don't run it :

    # Just have code verbatim here, and do not evaluate the code ...
    ```{r, eval=FALSE}
     x = rnorm(100)
    ```
    
  • Setup some global options for all R chunks :

    # Setup some global chunk options using kintr::opts_chunk$set()
    # The include=FALSE will hide this specific block of code and its result in output doc.
    # The option(s) fig.width = 8 is set for all r code chunks. 
    # The 'setup' just a label for this chunk.
    ```{r, setup, include=FALSE}
         knitr::opts_chunk$set(fig.width = 8, collapse = TRUE)
    ```
    
  • Nicely display result in a table :

    # To nicely display result in a table in output document ... use knitr::kable() ...
    ```{r my-table-demo}
          knitr::kable(iris[1:5, ], caption = 'A caption')
    ```
    
  • Quick Options reference ... :

    echo                FALSE               Hide the code in the final generated document.
    results             FALSE               Hide results output in the document.
    include             FALSE               Hide code, results, warnings, msgs from this chunk.
    eval                FALSE               Dont evaluate a code chunk.
    cache               FALSE               Re-Evaluate code chunk on each run.
    
    fig.width           8                   The (graphical device) size of width in inches.
    fig.height          4                   Output figure height is 4 inches.
    fig.dim             (8,4)               width is 8; Height is 4.
    fig.cap             My Caption          For Figure title
    fig.align           center              Align figure left, center or right
    out.width           80%                 The output size of R plots in document. 80% page width.
    
    dev                 png                 Default Graphical device for html is png;
                                            For latex output, device is usually pdf;
                                            Could also be: svg or jpeg.
                                            Note: output format i.e. html/pdf is not part of source doc.
    Extra: 
    collapse            FALSE               Display text output in same block as code for compact output.
    warning             FALSE               Read as ignore-warning=FALSE; i.e. Show warning messages.
    message             FALSE               Read as ignore-message=FALSE; i.e. Show info messages.
    error               FALSE               Read as ignore-error=FALSE; i.e. Halt on Error.
    child               ./doc2.rmd          Include another markdown file here.
    
  • To include graphics from external image file ... :

    ```{r, out.width='25%', fig.align='center', fig.cap='...'}
           knitr::include_graphics('images/hex-rmarkdown.png')
    ```
    

Tidyverse Tools

  • A opinionated Collection of tools
  • Built as library extension of any other R package, but introduces 'new syntax' like changes.
  • The syntax is non-standard but more intuitive to new users.
  • You may lose some flexibility due to added abstraction, but the code is shorter and more readable.
  • In addition to this, the ggplot2 package exploded in popularity for quick and dirty way of getting good graphics.

Recipies

Simple group by aggregation

This code uses the dplyr package from the tidyverse set of packages to take the monthly mean of ozone in the airquality dataset in R. To understand it, you need to understand the concept of a data frame and of tidy data. :

library(dplyr)
group_by(airquality, Month) %>%                  # This pipe operator feeds lhs as the first arg of rhs.
summarize(o3 = mean(Ozone, na.rm = TRUE))

Using just the functionality that came with the base R system, you’d have to do something like :

aggregate(airquality[, "Ozone"],
          list(Month = airquality[, "Month"]),
          mean, na.rm = TRUE)

Note: The na.rm arg is the argument to mean, not argument to aggregate. The dplyr package use is much more readable.

You can also use tapply function for aggregate after group by function.

  • Suppose, for example, we have a sample of 30 tax accountants of Australian states:

    > state <- c("tas", "sa", "qld", "nsw", "nsw", "nt", "wa", "wa",
    "qld", "vic", "nsw", "vic", "qld", "qld", "sa", "tas",
    "sa", "nt", "wa", "vic", "qld", "nsw", "nsw", "wa",
    "sa", "act", "nsw", "vic", "vic", "act")
    
    Notice that in the case of a character vector, “sorted” means sorted in alphabetical order.
    A factor is similarly created using the factor() function:
    
    > statef <- factor(state)
    
    The print() function handles factors slightly differently from other objects:
    > statef
    [1] tas sa qld nsw nsw nt wa wa qld vic nsw vic qld qld sa
    [16] tas sa nt wa vic qld nsw nsw wa sa act nsw vic vic act
    Levels: act nsw nt qld sa tas vic wa
    
    To find out the levels of a factor the function levels() can be used.
    > levels(statef)
    [1] "act" "nsw" "nt" "qld" "sa" "tas" "vic" "wa"
    
    suppose we have the incomes of same tax accountants as 'incomes' :
    
    > incomes <-  c(60, 49, 40, 61, 64, 60, 59, 54, 62, 69, 70, 42, 56,
                    61, 61, 61, 58, 51, 48, 65, 49, 49, 41, 48, 52, 46,
                    59, 46, 58, 43)
    
    To calculate the sample mean income for each state we can now use the special function tapply():
    
    > incmeans <- tapply(incomes, statef, mean)
    
    act     nsw     nt      qld     sa tas      vic     wa
    44.500 57.333 55.500 53.600 55.000 60.500 56.000 52.250
    
    The function tapply() is used to apply a function, here mean(), to each group of components
    of the first argument, here incomes, defined by the levels of the second component, here statef 2 ,
    

Examine Environment

Examine environment using search() and ls() commands :

> search()

 [1] ".GlobalEnv"        "package:stats"     "package:graphics"
 [4] "package:grDevices" "package:utils"     "package:datasets"
 [7] "package:methods"   "Autoloads"         "package:base"

> ls(1)
  [1] "a"   "a1"  "b"   "d"   "df"           # Show list of global variables.

> ls(2)                   # List exported symbols from package:stats

> detach("package:graphics")           # You can detach from specific package.

> attach(.env)            # Lets you direct access. e.g. s() vs .env$s().

History

  • Successor of S-Plus language designed in Bell labs ;
  • Alternatives at that time are: SAS, Stata, SPSS, Minitab, Microsoft Excel, but they lack programming capability.
  • With knitr and its combination with R Markdown, the writing of reproducible reports was made infinitely easier.
  • RStudio has significantly simplified the development of R packages via devtools and roxygen2.