*Note: We covered some, but not all of this content in the course (in the Introduction)**

R is a versatile, open source programming language that was specifically designed for data analysis. As such R is extremely useful both for statistics and data science. Inspired by the programming language S.

Introduction to R and RStudio

Let’s start by learning about our tool.

Point out the different windows in RStudio.

You can get output from R simply by typing in math

3 + 5
## [1] 8
12 / 7
## [1] 1.714286

or by typing words

"Hello World"
## [1] "Hello World"

Calculations

R can be used just like a calculator (as shown before).

3 + 5
## [1] 8
12 / 7
## [1] 1.714286
6 * 5
## [1] 30
3 ^ 4
## [1] 81

R Respects standard order of operations in math:

3 * 4 + 5
## [1] 17
3 * (4 + 5)
## [1] 27

Assignment

R is an object-oriented programming language. Everything we do in R, we do with objects. We can save our results to an object, if we give it a name.

<- is the assignment operator. The result of the operation on the right hand side of <- is assigned to an object with the name specified on the left hand side of <-. Put spaces around <- (and all other operators).

hrs_per_day <- 24
days_per_week <- 7
hrs_per_week <- hrs_per_day * days_per_week
hrs_per_week
## [1] 168
  1. Set x to be 7. What is the value of x ^ x? Save the value in a an object called i.
  2. If you asign the value 20 to the object x does the value of i change? What does this tell you about how R assigns values to objects?

A note about naming things:

Name things so that you can understand them later, try to balance brevity with clarity. x is easy to type, but may not mean much.

  • Only begin names with letters.
  • Separate words with a dot (my.var) or an underscore (my_var), or use camelCase.
  • Try to be consistent.
  • Avoid giving things names that already exist (like mean, sum, log)

Functions

Most of R’s power and flexibility comes from functions.

A function is a saved object that takes inputs to perform a task. Functions take in information and return outputs. A function takes zero, one, or many arguments (also called parameters), depending on the function, and returns a value. To call a function, type its name followed by brackets (). Arguments go inside the brackets and are separated by commas.

name_of_function(arg1,arg2,arg3)

log(10)
## [1] 2.302585
sqrt(4)
## [1] 2
max(4,8,2,6,9)
## [1] 9
date() # Not all functions require arguments
## [1] "Wed Feb 10 17:12:26 2016"

Specifying arguments

Arguments can be specified using positional and/or named matching. Some arguments have default values.

# Positional matching: 10 is the first (and only) argument, and base is not 
# specified, so the default (e; i.e., natural logarithm) is used
log(5)
## [1] 1.609438
# If you want to do base 10 log:
log(5, 10)          # OR
## [1] 0.69897
log(x=5, base=10)   # OR even
## [1] 0.69897
log(base=10, x=5)   # But this is bad form, can cause rampant confusion because:
## [1] 0.69897
log(10,5)
## [1] 1.430677

Help

All functions come with a help screen. To get help on a function, type ? followed by the function name.

?log

If you don’t know the name of a function, you can find functions associated with a topic by typing ??topic_name (one word) or ??"topic phrase" (multiple words). Sometimes Google is just as (or more) effective.

It is important that you learn to read the help screens since they provide important information on what the function does, how it works, and usually sample examples at the very bottom.

Workspace

List objects in your workspace with ls() function.

x <- 5
ls()
[1] "days_per_week" "hrs_per_day"   "hrs_per_week" 
[4] "x"  

Remove objects from your workspace with the rm() function.

rm(x)

Remove all objects from your workspace.

rm(list = ls())

Notice that we have nested one function inside of another. Calling ls() generates a list of objects to remove. This list is then passed to the list argument of rm(), so all the items in that list are removed.

Data types and structures

Understanding basic data types in R

To make the best of the R language, you’ll need a strong understanding of the basic data types and data structures and how to operate on those.

Very Important to understand because these are the objects you will manipulate on a day-to-day basis in R. Dealing with object conversions is one of the most common sources of frustration for beginners.

Everything in R is an object.

R has 6 (although we will not discuss the raw class for this workshop) data types.

  • character: "a", "swc"
  • numeric: 2, 15.5
  • integer: 2L (the L tells R to store this as an integer)
  • logical: TRUE, FALSE
  • complex: 1+4i (complex numbers with real and imaginary parts)

class() - what is it?

# Example
x <- "Hello World"
class(x)
## [1] "character"
y <- 10
class(y)
## [1] "numeric"
z <- 5L
class(z)
## [1] "integer"
q <- TRUE
class(q)
## [1] "logical"

Logicals (TRUE, FALSE) merit a bit more exploration, usually derived with >, <, >=, <=, ==, !=

5 > 3
## [1] TRUE
3 >= 3
## [1] TRUE
a <- 9
b <- 10
a == b # This is called a "logical equals"
## [1] FALSE
a != b
## [1] TRUE

Note: You can encode logicals with T and F but don’t do it! Use TRUE and FALSE:

T
## [1] TRUE
T <- FALSE
T
## [1] FALSE
# Chaos ensues

Basic data structures in R

R has many data structures. The main ones are:

Vectors

A vector is the most common and basic data structure in R and is pretty much the workhorse of R. It is a one-dimensional object with a series of values, all of the same type.

A vector is a collection of elements that are most commonly character, logical, integer or numeric.

You can create an empty vector of various types by using their corresponding functions, such as character(), numeric(), etc.

character(5) ## empty character vector of length 5
## [1] "" "" "" "" ""
numeric(5)
## [1] 0 0 0 0 0
logical(5)
## [1] FALSE FALSE FALSE FALSE FALSE

Usually, empty vectors aren’t that useful. Use the function c() to manually construct vectors:

x <- c(1, 2, 3)
x
## [1] 1 2 3
length(x)
## [1] 3

x is a numeric vector. These are the most common kind. They are numeric objects and are treated as double precision real (decimal) numbers.

To explicitly create integers, add an L to each (or coerce to the integer type using as.integer()).

x1 <- c(1L, 2L, 3L)

You can also have logical vectors.

y <- c(TRUE, TRUE, FALSE, FALSE)

Finally you can have character vectors:

first.names <- c("Sarah", "Tracy", "John")

Examine your vector

length(first.names)
## [1] 3
class(first.names)
## [1] "character"
str(first.names)
##  chr [1:3] "Sarah" "Tracy" "John"
  1. Do you see a property that’s common to all these vectors above?

Add elements to a vector

c(first.names, "Annette")
## [1] "Sarah"   "Tracy"   "John"    "Annette"

Note that the above doesn’t actually change the first.names vector itself. Rather, it prints out a new (unassigned) vector made up of first.names with “Annette” added to the end. If you want to actually update the first.names vector, you need to reassign it:

first.names <- c(first.names, "Annette")
first.names
## [1] "Sarah"   "Tracy"   "John"    "Annette"

You can also create vectors as a sequence of numbers using : or seq()

series <- 1:10
seq(10)
##  [1]  1  2  3  4  5  6  7  8  9 10
seq(1, 10, by = 0.1)
##  [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6
## [18] 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9
##  [ reached getOption("max.print") -- omitted 61 entries ]

NaN means Not a Number. It’s an undefined value.

0/0
## [1] NaN

What happens when you mix types?

R will create a resulting vector that is the least common denominator; this is called coersion. The coercion will move towards the one that’s easiest to coerce to:

logical < integer < double < complex < character

Guess what the following do without running them first

xx <- c(1.7, "a") 
xx <- c(TRUE, 2) 
xx <- c("a", TRUE) 

This is called implicit coercion. You can also coerce vectors explicitly using the as.<class_name>. Example

as.numeric("1")
## [1] 1
as.character(1:2)
## [1] "1" "2"
  1. What happens when you try to coerce the following vector to numeric?
x <- c("txt", "one", "1", "1.9")
y <- as.numeric(x)
## Warning: NAs introduced by coercion
  1. Calculate the mean of y. What happens?

Indexing

Each element in a vector has a numbered postiion and these numbers can be specified to subset the vector using vector_name[index(es)].

first.names[1]
## [1] "Sarah"

Note that this doesn’t actually change first.names, it just extracts and prints to the screen the elements you told it to (in this case the first name).

You can also put a vector inside the square brackets with the positions you want to extract:

first.names[c(1,2,4)]
## [1] "Sarah"   "Tracy"   "Annette"
p <- c(1:3)
first.names[p]
## [1] "Sarah" "Tracy" "John"

You can use negative numbers to exclude elements:

first.names[-3] # omit the third element
## [1] "Sarah"   "Tracy"   "Annette"
  1. Remove Tracy and John from the first.names vector. How did you do this?
first.names[-c(2,3)]
## [1] "Sarah"   "Annette"
first.names[-2:-3]
## [1] "Sarah"   "Annette"
first.names[c(1,4)]
## [1] "Sarah"   "Annette"
first.names[!first.names %in% c("Tracy", "John")]
## [1] "Sarah"   "Annette"

Matrix

Matrices are a special vector in R. They are not a separate type of object but simply a vector with dimensions added on to it. Matrices have rows and columns.

m <- matrix(nrow = 2, ncol = 2)
m
##      [,1] [,2]
## [1,]   NA   NA
## [2,]   NA   NA
dim(m)
## [1] 2 2

Matrices are filled column-wise.

m <- matrix(1:6, nrow = 2, ncol = 3)

Other ways to construct a matrix

m      <- 1:10
dim(m) <- c(2, 5)

This takes a vector and transform into a matrix with 2 rows and 5 columns.

Another way is to bind columns or rows using cbind() and rbind().

x <- 1:3
y <- 10:12
cbind(x, y)
##      x  y
## [1,] 1 10
## [2,] 2 11
## [3,] 3 12
rbind(x, y)
##   [,1] [,2] [,3]
## x    1    2    3
## y   10   11   12

Lists

In R lists are a lot like vectors. Unlike vectors however, the contents of a list are not restricted to a single data type and can encompass any mixture of data types (even other lists!). This makes them fundamentally different from vectors.

Create lists using list() or coerce other objects using as.list()

x <- list(1, "a", TRUE, 1+4i)
x
## [[1]]
## [1] 1
## 
## [[2]]
## [1] "a"
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] 1+4i
x <- 1:10
x <- as.list(x)
length(x)
## [1] 10

Lists, like vectors, can be indexed, though slightly differently. Use double square brackets list[[index]] to get the contents of a list element. Using single square will still return a list.

  1. What is the class of x[1]?
  2. How about x[[1]]?
Andy <- list(name = "Andy", fav_nums = 1:10, fav_data = head(iris))
Andy
## $name
## [1] "Andy"
## 
## $fav_nums
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $fav_data
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
  1. What is the length of this object? What about its structure?

Lists can be extremely useful inside functions. You can “staple” together lots of different kinds of results into a single object that a function can return.

A list does not print to the console like a vector. Instead, each element of the list starts on a new line.

  • A data frame is a special type of list where every element of the list is a vector of the same length.

Factors

Factors are special vectors that represent categorical data. Factors can be ordered or unordered and are important for modelling functions such as lm() and glm() and also in plot() methods. Almost any other time they’re a huge pain.

Once created factors can only contain a pre-defined set values, known as levels.

Factors are stored as integers that have labels associated the unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings. Some string methods will coerce factors to strings, while others will throw an error.

Sometimes factors can be left unordered. Example: male, female.

Other times you might want factors to be ordered (or ranked). Example: low, medium, high.

Underlying it’s represented by numbers 1, 2, 3.

They are better than using simple integer labels because factors are what are called self describing. male and female is more descriptive than 1s and 2s. Helpful when there is no additional metadata.

Which is male? 1 or 2? You wouldn’t be able to tell with just integer data. Factors have this information built in.

Factors can be created with factor(). Input is often a character vector.

x <- factor(c("yes", "no", "no", "yes", "yes"))
x
## [1] yes no  no  yes yes
## Levels: no yes
str(x)
##  Factor w/ 2 levels "no","yes": 2 1 1 2 2

table(x) will return a frequency table counting the number of elements in each level.

If you need to convert a factor to a character vector, simply use

as.character(x)
## [1] "yes" "no"  "no"  "yes" "yes"

To convert a factor to a numeric vector, go via a character. Compare:

f <- factor(c(1,5,10,2))
as.numeric(f) ## wrong!
## [1] 1 3 4 2
as.numeric(as.character(f))
## [1]  1  5 10  2

In modeling functions, it is important to know what the baseline level is. This is the first factor but by default the ordering is determined by alphanumerical order of elements. You can change this by speciying the levels (another option is to use the function relevel()).

x <- factor(c("yes", "no", "yes"), levels = c("yes", "no"))
x
## [1] yes no  yes
## Levels: yes no

Data frames

A data frame is a very important data type in R. It’s pretty much the de facto data structure for most tabular data and what we use for statistics.

Some additional information on data frames:

  • Usually created by read.csv() and read.table().
  • Can also create with data.frame() function.
  • Find the number of rows and columns with nrow(dat) and ncol(dat), respectively.
  • Rownames are usually 1, 2, …, n.

Creating data frames

dat <- data.frame(id = letters[1:10], x = 1:10, y = 11:20)
dat
##    id  x  y
## 1   a  1 11
## 2   b  2 12
## 3   c  3 13
## 4   d  4 14
## 5   e  5 15
## 6   f  6 16
## 7   g  7 17
## 8   h  8 18
## 9   i  9 19
## 10  j 10 20

Useful functions

  • head() - show first 6 rows
  • tail() - show last 6 rows
  • dim() - returns the dimensions
  • nrow() - number of rows
  • ncol() - number of columns
  • str() - structure of each column
  • names() - shows the names attribute for a data frame, which gives the column names.

Summary of Data Structures

Dimensions Homogenous Heterogeneous
1-D vector list
2_D matrix data frame

Get new functions: Packages

To install any package use install.packages()

install.packages("ggplot2")  ## install the ggplot2 package

You can’t ever learn all of R, but you can learn how to build a program and how to find help to do the things that you want to do.