Learning Objectives

One of R’s most powerful features is its ability to deal with tabular data - like what you might already have in a spreadsheet or a CSV. Let’s start by making a toy dataset in your data/ directory, called feline-data.csv:

coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1

We can load this into R via the following:

cats <- read.csv(file="data/feline-data.csv")
cats
##     coat weight likes_string
## 1 calico    2.1            1
## 2  black    5.0            0
## 3  tabby    3.2            1

We can begin exploring our dataset right away, pulling out columns via the following:

cats$weight
## [1] 2.1 5.0 3.2
cats$coat
## [1] calico black  tabby 
## Levels: black calico tabby

We can do other operations on the columns:

## We discovered that the scale weighs one Kg light:
cats$weight + 2
## [1] 4.1 7.0 5.2
paste("My cat is", cats$coat)
## [1] "My cat is calico" "My cat is black"  "My cat is tabby"

But what about

cats$weight + cats$coat
## Warning in Ops.factor(cats$weight, cats$coat): '+' not meaningful for
## factors
## [1] NA NA NA

Understanding what happened here is key to successfully analyzing data in R.

Data Types

If you guessed that the last command will return an error because 2.1 plus black is nonsense, you’re right - and you already have some intuition for an important concept in programming called data types. We can ask what type of data something is:

class(cats$weight)
## [1] "numeric"
class(cats$coat)
## [1] "factor"

There are 5 main classes: numeric (double), integers, logical and character. Factor is a special class that we’ll get into later.

class(1.25)
## [1] "numeric"
class(1L)
## [1] "integer"
class(TRUE)
## [1] "logical"
class('banana')
## [1] "character"

Note the L suffix to insist that a number is an integer. Character classes are always enclosed in quotation marks.

No matter how complicated our analyses become, all data in R is interpreted as one of these basic data types. This strictness has some really important concequences. Try adding another row to your cat data like this:

tabby,2.3 or 2.4,1

Reload your cats data like before, and check what type of data we find in the weight column:

cats <- read.csv(file="data/feline-data.csv")
class(cats$weight)
## [1] "factor"

Oh no, our weights aren’t numeric anymore! If we try to do the same math we did on them before, we run into trouble:

cats$weight + 1
## Warning in Ops.factor(cats$weight, 1): '+' not meaningful for factors
## [1] NA NA NA NA

What happened? When R reads a csv into one of these tables, it insists that everything in a column be the same basic type; if it can’t understand everything in the column as a double, then nobody in the column gets to be a double. The table that R loaded our cats data into is something called a data.frame, and it is our first example of something called a data structure - things that R knows how to build out of the basic data types. In order to successfully use our data in R, we need to understand what these basic data structures are, and how they behave. For now, let’s remove that extra line from our cats data and reload it, while we investigate this behavior further:

feline-data.csv:

coat,weight,likes_string
calico,2.1,1
black,5.0,0   
tabby,3.2,1 

And back in RStudio:

cats <- read.csv(file="data/feline-data.csv")

Vectors & Type Coercion

To better understand the behavior we just saw, let’s meet another of the data structures: the vector. All vectors are one of the classes we met above. We can create a vector by calling the function of the same name:

x <- numeric(5)
x
## [1] 0 0 0 0 0
y <- character(3)
y
## [1] "" "" ""

Just like you might be familiar with from vectors elsewhere, a vector in R is essentially an ordered list of things, with the special condition that everything in the vector must be the same basic data type.

You can check if something is a vector:

str(x)
##  num [1:5] 0 0 0 0 0

The somewhat cryptic output from this command indicates the basic data type found in this vector; the number of things in the vector; and a few examples of what’s actually in the vector. If we similarly do

str(cats$weight)
##  num [1:3] 2.1 5 3.2

we see that that’s a vector, too - the columns of data we load into R data.frames are all vectors, and that’s the root of why R forces everything in a column to be the same basic data type.

Discussion 1

Why is R so opinionated about what we put in our columns of data? How does this help us?

You can also make vectors with explicit contents with the c (combine) function:

x <- c(2,6,3)
x
## [1] 2 6 3
y <- c("Hello", "Goodbye", "I love data")
y
## [1] "Hello"       "Goodbye"     "I love data"

Given what we’ve learned so far, what do you think the following will produce?

x <- c(2,6,'3')

This is something called type coercion, and it is the source of many surprises and the reason why we need to be aware of the basic data types and how R will interpret them. Consider:

x <- c('a', TRUE)
x
## [1] "a"    "TRUE"
x <- c(0, TRUE)
x
## [1] 0 1

The coercion rules go: logical -> integer -> numeric -> complex -> character. You can try to force coercion against this flow using the as. functions:

x <- c('0','2','4')
x
## [1] "0" "2" "4"
y <- as.numeric(x)
y
## [1] 0 2 4
z <- as.logical(y)
z
## [1] FALSE  TRUE  TRUE

As you can see, some surprising things can happen when R forces one basic data type into another! Nitty-gritty of type coercion aside, the point is: if your data doesn’t look like what you thought it was going to look like, type coercion may well be to blame; make sure everything is the same type in your vectors and your columns of data.frames, or you will get nasty surprises!

But coercion isn’t a bad thing. For example, likes_string is numeric, but we know that the 1s and 0s actually represent TRUE and FALSE (a common way of representing them). R has special kind of data type called logical, which has two states: TRUE or FALSE, which is exactly what our data represents. We can ‘coerce’ this column to be logical by using the as.logical function:

cats$likes_string
## [1] 1 0 1
cats$likes_string <- as.logical(cats$likes_string)
cats$likes_string
## [1]  TRUE FALSE  TRUE

You can also append things to an existing vector using the c (combine) function:

x <- c('a', 'b', 'c')
x
## [1] "a" "b" "c"
x <- c(x, 'd')
x
## [1] "a" "b" "c" "d"

You can also make series of numbers:

mySeries <- 1:10
mySeries
##  [1]  1  2  3  4  5  6  7  8  9 10
seq(10)
##  [1]  1  2  3  4  5  6  7  8  9 10
seq(1,10, by=0.1)
##  [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6
## [18] 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9
##  [ reached getOption("max.print") -- omitted 61 entries ]

We can ask a few other questions about vectors:

x <- seq(10)
head(x, n=2)
## [1] 1 2
tail(x, n=4)
## [1]  7  8  9 10
length(x)
## [1] 10

Finally, you can give names to elements in your vector, and ask for them that way:

x <- 5:8
names(x) <- c("a", "b", "c", "d")
x
## a b c d 
## 5 6 7 8
x['b']
## b 
## 6

Missing values

Missing values are represented by NA. Functions such as min, max and mean that require knowledge of all the input values return an NA if one or more values are missing. This behaviour can be altered by setting the na.rm argument to be TRUE.

x <- c(1, 2, 3, NA)
mean(x)
## [1] NA
mean(x, na.rm = TRUE)
## [1] 2

Factors

str(cats$coat)
##  Factor w/ 3 levels "black","calico",..: 2 1 3

Another important data structure is called a factor. Factors usually look like character data, but are typically used to represent categorical information. For example, let’s make a vector of strings labeling cat colorations for all the cats in our study:

coats <- c('tabby', 'tortoiseshell', 'tortoiseshell', 'black', 'tabby')
coats
## [1] "tabby"         "tortoiseshell" "tortoiseshell" "black"        
## [5] "tabby"
str(coats)
##  chr [1:5] "tabby" "tortoiseshell" "tortoiseshell" "black" ...

We can turn a vector into a factor like so:

CATegories <- as.factor(coats)
str(CATegories)
##  Factor w/ 3 levels "black","tabby",..: 2 3 3 1 2

Now R has noticed that there are three possible categories in our data - but it also did something surprising; instead of printing out the strings we gave it, we got a bunch of numbers instead. R has replaced our human-readable categories with numbered indices under the hood:

class(coats)
## [1] "character"
typeof(coats)
## [1] "character"
class(CATegories)
## [1] "factor"
typeof(CATegories)
## [1] "integer"

Challenge 2

When we loaded our cats data, the coats column was interpreted as a factor; try using the help for read.csv to figure out how to keep text columns as character vectors instead of factors; then write a command or two to show that the cats$coats column actually is a character vector when loaded in this way.

In modeling functions, it’s important to know what the baseline levels are. This is assumed to be the first factor, but by default factors are labeled in alphabetical order. You can change this by specifying the levels:

mydata <- c("case", "control", "control", "case")
x <- factor(mydata, levels = c("control", "case"))
str(x)
##  Factor w/ 2 levels "control","case": 2 1 1 2

In this case, we’ve explicitly told R that “control” should represented by 1, and “case” by 2. This designation can be very important for interpreting the results of statistical models!

Lists

Another data structure you’ll want in your bag of tricks is the list. A list is simpler in some ways than the other types, because you can put anything you want in it:

x <- list(1, "a", TRUE, 1+4i)
x
## [[1]]
## [1] 1
## 
## [[2]]
## [1] "a"
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] 1+4i
x[2]
## [[1]]
## [1] "a"
x <- list(title = "Research Bazaar", numbers = 1:10, data = TRUE )
x
## $title
## [1] "Research Bazaar"
## 
## $numbers
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $data
## [1] TRUE

We can now understand something a bit surprising in our data.frame; what happens if we run:

typeof(cats)
## [1] "list"

We see that data.frames look like lists ‘under the hood’ - this is because a data.frame is really a list of vectors and factors, as they have to be - in order to hold those columns that are a mix of vectors and factors, the data.frame needs something a bit more flexible than a vector to put all the columns together into a familiar table.

Matrices

Last but not least is the matrix. We can declare a matrix full of zeros:

x <- matrix(0, ncol=6, nrow=3)
x
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    0    0    0    0    0    0
## [2,]    0    0    0    0    0    0
## [3,]    0    0    0    0    0    0

and we can ask for and put values in the elements of our matrix with a couple of different notations:

x[1,1] <- 1
x
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    0    0    0    0    0
## [2,]    0    0    0    0    0    0
## [3,]    0    0    0    0    0    0
x[1][1]
## [1] 1
x[1][1] <- 2
x[1,1]
## [1] 2
x
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    2    0    0    0    0    0
## [2,]    0    0    0    0    0    0
## [3,]    0    0    0    0    0    0

Challenge 3

What do you think will be the result of length(x)? Try it. Were you right? Why / why not?

Challenge 4

Make another matrix, this time containing the numbers 1:50, with 5 columns and 10 rows. Did the matrix function fill your matrix by column, or by row, as its default behaviour? See if you can figure out how to change this. (hint: read the documentation for matrix!)

Challenge 5

Create a list of length two containing a character vector for each of the sections in this part of the workshop:

  • Data types
  • Data structures

Populate each character vector with the names of the data types and data structures we’ve seen so far.

Challenge solutions

Solutions to challenges

Discussion 1

By keeping everything in a column the same, we allow ourselves to make simple assumptions about our data; if you can interpret one entry in the column as a number, then you can interpret all of them as numbers, so we don’t have to check every time. This consistency, like consistently using the same separator in our data files, is what people mean when they talk about clean data; in the long run, strict consistency goes a long way to making our lives easier in R.

Solution to Challenge 1

x <- 11:20
subset <- x[3:5]
names(subset) <- c('S', 'W', 'C')

Solution to Challenge 2

cats <- read.csv(file="data/feline-data.csv", stringsAsFactors=FALSE)
str(cats$coat)
##  chr [1:3] "calico" "black" "tabby"

Note: new students find the help files difficult to understand; make sure to let them know that this is typical, and encourage them to take their best guess based on semantic meaning, even if they aren’t sure.

Solution to challenge 3

What do you think will be the result of length(x)?

x <- matrix(0, ncol=6, nrow=3)
length(x)
## [1] 18

Because a matrix is really just a vector with added dimension attributes, length gives you the total number of elements in the matrix.

Solution to challenge 4

Make another matrix, this time containing the numbers 1:50, with 5 columns and 10 rows. Did the matrix function fill your matrix by column, or by row, as its default behaviour? See if you can figure out how to change this. (hint: read the documentation for matrix!)

x <- matrix(1:50, ncol=5, nrow=10)
x <- matrix(1:50, ncol=5, nrow=10, byrow = TRUE) # to fill by row

Solution to Challenge 5

dataTypes <- c('double', 'complex', 'integer', 'character', 'logical')
dataStructures <- c('data.frame', 'vector', 'factor', 'list', 'matrix')
answer <- list(dataTypes, dataStructures)

Note: it’s nice to make a list in big writing on the board or taped to the wall listing all of these types and structures - leave it up for the rest of the workshop to remind people of the importance of these basics.

Exploring Data Frames

Learning Objectives

  • To learn how to manipulate a data.frame in memory
  • To tour some best practices of exploring and understanding a data.frame when it is first loaded.

At this point, you’ve see it all - in the last lesson, we toured all the basic data types and data structures in R. Everything you do will be a manipulation of those tools. But a whole lot of the time, the star of the show is going to be the data.frame - that table that we started with that information from a CSV gets dumped into when we load it. In this lesson, we’ll learn a few more things about working with data.frame.

We learned last time that the columns in a data.frame were vectors, so that our data are consistent in type throughout the column. As such, we can perform operations on them just as we did with vectors

# Calculate weight of cats in g
cats$weight * 1000
## [1] 2100 5000 3200

We can also assign this result to a new column in the data frame:

cats$weight_kg <- cats$weight * 1000
cats
##     coat weight likes_string weight_kg
## 1 calico    2.1            1      2100
## 2  black    5.0            0      5000
## 3  tabby    3.2            1      3200

Our new column has appeared!

Discussion 1

What do you think

cats$weight[4]

will print at this point?

So far, you’ve seen the basics of manipulating data.frames with our cat data; now, let’s use those skills to digest a more realistic dataset.

Reading in data

Remember earlier we obtained the gapminder dataset, which contains GDP ,population, and life expentancy for many countries around the world. ‘Gapminder’.

If you’re curious about where this data comes from you might like to look at the Gapminder website.

Let’s first open up the data in Excel, an environment we’re familiar with, to have a quick look.

Now we want to load the gapminder data into R.

As its file extension would suggest, the file contains comma-separated values, and seems to contain a header row.

We can use read.csv to read this into R

gapminder <- read.csv(file="data/gapminder-FiveYearData.csv")
head(gapminder)
##       country year      pop continent lifeExp gdpPercap
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007
## 4 Afghanistan 1967 11537966      Asia  34.020  836.1971
## 5 Afghanistan 1972 13079460      Asia  36.088  739.9811
##  [ reached getOption("max.print") -- omitted 1 row ]

Miscellaneous Tips

  1. Another type of file you might encounter are tab-separated format. You can use read.delim to read in tab-separated files.

  2. If your file uses a different separater, the more generic read.table will let you specifiy it with the sep argument.

  3. You can also read in files from the Internet by replacing the file paths with a web address.

  4. You can read directly from excel spreadsheets without converting them to plain text first by using the xlsx package.

To make sure our analysis is reproducible, we should put the code into a script file so we can come back to it later.

Challenge 3

Go to file -> new file -> R script, and write an R script to load in the gapminder dataset.

Run the script using the source function, using the file path as its argument (or by pressing the “source” button in RStudio).

Using data frames: the gapminder dataset

To recap what we’ve just learned, let’s have a look at our example data (life expectancy in various countries for various years).

Remember, there are a few functions we can use to interrogate data structures in R:

class() # what is the data structure?
length() # how long is it? What about two dimensional objects?
attributes() # does it have any metadata?
str() # A full summary of the entire object
dim() # Dimensions of the object - also try nrow(), ncol()

Let’s use them to explore the gapminder dataset.

class(gapminder)
## [1] "data.frame"

The gapminder data is stored in a “data.frame”. This is the default data structure when you read in data, and (as we’ve heard) is useful for storing data with mixed types of columns.

Let’s look at some of the columns.

Challenge 4: Data types in a real dataset

Look at the first 6 rows of the gapminder data frame we loaded before:

head(gapminder)
##       country year      pop continent lifeExp gdpPercap
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007
## 4 Afghanistan 1967 11537966      Asia  34.020  836.1971
## 5 Afghanistan 1972 13079460      Asia  36.088  739.9811
##  [ reached getOption("max.print") -- omitted 1 row ]

Write down what data type you think is in each column

class(gapminder$year)
## [1] "integer"
class(gapminder$lifeExp)
## [1] "numeric"

Can anyone guess what we should expect the type of the continent column to be?

class(gapminder$continent)
## [1] "factor"

If you were expecting a the answer to be “character”, you would rightly be surprised by the answer.

One of the default behaviours of R is to treat any text columns as “factors” when reading in data. The reason for this is that text columns often represent categorical data, which need to be factors to be handled appropriately by the statistical modeling functions in R.

However it’s not obvious behaviour, and something that trips many people up. We can disable this behaviour when we read in the data.

gapminder <- read.csv(file="data/gapminder-FiveYearData.csv", 
                      stringsAsFactors = FALSE)

Tip

I highly recommend burning this pattern into your memory, or getting it tattooed onto your arm.

The first thing you should do when reading data in, is check that it matches what you expect, even if the command ran without warnings or errors. The str function, short for “structure”, is really useful for this:

str(gapminder)
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

We can see that the object is a data.frame with 1,704 observations (rows), and 6 variables (columns). Below that, we see the name of each column, followed by a “:”, followed by the type of variable in that column, along with the first few entries.

As discussed above, we can retrieve or modify the column or row names of the data.frame:

colnames(gapminder)  
## [1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"
copy <- gapminder
colnames(copy) <- letters[1:6]
head(copy, n=3)
##             a    b        c    d      e        f
## 1 Afghanistan 1952  8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957  9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007

Challenge 5

Recall that we also used the names function (above) to modify column names. Does it matter which you use? You can check help with ?names and ?colnames to see whether it should matter.

rownames(gapminder)[1:20]
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19" "20"

See those numbers in the square brackets on the left? That tells you the number of the first entry in that row of output. So we see that for the 5th row, the rowname is “5”. In this case, the rownames are simply the row numbers.

Challenge Solutions

Solutions to challenges 2 & 3.

Solution to Challenge 2

Create a data frame that holds the following information for yourself:

  • First name
  • Last name
  • Age

Then use rbind to add the same information for the people sitting near you.

Now use cbind to add a column of logicals answering the question, “Is there anything in this workshop you’re finding confusing?”

my_df <- data.frame(first_name = "Andy", last_name = "Teucher", age = 36)
my_df <- rbind(my_df, data.frame(first_name = "Jane", last_name = "Smith", age = 29))
my_df <- rbind(my_df, data.frame(first_name = c("Jo", "John"), last_name = c("White", "Lee"), age = c(23, 41)))
my_df <- cbind(my_df, confused = c(FALSE, FALSE, TRUE, FALSE))

Solution to Challenge 5

?colnames tells you that the colnames function is the same as names for a data frame. For other structures, they may not be the same. In particular, names does not work for matrices, but colnames does. You can verify this with

m <- matrix(1:9, nrow=3)
colnames(m) <- letters[1:3] # works as you would expect
names(m) <- letters[1:3]  # destroys the matrix