Data analysis IIVisualisation, practice1 / 81
Laurent BergéUniversity of Bordeaux, BxSE09/12/2021

Outline

Intro
A bit of graphical code in R
A functional approach to graphs
More graphical code in R

2 / 81

A tale of two paths...

3 / 81

Pros and cons: Base R

Pros

the principle is easy to grasp: we simply overlay successive forms on top of each other
low level operations $=$ everything is possible

4 / 81

Pros and cons: Base R

Pros

the principle is easy to grasp: we simply overlay successive forms on top of each other
low level operations $=$ everything is possible

Cons

because everything is an overlay on stg already in place, the first plot is critical: you need to plan everything in advance!!! Doing simple stuff can be surprisingly difficult.
many commands which are not really intuitive. After 10 years, I still look up ?par regularly.

4 / 81

Messing up the first plot in Base R

You miss most of the show!

5 / 81

Disclaimer

I'm a ggplot2 noob, base R is my home. I have a lot of sympathy for it.

6 / 81

Pros and cons: ggplot2

Pros

much more user friendly: stacks all your layers before creating the graph and makes all the computations for you. You don't need to overthink how to set the stage any more.
millions of contributed packages

7 / 81

Pros and cons: ggplot2

Pros

much more user friendly: stacks all your layers before creating the graph and makes all the computations for you. You don't need to overthink how to set the stage any more.
millions of contributed packages

Cons

since there is pre-processing, it is sometimes difficult to do exactly what you want (e.g. for a very precise publication graph). To be confirmed, it's only second-hand experience!

7 / 81

Wait! There are other paths!

8 / 81

Ready made solutions

there are numerous packages out there which make good graphs with a user-friendly interface (ie with minimal user input)
example: ggpubr or fplot for distributions; ggcorrplot for correlations; highcharter for many things, etc.

9 / 81

Pros and cons: Ready made solutions

Pros

in a single line of code you get a graph of (usually) very decent quality
very good for exploratory graphs

10 / 81

Pros and cons: Ready made solutions

Pros

in a single line of code you get a graph of (usually) very decent quality
very good for exploratory graphs

Cons

the level of customization is limited, this is especially problematic for presentations/publications in which you want a very high level of customization

10 / 81

Direct edition

I strongly encourage you to learn how to work with inkscape^$⋆$
it's just crazy how fast you can edit/create images
that's indispensable in your skill set, you'll save so much time!

11 / 81

later: add gif edition of the castle image

Pros and cons: Direct edition

Pros

you can do exactly what you want, as precisely as you want
with practice, you can edit very rapidly

12 / 81

Pros and cons: Direct edition

Pros

you can do exactly what you want, as precisely as you want
with practice, you can edit very rapidly

Cons

the direct edition only comes after the first creation of the graph: hence you have to navigate across software
cannot be automated: all the work that you do with one graph, you'd have to do it again for a graph with new data

12 / 81

Short introduction to Base R13 / 81

Data content

R dispose of multiple functions to display data:

plot(): CORE graphical function
points(), lines(), abline(), text(): used to plot additional data
density(), hist(), boxplot(), etc...

14 / 81

Plot

The function plot() is the main graphical function of R (more precisely, it's a method).
By default it is a scatterplot between two variables, but it can be used to do much more than that.
Some functions preprocess the data, like density(), and modify completely the behavior of plot() when you apply it to the preprocessed data. More on that later.
When you apply plot(), it creates a new graphic and the previous one is lost (of course there are exceptions...). To add several pieces of information, you'll need to use other functions.

15 / 81

Main `plot` arguments

Main plot arguments relating to data:

x, y: the data
xlim, ylim: the limits of the plotting region
col, pch, lty, lwd: color, symbol, line type and line width
type: the type of plot
log: whether to put the x/y axes to logarithm

16 / 81

Plot: type

17 / 81

Plot: type = "n"

Using type = "n" hides the data, but EVERYTHING else is there. Can be useful when constructing complex graphics: i.e. when setting the stage.

plot(1:5, type = "n")

18 / 81

Plot: limits

19 / 81

Plot: pch

plot(1:20, pch = 1:20)
grid()

20 / 81

Plot: cex

plot(1:5, pch = 16, cex = 1:5, main = "cex: modify point size")

21 / 81

Plot: lty and lwd

22 / 81

Plot: col I

23 / 81

Plot: col II

Lot of color possibilities:

Custom colors: rgb(),hsv(), etc
"Nice" colors: package RColorBrewer
Color interpolation:
- rainbow(n), heat.colors(n), etc, create vectors of n colors.
- colorRampPalette(c("white", "blue"))(5): create a vector of 5 colors between the colors white and blue.
Nice introduction to R colors in the R-stats UBC course

24 / 81

Exercise: Plot

Generate a 100 periods Brownian motion $x_{t + 1} = x_{t} + ϵ_{t}$ , $ϵ_{t} \sim N (0, 1)$ .

Plot its evolution with both a solid line and filled points (in the same graph).
This time display only the points and use the function rainbow() to set the color of each point.

25 / 81

Adding data points

To add points/lines onto an existing plot:

lines()
points()

It behaves as the function plot() and contains the same arguments (col, lty, cex, lwd, pch).

26 / 81

Lines & points

plot(1:5, ylim = c(-2, 5))
lines(1:5 - 1)
points(1:5 - 2)

27 / 81

Exercise: Plot & line

In the following graph, the functions plot(), lines() and points() have been called. Can you say to what command refers each graphical information, and in what order they have been called?

28 / 81

Exercise: Plot & lines

Re-generate the previous Brownian motion.

Plot it with both line and dots.
Generate another Brownian motion with $ϵ_{t} \sim N (0, 4)$ .
Plot the two motions on a single graph, the second one should be of "firebrick" color, have thick and dashed line and be of triangle symbol.

29 / 81

abline I

The function abline() draws lines. Its arguments are:

h: coordinate of horizontal line
v: coordinate of vertical line
a, b: intercept (a) and slope of a straight line. Shorthan exist: can take the result of an OLS regression (function lm()) instead.

30 / 81

abline II

plot(iris$Sepal.Length, iris$Petal.Width)
abline(lm(Petal.Width ~ Sepal.Length, iris))
abline(h = c(1, 2), v = c(5, 7), col = "gray", lty = 3)

31 / 81

Exercise: abline

You want to illustrate the relation between the variables "Sepal.Length" and "Petal.Width" for each species of the iris data.

Plot the scatterplot between the two variables with one color per species.
Draw the regression lines for each group with the appropriate color.

32 / 81

Text I

You can add text to an existing plot with the function text(). The most important arguments are:

x, y: coordinates of the text
labels: the text to be displayed
pos: the position of the text relative to the coordinate. pos = 0: as is, pos = 1: below, 2: left, 3: top, 4: right.
As usual, other graphical parameters apply: cex (size), col, etc.

33 / 81

Text II

plot(5:1, col = "firebrick", pch = 18, xlim = c(0, 6))
text(1, 5, "pos = default")
text(2:5, 4:1, paste0("pos = ", 1:4), pos = 1:4)

34 / 81

Exercise: textAs in the previous exercise, plot the scatterplot between the two variables with one color per species for the variables "Sepal.Length" and "Petal.Width" of the iris data.
Add the Species names in the middle of the points for each species in the right color and with large font.
35 / 81

A functional approach to graphs36 / 81

From pain we learn

Base R can be surprisingly painful for doing seemingly simple stuff.

Q: What does a programmer do when facing a tedious task?

37 / 81

From pain we learn

Base R can be surprisingly painful for doing seemingly simple stuff.

Q: What does a programmer do when facing a tedious task?

A: S/he automates it!

37 / 81

From pain we learn

Base R can be surprisingly painful for doing seemingly simple stuff.

Q: What does a programmer do when facing a tedious task?

A: S/he automates it!

Base R is so painful, that if you stick to it, it will make you a good programmer (or a masochist!).

Remember though: it's not just painful, it's also extremely powerful!

37 / 81

Hard coding

It's very easy to write code that is specific to your current data! In fact, it's usually the first thing we do, and it works well.

38 / 81

Hard coding

It's very easy to write code that is specific to your current data! In fact, it's usually the first thing we do, and it works well.

plot(iris$Petal.Length, iris$Sepal.Length, 
     col = iris$Species, pch = 20, cex = 2)
text(1.5, 5, "Setosa", 
     font = 2, cex = 4)
text(4, 6, "Versicolor", 
     font = 2, cex = 4, col = 2)
text(6, 7, "Virginica", 
     font = 2, cex = 4, col = 3)

38 / 81

Hard coding: The problem

If your data changes, even slightly, your code is messed up.

Changing Sepal.Length into Sepal.Width loses the legend:

plot(iris$Petal.Length, iris$Sepal.Width, 
     col = iris$Species, pch = 20, cex = 2)
text(1.5, 5, "Setosa", 
     font = 2, cex = 4)
text(4, 6, "Versicolor", 
     font = 2, cex = 4, col = 2)
text(6, 7, "Virginica", 
     font = 2, cex = 4, col = 3)

39 / 81

Hard coding: The problem

If your data changes, even slightly, your code is messed up.

To remember

The data always changes!

40 / 81

Hard coding: The routine

If you want to replicate a hard coded graph to a new data set you:

copy paste the code
change the data
make the adjustments so the graph looks as you wish with the new data

41 / 81

Hard coding: The routine

If you want to replicate a hard coded graph to a new data set you:

copy paste the code
change the data
make the adjustments so the graph looks as you wish with the new data

I think I don't need to write that each of these three steps are highly error-prone, and can cost dearly.^$⋆$

41 / 81

Hard coding: The solutionvery simple: don't hard code! 
42 / 81

Hard coding: The solution

very simple: don't hard code!
OK, here comes some tips

42 / 81

1: Uppercase letters work well to identify parameters.
Tip 1: Define global variables!whenever a variable is repeated twice, use a global variable1 defined at the beginning of the piece of code
43 / 81

Tip 1: Define global variables!

whenever a variable is repeated twice, use a global variable¹ defined at the beginning of the piece of code

BAD

plot(iris$Petal.Length, iris$Sepal.Width, 
     col = iris$Species, pch = 20, cex = 2)
text(1.5, 5, "Setosa", 
     font = 2, cex = 4)
text(4, 6, "Versicolor", 
     font = 2, cex = 4, col = 2)
text(6, 7, "Virginica", 
     font = 2, cex = 4, col = 3)

GOOD

FONT = 2
CEX = 4
plot(iris$Petal.Length, iris$Sepal.Width, 
     col = iris$Species, pch = 20, cex = 2)
text(1.5, 5, "Setosa", 
     font = FONT, cex = CEX)
text(4, 6, "Versicolor", 
     font = FONT, cex = CEX, col = 2)
text(6, 7, "Virginica", 
     font = FONT, cex = CEX, col = 3)

43 / 81

Tip 2: Lay bare how you think!

when you decide to place some text here, or a legend there, how do you take the decision?
you decide based on heuristics (although you may not even notice there was a decision process!)

44 / 81

Tip 2: Lay bare how you think!

when you decide to place some text here, or a legend there, how do you take the decision?
you decide based on heuristics (although you may not even notice there was a decision process!)
the game is to extract the (often implicit) rules that made you take a decision^$⋆$
if you achieve to make the heuristic explicit: you win since now you can automatize it!

44 / 81

Tip 2: Lay bare how you think!

Remember when I asked to put the names in the middle of the points?

Q: What does in the middle means mathematically?

45 / 81

Tip 2: Lay bare how you think!

Remember when I asked to put the names in the middle of the points?

Q: What does in the middle means mathematically?

A: The barycenter!

45 / 81

Tip 2: Lay bare how you think!

Remember when I asked to put the names in the middle of the points?

Q: What does in the middle means mathematically?

A: The barycenter!

BAD

FONT = 2
CEX = 4
plot(iris$Petal.Length, iris$Sepal.Width, 
     col = iris$Species, pch = 20, cex = 2)
text(1.5, 5, "Setosa", 
     font = FONT, cex = CEX)
text(4, 6, "Versicolor", 
     font = FONT, cex = CEX, col = 2)
text(6, 7, "Virginica", 
     font = FONT, cex = CEX, col = 3)

GOOD

FONT = 2
CEX = 4
plot(iris$Petal.Length, iris$Sepal.Width, 
     col = iris$Species, pch = 20, cex = 2)
bary = aggregate(cbind(Petal.Length, Sepal.Width) ~ Species, 
                 iris, mean)
text(bary[1, 2], bary[1, 3], "Setosa", 
     font = FONT, cex = CEX)
text(bary[2, 2], bary[2, 3], "Versicolor", 
     font = FONT, cex = CEX, col = 2)
text(bary[3, 2], bary[3, 3], "Virginica", 
     font = FONT, cex = CEX, col = 3)

45 / 81

Tip 3: Loop whenever possible!whenever you repeat two statements: use a loop instead!
46 / 81

Tip 3: Loop whenever possible!

whenever you repeat two statements: use a loop instead!

BAD

FONT = 2
CEX = 4
plot(iris$Petal.Length, iris$Sepal.Width, 
     col = iris$Species, pch = 20, cex = 2)
bary = aggregate(cbind(Petal.Length, Sepal.Width) ~ Species, 
                 iris, mean)
text(bary[1, 2], bary[1, 3], "Setosa", 
     font = FONT, cex = CEX)
text(bary[2, 2], bary[2, 3], "Versicolor", 
     font = FONT, cex = CEX, col = 2)
text(bary[3, 2], bary[3, 3], "Virginica", 
     font = FONT, cex = CEX, col = 3)

GOOD

FONT = 2
CEX = 4
plot(iris$Petal.Length, iris$Sepal.Width, 
     col = iris$Species, pch = 20, cex = 2)
categ_val = levels(iris$Species)
for(i in seq_along(categ_val)){
    data = iris[iris$Species == categ_val[i], ]
    text(mean(data$Petal.Length), 
         mean(data$Sepal.Width), categ_val[i],
         font = FONT, cex = CEX, col = i)
}

46 / 81

Tip 4: Loop over the tips!

Apply recursively Tip 1, Tip 2 and Tip 3 until you can't any more.

47 / 81

Tip 4: Loop over the tips!

Apply recursively Tip 1, Tip 2 and Tip 3 until you can't any more.

BAD

FONT = 2
CEX = 4
plot(iris$Petal.Length, iris$Sepal.Width, 
     col = iris$Species, pch = 20, cex = 2)
categ = levels(iris$Species)
for(i in seq_along(categ)){
    data = iris[iris$Species == categ[i], ]
    text(mean(data$Petal.Length), 
         mean(data$Sepal.Width), categ[i],
         font = FONT, cex = CEX, col = i)
}

GOOD

FONT = 2
CEX = 4
x = iris$Petal.Length
y = iris$Sepal.Width
categ = iris$Species
plot(x, y, col = categ, pch = 20, cex = 2)
categ_val = levels(categ)
for(i in seq_along(categ_val)){
    who = categ == categ_val[i]
    text(mean(x[who]), mean(y[who]), categ_val[i],
         font = FONT, cex = CEX, col = i)
}

47 / 81

Are the tips useful?

Can those tips be concretely helpful?

To know that, let's summon the copy-paste demon.

48 / 81

Code without tips

plot(iris$Petal.Length, iris$Sepal.Width, 
     col = iris$Species, pch = 20, cex = 2)
text(1.5, 5, "Setosa", 
     font = 2, cex = 4)
text(4, 6, "Versicolor", 
     font = 2, cex = 4, col = 2)
text(6, 7, "Virginica", 
     font = 2, cex = 4, col = 3)

49 / 81

Code without tips: Summoning

50 / 81

Code without tips: Outcome

The demon has immense powers

51 / 81

Code with tips

FONT = 2
CEX = 4
x = iris$Petal.Length
y = iris$Sepal.Width
categ = iris$Species
plot(x, y, col = categ, pch = 20, cex = 2)
categ_val = levels(categ)
for(i in seq_along(categ_val)){
    who = categ == categ_val[i]
    text(mean(x[who]), mean(y[who]), categ_val[i],
         font = FONT, cex = CEX, col = i)
}

52 / 81

Code with tips: Summoning

53 / 81

Code with tips: Outcome

The demon is weak

54 / 81

Tips: Side benefits

If you've followed the tips, guess what:

55 / 81

Tips: Side benefits

If you've followed the tips, guess what:

You can create a function for your graph for free!

55 / 81

Tips: Side benefits

If you've followed the tips, guess what:

You can create a function for your graph for free!

Before

FONT = 2
CEX = 4
x = iris$Petal.Length
y = iris$Sepal.Width
categ = iris$Species
plot(x, y, col = categ, pch = 20, cex = 2)
categ_val = levels(categ)
for(i in seq_along(categ_val)){
    who = categ == categ_val[i]
    text(mean(x[who]), mean(y[who]), 
         categ_val[i],
         font = FONT, cex = CEX, col = i)
}

After

scatter_name = function(x, y, categ, font = 2, cex = 4){
    plot(x, y, col = categ, pch = 20, cex = 2)
    categ_val = levels(categ)
    for(i in seq_along(categ_val)){
        who = categ == categ_val[i]
        text(mean(x[who]), mean(y[who]), categ_val[i],
             font = font, cex = cex, col = i)
    }
}
scatter_name(iris$Petal.Length, 
             iris$Sepal.Width,
             iris$Species)

55 / 81

Why create functions to make graphs?guards you against, or limits, copy-paste problems
56 / 81

Why create functions to make graphs?

guards you against, or limits, copy-paste problems
facilitates graph replications

56 / 81

Why create functions to make graphs?

guards you against, or limits, copy-paste problems
facilitates graph replications
you don't have to think to implementation details when running the function (reduces mental load)

56 / 81

Why create functions to make graphs?

guards you against, or limits, copy-paste problems
facilitates graph replications
you don't have to think to implementation details when running the function (reduces mental load)
it's very easy to include new features to the functions, and all the calls benefit from it

56 / 81

Mental load

Code telling what you do and not how you do it increases productivity tremendously.

57 / 81

Mental load

Code telling what you do and not how you do it increases productivity tremendously.

FONT = 2
CEX = 4
x = iris$Petal.Length
y = iris$Sepal.Width
categ = iris$Species
plot(x, y, col = categ, 
     pch = 20, cex = 2)
categ_val = levels(categ)
for(i in seq_along(categ_val)){
    who = categ == categ_val[i]
    text(mean(x[who]), mean(y[who]), 
         categ_val[i], col = i,
         font = FONT, cex = CEX)
}

scatter_name(iris$Petal.Length, 
             iris$Sepal.Width,
             iris$Species)

57 / 81

Mental load

Code telling what you do and not how you do it increases productivity tremendously.

FONT = 2
CEX = 4
x = iris$Petal.Length
y = iris$Sepal.Width
categ = iris$Species
plot(x, y, col = categ, 
     pch = 20, cex = 2)
categ_val = levels(categ)
for(i in seq_along(categ_val)){
    who = categ == categ_val[i]
    text(mean(x[who]), mean(y[who]), 
         categ_val[i], col = i,
         font = FONT, cex = CEX)
}

scatter_name(iris$Petal.Length, 
             iris$Sepal.Width,
             iris$Species)

The code on the right will always be easier to understand than the code on the left.^$⋆$

57 / 81

Why not create functions?you have a presentation in 30 minutes and have to finish that graph
58 / 81

Why not create functions?

you have a presentation in 30 minutes and have to finish that graph
you're making a graph that you think will never replicate^$⋆$

58 / 81

Why not create functions?

you have a presentation in 30 minutes and have to finish that graph
you're making a graph that you think will never replicate^$⋆$
the graph is really simple (in terms of lines of code!)

58 / 81

Functions: Summary

thinking in functions will change the way you code
it will clarify your code: it will be easier to understand and share, and less error-prone
due to the high fixed costs, 0 marginal cost nature of functions, you'll gain a lot of productivity

59 / 81

Functional programming: Application