Data visualization is the art of summarizing information from a data source into a pleasant, non-distorted, informative visual representation. Pleasant and informative: those are the keys to deliver a high-impact message. Miss one of them and nobody will listen. The objective of this course is to give the keys to understand what a good visualization is, and provide some tools to make such visualizations.
The course will cover some theory: we will talk about things such as color, placement, font and how the brain perceives shapes. We will also work our way through the powerful R graphics engine and the ggplot2 library. Throughout this course there will be many small assignments.
Data visualization is the art of summarizing information from a data source into a pleasant, non-distorted, informative visual representation. Pleasant and informative: those are the keys to deliver a high-impact message. Miss one of them and nobody will listen. The objective of this course is to give the keys to understand what a good visualization is, and provide some tools to make such visualizations.
The course will cover some theory: we will talk about things such as color, placement, font and how the brain perceives shapes. We will also work our way through the powerful R graphics engine and the ggplot2 library. Throughout this course there will be many small assignments.
opening your eyes on what makes a good statistical graph
learning the tradoffs in graph making
Sort these countries by GDP per capita:
Country | GDP per capita (US$, ppp) |
---|---|
Belgium | 50442.05 |
France | 45149.10 |
Germany | 53752.03 |
Spain | 39907.56 |
United Kingdom | 45504.84 |
Sort these countries by GDP per capita:
What can you say about these numbers?
x | y | mean_X | sd_X | mean_Y | sd_Y | cor_XY |
---|---|---|---|---|---|---|
55.4 | 97.2 | 54.3 | 16.8 | 47.8 | 26.9 | -0.0645 |
51.5 | 96 | |||||
46.2 | 94.5 | |||||
42.8 | 91.4 | |||||
40.8 | 88.3 | |||||
38.7 | 84.9 | |||||
35.6 | 79.9 | |||||
33.1 | 77.6 | |||||
... | ... |
Same measures; but different data ⇒ different stories.
Do you see a pattern?
Do you see a pattern?
remember that we're very limited human beings!
we only can grasp and understand the world with our 5 senses
abstract concepts (e.g. numbers) are tied to these physical senses
remember that we're very limited human beings!
we only can grasp and understand the world with our 5 senses
abstract concepts (e.g. numbers) are tied to these physical senses
to compare numbers we need to visualize them in our head
when we graphically represent numbers, we cut the middle man
Experiment: I give numbers 1, 10, 100, 1000 and tell them to close their eyes
powerful way to understand the data
powerful way to send a message (not the same as the previous point!)
most graphs suck (Excel do you hear me?): it's an easy way to stand out
event study graphs next year: many examples of how to make your message impactful based on a set of data the problem is that the students don't see the value of it if they haven't tried hard to make a nice graph so I should give them an assignement in class that they try hard to complete. Only then I can come with theory and advices.
Powerful: imagination is the limit, you can graph anything you have in your mind (really)
Versatile: there are so many packages...
Powerful: imagination is the limit, you can graph anything you have in your mind (really)
Versatile: there are so many packages...
Communication: smooth integration in HTML documents, create a website in minutes (Rmarkdown)
# David Kidd, Kingston University London (2019)# https://storymaps.arcgis.com/stories/79eeffa9f54d429687c17fa8267d3ba2# Thinkquest historical boundaries 1815#https://web.archive.org/web/20080328104539/http://library.thinkquest.org:80/C006628/download.htmllibrary(tidyverse)library(sf)library(ggpomological)library(ggimage)library(MapColoring)library(ggspatial)extrafont::loadfonts(device = "win")humboldt = read_sf("data/kingstonUniLondon/humboldt_route.geojson") |> filter(Journey == "America") |> mutate( angle = case_when( Date == "June-July 1799" ~ 0, Date == "Dec. 1800" ~ 340, Date == "Mar 1804" ~ 25, Date == "May 1804" ~ 50, Date == "July 1804" ~ 10, is.na(Date) ~ NA_real_ ), hjust = case_when( Date == "June-July 1799" ~ 1.1, Date == "Dec. 1800" ~ -0.3, Date == "Mar 1804" ~ 0.5, Date == "May 1804" ~ 0.5, Date == "July 1804" ~0.5, is.na(Date) ~ NA_real_ ), vjust = case_when( Date == "June-July 1799" ~ 3.5, Date == "Dec. 1800" ~ 0.5, Date == "Mar 1804" ~ 0.6, Date == "May 1804" ~ 0.5, Date == "July 1804" ~0.5, is.na(Date) ~ NA_real_ ) )countries = read_sf("data/thinkquest/1815/cntry1815.shp") |> st_set_crs(4326) |> mutate(fillcol = as.factor(getColoring(as_Spatial(countries)))) |> st_transform(st_crs(humboldt))bbox = humboldt |> st_bbox()g = ggplot() + geom_sf(data = countries, aes(fill = fillcol), col = NA, alpha = 0.6, show.legend = FALSE ) + geom_sf( data = humboldt, linetype = "longdash", color = "#2b323f", size = 0.5 ) + geom_sf_text( data = humboldt, aes(label = Date, angle = angle, hjust = hjust, vjust = vjust), family = "Homemade Apple", nudge_y = 250000 ) + annotation_north_arrow( style = north_arrow_nautical( line_col = "#a89985", text_family = "Homemade Apple", text_col = "#a89985", fill = c("#a89985", "white"), ) ) + annotate( "text", x = -1500000, y = -900000, color = "#6b452b", family = "Homemade Apple", size = 5, label = "Alexander von Humboldt\ntravels to the Americas\n1799-1804" ) + scale_fill_pomological() + labs( caption = glue::glue( "#30DayMapChallenge | Day 24: Historical | ", "Data: David Kidd, Kingston University London (2019), ThinkQuest | Created by @loreabad6", ) ) + coord_sf( xlim = c(bbox["xmin"], bbox["xmax"]), ylim = c(bbox["ymin"], bbox["ymax"]) ) + labs(x = NULL, y = NULL) + theme_pomological("Homemade Apple", 16) + theme( plot.caption = element_text( family = "mono", size = 9, hjust = 0.5, color = "#6b452b", face = "bold" ) )ggbackground(g, ggpomological:::pomological_images("background"))ggsave(filename = "maps/day24.png", height = 20, width = 28, units = "cm")
# ultra_running.R# Jamie Hudson# Created: 27 October 2021# Edited: 27 October 2021# Data: Benjamin Nowak by way of International Trail Running Association (ITRA)# load libraries ------------------------------------------------------------library(tidytuesdayR)library(tidyverse)library(janitor)library(ggmap)library(XML)library(gganimate)library(showtext)library(colorspace)library(ggtext)library(shadowtext)font_add_google("Cabin")font_add_google("Shrikhand")showtext_opts(dpi = 320)showtext_auto(enable = TRUE)# load dataset ------------------------------------------------------------ultra_rankings <- readr::read_csv('https://t.co/JNJTpFTKqI?amp=1') %>% clean_names()race <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-10-26/race.csv')# read in GTX data for UTMB course # code from Sascha Wolfer https://www.r-bloggers.com/2014/09/stay-on-track-plotting-gps-tracks-with-r/# data from https://www.plotaroute.com/search?keyword=utmboptions(digits=10)# Parse the GPX filepfile <- htmlTreeParse(file = "data/UTMB.gpx", error = function(...) {}, useInternalNodes = T)elevations <- as.numeric(xpathSApply(pfile, path = "//trkpt/ele", xmlValue))times <- xpathSApply(pfile, path = "//trkpt/time", xmlValue)coords <- xpathSApply(pfile, path = "//trkpt", xmlAttrs)lats <- as.numeric(coords["lat",])lons <- as.numeric(coords["lon",])# wrangle data ------------------------------------------------------------race_df <- full_join(ultra_rankings, race)utmb_21 <- race_df %>% filter(race_year_id == 72496)geodf <- data.frame(lat = lats, lon = lons, ele = elevations, time = times)lat <- c(min(geodf$lat) -0.06, max(geodf$lat) + 0.1)lon <- c(min(geodf$lon) - 0.1, max(geodf$lon) + 0.1)bbox <- make_bbox(lon,lat)geodf <- geodf %>% slice(which(row_number() %% 5 == 1))utmb_times <- utmb_21 %>% dplyr::select(time_in_seconds) %>% mutate(pos = row_number()) %>% pivot_wider(names_from = pos, values_from = time_in_seconds) %>% slice(rep(1, each = nrow(geodf)))# function to repeat times n timesrep.x <- function(x, na.rm=FALSE) (x / nrow(geodf) * row_number())geodf_times <- cbind(geodf, utmb_times) %>% dplyr::select(-c("time")) %>% mutate_at(.vars = vars("1":"1526"), rep.x) %>% pivot_longer(-c(lon, lat, ele), names_to = "id", values_to = "time") %>% mutate(time = as.numeric(time))# D'Haene and Dauwaltertwo_runners <- geodf_times %>% filter(id %in% c(1,7))# dataframe of routeroute <- geodf %>% dplyr::select(lon, lat, ele)# download stamenmap for backgroundmap_background <- get_stamenmap(bbox, zoom = 12, source="stamen", maptype = "terrain-background", color="bw")# plot ------------------------------------------------------------plot <- ggmap(map_background) + geom_path(data = route, mapping = aes(x = lon, y = lat, color = ele, group = 1), size = 4, lineend = "round") + scale_color_viridis_c(option = "magma") + geom_rect(xmin = 6.5, xmax = 7.3, ymin = 46.09, ymax = 46.2, fill = "grey", alpha = 0.4) + geom_jitter(geodf_times, mapping = aes(x = lon, y = lat, group = id), colour = "white", fill = "white", pch = 25, size = 0.4, width = 0.002, height = 0.002) + geom_jitter(two_runners, mapping = aes(x = lon, y = lat, group = id), colour = "white", fill = "black", pch = rep(c(21, 22), 2858), size = 2.5, width = 0.002, height = 0.002) + geom_segment(aes(x = route$lon[1] - 0.01, y = route$lat[1] + 0.01, xend = route$lon[1] + 0.01, yend = route$lat[1] - 0.01), colour = "yellow") + geom_shadowtext(label = "Start/End", x = route$lon[1] + 0.014, y = route$lat[1] - 0.014, size = 3, hjust = 0, family = "Cabin", check_overlap = TRUE, colour = "black", bg.colour = "white", bg.r = 0.2) + geom_shadowtext(label = "Elevation (m)", x = 6.917, y = 45.623, size = 3, hjust = 0.5, family = "Cabin", check_overlap = TRUE, colour = "black", bg.colour = "white", bg.r = 0.2) + geom_shadowtext(label = "1000", x = 6.785, y = 45.635, size = 2.5, hjust = 0.5, family = "Cabin", check_overlap = TRUE, colour = "black", bg.colour = "white", bg.r = 0.2) + geom_shadowtext(label = "2500", x = 7.07, y = 45.635, size = 2.5, hjust = 0.5, family = "Cabin", check_overlap = TRUE, colour = "black", bg.colour = "white", bg.r = 0.2) + guides(colour = guide_colorbar(title.position = 'bottom', title.hjust = 0.5, barwidth = unit(15, 'lines'), barheight = unit(0.8, 'lines'))) + annotate(geom = "text", label = "Ultra-Trail du Mont-Blanc 2021", x = 6.917, y = 46.162, size = 8, hjust = 0.5, family = "Shrikhand") + annotate(geom = "richtext", label = "A total of 1526 runners completed the 2021 edition of the UTMB® race in Chamonix, France. Starting at 17:00 the course \nundulates over 170km, and was eventually won by Francois D'Haene in a time of 20 hours 45 minutes and 59 seconds. \nCourtney Dauwalter was the first female runner to finish in 22 hours 30 minutes and 54 seconds.", x = 6.917, y = 46.125, size = 2.8, hjust = 0.5, family = "Cabin", fill = NA, label.color = NA) + annotate(geom = "richtext", label = "*The map below shows the average speed of each finisher. D'Haene is the black circle, and Dauwalter the black square*", x = 6.917, y = 46.1, size = 2.8, hjust = 0.5, family = "Cabin", fill = NA, label.color = NA) + labs(colour = "Elevation (m)", caption = "@jamie_bio | source: International Trail Running Association (ITRA)") + theme_minimal() + theme( axis.line = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.border = element_blank(), plot.caption = element_text(family = "Cabin", size = 8), legend.position=c(0.5, 0.025), legend.justification = "bottom", legend.direction = "horizontal", legend.text = element_blank(), legend.title = element_blank())anim <- plot + transition_reveal(time)animate(anim, nframes = 200, height = 8, width = 6.5, units = "in", res = 150)anim_save(paste0("ultra_running_", format(Sys.time(), "%d%m%Y"), ".gif"))
Good graphs have a main purpose.
Good graphs have a main purpose.
Main purposes:
looking nice
send a message
exploratory
Good graphs have a main purpose.
Main purposes:
looking nice
send a message ( = we're here!)
exploratory
informative: get the information quickly and unambiguously
pleasant: looks nice -- hard to tell
stacking information is good but can reduce clarity there are ways to stack information w/t reducing too much clarity: colors / forms
content
clarity
attractiveness
the information you want to share/observe
typically: relationships, time series, consequence of events, conditional distributions
you can stack several content in the same graph
how easy it is to understand the content
defined by its measure: the time it takes to extract a piece of information ⇒ the lower the time, the higher the clarity
is the graph pleasant to look at?
do you stare at it for its own sake?
graph∗=argmaxgclarity(g)s.t.content(g)≥τattractiveness(g)≥η
graph∗=argmaxgclarity(g)s.t.content(g)≥τattractiveness(g)≥η
graph∗=argmaxgattractiveness(g)s.t.content(g)≥τclarity(g)≥γ
adding content necessarily reduces clarity (there are solutions to limit that)
the relationship between attractiveness and content/clarity is not clear. Graphs which are visually too attractive usually reduce clarity or weaken the message.
the contrast is so so due to the grey background: ↘ clarity
the colors are OK to identify the species
the grid helps to make comparisons between distant points
⇒ clarity is OK overall but the attractiveness is so so. We get that these variables are good to discriminate the species: content is OK.
font hard to read ↘ clarity
no grid ↘ clarity
it is visually attractive (at least to me!):
⇒ attractiveness +++ (to me!). However, because there are so many nice things to look at, we lose track of the content. The design choices, although great, reduce clarity greatly.
This is not a graph that you should put in your report!
Thou shalt not expect the reader to be interested in what you do
Thou shalt not expect the reader to spend more than 5 seconds on your graph
Clarity is cardinal
what is this graph about?
what does it represent?
what's the value of that point?
a graph should be self explanatory
Clarity is cardinal
what is this graph about?
what does it represent?
what's the value of that point?
a graph should be self explanatory
Attractiveness is important, but second tier
hooks the reader
makes the reader stay longer and maybe decide to put in some efforts to understand your graph
clarity: think it that way: there is a threshold of effort above which the reader will just stop
attractiveness: that's why infographics in newspapers are so nice: to hook the uninterested reader
Clarity = value for money
you can extract more content with lower effort
if you don't get much for your money, you just switch to another product
Clarity = value for money
you can extract more content with lower effort
if you don't get much for your money, you just switch to another product
Attractiveness: increases the amount of money you wanna spend. It's the same effect as advertising:
You still have the same optimization problem to solve but the thresholds are different!
content
clarity
attractiveness
what is considered nice largely depends on a consensus which can evolve over time
it depends on preferences that vary between persons and even within persons (just have a look at your haircuts of 10 years ago!)
we won't cover attractiveness since we can't please everyone
what is considered nice largely depends on a consensus which can evolve over time
it depends on preferences that vary between persons and even within persons (just have a look at your haircuts of 10 years ago!)
we won't cover attractiveness since we can't please everyone
there are guiding principles of proportions and colors
good news: clear graphs usually look ok
content
clarity
attractiveness
tailor your content, ask yourself:
add elements of context if needed (do they strengthen your point?)
is the graph faithful?
hierarchy of information! (the main message should be emphasized vàv other messages/the elements of context)
do I get the takeaway just from looking at the graph? Or do I need to read the text / get an oral explanation to get the central message?
next year => add examples for each of those cases => very important to add concrete example : make an exercise in class of having to graph an idea, and come with these concepts afterwards (so that the students can see what it really means)
content
clarity
attractiveness
There are many tips to make graphs clear!
discerning colors
discerning shapes
discerning colors
discerning shapes
heights comparisons
discerning colors
discerning shapes
heights comparisons
Let's leverage these three properties to make good graphs!
graphe avec et sand grille horizontale graph étroit => OK grphe large => difficile de comparer
Moral of the story: Abuse colors!
two main uses of color:
two main uses of color:
two different usages = two very different color picks!
Colors can be decomposed in
Colors can be decomposed in
Hue
Saturation
Lightness
They're a bit of an eyesore. They're very bright, making them hard to read. That's why I had to use a heavy font.⋆ Note that the brightness depends on the hue: hues are not equal light-wise!
They're a bit of an eyesore. They're very bright, making them hard to read. That's why I had to use a heavy font.⋆ Note that the brightness depends on the hue: hues are not equal light-wise!
These are the same hues. I've just reduced saturation. The text is easier to read. I can even remove the heavy font!
use colors that are "different" but have some harmony
how to find harmonious color sets?
use colors that are "different" but have some harmony
how to find harmonious color sets?
Adobe color website: to create palettes or to find existing ones
use colors that are "clear-cut" (we keep colors in mind using their names, ex: using blue + mid-blue-mid-green + green makes it hard to remember)
use different hues, not only different shades (shade variations are harder to remember and discern than hue variations; having both is even better)
use colors that are "clear-cut" (we keep colors in mind using their names, ex: using blue + mid-blue-mid-green + green makes it hard to remember)
use different hues, not only different shades (shade variations are harder to remember and discern than hue variations; having both is even better)
don't use colors to distinguish too many categories:
1) pple remind the colors with their names, hence a color that is mid blue/mid green is harder to remember that stg blue. IF TIME => show two palettes with these differences
2) Same comment, don't use shades of a same hue, light blue/dark blue => hard to remember IF TIME => show two palettes with these differences
IF TIME: 4 panes: one with the names and associated colors, then the graph same set of two but with clear cut colors
3) What to do if I have 5+ categories to display? Cut content! In general, if you have a graph with 5+ categories to display, ask yourself if there's not a problem in terms of content
OK, but what if I really have to display 6+ categories?
OK, but what if I really have to display 6+ categories?
Two main types of things to represent:
A) things with positive values only: unemployment, earnings, whatever
B) correlation, deviations from the mean, etc
Source for the numbers: https://www.colorblindguide.com/post/colorblind-people-population-live-counter
mostly men: 8% women: 1/200
The main consequence is that there is no silver bullet to represent discrete color points
⇒ use bars to compare numbers, use a grid to help make comparisons
Can you see a difference?
Exactly the same data... incomparably easier to read.
Pie charts can be good to show big discrepancies--but that's illustration then. For precise statistical graphs, forget about it.
If you have many categorical values to display, vertical bar graphs can be good.
although too much content will always be hard to read. The solution is to cut out the cars that are not important in our study or make a single group of them. => In the future I should also take care of explaining how to curate the content to sharpen the message.
Without description, a graph is worthless!
You must absolutely ALWAYS label your axes, or else you'll endure divine wrath!1
Text takes space, and space is limited! The text should be as short as possible while remaining as informative as possible.
the text must be read easily: readable font + min. font size
hierarchy: the explanations should not take more space than the data. More important information should be more emphasized.
the text must be read easily: readable font + min. font size
hierarchy: the explanations should not take more space than the data. More important information should be more emphasized.
repeat information
the text must be read easily: readable font + min. font size
hierarchy: the explanations should not take more space than the data. More important information should be more emphasized.
repeat information
add a grid when relevant
the text must be read easily: readable font + min. font size
hierarchy: the explanations should not take more space than the data. More important information should be more emphasized.
repeat information
add a grid when relevant
minimize the distance between the legend and the data: especially with 4+ categories
There are three main font families:
In R
you can change the font in graphs with showtext:
pacman::p_load(showtext)font_add_google("Fira Code", "fira")font_add_google("Merriweather", "merri")showtext_auto()plot(iris$Petal.Length, iris$Sepal.Length, col = iris$Species, pch = 16, bty = "L", ann = FALSE)title(xlab = "Petal length (default sans serif)", cex.lab = 1.2)title(ylab = "Sepal length (serif: Merriweather)", family = "merri", cex.lab = 1.2)mtext(text = "Three varieties of iris flowers (mono: Fira Code)", side = 3, line = 1, font = 2, adj = 0, cex = 1.7, family = "fira")
ggsave
it and it will be fine?ggsave
it and it will be fine?Think twice!
ggsave
it and it will be fine?Think twice!
Some functions may help you to deal with that: pdf_fit
/png_fit
from fplot:
setFplot_page
to define the size of the final document if needed (by default it's an A4 page with some usual margins)# starts recording.# - pt = 11 will save the graph with 11pt font size# - w2h = 1.75 means that the width to height ratio is 1.75 (wide graph)pdf_fit("path.pdf", pt = 11, w2h = 1.75)# your graph# ends recording.# The final look of your graph is displayed in the viewer panefit_off()
I will show you a graph coming from a top (top) publication.
The research was careful and the results are of great relevance: there is no doubt on the quality and the importance of the research done.
Despite the stellar work, the graphs could be improved (i.e. ↗ clarity), at no cost.
Among others...
The legend takes as much space as the graph!!!!!!!! (it burned my eyes!)
Adding a light grid would facilitate the reading, it's almost impossible to compare points.
Data visualization is the art of summarizing information from a data source into a pleasant, non-distorted, informative visual representation. Pleasant and informative: those are the keys to deliver a high-impact message. Miss one of them and nobody will listen. The objective of this course is to give the keys to understand what a good visualization is, and provide some tools to make such visualizations.
The course will cover some theory: we will talk about things such as color, placement, font and how the brain perceives shapes. We will also work our way through the powerful R graphics engine and the ggplot2 library. Throughout this course there will be many small assignments.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |