#install.packages("ggstats")
suppressPackageStartupMessages({
library(ggstats)
library(tidyverse)
})
Geoms & EDA w/Titanic Data
Statistical Transformations
These notes are based on the ggstat
vignette.
stat_prop()
is part of ggstat
, an extension to ggplot2
. It is a variation of stat_count()
allowing to compute custom proportions according to the by
aesthetic defining the denominator (i.e. all proportions for a same value of by
will sum to 1). The by aesthetic should be a factor. Therefore, stat_prop()
requires the by aesthetic and this by aesthetic should be a factor.
libraries & ggplot extensions
The Titanic dataset:
The Titanic dataset (in R’s datasets) is a 4-D array. Access it like this:
# Examine all the entries
Titanic[,,,]
, , Age = Child, Survived = No
Sex
Class Male Female
1st 0 0
2nd 0 0
3rd 35 17
Crew 0 0
, , Age = Adult, Survived = No
Sex
Class Male Female
1st 118 4
2nd 154 13
3rd 387 89
Crew 670 3
, , Age = Child, Survived = Yes
Sex
Class Male Female
1st 5 1
2nd 11 13
3rd 13 14
Crew 0 0
, , Age = Adult, Survived = Yes
Sex
Class Male Female
1st 57 140
2nd 14 80
3rd 75 76
Crew 192 20
# The first-class data
1,,,] Titanic[
, , Survived = No
Age
Sex Child Adult
Male 0 118
Female 0 4
, , Survived = Yes
Age
Sex Child Adult
Male 5 57
Female 1 140
# The 3rd-class data
3,,,] Titanic[
, , Survived = No
Age
Sex Child Adult
Male 35 387
Female 17 89
, , Survived = Yes
Age
Sex Child Adult
Male 13 75
Female 14 76
# The Female 2nd-class data
1,2,,] Titanic[
Survived
Age No Yes
Child 0 1
Adult 4 140
To tidy this data we can use as.data.frame
<- as.data.frame(Titanic)
d glimpse(d)
Rows: 32
Columns: 5
$ Class <fct> 1st, 2nd, 3rd, Crew, 1st, 2nd, 3rd, Crew, 1st, 2nd, 3rd, Crew…
$ Sex <fct> Male, Male, Male, Male, Female, Female, Female, Female, Male,…
$ Age <fct> Child, Child, Child, Child, Child, Child, Child, Child, Adult…
$ Survived <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, N…
$ Freq <dbl> 0, 0, 35, 0, 0, 0, 17, 0, 118, 154, 387, 670, 4, 13, 89, 3, 5…
Count people on board by class and sex.
group_by(d,Class,Sex) |> summarize(n = sum(Freq))
# A tibble: 8 × 3
# Groups: Class [4]
Class Sex n
<fct> <fct> <dbl>
1 1st Male 180
2 1st Female 145
3 2nd Male 179
4 2nd Female 106
5 3rd Male 510
6 3rd Female 196
7 Crew Male 862
8 Crew Female 23
In the following example, we:
- use
stat = "prop"
to tellgeom_text()
to usestat_prop()
- used the
position_stack
in to place text atop its stack
<- ggplot(d) +
p # what happens when you remove "weight = Freq"?
aes(x = Class, fill = Survived, weight = Freq) +
geom_bar() +
# halfway up the stack
geom_text(stat = "prop", position = position_stack(.5)) +
labs(title = "Both Counts & Percentages")
print(p)
In the following example, we:
use
stat = "prop"
tell`geom_text()
to usestat_prop()
defined the by aesthetic (here we want to compute the proportions separately for each value of x)
used
position_fill()
when callinggeom_text()
to match the position = “fill” in geom_bar()
<- ggplot(d) +
p aes(x = Class, fill = Survived, weight = Freq, by = Class) +
geom_bar(position = "fill") +
geom_text(stat = "prop", position = position_fill(.5)) +
labs(title = "Proportions & Percentages")
print(p)
Facet over Gender.
+ facet_grid(~Sex) p
Displaying proportions of the total
If you want to display proportions of the total, simply map the by aesthetic to 1. Here an example using a stacked bar chart.
ggplot(d) +
aes(x = Class, fill = Survived, weight = Freq, by = 1) +
geom_bar() +
geom_text(
aes(label = scales::percent(after_stat(prop), accuracy = 1)),
stat = "prop",
position = position_stack(.5)) +
labs(title = "Percentages of total")
A dodged bar plot to compare two distributions
A dodged bar plot could be used to compare two distributions.
ggplot(d) +
aes(x = Class, fill = Sex, weight = Freq, by = Sex) +
geom_bar(position = "dodge")
On the previous graph, it is difficult to see if first class is over- or under-represented among women, due to the fact they were much more men on the boat. stat_prop()
could be used to adjust the graph by displaying instead the proportion within each category (i.e. here the proportion by sex).
ggplot(d) +
aes(x = Class, fill = Sex, weight = Freq, by = Sex, y = after_stat(prop)) +
geom_bar(stat = "prop", position = "dodge") +
scale_y_continuous(labels = scales::percent)
Finally, the same plot with labels
ggplot(d) +
aes(x = Class, fill = Sex, weight = Freq, by = Sex, y = after_stat(prop)) +
geom_bar(stat = "prop", position = "dodge") +
scale_y_continuous(labels = scales::percent) +
geom_text(
mapping = aes(
label = scales::percent(after_stat(prop), accuracy = .1),
y = after_stat(0.01)
),vjust = "bottom",
# what does the .9 do?
position = position_dodge(.9),
stat = "prop"
)