Geoms & EDA w/Titanic Data

Statistical Transformations

Published

October 1, 2024

These notes are based on the ggstat vignette.

stat_prop() is part of ggstat, an extension to ggplot2. It is a variation of stat_count() allowing to compute custom proportions according to the byaesthetic defining the denominator (i.e. all proportions for a same value of by will sum to 1). The by aesthetic should be a factor. Therefore, stat_prop() requires the by aesthetic and this by aesthetic should be a factor.

libraries & ggplot extensions

#install.packages("ggstats")
suppressPackageStartupMessages({
library(ggstats)
library(tidyverse)
})

The Titanic dataset:

The Titanic dataset (in R’s datasets) is a 4-D array. Access it like this:

# Examine all the entries
Titanic[,,,]
, , Age = Child, Survived = No

      Sex
Class  Male Female
  1st     0      0
  2nd     0      0
  3rd    35     17
  Crew    0      0

, , Age = Adult, Survived = No

      Sex
Class  Male Female
  1st   118      4
  2nd   154     13
  3rd   387     89
  Crew  670      3

, , Age = Child, Survived = Yes

      Sex
Class  Male Female
  1st     5      1
  2nd    11     13
  3rd    13     14
  Crew    0      0

, , Age = Adult, Survived = Yes

      Sex
Class  Male Female
  1st    57    140
  2nd    14     80
  3rd    75     76
  Crew  192     20
# The first-class data
Titanic[1,,,]
, , Survived = No

        Age
Sex      Child Adult
  Male       0   118
  Female     0     4

, , Survived = Yes

        Age
Sex      Child Adult
  Male       5    57
  Female     1   140
# The 3rd-class data
Titanic[3,,,]
, , Survived = No

        Age
Sex      Child Adult
  Male      35   387
  Female    17    89

, , Survived = Yes

        Age
Sex      Child Adult
  Male      13    75
  Female    14    76
# The Female 2nd-class data
Titanic[1,2,,]
       Survived
Age      No Yes
  Child   0   1
  Adult   4 140

To tidy this data we can use as.data.frame

d <- as.data.frame(Titanic)
glimpse(d)
Rows: 32
Columns: 5
$ Class    <fct> 1st, 2nd, 3rd, Crew, 1st, 2nd, 3rd, Crew, 1st, 2nd, 3rd, Crew…
$ Sex      <fct> Male, Male, Male, Male, Female, Female, Female, Female, Male,…
$ Age      <fct> Child, Child, Child, Child, Child, Child, Child, Child, Adult…
$ Survived <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, N…
$ Freq     <dbl> 0, 0, 35, 0, 0, 0, 17, 0, 118, 154, 387, 670, 4, 13, 89, 3, 5…

Count people on board by class and sex.

group_by(d,Class,Sex) |> summarize(n = sum(Freq))
# A tibble: 8 × 3
# Groups:   Class [4]
  Class Sex        n
  <fct> <fct>  <dbl>
1 1st   Male     180
2 1st   Female   145
3 2nd   Male     179
4 2nd   Female   106
5 3rd   Male     510
6 3rd   Female   196
7 Crew  Male     862
8 Crew  Female    23

In the following example, we:

  1. use stat = "prop" to tell geom_text() to use stat_prop()
  2. used the position_stack in to place text atop its stack
p <- ggplot(d) +
  # what happens when you remove "weight = Freq"?
  aes(x = Class, fill = Survived, weight = Freq) +
  geom_bar() +
  # halfway up the stack
  geom_text(stat = "prop", position = position_stack(.5)) +
  labs(title = "Both Counts & Percentages")

print(p)

In the following example, we:

  1. use stat = "prop" tell `geom_text() to use stat_prop()

  2. defined the by aesthetic (here we want to compute the proportions separately for each value of x)

  3. used position_fill() when calling geom_text() to match the position = “fill” in geom_bar()

p <- ggplot(d) +
  aes(x = Class, fill = Survived, weight = Freq, by = Class) +
  geom_bar(position = "fill") +
  geom_text(stat = "prop", position = position_fill(.5)) + 
  labs(title = "Proportions & Percentages")

print(p)

Facet over Gender.

p + facet_grid(~Sex)

Displaying proportions of the total

If you want to display proportions of the total, simply map the by aesthetic to 1. Here an example using a stacked bar chart.

ggplot(d) +
  aes(x = Class, fill = Survived, weight = Freq, by = 1) +
  geom_bar() +
  geom_text(
    aes(label = scales::percent(after_stat(prop), accuracy = 1)),
    stat = "prop",
    position = position_stack(.5)) +  
  labs(title = "Percentages of total")

A dodged bar plot to compare two distributions

A dodged bar plot could be used to compare two distributions.

ggplot(d) +
  aes(x = Class, fill = Sex, weight = Freq, by = Sex) +
  geom_bar(position = "dodge")

On the previous graph, it is difficult to see if first class is over- or under-represented among women, due to the fact they were much more men on the boat. stat_prop() could be used to adjust the graph by displaying instead the proportion within each category (i.e. here the proportion by sex).

ggplot(d) +
  aes(x = Class, fill = Sex, weight = Freq, by = Sex, y = after_stat(prop)) +
  geom_bar(stat = "prop", position = "dodge") +
  scale_y_continuous(labels = scales::percent)

Finally, the same plot with labels

ggplot(d) +
  aes(x = Class, fill = Sex, weight = Freq, by = Sex, y = after_stat(prop)) +
  geom_bar(stat = "prop", position = "dodge") +
  scale_y_continuous(labels = scales::percent) +
  geom_text(
    mapping = aes(
      label = scales::percent(after_stat(prop), accuracy = .1),
      y = after_stat(0.01)
    ),
    vjust = "bottom",
    # what does the .9 do?
    position = position_dodge(.9),
    stat = "prop"
  )