Operations on data tables

Creating a summary is an example of an operation on a table. Joining two tables together is as well. Creating summaries of groups of data is also discussed here.

Table Operations

bind_rows

Glue two tables together using bind_rows. Experiment to learn how different variables are handled.

group_by()

Shuffle a dataset into groups, so that subsequent analysis is done “by group” using group_by().

summarize()

Summarize() transforms your data table into one or more rows, where each row constitutes a summary of the corresponding data.

  1. Find the max of a single variable.
Code
library(tidyverse)
library(palmerpenguins)
penguins |> summarize(mx_bd = max(bill_depth_mm)) 
# A tibble: 1 × 1
  mx_bd
  <dbl>
1    NA
  1. Find the max of single variable within each group.
Code
penguins |> group_by(species) |>
  summarize(mx_bd = max(bill_depth_mm, na.rm = TRUE))
# A tibble: 3 × 2
  species   mx_bd
  <fct>     <dbl>
1 Adelie     21.5
2 Chinstrap  20.8
3 Gentoo     17.3
  1. Improve your summary with n = n()
Code
penguins |> group_by(species) |>
  summarize(mx_bd = max(bill_depth_mm, na.rm = TRUE), n = n())
# A tibble: 3 × 3
  species   mx_bd     n
  <fct>     <dbl> <int>
1 Adelie     21.5   152
2 Chinstrap  20.8    68
3 Gentoo     17.3   124
  1. Exercise: What’s wrong with this code?
Code
penguins |> group_by(species) |>
  summarize(mx_bd = max(bill_depth_mm, na.rm = TRUE, n = n()))
# A tibble: 3 × 2
  species   mx_bd
  <fct>     <dbl>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124
  1. Find the number of observations (rows) within each group.
Code
penguins |> group_by(species) |> summarize(n = n())
# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124

The slice_ functions

  1. df |> slice_head(n = 1) takes the first row from each group.
  2. df |> slice_tail(n = 1) takes the last row in each group.
  3. df |> slice_min(x, n = 1) takes the row with the smallest value of column x.
  4. df |> slice_max(x, n = 1) takes the row with the largest value of column x.
  5. df |> slice_sample(n = 1) takes one random row.

Ungrouping & .by()

To remove grouping use ungroup(), and to do “in-line” grouping on a per-operation basis you can use .by()

Code
penguins |> summarize(
  mx_bd = max(bill_depth_mm), 
  n = n(), 
  .by = species)
# A tibble: 3 × 3
  species   mx_bd     n
  <fct>     <dbl> <int>
1 Adelie     NA     152
2 Gentoo     NA     124
3 Chinstrap  20.8    68