Operations on data tables

Creating a summary is an example of an operation on a table. Joining two tables together is as well. Creating summaries of groups of data is also discussed here.

Table Operations

bind_rows

Glue two tables together using bind_rows. Experiment to learn how different variables are handled.

group_by()

Shuffle a dataset into groups, so that subsequent analysis is done “by group” using group_by().

summarize()

Summarize() transforms your data table into one or more rows, where each row constitutes a summary of the corresponding data.

Find the max of a single variable.

Code

library(tidyverse)
library(palmerpenguins)
penguins |> summarize(mx_bd = max(bill_depth_mm))

# A tibble: 1 × 1
  mx_bd
  <dbl>
1    NA

Find the max of single variable within each group.

Code

penguins |> group_by(species) |>
  summarize(mx_bd = max(bill_depth_mm, na.rm = TRUE))

# A tibble: 3 × 2
  species   mx_bd
  <fct>     <dbl>
1 Adelie     21.5
2 Chinstrap  20.8
3 Gentoo     17.3

Improve your summary with n = n()

Code

penguins |> group_by(species) |>
  summarize(mx_bd = max(bill_depth_mm, na.rm = TRUE), n = n())

# A tibble: 3 × 3
  species   mx_bd     n
  <fct>     <dbl> <int>
1 Adelie     21.5   152
2 Chinstrap  20.8    68
3 Gentoo     17.3   124

Exercise: What’s wrong with this code?

Code

penguins |> group_by(species) |>
  summarize(mx_bd = max(bill_depth_mm, na.rm = TRUE, n = n()))

# A tibble: 3 × 2
  species   mx_bd
  <fct>     <dbl>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124

Find the number of observations (rows) within each group.

Code

penguins |> group_by(species) |> summarize(n = n())

# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124

The slice_ functions

df |> slice_head(n = 1) takes the first row from each group.
df |> slice_tail(n = 1) takes the last row in each group.
df |> slice_min(x, n = 1) takes the row with the smallest value of column x.
df |> slice_max(x, n = 1) takes the row with the largest value of column x.
df |> slice_sample(n = 1) takes one random row.

Ungrouping & `.by()`

To remove grouping use ungroup(), and to do “in-line” grouping on a per-operation basis you can use .by()

Code

penguins |> summarize(
  mx_bd = max(bill_depth_mm), 
  n = n(), 
  .by = species)

# A tibble: 3 × 3
  species   mx_bd     n
  <fct>     <dbl> <int>
1 Adelie     NA     152
2 Gentoo     NA     124
3 Chinstrap  20.8    68

Table Operations

bind_rows

group_by()

summarize()

The slice_ functions

Ungrouping & .by()

Ungrouping & `.by()`