How to summarize data

and other operations on data tables

Creating summaries of data is a very common practice in data science. One way to get a summary is to use the base R summary()

mtcars |> summary()
      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000  

A more useful tool is in the tidyverse. We begin with

group_by()

The main way to summarize data begins with group_by().

It shuffles a dataset into groups, so that subsequent analysis is done “by group”.

# important: group_by does not change how the data looks
library(tidyverse)
library(nycflights13)
flights |> 
  group_by(year,month,day) |> 
  summarize(mx = max(arr_delay,na.rm=TRUE)) |> 
  arrange((mx)) |> 
  print(n=30) 
# A tibble: 365 × 4
# Groups:   year, month [12]
    year month   day    mx
   <int> <int> <int> <dbl>
 1  2013     5    26   102
 2  2013     6    20   130
 3  2013     9     7   135
 4  2013    11    19   140
 5  2013     9    24   142
 6  2013    11    30   144
 7  2013     2    28   145
 8  2013     9     4   148
 9  2013     9     9   153
10  2013     9    15   157
11  2013    11    15   159
12  2013     9    14   161
13  2013     1    19   162
14  2013     3    26   166
15  2013     2     4   169
16  2013    10    31   169
17  2013     9    23   170
18  2013    10     4   171
19  2013    10    21   173
20  2013     4     3   174
21  2013     1     6   175
22  2013     3    11   180
23  2013    11    13   180
24  2013    11    16   182
25  2013     1     8   184
26  2013     1    15   187
27  2013     8    26   187
28  2013     3    30   189
29  2013    10     5   189
30  2013     5     5   190
# ℹ 335 more rows

group_by() does not remove or add data, it just adds a layer of grouping that is used by subsequent functions

# important: group_by does not change how the data looks
diamonds |> group_by(cut)
# A tibble: 53,940 × 10
# Groups:   cut [5]
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

group_by is often used in tandem with summarize

summarize() transforms your data table into one or more rows, where each row constitutes a summary of the corresponding data.

  1. Find the max of a single variable.
library(tidyverse)
library(palmerpenguins)
penguins |> summarize(mx_bd = max(bill_depth_mm)) 
# A tibble: 1 × 1
  mx_bd
  <dbl>
1    NA
  1. Find the max of single variable within each group.
penguins |> group_by(species) |>
  summarize(mx_bd = max(bill_depth_mm, na.rm = TRUE))
# A tibble: 3 × 2
  species   mx_bd
  <fct>     <dbl>
1 Adelie     21.5
2 Chinstrap  20.8
3 Gentoo     17.3
  1. Improve your summary with n = n(). Every summary should include this.
penguins |> group_by(species) |>
  summarize(mx_bd = max(bill_depth_mm, na.rm = TRUE), n = n())
# A tibble: 3 × 3
  species   mx_bd     n
  <fct>     <dbl> <int>
1 Adelie     21.5   152
2 Chinstrap  20.8    68
3 Gentoo     17.3   124
  1. Exercise: What’s wrong with this code?
penguins |> group_by(species) |>
  summarize(mx_bd = max(bill_depth_mm, na.rm = TRUE, n = n()))
# A tibble: 3 × 2
  species   mx_bd
  <fct>     <dbl>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124
  1. Find the number of observations (rows) within each group.
penguins |> group_by(species) |> summarize(n = n())
# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124

grouping & cut on the diamonds data

  1. Explore the diamonds dataset by summarizing additional values.
  2. Arrange the output by cut.
  3. Interpret the output. Does anything surprise you?
  diamonds |> group_by(carat) |>
    summarize(
      mean_price = mean(price),
    )
# A tibble: 273 × 2
   carat mean_price
   <dbl>      <dbl>
 1  0.2        365.
 2  0.21       380.
 3  0.22       391.
 4  0.23       486.
 5  0.24       505.
 6  0.25       551.
 7  0.26       551.
 8  0.27       575.
 9  0.28       580.
10  0.29       601.
# ℹ 263 more rows

Grouping numerical values & summarizing

  1. Create a new variable in the diamonds data using cut() that gives the relative size, eg., “small”, “medium”, etc. The quantile() function may be helpful:
quantile(diamonds$carat)
  0%  25%  50%  75% 100% 
0.20 0.40 0.70 1.04 5.01 
  1. Group the data on this newly created variable and create a summary of the price (mean, max, min, etc.) for each group.

across()

Apply a function to several variables in a data frame.

# change character variables to factors
mpg |> mutate(across(where(is.character),
                     as.factor))
# A tibble: 234 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <fct>        <fct>      <dbl> <int> <int> <fct> <fct> <int> <int> <fct> <fct>
 1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
 2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
 3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
 4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
 5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
 6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
 7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
 8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
 9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
# ℹ 224 more rows

You can also use across inside of summarise.

# compute the grouped-mean of all numeric variables
diamonds |> group_by(cut) |>
  summarise(across(where(is.numeric),
            mean))
# A tibble: 5 × 8
  cut       carat depth table price     x     y     z
  <ord>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Fair      1.05   64.0  59.1 4359.  6.25  6.18  3.98
2 Good      0.849  62.4  58.7 3929.  5.84  5.85  3.64
3 Very Good 0.806  61.8  58.0 3982.  5.74  5.77  3.56
4 Premium   0.892  61.3  58.7 4584.  5.97  5.94  3.65
5 Ideal     0.703  61.7  56.0 3458.  5.51  5.52  3.40
# compute the mean of several variable
mpg |> group_by(manufacturer) |>
  summarize(
    across(where(is.numeric) & contains("y") & !contains("year"),
           mean))
# A tibble: 15 × 4
   manufacturer   cyl   cty   hwy
   <chr>        <dbl> <dbl> <dbl>
 1 audi          5.22  17.6  26.4
 2 chevrolet     7.26  15    21.9
 3 dodge         7.08  13.1  17.9
 4 ford          7.2   14    19.4
 5 honda         4     24.4  32.6
 6 hyundai       4.86  18.6  26.9
 7 jeep          7.25  13.5  17.6
 8 land rover    8     11.5  16.5
 9 lincoln       8     11.3  17  
10 mercury       7     13.2  18  
11 nissan        5.54  18.1  24.6
12 pontiac       6.4   17    26.4
13 subaru        4     19.3  25.6
14 toyota        5.12  18.5  24.9
15 volkswagen    4.59  20.9  29.2

Here’s another use of across

mtcars |> mutate(across(c(1,5), round))
                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4            21   6 160.0 110    4 2.620 16.46  0  1    4    4
Mazda RX4 Wag        21   6 160.0 110    4 2.875 17.02  0  1    4    4
Datsun 710           23   4 108.0  93    4 2.320 18.61  1  1    4    1
Hornet 4 Drive       21   6 258.0 110    3 3.215 19.44  1  0    3    1
Hornet Sportabout    19   8 360.0 175    3 3.440 17.02  0  0    3    2
Valiant              18   6 225.0 105    3 3.460 20.22  1  0    3    1
Duster 360           14   8 360.0 245    3 3.570 15.84  0  0    3    4
Merc 240D            24   4 146.7  62    4 3.190 20.00  1  0    4    2
Merc 230             23   4 140.8  95    4 3.150 22.90  1  0    4    2
Merc 280             19   6 167.6 123    4 3.440 18.30  1  0    4    4
Merc 280C            18   6 167.6 123    4 3.440 18.90  1  0    4    4
Merc 450SE           16   8 275.8 180    3 4.070 17.40  0  0    3    3
Merc 450SL           17   8 275.8 180    3 3.730 17.60  0  0    3    3
Merc 450SLC          15   8 275.8 180    3 3.780 18.00  0  0    3    3
Cadillac Fleetwood   10   8 472.0 205    3 5.250 17.98  0  0    3    4
Lincoln Continental  10   8 460.0 215    3 5.424 17.82  0  0    3    4
Chrysler Imperial    15   8 440.0 230    3 5.345 17.42  0  0    3    4
Fiat 128             32   4  78.7  66    4 2.200 19.47  1  1    4    1
Honda Civic          30   4  75.7  52    5 1.615 18.52  1  1    4    2
Toyota Corolla       34   4  71.1  65    4 1.835 19.90  1  1    4    1
Toyota Corona        22   4 120.1  97    4 2.465 20.01  1  0    3    1
Dodge Challenger     16   8 318.0 150    3 3.520 16.87  0  0    3    2
AMC Javelin          15   8 304.0 150    3 3.435 17.30  0  0    3    2
Camaro Z28           13   8 350.0 245    4 3.840 15.41  0  0    3    4
Pontiac Firebird     19   8 400.0 175    3 3.845 17.05  0  0    3    2
Fiat X1-9            27   4  79.0  66    4 1.935 18.90  1  1    4    1
Porsche 914-2        26   4 120.3  91    4 2.140 16.70  0  1    5    2
Lotus Europa         30   4  95.1 113    4 1.513 16.90  1  1    5    2
Ford Pantera L       16   8 351.0 264    4 3.170 14.50  0  1    5    4
Ferrari Dino         20   6 145.0 175    4 2.770 15.50  0  1    5    6
Maserati Bora        15   8 301.0 335    4 3.570 14.60  0  1    5    8
Volvo 142E           21   4 121.0 109    4 2.780 18.60  1  1    4    2

More operations on tables

The slice_ functions

  1. df |> slice_head(n = 1) takes the first row from each group.
  2. df |> slice_tail(n = 1) takes the last row in each group.
  3. df |> slice_min(x, n = 1) takes the row with the smallest value of column x.
  4. df |> slice_max(x, n = 1) takes the row with the largest value of column x.
  5. df |> slice_sample(n = 1) takes one random row.

Ungrouping & .by()

To remove grouping use ungroup(), and to do “in-line” grouping on a per-operation basis you can use .by()

penguins |> summarize(
  mx_bd = max(bill_depth_mm), 
  n = n(), 
  .by = species)
# A tibble: 3 × 3
  species   mx_bd     n
  <fct>     <dbl> <int>
1 Adelie     NA     152
2 Gentoo     NA     124
3 Chinstrap  20.8    68

bind_rows

This has nothing to do with grouping, but it is an operation on a table.

Glue two tables together using bind_rows. Experiment to learn how different variables are handled.