Creating summaries of data is a very common practice in data science. One way to get a summary is to use the base R summary()
mtcars |>summary()
mpg cyl disp hp
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
Median :19.20 Median :6.000 Median :196.3 Median :123.0
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
drat wt qsec vs
Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
Median :3.695 Median :3.325 Median :17.71 Median :0.0000
Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
am gear carb
Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
Median :0.0000 Median :4.000 Median :2.000
Mean :0.4062 Mean :3.688 Mean :2.812
3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :1.0000 Max. :5.000 Max. :8.000
A more useful tool is in the tidyverse. We begin with
group_by()
The main way to summarize data begins with group_by().
It shuffles a dataset into groups, so that subsequent analysis is done “by group”.
# important: group_by does not change how the data lookslibrary(tidyverse)library(nycflights13)flights |>group_by(year,month,day) |>summarize(mx =max(arr_delay,na.rm=TRUE)) |>arrange((mx)) |>print(n=30)
Create a new variable in the diamonds data using cut() that gives the relative size, eg., “small”, “medium”, etc. The quantile() function may be helpful:
quantile(diamonds$carat)
0% 25% 50% 75% 100%
0.20 0.40 0.70 1.04 5.01
Group the data on this newly created variable and create a summary of the price (mean, max, min, etc.) for each group.
across()
Apply a function to several variables in a data frame.
# change character variables to factorsmpg |>mutate(across(where(is.character), as.factor))
# A tibble: 234 × 11
manufacturer model displ year cyl trans drv cty hwy fl class
<fct> <fct> <dbl> <int> <int> <fct> <fct> <int> <int> <fct> <fct>
1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
3 audi a4 2 2008 4 manu… f 20 31 p comp…
4 audi a4 2 2008 4 auto… f 21 30 p comp…
5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
# ℹ 224 more rows
You can also use across inside of summarise.
# compute the grouped-mean of all numeric variablesdiamonds |>group_by(cut) |>summarise(across(where(is.numeric), mean))
# A tibble: 5 × 8
cut carat depth table price x y z
<ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Fair 1.05 64.0 59.1 4359. 6.25 6.18 3.98
2 Good 0.849 62.4 58.7 3929. 5.84 5.85 3.64
3 Very Good 0.806 61.8 58.0 3982. 5.74 5.77 3.56
4 Premium 0.892 61.3 58.7 4584. 5.97 5.94 3.65
5 Ideal 0.703 61.7 56.0 3458. 5.51 5.52 3.40
# compute the mean of several variablempg |>group_by(manufacturer) |>summarize(across(where(is.numeric) &contains("y") &!contains("year"), mean))