Chapters 2.1 - 2.2

Data 309

Published

August 29, 2024

Make a script for each class

Shortcut: Ctrl/Cmd - shift - N or point-and-click as shown

file -> open -> script Use comments to organize your work!

Data Science Workflow

What are the main tools of data science? One answer to this question is the following diagram.

Figure: The data science flow

Dataset: Gapminder

Let’s explore a new dataset called gapminder. It concerns world development and is found at (https://www.gapminder.org/)

Gapminder: Fight devastating ignorance with a fact-based worldview everyone can understand.

install.packages("gapminder")
library(gapminder)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Below we look at a transposed version of the data with glimpse, which lives in the dplyr package, included in the tidyverse. Or View(), or or just type gapminder.

glimpse(gapminder)
Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

We see that this dataset has variables: countries, continents, year, life expectancy, population and GDP per capita.

Useful exploratory tips

  1. Access variables directly gapminder$country or attach the data to our workspace to avoid the harsh syntax with attach(gapminder).
  2. Then we can see all the countries listed with unique(country).
  3. Recall, ?gapminder is helpful to learn about the dataset.
  4. names(gapminder) lists all the variable names of gapminder

Grammar of Graphics / ggplot syntax

One of the main ways to visualize data in R is ggplot2, which utilizes the conceptual framework of the grammar of graphics. In English, grammar dictates that each sentence must have a subject and verb. In the grammar of graphics, each plotting element must have data, aesthetics and a geometry. Good sentences often have prepositions, adverbs, etc., and good graphics have more layers as well.

Figure: ggplot2 layers

Our first input to ggplot is always a data table. Notice below, there is nothing to see, but also no error. It’s just a blank plot.

ggplot(data = gapminder)

The second input is a mapping argument called aes() or aesthetic. (Another blank plot.)

ggplot(
  data = gapminder,
  aes(x = year)
)

Only with the third layer do we get something interesting.

ggplot(
  data = gapminder,
  aes(x = year, y = lifeExp)) + 
  geom_point() 

This is a pretty bad plot. We learn something, but we could learn the same with less. Scatterplots are best used to compare numerical data, not categorical. The variable GDP (gross domestic product) is numerical, so let’s see if life expectancy is possibly related to GDP per capita.

ggplot(
  data = gapminder,
  aes(x = gdpPercap, y = lifeExp)) + 
  geom_point() 

In the above plot, can you determine a relationship between GDP and Life Expectancy? Let’s see if time plays a role.

ggplot(
  data = gapminder,
  aes(x = gdpPercap, y = lifeExp, color = year)) + 
  geom_point() 

In the above, we included a 3rd variable as color. Now apply a fourth layer, a statistic. In this case a curve of best fit.

ggplot(
  data = gapminder,
  aes(x = gdpPercap, y = lifeExp)) + 
  geom_point(aes(color = year)) +
  geom_smooth()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

We improve the graphic by adding labels such as a title, subtitle and more with labs(). Do ?labs() to explore possibilities. How could the plot below be improved even further?

ggplot(
  data = gapminder,
  aes(x = gdpPercap, y = lifeExp)) + 
  geom_point(aes(color = year)) +
  labs(title = "GDP vs. Life Expectancy", 
       y = "GDP per capita", 
       subtitle = "From 1960 to 2000")

Assignment 2

  1. Edit the file lecture_.qmd in the Week1 Folder to contain responses to the following:
  2. Recreate the final plot above but with a different mapping, such as GDP as a function of year and color by continent.
  3. What happens when you try to color by country?
    (iv)What makes some mappings more useful than others?