install.packages("gapminder")
Chapters 2.1 - 2.2
Data 309
Make a script for each class
Shortcut: Ctrl/Cmd - shift - N
or point-and-click as shown
Use comments to organize your work!
Data Science Workflow
What are the main tools of data science? One answer to this question is the following diagram.
Dataset: Gapminder
Let’s explore a new dataset called gapminder. It concerns world development and is found at (https://www.gapminder.org/)
Gapminder: Fight devastating ignorance with a fact-based worldview everyone can understand.
library(gapminder)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.3 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Below we look at a transposed version of the data with glimpse
, which lives in the dplyr package, included in the tidyverse. Or View()
, or or just type gapminder
.
glimpse(gapminder)
Rows: 1,704
Columns: 6
$ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
We see that this dataset has variables: countries, continents, year, life expectancy, population and GDP per capita.
Useful exploratory tips
- Access variables directly
gapminder$country
or attach the data to our workspace to avoid the harsh syntax withattach(gapminder)
. - Then we can see all the countries listed with
unique(country)
.
- Recall,
?gapminder
is helpful to learn about the dataset.
names(gapminder)
lists all the variable names of gapminder
Grammar of Graphics / ggplot syntax
One of the main ways to visualize data in R is ggplot2, which utilizes the conceptual framework of the grammar of graphics. In English, grammar dictates that each sentence must have a subject and verb. In the grammar of graphics, each plotting element must have data, aesthetics and a geometry. Good sentences often have prepositions, adverbs, etc., and good graphics have more layers as well.
Our first input to ggplot is always a data table. Notice below, there is nothing to see, but also no error. It’s just a blank plot.
ggplot(data = gapminder)
The second input is a mapping argument called aes()
or aesthetic. (Another blank plot.)
ggplot(
data = gapminder,
aes(x = year)
)
Only with the third layer do we get something interesting.
ggplot(
data = gapminder,
aes(x = year, y = lifeExp)) +
geom_point()
This is a pretty bad plot. We learn something, but we could learn the same with less. Scatterplots are best used to compare numerical data, not categorical. The variable GDP (gross domestic product) is numerical, so let’s see if life expectancy is possibly related to GDP per capita.
ggplot(
data = gapminder,
aes(x = gdpPercap, y = lifeExp)) +
geom_point()
In the above plot, can you determine a relationship between GDP and Life Expectancy? Let’s see if time plays a role.
ggplot(
data = gapminder,
aes(x = gdpPercap, y = lifeExp, color = year)) +
geom_point()
In the above, we included a 3rd variable as color. Now apply a fourth layer, a statistic. In this case a curve of best fit.
ggplot(
data = gapminder,
aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = year)) +
geom_smooth()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
We improve the graphic by adding labels such as a title, subtitle and more with labs()
. Do ?labs()
to explore possibilities. How could the plot below be improved even further?
ggplot(
data = gapminder,
aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = year)) +
labs(title = "GDP vs. Life Expectancy",
y = "GDP per capita",
subtitle = "From 1960 to 2000")
Assignment 2
- Edit the file
lecture_.qmd
in the Week1 Folder to contain responses to the following: - Recreate the final plot above but with a different mapping, such as GDP as a function of year and color by continent.
- What happens when you try to color by country?
(iv)What makes some mappings more useful than others?