Sections 1.4-1.5

Visualizing distributions & relationships

ATUS Data

Learn more and extract different data below: American Time Use Survey

Our ATUS data concerns sex, age, race, stress & work

Download

Prepare to download the data into your Dat309 folder (not Samba!), or a suitable sub-directory.

# get the working diretory
getwd()
# set the working directory
# note: Windows users, use forward slash \
setwd("enter your directory here")

Download the codebook, the ddi file and the zip file. ddi codebook .zip

Use the codebook to parse meaning from the variables. You need the ipumsr package to load the data into R.

# load ATUS data
# The data requires the IPUMSr package
library(haven)
library(ipumsr)
ddi <- read_ipums_ddi("ATUS2/atus_00002.xml")
data <- read_ipums_micro(ddi)

Exploratory Questions

  1. What is the size of the data? How many rows & columns? (Use dim(), for dimension.)
  2. What do the numbers mean in the rows?
  3. What are the variable names and what do they mean?

The (very useful) clean_names() function makes the variable names a bit easier to read.

# filter to get stress data
#| eval: TRUE
library(tidyverse)
library(janitor)
ds <- clean_names(data)
ds <- filter(ds,scstress < 10)

The variable pertaining to “kind of job” was poorly named, so we change it.

# rename work variable 
#| eval: FALSE
ds <- filter(ds,scstress < 10)
ds <- rename(ds,"job" = occ2_cps8)

Filter

Choose a few jobs so the data isn’t so big. Learn what how the numbers relate to jobs in the codebook.

dmsf <- filter(ds,job == 120 | job == 122 |job == 132)
ggplot(dmsf,aes(x=scstress)) + geom_bar()

Visualizing Distributions & Relationships

See examples in text (i) Categorical distribution: bar plot

Exercise

  1. Improve the plot below with fill = as_factor(job) from the haven package.
  2. Add a position = "fill" to the geom_bar(). Does it help?
dmsf <- filter(ds,job == 120 | job == 122 |job == 132)
ggplot(dmsf,aes(x=scstress, fill = job)) + geom_bar() +
  labs(fill = "Job")

  1. Numerical distribution: histogram, density plot

  2. Categorical / Numerical relationship: box plot, density boxplot

  3. Two categorical variables: barplot filled with color or the same with position = "fill" in the geom.

  4. Two-Three numerical variables: scatter plot with colors mapped to a variable. (see text)

  5. Just a few categorical variables? Try faceting:

Exercise:

  1. Are the happy and stress variables correlated?

Factors

  1. Remove factor() from the fill = factor(job) and observe the result.
  2. Replace it with fill = haven::as_factor(job) library) and observe. The syntax above is a way of using the as_factor function without loading the Haven library that caused problems earlier with data-typing.

Exerises:

  1. Textbook: 1.5.5 (Due: Sept. 9)
  2. Practice what you learned in 1.4-1.5 to the ATUS data.