πŸ“Š Stages 2 & 3 Β· Exploratory Data Analysis + Tidymodels

~25 minutes Β· The heart of the toolkit

Instructor Β· Stages 2–3 (~25 min combined). Render and open the HTML. Walk through EDA first (Stage 2, ~10 min) then the tidymodels section (Stage 3, ~10 min). Key narration cues are marked πŸ’¬ throughout.

πŸ” Stages 2 & 3 Β· Read the Evidence Β· Model the Pattern

A reproducible analysis keeps code and conclusions in the same file.
Every number, every chart, every model update is one render away.


0.1 Setup

Code
library(tidyverse)
library(tidymodels)
library(patchwork)
library(GGally)

treasure <- read_csv("data/treasure_hunt.csv", show_col_types = FALSE) |>
  mutate(major = fct_reorder(major, final_treasure, .fun = mean))

glimpse(treasure)
Rows: 25
Columns: 9
$ team                <chr> "A1", "A2", "A3", "A4", "A5", "B1", "B2", "B3", "B…
$ major               <fct> Finance, Marketing, Accounting, Management, OMIS, …
$ clues_solved        <dbl> 8, 6, 7, 5, 9, 4, 7, 6, 8, 9, 5, 7, 6, 7, 9, 4, 6,…
$ time_minutes        <dbl> 47, 58, 51, 60, 42, 68, 53, 57, 49, 44, 63, 55, 54…
$ visual_design       <dbl> 7, 9, 6, 8, 7, 6, 10, 5, 7, 8, 6, 9, 5, 8, 8, 5, 8…
$ data_wrangle        <dbl> 9, 7, 8, 6, 10, 5, 7, 7, 8, 9, 6, 8, 8, 7, 10, 5, …
$ model_score         <dbl> 8, 6, 7, 5, 9, 4, 6, 7, 7, 9, 5, 6, 7, 6, 8, 4, 6,…
$ presentation_energy <dbl> 8, 9, 6, 8, 7, 6, 10, 5, 8, 9, 7, 9, 6, 8, 8, 5, 8…
$ final_treasure      <dbl> 87, 81, 79, 72, 94, 61, 84, 74, 85, 95, 69, 83, 76…

πŸ’¬ β€œglimpse() is the first thing I run on any dataset. It tells me variable types, dimensions, and a data preview in a compact format β€” faster than scrolling through Excel.”

RStudio interface


1 Stage 2 Β· Exploratory Data Analysis

1.1 Summary statistics

Code
treasure |>
  summarise(
    across(
      where(is.numeric),
      list(
        n      = \(x) sum(!is.na(x)),
        min    = \(x) min(x, na.rm = TRUE),
        mean   = \(x) round(mean(x, na.rm = TRUE), 2),
        median = \(x) median(x, na.rm = TRUE),
        max    = \(x) max(x, na.rm = TRUE),
        sd     = \(x) round(sd(x, na.rm = TRUE), 2)
      ),
      .names = "{.col}__{.fn}"
    )
  ) |>
  pivot_longer(
    everything(),
    names_to  = c("variable", "stat"),
    names_sep = "__"
  ) |>
  pivot_wider(
    names_from  = stat,
    values_from = value
  )
# A tibble: 7 Γ— 7
  variable                n   min  mean median   max    sd
  <chr>               <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
1 clues_solved           25     4  6.8       7    10  1.58
2 time_minutes           25    40 53.2      53    70  7.79
3 visual_design          25     5  7.24      7    10  1.39
4 data_wrangle           25     5  7.6       8    10  1.41
5 model_score            25     4  6.76      7    10  1.54
6 presentation_energy    25     5  7.48      8    10  1.36
7 final_treasure         25    58 80.7      81    99  9.97

πŸ’¬ β€œSummary statistics are the first conversation with your data. Look for min/max outliers, and notice whether the mean and median are close β€” a big gap hints at skew.”

1.2 Per-major summary

Code
major_summary <- treasure |>
  group_by(major) |>
  summarise(
    n                    = n(),
    # final_treasure
    min_treasure         = min(final_treasure, na.rm = TRUE),
    mean_treasure        = round(mean(final_treasure, na.rm = TRUE), 1),
    median_treasure      = median(final_treasure, na.rm = TRUE),
    max_treasure         = max(final_treasure, na.rm = TRUE),
    sd_treasure          = round(sd(final_treasure, na.rm = TRUE), 1),
    # clues_solved
    min_clues            = min(clues_solved, na.rm = TRUE),
    mean_clues           = round(mean(clues_solved, na.rm = TRUE), 1),
    median_clues         = median(clues_solved, na.rm = TRUE),
    max_clues            = max(clues_solved, na.rm = TRUE),
    sd_clues             = round(sd(clues_solved, na.rm = TRUE), 1),
    # time_minutes
    min_time_min         = min(time_minutes, na.rm = TRUE),
    mean_time_min        = round(mean(time_minutes, na.rm = TRUE), 1),
    median_time_min      = median(time_minutes, na.rm = TRUE),
    max_time_min         = max(time_minutes, na.rm = TRUE),
    sd_time_min          = round(sd(time_minutes, na.rm = TRUE), 1),
    .groups = "drop"
  ) |>
  arrange(desc(mean_treasure))

major_summary
# A tibble: 5 Γ— 17
  major          n min_treasure mean_treasure median_treasure max_treasure
  <fct>      <int>        <dbl>         <dbl>           <dbl>        <dbl>
1 OMIS           5           89          94                94           99
2 Marketing      5           80          83.2              83           88
3 Management     5           71          77.6              78           85
4 Accounting     5           74          76.8              77           79
5 Finance        5           58          71.8              69           87
# β„Ή 11 more variables: sd_treasure <dbl>, min_clues <dbl>, mean_clues <dbl>,
#   median_clues <dbl>, max_clues <dbl>, sd_clues <dbl>, min_time_min <dbl>,
#   mean_time_min <dbl>, median_time_min <dbl>, max_time_min <dbl>,
#   sd_time_min <dbl>

1.3 Distributions

Code
p1 <- treasure |>
  ggplot(aes(x = final_treasure)) +
  geom_histogram(fill = "#0b3954", colour = "white", bins = 12) +
  geom_vline(xintercept = mean(treasure$final_treasure),
             colour = "#d4a017", linewidth = 1, linetype = "dashed") +
  annotate("text", x = mean(treasure$final_treasure) + 2,
           y = 4.5, label = "mean", colour = "#d4a017", size = 3.5) +
  labs(title = "Final treasure score", x = "Score", y = "Count") +
  theme_minimal(base_size = 12)

p2 <- treasure |>
  ggplot(aes(x = clues_solved)) +
  geom_histogram(fill = "#e05b3a", colour = "white", bins = 8) +
  labs(title = "Clues solved", x = "Clues", y = "Count") +
  theme_minimal(base_size = 12)

p1 + p2

Distribution of final treasure scores (left) and clues solved (right)

πŸ’¬ β€œThe dashed line is the mean. Is the distribution roughly symmetric or skewed? Skew affects which summary statistic is most meaningful.”

1.4 Score distribution by major

Code
treasure |>
  ggplot(aes(x = major, y = final_treasure, fill = major)) +
  geom_boxplot(alpha = 0.6, outlier.shape = NA, width = 0.5) +
  geom_jitter(aes(colour = major), width = 0.12, size = 2.5, alpha = 0.85) +
  scale_fill_viridis_d(option = "plasma", end = .82) +
  scale_colour_viridis_d(option = "plasma", end = .82) +
  labs(
    title  = "Treasure score distribution by major",
    x      = NULL, y = "Final treasure score"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

Treasure score by major β€” dots show individual teams

πŸ’¬ β€œBox plots show spread and median. The jittered dots show every individual team. Ask: which major has the highest median? Which has the most variability?”

1.5 Ranked bar chart

Code
major_summary |>
  ggplot(aes(x = mean_treasure, y = major, fill = mean_treasure)) +
  geom_col(show.legend = FALSE, width = .65) +
  geom_text(aes(label = mean_treasure), hjust = -.2, size = 3.8, colour = "#2c2c2c") +
  geom_errorbarh(
    aes(xmin = mean_treasure - sd_treasure, xmax = mean_treasure + sd_treasure),
    height = .25, colour = "#0b3954", linewidth = .7
  ) +
  scale_fill_gradient(low = "#8ecae6", high = "#023047") +
  scale_x_continuous(limits = c(0, 108)) +
  labs(title = "Average treasure score Β± 1 SD", x = "Score", y = NULL) +
  theme_minimal(base_size = 12) +
  theme(panel.grid.major.y = element_blank())

Average final treasure score by major

1.6 Pair plot β€” all numeric variables

Code
treasure |>
  select(clues_solved, time_minutes, data_wrangle,
         model_score, presentation_energy, final_treasure) |>
  ggpairs(
    upper = list(continuous = wrap("cor", size = 3.5)),
    lower = list(continuous = wrap("points", alpha = 0.5, size = 1.5,
                                   colour = "#0b3954")),
    diag  = list(continuous = wrap("densityDiag", fill = "#d4a017", alpha = .4))
  ) +
  theme_minimal(base_size = 10)

Pair plot: correlations and scatterplots for all numeric features

πŸ’¬ β€œThe pair plot shows everything at once: histograms on the diagonal, scatter plots below, correlation coefficients above. The strongest predictors of final_treasure are immediately visible.”

πŸ΄β€β˜ οΈ Fragment #2: The practice of keeping code and conclusions together so neither ever drifts from the other is called reproducible analysis.


2 Stage 3 Β· Tidymodels β€” Fit + Diagnose

2.1 The tidymodels workflow

Every tidymodels pipeline follows the same chain:

split β†’ recipe β†’ model spec β†’ workflow β†’ fit β†’ predict β†’ evaluate β†’ diagnose

What makes it powerful is that every step is explicit and interchangeable.
Swap the model spec (linear β†’ random forest), and the rest of the pipeline stays identical.

Golden key unlocking treasure

2.2 Split

Code
set.seed(482)
split_obj  <- initial_split(treasure, prop = 0.75, strata = major)
train_data <- training(split_obj)
test_data  <- testing(split_obj)

cat("Training rows:", nrow(train_data), "| Test rows:", nrow(test_data))
Training rows: 15 | Test rows: 10

2.3 Recipe

Code
treasure_recipe <- recipe(
  final_treasure ~ clues_solved + time_minutes +
                   data_wrangle + model_score + presentation_energy,
  data = train_data
) |>
  step_normalize(all_numeric_predictors())

treasure_recipe

πŸ’¬ β€œThe recipe is a preprocessing blueprint. step_normalize ensures all predictors are on the same scale before fitting. The recipe has not β€˜done’ anything yet β€” it’s a plan.”

2.4 Model specification

Code
lm_spec <- linear_reg() |> set_engine("lm") |> set_mode("regression")
lm_spec
Linear Regression Model Specification (regression)

Computational engine: lm 

2.5 Workflow β€” assemble + fit

Code
lm_workflow_fitted <- workflow() |>
  add_recipe(treasure_recipe) |>
  add_model(lm_spec)|> 
  fit(train_data)

lm_workflow_fitted |> tidy() |> arrange(desc(abs(estimate)))
# A tibble: 6 Γ— 5
  term                estimate std.error statistic  p.value
  <chr>                  <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)           80.4       0.313   256.    1.06e-18
2 time_minutes          -4.49      1.51     -2.97  1.56e- 2
3 presentation_energy    2.80      0.614     4.55  1.38e- 3
4 clues_solved           2.20      1.86      1.18  2.68e- 1
5 model_score            1.93      1.04      1.87  9.49e- 2
6 data_wrangle           0.921     1.34      0.686 5.10e- 1

2.6 Test-set evaluation

Code
preds <- predict(lm_workflow_fitted, new_data= test_data) |>
  bind_cols(test_data) |>
  mutate(.resid = final_treasure - .pred)

metrics(preds, truth = final_treasure, estimate = .pred)
# A tibble: 3 Γ— 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       1.71 
2 rsq     standard       0.958
3 mae     standard       1.33 

πŸ’¬ β€œRMSE is in the same units as the target variable. RSQ tells us proportion of variance explained. Ask a student: is this a good model for a real hiring decision?”


2.7 Diagnostic charts

Good analysts do not stop at metrics β€” they look at the residuals.

Code
# 1 Β· Predicted vs Actual (Covered in Class)
p1 <- preds |>
  ggplot(aes(x = final_treasure, y = .pred)) +
  geom_point(aes(colour = major), size = 3, alpha = .85) +
  geom_abline(slope = 1, intercept = 0,
              colour = "#d4a017", linewidth = 1, linetype = "dashed") +
  scale_colour_viridis_d(option = "plasma", end = .82) +
  labs(title = "Predicted vs. Actual",
       x = "Actual score", y = "Predicted score",
       colour = "Major") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "bottom",
        legend.text = element_text(size = 8))

# 2 Β· Residuals vs Fitted (Covered in Class)
p2 <- preds |>
  ggplot(aes(x = .pred, y = .resid)) +
  geom_point(colour = "#0b3954", size = 2.5, alpha = .8) +
  geom_hline(yintercept = 0, colour = "#d4a017",
             linewidth = 1, linetype = "dashed") +
  geom_smooth(se = FALSE, colour = "#e05b3a",
              linewidth = .8, method = "loess", formula = y ~ x) +
  labs(title = "Residuals vs. Fitted",
       x = "Fitted values", y = "Residuals") +
  theme_minimal(base_size = 11)

# 3 Β· Q-Q plot of residuals (Not Covered in Class but Useful)
p3 <- preds |>
  ggplot(aes(sample = .resid)) +
  stat_qq(colour = "#0b3954", size = 2.5, alpha = .8) +
  stat_qq_line(colour = "#d4a017", linewidth = 1, linetype = "dashed") +
  labs(title = "Q-Q Plot of Residuals",
       x = "Theoretical quantiles", y = "Sample quantiles") +
  theme_minimal(base_size = 11)

# 4 Β· Residuals distribution (Not Covered in Class but Useful)
p4 <- preds |>
  ggplot(aes(x = .resid)) +
  geom_histogram(aes(y = after_stat(density)),
                 fill = "#0b3954", colour = "white", bins = 10, alpha = .8) +
  geom_density(colour = "#d4a017", linewidth = 1) +
  geom_vline(xintercept = 0, colour = "#e05b3a",
             linewidth = 1, linetype = "dashed") +
  labs(title = "Residual Distribution",
       x = "Residuals", y = "Density") +
  theme_minimal(base_size = 11)

(p1 + p2) / (p3 + p4)

Four diagnostic charts for the fitted linear model
How to read these four charts
Chart What a good model looks like Red flag
Predicted vs Actual Dots close to the dashed diagonal Systematic curve above/below diagonal
Residuals vs Fitted Dots scattered randomly around 0 U-shape or funnel (heteroscedasticity)
Q-Q Plot Dots follow the dashed line Heavy tails or S-curve pattern
Residual Distribution Bell-shaped, centred on 0 Skew or multimodality

2.8 Coefficient plot

Code
lm_workflow_fitted |> 
  tidy() |> 
  filter(term != "(Intercept)") |>
  ggplot(aes(x = estimate,
             y = fct_reorder(term, estimate))) +
  geom_col(aes(fill = estimate > 0),
           show.legend = FALSE, width = .65) +
  geom_errorbarh(
    aes(xmin = estimate - std.error * 1.96,
        xmax = estimate + std.error * 1.96),
    height = .25, colour = "#0b3954"
  ) +
  geom_vline(xintercept = 0, linewidth = .5, colour = "grey40") +
  scale_fill_manual(values = c("#e05b3a", "#0b3954")) +
  labs(
    title = "Coefficient estimates (standardised predictors)",
    subtitle = "Error bars = 95% CI",
    x = "Estimate", y = NULL
  ) +
  theme_minimal(base_size = 12)

Which skills are most predictive of treasure recovery (standardised)?

πŸ’¬ β€œAsk: which skill has the widest confidence interval? What does that tell us about certainty? A coefficient whose error bar crosses 0 is not reliably different from β€˜no effect’. This was not covered in class but it can be useful beyond the class.”


2.9 Reflection prompts

  1. Which variable appears most predictive of treasure score?
  2. Do the residual plots suggest the linear reg model is appropriate for this data?
  3. How would you explain RMSE or MAE to a manager who does not know statistics?
  4. What would you change before showing this analysis to a client?

Continue to πŸ“½οΈ Stage 4: Slides

Pirate coding gif