Understanding Data Distributions Recitation Solutions ☕

Week 7

Author

Jessica Cooperstone

Introduction

Today you will be investigating some data from the Coffee Quality Database on coffee evaluation by the Coffee Quality Institute. You will look at the distribution of coffee evaluation scores across different characteristics, and include various metadata.

More information can be found on the Tidy Tuesday github repo on coffee ratings.

Read in data

tuesdata <- tidytuesdayR::tt_load('2020-07-07')

--- Compiling #TidyTuesday Information for 2020-07-07 ----

--- There is 1 file available ---

--- Starting Download ---


    Downloading file 1 of 1: `coffee_ratings.csv`

--- Download complete ---

coffee_ratings <- tuesdata$coffee_ratings

Load libraries

library(tidyverse) # for everything
library(ggridges) # for ridgeline plots
library(ggdist) # for nice dotplots

Total cupping score in Arabica and Robusta

Make 3 different visualizations that shows the distribution of total cupping score (i.e. total_cup_points) across Arabica and Robusta beans. Make the plots so you think they look good.

A histogram

Since there are so few robusta observations, I decided to make the y-axes on different scales

coffee_ratings %>%
  ggplot(aes(x = total_cup_points, fill = species)) +
  geom_histogram(bins = 200) +
  geom_vline(aes(xintercept = mean(total_cup_points)), color = "black") +
  scale_fill_manual(values = c("#C19A6B", "#5C4033")) +
  coord_cartesian(xlim = c(60,100)) + # change x-axis limits
  facet_wrap(vars(species),
             nrow = 2, # make two rows so can align histograms top to bottom
             scales = "free_y",
             strip.position = "top") +
  theme_classic() +
  theme(legend.position = "none") +
  labs(x = "Total cupping score (out of 100)",
       y = "",
       fill = "Species",
       title = "Distribution of cupping scores across 14,000 coffee samples",
       subtitle = "Note the y-axes are different between plots",
       caption = "Vertical line represents the median total cupping score across all samples")

Density plot

I might like this a bit better than a histogram.

coffee_ratings %>%
  ggplot(aes(x = total_cup_points, fill = species)) +
  geom_density() +
  geom_vline(aes(xintercept = mean(total_cup_points)), color = "black") +
  coord_cartesian(xlim = c(65,92)) +
  scale_fill_manual(values = c("#C19A6B", "#5C4033")) +
  facet_wrap(vars(species),
             nrow = 2, # make two rows so can align histograms top to bottom
             scales = "free_y") +
  theme_classic() +
  theme(legend.position = "none") +
  labs(x = "Total cupping score (out of 100)",
       y = "",
       fill = "Species",
       title = "Distribution of cupping scores across 14,000 coffee samples",
       caption = "Vertical line represents the median total cupping score across all samples")

Dot plot

You can also see here how many fewer robusta observations there are.

# dot plot
coffee_ratings %>%
  ggplot(aes(x = total_cup_points, color = species, fill = species)) +
  geom_dots() +
  scale_color_manual(values = c("#C19A6B", "#5C4033")) +
  scale_fill_manual(values = c("#C19A6B", "#5C4033")) +
  theme_ggdist() +
  xlim(50,100) +
  theme(legend.position = c(.18, .99),
        legend.justification = c("right", "top"),
        legend.box.just = "right",
        legend.box.background = element_rect(size = 0.5),
        legend.box.margin = margin(5, 5, 5, 5)) +
  labs(x = "Total cupping score (out of 100)",
       y = "",
       color = "Species",
       fill = "Species")

Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.

Warning: Removed 1 rows containing missing values (`geom_dotsinterval()`).

Ridgeline plot

# ridgeline plot
coffee_ratings %>%
  ggplot(aes(x = total_cup_points, y = species, fill = species)) +
  stat_density_ridges(quantile_lines = TRUE,
                      quantiles = 2,
                      alpha = 0.5) +
  xlim(55, 100) +
  scale_fill_manual(values = c("#C19A6B", "#5C4033")) +
  theme_ggdist() +
  theme(legend.position = "none") +
  labs(x = "Total cupping score (out of 100)",
       y = "",
       title = "Distribution of cupping scores across 14,000 coffee samples",
       subtitle = "The brown color theme is very coffee-esque")

Picking joint bandwidth of 0.605

Warning: Removed 1 rows containing non-finite values (`stat_density_ridges()`).

Warning: Using the `size` aesthetic with geom_segment was deprecated in ggplot2 3.4.0.
ℹ Please use the `linewidth` aesthetic instead.

Individual characteristic cupping scores in Arabica and Robusta

Make 3 different visualizations that show the distribution of all the individual contributors (i.e., aroma, flavor, aftertaste, acidity, body, balance, uniformity, clean_cup, sweetness, cupper_points) to total cupping score across Arabica and Robusta in one plot.

Wrangling:

# go from wide (each coffee attribute in a separate column)
# to long data (1 column w/ all characteristics, 1 column w/ all ratings)
coffee_ratings_tidy <- coffee_ratings %>%
  pivot_longer(cols = aroma:cupper_points, # columns from aroma to cupper_points
               names_to = "characteristic",
               values_to = "rating")

Prepare to clean up facet strip text using the function labeller().

# getting labels ready for plotting
# what are the coffee characteristics again?
(coffee_characteristics <- unique(coffee_ratings_tidy$characteristic) %>%
  sort()) # sort alphabetically, arrange won't work here bc not numeric

 [1] "acidity"       "aftertaste"    "aroma"         "balance"      
 [5] "body"          "clean_cup"     "cupper_points" "flavor"       
 [9] "sweetness"     "uniformity"

# create a vector of the coffee characteristic names as i want them to appear on the plot
coffee_labels <- c("Acidity",
                   "Aftertaste",
                   "Aroma",
                   "Balance",
                   "Body",
                   "Clean Cup",
                   "Cupper Points",
                   "Flavor",
                   "Sweetness",
                   "Uniformity")

# tell coffee_labels which original label to refer to
# these need to be in the same order (which is why i used sort())
names(coffee_labels) <- coffee_characteristics

Managing fonts:

# get fonts not default available in R
library(sysfonts) # aux packagew here fonts live
library(showtext) # package that helps use non-standard fonts

Loading required package: showtextdb

library(ragg)

# add the font Atkison Hyperlegible bc i like it
font_add_google("Atkinson Hyperlegible")

# what fonts do i have to choose from?
# remove head() to see them all
head(font_info_google())

         family   category num_variants                    variants num_subsets
1       ABeeZee sans-serif            2             regular, italic           1
2          Abel sans-serif            1                     regular           1
3  Abhaya Libre      serif            5 regular, 500, 600, 700, 800           3
4 Abril Fatface    display            1                     regular           2
5      Aclonica sans-serif            1                     regular           1
6          Acme sans-serif            1                     regular           1
                    subsets version lastModified
1                     latin     v20   2022-01-27
2                     latin     v12   2020-09-10
3 latin, latin-ext, sinhala     v11   2022-01-25
4          latin, latin-ext     v18   2022-01-27
5                     latin     v16   2022-01-25
6                     latin     v17   2022-01-27

# use to indicate that showtext is needed 
showtext_auto()

Boxplots

This is just ok.

coffee_ratings_tidy %>%
  ggplot(aes(x = species, y = rating, fill = species)) +
  geom_boxplot(color = "black", alpha = 0.8) +
  scale_fill_manual(values = c("#C19A6B", "#5C4033")) +
  coord_cartesian(ylim = c(5,10)) +
  facet_wrap(vars(characteristic),
             labeller = labeller(characteristic = coffee_labels), # all that work we did earlier
             nrow = 2) + # 2 rows in the faceted plot
  theme_classic() + 
  theme(legend.position = "none",
        text = element_text(family = "Atkinson Hyperlegible")) + # changing font
  labs(x = "Coffee Species",
       y = "Cupper rating (out of 10)",
       title = "Distribution of cupper scores for individual coffee attributes",
       caption = "Line represents the median rating per species")

Violin plot

# violin plot
coffee_ratings_tidy %>%
  ggplot(aes(x = species, y = rating, fill = species)) +
  geom_violin(draw_quantiles = 0.5, color = "black", alpha = 0.8) +
  scale_fill_manual(values = c("#C19A6B", "#5C4033")) +
  coord_cartesian(ylim = c(5,10)) +
  facet_wrap(vars(characteristic),
             labeller = labeller(characteristic = coffee_labels), # all that work we did earlier
             nrow = 2) + # 2 rows in the faceted plot
  theme_classic() + 
  theme(legend.position = "none",
        text = element_text(family = "Atkinson Hyperlegible")) + # changing font
  labs(x = "Coffee Species",
       y = "Cupper rating (out of 10)",
       title = "Distribution of cupper scores for individual coffee attributes",
       caption = "Line represents the median rating per species")

Dot plots

This is just ok.

coffee_ratings_tidy %>%
  ggplot(aes(x = species, y = rating, color = species)) +
  geom_dots(side = "both", layout = "swarm") +
  scale_color_manual(values = c("#C19A6B", "#5C4033")) +
  coord_cartesian(ylim = c(6,10)) +
  facet_wrap(vars(characteristic),
             labeller = labeller(characteristic = coffee_labels),
             nrow = 2) +
  theme_classic() + 
  theme(legend.position = "none",
        text = element_text(family = "Atkinson Hyperlegible")) + # changing font
  labs(x = "Cupper rating (out of 10)",
       y = "Coffee Species",
       title = "Distribution of cupper scores for individual coffee attributes")

Density plot

This one I think is my favorite.

coffee_ratings_tidy %>%
  ggplot(aes(x = rating, y = species, fill = species)) +
  geom_density_ridges(alpha = 0.8,
                      quantile_lines = TRUE,
                      quantiles = 2) +
  scale_fill_manual(values = c("#C19A6B", "#5C4033")) +
  coord_cartesian(xlim = c(6,10)) +
  facet_wrap(vars(characteristic),
             labeller = labeller(characteristic = coffee_labels),
             nrow = 2) +
  theme_classic() + 
  theme(legend.position = "none") +
  labs(x = "Cupper rating (out of 10)",
       y = "Coffee Species",
       title = "Distribution of cupper scores for individual coffee attributes")