Visualizing Correlations Recitation Solutions

Week 8

Author

Jessica Cooperstone

Introduction

We will be using some data collection from the National Health and Nutrition Examination Survey which collects data to assess the health and nutritional status of people in the United States. The data from 2009-2012 has been compiled in an R package called NHANES.

# install.packages("NHANES")
library(NHANES)

# functionality and correlation packages
library(tidyverse)
library(corrplot)
library(ggcorrplot)
library(GGally)
library(Hmisc)
library(reshape2)
library(scales)

knitr::kable(head(NHANES))
ID SurveyYr Gender Age AgeDecade AgeMonths Race1 Race3 Education MaritalStatus HHIncome HHIncomeMid Poverty HomeRooms HomeOwn Work Weight Length HeadCirc Height BMI BMICatUnder20yrs BMI_WHO Pulse BPSysAve BPDiaAve BPSys1 BPDia1 BPSys2 BPDia2 BPSys3 BPDia3 Testosterone DirectChol TotChol UrineVol1 UrineFlow1 UrineVol2 UrineFlow2 Diabetes DiabetesAge HealthGen DaysPhysHlthBad DaysMentHlthBad LittleInterest Depressed nPregnancies nBabies Age1stBaby SleepHrsNight SleepTrouble PhysActive PhysActiveDays TVHrsDay CompHrsDay TVHrsDayChild CompHrsDayChild Alcohol12PlusYr AlcoholDay AlcoholYear SmokeNow Smoke100 Smoke100n SmokeAge Marijuana AgeFirstMarij RegularMarij AgeRegMarij HardDrugs SexEver SexAge SexNumPartnLife SexNumPartYear SameSex SexOrientation PregnantNow
51624 2009_10 male 34 30-39 409 White NA High School Married 25000-34999 30000 1.36 6 Own NotWorking 87.4 NA NA 164.7 32.22 NA 30.0_plus 70 113 85 114 88 114 88 112 82 NA 1.29 3.49 352 NA NA NA No NA Good 0 15 Most Several NA NA NA 4 Yes No NA NA NA NA NA Yes NA 0 No Yes Smoker 18 Yes 17 No NA Yes Yes 16 8 1 No Heterosexual NA
51624 2009_10 male 34 30-39 409 White NA High School Married 25000-34999 30000 1.36 6 Own NotWorking 87.4 NA NA 164.7 32.22 NA 30.0_plus 70 113 85 114 88 114 88 112 82 NA 1.29 3.49 352 NA NA NA No NA Good 0 15 Most Several NA NA NA 4 Yes No NA NA NA NA NA Yes NA 0 No Yes Smoker 18 Yes 17 No NA Yes Yes 16 8 1 No Heterosexual NA
51624 2009_10 male 34 30-39 409 White NA High School Married 25000-34999 30000 1.36 6 Own NotWorking 87.4 NA NA 164.7 32.22 NA 30.0_plus 70 113 85 114 88 114 88 112 82 NA 1.29 3.49 352 NA NA NA No NA Good 0 15 Most Several NA NA NA 4 Yes No NA NA NA NA NA Yes NA 0 No Yes Smoker 18 Yes 17 No NA Yes Yes 16 8 1 No Heterosexual NA
51625 2009_10 male 4 0-9 49 Other NA NA NA 20000-24999 22500 1.07 9 Own NA 17.0 NA NA 105.4 15.30 NA 12.0_18.5 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA No NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 4 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
51630 2009_10 female 49 40-49 596 White NA Some College LivePartner 35000-44999 40000 1.91 5 Rent NotWorking 86.7 NA NA 168.4 30.57 NA 30.0_plus 86 112 75 118 82 108 74 116 76 NA 1.16 6.70 77 0.094 NA NA No NA Good 0 10 Several Several 2 2 27 8 Yes No NA NA NA NA NA Yes 2 20 Yes Yes Smoker 38 Yes 18 No NA Yes Yes 12 10 1 Yes Heterosexual NA
51638 2009_10 male 9 0-9 115 White NA NA NA 75000-99999 87500 1.84 6 Rent NA 29.8 NA NA 133.1 16.82 NA 12.0_18.5 82 86 47 84 50 84 50 88 44 NA 1.34 4.86 123 1.538 NA NA No NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 5 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

If you wanted to make a correlation plot for all variables below.

NHANES_trimmed <- NHANES %>%
  select(Age, BMI, Pulse, starts_with("BP"), TotChol) %>%
  drop_na()

NHANES_cor <- cor(NHANES_trimmed)

1. How correlated are different measures of blood pressure?

In the NHANES dataset, there are 3 measurements for each systolic (the first/top number) and diastolic blood (the second/bottom number) pressure. How reproducible is each type of blood pressure measurement over the 3 samplings? Make visualizations to convey your findings.

Wrangling, creating two dataframes

  1. Includes the 4 measures for systolic BP BPSysAve, BPSys1, BPSys2, BPSys3
  2. Includes the 4 measures for diastolic BP BPDiaAve, BPDia1, BPDia2, BPDia3
# create df with all of the BP measurements
# remove missing values
NHANES_BP <- NHANES %>%
  select(starts_with("BP")) %>%
  drop_na()

# create df with all systolic data
NHANES_systolic <- NHANES_BP %>%
  select(contains("Sys"))

# create df with all diastolic data
NHANES_diastolic <- NHANES_BP %>%
  select(contains("Dia"))

Looking at relationships using scatteplots

We can look quickly at the relationship betwen all the diastolic BP measurements, and all of the systolic BP measurements using ggpairs().

NHANES_diastolic %>%
  ggpairs(title = "Diastolic Blood Pressure Relationships")

From the diastolic data, we can see some values that are zero. Those are biologically implausible so I am going to elect to remove those observations.

NHANES_diastolic_no0 <- NHANES_diastolic %>%
  filter(BPDiaAve > 0 & BPDia1 > 0 & BPDia2 > 0 & BPDia3 > 0)

# how many observations are there?
nrow(NHANES_diastolic)
[1] 7971
# how many obesrvations after removing zero diastolic
nrow(NHANES_diastolic_no0)
[1] 7803
# how many observations did we remove?
nrow(NHANES_diastolic) - nrow(NHANES_diastolic_no0)
[1] 168

Try again now that we’ve removed diastolic BP values that are zero.

NHANES_diastolic_no0 %>%
  ggpairs(title = "Diastolic Blood Pressure Relationships")

This looks better.

NHANES_systolic %>%
  ggpairs(title = "Systolic Blood Pressure Relationships")

Run correlation analysis with cor() and rcorr()

# run systolic correlation analysis
NHANES_sys_cor <- cor(NHANES_systolic)

# could also use rcorr()
NHANES_sys_rcorr <- rcorr(as.matrix(NHANES_systolic))

# run diastolic correlation analysis
NHANES_dia_cor <- cor(NHANES_diastolic_no0)

# could also use rcorr()
NHANES_dia_rcorr <- rcorr(as.matrix(NHANES_diastolic_no0))

Prepare to plot with corrplot()

# create a vector of the systolic names for labeling
sys_labels <- c("Systolic BP, Average",
                "Systolic BP 1",
                "Systolic BP 2",
                "Systolic BP 3")

dia_labels <- c("Diastolic BP, Average",
                "Diastolic BP 1",
                "Diastolic BP 2",
                "Diastolic BP 3")

# change row and column names of the correlation matrix
# so they are how we want them to be plotted
colnames(NHANES_sys_rcorr$r) <- sys_labels
rownames(NHANES_sys_rcorr$r) <- sys_labels
colnames(NHANES_dia_rcorr$r) <- dia_labels
rownames(NHANES_dia_rcorr$r) <- dia_labels

# change row and column names of the pvalue matrix
# so they are how we want them to be plotted
colnames(NHANES_sys_rcorr$P) <- sys_labels
rownames(NHANES_sys_rcorr$P) <- sys_labels
colnames(NHANES_dia_rcorr$P) <- dia_labels
rownames(NHANES_dia_rcorr$P) <- dia_labels

Plot with corrplot().

corrplot(NHANES_sys_rcorr$r, # the correlation matrix
         type = "lower", # lower triangle
         tl.col = "black", # axis labels are black
         p.mat  = NHANES_sys_rcorr$P, # pvalue matrix
         sig.level = 0.05, # how sig does a cor need to be to be included
         insig = "blank", # do not display insignificant correlations
         addCoef.col = "white", # display correlations in black
         diag = FALSE, # don't show the diagonal (because this is all 1)
         number.cex = 1.0, # size of correlation font
         col = colorRampPalette(c("#d8b365", "#f5f5f5", "#5ab4ac"))(100)) # change colors to be colorblind friendly

corrplot(NHANES_dia_rcorr$r, # the correlation matrix
         type = "lower", # lower triangle
         tl.col = "black", # axis labels are black
         p.mat  = NHANES_dia_rcorr$P, # pvalue matrix
         sig.level = 0.05, # how sig does a cor need to be to be included
         insig = "blank", # do not display insignificant correlations
         addCoef.col = "white", # display correlations in black
         diag = FALSE, # don't show the diagonal (because this is all 1)
         number.cex = 1.0, # size of correlation font
         col = colorRampPalette(c("#d8b365", "#f5f5f5", "#5ab4ac"))(100)) # change colors to be colorblind friendly

Plot with ggcorrplot()

ggcorrplot(NHANES_sys_cor)

ggcorrplot(NHANES_dia_cor)

All are so highly correlated just looks red.

Can try adjusting the scale.

ggcorrplot(NHANES_sys_cor) +
  scale_fill_gradient2(limit = c(0.8,1), # set limits for corr range
                       low = "#e9a3c9", mid = "#f7f7f7", high =  "#a1d76a", # pick colors
                       midpoint = 0.9) + # set midpoint
  scale_x_discrete(labels = sys_labels) + # change x-axis labels
  scale_y_discrete(labels = sys_labels) + # change y-axis labels
  labs(fill = "Correlation \ncoefficient",
       title = "Correlations between measurements of systolic \nblood pressure in NHANES data")
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

ggcorrplot(NHANES_dia_cor) +
  scale_fill_gradient2(limit = c(0.8,1), # set limits for corr range
                       low = "#e9a3c9", mid = "#f7f7f7", high =  "#a1d76a", # pick colors
                       midpoint = 0.9) + # set midpoint
  scale_x_discrete(labels = dia_labels) + # change x-axis labels
  scale_y_discrete(labels = dia_labels) + # change y-axis labels
  labs(fill = "Correlation \ncoefficient",
       title = "Correlations between measurements of diastolic \nblood pressure in NHANES data")
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

Prepare to plot with melt() and ggplot()

Create a lower triangle object to plot.

# "save as"
sys_lower <- NHANES_sys_cor
dia_lower <- NHANES_dia_cor

# use function upper.tri() and set the upper triangle all to NA
# then we can keep only the lower triangle
sys_lower[upper.tri(sys_lower)] <- NA
dia_lower[upper.tri(dia_lower)] <- NA

# melt to go back to long format
melted_sys_lower <- melt(sys_lower, na.rm = TRUE)
melted_dia_lower <- melt(dia_lower, na.rm = TRUE)

# did it work?
head(melted_sys_lower) 
      Var1     Var2     value
1 BPSysAve BPSysAve 1.0000000
2   BPSys1 BPSysAve 0.9526899
3   BPSys2 BPSysAve 0.9881045
4   BPSys3 BPSysAve 0.9876269
6   BPSys1   BPSys1 1.0000000
7   BPSys2   BPSys1 0.9468706
head(melted_dia_lower) 
      Var1     Var2     value
1 BPDiaAve BPDiaAve 1.0000000
2   BPDia1 BPDiaAve 0.9091983
3   BPDia2 BPDiaAve 0.9769738
4   BPDia3 BPDiaAve 0.9770299
6   BPDia1   BPDia1 1.0000000
7   BPDia2   BPDia1 0.8979354

Plot systolic

# create a vector of the systolic names for labeling
sys_labels <- c("Systolic BP, Average",
                "Systolic BP 1",
                "Systolic BP 2",
                "Systolic BP 3")

melted_sys_lower %>%
  ggplot(aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(label = round(value, 2)), color = "black") +
  scale_fill_gradient2(low = "#ef8a62",
                       mid = "#f7f7f7",
                       high = "#67a9cf",
                       limits = c(-1, 1)) +
  scale_x_discrete(labels = sys_labels) +
  scale_y_discrete(labels = sys_labels) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        legend.justification = c(1, 0),
        legend.position = c(0.5, 0.7),
        legend.direction = "horizontal") +
  labs(fill = "Correlation \ncoefficient",
       x = "",
       y = "",
       title = "Correlation measures of systolic blood pressure at 3 times",
       subtitle = "Data collected from NHANES 2009-2012",
       caption = "Number presents correlation coefficient \nAll correlations are statistically significant (p < 0.05)")

Plot diastolic

# create a vector of the systolic names for labeling
dia_labels <- c("Diastolic BP, Average",
                "Diastolic BP 1",
                "Diastolic BP 2",
                "Diastolic BP 3")

melted_dia_lower %>%
  ggplot(aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(label = round(value, 2)), color = "black") +
  scale_fill_gradient2(low = "#f1a340",
                       mid = "#f7f7f7",
                       high = "#998ec3",
                       limits = c(-1, 1)) +
  scale_x_discrete(labels = dia_labels) +
  scale_y_discrete(labels = dia_labels) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        legend.justification = c(1, 0),
        legend.position = c(0.5, 0.7),
        legend.direction = "horizontal") +
  labs(fill = "Correlation \ncoefficient",
       x = "",
       y ="",
       title = "Correlation measures of diastolic blood pressure at 3 times",
       subtitle = "Data collected from NHANES 2009-2012",
       caption = "Number presents correlation coefficient \nAll correlations are statistically significant (p < 0.05)")

2. How correlated are different physical measurements, health, and lifestyle variables?

In the NHANES dataset, there are data for subject BMI, Pulse, BPSysAve, BPDiaAve, TotalChol.

Create a series of plots to show the relationship between these variables with each other.

Wrangle

Create a dataframe that includes only the variables we want to correlate, and drop the observations with missing values.

nhanes_trimmed <- NHANES %>%
  select(BMI, Pulse, BPSysAve, BPDiaAve, TotChol) %>%
  drop_na()

Visualize with ggpairs()

Here, we don’t have to specify columns since we’re using them all.

nhanes_trimmed %>%
    ggpairs(aes(alpha = 0.01), # note alpha inside aes which is weird idk why
            lower = list(continuous = "smooth"),
            columnLabels = c("BMI", "Pulse", "Systolic BP", "Diastolic BP", "Total Cholesterol"))

Create a correlation plot with corrplot()

First we will make our trimmed df a matrix.

# convert into a matrix as this is what corrplot takes
nhanes_trimmed_matrix <- nhanes_trimmed %>%
  as.matrix() 

nhanes_rcorr <- rcorr(nhanes_trimmed_matrix, type = "pearson")

# correlation matrix
nhanes_rcorr$r
                BMI         Pulse    BPSysAve   BPDiaAve       TotChol
BMI      1.00000000  1.514764e-02  0.24526330 0.23705522  1.213842e-01
Pulse    0.01514764  1.000000e+00 -0.09670785 0.01217557 -9.486773e-05
BPSysAve 0.24526330 -9.670785e-02  1.00000000 0.40476431  2.174280e-01
BPDiaAve 0.23705522  1.217557e-02  0.40476431 1.00000000  2.620035e-01
TotChol  0.12138419 -9.486773e-05  0.21742801 0.26200353  1.000000e+00
# pvalue matrix
nhanes_rcorr$P
               BMI     Pulse BPSysAve  BPDiaAve  TotChol
BMI             NA 0.1772467        0 0.0000000 0.000000
Pulse    0.1772467        NA        0 0.2781342 0.993258
BPSysAve 0.0000000 0.0000000       NA 0.0000000 0.000000
BPDiaAve 0.0000000 0.2781342        0        NA 0.000000
TotChol  0.0000000 0.9932580        0 0.0000000       NA

Wrangle labels

# create a vector of how i want the labels to look
nhanes_labels <- c("BMI",
                   "Pulse",
                   "Systolic \nBlood Pressure",
                   "Diastolic \nBlood Pressure",
                   "Total Cholesterol")

# change row and column names of the correlation matrix
# so they are how we want them to be plotted
colnames(nhanes_rcorr$r) <- nhanes_labels
rownames(nhanes_rcorr$r) <- nhanes_labels

# change row and column names of the pvalue matrix
# so they are how we want them to be plotted
colnames(nhanes_rcorr$P) <- nhanes_labels
rownames(nhanes_rcorr$P) <- nhanes_labels

Make the correlation plot. The numbers are the correlation coefficients for relationships that are significant based on our criteria.

corrplot(nhanes_rcorr$r, # the correlation matrix
         type = "lower", # lower triangle
         tl.col = "black", # axis labels are black
         p.mat  = nhanes_rcorr$P, # pvalue matrix
         sig.level = 0.05, # how sig does a cor need to be to be included
         insig = "blank", # do not display insignificant correlations
         addCoef.col = "black", # display correlations in black
         diag = FALSE, # don't show the diagonal (because this is all 1)
         number.cex = 0.6) # size of correlation font

Back to top