We will be using some data collection from the National Health and Nutrition Examination Survey which collects data to assess the health and nutritional status of people in the United States. The data from 2009-2012 has been compiled in an R package called NHANES.
# install.packages("NHANES")library(NHANES)# functionality and correlation packageslibrary(tidyverse)library(corrplot)library(ggcorrplot)library(GGally)library(Hmisc)library(reshape2)library(scales)knitr::kable(head(NHANES))
ID
SurveyYr
Gender
Age
AgeDecade
AgeMonths
Race1
Race3
Education
MaritalStatus
HHIncome
HHIncomeMid
Poverty
HomeRooms
HomeOwn
Work
Weight
Length
HeadCirc
Height
BMI
BMICatUnder20yrs
BMI_WHO
Pulse
BPSysAve
BPDiaAve
BPSys1
BPDia1
BPSys2
BPDia2
BPSys3
BPDia3
Testosterone
DirectChol
TotChol
UrineVol1
UrineFlow1
UrineVol2
UrineFlow2
Diabetes
DiabetesAge
HealthGen
DaysPhysHlthBad
DaysMentHlthBad
LittleInterest
Depressed
nPregnancies
nBabies
Age1stBaby
SleepHrsNight
SleepTrouble
PhysActive
PhysActiveDays
TVHrsDay
CompHrsDay
TVHrsDayChild
CompHrsDayChild
Alcohol12PlusYr
AlcoholDay
AlcoholYear
SmokeNow
Smoke100
Smoke100n
SmokeAge
Marijuana
AgeFirstMarij
RegularMarij
AgeRegMarij
HardDrugs
SexEver
SexAge
SexNumPartnLife
SexNumPartYear
SameSex
SexOrientation
PregnantNow
51624
2009_10
male
34
30-39
409
White
NA
High School
Married
25000-34999
30000
1.36
6
Own
NotWorking
87.4
NA
NA
164.7
32.22
NA
30.0_plus
70
113
85
114
88
114
88
112
82
NA
1.29
3.49
352
NA
NA
NA
No
NA
Good
0
15
Most
Several
NA
NA
NA
4
Yes
No
NA
NA
NA
NA
NA
Yes
NA
0
No
Yes
Smoker
18
Yes
17
No
NA
Yes
Yes
16
8
1
No
Heterosexual
NA
51624
2009_10
male
34
30-39
409
White
NA
High School
Married
25000-34999
30000
1.36
6
Own
NotWorking
87.4
NA
NA
164.7
32.22
NA
30.0_plus
70
113
85
114
88
114
88
112
82
NA
1.29
3.49
352
NA
NA
NA
No
NA
Good
0
15
Most
Several
NA
NA
NA
4
Yes
No
NA
NA
NA
NA
NA
Yes
NA
0
No
Yes
Smoker
18
Yes
17
No
NA
Yes
Yes
16
8
1
No
Heterosexual
NA
51624
2009_10
male
34
30-39
409
White
NA
High School
Married
25000-34999
30000
1.36
6
Own
NotWorking
87.4
NA
NA
164.7
32.22
NA
30.0_plus
70
113
85
114
88
114
88
112
82
NA
1.29
3.49
352
NA
NA
NA
No
NA
Good
0
15
Most
Several
NA
NA
NA
4
Yes
No
NA
NA
NA
NA
NA
Yes
NA
0
No
Yes
Smoker
18
Yes
17
No
NA
Yes
Yes
16
8
1
No
Heterosexual
NA
51625
2009_10
male
4
0-9
49
Other
NA
NA
NA
20000-24999
22500
1.07
9
Own
NA
17.0
NA
NA
105.4
15.30
NA
12.0_18.5
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
No
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
4
1
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
51630
2009_10
female
49
40-49
596
White
NA
Some College
LivePartner
35000-44999
40000
1.91
5
Rent
NotWorking
86.7
NA
NA
168.4
30.57
NA
30.0_plus
86
112
75
118
82
108
74
116
76
NA
1.16
6.70
77
0.094
NA
NA
No
NA
Good
0
10
Several
Several
2
2
27
8
Yes
No
NA
NA
NA
NA
NA
Yes
2
20
Yes
Yes
Smoker
38
Yes
18
No
NA
Yes
Yes
12
10
1
Yes
Heterosexual
NA
51638
2009_10
male
9
0-9
115
White
NA
NA
NA
75000-99999
87500
1.84
6
Rent
NA
29.8
NA
NA
133.1
16.82
NA
12.0_18.5
82
86
47
84
50
84
50
88
44
NA
1.34
4.86
123
1.538
NA
NA
No
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
5
0
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
If you wanted to make a correlation plot for all variables below.
1. How correlated are different measures of blood pressure?
In the NHANES dataset, there are 3 measurements for each systolic (the first/top number) and diastolic blood (the second/bottom number) pressure. How reproducible is each type of blood pressure measurement over the 3 samplings? Make visualizations to convey your findings.
Wrangling, creating two dataframes
Includes the 4 measures for systolic BP BPSysAve, BPSys1, BPSys2, BPSys3
Includes the 4 measures for diastolic BP BPDiaAve, BPDia1, BPDia2, BPDia3
# create df with all of the BP measurements# remove missing valuesNHANES_BP <- NHANES %>%select(starts_with("BP")) %>%drop_na()# create df with all systolic dataNHANES_systolic <- NHANES_BP %>%select(contains("Sys"))# create df with all diastolic dataNHANES_diastolic <- NHANES_BP %>%select(contains("Dia"))
Looking at relationships using scatteplots
We can look quickly at the relationship betwen all the diastolic BP measurements, and all of the systolic BP measurements using ggpairs().
# run systolic correlation analysisNHANES_sys_cor <-cor(NHANES_systolic)# could also use rcorr()NHANES_sys_rcorr <-rcorr(as.matrix(NHANES_systolic))# run diastolic correlation analysisNHANES_dia_cor <-cor(NHANES_diastolic_no0)# could also use rcorr()NHANES_dia_rcorr <-rcorr(as.matrix(NHANES_diastolic_no0))
Prepare to plot with corrplot()
# create a vector of the systolic names for labelingsys_labels <-c("Systolic BP, Average","Systolic BP 1","Systolic BP 2","Systolic BP 3")dia_labels <-c("Diastolic BP, Average","Diastolic BP 1","Diastolic BP 2","Diastolic BP 3")# change row and column names of the correlation matrix# so they are how we want them to be plottedcolnames(NHANES_sys_rcorr$r) <- sys_labelsrownames(NHANES_sys_rcorr$r) <- sys_labelscolnames(NHANES_dia_rcorr$r) <- dia_labelsrownames(NHANES_dia_rcorr$r) <- dia_labels# change row and column names of the pvalue matrix# so they are how we want them to be plottedcolnames(NHANES_sys_rcorr$P) <- sys_labelsrownames(NHANES_sys_rcorr$P) <- sys_labelscolnames(NHANES_dia_rcorr$P) <- dia_labelsrownames(NHANES_dia_rcorr$P) <- dia_labels
Plot with corrplot().
corrplot(NHANES_sys_rcorr$r, # the correlation matrixtype ="lower", # lower triangletl.col ="black", # axis labels are blackp.mat = NHANES_sys_rcorr$P, # pvalue matrixsig.level =0.05, # how sig does a cor need to be to be includedinsig ="blank", # do not display insignificant correlationsaddCoef.col ="white", # display correlations in blackdiag =FALSE, # don't show the diagonal (because this is all 1)number.cex =1.0, # size of correlation fontcol =colorRampPalette(c("#d8b365", "#f5f5f5", "#5ab4ac"))(100)) # change colors to be colorblind friendly
corrplot(NHANES_dia_rcorr$r, # the correlation matrixtype ="lower", # lower triangletl.col ="black", # axis labels are blackp.mat = NHANES_dia_rcorr$P, # pvalue matrixsig.level =0.05, # how sig does a cor need to be to be includedinsig ="blank", # do not display insignificant correlationsaddCoef.col ="white", # display correlations in blackdiag =FALSE, # don't show the diagonal (because this is all 1)number.cex =1.0, # size of correlation fontcol =colorRampPalette(c("#d8b365", "#f5f5f5", "#5ab4ac"))(100)) # change colors to be colorblind friendly
Plot with ggcorrplot()
ggcorrplot(NHANES_sys_cor)
ggcorrplot(NHANES_dia_cor)
All are so highly correlated just looks red.
Can try adjusting the scale.
ggcorrplot(NHANES_sys_cor) +scale_fill_gradient2(limit =c(0.8,1), # set limits for corr rangelow ="#e9a3c9", mid ="#f7f7f7", high ="#a1d76a", # pick colorsmidpoint =0.9) +# set midpointscale_x_discrete(labels = sys_labels) +# change x-axis labelsscale_y_discrete(labels = sys_labels) +# change y-axis labelslabs(fill ="Correlation \ncoefficient",title ="Correlations between measurements of systolic \nblood pressure in NHANES data")
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
ggcorrplot(NHANES_dia_cor) +scale_fill_gradient2(limit =c(0.8,1), # set limits for corr rangelow ="#e9a3c9", mid ="#f7f7f7", high ="#a1d76a", # pick colorsmidpoint =0.9) +# set midpointscale_x_discrete(labels = dia_labels) +# change x-axis labelsscale_y_discrete(labels = dia_labels) +# change y-axis labelslabs(fill ="Correlation \ncoefficient",title ="Correlations between measurements of diastolic \nblood pressure in NHANES data")
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
Prepare to plot with melt() and ggplot()
Create a lower triangle object to plot.
# "save as"sys_lower <- NHANES_sys_cordia_lower <- NHANES_dia_cor# use function upper.tri() and set the upper triangle all to NA# then we can keep only the lower trianglesys_lower[upper.tri(sys_lower)] <-NAdia_lower[upper.tri(dia_lower)] <-NA# melt to go back to long formatmelted_sys_lower <-melt(sys_lower, na.rm =TRUE)melted_dia_lower <-melt(dia_lower, na.rm =TRUE)# did it work?head(melted_sys_lower)
# convert into a matrix as this is what corrplot takesnhanes_trimmed_matrix <- nhanes_trimmed %>%as.matrix() nhanes_rcorr <-rcorr(nhanes_trimmed_matrix, type ="pearson")# correlation matrixnhanes_rcorr$r
BMI Pulse BPSysAve BPDiaAve TotChol
BMI NA 0.1772467 0 0.0000000 0.000000
Pulse 0.1772467 NA 0 0.2781342 0.993258
BPSysAve 0.0000000 0.0000000 NA 0.0000000 0.000000
BPDiaAve 0.0000000 0.2781342 0 NA 0.000000
TotChol 0.0000000 0.9932580 0 0.0000000 NA
Wrangle labels
# create a vector of how i want the labels to looknhanes_labels <-c("BMI","Pulse","Systolic \nBlood Pressure","Diastolic \nBlood Pressure","Total Cholesterol")# change row and column names of the correlation matrix# so they are how we want them to be plottedcolnames(nhanes_rcorr$r) <- nhanes_labelsrownames(nhanes_rcorr$r) <- nhanes_labels# change row and column names of the pvalue matrix# so they are how we want them to be plottedcolnames(nhanes_rcorr$P) <- nhanes_labelsrownames(nhanes_rcorr$P) <- nhanes_labels
Make the correlation plot. The numbers are the correlation coefficients for relationships that are significant based on our criteria.
corrplot(nhanes_rcorr$r, # the correlation matrixtype ="lower", # lower triangletl.col ="black", # axis labels are blackp.mat = nhanes_rcorr$P, # pvalue matrixsig.level =0.05, # how sig does a cor need to be to be includedinsig ="blank", # do not display insignificant correlationsaddCoef.col ="black", # display correlations in blackdiag =FALSE, # don't show the diagonal (because this is all 1)number.cex =0.6) # size of correlation font