library(tidyverse) # everything
library(readxl) # reading in excel sheets
library(factoextra) # easy PCA plotting
library(glue) # easy pasting
Principal Components Analysis
Week 10
Introduction
Today we are going to start Module 4 where we put together a lot of the material we’ve learned in the first 3 modules of this course. Today’s material is on conducting principal components analysis (PCA) using R, and visualizing the results with some tools we’ve already learned to use, and some new wrangling and viz tips along the way.
PCA is a data reduction approach, and useful if you have many variables, for example, thousands of genes or metabolites. PCA creates summary variables (the principal components) which maximize the variation in the dataset. It can be categorized as an unsupervised approach, as PCA doesn’t know which samples belong to your different groups. When you look at a scores plot, points that are closer together are more similar based on your input data, and those further apart are more different. The location of the loadings helps you understand what is driving those differences in your scores plot.
If you are unfamiliar with PCA, I’d recommend these two Youtube videos by Josh Starmer of StatQuest which explain PCA in 5 mins, or with more detail in 20 min. Bam 💥!
Read in data
Today we are going to continue to use the same real research data from my group from the lessons on distributions and correlations. We will be reading in the supplementary data from a paper from my group written by Michael Dzakovich, and published in The Plant Genome. The data is present in a Excel worksheet, so we will use the function read_excel()
from the tidyverse (but not core tidyverse) package readxl
. We want to import Supplemental Table 3. You can indicate which sheet you want to import in the arguments to read_excel()
.
<- read_excel("data/tpg220192-sup-0002-supmat.xlsx",
alkaloid_blups sheet = "S3 BLUP Diversity Panel")
Let’s take a look at this new data sheet.
::kable(head(alkaloid_blups)) knitr
Genotype | Plot_Source | Species | Class | Origin | Provence | Blanca_Cluster1 | Blanca_Cluster2 | Passport_Species | Passport_Classification | Sim_Grouping | Dehydrotomatidine | Tomatidine | DehydrotomatineA1 | Dehydrotomatine2 | TotalDehydrotomatine | Tomatine | Hydroxytomatine1 | Hydroxytomatine2 | Hydroxytomatine3 | Hydroxytomatine4 | TotalHydroxytomatine | Acetoxytomatine1 | Acetoxytomatine2 | Acetoxytomatine3 | TotalAcetoxytomatine | DehydrolycoperosideFGdehydroesculeosideA | LycoperosideFGEsculeosideA1 | LycoperosideFGEsculeosideA2 | TotalLycoperosideFGEsculeosideA | EsculeosideB1 | EsculeosideB2 | EsculeosideB3 | TotalEsculeosideB | Total | Latitude | Longitude |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CULBPT_05_11 | 2K9-8584 | Processing | Cultivated Processing | USA | NY | SLL_processing_2 | SLL_processing_2 | SLL | SLL_processing_NY | Arid | 0.0001504 | 0.0043495 | 0.0072241 | -0.0010738 | 0.0061503 | 0.1929106 | 0.0102521 | 0.1596566 | 0.0352666 | 0.0864042 | 0.2917962 | -0.0680508 | -0.1242448 | 0.0117481 | -0.1805475 | -0.0369856 | 0.0056573 | -0.1693034 | -0.1636462 | -0.0340685 | -0.0061713 | -0.0371543 | -0.0773942 | 0.0365648 | 40.712800000000001 | -74.006 |
CULBPT_05_15 | 2K9-8622 | Processing | Cultivated Processing | USA | NY | SLL_processing_2 | SLL_processing_2 | SLL | SLL_processing_NY | Arid | -0.0000272 | -0.0001716 | 0.0086547 | 0.0000000 | 0.0086547 | 0.2824926 | 0.0006910 | 0.1118818 | 0.0243950 | 0.0189345 | 0.1559021 | 0.0038232 | 0.0467380 | 0.0024737 | 0.0530349 | 0.0066251 | 0.0125334 | 0.3417664 | 0.3542998 | 0.0138429 | 0.0021833 | 0.0139309 | 0.0299571 | 0.8907677 | 40.712800000000001 | -74.006 |
CULBPT_05_22 | 2K17-7708-1 | Processing | Cultivated Processing | USA | NY | SLL_processing_2 | SLL_processing_2 | SLL | SLL_processing_NY | Arid | -0.0000649 | -0.0002516 | 0.0085205 | 0.0000000 | 0.0085205 | 0.1755250 | -0.0001804 | 0.0505547 | 0.0040900 | 0.0023229 | 0.0576202 | 0.0078240 | 0.0069610 | -0.0001145 | 0.0146705 | 0.0031468 | 0.0150674 | 0.3840689 | 0.3991363 | 0.0007128 | 0.0008227 | 0.0028133 | 0.0043489 | 0.6618186 | 40.712800000000001 | -74.006 |
CULBPT04_1 | 2K9-8566 | Processing | Cultivated Processing | USA | NY | SLL_processing_2 | SLL_processing_2 | SLL | SLL_processing_NY | Arid | -0.0000272 | 0.0001259 | -0.0038737 | 0.0000000 | -0.0038737 | -0.0061446 | 0.0015930 | 0.1059454 | 0.0195829 | 0.0083566 | 0.1354779 | 0.0098585 | 0.0095344 | 0.0007941 | 0.0201870 | 0.0049377 | 0.0100416 | 0.3466282 | 0.3566699 | 0.0100830 | 0.0006432 | 0.0089015 | 0.0196277 | 0.5269806 | 40.712800000000001 | -74.006 |
E6203 | 2K9-8600 | Processing | Cultivated Processing | USA | CA | SLL_processing_1 | SLL_processing_1_1 | SLL | SLL_processing_CA | Arid | -0.0000272 | 0.0000159 | 0.0099538 | 0.0000000 | 0.0099538 | 0.2929606 | 0.0000000 | 0.0227075 | 0.0039160 | 0.0088279 | 0.0354514 | 0.0002365 | 0.0395374 | 0.0041794 | 0.0439532 | 0.0027439 | 0.0133225 | 0.3018732 | 0.3151958 | 0.0080894 | -0.0003574 | 0.0100045 | 0.0177365 | 0.7179839 | 36.778300000000002 | -119.4179 |
F06-2041 | 2K16-9843 | Processing | Cultivated Processing | USA | OH | SLL_processing_1 | SLL_processing_1_3 | SLL | SLL_processing_OH | Humid | -0.0000272 | -0.0001648 | -0.0011236 | 0.0000000 | -0.0011236 | 0.0169681 | 0.0011237 | 0.0195573 | -0.0016308 | -0.0016069 | 0.0188454 | -0.0004237 | 0.0271524 | -0.0005286 | 0.0262002 | 0.0053084 | 0.0049033 | 0.1647694 | 0.1696728 | -0.0006298 | -0.0003074 | 0.0019315 | 0.0009943 | 0.2352713 | 40.417299999999997 | -82.9071 |
What are the dimensions of this dataframe?
dim(alkaloid_blups)
[1] 107 37
Light wrangling
Here we have the best linear unbiased predictors (BLUPs) representing the alkaloid content of 107 genotypes of tomatoes. There is extra meta-data here we won’t use, so like we did in correlations, we are going to create a vector to indicate which column name reprents the alkaloids we want to include in our principal components analysis. Then we can create a new trimmed dataframe.
<- c("Dehydrotomatidine",
alkaloid_total_names "Tomatidine",
"TotalDehydrotomatine",
"Tomatine",
"TotalHydroxytomatine",
"TotalAcetoxytomatine",
"DehydrolycoperosideFGdehydroesculeosideA",
"TotalLycoperosideFGEsculeosideA",
"TotalEsculeosideB",
"Total")
<- alkaloid_blups |>
alkaloid_blups_trim select(Genotype, Species, Class, all_of(alkaloid_total_names))
# did it work?
colnames(alkaloid_blups_trim) # yes
[1] "Genotype"
[2] "Species"
[3] "Class"
[4] "Dehydrotomatidine"
[5] "Tomatidine"
[6] "TotalDehydrotomatine"
[7] "Tomatine"
[8] "TotalHydroxytomatine"
[9] "TotalAcetoxytomatine"
[10] "DehydrolycoperosideFGdehydroesculeosideA"
[11] "TotalLycoperosideFGEsculeosideA"
[12] "TotalEsculeosideB"
[13] "Total"
Run PCA
There are many packages that have functions that run PCA (including ) but I think the most common function used is a part of base R, and is called prcomp()
.
Note, PCA will allow zeroes, but will throw an error if you feed it NAs.
<- prcomp(alkaloid_blups_trim |> select(all_of(alkaloid_total_names)),
alkaloids_pca scale = TRUE, # default is false
center = TRUE) # default is true, just being explicit
Let’s investigate alkaloids_pca
.
glimpse(alkaloids_pca)
List of 5
$ sdev : num [1:10] 1.794 1.732 1.215 0.99 0.776 ...
$ rotation: num [1:10, 1:10] 0.162 0.309 0.422 0.326 0.311 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:10] "Dehydrotomatidine" "Tomatidine" "TotalDehydrotomatine" "Tomatine" ...
.. ..$ : chr [1:10] "PC1" "PC2" "PC3" "PC4" ...
$ center : Named num [1:10] 0.000169 0.00721 0.142798 1.865975 1.323755 ...
..- attr(*, "names")= chr [1:10] "Dehydrotomatidine" "Tomatidine" "TotalDehydrotomatine" "Tomatine" ...
$ scale : Named num [1:10] 0.000505 0.024096 0.339379 5.889986 2.91824 ...
..- attr(*, "names")= chr [1:10] "Dehydrotomatidine" "Tomatidine" "TotalDehydrotomatine" "Tomatine" ...
$ x : num [1:107, 1:10] -1.51 -1.5 -1.55 -1.55 -1.52 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:10] "PC1" "PC2" "PC3" "PC4" ...
- attr(*, "class")= chr "prcomp"
print(alkaloids_pca)
Standard deviations (1, .., p=10):
[1] 1.7944141686 1.7318715307 1.2151400892 0.9904494488 0.7763914595
[6] 0.6695313752 0.4394142511 0.2041300753 0.1932182462 0.0001503975
Rotation (n x k) = (10 x 10):
PC1 PC2 PC3
Dehydrotomatidine 0.1621938 0.05363402 -0.06746652
Tomatidine 0.3088504 -0.13094437 -0.44243776
TotalDehydrotomatine 0.4222731 -0.30216776 -0.22116400
Tomatine 0.3263804 -0.28919346 -0.43424991
TotalHydroxytomatine 0.3111090 -0.07515005 0.42780515
TotalAcetoxytomatine 0.3190534 -0.17396967 0.54287286
DehydrolycoperosideFGdehydroesculeosideA 0.2125680 0.49447329 -0.07346836
TotalLycoperosideFGEsculeosideA 0.2130280 0.52056383 -0.08463631
TotalEsculeosideB 0.1864604 0.50165801 -0.10122926
Total 0.5191805 0.04443498 0.24834278
PC4 PC5 PC6
Dehydrotomatidine 0.92897283 -0.275272461 -0.09845544
Tomatidine 0.13150651 0.482582095 0.65879800
TotalDehydrotomatine -0.16555038 -0.170452131 -0.15777601
Tomatine -0.13638091 -0.163532836 -0.41927007
TotalHydroxytomatine 0.15203796 0.706002906 -0.41591986
TotalAcetoxytomatine -0.05462895 -0.326082234 0.41704857
DehydrolycoperosideFGdehydroesculeosideA -0.16278360 0.013643330 -0.06919125
TotalLycoperosideFGEsculeosideA -0.09078436 -0.009476836 -0.02310538
TotalEsculeosideB 0.02495982 -0.037371129 0.01529154
Total -0.11066018 -0.170588385 0.05600981
PC7 PC8 PC9
Dehydrotomatidine 0.13097525 -0.02201625 -0.012909093
Tomatidine 0.03505366 0.05778857 0.054315300
TotalDehydrotomatine 0.03480557 -0.61023340 -0.475709677
Tomatine -0.09303946 0.40353582 0.403874861
TotalHydroxytomatine -0.04702814 -0.04023463 0.009165434
TotalAcetoxytomatine -0.01377758 -0.02258828 0.182475550
DehydrolycoperosideFGdehydroesculeosideA 0.61727735 -0.31823996 0.437114636
TotalLycoperosideFGEsculeosideA 0.10012665 0.42335087 -0.583833541
TotalEsculeosideB -0.76029219 -0.28461575 0.203734859
Total -0.01570365 0.31194871 -0.025467058
PC10
Dehydrotomatidine -0.0001604947
Tomatidine 0.0010655927
TotalDehydrotomatine 0.0147128823
Tomatine 0.2559833151
TotalHydroxytomatine 0.1268345182
TotalAcetoxytomatine 0.5059522400
DehydrolycoperosideFGdehydroesculeosideA 0.0080094872
TotalLycoperosideFGEsculeosideA 0.3707973486
TotalEsculeosideB 0.0226436777
Total -0.7239562737
class(alkaloids_pca)
[1] "prcomp"
We can see that the resulting PCA object is a prcomp object, and is a list of 5 lists and vectors.
This includes:
sdev
: the standard deviations (square roots of the eigenvalues of the covariance matrix) of the principal componentsrotation
: the PCs for the variables (i.e., the variable loadings)x
: the PCs for samples (i.e., the scores)center
: the centering usedscale
: the scaling used
We can also look at the output of our PCA in a different way using the function summary()
.
summary(alkaloids_pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 1.794 1.7319 1.2151 0.9904 0.77639 0.66953 0.43941
Proportion of Variance 0.322 0.2999 0.1477 0.0981 0.06028 0.04483 0.01931
Cumulative Proportion 0.322 0.6219 0.7696 0.8677 0.92796 0.97279 0.99210
PC8 PC9 PC10
Standard deviation 0.20413 0.19322 0.0001504
Proportion of Variance 0.00417 0.00373 0.0000000
Cumulative Proportion 0.99627 1.00000 1.0000000
We can convert this summary into something later usable by extraction the element importance
from summary(alkaloids_pca)
and converting it to a dataframe.
<- summary(alkaloids_pca)$importance |>
importance as.data.frame()
::kable(head(importance)) knitr
PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | |
---|---|---|---|---|---|---|---|---|---|---|
Standard deviation | 1.794414 | 1.731872 | 1.21514 | 0.9904494 | 0.7763915 | 0.6695314 | 0.4394143 | 0.2041301 | 0.1932182 | 0.0001504 |
Proportion of Variance | 0.321990 | 0.299940 | 0.14766 | 0.0981000 | 0.0602800 | 0.0448300 | 0.0193100 | 0.0041700 | 0.0037300 | 0.0000000 |
Cumulative Proportion | 0.321990 | 0.621930 | 0.76959 | 0.8676900 | 0.9279600 | 0.9727900 | 0.9921000 | 0.9962700 | 1.0000000 | 1.0000000 |
By looking at the summary we can see, for example, that the first two PCs explain 62.19% of variance.
We are going to go over making scree, scores and loadings plots using helper functions (here, they start fviz_()
and come from the package factoextra
, and manually via ggplot. The helper functions allow you look at each plot type simply. This is an important step because when you make your plots with ggplot, you want to be sure they look how they should.
Scree plot
A scree plot shows what percentage of total variance is explained by each principal component.
Using fviz_eig()
We can do this quickly using fviz_eig()
.
fviz_eig(alkaloids_pca)
You can actually do this very easily with base R plotting as well. If you weren’t planning to publish this type of plot, it might not be important it look beautiful, and then both of these options would be great and quick. Note though that the base R plot is plotting at a different scale.
plot(alkaloids_pca)
Manually
If you wanted to make a scree plot manually, you could by plotting using a wrangled version of the importance
dataframe we made earlier.
<- importance |>
importance_tidy rownames_to_column(var = "measure") |>
pivot_longer(cols = PC1:PC10,
names_to = "PC",
values_to = "value")
|>
importance_tidy filter(measure == "Proportion of Variance") |>
ggplot(aes(x = PC, y = value)) +
geom_col()
Almost! PC10 is displaying right after PC1 because alphabetically, this is the order. Let’s fix it.
# create a vector with the order we want
<- colnames(importance)
my_order
# relevel according to my_order
$PC <- factor(importance_tidy$PC, levels = my_order)
importance_tidy
# check to see if it worked
levels(importance_tidy$PC)
[1] "PC1" "PC2" "PC3" "PC4" "PC5" "PC6" "PC7" "PC8" "PC9" "PC10"
Let’s plot again.
|>
importance_tidy filter(measure == "Proportion of Variance") |>
ggplot(aes(x = PC, y = value)) +
geom_col()
Success!
If we want to tighten up this plot we can.
|>
importance_tidy filter(measure == "Proportion of Variance") |>
ggplot(aes(x = PC, y = value)) +
geom_col(alpha = 0.1, color = "black") +
scale_y_continuous(labels = scales::percent) +
theme_minimal() +
labs(x = "Principal component",
y = "Percent variance explained",
title = "Scree plot of 10 alkaloids analyzed across 107 tomato accessions")
This is a perfectly ready scree plot for the supplementary materials of a publication.
Scores plot
When people talk about PCA plots, what they most often mean is PCA scores plots. Here, each point represents a sample, and we are plotting their coordinates typically for the first 2 PCs. Sometimes people make 3D PCA plots with the first 3 PCs but I think these are not easy to look in 2D and I wouldn’t recommend you to put them in your papers.
Using fviz_pca_ind()
We can also look at a scores plot using fviz_pca_ind()
where ind means individuals. Here, each point is a sample.
fviz_pca_ind(alkaloids_pca)
Because our alkaloids_pca
doesn’t have any meta-data, this is a hard to interpret plot, where each number indicates the rownumber of that sample. Making the scores plot this way is useful because it shows us the shape of the plot which we can use to confirm that we have made a ggplot that looks like its been created correctly.
Manually
We want to plot the scores, which are in provided in alkaloids_pca$x
.
We can convert the list into a dataframe of scores values by using as.data.frame()
. Then we can bind back our relevant metadata so they’re all together. Note, to use bind_cols()
both datasets need to be in the same order. In this case they are so we are good.
# create a df of alkaloids_pca$x
<- as.data.frame(alkaloids_pca$x)
scores_raw
# bind meta-data
<- bind_cols(alkaloid_blups |> select(Genotype, Plot_Source, Species), # metadata
scores
scores_raw)
# how does our new df look?
1:6, 1:6] scores[
# A tibble: 6 × 6
Genotype Plot_Source Species PC1 PC2 PC3
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 CULBPT_05_11 2K9-8584 Processing -1.51 -1.17 0.00632
2 CULBPT_05_15 2K9-8622 Processing -1.50 -0.915 0.0649
3 CULBPT_05_22 2K17-7708-1 Processing -1.55 -0.942 0.0658
4 CULBPT04_1 2K9-8566 Processing -1.55 -0.906 0.0817
5 E6203 2K9-8600 Processing -1.52 -0.940 0.0436
6 F06-2041 2K16-9843 Processing -1.58 -0.934 0.0677
Now we can plot.
|>
scores ggplot(aes(x = PC1, y = PC2, color = Species)) +
geom_point()
Our shapes are looking the same, this is good. Let’s pretty up our plot.
First let’s wrangle.
# create objects indicating percent variance explained by PC1 and PC2
<- round((importance[2,1])*100, # index 2nd row, 1st column, times 100
PC1_percent 1) # round to 1 decimal
<- round((importance[2,2])*100, 1)
PC2_percent
# make Species a factor and set levels
# from least wild to most wild
# we did this back in data distributions.
$Species <- factor(scores$Species,
scoreslevels = c("Processing",
"Cultivated Cherry",
"Hybrid",
"Wild Cherry",
"Wild"))
Then we can plot
# plot
<- scores |>
(scores_plot ggplot(aes(x = PC1, y = PC2, fill = Species)) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_vline(xintercept = 0, linetype = "dashed") +
geom_point(shape = 21, color = "black", size = 2.5, alpha = 0.7) +
# sequential color selecting hex codes from YlOrRd color by wildness
scale_fill_manual(values = c("#ffffcc", "#fed976", "#fd8d3c", "#e31a1c", "#800026")) +
theme_minimal() +
labs(x = glue("PC1: {PC1_percent}%"),
y = glue("PC2: {PC2_percent}%"),
title = "PCA Scores Plot of 10 Alkaloids Present in 107 Tomato Accessions"))
This looks nice.
Loadings plot
Using fviz_pca_var()
We can also look at a loadings plot using fviz_pca_var()
where var means variables. Here, each point is a variable.
fviz_pca_var(alkaloids_pca)
Manually
We can also make a more customized loadings plot manually using ggplot and using the dataframe alkaloids_pca$rotation
.
# grab raw loadings, without any metadata
<- as.data.frame(alkaloids_pca$rotation)
loadings_raw
<- loadings_raw |>
loadings rownames_to_column(var = "alkaloid")
We can then plot with ggplot like normal.
|>
loadings ggplot(aes(x = PC1, y = PC2, label = alkaloid)) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_vline(xintercept = 0, linetype = "dashed") +
geom_point() +
geom_text() +
scale_fill_brewer() +
theme_minimal() +
labs(x = glue("PC1: {PC1_percent}%"),
y = glue("PC2: {PC2_percent}%"),
title = "PCA Loadings Plot")
We have two problems with this plot.
- The names are abbreviated and not how we want them to appear
- The label names are on top of the points/each other
We can fix both of these problems.
We can create a vector of the labels as we want them to appear, as we have done previously.
<- c("Dehydrotomatidine",
alkaloid_labels "Tomatidine",
"Dehydrotomatine",
"Alpha-Tomatine",
"Hydroxytomatine",
"Acetoxytomatine",
"Dehydrlycoperoside F, G, \nor Dehydroescueloside A",
"Lycoperoside F, G, \nor Escueloside A",
"Escueloside B",
"Total Steroidal Alkaloids")
Then we can re-plot with these labels.
|>
loadings ggplot(aes(x = PC1, y = PC2, label = alkaloid_labels)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_vline(xintercept = 0, linetype = "dashed") +
geom_text() +
scale_fill_brewer() +
theme_minimal() +
labs(x = glue("PC1: {PC1_percent}%"),
y = glue("PC2: {PC2_percent}%"),
title = "PCA Loadings Plot")
Ok the label names are better but they’re still smushed. The package ggrepel
has some good functions to help us. You can try using geom_text_repel()
and geom_label_repel()
.
With geom_text_repel()
library(ggrepel)
<- loadings |>
(loadings_plot ggplot(aes(x = PC1, y = PC2, label = alkaloid_labels)) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_vline(xintercept = 0, linetype = "dashed") +
geom_point() +
geom_text_repel() +
scale_fill_brewer() +
theme_minimal() +
labs(x = glue("PC1: {PC1_percent}%"),
y = glue("PC2: {PC2_percent}%"),
title = "PCA Loadings Plot using geom_text_repel()"))
With geom_label_repel()
|>
loadings ggplot(aes(x = PC1, y = PC2, label = alkaloid_labels)) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_vline(xintercept = 0, linetype = "dashed") +
geom_point() +
geom_label_repel() +
scale_fill_brewer() +
theme_minimal() +
labs(x = glue("PC1: {PC1_percent}%"),
y = glue("PC2: {PC2_percent}%"),
title = "PCA Loadings Plot using geom_label_repel()")
patchwork
You can pop these two plots side by side easing using the package patchwork
.
library(patchwork)
+ loadings_plot scores_plot
Biplot
Using fviz_pca()
.
You can make a biplot quickly with fviz_pca()
. Note, fviz_pca_biplot()
and fviz_pca()
are the same.
fviz_pca(alkaloids_pca)
Instead of making this plot manually, let’s go through how to alter the existing plot made with fviz_pca()
. We can do this because factoextra
creates ggplot objects. To start off, we need to be using a dataframe that includes our metadata.
fviz_pca(alkaloids_pca, # pca object
label = "var",
repel = TRUE,
geom.var = "text") +
geom_point(aes(fill = alkaloid_blups$Species), shape = 21) +
scale_fill_brewer(palette = "Set2") +
theme_minimal() +
labs(x = glue("PC1: {PC1_percent}%"),
y = glue("PC2: {PC2_percent}%"),
title = "PCA Biplot Plot of 10 Alkaloids Present in 107 Tomato Accessions",
fill = "Species")
This is almost what we want - except we have only the abbreviated names for the alkaloids. Since in a biplot, we are really plotting two different sets of data (the scores and the loadings)there isn’t the ability to use labeller
or similar with fviz_pca
for the loadings only. There is a workaround though, we can go into our PCA object, change the rownames of alkaloids_pca$rotation
to be our longer labels, and that should inherit to our new plot.
# save as a new df
<- alkaloids_pca
alkaloids_pca_labelled
# assign alkaloid_labels to rownames
rownames(alkaloids_pca_labelled$rotation) <- alkaloid_labels
# plot
fviz_pca(alkaloids_pca_labelled, # pca object
label = "var",
repel = TRUE,
geom.var = c("text", "point"),
col.var = "black") +
geom_point(aes(fill = alkaloid_blups$Species), shape = 21) +
scale_fill_brewer(palette = "Set2") +
theme_minimal() +
labs(x = glue("PC1: {PC1_percent}%"),
y = glue("PC2: {PC2_percent}%"),
title = "PCA Biplot Plot of 10 Alkaloids Present in 107 Tomato Accessions",
fill = "Species")
Manually
We can’t add the loadings right on top of the scores because they are on different scales.
# plot
|>
scores ggplot() +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_vline(xintercept = 0, linetype = "dashed") +
geom_point(aes(x = PC1, y = PC2, fill = Species), # from piped data
shape = 21, color = "black", size = 2.5, alpha = 0.7) +
geom_point(data = loadings,
aes(x = PC1, y = PC2)) + # set locally
# sequential color selecting hex codes from YlOrRd color by wildness
scale_fill_manual(values = c("#ffffcc", "#fed976", "#fd8d3c", "#e31a1c", "#800026")) +
theme_minimal() +
labs(x = glue("PC1: {PC1_percent}%"),
y = glue("PC2: {PC2_percent}%"),
title = "PCA Scores Plot of 10 Alkaloids Present in 107 Tomato Accessions")
The issue here is that the scores and loadings are on different scales - let’s make them on the same scale.
I can write a quick function to allow normalization.
<- function(x) return((x - min(x))/(max(x) - min(x))) normalize
Then I can nornalize the scores using the scale function, since the loadings are already normalized.
<- scores |>
scores_normalized mutate(PC1_norm = scale(normalize(PC1), center = TRUE, scale = FALSE)) |>
mutate(PC2_norm = scale(normalize(PC2), center = TRUE, scale = FALSE)) |>
select(Genotype, Plot_Source, Species,
everything()) # pick metadata and reorder PC1_norm, PC2_norm,
How did it go? PC1_norm and PC2_norm should all now be between -1 and 1
# look at first six rows
head(scores_normalized)
# A tibble: 6 × 15
Genotype Plot_Source Species PC1_norm[,1] PC2_norm[,1] PC1 PC2 PC3
<chr> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 CULBPT_05_… 2K9-8584 Proces… -0.201 -0.114 -1.51 -1.17 0.00632
2 CULBPT_05_… 2K9-8622 Proces… -0.200 -0.0892 -1.50 -0.915 0.0649
3 CULBPT_05_… 2K17-7708-1 Proces… -0.206 -0.0918 -1.55 -0.942 0.0658
4 CULBPT04_1 2K9-8566 Proces… -0.206 -0.0883 -1.55 -0.906 0.0817
5 E6203 2K9-8600 Proces… -0.203 -0.0916 -1.52 -0.940 0.0436
6 F06-2041 2K16-9843 Proces… -0.210 -0.0910 -1.58 -0.934 0.0677
# ℹ 7 more variables: PC4 <dbl>, PC5 <dbl>, PC6 <dbl>, PC7 <dbl>, PC8 <dbl>,
# PC9 <dbl>, PC10 <dbl>
# range of PC1
range(scores_normalized$PC1_norm)
[1] -0.2128768 0.7871232
# range of PC2
range(scores_normalized$PC2_norm)
[1] -0.4608813 0.5391187
We can set the scores_normalized
data for the overall plot and add the loadings
within another geom_point()
.
|>
scores_normalized ggplot() +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_vline(xintercept = 0, linetype = "dashed") +
geom_point(aes(x = PC1_norm, y = PC2_norm, fill = Species), # scores
shape = 21, color = "black", size = 2.5, alpha = 0.7) +
geom_point(data = loadings, # loadings points
aes(x = PC1, y = PC2), color = "gray") +
geom_label_repel(data = loadings, # loadings labels
aes(x = PC1, y = PC2, label = alkaloid_labels),
size = 2.5, # make labels smaller
fill = alpha(c("white"),0.5)) + # make label background transparent
scale_fill_manual(values = c("#ffffcc", "#fed976", "#fd8d3c", "#e31a1c", "#800026")) +
theme_minimal() +
labs(x = glue("PC1: {PC1_percent}%"),
y = glue("PC2: {PC2_percent}%"),
title = "PCA Biplot of 10 Alkaloids Present in 107 Tomato Accessions",
subtitle = "Samples are colored points and loadings are grey with labels")
Voila.