--- title: "Principal Components Analysis" output: markdown::html_format: options: toc: true number_sections: true bibliography: bibliography.bib vignette: > %\VignetteIndexEntry{Principal Components Analysis} %\VignetteEngine{knitr::knitr} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 7, out.width = NULL ) ``` ```{r setup} library(dimensio) ``` # Do PCA ```{r pca} ## Load data data(iris) head(iris) ## Compute PCA X <- pca(iris, center = TRUE, scale = TRUE, sup_quali = "Species") ``` # Explore the results **dimensio** provides several methods to extract (`get_*()`) the results: * `get_data()` returns the original data. * `get_contributions()` returns the contributions to the definition of the principal dimensions. * `get_coordinates()` returns the principal or standard coordinates. * `get_correlations()` returns the correlations between variables and dimensions. * `get_cos2()` returns the cos^2^ values (i.e. the quality of the representation of the points on the factor map). * `get_eigenvalues()` returns the eigenvalues, the percentages of variance and the cumulative percentages of variance. The package also allows to quickly visualize (`viz_*()`) the results: * `biplot()` produces a biplot. * `screeplot()` produces a scree plot. * `viz_rows()`/`viz_individuals()` displays row/individual principal coordinates. * `viz_columns()`/`viz_variables()` displays columns/variable principal coordinates. * `viz_contributions()` displays (joint) contributions. * `viz_cos2()` displays (joint) cos^2^. ```{r eigenvalues, fig.show='hold', out.width='50%'} ## Get eigenvalues get_eigenvalues(X) ## Scree plot screeplot(X, cumulative = TRUE) ## Plot variable contributions to the definition of the first two axes viz_contributions(X, margin = 2, axes = c(1, 2)) ``` # PCA biplot A biplot is the simultaneous representation of rows and columns of a rectangular dataset. It is the generalization of a scatterplot to the case of mutlivariate data: it allows to visualize as much information as possible in a single graph (Greenacre, 2010). **dimensio** allows to display two types of biplots: a *form biplot* (*row-metric-preserving* biplot) or a *covariance biplot* (*column-metric-preserving* biplot). See Greenacre (2010) for more details about biplots. The form biplot favors the representation of the individuals: the distance between the individuals approximates the Euclidean distance between rows. In the form biplot the length of a vector approximates the quality of the representation of the variable. ```{r biplot-rows} biplot(X, type = "form", labels = "variables") ``` The covariance biplot favors the representation of the variables: the length of a vector approximates the standard deviation of the variable and the cosine of the angle formed by two vectors approximates the correlation between the two variables (Greenacre, 2010). In the covariance biplot the distance between the individuals approximates the Mahalanobis distance between rows. ```{r biplot-columns} biplot(X, type = "covariance", labels = "variables") ``` Biplots have the drawbacks of their advantages: they can quickly become difficult to read as they display a lot of information at once. It may then be preferable to visualize the results for individuals and variables separately. # Plot PCA loadings `viz_variables()` depicts the variables by rays emanating from the origin (both their lengths and directions are important to the interpretation). ```{r plot-var} ## Plot variables factor map viz_variables(X) ``` `viz_variables()` allows to highlight additional information by varying different graphical elements (color, transparency, shape and size of symbols...). ```{r plot-var-contribution} ## Highlight contribution viz_variables( x = X, extra_quanti = "contribution", color = c("#FB9A29", "#E1640E", "#AA3C03", "#662506"), legend = list(x = "bottomleft") ) ``` # Plot PCA scores `viz_individuals()` allows to display individuals and to highlight additional information. ```{r plot-ind-species} ## Plot individuals and color by species viz_individuals( x = X, extra_quali = iris$Species, color = c("#4477AA", "#EE6677", "#228833"), # Custom color scheme symbol = c(15, 16, 17), # Custom symbols legend = list(x = "bottomright") ) ``` ```{r plot-ind-highligh} ## Highlight one species viz_individuals( x = X, extra_quali = iris$Species, color = c(versicolor = "black"), # Named vector symbol = c(15, 16, 17), # Custom symbols legend = list(x = "bottomright") ) ``` ```{r plot-ind-lab} ## Label the 10 individuals with highest cos2 viz_individuals( x = X, labels = list(filter = "cos2", n = 10), extra_quali = iris$Species, color = c("#4477AA", "#EE6677", "#228833"), symbol = c(15, 16, 17), legend = list(x = "bottomright") ) ``` ```{r plot-wrap, fig.show='hold', out.width='50%'} ## Add ellipses viz_individuals(x = X, extra_quali = iris$Species, color = c("#004488", "#DDAA33", "#BB5566")) viz_tolerance(x = X, group = iris$Species, level = 0.95, color = c("#004488", "#DDAA33", "#BB5566")) ## Add convex hull viz_individuals(x = X, extra_quali = iris$Species, color = c("#004488", "#DDAA33", "#BB5566")) viz_hull(x = X, group = iris$Species, level = 0.95, color = c("#004488", "#DDAA33", "#BB5566")) ``` ```{r plot-ind-petal} ## Highlight petal length viz_individuals( x = X, extra_quanti = iris$Petal.Length, color = color("YlOrBr")(12), # Custom color scale size = c(1, 2), # Custom size scale legend = list(x = "bottomleft") ) ``` ```{r plot-ind-cos2} ## Highlight contributions viz_individuals( x = X, extra_quanti = "cos2", color = color("iridescent")(12), # Custom color scale size = c(1, 2), # Custom size scale legend = list(x = "bottomleft") ) ``` # Custom plot If you need more flexibility, the `get_*()` family and the `tidy()` and `augment()` functions allow you to extract the results as data frames and thus build custom graphs with base **graphics** or **ggplot2**. ```{r tidy} iris_tidy <- tidy(X, margin = 2) head(iris_tidy) iris_augment <- augment(X, margin = 1) head(iris_augment) ``` ```{r ggplot2, eval=FALSE} ## Custom plot with ggplot2 ggplot2::ggplot(data = iris_augment) + ggplot2::aes(x = F1, y = F2, colour = contribution) + ggplot2::geom_vline(xintercept = 0, linewidth = 0.5, linetype = "dashed") + ggplot2::geom_hline(yintercept = 0, linewidth = 0.5, linetype = "dashed") + ggplot2::geom_point() + ggplot2::coord_fixed() + # /!\ ggplot2::theme_bw() + khroma::scale_color_iridescent() ``` # References Greenacre, M. J. (2010). *Biplots in Practice*. Bilbao: FundaciĆ³n BBVA.