--- title: "TDAvec vignette" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{TDAvec-vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` Persistence diagrams are a fundamental tool in topological data analysis (TDA) that summarize the multi-scale topological features (such as connected components, loops, and voids) of a dataset. They represent these features as points in a 2D plane, where each point corresponds to the birth and death of a feature across different scales. Since persistence diagrams are multisets of points in a 2D plane, their unordered structure makes them challenging to use directly in machine learning models. By vectorizing them (i.e., transforming into fixed-length vectors), conventional statistical and machine learning techniques can be applied while preserving topological information. The `TDAvec` R package is specifically designed for this task, offering 13 vectorization methods commonly used in TDA. These methods are divided into three broad categories: - Functional vector summaries - based on summary functions: - Betti curve - Euler characteristic curve - Normalized life curve - Persistence block - Persistence surface - Persistence landscape function - Persistence silhouette function - Persistent entropy summary function - Template function - Algebraic vector summaries - based on polynomial maps: - Algebraic functions - Complex polynomial coefficients - Tropical coordinate function - Statistical vector summaries - based on descriptive statistics: - Basic descriptive statistics For improved computational efficiency, all code behind the vector summaries of `TDAvec` is written in `C++` using the `Rcpp` and `RcppArmadillo` packages. In this vignette, we illustrate the basic usage of the package using simple examples. Let's first load the required libraries. ```{r setup} library(TDAvec) library(TDAstats) # to compute persistence diagrams ``` Now, we generate a 2D point cloud of size 100 sampled uniformly from a unit circle with added Gaussian noise: ```{r} # the number of points to sample N <- 100 # set a random seed for reproducibility set.seed(123) # sample N points uniformly from the unit circle and add Gaussian noise theta <- runif(N, min = 0, max = 2 * pi) X <- cbind(cos(theta), sin(theta)) + rnorm(2 * N, mean = 0, sd = 0.2) # plot the point cloud plot(X,pch = 20,asp = 1,xlab = 'x',ylab = 'y') ``` Next, we use the `TDAstats` package to compute the persistence diagram (PD) of a Vietoris-Rips filtration built on top of the point cloud $X$. ```{r} D <- calculate_homology(X,dim=1,threshold=2) sum(D[,1]==0) # number of connected components sum(D[,1]==1) # number of loops sum(D[,1]==2) # number of voids ``` In the `calculate_homology()` function, `dim` is the maximum homological dimension of the topological features to be computed (connected components if `dim=0`; connected components and loops if `dim=1`; connected components, loops and voids if `dim=2`, etc.). `threshold` is the maximum value of the scale parameter of the filtration (which we set equal to 2 since the points are sampled from a circle with diameter 2). The persistence diagram $D$ has `r sum(D[,1]==0)` connected components (the point cloud size - 1; `TDAstats` drops the connected component with infinite death value), `r sum(D[,1]==1)` loops (one with long life-span, the rest are short-lived) and `r sum(D[,1]==2)` voids along with their `birth` and `death` values. To plot the diagram, we can use the `plot_persisit()` function. ```{r} plot_persist(D) ``` In the plot, the solid dots and triangles represent connected components and loops respectively. Let's compute a vector summary of one of the simplest summary functions, the Betti curve, for homological dimension $H_1$. ```{r} # sequence of scale values to be used for vectorization scaleSeq = seq(0,1.5,length.out=16) # vectorize the Betti curve for homological dimension H_1 computeBettiCurve(D,homDim=1,scaleSeq) ``` By default, vectorization in this case is performed by computing the average values of the Betti curve over consecutive intervals defined by an increasing sequence of scale points (`evaluate`="intervals"). This vectorization method is adopted as the default for all univariate summary functions that are easy to integrate. More specifically, if $f$ is a (univariate) summary function and $t_1,t_2,\ldots,t_n$ are increasing scale values, we discretize $f$ as: \begin{equation} \Big(\frac{1}{\Delta t_1}\int_{t_1}^{t_2}f(t)dt,\frac{1}{\Delta t_2}\int_{t_2}^{t_3}f(t)dt,\ldots,\frac{1}{\Delta t_{n-1}}\int_{t_{n-1}}^{t_n}f(t)dt\Big)\in\mathbb{R}^{n-1}, \end{equation} where $\Delta t_k=t_{k+1}-t_k$, $k=1,\ldots,n-1$. Alternatively, by setting the `evaluate` argument to "points", one can vectorize the Betti curve and all the other univariate summary functions by evaluating them at each scale point and arranging the values into a vector: \begin{equation} (f(t_1),f(t_2),\ldots,f(t_n))\in\mathbb{R}^{n}, \end{equation} As a note, the `computePersistenceLandscape()` function supports both classic and generalized persistence landscape functions. For example: ```{r} # classic computePersistenceLandscape(D,homDim=1,scaleSeq,k=3) # k = the number of persistence landscape functions to consider (default is 1) # generalized computePersistenceLandscape(D,homDim=1,scaleSeq,k=3,generalized=TRUE,kernel = "epanechnikov",h=0.2) # h = bandwidth for the kernel function ``` Computing algebraic and statistical vector summaries follows a similar approach. For example: ```{r} computeAlgebraicFunctions(D,homDim=0) ``` returns four algebraic functions based on the birth and death values (refer to the help documentation for more details). _Persistence surface_ and _persistence block_ are bivariate summary functions and to vectorize them, we first need to switch from the birth-death to the birth-persistence coordinates: ```{r} D[,3] <- D[,3] - D[,2] colnames(D)[3] <- "Persistence" ``` Below, we compute the _persistence image_, which is a vector summary of the _persistence surface_: ```{r} # Persistence Image (PI) resB <- 5 # resolution (or grid size) along the birth axis resP <- 5 # resolution (or grid size) along the persistence axis # find min and max persistence values for dimension H_0 minPH0 <- min(D[D[,1]==0,3]); maxPH0 <- max(D[D[,1]==0,3]) # construct one-dimensional grid of scale values ySeqH0 <- seq(minPH0,maxPH0,length.out=resP+1) # default way of selecting the standard deviation sigma of the Gaussians on top of each point of the diagram sigma <- 0.5*(maxPH0-minPH0)/resP # compute PI for homological dimension H_0 computePersistenceImage(D,homDim=0,xSeq=NA,ySeqH0,sigma) ``` Since the $H_0$ features all have the birth value of zero in this case, a one-dimensional grid of scale values is used for vectorization. For homological dimension $H_1$, the birth values are not all the same and therefore the vectorization is performed over a two-dimensional grid. ```{r} # Persistence Image (PI) # find min & max birth and persistence values for dimension H_1 minBH1 <- min(D[D[,1]==1,2]); maxBH1 <- max(D[D[,1]==1,2]) minPH1 <- min(D[D[,1]==1,3]); maxPH1 <- max(D[D[,1]==1,3]) xSeqH1 <- seq(minBH1,maxBH1,length.out=resB+1) ySeqH1 <- seq(minPH1,maxPH1,length.out=resP+1) sigma <- 0.5*(maxPH1-minPH1)/resP # compute PI for homological dimension H_1 computePersistenceImage(D,homDim=1,xSeqH1,ySeqH1,sigma) ``` The code for `computePersistenceImage()` is adopted from the `pers.image()` function (available in the `kernelTDA` package) with some modifications. For example, `pers.image()` uses a one-dimensional grid for homological dimension $H_0$ regardless of the filtration type. In contrast, `computePersistenceImage()` uses a one-dimensional grid only if additionally the birth values are the same (which may not be true for some filtration types). Moreover, `pers.image()` uses a square grid (e.g., 10 by 10) for vectorization, whereas `computePersistenceImage()` is not limited to such a grid and can compute vector summaries using a rectangular grid (e.g., 10 by 20). #### References 1. Chazal, F., & Michel, B. (2021). An Introduction to Topological Data Analysis: Fundamental and Practical Aspects for Data Scientists. _Frontiers in Artificial Intelligence_, 108. 2. Chung, Y. M., & Lawson, A. (2022). Persistence curves: A canonical framework for summarizing persistence diagrams. _Advances in Computational Mathematics_, 48(1), 1-42. 3. Bubenik, P. (2015). Statistical topological data analysis using persistence landscapes. _Journal of Machine Learning Research_, 16(1), 77-102. 4. Berry, E., Chen, Y. C., Cisewski-Kehe, J., & Fasy, B. T. (2020). Functional summaries of persistence diagrams. _Journal of Applied and Computational Topology_, 4(2):211–262. 5. Adcock, A., Carlsson, E. and Carlsson, G., 2013. The ring of algebraic functions on persistence bar codes. Homology, Homotopy Appl., 18:381–402, 2016. 6. Ali, D., Asaad, A., Jimenez, M.J., Nanda, V., Paluzo-Hidalgo, E. and Soriano-Trigueros, M., (2023). A survey of vectorization methods in topological data analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence. 7. Adams, H., Emerson, T., Kirby, M., Neville, R., Peterson, C., Shipman, P., ... & Ziegelmeier, L. (2017). Persistence images: A stable vector representation of persistent homology. _Journal of Machine Learning Research_, 18.