\documentclass[12pt]{article} \usepackage{fullpage} \usepackage{latexsym,amsmath,amssymb,epsfig} \usepackage{amsthm} \usepackage{subfigure} \usepackage{marvosym} %\usepackage{hyperref} \usepackage{pstricks} \usepackage{pst-node} \usepackage[round]{natbib} \usepackage{hyperref} \usepackage{multirow} \newtheorem{definition}{Definition}[section] \graphicspath{{figures/}} \title{An Introduction to \texttt{PottsUtils}} \author{Dai Feng \\ dai\_feng@merck.com} \date{} \begin{document} %\VignetteIndexEntry{Using the PottsUtils Package} %\VignetteKeywords{Potts, Simulation of the Potts models} %\VignettePackage{PottsUtils} \maketitle %<<echo=false,results=hide>>= %library(PottsUtils) %@ Package \texttt{PottsUtils} comprises several functions related to the Potts models defined on undirected graphs. The main purpose of the package is to make available several functions that generate samples from the models. To facilitate that, there are other utilities. Furthermore, there is a by-product of simulation functions. Altogether, there are three sets of functions. The first produces basic properties of a graph to facilitate the simulation functions (they maybe used for other purposes as well). The second provides various simulation functions. The third currently includes only one function which computes the normalizing constant based on simulation results. This introduction was intended to help users to understand better the functions, the documentation (Rd files), and the source code. For more technical details (mathematical proof among others), we refer users to references herein. Hereafter, first we introduce some basic concepts and definitions related to the Potts models. Second, algorithms used in simulation functions are presented in a concise way. Third, a function to obtain normalizing constant is introduced. Forth, we discuss some computational issues. Finally, some future work is outlined. \section{Notation and Terminology} In this section we introduce concepts and definitions involved in the discussions of the Potts models. The notations used are similar to those given in \cite{Winkler:2003}. Based on that, several related functions in the package are introduced. We consider the Potts model defined on a finite undirected graph. A graph describes a set of connections between objects. Each object is called a node or vertex. There are observations to characterize the properties of vertices. Let $\mathcal{V}$ be a finite set, the set of vertices; $\mathcal{V}=\{1,2,,\ldots,N\}$, where $N$ is the total number of vertices. For every vertex $i \in \mathcal{V}$, let $z_{i}$ take values in a finite set of categories $\mathcal{Z} = \{1,2,\ldots,k\}$, where $k$ is the number of categories. In the package we use different colors to represent different categories and vertices from the same category are of the same color. The product ${\bf{Z}} = \mathcal{Z}^N$ is the space of configurations ${\bf{z}}=(z_{i}; i \in \mathcal{V})$. A strictly positive probability measure $P$ on ${\bf{Z}}$ for every ${\bf{z}} \in {\bf{Z}}$ is called a \textit{stochastic} or \textit{random field}. Note that $P$ has to be strictly positive on ${\bf{Z}}$ to satisfy the assumptions of the Hammersley-Clifford theorem; see \cite{Besag:1974} and \cite{Winkler:2003} for details. A collection $\partial = (\partial(v): v \in \mathcal{V})$ of subsets of $\mathcal{V}$ is called a \textit{neighborhood system}, if (i) $i \notin \partial(i)$ and (ii) $i \in \partial(j)$ if and only if $j \in \partial(i)$. The sites $j \in \partial(i)$ are called \textit{neighbors} of $i$. We use $i \sim j$ to denote that $i$ and $j$ are neighbors of each other. There is an \textit{edge} between $i$ and $j$ if and only if they are neighbors. Define a \textit{graph} $\mathcal{G}=\{\mathcal{V}, \mathcal{E}\}$, where $\mathcal{E}$ is the set of edges. For a \textit{finite undirected graph}, the number of vertices are finite and edge $(i, j)$ is equivalent to edge $(j, i)$. The function \texttt{getEdges()} can be used to get edges of a graph. When neighbors $i$ and $j$ are from the same category (of the same color), then there is a \textit{bond} between them. The random field $P$ is a \textit{Markov random field} (MRF) w.r.t. the neighborhood system $\partial$ if for all ${\bf{z}} \in {\bf{Z}}$, \[ P(z_{i}|z_{j}, j \neq i) = P(z_{i}|z_{j}, j \in \partial(i)) \] Probability measures of the form \[ P({\bf z})=\frac{\exp\{-H({\bf z})\}}{\sum_{{\bf x} \in {\bf{Z}}} \exp\{-H({\bf x})\}} \] are called \textit{Gibbs fields (or measures)}. $H$ is called the \textit{energy function} or \textit{Hamiltonian}, and $\sum_{x \in {\bf{Z}}} \exp\{-H(x)\}$ is called the \textit{partition function} or \textit{normalizing constant}. For detailed account on MRF, Gibbs measures, and related issues, we refer to \cite{Winkler:2003}. When using Markov random field models, the first question is how to define neighbors of all vertices. For a 1D lattice, the simplest way to define neighbors is that every vertex (except those on the boundaries) has the two adjacent vertices as its neighbors, see Figure \ref{fig:neighbors1D} for illustration. Of course, a vertex could have more than two neighbors. \begin{figure} \begin{center} \includegraphics[scale=0.8]{neighbors1D.png} \caption{2 neighbors in 1D} \label{fig:neighbors1D} \end{center} \end{figure} For a 2D lattice, there are two common ways to define neighbors. One is that neighbors of a vertex comprise its available N, S, E, and W adjacencies. Another is that, besides those four, there are four diagonal adjacencies on its north-west, north-east, south-west, and south-east. See Figure \ref{fig:neighbors2D} for illustrations. Probability measures defined on the former are called the first-order Markov random fields and the latter the second-order Markov random fields. \begin{figure} \begin{center} \includegraphics[scale=0.8]{neighbors2D.png} \caption{Four and eight neighbors in 2D} \label{fig:neighbors2D} \end{center} \end{figure} For a 3D lattice, besides defining six neighbors in the $x$, $y$, and $z$ directions, one can add twelve diagonal neighbors in the $x-y$, $x-z$, and $y-z$ planes, and another eight on the 3D diagonals. This leads to a six neighbor structure, an eighteen neighbor structure, and a twenty-six neighbor structure. For illustration, see Figure \ref{fig:neighbors3D}. % where we use a right-handed coordinate system with the $z$ axis pointing up. \begin{figure}[htp] \centering \subfigure[six neighbors]{ \includegraphics[width=0.4\textwidth, totalheight=0.25\textheight]{6neighbors.png}} \hspace{0.2in} \subfigure[eighteen neighbors]{ \includegraphics[width=0.4\textwidth, totalheight=0.25\textheight]{18neighbors.png}} \hspace{0.2in} \subfigure[twenty-six neighbors]{ \includegraphics[width=0.4\textwidth, totalheight=0.25\textheight]{26neighbors.png}} \caption{Illustration of neighbor structures in 3D} \label{fig:neighbors3D} \end{figure} The package provides a function called \texttt{getNeighbors()} to generate all neighbors of a graph. After defining neighbors, the second question is how to model the spatial relationship among neighbors. One choice is to use a model from the Potts model family, a set of MRF models with the Gibbs measure defined as follows. \begin{equation} p({\bf z}) = C(\beta)^{-1} \exp\left\{\sum_{i=1}^{N}\alpha_{i}(z_{i})+ \beta \sum_{i \sim j}w_{ij}f(z_{i}, z_{j})\right\} \label{eqn:genPotts} \end{equation} where $C(\beta)$ is a normalizing constant and $i \sim j$ indicates neighboring vertices. We need to define neighborhood structure and then assign relationships among neighboring vertices. The parameter $\beta$, called the \textit{inverse temperature}, determines the level of spatial homogeneity between neighboring vertices in the graph. A zero $\beta$ would imply that neighboring vertices are independent. We use positive $\beta$ values. The $w_{ij}$ are \textit{weights} and we assume $w_{ij} > 0$. The term $\sum_{i=1}^{N}\alpha_{i}(z_{i})$ is called the \textit{external field}. The $\alpha_{i}(z_{i})$ are functions of $z_{i}$. When $\beta=0$, the external field completely characterizes the distribution of the independent $z_i, i=1,2,\dots, N$. When $f(z_{i}, z_{j})=\textrm{I}(z_{i}=z_{j})$ model (\ref{eqn:genPotts}) becomes \begin{equation} p({\bf z})=C(\beta)^{-1}\exp\left\{\sum_{i=1}^{N}\alpha_{i}(z_{i})+ \beta\sum_{i \sim j}\textrm{I}(z_{i}=z_{j})\right\} \label{eqn:Potts} \end{equation} For $k = 2$, this model is called the Ising model \citep{Ising:1925}; for $k > 2$ it is the \cite{Potts:1953} model. The Ising model was originally proposed to describe the physical properties of magnets. Due to its flexibility and simplicity, the Ising model and its various versions have been widely used in other fields, such as brain models in cognitive science \citep{Feng:2008}, information and machine learning theory (\cite{MacKay:2003} and references therein), economics (\cite{Bourgine/Nadal:2004} and references therein), sociology \citep{Kohring:1996} and game theory \citep{Hauert/Szabo:2005}. The most commonly used Potts model is the one without an external field and with $w_{ij} \equiv 1$, \begin{equation} p({\bf z})=C(\beta)^{-1}\exp\left\{\beta\sum_{i \sim j}\textrm{I}(z_{i}=z_{j})\right\} \label{eqn:simPotts} \end{equation} We refer to this as the \emph{simple Potts model}. Let $\alpha_{i}(z_{i}) \equiv 0$ and $f(z_{i},z_{j}) = w_{ij}\textrm{I}(z_{i}=z_{j})$. Then (\ref{eqn:genPotts}) reduces to \begin{equation} p({\bf{z}}) = C(\beta)^{-1} \exp\left\{ \beta\sum_{i \sim j}w_{ij}\textrm{I}(z_{i}=z_{j})\right\} \label{eqn:comPotts} \end{equation} where $w_{ij}$ is the weight between vertex $i$ and $j$. For example we might take $w_{ij} = \frac{1}{d(z_{i}, z_{j})}$ where $d(z_{i}, z_{j})$ is a distance function, say Euclidean distance, between two vertices. This model is referred to as the \emph{compound Potts model}. In model (\ref{eqn:genPotts}), let $\alpha_{i}(z_{i})=0$ and define $f(z_{i},z_{j})$ as \begin{equation} f(z_{i},z_{j})= \begin{cases} a_{1} & \textrm{if} \hspace{0.2cm} z_{i} = z_{j}\\ a_{2} & \textrm{if} \hspace{0.2cm} |z_{i}-z_{j}|=1\\ a_{3} & \textrm{otherwise} \end{cases} \label{eqn:rep1} \end{equation} where $a_{1} \geq a_{2} \geq a_{3}$. We call this model the \emph{repulsion Potts model}. This model assumes an ordering of the colors and that neighboring vertices are most likely of the same color, and if they are different then it is more likely that they are similar than totally different. See \cite{Feng:2008} for more details. \section{Simulation of the Potts Models} It is very hard to find algorithms (such as inversion of CDF, rejection sampling, adaptive rejection sampling, or ratio-of-uniforms sampling) to generate i.i.d. samples from the Potts models, and Markov chain methods have to be used for the simulation. To generate samples from model (\ref{eqn:genPotts}), single site updating, for example Gibbs sampling, is easy but may mix slowly. The \cite{Swendsen/Wang} algorithm (SW) is widely used to generate random samples from the simple Potts model. Wolff's algorithm \citep{Wolff:1989} has been advocated as an alternative to the SW. A Gibbs sampler that takes advantage of the conditional independence structure to update variables $z_{i}, i=1,2,\ldots, N$, could make the simulation much faster than a single site updating scheme \citep{Feng:2008}. When there is external field, the partial decoupling method might outperform the Gibbs sampling. \subsection{Swendsen-Wang Algorithm} \label{sec:231} The SW algorithm was originally proposed for the simulation of the simple Potts model. Drawing auxiliary variables $u_{ij}|{\bf{z}}$ for neighboring vertices $(i, j)$ from independent and uniform distributions on $[0,\exp\{\beta\textrm{I}(z_{i}=z_{j})\}]$ makes the joint density \begin{equation} p({\bf{z}}, {\bf{u}}) \propto \prod_{i \sim j} \textrm{I}_{[0,\exp\{\beta\textrm{I}(z_{i}=z_{j})\}]}(u_{ij}) \end{equation} The conditional distribution of ${\bf{z}}$ given ${\bf{u}}$ is also uniform on possible configurations. If $u_{ij} \geq 1$, then there is a bond between vertices i and j (when $u_{ij} \geq 1$ definitely $z_{i}=z_{j}$); otherwise there is no further constraint. Therefore, all vertices can be divided into patches (clusters). A patch is a collection of vertices, in which any two vertices are connected by a bond or a sequence of bonds. There is no bond between vertices from different patches. See Figure \ref{fig:patches} for illustrations. The vertices within each patch should belong to the same category and the categories of different patches are i.i.d. from the discrete uniform distribution among all possible categories. \begin{figure}% [htp] \centering \includegraphics[width=0.6\textwidth, totalheight=0.2\textheight]{Graph-patches.png} \caption{Illustration of the concept of patches.} \label{fig:patches} \end{figure} The SW algorithm is the seminal work on algorithms using auxiliary variables (in this case ${\bf u}$) in this area. A slightly generalized version of the original SW algorithm can be used to generate samples from a compound Potts model \citep{Feng:2008}. Given all vertices, there are three steps in each iteration as follows. \begin{enumerate} \item Build bonds among vertices in neighbor with probability \[1-exp(-\beta w_{ij}\textrm{I}(z_{i}=z_{j}))\] \item Build patches for vertices bonded together \item Flip the patches randomly to any color \end{enumerate} The SW algorithm may outperform single-site updating samplers, especially when there is no external field and there is patchiness among vertices, in which simultaneous switching of clusters is necessary. For more details see \cite{Feng:2008}. \subsection{Wolff's Algorithm} \label{sec:232} A modification of the Swendsen-Wang algorithm was proposed by \cite{Wolff:1989}. The difference between the Wolff and the SW is that instead of flipping all patches randomly, one patch is chosen and all vertices in that patch are flipped to their opposites in the simple Potts model. To be more specific, for Wolff's algorithm, a vertex, say $v$, is selected randomly among all vertices, and bonds between $v$ and its neighbors are set the same way as in the SW. If there are bonds between $v$ and its neighbors, say $C_{v}$, then the bonds between vertices in $C_{v}$ and their neighbors are set. Follow the same procedure recursively until no new bonds are created. Now there is a patch around $v$ and all vertices in this patch are flipped to their opposites. Note, there is no randomness involved when flipping vertices for the Ising Model. Although Wolff's algorithm is similar to the SW, its proof is formalized from another perspective where detailed balance and irreducibility are verified (see \cite{Wolff:1989} for details). To generate samples from a compound Potts model, a slighted generalized version of the original Wolff algorithm as follows can be adopted. There are four steps in each iteration. \begin{enumerate} \item Randomly select a vertex \item Build a patch around it with probability \[1-exp(-\beta w_{ij}\textrm{I}(z_{i}=z_{j}))\] \item Continue building the patch till no additional vertex can be bonded together \item Flip the patch randomly to another color \end{enumerate} See \cite{Feng:2008} for more details. \subsection{A Gibbs Sampler Using Conditional Independence} \label{subsec:Gibbs} Various multiple-site sampling methods might outperform single-site updating, but sometimes they might not. Multi-site sampling methods like the SW algorithm could tackle the critical slowing down problem. When there is no external field, from the results pointed out in \cite{Higdon:1998}, the SW algorithm outperforms single-site Metropolis updating when $\beta$ is at the critical value. When there is an external field (a likelihood function in Bayesian inference for example), the SW slows down since it does not make good use of the data. \cite{Hurn:1997} and \cite{Smith/Smith} suggested that when $\beta$ is large, Gibbs samplers might be more effective. The choice of the best sampling method is likely to be problem-specific and there is no clear-cut winner as pointed out in \cite{Hurn:1997} and \cite{Higdon:1998}. Furthermore, the previous studies just focus on relatively small grids in two dimensions with the number of categories equal to 2. What the convergence properties of various algorithms are on a three dimensional large grid with more than 2 categories, the case in MRI analysis for example \citep{Feng:2008}, needs further study. Given that, a Gibbs sampler might be a good choice under certain circumstances. The package provides functions that uses the Gibbs sampler taking advantage of the conditional independence structure to update colors of vertices. It could make the simulation much faster than a one-site-after-another sampling scheme by using the vectorization functions in $\mathbf{R}$. The idea is that if we can divide variables that need to be updated into different blocks and given the variables in other blocks, all the variables within the same block are conditionally independent, then we can update all blocks iteratively with the variables within the same block being updated simultaneously. In Figure (a) of \ref{fig:checkerboard}, under the four neighbor structure in 2D, given the black vertices, the whites are independent and vice versa. By this kind of independence, updating can be done in two steps: one for the blacks, one for the whites. The idea of taking advantage of this kind of independence can be traced back at least to the ``Coding Methods'' in \cite{Besag:1974}. It was described in \cite{Wilkinson:2005} and detailed discussion can be found in \cite{Winkler:2003}. This conditional independence can be generalized to 3D lattices with a six neighbor structure, see (b) of Figure \ref{fig:checkerboard} for illustration. Under six neighbor structure, given the blacks, the whites are independent and vice versa. The minimum number of blocks to make the vertices within each block independent given the other blocks is called the \textit{chromatic number} in \cite{Winkler:2003}. So the chromatic numbers for four neighbor configuration in 2D and six neighbor in 3D are both 2. \begin{figure}[htp] \centering \subfigure[2D]{ \includegraphics[width=0.38\textwidth, totalheight=0.3\textheight] {chessboard.png}} \vspace{.3in} \subfigure[3D]{ \includegraphics[width=0.38\textwidth,totalheight=0.3\textheight] {chessboard-3d-2.png}} \caption{Illustration of Conditional Independence} \label{fig:checkerboard} \end{figure} The extension to the eight neighbor configuration in 2D and eighteen and twenty-six neighbor in 3D is as follows. Under the eight neighbor structure in 2D, the chromatic number is four; under the eighteen neighbor configuration in 3D, the chromatic number is seven; under the twenty-six neighbor configuration in 3D, the chromatic number is eight. For more details, see \cite{Feng:2008}. The function to split vertices into conditional independent blocks is \texttt{getBlocks()}. Right now, for a 2D graph, the vertices can be divided into either 2 or 4 blocks, and for a 3D graph either 2 or 8. The functions using the Gibbs sampler that takes advantage of conditional independence are \texttt{BlocksGibbs()} and \texttt{rPotts1()}. The difference between the two is that the first one can only generate samples from a Potts model without the external field while the second is good for all Potts models. The relationship among neighboring vertices can be defined by specifying the parameter \texttt{spatialMat}. \subsection{The Partial Decoupling Method} Besides the Gibbs sampling, another way to obtain samples from a Potts model with external field is to use the partial decoupling method through the function \texttt{rPotts1()}. The partial decoupling method was originally developed by \cite{Higdon:1993} for an Ising model with an external field. Given the external field, the symmetry property is violated. Instead of drawing $u_{ij}|{\bf z}$ for neighbor points $(i, j)$ from independent uniform distributions on $[0,\exp\{\beta\textrm{I}(z_{i}=z_{j})\}]$, draw $u_{ij}|{\bf z}$ from the uniform distribution $[0,\exp\{\delta_{ij}\beta\textrm{I}(z_{i}=z_{j})\}]$. Now, the bonding probability is not only controlled by $\beta$ alone, but $\delta_{ij}$ as well. When $\delta_{ij} = 0$, it reduces to single site updating; when $\delta_{ij} = 1$, it corresponds to the SW algorithm. The smaller the $\delta_{ij}$, the less likely bonds are formed. After setting the bonds, clusters are usually not independent anymore and Gibbs, Metropolis-Hastings, or even partial decoupling could be used to update the coarser model. The Gibbs sampling is used in the package. The goal of choosing $\delta_{ij}$ is to improve the mixing when sampling from ${\bf z}|{\bf u}$. Strategies for the determination of $\{\delta_{ij}\}$ are based on the nature of the likelihood function of the data. For example, choosing $\delta_{ij}=0$ for vertices on the boundaries of adjacent sub-graphs and $\delta_{ij}=1$ for others can prevent the clusters from growing beyond certain limits. Another choice is $\delta_{ij} = a\textrm{I}\{y_{i}=y_{j}\}$ with $0 < a <1$, where $y_i$ and $y_j$ are observations for vertex $i$ and $j$. The details of the partial decoupling method can be found in \cite{Higdon:1998} and \cite{Hurn:1997}. \section{Calculation of the Normalizing Constant} The normalizing constant is not critical and can be ignored if $\beta$ is known. However, if $\beta$ is treated as unknown, for example in studies using the Bayesian methods \citep{Green/Richardson:2002,Smith/Smith}, then the normalizing constant is necessary when drawing samples from the conditional distribution of $\beta$ by a Metropolis-Hastings algorithm. The package provide a function \texttt{getNC()} to obtain the normalizing constant of a simple Potts model. The method adopted is similar to that in \cite{Green/Richardson:2002} and \cite{Smith/Smith}, in which the thermodynamic integration approach was used. The thermodynamic integration approach comes from the differential equations for describing thermodynamic relationships in physics. Basically, the thermodynamic integration method uses the fact that the normalizing constant can be obtained by solving the differential equation \[ \frac{\partial}{\partial \beta}\log(C(\beta))=E(U({\bf z})|\beta, k) \] where $U({\bf z}) = \sum_{i \sim j}\textrm{I}(z_{i}=z_{j})$ and $k$ is the number of categories. Since \[ \log{C(0)}=N\log{k} \] it follows \begin{equation} \log{C(\beta)} = N\log{k} + \int_{0}^{\beta}E(U({\bf z})|\beta^{'}, k)d\beta^{'} \label{eqn:NC} \end{equation} In order to compute $\log{C(\beta)}$ we use the following steps: \begin{enumerate} \item Take $k$ grid values $\{0 < \beta_{1} < \beta_{2}<\ldots<\beta_{k}\leq \beta\}$. For each $\beta_{i}, i=1,2,\ldots, k$, obtain $n$ simulations of the graph and estimate $E(U({\bf z})|\beta_{i},k)$ by the average of the functions $U({\bf z})$ from $n$ simulations. Possible methods for simulation include Gibbs sampling with single site updating, or the SW, or the Wolff algorithms discussed previously. \item Compute the integral by numerical integration. \end{enumerate} \section{Computational Issues} As mentioned in sub-section \ref{subsec:Gibbs}, it is beneficial to use vectorization functions in $\mathbf{R}$ to fulfill the idea of conditional independence. Furthermore, for the SW algorithm the major computational work is the identification and labeling of the patches of connected vertices. This is an instance of a connected component labeling problem for an undirected graph. We use Rem's algorithm \citep{Dijkstra:1976} to fulfill the task in the function \texttt{getPatches()}. Besides facilitating the SW algorithm, it can be used in other clustering methods as well. See \cite{Heller:2006} for example. \section{Future Work} In the future, the package might be upgraded in several directions. First we might incorporate more available sampling methods for the Potts models. Second, the speed of current functions might be enhanced by using embedded $\mathbf{C}$ functions and parallel computations. Third, changes based on feedback of users. \bibliographystyle{plainnat} \bibliography{Intro} \end{document}