%\VignetteIndexEntry{Ecological inference with R: the ecoreg package} \documentclass{article} %% Need to modify Sweave.sty to pass pdftex option to graphicx. \usepackage{Sweave} \usepackage{times} \addtolength{\textwidth}{2cm} \newcommand{\etal}{{\textit{et al.}}} \def\logit{\mathop{\rm logit}\nolimits} \def\expit{\mathop{\rm expit}\nolimits} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} <>= options(width = 60) version <- gsub("Version: +", "", help(package="ecoreg")$info[[1]][3]) @ \title{Ecological inference with R: the \Rpackage{ecoreg} package \vskip 0.2in \large{Version % <>= cat(version) @ }} \author{Christopher Jackson\\MRC Biostatistics Unit, Cambridge\\ \texttt{chris.jackson@mrc-bsu.cam.ac.uk} \footnote{Written while a member of the BIAS project at the Department of Epidemiology and Public Health, Imperial College London} } \date{March 2006} \bibliographystyle{unsrt} \begin{document} \maketitle \begin{abstract} In typical small-area studies of health and environment we wish to make inference on the relationship between individual-level quantities using aggregate, or ecological, data. Such ecological inference is often subject to bias and imprecision, due to the lack of individual-level information in the data. Simple regressions of area-level mean outcomes on area-level mean exposures are usually biased. To alleviate bias, the within-area distribution of exposures should be accounted for. The \Rpackage{ecoreg} package can be used to fit this class of models for ecological inference from aggregate data. In addition, full outcome and covariate information from a survey of individuals within the areas can be used to improve bias and precision. \Rpackage{ecoreg} can be used in this way to analyse ecological and individual data simultaneously, using \emph{hierarchical related regression}. \end{abstract} \section{Ecological inference} \label{sec:ecological} Ecological studies analyse data defined at a group level, but aim to make inferences about the individuals within the groups. To make reliable individual-level inferences from these studies, a number of problems must be overcome. One crucial difficulty is that the group-level exposure-response relationship may not reflect the individual-level relationship, a problem known as {\em ecological bias}, or the {\em ecological fallacy}. See, for example, \cite{richardson87,GreenlandM,GreenlandR,richardson:monfort} for discussion of these issues. Denote the outcome count in area $i$, with population $N_i$, by $y_{i}$. To model $y_{i}$ in terms of exposures measured as aggregate-level summaries, the usual model is a simple binomial or Poisson regression on the area-level covariate means $\overline z_{i}$. With binary covariates, $\overline z_{i}$ is the proportion exposed over the area. However, this only models the relationship between the \emph{aggregate} exposures and outcomes. It is only justified as a method of estimating individual-level relationships if all individuals in the area have the same covariate value, or there is the same exposure-response relationship at the individual and aggregate levels, which is generally only true for linear models. With non-linear models, such as Binomial or Poisson models, using the same model form at both levels will lead to ecological bias \cite{richardson87}. Despite these problems, aggregate data \emph{can} provide information about individual-level relationships. As the ratio of the between-area to the within-area variability of the exposure increases, the aggregate data summarise the true distribution of the exposure more accurately, and contain more information about the true individual-level exposure-outcome relationship. With sufficient exposure information, and with \emph{correctly-specified} models for the mean outcome \cite{richardson87,wakefield:salway,PrenticeS}, ecological bias can be reduced to negligible levels. In general, successful ecological inference requires samples of individual-level data within areas. Individual exposure data are usually required to reduce ecological bias by accounting for the within-area variability of exposures. But further improvements can also be made by using samples of exposures and \emph{outcomes} for selected individuals, as discussed by Wakefield~\cite{wakefield2by2} for 2 $\times$ 2 tables, and more generally by Jackson \etal \cite{improving:eco}. \section{Models for aggregate and individual data} \label{sec:models} The models described here are also described in the papers by Jackson \etal \cite{improving:eco} \cite{my:hrr}. \subsection{Individual data} \label{sec:indiv} We begin by specifying the form of the relationship between the individual-level risk of the binary outcome and the covariates. If individual-level exposure and outcome data are available, this is used to model them. It will also be used as the basis for an equivalent model for the aggregate data, as described in the next section. The risk $p_{ij}$ of the individual-level outcome $y_{ij}$ for the $j$th individual in area $i$ is assumed to be a logit-linear function of the covariates. The most general model we consider is \begin{equation} \label{eq:indiv} \logit(p_{ij}) = \mu_i + \sum_r \alpha_r x_{ir} + \sum_r \beta_r z_{ijr} + \gamma_{s_{ij}} \end{equation} where $x_{ir}$ are group-level covariates, and $z_{ijr}$ are individual-level covariates. The group-level covariates may include descriptions of the socio-economic status of the area, or the health service provision in the area. Individual-level covariates might comprise individual behaviours such as smoking, demographics such as ethnicity, or individual indicators of wealth and social class. Individuals may be influenced by the overall average exposure in the area, in addition to their own, so that the group-level variables may include the means of certain individual-level variables. $\gamma_s$ represents an additional contribution to the baseline risk for an individual occupying one of several strata $s$, usually defined by age and sex. The baseline risk $\mu_i$ may be fixed at $\mu$ or considered as a random effect with some distribution across areas. This can account for any remaining overdispersion and heterogeneity between areas, after adjusting for observed area-level variables. A random effect also allows the borrowing of information across areas and can stabilise estimation from areas with small populations \cite{richardson:best}. \subsection{Aggregate data} \label{sec:agg} \subsubsection{Marginal model} Suppose the area-level exposures have been estimated from a survey. For example, in the UK census, aggregate data on social class and education are calculated using a 10\% sample to maintain confidentiality. The proportion of smokers in the area might also be estimated from sales figures instead of a census. Then, individual outcomes can be assumed to be independent and identically distributed, with risk equal to some marginal ``group-level risk'' $p_{i}$. We assume \begin{equation} \label{eq:bin} y_{i} \sim Bin(N_{i}, p_{i}), \end{equation} where $p_{i}$ is determined by integrating the individual-level model over the joint within-area distribution of covariates \cite{richardson87,wakefield:salway}. Thus, $p_i$ is the average risk for an individual in group $i$. \begin{equation} \label{eq:integral} p_i = \int p_{ij}(\mathbf x) f_i(\mathbf x) d \mathbf x = E_{\mathbf x} (p_{ij}(\mathbf x) | i) \end{equation} For a \emph{single binary covariate}, observing the proportion exposed gives us enough information to estimate a binomial within-area distribution. For \emph{continuous covariates} the mean is not sufficient to estimate the within-area distribution, and we generally need samples of individual covariate data to be able to estimate the within-area variability. For \emph{multiple binary covariates} the joint distribution is estimated by the cross-classification of individuals between covariates. Typical census data do not usually have this cross-classification, and we need individual data to estimate it. \subsubsection{Conditional model} An alternative to the marginal model was proposed by Wakefield \cite{wakefield2by2}. The binomial model~(\ref{eq:bin}) is based on the assumption that each individual in group $i$ has an identical marginal probability $p_i$ of outcome, integrating over their unknown exposures. If the binary exposure is known for \emph{all} the individuals in the area from a full population census, but not necessarily coupled with exposures for the same individuals, then these should be conditioned on, leading to a likelihood based on a convolution of binomial distributions. The exact likelihood is related to the extended (or non-central) hypergeometric distribution, however, Wakefield describes a more computationally convenient normal approximation. \subsubsection{Binary covariates} For clarity we demonstrate the marginal model for 3 binary covariates, however, the framework extends immediately to any number of binary covariates. The integral to obtain the group-level risk is equivalent to a sum. Each individual falls into one of $S \times 2^3$ categories, defined by the distinct combinations of the 3 covariates and the $S$ age-sex strata, and indexed by $k$. Let $\phi_{ik}$ be the probability that an individual occupies category $k$. Let $q_{ik}$ be the probability of the individual outcome conditionally on occupying category $k$. Rewriting the index $k$ as $\{k_1, k_2, k_3, s\}$, where $k_r$ (0 or 1) indicates the presence or absence of covariate $r = 1, \ldots, 3$. \begin{equation} \label{eq:bin3} p_{i} = \sum_{k} \phi_{ik} q_{ik} = \sum_{k_1, k_2, k_3, s} \phi_{\{i, k_1, k_2, k_3, s\}} q_{\{i, k_1, k_2, k_3, s\}}. \end{equation} The outcome model conditionally on the unobserved category is \begin{equation} \label{eq:q} \logit ( q_{\{i, k_1, k_2, k_3, s\}} ) = \mu_i + \logit(e_s) + \sum_r \alpha_r x_{ir} + \sum_r k_r \beta_{r} \end{equation} The effect $\beta_r$ of the binary individual-level covariate $r$ only enters into this equation when the covariate $r$ is present. This is a generalisation of the model presented for two covariates by Lasserre \etal \cite{lasserre:binary}, except that the outcome is assumed to be non-rare and binomial instead of Poisson. $\logit(e_s)$ is a fixed offset, where $e_s$ is the risk of the outcome in stratum $s$, estimated from national population data. This is similar in spirit to ``indirect standardisation'', see, for example, \cite{clayton:hills}. If population and outcome totals are available for strata within areas, then we could generalise this to a separate binomial model for each stratum within area, with coefficients for the other covariates shared between the models, assuming no stratum-covariate interactions. It is also important to check whether \emph{exposures} are correlated with strata. If not accounted for, this can lead to ``mutual standardisation bias'' \cite{rosenbaum:rubin:stand}. We normally replace $\phi_{ik}$ in (\ref{eq:bin3}) by an estimate $\hat \phi_{ik}$. Ideally this \emph{probability} would be estimated by the \emph{proportion} of individuals in area $i$ occupying category $k$. But typical census data are not sufficient to know the complete cross-classification of individuals between all of these covariate and strata categories. Usually, our only information is the marginal proportions of single covariates, for example, the proportion $\phi_i^{(1)}$ of individuals who are economically inactive. If we assume that the covariates are independent, then $\phi_{ik}$ can be estimated by the product of the proportions of individuals occupying each marginal category defining $k$. But generally, socioeconomic indicators, such as unemployment and social class, are highly correlated. \cite{lasserre:binary} demonstrated that, in a typical case, bias is negligible when the joint covariate distribution is estimated by the product of the two marginal distributions, even when the covariates are correlated. However, we may wish to study more than two covariates. To estimate $\phi_{ik}$ we can often use a combination of marginal proportions $\overline z_{ir}$ and individual covariate data $z_{ijr}$. In the context of the UK census, for example, these correspond to district-level aggregate data and the Samples of Anonymised Records. Let $C_{ik}$ be the number of individuals in area $i$ in this individual-level dataset occupying category $k$, computed from the $z_{ijr}$. We use the following two principles. Firstly, the estimated $\hat\phi_{ik}$ corresponding to categories $k$ in which binary covariate $r$ is 1 must sum to $z_{ir}$. Secondly, the ratio of estimates $\hat\phi_{i_k} / \hat\phi_{i_l}$ must be the same as the ratio $C_{ik} / C_{il}$. This gives, for example, where $R$ is the set of categories in which covariate $r$ is 1, \begin{equation} \label{eq:phiik} \hat \phi_{ik} = C_{ik} \overline z_{ir} / \sum_{l \in R} {C_{il}} \end{equation} \subsubsection{Continuous covariates} Suppose now the individual-level model~(\ref{eq:indiv}) depends only on an intercept and one continuous individual-level covariate $x_{ij}$: $\logit(p_{ij}) = \mu_i + \beta x_{ij}$. The ecological data consist of the within-area mean $m_i$ of $x_{ij}$. In some cases, as well as the within-area mean, we may also have an estimate $s_i^2$ of the within-area variance of $x_{ij}$, for example, from geographical modelling of an environmental exposure surface (e.g. Best \etal \cite{best:leuk}). Then we suppose that these exposures are normally distributed, with $x_{ij} \sim N(m_i, s_i^2)$. If an exposure is not naturally normally distributed, it can often be transformed to normality. We can then calculate the area-specific risks~(\ref{eq:integral}) by integrating over $f_i$, here the density function of the normal distribution. \begin{eqnarray} \label{eq:qint} p_i & = & \int \expit(\mu_i + \beta x) f_i(x) dx \end{eqnarray} Our assumed underlying model for $p_{ij}(x)$ is logit-linear. In this case, the integral is not available in closed form. However, if we approximate the logit by a probit link function, then (\ref{eq:qint}) evaluates to \begin{eqnarray} \label{eq:normal:binomial} p_i & = & \expit \left \{ (1 + c^2 \beta^2 s_i^2) ^ {-1/2} (\mu_i + \beta m_i) \right \} \end{eqnarray} where $c = 16\sqrt{3} / (15 \pi)$ (Salway and Wakefield \cite{nonrare}). If, instead, we were using a Poisson model for a rare outcome, $y_i \sim \mbox{Pois}(N_i p_i)$, and a log-linear individual-level model $\log(p_{ij}) = \mu_{i} + \beta x_{ij}$, then the integrals can be evaluated explicitly without an approximation. Instead of~(\ref{eq:normal:binomial}), we would have (Richardson \etal \cite{richardson87}) \begin{eqnarray} \label{eq:normal:poisson} p_i = \exp \left \{ \mu_i + \beta m_i + \frac{\beta^2 s_i^2}{2} \right \} \qquad \end{eqnarray} These methods generalise to multiple jointly normally-distributed covariates. \subsubsection{Semi-parametric approach} We have described a fully parametric model for ecological inference. Prentice and Sheppard \cite{PrenticeS} described an alternative semi-parametric approach based on estimating functions. This is often simply called the \emph{aggregate data} method. Suppose a sample of covariates (binary or continuous) are available from a subset of $n_i$ individuals, but not the corresponding outcomes on the same individuals. Broadly, the mean and variance of the total disease count $y_i$ are calculated in terms of an \emph{aggregate} risk $\frac{1}{n_i}\sum_j p_{ij}$. $p_{ij}$ is the risk for individual $j$ in area $i$, conditionally on their covariate values. This method is not implemented in the \Rpackage{ecoreg} R package. This approach does not require a within-area distribution to be specified for the covariates. It requires samples of covariate data as an explicit part of the model. On the other hand, the parametric approaches described above, and implemented in \Rpackage{ecoreg}, require individual covariate data implicitly to estimate an appropriate within-area distribution. \subsection{Combining aggregate and individual data} \label{sec:aggindiv} To summarise the model for the ecological data, we have a binomial model~(\ref{eq:bin}) for the area-level outcome $y_i$. The corresponding area-level risk $p_i$ is calculated explicitly in terms of the transformed group baseline risk $\mu_i$, the individual-level covariate effects, and the within-area distributions of the covariates. It is easy to extend this model to include information from a sample of individuals in each area whose outcomes and exposures are known. We simply model the risk of the individual level outcome with the logistic regression~(\ref{eq:indiv}). Then the covariate effects $\alpha$ and $\beta$ and the intercept $\mu_i$ are shared by the models for both the aggregate and individual-level data. Thus, we can fit a joint model which combines the information from the two sources of data. This is termed \emph{hierarchical related regression}. Note that it is not necessary to have individual data within all of the areas $i$. In practice, sample survey data will be available from varying numbers of individuals between areas. \section{Using \Rpackage{ecoreg}} \label{sec:ecoreg} The \Rpackage{ecoreg} package implements a fairly general case of the models described in Section~\ref{sec:models}. We assume you have already downloaded and installed the \Rpackage{ecoreg} package. To apply these methods, you should have one or both of \begin{itemize} \item An aggregate dataset with one record for each aggregate group, for example a geographical area, or a stratum within area, for example from a population census. This contains aggregate outcomes and exposures, and optionally some indication of the within-area distribution of the exposures. \item An individual-level dataset, for example from a sample survey study. There need not be the same number of individuals per area, and there may be some areas in the aggregate dataset with no individuals. This contains full individual-level covariate and outcome data. \end{itemize} First we load the \Rpackage{ecoreg} package into the working R session. <<>>= library(ecoreg) @ The main R function in \Rpackage{ecoreg} is called \Rfunction{eco}. This is used to fit a model to one of these datasets, or a combination of the two. The R help page for \emph{eco} fully describes each of the function's arguments. \subsection{Example} The use of \Rfunction{eco} is illustrated here using a simulated dataset. We simulate aggregate data consisting of 50 groups of 100 individuals. Two contextual covariates (labelled deprivation and mean income) are generated as standard normal variables. Two covariates which are binary at the individual level, and available at the aggregate level as the proportion of non-white individuals and smokers in each area, are generated from uniform distributions. The data frame \Robject{sim.df} contains the ecological covariate data. <<>>= ng <- 50 N <- rep(100, ng) set.seed(31412) ctx <- cbind(deprivation = rnorm(ng), mean.income = rnorm(ng)) phi <- cbind(nonwhite = runif(ng), smoke = runif(ng)) sim.df <- as.data.frame(cbind(ctx, phi)) sim.df[1:5,] @ A disease outcome with approximate 5\% baseline prevalence, and odds ratios of 1.01, 1.02, 1.5 and 2 respectively for the four covariates, is now simulated. The function \Rfunction{sim.eco} is provided to simulate ecological outcome data and individual sample data, in terms of known covariates, baseline risks and odds ratios. <<>>= mu <- qlogis(0.05) alpha.c <- log(c(1.01, 1.02)) alpha <- log(c(1.5, 2)) sim1 <- sim.eco(N, ctx=~deprivation+mean.income, binary=~nonwhite+smoke, data = sim.df, mu=mu, alpha.c=alpha.c, alpha=alpha, isam=10) sim1$y[1:5] aggdata <- as.data.frame(cbind(y=sim1$y, N=N, sim.df)) @ \paragraph{Format of aggregate dataset} We have now got an aggregate dataset \Robject{aggdata} suitable for use with the \Rfunction{eco} function. Generally, the aggregate dataset should be a data frame with a row corresponding to an area or group. It should have at mininum an outcome variable. This is available either as a proportion of individuals with the outcome, or the number of events and the population at risk in the area. In addition, any number of aggregate covariates can be specified. Often these will be binary covariates, expressed as proportions over the area. For covariates that are continuous at the individual level, these should be specified as within-area means, and if possible, within-area variances, after transformation to an approximate normal distribution. For example, we print the first ten rows of the \Robject{aggdata} data. <<>>= aggdata[1:5,] @ The number of individuals with the disease, the population of the area, the deprivation index, the mean income, the proportion of non-white individuals, the proportion of smokers, and the mean and standard deviation of the pollution exposure are labelled \Robject{y}, \Robject{N}, \Robject{deprivation}, \Robject{mean.income}, \Robject{nonwhite} and \Robject{smoke}, respectively. The return value of \Rfunction{sim.eco} has a component \Robject{y} containing the ecological outcome data (the number of individuals in each area with the outcome), and a component \Robject{idata} containing the individual sample data. Here we have specified \Rfunarg{isam=10} in the call to \Rfunction{sim.eco}, producing an individual sample dataset with 10 individuals for each of the 50 areas. <<>>= indivdata <- sim1$idata @ \paragraph{Format of individual dataset} We have now got an individual dataset \Robject{indivdata} suitable for use with the \Rpackage{ecoreg} package. The individual dataset should be a data frame with each row corresponding to an individual. Variables may include a binary outcome and any number of covariates. For example, the first 15 rows of \Robject{indivdata} are illustrated. <<>>= indivdata[1:15,] @ The area indicator, the disease status of the individual, the deprivation index and the mean income of the area in which the individual lives, indicators for non-white ethnicity and whether the individual smoked, the pollution exposure and the area indicator are labelled \Robject{group}, \Robject{y}, \Robject{deprivation}, \Robject{mean.income}, \Robject{nonwhite} and \Robject{smoke} respectively. Binary indicators are 0 or 1 corresponding to no and yes respectively. The area indicator is only necessary when using models with random area effects. \subsection{Calling \Rfunction{eco}} Now we give examples of calling the \emph{eco} function to fit models to the aggregate and individual datasets. \subsubsection{Aggregate data alone} Firstly, we fit the correct model to the simulated data, with two contextual covariates and two individual binary covariates, using the aggregate data alone. <<>>= agg.eco <- eco(cbind(y, N) ~ deprivation + mean.income, binary = ~ nonwhite + smoke, data = aggdata) @ \begin{itemize} \item The first argument of \Rfunction{eco} is a formula, as used in most statistical modelling functions in R such as \Rfunction{lm} and \Rfunction{glm}. It specifies the \emph{aggregate} component of the model, that is, the names of any covariates included in $x_{ir}$ (equation~\ref{eq:indiv}). \item The argument \Rfunarg{data} specifies a data frame which should contain all aggregate variables specified in the call to \Rfunction{eco}. \item The \Rfunarg{binary} argument to \Rfunction{eco} is a formula whose right hand side should contain the names of any aggregate covariates considered as \emph{individual-level} rather than contextual effects, here, non-white ethnicity and smoking. These should be binary at the individual level. \Rfunction{eco} will fit the marginal model (\ref{eq:bin}--\ref{eq:integral}) by default for binary covariates. To fit the convolution normal-approximation model of Wakefield~\cite{wakefield2by2} specify \Rfunarg{model=conditional} in the call to \Rfunction{eco}. \end{itemize} The \Rfunction{eco} function returns objects of class \Rfunction{ecoreg}. Printing an object of this class displays the estimated odds ratios $\exp(\alpha)$ associated with aggregate-level covariates, and odds ratios $\exp(\beta)$ associated with individual covariates (equation~\ref{eq:indiv}), along with their 95\% confidence intervals, and $-2 \times$ the maximised log-likelihood. In this example, the estimates are close to the true values used for simulating the data. <<>>= agg.eco @ \subsubsection{Combining aggregate and individual data} Next we combine the aggregate data with the information from samples of individuals, as described in Section~\ref{sec:aggindiv}. Firstly, we form a reduced aggregate dataset, removing the individual outcomes and population totals which appear in the individual data. We assume that the sampled individuals did not contribute to the estimation of the aggregate covariates. <<>>= aggdata.sub <- aggdata aggdata.sub$y <- aggdata$y - tapply(indivdata$y, indivdata$group, sum) aggdata.sub$N <- aggdata.sub$N - 10 @ Again, we fit the correct model to the simulated data, with two contextual covariates and two individual binary covariates. The individual-level regression model is given in the \Rfunarg{iformula} argument. The name of the individual-level dataset, in which the variables in the individual-level model should appear, is given in the \Rfunarg{idata} argument. <<>>= agg.indiv.eco <- eco(cbind(y, N) ~ deprivation + mean.income, binary = ~ nonwhite + smoke, iformula = y ~ deprivation + mean.income + nonwhite + smoke, data = aggdata.sub, idata=sim1$idata) agg.indiv.eco @ In this example, combining with the individual sample data does not noticeably improve the precision of the estimates. \subsubsection{Importance of the between-area exposure contrasts} Suppose now that we simulate data with a much lower between-area exposure variance, such as a uniform(0, 0.2) distribution for proportion of non-white ethnicity and uniform(0.1, 0.3) for proportion of smokers. The aggregate data now contain less information about the individual-level effects. The amount of individual-level information in ecological data decreases as the between-area to the within-area variability of the exposure decreases. When there are low exposure contrasts between areas, inference may be improved by combining the ecological data with individual-level data~\cite{improving:eco}. When fitting the true individual model to the aggregate data alone, we obtain highly imprecise estimates for the individual-level effects. <<>>= phi <- cbind(nonwhite = runif(ng, 0, 0.2), smoke = runif(ng, 0.1, 0.3)) sim.df <- as.data.frame(cbind(ctx, phi)) sim1 <- sim.eco(N, ctx=~deprivation+mean.income, binary=~nonwhite+smoke, data = sim.df, mu=mu, alpha.c=alpha.c, alpha=alpha, isam=10) aggdata <- as.data.frame(cbind(y=sim1$y, N=N, sim.df)) indivdata <- sim1$idata agg.eco <- eco(cbind(y, N) ~ deprivation + mean.income, binary = ~ nonwhite + smoke, data = aggdata) agg.eco @ To be able to estimate the individual-level effects more accurately, we combine the aggregate data with the individual-level sample data. <<>>= aggdata.sub <- aggdata aggdata.sub$y <- aggdata$y - tapply(indivdata$y, indivdata$group, sum) aggdata.sub$N <- aggdata.sub$N - 10 agg.indiv.eco <- eco(cbind(y, N) ~ deprivation + mean.income, binary = ~ nonwhite + smoke, iformula = y ~ deprivation + mean.income + nonwhite + smoke, data = aggdata.sub, idata=indivdata) agg.indiv.eco @ This raises the question of whether the aggregate data do contribute any information in this case, and whether we can do just as well by analysing the individual data alone. But we find that precision is lower when only the individual data are included. <<>>= indiv.eco <- eco(iformula = y ~ deprivation + mean.income + nonwhite + smoke, idata=indivdata) indiv.eco @ \subsubsection{Normally-distributed covariates} We now add a continuous covariate to the simulated data, representing estimated exposure to air pollution. This will have a constant within-area standard deviation of 2, and within-area means varying around 1.24 with between-area standard deviation 0.1. A disease outcome is simulated with an odds ratio of 1.2 for one unit of pollution exposured, and the same odds ratios as before for the other covariates. <<>>= phi <- cbind(nonwhite = runif(ng), smoke = runif(ng)) sim.df <- as.data.frame(cbind(ctx, phi)) sim.df$poll <- rnorm(ng, 1.24, 0.1) sim.df$poll.sd <- rep(0.2, ng) sim1 <- sim.eco(N, ctx=~deprivation+mean.income, binary=~nonwhite+smoke, m=sim.df["poll"], S=sim.df["poll.sd"], beta=log(2), data = sim.df, mu=mu, alpha.c=alpha.c, alpha=alpha, isam=10) aggdata <- as.data.frame(cbind(y=sim1$y, N=N, sim.df)) aggdata[1:5,] @ The area mean of the covariate is called \Robject{poll} in the aggregate dataset, and the area standard deviation is called \Robject{poll.sd}. We now model these data as before, including pollution as another individual-level covariate in the model. <<>>= agg.eco <- eco(cbind(y, N) ~ deprivation + mean.income, normal= ~ poll, norm.var=poll.sd, binary = ~ nonwhite + smoke, data = aggdata) agg.eco @ The \Rfunarg{normal} argument to \Rfunction{eco} is a formula whose right-hand side should contain variables denoting the group-level means of the normally-distributed covariates. These covariates will then be fitted as individual-level effects, using a model of the form of equation~(\ref{eq:normal:binomial}) by default, or~(\ref{eq:normal:poisson}) if the outcome is rare and \Rfunarg{model = ``poisson''} is specified. The \Rfunarg{norm.var} is used to supply the corresponding group-level variances. The true odds ratio of 2 for pollution is fairly well estimated. Now suppose the pollution data had a lower ratio of between-area to within-area standard deviation. <<>>= sim.df$poll <- rnorm(ng, 1.24, 0.1) sim.df$poll.sd <- rep(0.2, ng) sim1 <- sim.eco(N, ctx=~deprivation+mean.income, binary=~nonwhite+smoke, m=sim.df["poll"], S=sim.df["poll.sd"], beta=log(2), data = sim.df, mu=mu, alpha.c=alpha.c, alpha=alpha, isam=10) aggdata <- as.data.frame(cbind(y=sim1$y, N=N, sim.df)) agg.eco <- eco(cbind(y, N) ~ deprivation + mean.income, normal= ~ poll, norm.var=poll.sd, binary = ~ nonwhite + smoke, data = aggdata) agg.eco @ Now the confidence interval for the pollution odds ratio estimate is much wider, as there is less information in the aggregate data. However, the estimate is not biased. Again, to improve precision we can incorporate the individual-level data. Remember to include pollution in the individual level model \Rfunarg{iformula}. <<>>= indivdata <- sim1$idata indivdata[1:5,] aggdata.sub <- aggdata aggdata.sub$y <- aggdata$y - tapply(indivdata$y, indivdata$group, sum) aggdata.sub$N <- aggdata.sub$N - 10 agg.indiv.eco <- eco(cbind(y, N) ~ deprivation + mean.income, normal= ~ poll, norm.var=poll.sd, binary = ~ nonwhite + smoke, data = aggdata.sub, iformula = y ~ deprivation + mean.income + nonwhite + smoke + poll, idata=indivdata ) agg.indiv.eco @ The precision of the estimate for pollution is improved. \subsubsection{Within-area distribution of binary covariates} To build the model~(\ref{eq:integral}) with more than one binary covariate, by default \Rfunction{eco} assumes that the covariates are independent within areas. Often this assumption is not appropriate, especially when considering, for example, socio-economically related factors. To account for the joint within-area distribution of a set of binary or categorical covariates, use the \Rfunarg{cross} argument to \Rfunction{eco}. This should be a matrix containing the same number of rows as the aggregate data, and number of columns equal to the distinct number of covariate categories into which an individual can belong. For full details of how to specify \Rfunarg{cross}, refer to the help page for the \Rfunction{eco} function. The \Rfunarg{cross} needs to be calculated by the user before calling \Rfunction{eco}. Individual data may be required to estimate \Rfunarg{cross}, as typical census data do not give detailed cross-classification tables. This is now illustrated with a hypothetical example. We introduce another covariate called \Robject{soclass} into the \Robject{aggdata} data representing the proportion of individuals in a lower social class. This is likely to be correlated with smoking at both aggregate and individual level. Suppose that we have an information from an individual-level survey, suggesting that an individual is twice as likely to smoke if in a lower social class. We wish to use this information to construct a matrix with 100 rows and four columns, representing estimates of the proportion of individuals who are in each of four categories: \begin{enumerate} \item neither smoke nor are in the lower social class ($p_{00}$) \item smoke but are not in the lower social class ($p_{10}$) \item do not smoke but are in the lower social class ($p_{01}$) \item smoke and are in the lower social class. ($p_{11}$) \end{enumerate} Let $p_A$ be the proportion of smokers, and $p_B$ be the proportion of individuals in the lower social class, in an area. We know that $p_{10} + p_{11} = p_A$, $p_{01} + p_{11} = p_B$ and, from the survey, $(p_{11}/p_B) / (p_{10}/(1 - p_B)) = 2$. This leads, for example, to \[ p_{11} = 2 p_A p_B / (1 + p_B) \] This is enough information to construct all the estimated cross-classification probabilities, as illustrated by the following R code. <<>>= aggdata$soclass <- plogis(qlogis(aggdata$smoke) + runif(ng, -1, 1)) pa <- aggdata$smoke pb <- aggdata$soclass p11 <- pa*pb* 2 /(1 + pb) p10 <- pa - p11 p01 <- pb - p11 p00 <- 1 - (p01 + p10 + p11) cross <- cbind(p00, p10, p01, p11) cross[1:5,] @ This is the cross-classification matrix that we can supply to \Rfunction{eco} if we wanted to construct an ecological model including both smoking and social class, for example <>= eco(cbind(y, N) ~ 1, binary = ~ smoke + soclass, cross=cross, data=aggdata) @ To account for the joint within-area distribution of a set of continuous covariates, use the \Rfunarg{norm.var} argument to \Rfunction{eco} to specify the joint covariance matrix of the covariates, assumed normally distributed, for each area. For further details of how to specify \Rfunarg{norm.var} in this way, refer to the help page for the \Rfunction{eco} function. \Rpackage{ecoreg} does not support specifying the within-area covariance between binary and continuous covariates. These are assumed to be independent in the model. \subsubsection{Stratification} Suppose that outcome data and covariate data are available by age and To account for differing disease risks in strata defined by age and sex, using fixed offsets determined from the whole population (as in equations \ref{eq:indiv} and \ref{eq:q}), use the \Rfunarg{strata}, \Rfunarg{pstrata} and \Rfunarg{istrata} arguments to \Rfunction{eco}. \begin{itemize} \item \Rfunarg{pstrata} should be a vector with one element for each stratum, giving the assumed baseline outcome probabilities for the strata. \item \Rfunarg{strata} should be a matrix with the same number of rows as the aggregate data. Rows represent areas, and columns represent the strata occupancy \emph{proportions} for those areas (which are used as estimates of the observed occupancy \emph{probabilities}). Alternatively, to account for within-area correlation between strata membership and binary covariate status, the cross-classification between strata and covariates can be specified in the \Rfunarg{cross} argument. See the help page to \Rfunction{eco}. \item If individual data are modelled, \Rfunarg{istrata} should be a variable containing the individual-level variable indicating the stratum an individual occupies. This should be a factor, whose levels correspond to the columns of the matrix \Rfunarg{strata}. \end{itemize} \subsubsection{Categorical covariates} As well as binary covariates, categorical covariates can also be fitted as individual-level predictors. The aggregate data for categorical covariates must be supplied separately from the main aggregate dataset, in the \Rfunarg{categorical} argument to \Rfunction{eco}. See the help page for \Rfunction{eco}. In practice, there is not likely to be enough information in ecological data for successful ecological inference on categorical variables with large numbers of categories. \subsubsection{Random effects models} By default, \Rfunction{eco} assumes the baseline risk $\mu_i$ (equation~\ref{eq:indiv}) is constant $\mu$ between areas $i$. Optionally, \Rfunction{eco} can also fit $\mu_i$ as a normally-distributed random effect. If \Rfunarg{random=TRUE} is specified in the call to \Rfunction{eco}, an area-level random intercept is included in the model. In this case the data should indicate which area each row of the data corresponds to. In the individual data, \Rfunarg{igroups} should give the name of a variable containing the group identifiers of the individual-level data. In the aggregate data, by default, the groups are the row numbers of the dataset. Alternatively, \Rfunarg{groups} specifies a group-level variable containing the group identifiers to be matched with the groups given in \Rfunarg{igroups}. \Rfunction{eco} uses adaptive Gauss-Hermite integration \cite{liu:pierce} to fit random effects models. The Gauss-Hermite integration can be controlled by the arguments \Rfunarg{gh.points} and \Rfunarg{iter.adapt} to \Rfunction{eco}. \Rfunarg{gh.points} gives the number of points to use for quadrature, while \Rfunarg{iter.adapt} gives the number of iterations to use for the adaptive phase of the algorithm. Random effects model fitting is relatively slow, and it may be useful to view the progress of the model fitting by specifying a \Rfunction{control} argument, such as \Rfunction{control=list(trace=1, REPORT=1)}. This is passed from \Rfunction{eco} to \Rfunction{optim}, the R function which performs optimisation of the likelihood. See \Rfunction{help(optim)} for further options to control optimisation. \section{Warnings and limitations} \begin{itemize} \item It is easy to over-fit models, especially with several covariates. Often there is not enough information available in aggregate data. \item When fitting many covariates, it is essential to account for their within-area distribution. \item Continuous covariates must be normally-distributed or able to be transformed to normality. \item Only limited error-checking is performed at the moment. \Rfunction{eco} may fail with an incomprehensible error message if the model or data are specified wrongly or inconsistently. \end{itemize} \section{\Rpackage{eco} reference guide} The R help page for \Rfunction{eco} gives details of all the allowed arguments and options to the \Rfunction{eco} function. To view this online in R, type: <>= help(eco) @ Similarly, all other functions in the package have help pages, which should always be consulted in case of doubt about how to call them. The web-browser based help interface may be convenient - type <>= help.start() @ and navigate to \textsf{Packages} $\ldots$ \textsf{ecoreg}, which brings up a list of all the functions in the package with links to their documentation, and a link to this manual in PDF format. \section{Similar software} \begin{itemize} \item WinBUGS (\texttt{http://www.mrc-bsu.cam.ac.uk/bugs}) can be used to fit these models from a Bayesian perspective using Markov Chain Monte Carlo simulation. While computationally slower, this approach is amenable to extension to account for complexities such as random intercepts, random coefficients, spatial correlation and measurement error. WinBUGS files containing worked examples of a range of these models, using methods described by Jackson \etal~\cite{improving:eco}~\cite{my:hrr} are provided at \texttt{http://www.bias-project.org.uk/software}. \item The R package \Rpackage{eiPack} implements ecological inference for R $\times$ C contingency tables. \item The R package \Rpackage{MCMCpack}, available from CRAN, implements ecological inference for $2 \times 2$ tables using a Bayesian hierarchical model described by Wakefield \cite{wakefield2by2}. \item The R package \Rpackage{eco}, available from CRAN, implements ecological inference for $2 \times 2$ tables, using methods described by Imai and Lu~\cite{imai:lu}. \item \emph{EI} and \emph{EzI} \cite{king:ei} by Kenneth Benoit and Gary King, implementing methods from King~\cite{king:solution}. \end{itemize} \bibliography{abbrev,ecoreg} \end{document}