---
title: "Using Survey Weights with nhanesA"
author: "Robert Gentleman and Teresa Filshtein Sonmez"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Using Survey Weights with nhanesA}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

### NHANES and Survey Weights

NHANES aims to produce national estimates on a range of health, nutrition, and other factors that accurately represent the non-institutionalized civilian U.S. population. The data is gathered using a complex, four-stage sampling design. When analyzing this data, it's crucial to use specialized survey analysis techniques that incorporate specific weights. These weights account for the intricate sampling strategy employed during data collection.

You can find the NHANES sample weights as part of the downloadable data. For a detailed understanding of these weights and their correct usage, the CDC provides extensive documentation: [CDC NHANES Tutorials](https://wwwn.cdc.gov/nchs/nhanes/tutorials/default.aspx).

It's worth noting that the methodologies and strategies for NHANES have evolved over time. Reasons for these changes range from methodological enhancements, external events like the COVID epidemic, to a recent shift towards a more longitudinal perspective. Such changes can significantly influence any analysis. Thus, analysts should approach this data with diligence and care.

There are numerous online resources that offer worked examples for further understanding:

1. The official resource for the R `survey` package: [R-Survey](http://r-survey.r-forge.r-project.org/survey/).
2. An up-to-date resource by the Advanced Research Computing group at UCLA: [Survey Data Analysis with R](https://stats.oarc.ucla.edu/r/seminars/survey-data-analysis-with-r/).
3. A growing collection of presentations on platforms like YouTube.

This vignette's intent is not to serve as an exhaustive guide. Instead, we aim to showcase some fundamental functions and point you towards more in-depth online resources necessary for specialized analyses. Here we will cover the basics of setting up the survey design and performing basic tasks. For those interested in diving deeper, we encourage you to develop and share more comprehensive documents.

### Example

Using the `nhanesA` and `survey` packages, we will explore the average blood pressure of individuals over 40 years of age, segmented by their reported ethnicity, during the 2017-2018 cycle. To achieve this, we'll merge the demographic (`DEMO_J`) and blood pressure (`BXP_J`) datasets. Our examination will cover average blood pressure distributions across different sexes and ethnicities, and we'll conduct a regression analysis to understand the relationship between blood pressure and age. This is a basic demonstration; comprehensive analyses would need to consider various risk factors and align with specific epidemiological queries.


#### Pull in the relevant data 

```{r loadlibs, eval = FALSE, message = FALSE, warning = FALSE}
library("nhanesA")

demoj = nhanes("DEMO_J")
dim(demoj)
```

```{r ll_print, echo = FALSE}
cat(sprintf("[1] 9254   46"))
```

```{r merge_example, eval = FALSE}
## merge DEMO_J and BXP_J using SEQN.  
bpxj = nhanes("BPX_J")
data = merge(demoj, bpxj, by="SEQN")
dim(data)
```

```{r merge_print, echo = FALSE}
cat(sprintf("[1] 8704   66"))
```

#### Create survey design object

To generate accurate estimates, it's essential to establish a survey design object that integrates the weights into our analysis. This design object should be created *before* any subsetting or manipulation of the data. By doing this, we ensure that intricate survey design elements, like stratification and clustering, are fully captured and applied across the dataset. 

It is also crucial to choose the appropriate weighting scheme for your analysis. NHANES offers various weights, such as interview, examination, fasting, and MEC laboratory weights, to name a few. The selection hinges on the specific dataset components and analyses you're focusing on. Moreover, when analyzing data across multiple cycles, it will be necessary to modify these weights to ensure accurate representation and inference.

For a comprehensive understanding, the CDC provides in-depth guidance on its website: [CDC NHANES Weighting Tutorial](https://wwwn.cdc.gov/nchs/nhanes/tutorials/weighting.aspx).


```{r survey, eval = FALSE, warning = FALSE, message = FALSE}
library("survey")
nhanesDesign <- svydesign(id = ~SDMVPSU,  # Primary Sampling Units (PSU)
                          strata  = ~SDMVSTRA, # Stratification used in the survey
                          weights = ~WTMEC2YR,   # Survey weights
                          nest    = TRUE,      # Whether PSUs are nested within strata
                          data    = data)

```


#### Subset the survey design object


Next, we subset the data to focus on individuals over 40 years of age. At this stage, it's imperative to use the tools in the `survey` package for any data manipulations. This procedure guarantees that weight adjustments are correctly applied, ensuring the validity of subsequent analyses. For comparison purposes, we also create an additional subset without the survey design framework, which allows for easy examination of unadjusted values.



```{r surveydesign, eval = FALSE, message = FALSE, warning = FALSE}

# subset survey design object
dfsub = subset(nhanesDesign, data$RIDAGEYR>=40)

# subset the original dataset
datasub = data[data$RIDAGEYR>=40,]

```


#### Basic data exploration 


Next, we'll explore the variable RIDRETH1 from the DEMO_J table, which represents reported ethnicity. We'll focus on diastolic blood pressure for our analysis. For simplicity in our presentation, we'll utilize only the first measurement, represented by the variable BPXDI1 in the BPX_J table.

##### Calculating means

First we split by ethnicity and calculate mean blood pressure for each group on the unweighted data. 

```{r ethtables, eval = FALSE, message = FALSE, warning=FALSE}

## mean on total data set

mean(datasub$BPXDI1, na.rm = TRUE)
```

```{r ethtables_print, echo = FALSE}
cat(sprintf("[1] 73.04455"))
```

```{r eth_means, eval = FALSE}
##split the data by ethnicity and calculate mean of the unweighted data

unweighted_means <- sapply(split(datasub$BPXDI1, datasub$RIDRETH1), mean, na.rm=TRUE)
unweighted_means 

```

```{r eth_means_print, echo = FALSE}
df <- data.frame(matrix(1,nrow=1,ncol=5))
names(df) <- c('   Mexican American','   Other Hispanic','   Non-Hispanic White',
               '   Non-Hispanic Black','   Other Race - Including Multi-Racial')
df[1,] <- list('72.41000','72.97611','70.84130','75.71466','74.41311')

print(df,row.names=FALSE)
```

Now, we perform the same analysis using the survey weights.

```{r svyby, eval = FALSE, message = FALSE, warning = FALSE}
adjmns = svymean(~BPXDI1, dfsub, na.rm=TRUE)
adjmns
```

```{r svyby_print, echo = FALSE}
df <- data.frame(matrix(1,nrow=1,ncol=3))
names(df) <- c('','mean', 'SE')
df[1,] <- list('BPXDI1', '73.237', '0.5597')

print(df,row.names=FALSE)
```

```{r eth_combined, eval = FALSE}
# By ethnicity
adjmnsbyEth = svyby(~BPXDI1, ~RIDRETH1, dfsub, svymean, na.rm=TRUE)

weighted_means <- as.numeric(adjmnsbyEth$BPXDI1)

combined_data <- cbind(unweighted_means, weighted_means)
colnames(combined_data) <- c("Unweighted", "Weighted")
combined_data

```


```{r eth_combined_print, echo = FALSE}
df_uw <- data.frame(matrix(1,nrow=5,ncol=3))
names(df_uw) <- list('','Unweighted', 'Weighted')
df_uw[1,] <- list('Mexican American','72.41000','74.03194')
df_uw[2,] <- list('Other Hispanic','72.97611','74.73756')
df_uw[3,] <- list('Non-Hispanic White','70.84130','72.45422')
df_uw[4,] <- list('Non-Hispanic Black','75.71466','75.71874')
df_uw[5,] <- list('Other Race - Including Multi-Racial','74.41311','74.60215')

print(df_uw,row.names=FALSE)
```


We can do this with gender and any other categorical variable: 

```{r bygender, eval = FALSE, message = FALSE, warning = FALSE}
# By Gender 
mns = sapply(split(datasub$BPXDI1, datasub$RIAGENDR), mean, na.rm=TRUE)

adjmnsbyGen = svyby(~BPXDI1, ~RIAGENDR, dfsub, svymean, na.rm=TRUE)

combined_data <- cbind(mns, adjmnsbyGen$BPXDI1)
colnames(combined_data) <- c("Unweighted", "Weighted")
combined_data

```

```{r bygender_print, echo = FALSE}
df <- data.frame(matrix(1,nrow=2,ncol=3))
names(df) <- c('','Unweighted', 'Weighted')
df[1,] <- list('Male', '74.67330', '75.31134')
df[2,] <- list('Female', '71.43385', '71.31453')

print(df,row.names=FALSE)
```


In addition to computing the mean, various other summary statistics, such as quantiles and variance, can be readily calculated using the survey design object. Importantly, each of these summary statistics will come with a corresponding standard error that accounts for the weights, providing insights into the precision of the estimates.


##### Calculating quantiles

```{r svq1, eval = FALSE, message = FALSE, warning = FALSE}

# For the unweighted data

quantile(datasub$BPXDI1, c(0.25,0.5,.75), na.rm = TRUE)
```

```{r, echo = FALSE}
df <- data.frame(matrix(1,nrow=1,ncol=3))
names(df) <- c('25%', '50%', '75%')
df[1,] <- list('66', '74', '82')

print(df,row.names=FALSE)
```

```{r svq2, eval = FALSE}
# For the survey weighted data
svyquantile(~BPXDI1, dfsub, quantiles = c(0.25,0.5,0.75), na.rm=TRUE)
```

```{r svq2_print, message = FALSE, echo = FALSE}
df <- data.frame(matrix(1,nrow=9,ncol=1))
names(df) <- '$BPXDI1                                '
df[1,] <- '    quantile ci.2.5 ci.97.5        se'
df[2,] <- '0.25       66     66      68 0.4691643'
df[3,] <- '0.5        74     74      76 0.4691643'
df[4,] <- '0.75       80     80      82 0.4691643'
df[5,] <- ''
df[6,] <- 'attr(,"hasci")                       '
df[7,] <- '[1] TRUE                             '
df[8,] <- 'attr(,"class")                       '
df[9,] <- '[1] "newsvyquantile"                 '

print(df, row.names=FALSE)
```

```{r svq3, eval = FALSE}
# By Gender 
svyby(~BPXDI1, ~RIAGENDR, dfsub, svyquantile, quantiles = c(0.5), na.rm=TRUE)

```

```{r svq3_print, echo = FALSE}
df <- data.frame(matrix(1,nrow=2,ncol=3))
names(df) <- c('RIAGENDR','BPXDI1','se.BPXDI1')
df[1,] <- list('Male','76','0.9383286')
df[2,] <- list('Female','72','0.4691643')

print(df, row.names=FALSE)
```

**Note:** The columns ci.2.5 and ci.97.5 in the output represent percentile intervals, not the traditional confidence intervals. Specifically, for a given quantile (like the 25th percentile), the values in these columns denote the range between the 2.5th percentile and the 97.5th percentile of that particular quantile estimate, when considering the survey weights.


##### Calculating variances

```{r, eval = FALSE, message = FALSE, warning = FALSE}

# For the entire dataset
svyvar(~BPXDI1, dfsub, quantiles = c(0.25,0.5,0.75), na.rm=TRUE)
```

```{r, echo=FALSE}
df <- data.frame(matrix(1,nrow=1,ncol=3))
names(df) <- c('', 'variance', 'SE')
df[1,] <- list('BPXDI1', '173.68', '12.279')

print(df,row.names=FALSE)
```

```{r, eval = FALSE, message = FALSE, warning = FALSE}
  # By ethnicity
svyby(~BPXDI1, ~RIDRETH1, dfsub, svyvar, na.rm=TRUE)
```

```{r, echo=FALSE}
df <- data.frame(matrix(1,nrow=5,ncol=3))
names(df) <- c('RIDRETH1','BPXDI1','se')
df[1,] <- list('Mexican American','200.8781','45.43074')
df[2,] <- list('Other Hispanic','160.0157','23.79938')
df[3,] <- list('Non-Hispanic White','168.3504','18.30355')
df[4,] <- list('Non-Hispanic Black','202.3565','37.12593')
df[5,] <- list('Other Race - Including Multi-Racial','156.9968','17.26037')

print(df,row.names=FALSE)
```

```{r, eval = FALSE, message = FALSE, warning = FALSE}
  # By Gender 
svyby(~BPXDI1, ~RIAGENDR, dfsub, svyvar, na.rm=TRUE)

```

```{r, echo=FALSE}
df <- data.frame(matrix(1,nrow=2,ncol=3))
names(df) <- c('RIAGENDR', 'BPXDI1', 'se')
df[1,] <- list('Male', '141.2704', '10.52796')
df[2,] <- list('Female', '196.1199', '20.65706')

print(df,row.names=FALSE)
```

#### Conducting Simple Regression Analysis

The `svyglm` function allows us to perform regression analyses while accounting for the survey's design. In this section, we will demonstrate how to model diastolic blood pressure levels based on age and ethnicity.

##### Syntax for `svyglm`

Using `svyglm()` is akin to utilizing the `glm()` function from base R. The key difference is that `svyglm()` is specifically tailored to work with survey objects, considering the intricate sampling design and weights.

To model diastolic blood pressure (noted as `BPXDI1` in our dataset) as a function of age (`RIDAGEYR`) and ethnicity (`RIDRETH3`), we can construct our model as outlined below. The resulting output will furnish regression coefficients, standard errors, and significance tests that have been adjusted for the complex survey design. When interpreting these results, it's crucial to remember that they resemble outputs from a generalized linear model (`glm`). However, the estimates provided by `svyglm()` have been weighted to produce nationally representative conclusions.

```{r, eval = FALSE, message = FALSE, warning = FALSE}
weighted_model <- svyglm(BPXDI1 ~ RIDAGEYR + RIDRETH1, design = dfsub, family = gaussian())
summary(weighted_model)
```

```{r svyglm, echo = FALSE}
cat(sprintf(paste('Call:',
            '\n',
            '\nsvyglm(formula = BPXDI1 ~ RIDAGEYR + RIDRETH1, design = dfsub,',
            '\n    family = gaussian())',
            '\n',
            '\nSurvey design:',
            '\nsubset(nhanesDesign, data$RIDAGEYR >= 40)',
            '\n',
            '\nCoefficients:',
            '\n                                            Estimate Std. Error t value',
            '\n(Intercept)                                 90.96516    1.24720  72.935',
            '\nRIDAGEYR                                    -0.31522    0.01853 -17.012',
            '\nRIDRETH1Other Hispanic                       1.43894    1.25518   1.146',
            '\nRIDRETH1Non-Hispanic White                   0.49810    0.73132   0.681',
            '\nRIDRETH1Non-Hispanic Black                   2.92332    1.09594   2.667',
            '\nRIDRETH1Other Race - Including Multi-Racial  1.38532    0.63780   2.172',
            '\n                                            Pr(>|t|)',
            '\n(Intercept)                                 5.73e-15 ***',
            '\nRIDAGEYR                                    1.04e-08 ***',
            '\nRIDRETH1Other Hispanic                        0.2783',
            '\nRIDRETH1Non-Hispanic White                    0.5113',
            '\nRIDRETH1Non-Hispanic Black                    0.0236 *',
            '\nRIDRETH1Other Race - Including Multi-Racial   0.0550 .',
            '\n---',
            "\nSignif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1",
            '\n',
            '\n(Dispersion parameter for gaussian family taken to be 162.6653)',
            '\n',
            '\nNumber of Fisher Scoring iterations: 2'
            )))
```

#### Visualizing differences in weighted and unweighted survey data

Visualizing the differences between weighted and unweighted analyses can offer a clear, intuitive understanding of the impact of weights. We can use bar plots to compare mean diastolic blood pressures across different ethnicities. This plot could be extended to other metrics as well.


```{r, fig.width=7, eval = FALSE, message = FALSE, warning = FALSE}
library(ggplot2)

## recalculating means (same as above) 

unweighted_means <- sapply(split(datasub$BPXDI1, datasub$RIDRETH1), mean, na.rm=TRUE)
weighted_means <- as.numeric(adjmnsbyEth$BPXDI1)


plot_data <- data.frame(
  Ethnicity = factor(rep(names(unweighted_means), 2),
                     levels = names(unweighted_means)),
  Means = c(unweighted_means, weighted_means),
  Type = factor(c(rep("Unweighted", length(unweighted_means)),
                  rep("Weighted", length(weighted_means))),
               levels = c("Unweighted", "Weighted"))
)

# Creating the plot
ggplot(plot_data, aes(x = Ethnicity, y = Means, fill = Type)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.6) +
  labs(title = "Comparison of Diastolic Blood Pressure by Ethnicity",
       y = "Mean Diastolic Blood Pressure") +
  scale_x_discrete(labels = c('Mex-Amer','Other Hisp','White', 'Black', 'Other')) +
  scale_fill_manual(values = c("blue", "red")) +
  theme_minimal() +
  theme(legend.title = element_blank())


```

```{r bp_by_eth_plot, echo = FALSE, fig.width=7, message = FALSE, warning = FALSE}
unweighted_means <- as.numeric(df_uw$Unweighted)
names(unweighted_means) <- df_uw[[1]]
weighted_means <- as.numeric(df_uw$Weighted)
plot_data <- data.frame(
  Ethnicity = factor(rep(names(unweighted_means), 2),
                     levels = names(unweighted_means)),
  Means = c(unweighted_means, weighted_means),
  Type = factor(c(rep("Unweighted", length(unweighted_means)),
                  rep("Weighted", length(weighted_means))),
               levels = c("Unweighted", "Weighted"))
)

# Creating the plot
library(ggplot2)
ggplot(plot_data, aes(x = Ethnicity, y = Means, fill = Type)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.6) +
  labs(title = "Comparison of Diastolic Blood Pressure by Ethnicity",
       y = "Mean Diastolic Blood Pressure") +
  scale_x_discrete(labels = c('Mex-Amer','Other Hisp','White', 'Black', 'Other')) +
  scale_fill_manual(values = c("blue", "red")) +
  theme_minimal() +
  theme(legend.title = element_blank())
```

#### Visualizing Regression Results

Visualizing regression coefficients can provide a clear understanding of how different variables impact the outcome, especially when using weighted and unweighted data.

```{r, fig.width=7, eval = FALSE, message = FALSE, warning = FALSE}

unweighted_model <- glm(BPXDI1 ~ RIDAGEYR + RIDRETH1, data = datasub, family = gaussian())
weighted_model <- svyglm(BPXDI1 ~ RIDAGEYR + RIDRETH1, design = dfsub, family = gaussian())

# For the unweighted model
unweighted_summary <- summary(unweighted_model)
unweighted_coefs <- unweighted_summary$coefficients[, "Estimate"]
unweighted_se <- unweighted_summary$coefficients[, "Std. Error"]

# For the weighted model
weighted_summary <- summary(weighted_model)
weighted_coefs <- as.numeric(weighted_summary$coefficients[, "Estimate"])
weighted_se <- as.numeric(weighted_summary$coefficients[, "Std. Error"])

comparison <- data.frame(
  Variable = names(unweighted_coefs),
  Unweighted = unweighted_coefs,
  Weighted = as.numeric(weighted_coefs)
)

print(comparison)
```

```{r visregcompare, echo = FALSE}
unwt_coef_names <- c("(Intercept)", "RIDAGEYR", "RIDRETH1Other Hispanic", "RIDRETH1Non-Hispanic White", "RIDRETH1Non-Hispanic Black", "RIDRETH1Other Race - Including Multi-Racial")
unweighted_coefs <- c(92.2848149, -0.3460552,  1.2325637,  0.7553912,  4.2996735,  2.0667520)
weighted_coefs <- c(90.9651590, -0.3152158,  1.4389441,  0.4981020,  2.9233165,  1.3853204)

comparison <- data.frame(
  Variable = unwt_coef_names,
  Unweighted = unweighted_coefs,
  Weighted = as.numeric(weighted_coefs)
)

print(comparison)
```

```{r visregress, eval = FALSE}
unweighted_df <- data.frame(Variable = names(unweighted_coefs), 
                            Estimate = unweighted_coefs, 
                            SE = unweighted_se, 
                            Type = "Unweighted")

weighted_df <- data.frame(Variable = names(unweighted_coefs), 
                          Estimate = weighted_coefs, 
                          SE = weighted_se, 
                          Type = "Weighted")

plot_data <- rbind(unweighted_df, weighted_df)

ggplot(subset(plot_data, Variable!='(Intercept)'), aes(x = Estimate, y = reorder(Variable, Estimate), color = Type)) +
  geom_point(position = position_dodge(0.5), size = 2.5) +
  geom_errorbarh(aes(xmin = Estimate - SE, xmax = Estimate + SE),
                 height = 0.2, position = position_dodge(0.5)) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "grey50") +
  labs(title = "Comparison of Regression Coefficients",
       x = "Coefficient Value", y = "Predictors") +
  theme_minimal() +
  scale_color_manual(values = c("Unweighted" = "blue", "Weighted" = "red")) +
  theme(legend.title = element_blank())

```

```{r regression_coeff, fig.width=7, echo = FALSE, message = FALSE, warning = FALSE}


unweighted_coefs <- c(92.2848149, -0.3460552,  1.2325637,  0.7553912,  4.2996735,  2.0667520)
weighted_coefs <- c(90.9651590, -0.3152158,  1.4389441,  0.4981020,  2.9233165,  1.3853204)
names(unweighted_coefs) <- c("(Intercept)", "RIDAGEYR", "RIDRETH1Other Hispanic", "RIDRETH1Non-Hispanic White", "RIDRETH1Non-Hispanic Black", "RIDRETH1Other Race - Including Multi-Racial")

unweighted_se <- c(1.36340677, 0.02061825, 1.04001732, 0.79478016, 0.83753015, 0.86954266)
weighted_se <- c(1.24720135, 0.01852892, 1.25517931, 0.73131800, 1.09593669, 0.63780221)


unweighted_df <- data.frame(Variable = names(unweighted_coefs), 
                            Estimate = unweighted_coefs, 
                            SE = unweighted_se, 
                            Type = "Unweighted")

weighted_df <- data.frame(Variable = names(unweighted_coefs), 
                          Estimate = weighted_coefs, 
                          SE = weighted_se, 
                          Type = "Weighted")

plot_data <- rbind(unweighted_df, weighted_df)

ggplot(subset(plot_data, Variable!='(Intercept)'), aes(x = Estimate, y = reorder(Variable, Estimate), color = Type)) +
  geom_point(position = position_dodge(0.5), size = 2.5) +
  geom_errorbarh(aes(xmin = Estimate - SE, xmax = Estimate + SE),
                 height = 0.2, position = position_dodge(0.5)) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "grey50") +
  labs(title = "Comparison of Regression Coefficients",
       x = "Coefficient Value", y = "Predictors") +
  theme_minimal() +
  scale_color_manual(values = c("Unweighted" = "blue", "Weighted" = "red")) +
  theme(legend.title = element_blank())
```