NHANES uses a complex, multistage probability sampling design to
select participants who represent the non-institutionalized U.S.
population. Without proper survey weights, analyses will produce biased
estimates. The create_design() function automates the
calculation of appropriate weights when combining multiple NHANES
cycles, following CDC
weighting guidelines.
NHANES provides three categories of sampling weights, each reflecting different levels of participation:
wtint2yr,
wtint4yr): Used when all variables come from the household
interview (demographics, questionnaires).wtmec2yr, wtmec4yr): Used when any variable
requires a physical exam (laboratory tests, body measurements, DEXA
scans).wtsaf2yr): Used when
any variable requires fasting laboratory tests (glucose, insulin,
lipids).The probability of being sampled decreases from interview to MEC to fasting subsamples. When combining variables across categories, always use the weight with the lowest probability of selection. For example, if your analysis includes both demographics (interview) and body measurements (MEC), use MEC weights.
CDC recommendations for combining cycles are based on the number of cycles present in your data, not the timespan covered. This distinction matters when you have gaps in your data.
NHANES provides 4-year weights (wtint4yr,
wtmec4yr) for 1999-2000 and 2001-2002 cycles, while all
subsequent cycles provide only 2-year weights. When combining multiple
cycles:
Cycles 1999 or 2001: Use 4-year weight × (2/n) The numerator is 2 because the 4-year weight represents two 2-year cycles.
Cycles 2003+: Use 2-year weight × (1/n)
Denominator n: Total number of cycles in your analysis
Combining 4 cycles (1999, 2001, 2003, 2005) with MEC weights:
wtmec4yr * 2/4 = wtmec4yr * 0.5wtmec2yr * 1/4 = wtmec2yr * 0.25If you excluded the 2003 cycle, you would have 3 cycles total, so:
wtmec4yr * 2/3wtmec2yr * 1/3The key principle: n is the number of cycles present, not the timespan.
When analyzing demographics and questionnaire data only:
# Load demographics data
demo <- read_nhanes("demo")
# Create design with interview weights
design_int <- create_design(
dsn = demo,
start_yr = 1999,
end_yr = 2011,
wt_type = "interview"
)
# Calculate weighted means
design_int |>
summarize(
mean_age = survey_mean(ridageyr, na.rm = TRUE),
pct_female = survey_mean(riagendr == 2, na.rm = TRUE)
)When including any examination or laboratory data:
# Load demographics and body measures
demo <- read_nhanes("demo")
bmx <- read_nhanes("bmx")
combined <- demo |>
left_join(bmx, by = c("seqn", "year"))
# Use MEC weights because body measures require exam participation
design_mec <- create_design(
dsn = combined,
start_yr = 2007,
end_yr = 2017,
wt_type = "mec"
)
# Weighted BMI analysis
design_mec |>
filter(!is.na(bmxbmi)) |>
summarize(
mean_bmi = survey_mean(bmxbmi, na.rm = TRUE),
pct_obese = survey_mean(bmxbmi >= 30, na.rm = TRUE)
)When including fasting laboratory measurements:
# Load demographics and fasting lab data
demo <- read_nhanes("demo")
glu <- read_nhanes("glu")
combined <- demo |>
left_join(glu, by = c("seqn", "year"))
# Use fasting weights for glucose analysis
design_fast <- create_design(
dsn = combined,
start_yr = 2005,
end_yr = 2015,
wt_type = "fasting"
)
# Analyze fasting glucose
design_fast |>
filter(!is.na(lbxglu)) |>
summarize(
mean_glucose = survey_mean(lbxglu, na.rm = TRUE)
)You can specify a wide year range even if some cycles are missing from your data. The function calculates weights based only on cycles actually present:
When creating a survey design, some participants may lack the weight variable needed for your analysis. This happens naturally in NHANES because not everyone completes every component.
How create_design() handles this:
Example message you might see:
Filtered out 150 participants without valid mec weights.
These participants were not in the subsample for this weight category.
Learn more:
+ CDC weighting guidance:
https://wwwn.cdc.gov/nchs/nhanes/tutorials/Weighting.aspx
+ Survey design vignette: vignette('survey-design', package = 'nhanesdata')
Zero weights are different from missing weights:
NHANES uses a stratified, multistage sampling design with Primary Sampling Units (PSUs) nested within strata. Variance estimation requires at least 2 PSUs per stratum. When subsetting data (e.g., filtering to diabetes patients only), you may create strata with only one PSU.
The create_design() function sets
options(survey.lonely.psu = "adjust"), which handles this
conservatively by centering single-PSU strata at the sample grand mean
rather than the stratum mean. This approach:
For more details on lonely PSU handling, see Thomas Lumley’s {survey} package documentation.
The function validates that your dataset contains:
year: NHANES cycle start year (odd years: 1999, 2001,
2003, …, 2021)sdmvpsu: Primary sampling unitssdmvstra: Sampling stratawt_type:
wtint2yr (and wtint4yr if
1999/2001 cycles present)wtmec2yr (and wtmec4yr if 1999/2001
cycles present)wtsaf2yrThese variables are automatically included in datasets loaded via
read_nhanes().
read_nhanes() and {dplyr} joinscreate_design()Preprocessing before design creation is strongly recommended. Once the design object is created, filtering and recoding become more complex due to the survey structure.