Predictive Modeling of Mental Health and Wellness: Synergies of Lifestyle and Socioeconomic Factors (2021-2023 Cohort)¶
Name: Aaron Dumont
Project Website: https://adumont2.github.io
Project Direct Website Notebook link: https://adumont2.github.io/Dumont-Milestone1.html
Project Plan¶
Collaboration Plan
I am completing this final tutorial as a solo project. I am utilizing a local Jupyter environment running within a Docker Dev Container for iterative development, and a GitHub repository (adumont2.github.io) for version control and hosting the final public-facing portfolio piece.
Dataset Selection & Motivation To explore the determinants of wellness, I have chosen to use the National Health and Nutrition Examination Survey (NHANES). Administered by the CDC, NHANES provides a uniquely powerful, micro-level dataset that combines self-reported survey data (lifestyle, mental health) with objective clinical examination data (body measurements, lab results). I am interested in using my background in healthcare to leverage my learning and application of data science to improve wellness for a better world.
Specifically for this project, I am utilizing the most recently completed August 2021–August 2023 cycle. My motivation for selecting this specific timeframe is to capture contemporary, post-pandemic health and wellness trends. I aim to investigate the complex relationships between modifiable lifestyle factors—such as rigorous physical training, sleep architecture, and metabolic markers—and mental wellness. By identifying strong predictors of mental health struggles while controlling for socioeconomic confounders in a modern cohort, we can better understand holistic wellness strategies.
Proposed Questions
- Predictive Modeling: Can we accurately predict a positive screening for clinical depression (using the PHQ-9 scoring system) based on a matrix of modifiable lifestyle variables (vigorous physical activity minutes, sleep duration, and BMI), while controlling for the socioeconomic baseline (income-to-poverty ratio)?
- Correlative Analysis: Is the protective effect of vigorous physical activity against depression severity maintained across all age demographics in the post-pandemic era, or does its statistical significance decrease in older populations?
Data Adequacy The NHANES dataset is well-suited to answer these questions because it contains the Patient Health Questionnaire (PHQ-9), alongside highly specific physical activity metrics, clinical BMI, and demographic data. If these metrics yield low predictive power, I will attempt to incorporate systemic inflammatory markers (like C-Reactive Protein) in Milestone 2.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set visualization style for professional graphs
sns.set_theme(style="whitegrid")
Extraction, Transform, and Load (ETL)¶
Data Source & Collection: The data is sourced directly from the CDC's NHANES 2021-2023 repository. NHANES collects data via a complex, stratified, multistage probability sample of the civilian, non-institutionalized US population.
ETL Challenges & Formatting:
- File Format & Merging: The data is provided in SAS Transport format (
.XPT) using the "L" suffix to denote the 21-23 cycle. I extracted five separate tables (Demographics, Body Measures, Sleep, Physical Activity, and Depression) directly via URL and performed an inner join on the respondent sequence number (SEQN). - Cryptic Variables: Raw SAS variables (e.g.,
PAQ660,INDFMPIR) were mapped to human-readable names to adhere to tidy data principles. - Handling Missing Data (NaNs): - Target Variable: Rows missing the PHQ-9 depression score were dropped, as we cannot train a model on an unknown target.
- Survey Logic NaNs: In NHANES, if a participant answers "No" to engaging in vigorous activity, the subsequent question regarding "minutes of vigorous activity" is skipped (NaN). I explicitly filled these specific NaNs with
0to reflect zero minutes. - Standard Imputation: Remaining missing features (like unrecorded BMI) were filled with the dataset median to preserve the cohort size for exploratory analysis.
- Survey Logic NaNs: In NHANES, if a participant answers "No" to engaging in vigorous activity, the subsequent question regarding "minutes of vigorous activity" is skipped (NaN). I explicitly filled these specific NaNs with
# 1. Extraction: Load the 2021-2023 data directly from the NEW CDC URLs
base_url = "https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/"
print("Downloading and reading 2021-2023 SAS XPT files from CDC...")
df_demo = pd.read_sas(base_url + 'DEMO_L.xpt') # Demographics
df_body = pd.read_sas(base_url + 'BMX_L.xpt') # Body Measures
df_sleep = pd.read_sas(base_url + 'SLQ_L.xpt') # Sleep
df_activity = pd.read_sas(base_url + 'PAQ_L.xpt') # Physical Activity
df_depr = pd.read_sas(base_url + 'DPQ_L.xpt') # Depression Screener (PHQ-9)
df_crp = pd.read_sas(base_url + 'HSCRP_L.xpt') # NEW: High-Sensitivity C-Reactive Protein (Inflammation)
# 2. Transform: Merge all 6 datasets on the unique sequence number (SEQN)
df = pd.merge(df_demo, df_body, on='SEQN', how='inner')
df = pd.merge(df, df_sleep, on='SEQN', how='inner')
df = pd.merge(df, df_activity, on='SEQN', how='inner')
df = pd.merge(df, df_depr, on='SEQN', how='inner')
df = pd.merge(df, df_crp, on='SEQN', how='inner') # Merging new lab data
# 3. Transform: Calculate PHQ-9 Depression Score
phq_cols = [f'DPQ0{i}0' for i in range(1, 10)]
df[phq_cols] = df[phq_cols].replace([7, 9], np.nan)
df['PHQ9_Score'] = df[phq_cols].sum(axis=1)
# 4. Transform: Rename to readable columns and filter adults (20+)
col_mapping = {
'SEQN': 'ID',
'RIAGENDR': 'Gender',
'RIDAGEYR': 'Age',
'BMXBMI': 'BMI',
'SLD012': 'Sleep_Hours',
'PAD820': 'Vigorous_Activity_Min',
'INDFMPIR': 'Income_Poverty_Ratio',
'LBXHSCRP': 'CRP_mgL' # NEW: Clinical inflammation marker
}
df_clean = df.rename(columns=col_mapping)[['ID', 'Gender', 'Age', 'BMI', 'Sleep_Hours',
'Vigorous_Activity_Min', 'Income_Poverty_Ratio', 'CRP_mgL', 'PHQ9_Score']]
df_clean = df_clean[df_clean['Age'] >= 20]
# Map Gender
df_clean['Gender'] = df_clean['Gender'].map({1.0: 'Male', 2.0: 'Female'})
# Create Binary Target: Clinical cutoff for depression is a score >= 10
df_clean['Depressed'] = df_clean['PHQ9_Score'].apply(lambda x: 1 if x >= 10 else 0 if pd.notnull(x) else np.nan)
# 5. Handle NaNs & Set Dtypes
df_clean = df_clean.dropna(subset=['PHQ9_Score'])
df_clean['Vigorous_Activity_Min'] = df_clean['Vigorous_Activity_Min'].fillna(0)
# Median imputation for remaining features including the new CRP clinical data
for col in ['BMI', 'Sleep_Hours', 'Income_Poverty_Ratio', 'CRP_mgL']:
df_clean[col] = df_clean[col].fillna(df_clean[col].median())
# Set logical datatypes
df_clean = df_clean.astype({
'ID': 'int64',
'Age': 'int64',
'Sleep_Hours': 'float64',
'Vigorous_Activity_Min': 'float64',
'CRP_mgL': 'float64',
'PHQ9_Score': 'int64',
'Depressed': 'int64'
})
print("\nETL Complete. Dataset Shape:", df_clean.shape)
display(df_clean.head())
Downloading and reading 2021-2023 SAS XPT files from CDC... ETL Complete. Dataset Shape: (6064, 10)
| ID | Gender | Age | BMI | Sleep_Hours | Vigorous_Activity_Min | Income_Poverty_Ratio | CRP_mgL | PHQ9_Score | Depressed | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 130378 | Male | 43 | 27.0 | 9.5 | 45.0 | 5.00 | 1.78 | 0 | 0 |
| 1 | 130379 | Male | 66 | 33.5 | 9.0 | 45.0 | 5.00 | 2.03 | 1 | 0 |
| 2 | 130380 | Female | 44 | 29.7 | 8.0 | 0.0 | 1.41 | 5.62 | 2 | 0 |
| 3 | 130386 | Male | 34 | 30.2 | 7.5 | 30.0 | 1.33 | 1.05 | 1 | 0 |
| 4 | 130387 | Female | 68 | 42.6 | 3.0 | 0.0 | 1.32 | 3.96 | 0 | 0 |
Exploratory Data Analysis (EDA)¶
To understand the multidimensional relationships in this recent dataset, I generated summary statistics across physical, psychological, and socioeconomic vectors, followed by a correlation heatmap.
Relevance of these statistics:
- Base Rate: Establishes the class imbalance dictating the need for specific sampling techniques during Milestone 2 model training.
- Activity & Income Medians: Provide a baseline for the general population's fitness and socioeconomic stability in the 2021-2023 timeframe.
- Grouped Means: Comparing physical activity levels between depressed and non-depressed cohorts provides an immediate, raw look at one of my core hypothesis questions.
- The Correlation Heatmap: This graphic is crucial for feature selection. It reveals collinearity between our predictive variables (e.g., how strongly Income correlates with BMI) and their individual linear relationships with the target PHQ-9 score, guiding our machine learning approach.
# --- Summary Statistics (5 generated) ---
print("--- EDA Summary Statistics ---\n")
# 1. Base rate of depression
dep_rate = df_clean['Depressed'].mean() * 100
print(f"1. Proportion of cohort screening positive for depression (PHQ9 >= 10): {dep_rate:.2f}%")
# 2. Median Activity
med_activity = df_clean['Vigorous_Activity_Min'].median()
print(f"2. Median vigorous physical activity (minutes/day) across cohort: {med_activity:.1f}")
# 3. Median Income Ratio
med_income = df_clean['Income_Poverty_Ratio'].median()
print(f"3. Median Income-to-Poverty Ratio (higher = greater financial stability): {med_income:.2f}")
# 4 & 5. Grouped statistic: Average Activity by Depression Status
activity_by_dep = df_clean.groupby('Depressed')['Vigorous_Activity_Min'].mean()
print(f"4. Average vigorous activity for NON-depressed individuals: {activity_by_dep[0]:.1f} mins/day")
print(f"5. Average vigorous activity for DEPRESSED individuals: {activity_by_dep[1]:.1f} mins/day\n")
# --- Graphic: Correlation Heatmap ---
plt.figure(figsize=(10, 8))
# Select only the continuous numerical variables for the heatmap
numeric_cols = ['Age', 'BMI', 'Sleep_Hours', 'Vigorous_Activity_Min', 'Income_Poverty_Ratio', 'PHQ9_Score']
corr_matrix = df_clean[numeric_cols].corr()
# Generate a mask for the upper triangle to make it easier to read
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
# Draw the heatmap
sns.heatmap(corr_matrix, mask=mask, annot=True, fmt=".2f", cmap='coolwarm',
vmax=0.3, vmin=-0.3, square=True, linewidths=.5, cbar_kws={"shrink": .8})
plt.title('Correlation Heatmap: Lifestyle, Socioeconomic Factors, and Depression (2021-2023)', fontsize=14, pad=20)
plt.show()
--- EDA Summary Statistics --- 1. Proportion of cohort screening positive for depression (PHQ9 >= 10): 11.05% 2. Median vigorous physical activity (minutes/day) across cohort: 0.0 3. Median Income-to-Poverty Ratio (higher = greater financial stability): 2.82 4. Average vigorous activity for NON-depressed individuals: 37.8 mins/day 5. Average vigorous activity for DEPRESSED individuals: 66.8 mins/day
Discussion of EDA Findings¶
The summary statistics reveal an interesting and counterintuitive finding regarding physical activity. The median vigorous activity for the entire cohort is 0.0 minutes, meaning more than half of the US adult population reports doing absolutely zero vigorous exercise. However, when looking at the means, the depressed cohort reports a significantly higher average (66.8 mins/day) than the non-depressed cohort (37.8 mins/day).
This highlights a potential "outlier effect" in our data. Because the depressed cohort is much smaller (roughly 11% of the sample), a few individuals reporting high amounts of activity appear to markedly skew the average upward for that group. This finding is very relevant to my core question, as it proves that relying solely on linear averages for behavioral data is misleading. Moving into Milestone 2, I will need to handle these outliers carefully or use non-linear models to accurately predict depression from lifestyle factors.
Furthermore, the Correlation Heatmap shows a slight negative correlation (-0.14) between the Income-to-Poverty Ratio and the PHQ-9 Score, which supports the hypothesis that higher socioeconomic stability has a mildly protective effect against depression severity.
# Set custom aesthetic palette
custom_palette = ['#232D4B', '#E57200']
sns.set_palette(custom_palette)
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Exploratory Data Analysis: Distributions and Relationships', fontsize=16, fontweight='bold', y=1.02)
# Graph 2: Distribution of PHQ-9 Scores
sns.histplot(data=df_clean, x='PHQ9_Score', bins=27, kde=True, color='#232D4B', ax=axes[0, 0])
axes[0, 0].set_title('Graph 2: Distribution of PHQ-9 Depression Scores', fontsize=12)
axes[0, 0].set_xlabel('PHQ-9 Score')
axes[0, 0].set_ylabel('Count of Individuals')
# Graph 3: Depression Base Rate by Gender
sns.barplot(data=df_clean, x='Gender', y='Depressed', errorbar=None, palette=custom_palette, hue='Gender', legend=False, ax=axes[0, 1])
axes[0, 1].set_title('Graph 3: Clinical Depression Rate by Gender', fontsize=12)
axes[0, 1].set_ylabel('Proportion Depressed (PHQ9 >= 10)')
# Graph 4: BMI vs Depression Status (Boxplot)
sns.boxplot(data=df_clean, x='Depressed', y='BMI', palette=custom_palette, hue='Depressed', legend=False, ax=axes[1, 0])
axes[1, 0].set_title('Graph 4: Clinical BMI Distribution by Depression Status', fontsize=12)
axes[1, 0].set_xticks([0, 1])
axes[1, 0].set_xticklabels(['Not Depressed (0)', 'Depressed (1)'])
axes[1, 0].set_xlabel('Depression Status') # Added to override the x-axis label
axes[1, 0].set_ylabel('Body Mass Index (BMI)')
# Graph 5: Inflammation (CRP) vs Depression Status (Violin Plot)
# Capping CRP at 15 mg/L for visualization purposes to cut extreme outliers
df_vis = df_clean[df_clean['CRP_mgL'] <= 15]
sns.violinplot(data=df_vis, x='Depressed', y='CRP_mgL', palette=custom_palette, hue='Depressed', legend=False, inner="quartile", ax=axes[1, 1])
axes[1, 1].set_title('Graph 5: Systemic Inflammation (CRP) by Depression Status', fontsize=12)
axes[1, 1].set_xticks([0, 1])
axes[1, 1].set_xticklabels(['Not Depressed (0)', 'Depressed (1)'])
axes[1, 1].set_xlabel('Depression Status') # Added to override the x-axis label
axes[1, 1].set_ylabel('C-Reactive Protein (mg/L)')
plt.tight_layout()
plt.show()
from scipy import stats
# 1. Isolate the CRP data for the two groups
crp_not_depressed = df_clean[df_clean['Depressed'] == 0]['CRP_mgL'].dropna()
crp_depressed = df_clean[df_clean['Depressed'] == 1]['CRP_mgL'].dropna()
# 2. Run the Mann-Whitney U test
# We use 'greater' because our hypothesis is that the depressed group has HIGHER inflammation
stat, p_value = stats.mannwhitneyu(crp_depressed, crp_not_depressed, alternative='greater')
# 3. Print the results cleanly
print(f"Mann-Whitney U Statistic: {stat}")
print(f"P-Value: {p_value}")
if p_value < 0.05:
print("\nResult: STATISTICALLY SIGNIFICANT (p < 0.05).")
print("The depressed cohort has significantly higher systemic inflammation.")
else:
print("\nResult: NOT STATISTICALLY SIGNIFICANT.")
print("We cannot confidently say the inflammation levels are different.")
Mann-Whitney U Statistic: 2009218.0 P-Value: 1.1068184701508483e-06 Result: STATISTICALLY SIGNIFICANT (p < 0.05). The depressed cohort has significantly higher systemic inflammation.
Proposed Machine Learning Models (Milestone 2 Outline)¶
To rigorously answer the core research questions established in Milestone 1, I propose building the following two distinct models:
Model 1: Binary Classification of Clinical Depression (Logistic Regression / Random Forest)¶
- Objective: Predict whether an individual screens positive for clinical depression based on a matrix of lifestyle and physiological markers.
- Dependent Variable (Target):
Depressed(Binary: 0 = Not Depressed, 1 = Depressed). - Independent Variables (Features):
Age,Gender,BMI,Sleep_Hours,Vigorous_Activity_Min,Income_Poverty_Ratio, andCRP_mgL. - Methodology: I will utilize Logistic Regression to establish baseline odds ratios for each modifiable factor (e.g., "Does an extra hour of sleep reduce the odds of depression?"). I will also train a Random Forest Classifier to capture non-linear relationships, such as the "outlier effect" observed in the vigorous physical activity data during EDA. Performance will be evaluated using F1-score and ROC-AUC due to the heavy class imbalance (only ~11% positive).
Model 2: Continuous Prediction of Mental Wellness (Multiple Linear Regression)¶
- Objective: Understand the proportional impact of socioeconomic stability and physical health on the raw severity of depressive symptoms.
- Dependent Variable (Target):
PHQ9_Score(Continuous: 0 to 27). - Independent Variables (Features):
Age,BMI,Sleep_Hours,Vigorous_Activity_Min,Income_Poverty_Ratio, andCRP_mgL. - Methodology: I will deploy Multiple Linear Regression. By interpreting the coefficients of the standardized features, I can identify which variable (e.g., metabolic inflammation via CRP vs. socioeconomic stability via Income Ratio) has the largest relative effect on moving a patient up or down the PHQ-9 severity scale. I will use $R^2$ and Root Mean Squared Error (RMSE) to evaluate the model's goodness of fit.