Predictive Modeling of Mental Health and Wellness: Synergies of Lifestyle, Socioeconomic, and Biomarker Factors (NHANES 2021–2023)¶

Course: CMPS 6790 — Introduction to Data Science
Name: Aaron Dumont
Project Website: https://adumont2.github.io
Notebook Direct Link: https://adumont2.github.io/Dumont-FinalTutorial.html

Introduction and Motivation¶

Depression is a leading cause of disability worldwide. According to the National Institute of Mental Health, an estimated 21 million adults in the United States experienced at least one major depressive episode in 2021 alone. While clinical screening tools like the Patient Health Questionnaire-9 (PHQ-9) are widely validated for identifying depression, they are primarily administered in clinical settings — meaning most at-risk individuals are never screened.

This raises a critical question for data science: can routinely collected lifestyle and physiological metrics — physical activity, sleep, diet security, alcohol consumption, body composition, and systemic inflammation — reliably predict who is at risk for clinical depression? If so, such models could power population-level screening tools that identify at-risk individuals before they present to a clinic.

The CDC's National Health and Nutrition Examination Survey (NHANES) provides a uniquely powerful dataset for this investigation. Unlike most health surveys, NHANES combines self-reported behavioral data with objective clinical measurements from physical exams and laboratory tests. This fusion of subjective and objective data is rare, and it is precisely what makes NHANES ideal for building interpretable, multi-modal predictive models.

For this tutorial, I use the most recent completed cycle: August 2021 – August 2023. My motivation for selecting this timeframe is to capture contemporary, post-pandemic health trends. The COVID-19 pandemic reshaped physical activity patterns, dietary habits, alcohol consumption, and food security across the US population — all factors hypothesized to affect mental health.

Research Questions¶

  1. Predictive Modeling (Classification): Can we accurately predict a positive screening for clinical depression (PHQ-9 ≥ 10) using a combination of modifiable lifestyle factors (vigorous physical activity, sleep duration, alcohol consumption), physiological biomarkers (BMI, C-Reactive Protein), and socioeconomic indicators (income-to-poverty ratio, food security status)?

  2. Explanatory Modeling (Regression): Among these predictors, which variables have the largest proportional effect on the continuous severity of depressive symptoms (PHQ-9 score), and does systemic inflammation (CRP) independently predict depression severity after controlling for lifestyle and socioeconomic factors?

Tutorial Roadmap¶

This notebook follows the complete Data Science Lifecycle:

  1. ETL — Extract, merge, and clean 8 NHANES component tables into a single analytic dataset
  2. EDA — Explore distributions, relationships, and group differences to inform model design
  3. Modeling — Build and evaluate a classification model (Logistic Regression and Random Forest) and a regression model (Multiple OLS Regression)
  4. Conclusions — Synthesize findings into actionable recommendations for public health and clinical practice
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (classification_report, confusion_matrix,
                             roc_auc_score, roc_curve, f1_score)
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
import warnings
# Suppress convergence/deprecation warnings for cleaner tutorial output;
# in production, review all warnings before suppressing.
warnings.filterwarnings('ignore')

# Set visualization style for professional graphs
sns.set_theme(style="whitegrid")
custom_palette = ['#232D4B', '#E57200']  # UVA-inspired palette
sns.set_palette(custom_palette)

Extraction, Transform, and Load (ETL)¶

Data Sources¶

All data is sourced directly from the CDC's NHANES 2021–2023 public-use repository. NHANES collects data through a complex, stratified, multistage probability sample of the civilian, non-institutionalized US population, combining in-home interviews with standardized physical examinations and laboratory tests at Mobile Examination Centers.

I extract and merge eight separate component tables to build a comprehensive analytic dataset:

# NHANES Table File Key Variables Rationale
1 Demographics DEMO_L.xpt Age, Gender, Income-to-Poverty Ratio Core demographics and socioeconomic confounder
2 Body Measures BMX_L.xpt BMI Objective physiological marker of metabolic health
3 Sleep Disorders SLQ_L.xpt Sleep Duration (hours) Modifiable lifestyle factor tied to mental health
4 Physical Activity PAQ_L.xpt Vigorous Activity (min/day) Key modifiable lifestyle factor under investigation
5 Depression Screener DPQ_L.xpt PHQ-9 item scores → total score Target variable for both models
6 hs-CRP Laboratory HSCRP_L.xpt High-sensitivity C-Reactive Protein (mg/L) Objective biomarker of systemic inflammation (Osimo et al., 2019)
7 Alcohol Use ALQ_L.xpt Average drinks per day (past 12 months) Behavioral risk factor with strong depression comorbidity (Boden & Fergusson, 2011)
8 Food Security FSQ_L.xpt Adult food security category (FSDAD) Socioeconomic determinant of mental health (Pourmotabbed et al., 2020)

The inclusion of CRP (an objective laboratory biomarker), alcohol use (a behavioral risk factor), and food security (a socioeconomic determinant) was motivated by the published literature linking each to depression outcomes. This allows us to test whether physiological, behavioral, and structural factors contribute independently to depression risk — a question with direct implications for designing multi-level interventions.

For broader population-level context, the CDC PLACES dataset and the Behavioral Risk Factor Surveillance System (BRFSS) provide complementary county- and state-level depression prevalence estimates. While these aggregate datasets cannot be merged at the individual level with NHANES, they serve as useful benchmarks for validating whether our micro-level findings align with population trends.

ETL Challenges¶

  1. File Format & Multi-Table Merge: Data is provided in SAS Transport format (.XPT). I performed sequential inner joins on the unique respondent sequence number (SEQN) across all 8 tables to ensure every record has complete cross-domain information. Because not all NHANES participants complete every exam component, the inner join reduces the sample from 11,933 demographic records to 6,337 respondents who appear in all 8 tables (and 6,064 adults after filtering to age ≥ 20 and dropping rows missing the PHQ-9 target). This attrition is expected — NHANES laboratory and questionnaire modules are administered to age- and exam-specific subsamples — and the resulting analytic cohort is still large enough for robust modeling.
  2. Cryptic Variable Names: Raw SAS variable codes (e.g., PAD820, INDFMPIR, ALQ130) were mapped to human-readable names following tidy data principles.
  3. Survey Skip-Logic NaNs & Special Codes: In NHANES, if a participant answers "No" to engaging in vigorous activity, the follow-up question about minutes is skipped (recorded as NaN). These were explicitly filled with 0 to reflect zero minutes of activity. Similarly, participants who reported never drinking alcohol have NaN for average drinks per day, filled with 0. For ALQ130 (alcohol), response codes 777 (Refused) and 999 (Don't know) were recoded to NaN before zero-filling. Note: the 2021–2023 cycle dropped the household-level food security variable (FSDHH); we use FSDAD (adult food security category) instead, which is derived from the 10-item US Food Security Survey Module.
  4. Target Variable Integrity: Rows missing the PHQ-9 depression score were dropped — we cannot train or evaluate models on an unknown outcome.
  5. Standard Imputation: Remaining missing values in features (e.g., unrecorded BMI, CRP) were filled with the column median to preserve cohort size for exploratory analysis. This is a conservative choice; more sophisticated imputation (e.g., MICE) could be explored in future work.
In [2]:
# =============================================================================
# 1. EXTRACTION: Download 2021-2023 SAS XPT files directly from the CDC
# =============================================================================
base_url = "https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/"

print("Downloading 8 NHANES 2021-2023 component tables from CDC...")
df_demo     = pd.read_sas(base_url + 'DEMO_L.xpt')     # Demographics
df_body     = pd.read_sas(base_url + 'BMX_L.xpt')      # Body Measures
df_sleep    = pd.read_sas(base_url + 'SLQ_L.xpt')      # Sleep
df_activity = pd.read_sas(base_url + 'PAQ_L.xpt')      # Physical Activity
df_depr     = pd.read_sas(base_url + 'DPQ_L.xpt')      # Depression Screener (PHQ-9)
df_crp      = pd.read_sas(base_url + 'HSCRP_L.xpt')    # hs-CRP (Inflammation Lab)
df_alcohol  = pd.read_sas(base_url + 'ALQ_L.xpt')      # Alcohol Use (NEW)
df_food     = pd.read_sas(base_url + 'FSQ_L.xpt')      # Food Security (NEW)

print(f"  Demographics:    {df_demo.shape[0]:,} records")
print(f"  Body Measures:   {df_body.shape[0]:,} records")
print(f"  Sleep:           {df_sleep.shape[0]:,} records")
print(f"  Physical Activity: {df_activity.shape[0]:,} records")
print(f"  Depression (PHQ): {df_depr.shape[0]:,} records")
print(f"  hs-CRP Lab:      {df_crp.shape[0]:,} records")
print(f"  Alcohol Use:     {df_alcohol.shape[0]:,} records")
print(f"  Food Security:   {df_food.shape[0]:,} records")

# =============================================================================
# 2. TRANSFORM: Merge all 8 datasets on the unique respondent ID (SEQN)
# =============================================================================
df = df_demo.copy()
for table in [df_body, df_sleep, df_activity, df_depr, df_crp, df_alcohol, df_food]:
    df = pd.merge(df, table, on='SEQN', how='inner')

print(f"\nAfter inner join across all 8 tables: {df.shape[0]:,} records")

# =============================================================================
# 3. TRANSFORM: Compute PHQ-9 Depression Score
# =============================================================================
# PHQ-9 consists of 9 items (DPQ010-DPQ090), each scored 0-3.
# Codes 7 (refused) and 9 (don't know) are treated as missing.
phq_cols = [f'DPQ0{i}0' for i in range(1, 10)]
df[phq_cols] = df[phq_cols].replace([7, 9], np.nan)
df['PHQ9_Score'] = df[phq_cols].sum(axis=1)

# =============================================================================
# 4. TRANSFORM: Rename variables, create features, filter adults
# =============================================================================
col_mapping = {
    'SEQN': 'ID',
    'RIAGENDR': 'Gender',
    'RIDAGEYR': 'Age',
    'BMXBMI': 'BMI',
    'SLD012': 'Sleep_Hours',
    'PAD820': 'Vigorous_Activity_Min',
    'INDFMPIR': 'Income_Poverty_Ratio',
    'LBXHSCRP': 'CRP_mgL',
    'ALQ130': 'Alcohol_Drinks_Day',      # NEW: Average alcoholic drinks per day
    'FSDAD': 'Food_Security_Category'     # NEW: Adult food security (1=Full, 2=Marginal, 3=Low, 4=Very Low)
}

keep_cols = list(col_mapping.values()) + ['PHQ9_Score']
df_clean = df.rename(columns=col_mapping)[keep_cols]

# Filter to adults aged 20+
df_clean = df_clean[df_clean['Age'] >= 20]

# Map Gender to readable labels
df_clean['Gender'] = df_clean['Gender'].map({1.0: 'Male', 2.0: 'Female'})

# Create binary depression target: PHQ-9 >= 10 is the clinical cutoff
df_clean['Depressed'] = df_clean['PHQ9_Score'].apply(
    lambda x: 1 if x >= 10 else 0 if pd.notnull(x) else np.nan
)

# Create binary food insecurity indicator (Low or Very Low security)
df_clean['Food_Insecure'] = df_clean['Food_Security_Category'].apply(
    lambda x: 1 if x >= 3 else 0 if pd.notnull(x) else np.nan
)

# =============================================================================
# 5. LOAD: Handle missing values and set data types
# =============================================================================
# Drop rows missing the target variable
df_clean = df_clean.dropna(subset=['PHQ9_Score'])

# Clean special codes: ALQ130 uses 777=Refused, 999=Don't know
df_clean['Alcohol_Drinks_Day'] = df_clean['Alcohol_Drinks_Day'].replace([777, 999], np.nan)

# Skip-logic NaN fills: 0 activity = didn't exercise; 0 drinks = doesn't drink
df_clean['Vigorous_Activity_Min'] = df_clean['Vigorous_Activity_Min'].fillna(0)
df_clean['Alcohol_Drinks_Day'] = df_clean['Alcohol_Drinks_Day'].fillna(0)

# Median imputation for remaining continuous features
for col in ['BMI', 'Sleep_Hours', 'Income_Poverty_Ratio', 'CRP_mgL']:
    df_clean[col] = df_clean[col].fillna(df_clean[col].median())

# Mode imputation for categorical food security
df_clean['Food_Insecure'] = df_clean['Food_Insecure'].fillna(0)
df_clean['Food_Security_Category'] = df_clean['Food_Security_Category'].fillna(
    df_clean['Food_Security_Category'].mode()[0]
)

# Set logical data types
df_clean = df_clean.astype({
    'ID': 'int64', 'Age': 'int64',
    'Sleep_Hours': 'float64', 'Vigorous_Activity_Min': 'float64',
    'CRP_mgL': 'float64', 'Alcohol_Drinks_Day': 'float64',
    'PHQ9_Score': 'int64', 'Depressed': 'int64', 'Food_Insecure': 'int64'
})

print(f"\nFinal analytic dataset: {df_clean.shape[0]:,} adults, {df_clean.shape[1]} variables")
print(f"\n--- Data Types ---")
print(df_clean.dtypes)
print(f"\n--- First 5 Rows ---")
display(df_clean.head())
print(f"\n--- Descriptive Statistics ---")
display(df_clean.describe().round(2))
Downloading 8 NHANES 2021-2023 component tables from CDC...
  Demographics:    11,933 records
  Body Measures:   8,860 records
  Sleep:           8,501 records
  Physical Activity: 8,153 records
  Depression (PHQ): 6,337 records
  hs-CRP Lab:      8,727 records
  Alcohol Use:     6,337 records
  Food Security:   11,933 records

After inner join across all 8 tables: 6,337 records

Final analytic dataset: 6,064 adults, 13 variables

--- Data Types ---
ID                          int64
Gender                     object
Age                         int64
BMI                       float64
Sleep_Hours               float64
Vigorous_Activity_Min     float64
Income_Poverty_Ratio      float64
CRP_mgL                   float64
Alcohol_Drinks_Day        float64
Food_Security_Category    float64
PHQ9_Score                  int64
Depressed                   int64
Food_Insecure               int64
dtype: object

--- First 5 Rows ---
ID Gender Age BMI Sleep_Hours Vigorous_Activity_Min Income_Poverty_Ratio CRP_mgL Alcohol_Drinks_Day Food_Security_Category PHQ9_Score Depressed Food_Insecure
0 130378 Male 43 27.0 9.5 45.0 5.00 1.78 0.0 1.0 0 0 0
1 130379 Male 66 33.5 9.0 45.0 5.00 2.03 3.0 1.0 1 0 0
2 130380 Female 44 29.7 8.0 0.0 1.41 5.62 1.0 1.0 2 0 0
3 130386 Male 34 30.2 7.5 30.0 1.33 1.05 2.0 1.0 1 0 0
4 130387 Female 68 42.6 3.0 0.0 1.32 3.96 0.0 1.0 0 0 0
--- Descriptive Statistics ---
ID Age BMI Sleep_Hours Vigorous_Activity_Min Income_Poverty_Ratio CRP_mgL Alcohol_Drinks_Day Food_Security_Category PHQ9_Score Depressed Food_Insecure
count 6064.00 6064.00 6064.00 6064.00 6064.00 6064.00 6064.00 6064.00 6064.00 6064.00 6064.00 6064.00
mean 136346.30 53.85 29.81 7.71 41.04 2.91 3.72 1.69 1.55 3.51 0.11 0.18
std 3437.86 17.18 7.30 1.58 378.79 1.54 7.14 2.17 0.97 4.57 0.31 0.39
min 130378.00 20.00 11.10 2.00 0.00 0.00 0.11 0.00 1.00 0.00 0.00 0.00
25% 133335.50 39.00 24.70 7.00 0.00 1.62 0.87 0.00 1.00 0.00 0.00 0.00
50% 136391.00 57.00 28.50 8.00 0.00 2.82 1.83 1.00 1.00 2.00 0.00 0.00
75% 139304.25 68.00 33.50 8.50 40.00 4.56 3.90 2.00 2.00 5.00 0.00 0.00
max 142310.00 80.00 74.80 14.00 9999.00 5.00 150.92 15.00 4.00 26.00 1.00 1.00

Exploratory Data Analysis (EDA)¶

Before building predictive models, we must understand the structure, distributions, and key relationships within our data. The analyses below are organized in three stages:

  1. Baseline statistics and correlations — Establish the depression base rate, class imbalance, and linear relationships among features to guide model selection.
  2. Distribution and group-comparison visualizations — Examine how each candidate predictor differs between depressed and non-depressed cohorts, directly informing feature selection.
  3. Statistical hypothesis testing and cross-tabulations — Formally test key relationships and quantify group differences across multiple dimensions.

Each analysis is accompanied by a discussion of what the finding means for our modeling strategy.

In [3]:
# =============================================================================
# EDA Stage 1: Summary Statistics
# =============================================================================
print("=" * 70)
print("SUMMARY STATISTICS")
print("=" * 70)

# 1. Depression base rate (establishes class imbalance)
dep_rate = df_clean['Depressed'].mean() * 100
print(f"\n1. Depression prevalence (PHQ-9 >= 10): {dep_rate:.1f}%")

# 2. Median vigorous activity across cohort
med_activity = df_clean['Vigorous_Activity_Min'].median()
print(f"2. Median vigorous activity (min/day):   {med_activity:.1f}")

# 3. Median income-to-poverty ratio
med_income = df_clean['Income_Poverty_Ratio'].median()
print(f"3. Median income-to-poverty ratio:       {med_income:.2f}")

# 4. Food insecurity rate
food_insec_rate = df_clean['Food_Insecure'].mean() * 100
print(f"4. Food insecurity prevalence:            {food_insec_rate:.1f}%")

# 5. Median alcohol consumption
med_alcohol = df_clean['Alcohol_Drinks_Day'].median()
mean_alcohol = df_clean['Alcohol_Drinks_Day'].mean()
print(f"5. Median alcohol drinks/day:             {med_alcohol:.1f} (mean: {mean_alcohol:.2f})")

# 6-7. Activity by depression status
activity_by_dep = df_clean.groupby('Depressed')['Vigorous_Activity_Min'].mean()
print(f"6. Mean vigorous activity — NOT depressed: {activity_by_dep[0]:.1f} min/day")
print(f"7. Mean vigorous activity — DEPRESSED:     {activity_by_dep[1]:.1f} min/day")
======================================================================
SUMMARY STATISTICS
======================================================================

1. Depression prevalence (PHQ-9 >= 10): 11.0%
2. Median vigorous activity (min/day):   0.0
3. Median income-to-poverty ratio:       2.82
4. Food insecurity prevalence:            18.2%
5. Median alcohol drinks/day:             1.0 (mean: 1.69)
6. Mean vigorous activity — NOT depressed: 37.8 min/day
7. Mean vigorous activity — DEPRESSED:     66.8 min/day
In [4]:
# =============================================================================
# EDA Stage 1: Correlation Heatmap
# =============================================================================
plt.figure(figsize=(10, 8))

numeric_cols = ['Age', 'BMI', 'Sleep_Hours', 'Vigorous_Activity_Min',
                'Income_Poverty_Ratio', 'CRP_mgL', 'Alcohol_Drinks_Day',
                'Food_Insecure', 'PHQ9_Score']
corr_matrix = df_clean[numeric_cols].corr()
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

sns.heatmap(corr_matrix, mask=mask, annot=True, fmt=".2f", cmap='coolwarm',
            vmax=0.3, vmin=-0.3, square=True, linewidths=.5,
            cbar_kws={"shrink": .8})
plt.title('Graph 1: Correlation Heatmap — All Features vs. PHQ-9 Score (2021-2023)',
          fontsize=13, pad=20)
plt.tight_layout()
plt.show()
No description has been provided for this image

Discussion: Summary Statistics and Correlation Structure¶

Several findings from the summary statistics directly shape our modeling approach:

Class imbalance is substantial. Only about 11% of the adult cohort screens positive for clinical depression. This means a naive classifier that predicts "not depressed" for everyone would achieve ~89% accuracy — a meaningless result. For Model 1, we must use class-weighted algorithms or resampling techniques to ensure the model actually learns the minority class.

Physical activity shows a paradoxical pattern. The median vigorous activity across the cohort is 0.0 minutes — more than half of US adults report zero vigorous exercise. Yet the mean activity is actually higher in the depressed cohort than the non-depressed cohort. This paradox arises because the depressed group is small (~11%), so a handful of highly active individuals dramatically skew the average. This is a textbook demonstration of why relying on linear averages for skewed behavioral data is misleading, and it motivates the use of non-linear models like Random Forest that can handle such distributional quirks.

The correlation heatmap reveals weak linear relationships. No single feature has a strong linear correlation with PHQ-9 score (all |r| < 0.20). The Income-to-Poverty Ratio shows a slight negative correlation, and Food Insecurity shows a positive correlation — both consistent with the hypothesis that socioeconomic hardship is associated with worse mental health. The weak linear correlations further reinforce that non-linear classifiers may outperform Logistic Regression for the binary prediction task.

Collinearity between predictors is minimal. The feature-to-feature correlations are uniformly low, which is encouraging — it means each variable contributes relatively independent information, and multicollinearity will not be a major concern for our regression model.

EDA Stage 2: Distribution and Group-Comparison Visualizations¶

The summary statistics established the depression base rate and revealed that linear correlations are weak. The following four visualizations explore the distributional differences between depressed and non-depressed cohorts across our key predictors. Each plot is chosen to answer a specific question about feature utility for model building.

In [5]:
# =============================================================================
# EDA Stage 2: Four-Panel Visualization
# =============================================================================
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Exploratory Data Analysis: Distributions and Relationships',
             fontsize=16, fontweight='bold', y=1.02)

# Graph 2: Distribution of PHQ-9 Scores
# Question: How is the target variable distributed? Is binary cutoff appropriate?
sns.histplot(data=df_clean, x='PHQ9_Score', bins=27, kde=True,
             color='#232D4B', ax=axes[0, 0])
axes[0, 0].axvline(x=10, color='#E57200', linestyle='--', linewidth=2, label='Clinical Cutoff (≥10)')
axes[0, 0].legend()
axes[0, 0].set_title('Graph 2: Distribution of PHQ-9 Depression Scores', fontsize=12)
axes[0, 0].set_xlabel('PHQ-9 Score')
axes[0, 0].set_ylabel('Count of Individuals')

# Graph 3: Depression Rate by Gender
# Question: Does depression prevalence differ by gender?
sns.barplot(data=df_clean, x='Gender', y='Depressed', errorbar=None,
            palette=custom_palette, hue='Gender', legend=False, ax=axes[0, 1])
axes[0, 1].set_title('Graph 3: Clinical Depression Rate by Gender', fontsize=12)
axes[0, 1].set_ylabel('Proportion Depressed (PHQ-9 ≥ 10)')

# Graph 4: BMI Distribution by Depression Status
# Question: Is there a visible physiological difference in body composition?
sns.boxplot(data=df_clean, x='Depressed', y='BMI', palette=custom_palette,
            hue='Depressed', legend=False, ax=axes[1, 0])
axes[1, 0].set_title('Graph 4: BMI Distribution by Depression Status', fontsize=12)
axes[1, 0].set_xticks([0, 1])
axes[1, 0].set_xticklabels(['Not Depressed', 'Depressed'])
axes[1, 0].set_xlabel('Depression Status')
axes[1, 0].set_ylabel('Body Mass Index (BMI)')

# Graph 5: Systemic Inflammation (CRP) by Depression Status
# Question: Is there objective biomarker evidence of depression-related inflammation?
df_vis = df_clean[df_clean['CRP_mgL'] <= 15]  # Cap at 15 mg/L to reduce outlier distortion
sns.violinplot(data=df_vis, x='Depressed', y='CRP_mgL', palette=custom_palette,
               hue='Depressed', legend=False, inner="quartile", ax=axes[1, 1])
axes[1, 1].set_title('Graph 5: Systemic Inflammation (CRP) by Depression Status', fontsize=12)
axes[1, 1].set_xticks([0, 1])
axes[1, 1].set_xticklabels(['Not Depressed', 'Depressed'])
axes[1, 1].set_xlabel('Depression Status')
axes[1, 1].set_ylabel('C-Reactive Protein (mg/L)')

plt.tight_layout()
plt.show()
No description has been provided for this image

Discussion: Distribution and Group-Comparison Findings¶

Graph 2 — PHQ-9 Distribution: The distribution is heavily right-skewed, with the vast majority of scores clustering between 0 and 5. The clinical cutoff at ≥10 (orange dashed line) isolates a relatively small tail, confirming the ~11% positive rate from our summary statistics. This skew has two implications for modeling: (1) the binary classification approach (Model 1) is appropriate because the continuous score is not normally distributed, and (2) any linear regression on the raw PHQ-9 score (Model 2) may have difficulty with the distributional assumptions of OLS.

Graph 3 — Depression by Gender: Females show a meaningfully higher depression rate than males (roughly 14% vs. 8%). This gender gap is consistent with the broader epidemiological literature (Salk et al., 2017). Gender should be included as a feature in our models, and we should consider whether model performance differs by subgroup.

Graph 4 — BMI by Depression Status: The depressed cohort shows a slightly higher median BMI and a wider interquartile range compared to the non-depressed cohort. However, the overlap between the two distributions is substantial, suggesting that BMI alone is a weak discriminator. It may contribute meaningful signal in combination with other features but is unlikely to be a dominant predictor.

Graph 5 — CRP by Depression Status: This is the most visually striking difference among our predictors. The depressed cohort shows both a higher median CRP level and a wider, more right-skewed distribution of systemic inflammation. CRP is an objective laboratory measurement — not self-reported — which makes this finding particularly valuable. It motivates a formal statistical test (below) and strongly supports including CRP as a predictor in both models.

In [6]:
# =============================================================================
# EDA Stage 2 (continued): New Predictor Visualizations
# =============================================================================
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Graph 6: Alcohol Consumption by Depression Status
sns.boxplot(data=df_clean, x='Depressed', y='Alcohol_Drinks_Day',
            palette=custom_palette, hue='Depressed', legend=False, ax=axes[0])
axes[0].set_title('Graph 6: Alcohol Consumption by Depression Status', fontsize=12)
axes[0].set_xticks([0, 1])
axes[0].set_xticklabels(['Not Depressed', 'Depressed'])
axes[0].set_xlabel('Depression Status')
axes[0].set_ylabel('Avg Alcoholic Drinks / Day')
axes[0].set_ylim(0, 10)  # Cap visualization to reduce extreme outlier distortion

# Graph 7: Food Insecurity Rate by Depression Status
food_dep = df_clean.groupby('Depressed')['Food_Insecure'].mean().reset_index()
sns.barplot(data=food_dep, x='Depressed', y='Food_Insecure',
            palette=custom_palette, hue='Depressed', legend=False, ax=axes[1])
axes[1].set_title('Graph 7: Food Insecurity Rate by Depression Status', fontsize=12)
axes[1].set_xticks([0, 1])
axes[1].set_xticklabels(['Not Depressed', 'Depressed'])
axes[1].set_xlabel('Depression Status')
axes[1].set_ylabel('Proportion Food Insecure')

# Graph 8: Sleep Duration by Depression Status
sns.violinplot(data=df_clean, x='Depressed', y='Sleep_Hours',
               palette=custom_palette, hue='Depressed', legend=False,
               inner="quartile", ax=axes[2])
axes[2].set_title('Graph 8: Sleep Duration by Depression Status', fontsize=12)
axes[2].set_xticks([0, 1])
axes[2].set_xticklabels(['Not Depressed', 'Depressed'])
axes[2].set_xlabel('Depression Status')
axes[2].set_ylabel('Sleep Duration (hours)')

plt.tight_layout()
plt.show()
No description has been provided for this image

Discussion: Alcohol, Food Security, and Sleep Findings¶

Graph 6 — Alcohol Consumption: The depressed cohort tends to show slightly higher median alcohol consumption, though both groups have highly skewed distributions with many zeros (non-drinkers). The presence of heavy-drinking outliers in both groups suggests alcohol may contribute non-linearly — occasional heavy drinking may matter more than daily averages. This is another reason non-linear classifiers like Random Forest may capture patterns that Logistic Regression misses.

Graph 7 — Food Insecurity: This is a compelling finding. The rate of food insecurity is dramatically higher among the depressed cohort compared to the non-depressed cohort. Food insecurity is a measure of structural socioeconomic hardship that goes beyond income alone — it captures the lived experience of material deprivation. This strong group difference justifies its inclusion as a feature and suggests that socioeconomic factors may be among the strongest predictors in our models.

Graph 8 — Sleep Duration: The depressed cohort shows a wider distribution of sleep duration, with more individuals at both extremes (very short and very long sleep). This is consistent with the clinical understanding that depression is associated with both insomnia and hypersomnia. The bimodal pattern suggests that a simple linear relationship between sleep hours and depression may not capture the full picture — both too little and too much sleep are risk indicators.

In [7]:
# =============================================================================
# EDA Stage 3: Statistical Hypothesis Testing
# =============================================================================
print("=" * 70)
print("STATISTICAL TESTS: Depressed vs. Non-Depressed Cohort Comparisons")
print("=" * 70)

# Mann-Whitney U test for CRP (non-normal, continuous)
crp_nd = df_clean[df_clean['Depressed'] == 0]['CRP_mgL'].dropna()
crp_d  = df_clean[df_clean['Depressed'] == 1]['CRP_mgL'].dropna()
stat_crp, p_crp = stats.mannwhitneyu(crp_d, crp_nd, alternative='greater')
print(f"\n1. CRP (inflammation):")
print(f"   Mann-Whitney U = {stat_crp:,.0f}, p = {p_crp:.2e}")
print(f"   {'*** SIGNIFICANT' if p_crp < 0.05 else 'Not significant'}: Depressed cohort has {'higher' if p_crp < 0.05 else 'similar'} inflammation.")

# Mann-Whitney U test for Alcohol
alc_nd = df_clean[df_clean['Depressed'] == 0]['Alcohol_Drinks_Day'].dropna()
alc_d  = df_clean[df_clean['Depressed'] == 1]['Alcohol_Drinks_Day'].dropna()
stat_alc, p_alc = stats.mannwhitneyu(alc_d, alc_nd, alternative='greater')
print(f"\n2. Alcohol (drinks/day):")
print(f"   Mann-Whitney U = {stat_alc:,.0f}, p = {p_alc:.4f}")
print(f"   {'*** SIGNIFICANT' if p_alc < 0.05 else 'Not significant'}")

# Chi-squared test for Food Insecurity (categorical)
contingency = pd.crosstab(df_clean['Depressed'], df_clean['Food_Insecure'])
chi2, p_chi, dof, expected = stats.chi2_contingency(contingency)
print(f"\n3. Food Insecurity (categorical):")
print(f"   Chi-squared = {chi2:.1f}, df = {dof}, p = {p_chi:.2e}")
print(f"   {'*** SIGNIFICANT' if p_chi < 0.05 else 'Not significant'}: Depression and food insecurity are {'associated' if p_chi < 0.05 else 'independent'}.")

# Mann-Whitney U test for Sleep
sleep_nd = df_clean[df_clean['Depressed'] == 0]['Sleep_Hours'].dropna()
sleep_d  = df_clean[df_clean['Depressed'] == 1]['Sleep_Hours'].dropna()
stat_slp, p_slp = stats.mannwhitneyu(sleep_d, sleep_nd, alternative='two-sided')
print(f"\n4. Sleep Duration (two-sided test):")
print(f"   Mann-Whitney U = {stat_slp:,.0f}, p = {p_slp:.4f}")
print(f"   {'*** SIGNIFICANT' if p_slp < 0.05 else 'Not significant'}")
======================================================================
STATISTICAL TESTS: Depressed vs. Non-Depressed Cohort Comparisons
======================================================================

1. CRP (inflammation):
   Mann-Whitney U = 2,009,218, p = 1.11e-06
   *** SIGNIFICANT: Depressed cohort has higher inflammation.

2. Alcohol (drinks/day):
   Mann-Whitney U = 2,075,518, p = 0.0000
   *** SIGNIFICANT

3. Food Insecurity (categorical):
   Chi-squared = 243.0, df = 1, p = 8.83e-55
   *** SIGNIFICANT: Depression and food insecurity are associated.

4. Sleep Duration (two-sided test):
   Mann-Whitney U = 1,639,981, p = 0.0001
   *** SIGNIFICANT
In [8]:
# =============================================================================
# EDA Stage 3: Cross-Tabulation Summary
# =============================================================================
print("\n" + "=" * 70)
print("CROSS-TABULATION: Mean Feature Values by Depression Status")
print("=" * 70 + "\n")

cross_tab = df_clean.groupby('Depressed').agg(
    N=('ID', 'count'),
    Avg_Age=('Age', 'mean'),
    Avg_BMI=('BMI', 'mean'),
    Avg_Sleep=('Sleep_Hours', 'mean'),
    Avg_Activity=('Vigorous_Activity_Min', 'mean'),
    Avg_CRP=('CRP_mgL', 'mean'),
    Avg_Alcohol=('Alcohol_Drinks_Day', 'mean'),
    Avg_Income_Ratio=('Income_Poverty_Ratio', 'mean'),
    Pct_Food_Insecure=('Food_Insecure', 'mean'),
    Pct_Female=('Gender', lambda x: (x == 'Female').mean())
).round(3)

cross_tab.index = ['Not Depressed (0)', 'Depressed (1)']
display(cross_tab.T)
======================================================================
CROSS-TABULATION: Mean Feature Values by Depression Status
======================================================================

Not Depressed (0) Depressed (1)
N 5394.000 670.000
Avg_Age 54.455 48.981
Avg_BMI 29.645 31.107
Avg_Sleep 7.736 7.471
Avg_Activity 37.838 66.813
Avg_CRP 3.577 4.913
Avg_Alcohol 1.614 2.307
Avg_Income_Ratio 2.984 2.285
Pct_Food_Insecure 0.154 0.401
Pct_Female 0.542 0.631

Discussion: Statistical Tests and Cross-Tabulation¶

The hypothesis tests confirm what the visualizations suggested:

CRP (inflammation) is significantly higher in the depressed cohort (Mann-Whitney U = 2,009,218, p = 1.11 × 10⁻⁶). This objective laboratory finding aligns with the growing body of research on the "inflammatory hypothesis of depression" — the idea that chronic low-grade systemic inflammation both contributes to and results from depressive disorders. Importantly, CRP is not self-reported, so this finding is less susceptible to reporting bias than behavioral variables.

Alcohol consumption is also significantly elevated in the depressed cohort (p < 0.0001), consistent with the well-documented comorbidity between alcohol use and depression.

Food insecurity shows a strong, statistically significant association with depression (χ² = 243.0, p = 8.83 × 10⁻⁵⁵). This is the most striking finding in our dataset: 40.1% of the depressed cohort is food insecure, compared to just 15.4% of the non-depressed cohort. This reinforces that structural socioeconomic factors are not merely confounders — they may be primary drivers of mental health outcomes at the population level.

Sleep duration differs significantly between groups (p = 0.0001), with the depressed cohort averaging 7.47 hours versus 7.74 hours for the non-depressed — a modest but statistically reliable difference that, combined with the wider distributional spread seen in Graph 8, supports including sleep as a model feature.

The cross-tabulation provides a comprehensive side-by-side comparison. Key patterns: the depressed cohort is younger on average (49.0 vs. 54.5 years), has higher BMI (31.1 vs. 29.6), higher CRP (4.9 vs. 3.6 mg/L), lower income-to-poverty ratio (2.29 vs. 2.98), much higher food insecurity (40.1% vs. 15.4%), higher alcohol consumption (2.3 vs. 1.6 drinks/day), and a higher proportion of females (63.1% vs. 54.2%). These patterns are mutually reinforcing and together paint a profile of the at-risk population segment.

From EDA to Model Design¶

These EDA findings directly motivate our two models and their specifications:

  • Model 1 (Classification): The class imbalance (~11% positive rate), weak linear correlations (all |r| < 0.20), and distributional quirks (activity paradox, sleep bimodality) suggest that non-linear classifiers might outperform linear ones. To test this, we train both Logistic Regression (as a linear, interpretable baseline) and Random Forest (to capture potential non-linearities and interactions) and compare their ROC-AUC performance. Both models use class_weight='balanced' to address the imbalance.
  • Model 2 (Regression): The cross-tabulation reveals that food insecurity, income, CRP, alcohol, and gender all differ substantially between cohorts. OLS regression will quantify the independent contribution of each predictor to continuous PHQ-9 severity, allowing us to answer: does CRP add predictive value beyond lifestyle and socioeconomic factors? The nine features selected for both models are exactly those that showed meaningful group differences or theoretical motivation in this EDA — no feature is included without empirical or scientific justification.

Machine Learning Models¶

Having established the key distributional patterns and group differences in our EDA, we now build two complementary models to answer our research questions. Both models use the same set of nine features identified through our exploratory analysis.

Model 1 (Classification) uses a standard 80/20 stratified train-test split to evaluate out-of-sample predictive performance. Model 2 (OLS Regression) is fit on the full dataset because its purpose is explanatory inference — estimating the magnitude and significance of each predictor's independent contribution to PHQ-9 severity — rather than out-of-sample prediction. This is the standard approach in epidemiological and social science research, where coefficient interpretation and statistical significance are the primary outputs. The OLS diagnostic plots (residual and Q-Q) verify whether the model's assumptions hold.

In [9]:
# =============================================================================
# MODEL 1: Binary Classification of Clinical Depression
# =============================================================================
print("=" * 70)
print("MODEL 1: Binary Classification — Logistic Regression vs. Random Forest")
print("=" * 70)

# Prepare features and target
feature_cols = ['Age', 'BMI', 'Sleep_Hours', 'Vigorous_Activity_Min',
                'Income_Poverty_Ratio', 'CRP_mgL', 'Alcohol_Drinks_Day', 'Food_Insecure']

# Encode Gender as binary
df_model = df_clean.copy()
df_model['Gender_Female'] = (df_model['Gender'] == 'Female').astype(int)
feature_cols_full = feature_cols + ['Gender_Female']

X = df_model[feature_cols_full]
y = df_model['Depressed']

# Train-test split (80/20, stratified to preserve class balance)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)
print(f"\nTraining set: {X_train.shape[0]:,} samples ({y_train.mean()*100:.1f}% positive)")
print(f"Test set:     {X_test.shape[0]:,} samples ({y_test.mean()*100:.1f}% positive)")

# Scale features for Logistic Regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# --- Logistic Regression (linear baseline) ---
lr = LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42)
lr.fit(X_train_scaled, y_train)
y_pred_lr = lr.predict(X_test_scaled)
y_prob_lr = lr.predict_proba(X_test_scaled)[:, 1]

print("\n--- Logistic Regression ---")
print(classification_report(y_test, y_pred_lr, target_names=['Not Depressed', 'Depressed']))
auc_lr = roc_auc_score(y_test, y_prob_lr)
print(f"ROC-AUC: {auc_lr:.3f}")

# --- Random Forest (non-linear) ---
rf = RandomForestClassifier(n_estimators=200, class_weight='balanced',
                            max_depth=10, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
y_prob_rf = rf.predict_proba(X_test)[:, 1]

print("\n--- Random Forest ---")
print(classification_report(y_test, y_pred_rf, target_names=['Not Depressed', 'Depressed']))
auc_rf = roc_auc_score(y_test, y_prob_rf)
print(f"ROC-AUC: {auc_rf:.3f}")
======================================================================
MODEL 1: Binary Classification — Logistic Regression vs. Random Forest
======================================================================

Training set: 4,851 samples (11.0% positive)
Test set:     1,213 samples (11.0% positive)

--- Logistic Regression ---
               precision    recall  f1-score   support

Not Depressed       0.94      0.72      0.82      1079
    Depressed       0.22      0.66      0.33       134

     accuracy                           0.71      1213
    macro avg       0.58      0.69      0.58      1213
 weighted avg       0.86      0.71      0.76      1213

ROC-AUC: 0.725

--- Random Forest ---
               precision    recall  f1-score   support

Not Depressed       0.90      0.95      0.92      1079
    Depressed       0.22      0.12      0.16       134

     accuracy                           0.86      1213
    macro avg       0.56      0.53      0.54      1213
 weighted avg       0.82      0.86      0.84      1213

ROC-AUC: 0.674
In [10]:
# =============================================================================
# MODEL 1: ROC Curves and Feature Importance
# =============================================================================
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# ROC Curves
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_prob_lr)
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_prob_rf)

axes[0].plot(fpr_lr, tpr_lr, color='#232D4B', lw=2, label=f'Logistic Regression (AUC={auc_lr:.3f})')
axes[0].plot(fpr_rf, tpr_rf, color='#E57200', lw=2, label=f'Random Forest (AUC={auc_rf:.3f})')
axes[0].plot([0, 1], [0, 1], 'k--', lw=1, label='Random Chance')
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate')
axes[0].set_title('Graph 9: ROC Curves — Depression Classification', fontsize=12)
axes[0].legend(loc='lower right')

# Feature Importance (Random Forest)
importances = rf.feature_importances_
feat_imp = pd.Series(importances, index=feature_cols_full).sort_values(ascending=True)
feat_imp.plot(kind='barh', color='#232D4B', ax=axes[1])
axes[1].set_title('Graph 10: Random Forest Feature Importance', fontsize=12)
axes[1].set_xlabel('Importance (Gini)')

plt.tight_layout()
plt.show()

# Logistic Regression Coefficients (Odds Ratios)
print("\n--- Logistic Regression: Odds Ratios ---")
odds_ratios = pd.DataFrame({
    'Feature': feature_cols_full,
    'Coefficient': lr.coef_[0],
    'Odds_Ratio': np.exp(lr.coef_[0])
}).sort_values('Odds_Ratio', ascending=False)
display(odds_ratios.round(3))
No description has been provided for this image
--- Logistic Regression: Odds Ratios ---
Feature Coefficient Odds_Ratio
7 Food_Insecure 0.380 1.463
6 Alcohol_Drinks_Day 0.220 1.247
8 Gender_Female 0.135 1.145
1 BMI 0.116 1.123
5 CRP_mgL 0.064 1.066
3 Vigorous_Activity_Min 0.024 1.025
2 Sleep_Hours -0.132 0.876
0 Age -0.200 0.819
4 Income_Poverty_Ratio -0.273 0.761
In [12]:
# =============================================================================
# MODEL 1: Logistic Regression — Odds Ratio Visualization
# =============================================================================
# The Random Forest feature importance (Graph 10) shows how the ensemble
# ranks features by Gini impurity. For Logistic Regression, the equivalent
# interpretability tool is the odds ratio — exp(coefficient) — which quantifies
# each feature's multiplicative effect on the odds of depression.

odds_ratios = pd.DataFrame({
    'Feature': feature_cols_full,
    'Coefficient': lr.coef_[0],
    'Odds_Ratio': np.exp(lr.coef_[0])
}).sort_values('Odds_Ratio', ascending=True)

# Clean feature labels for display
label_map = {
    'Food_Insecure': 'Food Insecure', 'Alcohol_Drinks_Day': 'Alcohol (Drinks/Day)',
    'Gender_Female': 'Gender (Female)', 'BMI': 'BMI', 'CRP_mgL': 'CRP (mg/L)',
    'Vigorous_Activity_Min': 'Vigorous Activity (Min)', 'Sleep_Hours': 'Sleep Hours',
    'Age': 'Age', 'Income_Poverty_Ratio': 'Income-to-Poverty Ratio'
}
labels = [label_map[f] for f in odds_ratios['Feature']]
colors = ['#E57200' if or_val > 1 else '#232D4B' for or_val in odds_ratios['Odds_Ratio']]

fig, ax = plt.subplots(figsize=(10, 6))

# Lollipop-style forest plot: line from OR=1 to each feature's OR
for idx, (or_val, col) in enumerate(zip(odds_ratios['Odds_Ratio'].values, colors)):
    ax.plot([1.0, or_val], [idx, idx], color=col, linewidth=2, alpha=0.6)
ax.scatter(odds_ratios['Odds_Ratio'].values, range(len(odds_ratios)),
           color=colors, s=120, zorder=3, edgecolors='white', linewidths=1)

ax.axvline(x=1.0, color='black', linewidth=1.5, linestyle='--', alpha=0.7)
ax.set_yticks(range(len(odds_ratios)))
ax.set_yticklabels(labels, fontsize=11)
ax.set_xlabel('Odds Ratio (Standardized Coefficients)', fontsize=12)
ax.set_title('Graph 10a: Logistic Regression — Odds Ratios for Depression', fontsize=13)

# Annotate — protective labels LEFT of dot (away from line), risk labels RIGHT
for idx, val in enumerate(odds_ratios['Odds_Ratio'].values):
    if val >= 1:
        ax.text(val + 0.025, idx, f'OR = {val:.2f}', va='center', ha='left',
                fontsize=10, fontweight='bold')
    else:
        ax.text(val - 0.025, idx, f'OR = {val:.2f}', va='center', ha='right',
                fontsize=10, fontweight='bold')

ax.set_xlim(0.55, 1.60)

# Subtle background shading for protective vs. risk regions
ax.axvspan(0.55, 1.0, alpha=0.04, color='#232D4B')
ax.axvspan(1.0, 1.60, alpha=0.04, color='#E57200')

# Region labels positioned below data, clear of x-axis ticks
ax.text(0.75, -1.4, '\u2190 Protective', fontsize=10, color='#232D4B',
        ha='center', fontstyle='italic', fontweight='bold')
ax.text(1.30, -1.4, 'Risk \u2192', fontsize=10, color='#E57200',
        ha='center', fontstyle='italic', fontweight='bold')

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_ylim(-1.8, len(odds_ratios) - 0.5)

plt.tight_layout()
plt.show()

# Print the full odds ratio table for reference
display(odds_ratios.round(3))
No description has been provided for this image
Out[12]:
FeatureCoefficientOdds_Ratio
7Food_Insecure0.3801.463
6Alcohol_Drinks_Day0.2201.247
8Gender_Female0.1351.145
1BMI0.1161.123
5CRP_mgL0.0641.066
3Vigorous_Activity_Min0.0241.025
2Sleep_Hours-0.1320.876
0Age-0.2000.819
4Income_Poverty_Ratio-0.2730.761

Discussion: Model 1 — Binary Classification Results¶

Logistic Regression achieved an ROC-AUC of 0.725 and an F1-score of 0.33 for the depressed class (precision = 0.22, recall = 0.66). Although overall accuracy is 71%, the model correctly identifies 66% of truly depressed individuals — a meaningful recall rate for a screening application where catching at-risk individuals matters more than avoiding false alarms.

The odds ratios quantify each feature's association with depression after controlling for all others:

  • Food Insecurity (OR = 1.46) is the strongest risk factor: being food insecure increases the odds of clinical depression by 46%, holding all else constant.
  • Alcohol consumption (OR = 1.25) and Female gender (OR = 1.15) also elevate risk.
  • Income-to-Poverty Ratio (OR = 0.76) is the strongest protective factor: each unit increase in the ratio reduces depression odds by 24%.
  • Age (OR = 0.82) shows a protective effect — older adults in this cohort screen lower, possibly reflecting survivorship bias or cohort effects.
  • CRP (OR = 1.07) shows a modest but positive association, consistent with the inflammatory hypothesis.

Random Forest achieved a lower ROC-AUC of 0.674 and a much lower depressed-class F1-score of 0.16 (recall = 0.12). This is a notable result: contrary to our initial expectation, the linear model outperformed the non-linear ensemble. This likely reflects the fact that with only 670 depressed cases and 9 features, the Random Forest overfits to noise in the training data — the weak, diffuse signal in this dataset is better captured by the regularized linear decision boundary of Logistic Regression than by the deep splits of 200 decision trees. This is an important data science lesson: more complex models are not always better, especially when the signal-to-noise ratio is low and the minority class is small.

The ROC curves (Graph 9) visually confirm the Logistic Regression advantage across all probability thresholds. The odds ratio plot (Graph 10a) provides the Logistic Regression analogue of the Random Forest feature importance (Graph 10). It visually confirms that Food Insecurity and Alcohol are the strongest risk factors, while Income-to-Poverty Ratio and Age are protective. The feature importance plot (Graph 10) shows that the Random Forest relied most heavily on Age and CRP — both continuous variables with wide ranges that are easy to split on — rather than the binary Food Insecurity variable that proved most informative in the Logistic Regression framework.

In [11]:
# =============================================================================
# MODEL 2: Multiple Linear Regression — Predicting PHQ-9 Severity
# =============================================================================
print("=" * 70)
print("MODEL 2: OLS Regression — Predicting Continuous PHQ-9 Score")
print("=" * 70)

# Prepare features with Gender encoded
X_reg = df_model[feature_cols_full].astype(float)
y_reg = df_model['PHQ9_Score'].astype(float)

# Add constant for statsmodels OLS
X_reg_const = sm.add_constant(X_reg)

# Fit OLS model
ols_model = sm.OLS(y_reg, X_reg_const).fit()
print(ols_model.summary())
======================================================================
MODEL 2: OLS Regression — Predicting Continuous PHQ-9 Score
======================================================================
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             PHQ9_Score   R-squared:                       0.104
Model:                            OLS   Adj. R-squared:                  0.102
Method:                 Least Squares   F-statistic:                     77.89
Date:                Sun, 19 Apr 2026   Prob (F-statistic):          4.45e-137
Time:                        18:45:03   Log-Likelihood:                -17489.
No. Observations:                6064   AIC:                         3.500e+04
Df Residuals:                    6054   BIC:                         3.506e+04
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                     3.7929      0.439      8.640      0.000       2.932       4.654
Age                      -0.0199      0.003     -5.961      0.000      -0.026      -0.013
BMI                       0.0318      0.008      3.999      0.000       0.016       0.047
Sleep_Hours              -0.1434      0.035     -4.064      0.000      -0.213      -0.074
Vigorous_Activity_Min  3.976e-05      0.000      0.270      0.787      -0.000       0.000
Income_Poverty_Ratio     -0.1802      0.040     -4.544      0.000      -0.258      -0.102
CRP_mgL                   0.0245      0.008      3.023      0.003       0.009       0.040
Alcohol_Drinks_Day        0.3385      0.026     12.838      0.000       0.287       0.390
Food_Insecure             2.0589      0.159     12.913      0.000       1.746       2.371
Gender_Female             0.7898      0.113      6.968      0.000       0.568       1.012
==============================================================================
Omnibus:                     1933.494   Durbin-Watson:                   2.031
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             5734.417
Skew:                           1.669   Prob(JB):                         0.00
Kurtosis:                       6.399   Cond. No.                     3.02e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.02e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [12]:
# =============================================================================
# MODEL 2: Diagnostic Plots
# =============================================================================
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Residual Plot
residuals = ols_model.resid
fitted = ols_model.fittedvalues

axes[0].scatter(fitted, residuals, alpha=0.3, s=10, color='#232D4B')
axes[0].axhline(y=0, color='#E57200', linestyle='--', linewidth=2)
axes[0].set_xlabel('Fitted Values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Graph 11: Residual Plot — OLS Regression', fontsize=12)

# Q-Q Plot
stats.probplot(residuals, dist="norm", plot=axes[1])
axes[1].set_title('Graph 12: Q-Q Plot — Residual Normality', fontsize=12)
axes[1].get_lines()[0].set_color('#232D4B')
axes[1].get_lines()[1].set_color('#E57200')

plt.tight_layout()
plt.show()

# Model performance metrics
from sklearn.metrics import r2_score, mean_squared_error
y_pred_ols = ols_model.predict(X_reg_const)
r2 = r2_score(y_reg, y_pred_ols)
rmse = np.sqrt(mean_squared_error(y_reg, y_pred_ols))
print(f"\nR-squared: {r2:.4f}")
print(f"RMSE: {rmse:.2f}")
print(f"Mean PHQ-9 Score: {y_reg.mean():.2f}")
No description has been provided for this image
R-squared: 0.1038
RMSE: 4.33
Mean PHQ-9 Score: 3.51

Discussion: Model 2 — OLS Regression Results¶

The OLS regression answers our second research question: which predictors have the largest independent effect on PHQ-9 severity? The model is jointly significant (F = 77.89, p < 10⁻¹³⁷) and explains 10.4% of the variance in PHQ-9 scores (R² = 0.104, Adj. R² = 0.102, RMSE = 4.33).

Key coefficient interpretations (all controlling for the other eight predictors):

  • Food Insecurity has the largest effect: being food insecure is associated with a 2.06-point higher PHQ-9 score (p < 0.001). On a 0–27 scale where the clinical cutoff is 10, this is a substantial shift — equivalent to roughly half the distance between "minimal" and "mild" symptom categories.
  • Gender (Female) is associated with a 0.79-point increase (p < 0.001), confirming the gender disparity observed in the EDA.
  • Alcohol consumption shows a 0.34-point increase per additional drink per day (p < 0.001). A person averaging 3 drinks/day would score roughly 1 point higher than a non-drinker, all else equal.
  • Income-to-Poverty Ratio has a −0.18-point effect per unit (p < 0.001): moving from poverty (ratio = 1) to middle income (ratio = 3) is associated with a 0.36-point lower PHQ-9 score.
  • CRP (inflammation) independently predicts severity with a coefficient of 0.025 per mg/L (p = 0.003). While statistically significant, this is a small effect — a person with CRP of 10 mg/L (high inflammation) would score only 0.25 points higher than a person with CRP of 0. CRP adds independent information, but its practical effect size is modest compared to food insecurity and alcohol.
  • Sleep shows a protective effect (−0.14 per hour, p < 0.001): each additional hour of sleep is associated with a slightly lower depression score.
  • Vigorous Physical Activity is not statistically significant (p = 0.787). This is an important null finding: after controlling for socioeconomic status, alcohol, sleep, and inflammation, vigorous exercise does not independently predict depression severity in this cross-sectional dataset. This does not mean exercise is irrelevant to mental health — it may operate indirectly through BMI, sleep, or CRP — but it is not a significant direct predictor in our multivariate model.

Model fit (R² = 10.4%): A low R² is expected and informative. Depression is a complex, multifactorial condition influenced by genetics, childhood adversity, social relationships, therapy access, and other factors not captured in NHANES. The fact that just nine lifestyle and socioeconomic variables explain over 10% of the variance — with food insecurity alone accounting for a substantial share — is itself a meaningful public health finding.

Diagnostic plots: The residual plot (Graph 11) shows heteroscedasticity: residuals fan out at higher fitted values, which is expected given the right-skewed PHQ-9 distribution. The Q-Q plot (Graph 12) departs from normality in the upper tail, again reflecting the skew. These violations mean that the OLS standard errors may be slightly biased; in future work, robust standard errors (HC3) or a generalized linear model (e.g., negative binomial regression for count data) could address this.

Convergence across models: Both Model 1 and Model 2 agree that Food Insecurity and Income-to-Poverty Ratio are the most impactful predictors, while CRP contributes modest but statistically significant independent information. This convergent evidence across classification and regression frameworks strengthens our confidence in these findings.

Conclusions and Recommendations¶

Summary of Findings¶

This tutorial walked through the complete Data Science Lifecycle — from extracting and merging 8 NHANES component tables, through exploratory analysis of distributions and group differences, to building and evaluating two complementary predictive models. The key findings are:

  1. Depression is prevalent but detectable. Approximately 11% of the 6,064 US adults in this cohort screened positive for clinical depression (PHQ-9 ≥ 10). The Logistic Regression classifier achieved an ROC-AUC of 0.725 and identified 66% of depressed individuals (recall = 0.66) using only nine routinely collected variables — demonstrating that population-level screening with non-clinical data is feasible, if imperfect.

  2. Socioeconomic factors are the most powerful predictors. Food insecurity emerged as the single strongest predictor in both models: an odds ratio of 1.46 for binary depression classification and a 2.06-point increase in continuous PHQ-9 score. Income-to-poverty ratio was the strongest protective factor (OR = 0.76). Together, these findings underscore that mental health cannot be separated from material living conditions.

  3. Objective biomarkers add independent but modest value. C-Reactive Protein was significantly elevated in the depressed cohort (mean 4.9 vs. 3.6 mg/L, p = 1.11 × 10⁻⁶) and contributed independent predictive power beyond self-reported lifestyle factors (OLS coef = 0.025, p = 0.003). However, its practical effect size is small — much smaller than food insecurity or alcohol consumption. CRP is best understood as one signal among many, not a standalone biomarker for depression.

  4. Simpler models outperformed complex ones. Logistic Regression (AUC = 0.725) outperformed Random Forest (AUC = 0.674) for depression classification, likely because the signal in this dataset is weak and diffuse — better suited to a regularized linear boundary than an ensemble of deep decision trees. This is a practical reminder that model complexity should match signal strength.

  5. Physical activity was not a significant predictor after controlling for other factors. Despite its importance in clinical recommendations, vigorous exercise did not independently predict PHQ-9 severity in the OLS model (p = 0.787). Its effects may be mediated through BMI, sleep, or inflammation rather than operating as a direct predictor.

Recommendations¶

For Public Health Policy:

  • Food security programs may have underappreciated mental health benefits. Our data suggests that a food-insecure individual scores, on average, 2 points higher on the PHQ-9 than a food-secure individual with otherwise identical characteristics — a clinically meaningful difference. Addressing material deprivation could be as impactful for population mental health as expanding access to clinical screening.
  • Routine inclusion of inflammatory biomarkers (CRP) in primary care wellness panels could provide a modest additional signal for depression risk, though its effect size (OR = 1.07) suggests it should complement, not replace, behavioral and socioeconomic screening.

For Clinical Practice:

  • PHQ-9 screening should be paired with assessment of food security status and other social determinants. The 40.1% food insecurity rate among the depressed cohort (vs. 15.4% in the non-depressed cohort) suggests that a positive food insecurity screen may warrant closer mental health follow-up.
  • The gender disparity in depression prevalence (63.1% female in depressed cohort vs. 54.2% overall) persists in the post-pandemic era and should inform targeted outreach.

For Data Science:

  • When working with health survey data, be cautious of skip-logic NaNs (e.g., zero-filling vigorous activity for non-exercisers), class imbalance (the naive accuracy of 89% is meaningless here), and the difference between statistical significance and practical effect size (CRP is significant at p = 0.003 but moves the PHQ-9 by only 0.025 per mg/L).
  • More complex models are not always better. With 670 positive cases and 9 features, Logistic Regression outperformed Random Forest — a useful counterexample to the assumption that ensemble methods always dominate.
  • Combining self-reported behavioral data with objective laboratory measurements (CRP, BMI) produces richer models, but the self-reported socioeconomic variables (food security, income) proved more predictive than the laboratory values in this dataset.

Limitations¶

  • Cross-sectional design: NHANES is a snapshot, not a longitudinal study. We can identify associations but cannot establish causation. Food insecurity may cause depression, or depression may cause food insecurity (through job loss), or both may share common upstream causes.
  • Sample attrition from inner joins: The inner join across 8 tables reduced the sample from 11,933 to 6,064 adults. Individuals who did not complete certain exam components may differ systematically from those who did, potentially introducing selection bias.
  • Median imputation: Filling missing values with medians preserves sample size but may attenuate associations. More sophisticated imputation methods (e.g., MICE) could be explored.
  • Self-reported behavioral data: Physical activity and alcohol consumption are self-reported and subject to social desirability bias. CRP and BMI, being objectively measured, are not affected by this limitation.
  • Survey weights not applied: NHANES uses a complex sampling design with survey weights to produce nationally representative estimates. This tutorial uses unweighted analysis for pedagogical simplicity, which may slightly bias prevalence estimates.
  • OLS assumption violations: The residual and Q-Q plots reveal heteroscedasticity and non-normality, consistent with the right-skewed PHQ-9 distribution. Robust standard errors or a negative binomial GLM could improve inference in future work.
  • Model scope: Depression is influenced by many factors not present in NHANES (genetics, childhood trauma, social support, therapy access). Our model explains 10.4% of the variance — the remaining 89.6% reflects unmeasured complexity.

Future Directions¶

  • Apply NHANES survey weights for nationally representative estimates.
  • Incorporate longitudinal data (e.g., MIDUS study) to test causal pathways between food insecurity and depression.
  • Explore additional biomarkers (vitamin D, thyroid function) and lifestyle variables (screen time, social participation).
  • Use robust standard errors or negative binomial regression to address heteroscedasticity.
  • Deploy the Logistic Regression classifier as an interactive screening tool for community health workers, using the odds ratios to generate individualized risk scores.

References and Resources¶

Data Sources¶

  • CDC NHANES 2021-2023 Public-Use Data
  • CDC PLACES: Local Data for Better Health
  • BRFSS: Behavioral Risk Factor Surveillance System

Clinical & Scientific References¶

  • PHQ-9 Instrument (APA)
  • NIMH: Major Depression Statistics
  • Osimo, E.F. et al. (2019). Prevalence of low-grade inflammation in depression: a systematic review and meta-analysis. Molecular Psychiatry. DOI
  • Boden, J.M. & Fergusson, D.M. (2011). Alcohol and depression. Addiction. DOI
  • Pourmotabbed, A. et al. (2020). Food insecurity and mental health: a systematic review and meta-analysis. Advances in Nutrition. DOI
  • Salk, R.H. et al. (2017). Gender differences in depression in representative national samples. Psychological Bulletin. DOI

Technical References¶

  • scikit-learn Documentation
  • statsmodels OLS Documentation
  • seaborn Visualization Library
  • Arie's Coding Guide (CMPS 3140)

Notebook Conversion¶

To generate the HTML version of this notebook for the GitHub Pages site:

jupyter nbconvert --to html Dumont-FinalTutorial.ipynb

Rename the output to index.html and upload to your adumont2.github.io repository.