Predictive Modeling of Mental Health and Wellness: Synergies of Lifestyle, Socioeconomic, and Biomarker Factors (NHANES 2021–2023)¶
Course: CMPS 6790 — Data Science
Name: Aaron Dumont
Project Website: https://adumont2.github.io
Notebook Direct Link: https://adumont2.github.io/Dumont-FinalTutorial.html
Can a single survey question predict depression better than a blood test?
Clinical depression is a common and important clinical problem. In this tutorial, we'll show that food insecurity — a single question on a household survey — is a stronger predictor of clinical depression than C-Reactive Protein, the gold-standard inflammatory biomarker drawn from a vial of blood. We'll build the models, examine the odds ratios, and walk through every step from raw CDC data to public-health recommendations. By the end, you'll have a complete, reproducible classification + regression pipeline using only free, publicly available data.
What you'll learn: how to merge 8 SAS files into a tidy analytic dataset, when to use median vs. mean imputation, why simpler models sometimes beat random forests, and how to interpret odds ratios for a broad audience.
Introduction and Motivation¶
Depression is a leading cause of disability worldwide. According to the National Institute of Mental Health, an estimated 21 million adults in the United States experienced at least one major depressive episode in 2021 alone. While clinical screening tools like the Patient Health Questionnaire-9 (PHQ-9) are widely validated for identifying depression, they are primarily administered in clinical settings — meaning most at-risk individuals are never screened.
This raises a critical question for data science: can routinely collected lifestyle and physiological metrics — physical activity, sleep, diet security, alcohol consumption, body composition, and systemic inflammation — reliably predict who is at risk for clinical depression? If so, such models could power population-level screening tools that identify at-risk individuals before they present to a clinic.
The CDC's National Health and Nutrition Examination Survey (NHANES) provides a uniquely powerful dataset for this investigation. Unlike most health surveys, NHANES combines self-reported behavioral data with objective clinical measurements from physical exams and laboratory tests. This fusion of subjective and objective data is rare, and it is precisely what makes NHANES ideal for building interpretable, multi-modal predictive models.
For this tutorial, we use the most recent completed cycle: August 2021 – August 2023. The motivation for selecting this timeframe is to capture contemporary, post-pandemic health trends. The COVID-19 pandemic reshaped physical activity patterns, dietary habits, alcohol consumption, and food security across the US population — all factors hypothesized to affect mental health.
Research Questions¶
Predictive Modeling (Classification): Can we accurately predict a positive screening for clinical depression (PHQ-9 ≥ 10) using a combination of modifiable lifestyle factors (vigorous physical activity, sleep duration, alcohol consumption), physiological biomarkers (BMI, C-Reactive Protein), and socioeconomic indicators (income-to-poverty ratio, food security status)?
Explanatory Modeling (Regression): Among these predictors, which variables have the largest proportional effect on the continuous severity of depressive symptoms (PHQ-9 score), and does systemic inflammation (CRP) independently predict depression severity after controlling for lifestyle and socioeconomic factors?
Tutorial Roadmap¶
This notebook follows the complete Data Science Lifecycle:
- ETL — Extract, merge, and clean 8 NHANES component tables from separate files into a single analytic dataset
- EDA — Explore distributions, relationships, and group differences to inform model design
- Modeling — Build and evaluate a classification model (Logistic Regression and Random Forest) and a regression model (Multiple OLS Regression)
- Conclusions — Synthesize findings into actionable recommendations for public health and clinical practice
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (classification_report, confusion_matrix,
roc_auc_score, roc_curve, f1_score)
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
import warnings
# Suppress convergence/deprecation warnings for cleaner tutorial output;
# in production, review all warnings before suppressing.
warnings.filterwarnings('ignore')
# Set visualization style for professional graphs
sns.set_theme(style="whitegrid")
custom_palette = ['#232D4B', '#E57200']
sns.set_palette(custom_palette)
print("Libraries loaded and visualization style configured.")
Libraries loaded and visualization style configured.
Extraction, Transform, and Load (ETL)¶
Data Sources¶
All data is sourced directly from the CDC's NHANES 2021–2023 public-use repository. NHANES collects data through a complex, stratified, multistage probability sample of the civilian, non-institutionalized US population, combining in-home interviews with standardized physical examinations and laboratory tests at Mobile Examination Centers.
We extract and merge eight separate component tables — each representing a distinct data collection instrument with its own methodology — to build a comprehensive analytic dataset. These tables span three fundamentally different data collection modalities: (1) household interview questionnaires (Demographics, Sleep, Physical Activity, Depression Screener, Alcohol Use, Food Security), (2) Mobile Examination Center (MEC) physical measurements (Body Measures), and (3) laboratory assays on collected biospecimens (high sensitivity C-reactive protein (hs-CRP)). Integrating self-reported behavioral data with objective physical and laboratory measurements is a core strength of NHANES and mirrors the multi-source data fusion challenges common in applied data science. Importantly, these eight tables represent three fundamentally distinct datasets — each with its own sampling frame, collection protocol, and data type: self-reported questionnaire responses (interview), objective anthropometric measurements (MEC exam), and biochemical assay results (laboratory). Merging them into a single analytic table as a data integration task required careful handling of missingness patterns that differ across collection modalities.
| # | NHANES Table | File | Key Variables | Rationale |
|---|---|---|---|---|
| 1 | Demographics | DEMO_L.xpt |
Age, Gender, Income-to-Poverty Ratio | Core demographics and socioeconomic confounder |
| 2 | Body Measures | BMX_L.xpt |
BMI | Objective physiological marker of metabolic health |
| 3 | Sleep Disorders | SLQ_L.xpt |
Sleep Duration (hours) | Modifiable lifestyle factor tied to mental health |
| 4 | Physical Activity | PAQ_L.xpt |
Vigorous Activity (min/day) | Key modifiable lifestyle factor under investigation |
| 5 | Depression Screener | DPQ_L.xpt |
PHQ-9 item scores → total score | Target variable for both models |
| 6 | hs-CRP Laboratory | HSCRP_L.xpt |
High-sensitivity C-Reactive Protein (mg/L) | Objective biomarker of systemic inflammation (Osimo et al., 2019) |
| 7 | Alcohol Use | ALQ_L.xpt |
Average drinks per day (past 12 months) | Behavioral risk factor with strong depression comorbidity (Boden & Fergusson, 2011) |
| 8 | Food Security | FSQ_L.xpt |
Adult food security category (FSDAD) | Socioeconomic determinant of mental health (Pourmotabbed et al., 2020) |
The inclusion of hs-CRP (an objective laboratory biomarker), alcohol use (a behavioral risk factor), and food security (a socioeconomic determinant) was motivated by the published literature linking each to depression outcomes. This allows us to test whether physiological, behavioral, and structural factors contribute independently to depression risk — a question with direct implications for designing multi-level interventions.
For broader population-level context, the CDC PLACES dataset and the Behavioral Risk Factor Surveillance System (BRFSS) provide complementary county- and state-level depression prevalence estimates. These aggregate datasets cannot be merged at the individual level with NHANES, but they serve as useful benchmarks for validating whether our micro-level findings align with population trends. For example, BRFSS reports a national adult depression prevalence of approximately 18–21%, compared to our NHANES-derived screening rate of 11% — a difference likely attributable to the PHQ-9 ≥ 10 clinical threshold versus BRFSS's self-reported diagnosis measure. Reconciling these prevalence estimates across survey instruments is a natural direction for future work.
ETL Challenges¶
- File Format & Multi-Table Merge: Data is provided in SAS Transport format (
.XPT). We performed sequential inner joins on the unique respondent sequence number (SEQN) across all 8 tables to ensure every record has complete cross-domain information. Because not all NHANES participants complete every exam component, the inner join reduces the sample from 11,933 demographic records to 6,337 respondents who appear in all 8 tables (and 6,064 adults after filtering to age ≥ 20 and dropping rows missing the PHQ-9 target). This attrition is expected — NHANES laboratory and questionnaire modules are administered to age- and exam-specific subsamples — and the resulting analytic cohort is still large enough for robust modeling. - Cryptic Variable Names: Raw SAS variable codes (e.g.,
PAD820,INDFMPIR,ALQ130) were mapped to human-readable names following tidy data principles. - Survey Skip-Logic NaNs & Special Codes: In NHANES, if a participant answers "No" to engaging in vigorous activity, the follow-up question about minutes is skipped (recorded as NaN). These were explicitly filled with
0to reflect zero minutes of activity. Similarly, participants who reported never drinking alcohol have NaN for average drinks per day, filled with0. For ALQ130 (alcohol), response codes 777 (Refused) and 999 (Don't know) were recoded to NaN before zero-filling. Note: the 2021–2023 cycle dropped the household-level food security variable (FSDHH); we useFSDAD(adult food security category) instead, which is derived from the 10-item US Food Security Survey Module. - Target Variable Integrity: Rows missing the PHQ-9 depression score were dropped — we cannot train or evaluate models on an unknown outcome.
- Standard Imputation: Remaining missing values in features (e.g., unrecorded BMI, CRP) were filled with the column median to preserve cohort size for exploratory analysis (median was used over mean due to less sensitivity to skewing). This is a conservative choice; more sophisticated imputation (e.g., Multiple Imputation by Chained Equations (MICE)) could be explored in future work.
💡 Aside — Why median over mean? When data is right-skewed (as ours is for CRP, alcohol, and PHQ-9 scores), the mean gets pulled toward the long tail by a small number of extreme values. The median doesn't move with outliers, so it produces a more representative "typical value" for imputation. As a rule of thumb: if
df.skew()is greater than 1 or less than -1, prefer the median.
# =============================================================================
# 1. EXTRACTION: Download 2021-2023 SAS XPT files directly from the CDC
# =============================================================================
base_url = "https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/"
print("Downloading 8 NHANES 2021-2023 component tables from CDC...")
df_demo = pd.read_sas(base_url + 'DEMO_L.xpt') # Demographics
df_body = pd.read_sas(base_url + 'BMX_L.xpt') # Body Measures
df_sleep = pd.read_sas(base_url + 'SLQ_L.xpt') # Sleep
df_activity = pd.read_sas(base_url + 'PAQ_L.xpt') # Physical Activity
df_depr = pd.read_sas(base_url + 'DPQ_L.xpt') # Depression Screener (PHQ-9)
df_crp = pd.read_sas(base_url + 'HSCRP_L.xpt') # hs-CRP (Inflammation Lab)
df_alcohol = pd.read_sas(base_url + 'ALQ_L.xpt') # Alcohol Use (NEW)
df_food = pd.read_sas(base_url + 'FSQ_L.xpt') # Food Security (NEW)
print(f" Demographics: {df_demo.shape[0]:,} records")
print(f" Body Measures: {df_body.shape[0]:,} records")
print(f" Sleep: {df_sleep.shape[0]:,} records")
print(f" Physical Activity: {df_activity.shape[0]:,} records")
print(f" Depression (PHQ): {df_depr.shape[0]:,} records")
print(f" hs-CRP Lab: {df_crp.shape[0]:,} records")
print(f" Alcohol Use: {df_alcohol.shape[0]:,} records")
print(f" Food Security: {df_food.shape[0]:,} records")
# =============================================================================
# 2. TRANSFORM: Merge all 8 datasets on the unique respondent ID (SEQN)
# =============================================================================
df = df_demo.copy()
for table in [df_body, df_sleep, df_activity, df_depr, df_crp, df_alcohol, df_food]:
df = pd.merge(df, table, on='SEQN', how='inner')
print(f"\nAfter inner join across all 8 tables: {df.shape[0]:,} records")
# =============================================================================
# 3. TRANSFORM: Compute PHQ-9 Depression Score
# =============================================================================
# PHQ-9 consists of 9 items (DPQ010-DPQ090), each scored 0-3.
# Codes 7 (refused) and 9 (don't know) are treated as missing.
phq_cols = [f'DPQ0{i}0' for i in range(1, 10)]
df[phq_cols] = df[phq_cols].replace([7, 9], np.nan)
df['PHQ9_Score'] = df[phq_cols].sum(axis=1)
# =============================================================================
# 4. TRANSFORM: Rename variables, create features, filter adults
# =============================================================================
col_mapping = {
'SEQN': 'ID',
'RIAGENDR': 'Gender',
'RIDAGEYR': 'Age',
'BMXBMI': 'BMI',
'SLD012': 'Sleep_Hours',
'PAD820': 'Vigorous_Activity_Min',
'INDFMPIR': 'Income_Poverty_Ratio',
'LBXHSCRP': 'CRP_mgL',
'ALQ130': 'Alcohol_Drinks_Day', # NEW: Average alcoholic drinks per day
'FSDAD': 'Food_Security_Category' # NEW: Adult food security (1=Full, 2=Marginal, 3=Low, 4=Very Low)
}
keep_cols = list(col_mapping.values()) + ['PHQ9_Score']
df_clean = df.rename(columns=col_mapping)[keep_cols]
# Filter to adults aged 20+
df_clean = df_clean[df_clean['Age'] >= 20]
# Map Gender to readable labels
df_clean['Gender'] = df_clean['Gender'].map({1.0: 'Male', 2.0: 'Female'})
# Create binary depression target: PHQ-9 >= 10 is the clinical cutoff
df_clean['Depressed'] = df_clean['PHQ9_Score'].apply(
lambda x: 1 if x >= 10 else 0 if pd.notnull(x) else np.nan
)
# Create binary food insecurity indicator (Low or Very Low security)
df_clean['Food_Insecure'] = df_clean['Food_Security_Category'].apply(
lambda x: 1 if x >= 3 else 0 if pd.notnull(x) else np.nan
)
# =============================================================================
# 5. LOAD: Handle missing values and set data types
# =============================================================================
# Drop rows missing the target variable
df_clean = df_clean.dropna(subset=['PHQ9_Score'])
# Clean special codes: ALQ130 uses 777=Refused, 999=Don't know
df_clean['Alcohol_Drinks_Day'] = df_clean['Alcohol_Drinks_Day'].replace([777, 999], np.nan)
# Skip-logic NaN fills: 0 activity = didn't exercise; 0 drinks = doesn't drink
df_clean['Vigorous_Activity_Min'] = df_clean['Vigorous_Activity_Min'].fillna(0)
df_clean['Alcohol_Drinks_Day'] = df_clean['Alcohol_Drinks_Day'].fillna(0)
# Median imputation for remaining continuous features
for col in ['BMI', 'Sleep_Hours', 'Income_Poverty_Ratio', 'CRP_mgL']:
df_clean[col] = df_clean[col].fillna(df_clean[col].median())
# Mode imputation for categorical food security
df_clean['Food_Insecure'] = df_clean['Food_Insecure'].fillna(0)
df_clean['Food_Security_Category'] = df_clean['Food_Security_Category'].fillna(
df_clean['Food_Security_Category'].mode()[0]
)
# Set logical data types
df_clean = df_clean.astype({
'ID': 'int64', 'Age': 'int64',
'Sleep_Hours': 'float64', 'Vigorous_Activity_Min': 'float64',
'CRP_mgL': 'float64', 'Alcohol_Drinks_Day': 'float64',
'PHQ9_Score': 'int64', 'Depressed': 'int64', 'Food_Insecure': 'int64'
})
print(f"\nFinal analytic dataset: {df_clean.shape[0]:,} adults, {df_clean.shape[1]} variables")
print(f"\n--- Data Types ---")
print(df_clean.dtypes)
print(f"\n--- First 5 Rows ---")
display(df_clean.head())
print(f"\n--- Descriptive Statistics ---")
display(df_clean.describe().round(2))
Downloading 8 NHANES 2021-2023 component tables from CDC... Demographics: 11,933 records Body Measures: 8,860 records Sleep: 8,501 records Physical Activity: 8,153 records Depression (PHQ): 6,337 records hs-CRP Lab: 8,727 records Alcohol Use: 6,337 records Food Security: 11,933 records After inner join across all 8 tables: 6,337 records Final analytic dataset: 6,064 adults, 13 variables --- Data Types --- ID int64 Gender object Age int64 BMI float64 Sleep_Hours float64 Vigorous_Activity_Min float64 Income_Poverty_Ratio float64 CRP_mgL float64 Alcohol_Drinks_Day float64 Food_Security_Category float64 PHQ9_Score int64 Depressed int64 Food_Insecure int64 dtype: object --- First 5 Rows ---
| ID | Gender | Age | BMI | Sleep_Hours | Vigorous_Activity_Min | Income_Poverty_Ratio | CRP_mgL | Alcohol_Drinks_Day | Food_Security_Category | PHQ9_Score | Depressed | Food_Insecure | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 130378 | Male | 43 | 27.0 | 9.5 | 45.0 | 5.00 | 1.78 | 0.0 | 1.0 | 0 | 0 | 0 |
| 1 | 130379 | Male | 66 | 33.5 | 9.0 | 45.0 | 5.00 | 2.03 | 3.0 | 1.0 | 1 | 0 | 0 |
| 2 | 130380 | Female | 44 | 29.7 | 8.0 | 0.0 | 1.41 | 5.62 | 1.0 | 1.0 | 2 | 0 | 0 |
| 3 | 130386 | Male | 34 | 30.2 | 7.5 | 30.0 | 1.33 | 1.05 | 2.0 | 1.0 | 1 | 0 | 0 |
| 4 | 130387 | Female | 68 | 42.6 | 3.0 | 0.0 | 1.32 | 3.96 | 0.0 | 1.0 | 0 | 0 | 0 |
--- Descriptive Statistics ---
| ID | Age | BMI | Sleep_Hours | Vigorous_Activity_Min | Income_Poverty_Ratio | CRP_mgL | Alcohol_Drinks_Day | Food_Security_Category | PHQ9_Score | Depressed | Food_Insecure | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 6064.00 | 6064.00 | 6064.00 | 6064.00 | 6064.00 | 6064.00 | 6064.00 | 6064.00 | 6064.00 | 6064.00 | 6064.00 | 6064.00 |
| mean | 136346.30 | 53.85 | 29.81 | 7.71 | 41.04 | 2.91 | 3.72 | 1.69 | 1.55 | 3.51 | 0.11 | 0.18 |
| std | 3437.86 | 17.18 | 7.30 | 1.58 | 378.79 | 1.54 | 7.14 | 2.17 | 0.97 | 4.57 | 0.31 | 0.39 |
| min | 130378.00 | 20.00 | 11.10 | 2.00 | 0.00 | 0.00 | 0.11 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 |
| 25% | 133335.50 | 39.00 | 24.70 | 7.00 | 0.00 | 1.62 | 0.87 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 |
| 50% | 136391.00 | 57.00 | 28.50 | 8.00 | 0.00 | 2.82 | 1.83 | 1.00 | 1.00 | 2.00 | 0.00 | 0.00 |
| 75% | 139304.25 | 68.00 | 33.50 | 8.50 | 40.00 | 4.56 | 3.90 | 2.00 | 2.00 | 5.00 | 0.00 | 0.00 |
| max | 142310.00 | 80.00 | 74.80 | 14.00 | 9999.00 | 5.00 | 150.92 | 15.00 | 4.00 | 26.00 | 1.00 | 1.00 |
Exploratory Data Analysis (EDA)¶
Before building predictive models, we must understand the structure, distributions, and key relationships within our data. The analyses below are organized in three stages:
- Baseline statistics and correlations — Establish the depression base rate, class imbalance, and linear relationships among features to guide model selection.
- Distribution and group-comparison visualizations — Examine how each candidate predictor differs between depressed and non-depressed cohorts, directly informing feature selection.
- Statistical hypothesis testing and cross-tabulations — Formally test key relationships and quantify group differences across multiple dimensions.
Each analysis is accompanied by a discussion of what the finding means for our modeling strategy.
💡 Aside — Why start with a correlation matrix? Before building any model, we want to know whether the predictors are linearly related to the target and to each other. Strong predictor-to-target correlations suggest signal. Strong predictor-to-predictor correlations (multicollinearity) can destabilize regression coefficients. A heatmap can provide very critical information upfront.
# =============================================================================
# EDA Stage 1: Summary Statistics
# =============================================================================
print("=" * 70)
print("SUMMARY STATISTICS")
print("=" * 70)
# 1. Depression base rate (establishes class imbalance)
dep_rate = df_clean['Depressed'].mean() * 100
print(f"\n1. Depression prevalence (PHQ-9 >= 10): {dep_rate:.1f}%")
# 2. Median vigorous activity across cohort
med_activity = df_clean['Vigorous_Activity_Min'].median()
print(f"2. Median vigorous activity (min/day): {med_activity:.1f}")
# 3. Median income-to-poverty ratio
med_income = df_clean['Income_Poverty_Ratio'].median()
print(f"3. Median income-to-poverty ratio: {med_income:.2f}")
# 4. Food insecurity rate
food_insec_rate = df_clean['Food_Insecure'].mean() * 100
print(f"4. Food insecurity prevalence: {food_insec_rate:.1f}%")
# 5. Median alcohol consumption
med_alcohol = df_clean['Alcohol_Drinks_Day'].median()
mean_alcohol = df_clean['Alcohol_Drinks_Day'].mean()
print(f"5. Median alcohol drinks/day: {med_alcohol:.1f} (mean: {mean_alcohol:.2f})")
# 6-7. Activity by depression status
activity_by_dep = df_clean.groupby('Depressed')['Vigorous_Activity_Min'].mean()
print(f"6. Mean vigorous activity — NOT depressed: {activity_by_dep[0]:.1f} min/day")
print(f"7. Mean vigorous activity — DEPRESSED: {activity_by_dep[1]:.1f} min/day")
====================================================================== SUMMARY STATISTICS ====================================================================== 1. Depression prevalence (PHQ-9 >= 10): 11.0% 2. Median vigorous activity (min/day): 0.0 3. Median income-to-poverty ratio: 2.82 4. Food insecurity prevalence: 18.2% 5. Median alcohol drinks/day: 1.0 (mean: 1.69) 6. Mean vigorous activity — NOT depressed: 37.8 min/day 7. Mean vigorous activity — DEPRESSED: 66.8 min/day
# =============================================================================
# EDA Stage 1: Correlation Heatmap
# =============================================================================
plt.figure(figsize=(10, 8))
numeric_cols = ['Age', 'BMI', 'Sleep_Hours', 'Vigorous_Activity_Min',
'Income_Poverty_Ratio', 'CRP_mgL', 'Alcohol_Drinks_Day',
'Food_Insecure', 'PHQ9_Score']
corr_matrix = df_clean[numeric_cols].corr()
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, fmt=".2f", cmap='coolwarm',
vmax=0.3, vmin=-0.3, square=True, linewidths=.5,
cbar_kws={"shrink": .8})
plt.title('Graph 1: Correlation Heatmap — All Features vs. PHQ-9 Score (2021-2023)',
fontsize=13, pad=20)
plt.tight_layout()
plt.show()
Discussion: Summary Statistics and Correlation Structure¶
Several findings from the summary statistics directly shape our modeling approach:
Class imbalance is substantial. Only about 11% of the adult cohort screens positive for clinical depression. This means a naive classifier that predicts "not depressed" for everyone would achieve ~89% accuracy — a meaningless result. For Model 1, we must use class-weighted algorithms or resampling techniques to ensure the model actually learns the minority class.
Physical activity shows a paradoxical pattern. The median vigorous activity across the cohort is 0.0 minutes — more than half of US adults report zero vigorous exercise. Yet the mean activity is actually higher in the depressed cohort than the non-depressed cohort. This paradox arises because the depressed group is small (~11%), so a handful of highly active individuals dramatically skew the average. This is a textbook demonstration of why relying on linear averages for skewed behavioral data is misleading, and it motivates the use of non-linear models like Random Forest that can handle such distributional quirks.
The correlation heatmap reveals weak linear relationships. No single feature has a strong linear correlation with PHQ-9 score (most |r| < 0.20). The Income-to-Poverty Ratio shows a slight negative correlation, and Food Insecurity shows a positive correlation — both consistent with the hypothesis that socioeconomic hardship is associated with worse mental health. The weak linear correlations further reinforce that non-linear classifiers may outperform Logistic Regression for the binary prediction task.
Collinearity between predictors is low/below multicollinearity threshold. Most feature-to-feature correlations are weak (|r| < 0.15), though Food Insecurity and Income-to-Poverty Ratio show a moderate negative correlation (r = -0.41), which makes substantive sense — lower income is associated with food insecurity. CRP and BMI also show a modest relationship (r = 0.27), consistent with the well-documented link between adiposity and systemic inflammation. Neither correlation approaches the multicollinearity threshold (|r| > 0.80), so both variables can be retained as independent predictors.
EDA Stage 2: Distribution and Group-Comparison Visualizations¶
The summary statistics established the depression base rate and revealed that linear correlations are weak. The following four visualizations explore the distributional differences between depressed and non-depressed cohorts across our key predictors. Each plot is chosen to answer a specific question about feature utility for model building.
# =============================================================================
# EDA Stage 2: Four-Panel Visualization
# =============================================================================
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Exploratory Data Analysis: Distributions and Relationships',
fontsize=16, fontweight='bold', y=1.02)
# Graph 2: Distribution of PHQ-9 Scores
# Question: How is the target variable distributed? Is binary cutoff appropriate?
sns.histplot(data=df_clean, x='PHQ9_Score', bins=27, kde=True,
color='#232D4B', ax=axes[0, 0])
axes[0, 0].axvline(x=10, color='#E57200', linestyle='--', linewidth=2, label='Clinical Cutoff (≥10)')
axes[0, 0].legend()
axes[0, 0].set_title('Graph 2: Distribution of PHQ-9 Depression Scores', fontsize=12)
axes[0, 0].set_xlabel('PHQ-9 Score')
axes[0, 0].set_ylabel('Count of Individuals')
# Graph 3: Depression Rate by Gender
# Question: Does depression prevalence differ by gender?
sns.barplot(data=df_clean, x='Gender', y='Depressed', errorbar=None,
palette=custom_palette, hue='Gender', legend=False, ax=axes[0, 1])
axes[0, 1].set_title('Graph 3: Clinical Depression Rate by Gender', fontsize=12)
axes[0, 1].set_ylabel('Proportion Depressed (PHQ-9 ≥ 10)')
# Graph 4: Food Insecurity Rate by Depression Status
# Question: Does food insecurity differ sharply between depressed and non-depressed cohorts?
food_dep = df_clean.groupby('Depressed')['Food_Insecure'].mean().reset_index()
sns.barplot(data=food_dep, x='Depressed', y='Food_Insecure',
palette=custom_palette, hue='Depressed', legend=False, ax=axes[1, 0])
axes[1, 0].set_title('Graph 4: Food Insecurity Rate by Depression Status', fontsize=12)
axes[1, 0].set_xticks([0, 1])
axes[1, 0].set_xticklabels(['Not Depressed', 'Depressed'])
axes[1, 0].set_xlabel('Depression Status')
axes[1, 0].set_ylabel('Proportion Food Insecure')
# Graph 5: Systemic Inflammation (CRP) by Depression Status
# Question: Is there objective biomarker evidence of depression-related inflammation?
df_vis = df_clean[df_clean['CRP_mgL'] <= 15] # Cap at 15 mg/L to reduce outlier distortion
sns.violinplot(data=df_vis, x='Depressed', y='CRP_mgL', palette=custom_palette,
hue='Depressed', legend=False, inner="quartile", ax=axes[1, 1])
axes[1, 1].set_title('Graph 5: Systemic Inflammation (CRP) by Depression Status', fontsize=12)
axes[1, 1].set_xticks([0, 1])
axes[1, 1].set_xticklabels(['Not Depressed', 'Depressed'])
axes[1, 1].set_xlabel('Depression Status')
axes[1, 1].set_ylabel('C-Reactive Protein (mg/L)')
plt.tight_layout()
plt.show()
Discussion: Distribution and Group-Comparison Findings¶
Graph 2 — PHQ-9 Distribution: The distribution is heavily right-skewed, with the vast majority of scores clustering between 0 and 5. The clinical cutoff at ≥10 (orange dashed line) isolates a relatively small tail, confirming the ~11% positive rate from our summary statistics. This skew has two implications for modeling: (1) the binary classification approach (Model 1) is appropriate because the continuous score is not normally distributed, and (2) any linear regression on the raw PHQ-9 score (Model 2) may have difficulty with the distributional assumptions of OLS.
Graph 3 — Depression by Gender: Females show a meaningfully higher depression rate than males (roughly 14% vs. 8%). This gender gap is consistent with the broader epidemiological literature (Salk et al., 2017). Gender should be included as a feature in our models, and we should consider whether model performance differs by subgroup.
Graph 4 — Food Insecurity: This is a compelling finding. The rate of food insecurity is dramatically higher among the depressed cohort compared to the non-depressed cohort. Food insecurity is a measure of structural socioeconomic hardship that goes beyond income alone — it captures the lived experience of material deprivation. This strong group difference justifies its inclusion as a feature and suggests that socioeconomic factors may be among the strongest predictors in our models.
Graph 5 — CRP by Depression Status: This is the most visually striking difference among our predictors. The depressed cohort shows both a higher median CRP level and a wider, more right-skewed distribution of systemic inflammation. CRP is an objective laboratory measurement — not self-reported — which makes this finding particularly valuable. It motivates a formal statistical test (below) and strongly supports including CRP as a predictor in both models.
# =============================================================================
# EDA Stage 2 (continued): New Predictor Visualizations
# =============================================================================
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# Graph 6: Alcohol Consumption by Depression Status
sns.boxplot(data=df_clean, x='Depressed', y='Alcohol_Drinks_Day',
palette=custom_palette, hue='Depressed', legend=False, ax=axes[0])
axes[0].set_title('Graph 6: Alcohol Consumption by Depression Status', fontsize=12)
axes[0].set_xticks([0, 1])
axes[0].set_xticklabels(['Not Depressed', 'Depressed'])
axes[0].set_xlabel('Depression Status')
axes[0].set_ylabel('Avg Alcoholic Drinks / Day')
axes[0].set_ylim(0, 10) # Cap visualization to reduce extreme outlier distortion
# Graph 7: BMI Distribution by Depression Status
sns.boxplot(data=df_clean, x='Depressed', y='BMI', palette=custom_palette,
hue='Depressed', legend=False, ax=axes[1])
axes[1].set_title('Graph 7: BMI Distribution by Depression Status', fontsize=12)
axes[1].set_xticks([0, 1])
axes[1].set_xticklabels(['Not Depressed', 'Depressed'])
axes[1].set_xlabel('Depression Status')
axes[1].set_ylabel('Body Mass Index (BMI)')
# Graph 8: Sleep Duration by Depression Status
sns.violinplot(data=df_clean, x='Depressed', y='Sleep_Hours',
palette=custom_palette, hue='Depressed', legend=False,
inner="quartile", ax=axes[2])
axes[2].set_title('Graph 8: Sleep Duration by Depression Status', fontsize=12)
axes[2].set_xticks([0, 1])
axes[2].set_xticklabels(['Not Depressed', 'Depressed'])
axes[2].set_xlabel('Depression Status')
axes[2].set_ylabel('Sleep Duration (hours)')
plt.tight_layout()
plt.show()
Discussion: Alcohol, Food Security, and Sleep Findings¶
Graph 6 — Alcohol Consumption: The depressed cohort tends to show slightly higher median alcohol consumption, though both groups have highly skewed distributions with many zeros (non-drinkers). The presence of heavy-drinking outliers in both groups suggests alcohol may contribute non-linearly — occasional heavy drinking may matter more than daily averages. This is another reason non-linear classifiers like Random Forest may capture patterns that Logistic Regression misses.
Graph 7 — BMI by Depression Status: The depressed cohort shows a slightly higher median BMI and a wider interquartile range compared to the non-depressed cohort. However, the overlap between the two distributions is substantial, suggesting that BMI alone is a weak discriminator. It may contribute meaningful signal in combination with other features but is unlikely to be a dominant predictor.
Graph 8 — Sleep Duration: The depressed cohort shows a wider distribution of sleep duration, with more individuals at both extremes (very short and very long sleep). This is consistent with the clinical understanding that depression is associated with both insomnia and hypersomnia. The bimodal pattern suggests that a simple linear relationship between sleep hours and depression may not capture the full picture — both too little and too much sleep are risk indicators.
💡 Aside — Picking the right test: For two binary/categorical variables (e.g., depressed × food-insecure), the chi-squared test of independence is appropriate. For a continuous variable compared across two groups, the choice depends on distribution: a t-test if the variable is approximately normal, or Mann-Whitney U if it's skewed. Below, the skewness check confirms CRP (skew = 8.04) and alcohol (skew = 2.57) are heavily right-skewed — well past the conventional threshold of 1 — so Mann-Whitney is appropriate. Sleep duration is approximately symmetric (skew ≈ −0.09), but we apply Mann-Whitney consistently across all continuous comparisons; it remains valid for symmetric data, with only a small loss of statistical power. Food insecurity is categorical, so chi-squared is the right call. The choice is driven by data type and distribution, not by preference.
print("Skewness check (>1 = right-skewed, supports non-parametric test):")
print(f" CRP_mgL: skew = {df_clean['CRP_mgL'].skew():.2f}")
print(f" Alcohol_Drinks_Day: skew = {df_clean['Alcohol_Drinks_Day'].skew():.2f}")
print(f" Sleep_Hours: skew = {df_clean['Sleep_Hours'].skew():.2f}")
print()
print("→ CRP and Alcohol are heavily right-skewed; Mann-Whitney U is appropriate.")
print("→ Sleep is approximately symmetric; t-test would also be valid, but Mann-Whitney is used for consistency.")
print()
Skewness check (>1 = right-skewed, supports non-parametric test): CRP_mgL: skew = 8.04 Alcohol_Drinks_Day: skew = 2.57 Sleep_Hours: skew = -0.09 → CRP and Alcohol are heavily right-skewed; Mann-Whitney U is appropriate. → Sleep is approximately symmetric; t-test would also be valid, but Mann-Whitney is used for consistency.
EDA Stage 3: Hypothesis Testing¶
# =============================================================================
# EDA Stage 3: Statistical Hypothesis Testing
# =============================================================================
print("=" * 70)
print("EDA Stage 3: STATISTICAL TESTS: Depressed vs. Non-Depressed Cohort Comparisons")
print("=" * 70)
# Mann-Whitney U test for CRP (non-normal, continuous)
crp_nd = df_clean[df_clean['Depressed'] == 0]['CRP_mgL'].dropna()
crp_d = df_clean[df_clean['Depressed'] == 1]['CRP_mgL'].dropna()
stat_crp, p_crp = stats.mannwhitneyu(crp_d, crp_nd, alternative='greater')
print(f"\n1. CRP (inflammation):")
print(f" Mann-Whitney U = {stat_crp:,.0f}, p = {p_crp:.2e}")
print(f" {'*** SIGNIFICANT' if p_crp < 0.05 else 'Not significant'}: Depressed cohort has {'higher' if p_crp < 0.05 else 'similar'} inflammation.")
# Mann-Whitney U test for Alcohol
alc_nd = df_clean[df_clean['Depressed'] == 0]['Alcohol_Drinks_Day'].dropna()
alc_d = df_clean[df_clean['Depressed'] == 1]['Alcohol_Drinks_Day'].dropna()
stat_alc, p_alc = stats.mannwhitneyu(alc_d, alc_nd, alternative='greater')
print(f"\n2. Alcohol (drinks/day):")
print(f" Mann-Whitney U = {stat_alc:,.0f}, p = {p_alc:.4f}")
print(f" {'*** SIGNIFICANT' if p_alc < 0.05 else 'Not significant'}: Depressed cohort has higher average alcohol consumption.")
# Chi-squared test for Food Insecurity (categorical)
contingency = pd.crosstab(df_clean['Depressed'], df_clean['Food_Insecure'])
chi2, p_chi, dof, expected = stats.chi2_contingency(contingency)
print(f"\n3. Food Insecurity (categorical):")
print(f" Chi-squared = {chi2:.1f}, df = {dof}, p = {p_chi:.2e}")
print(f" {'*** SIGNIFICANT' if p_chi < 0.05 else 'Not significant'}: Depressed cohort has higher rates of food insecurity (40.1% vs. 15.4%).")
# Mann-Whitney U test for Sleep
sleep_nd = df_clean[df_clean['Depressed'] == 0]['Sleep_Hours'].dropna()
sleep_d = df_clean[df_clean['Depressed'] == 1]['Sleep_Hours'].dropna()
stat_slp, p_slp = stats.mannwhitneyu(sleep_d, sleep_nd, alternative='two-sided')
print(f"\n4. Sleep Duration (two-sided test):")
print(f" Mann-Whitney U = {stat_slp:,.0f}, p = {p_slp:.4f}")
print(f" {'*** SIGNIFICANT' if p_slp < 0.05 else 'Not significant'}: Depressed cohort has significantly different sleep duration.")
======================================================================
EDA Stage 3: STATISTICAL TESTS: Depressed vs. Non-Depressed Cohort Comparisons
======================================================================
1. CRP (inflammation):
Mann-Whitney U = 2,009,218, p = 1.11e-06
*** SIGNIFICANT: Depressed cohort has higher inflammation.
2. Alcohol (drinks/day):
Mann-Whitney U = 2,075,518, p = 0.0000
*** SIGNIFICANT: Depressed cohort has higher average alcohol consumption.
3. Food Insecurity (categorical):
Chi-squared = 243.0, df = 1, p = 8.83e-55
*** SIGNIFICANT: Depressed cohort has higher rates of food insecurity (40.1% vs. 15.4%).
4. Sleep Duration (two-sided test):
Mann-Whitney U = 1,639,981, p = 0.0001
*** SIGNIFICANT: Depressed cohort has significantly different sleep duration.
# =============================================================================
# EDA Stage 3: Cross-Tabulation Summary
# =============================================================================
print("\n" + "=" * 70)
print("CROSS-TABULATION: Mean Feature Values by Depression Status")
print("=" * 70 + "\n")
cross_tab = df_clean.groupby('Depressed').agg(
N=('ID', 'count'),
Avg_Age=('Age', 'mean'),
Avg_BMI=('BMI', 'mean'),
Avg_Sleep=('Sleep_Hours', 'mean'),
Avg_Activity=('Vigorous_Activity_Min', 'mean'),
Avg_CRP=('CRP_mgL', 'mean'),
Avg_Alcohol=('Alcohol_Drinks_Day', 'mean'),
Avg_Income_Ratio=('Income_Poverty_Ratio', 'mean'),
Pct_Food_Insecure=('Food_Insecure', 'mean'),
Pct_Female=('Gender', lambda x: (x == 'Female').mean())
).round(3)
cross_tab.index = ['Not Depressed (0)', 'Depressed (1)']
display(cross_tab.T)
====================================================================== CROSS-TABULATION: Mean Feature Values by Depression Status ======================================================================
| Not Depressed (0) | Depressed (1) | |
|---|---|---|
| N | 5394.000 | 670.000 |
| Avg_Age | 54.455 | 48.981 |
| Avg_BMI | 29.645 | 31.107 |
| Avg_Sleep | 7.736 | 7.471 |
| Avg_Activity | 37.838 | 66.813 |
| Avg_CRP | 3.577 | 4.913 |
| Avg_Alcohol | 1.614 | 2.307 |
| Avg_Income_Ratio | 2.984 | 2.285 |
| Pct_Food_Insecure | 0.154 | 0.401 |
| Pct_Female | 0.542 | 0.631 |
Discussion: Statistical Tests and Cross-Tabulation¶
The hypothesis tests confirm what the visualizations suggested:
CRP (inflammation) is significantly higher in the depressed cohort (Mann-Whitney U = 2,009,218, p = 1.11 × 10⁻⁶). This objective laboratory finding aligns with the growing body of research on the "inflammatory hypothesis of depression" — the idea that chronic low-grade systemic inflammation both contributes to and results from depressive disorders. Importantly, CRP is not self-reported, so this finding is less susceptible to reporting bias than behavioral variables.
Alcohol consumption is also significantly elevated in the depressed cohort (p < 0.0001), consistent with the well-documented comorbidity between alcohol use and depression.
Food insecurity shows a strong, statistically significant association with depression (χ² = 243.0, p = 8.83 × 10⁻⁵⁵). This is the most striking finding in our dataset: 40.1% of the depressed cohort is food insecure, compared to just 15.4% of the non-depressed cohort. This reinforces that structural socioeconomic factors are not merely confounders — they may be primary drivers of mental health outcomes at the population level.
Sleep duration differs significantly between groups (p = 0.0001), with the depressed cohort averaging 7.47 hours versus 7.74 hours for the non-depressed — a modest but statistically reliable difference that, combined with the wider distributional spread seen in Graph 8, supports including sleep as a model feature.
The cross-tabulation provides a comprehensive side-by-side comparison. Key patterns: the depressed cohort is younger on average (49.0 vs. 54.5 years), has higher BMI (31.1 vs. 29.6), higher CRP (4.9 vs. 3.6 mg/L), lower income-to-poverty ratio (2.29 vs. 2.98), much higher food insecurity (40.1% vs. 15.4%), higher alcohol consumption (2.3 vs. 1.6 drinks/day), and a higher proportion of females (63.1% vs. 54.2%). These patterns are mutually reinforcing and together paint a profile of the at-risk population segment.
From EDA to Model Design¶
These EDA findings directly motivate our two models and their specifications:
- Model 1 (Classification): The class imbalance (~11% positive rate), weak linear correlations (most |r| < 0.20), and distributional quirks (activity paradox, sleep bimodality) suggest that non-linear classifiers might outperform linear ones. To test this, we train both Logistic Regression (as a linear, interpretable baseline) and Random Forest (to capture potential non-linearities and interactions) and compare their ROC-AUC performance. Both models use
class_weight='balanced'to address the imbalance. - Model 2 (Regression): The cross-tabulation reveals that food insecurity, income, CRP, alcohol, and gender all differ substantially between cohorts. OLS regression will quantify the independent contribution of each predictor to continuous PHQ-9 severity, allowing us to answer: does CRP add predictive value beyond lifestyle and socioeconomic factors? The nine features selected for both models are exactly those that showed meaningful group differences or theoretical motivation in this EDA — no feature is included without empirical or scientific justification.
Machine Learning Models¶
Having established the key distributional patterns and group differences in our EDA, we now build two complementary models to answer our research questions. Both models use the same set of nine features identified through our exploratory analysis.
Model 1 (Classification) uses a standard 80/20 stratified train-test split to evaluate out-of-sample predictive performance. Model 2 (OLS Regression) is fit on the full dataset because its purpose is explanatory inference — estimating the magnitude and significance of each predictor's independent contribution to PHQ-9 severity — rather than out-of-sample prediction. This is the standard approach in epidemiological and social science research, where coefficient interpretation and statistical significance are the primary outputs. The OLS diagnostic plots (residual and Q-Q) verify whether the model's assumptions hold.
💡 Aside — Two pre-modeling steps worth understanding:
- Stratified train/test split. Because only 11% of our sample is depressed, a random 80/20 split could (by chance) leave very few positive cases in the test set. Stratifying preserves the 11% rate in both halves.
- Z-standardization (mean=0, SD=1). Logistic regression with regularization is sensitive to scale — a feature measured in milligrams (CRP) and one measured in minutes (activity) shouldn't compete on raw units. Standardizing also lets us interpret coefficients as effects-per-standard-deviation, which makes odds ratios directly comparable across features.
💡 Aside — Why fit two classifiers? We have elected to fit two different classifiers (logistic regression and random forest). Logistic regression assumes the log-odds of depression are a linear combination of predictors. Random Forests make no such assumption — they capture nonlinearities and interactions automatically. If RF dramatically outperforms logistic regression, that's evidence the relationship is nonlinear. If logistic wins, the relationship is approximately linear and the simpler model is preferred for both interpretability and generalization.
# =============================================================================
# MODEL 1: Binary Classification of Clinical Depression
# =============================================================================
print("=" * 70)
print("MODEL 1: Binary Classification — Logistic Regression vs. Random Forest")
print("=" * 70)
# Prepare features and target
feature_cols = ['Age', 'BMI', 'Sleep_Hours', 'Vigorous_Activity_Min',
'Income_Poverty_Ratio', 'CRP_mgL', 'Alcohol_Drinks_Day', 'Food_Insecure']
# Encode Gender as binary
df_model = df_clean.copy()
df_model['Gender_Female'] = (df_model['Gender'] == 'Female').astype(int)
feature_cols_full = feature_cols + ['Gender_Female']
X = df_model[feature_cols_full]
y = df_model['Depressed']
# Train-test split (80/20, stratified to preserve class balance)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42, stratify=y
)
print(f"\nTraining set: {X_train.shape[0]:,} samples ({y_train.mean()*100:.1f}% positive)")
print(f"Test set: {X_test.shape[0]:,} samples ({y_test.mean()*100:.1f}% positive)")
# Scale features for Logistic Regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# --- Logistic Regression (linear baseline) ---
lr = LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42)
lr.fit(X_train_scaled, y_train)
y_pred_lr = lr.predict(X_test_scaled)
y_prob_lr = lr.predict_proba(X_test_scaled)[:, 1]
print("\n--- Logistic Regression ---")
print(classification_report(y_test, y_pred_lr, target_names=['Not Depressed', 'Depressed']))
auc_lr = roc_auc_score(y_test, y_prob_lr)
print(f"ROC-AUC: {auc_lr:.3f}")
# --- Random Forest (non-linear) ---
rf = RandomForestClassifier(n_estimators=200, class_weight='balanced',
max_depth=10, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
y_prob_rf = rf.predict_proba(X_test)[:, 1]
print("\n--- Random Forest ---")
print(classification_report(y_test, y_pred_rf, target_names=['Not Depressed', 'Depressed']))
auc_rf = roc_auc_score(y_test, y_prob_rf)
print(f"ROC-AUC: {auc_rf:.3f}")
# =============================================================================
# Cross-Validation: Verify results are not sensitive to the train/test split
# =============================================================================
# A single 80/20 split could produce optimistic or pessimistic results by chance.
# Stratified 5-fold CV repeats the evaluation across 5 non-overlapping folds,
# each preserving the ~11% depression base rate.
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Logistic Regression CV (requires scaling inside each fold via pipeline)
from sklearn.pipeline import Pipeline
lr_pipeline = Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42))
])
cv_scores_lr = cross_val_score(lr_pipeline, X, y, cv=cv, scoring='roc_auc')
# Random Forest CV (no scaling needed)
cv_scores_rf = cross_val_score(rf, X, y, cv=cv, scoring='roc_auc')
print("\n" + "=" * 70)
print("CROSS-VALIDATION: 5-Fold Stratified ROC-AUC")
print("=" * 70)
print(f"\nLogistic Regression: {cv_scores_lr.mean():.3f} ± {cv_scores_lr.std():.3f} (folds: {', '.join(f'{s:.3f}' for s in cv_scores_lr)})")
print(f"Random Forest: {cv_scores_rf.mean():.3f} ± {cv_scores_rf.std():.3f} (folds: {', '.join(f'{s:.3f}' for s in cv_scores_rf)})")
print(f"\nLogistic Regression {'outperforms' if cv_scores_lr.mean() > cv_scores_rf.mean() else 'underperforms'} Random Forest across all folds.")
print(f"The narrow standard deviation (±{cv_scores_lr.std():.3f}) confirms the 80/20 holdout result is stable.")
======================================================================
MODEL 1: Binary Classification — Logistic Regression vs. Random Forest
======================================================================
Training set: 4,851 samples (11.0% positive)
Test set: 1,213 samples (11.0% positive)
--- Logistic Regression ---
precision recall f1-score support
Not Depressed 0.94 0.72 0.82 1079
Depressed 0.22 0.66 0.33 134
accuracy 0.71 1213
macro avg 0.58 0.69 0.58 1213
weighted avg 0.86 0.71 0.76 1213
ROC-AUC: 0.725
--- Random Forest ---
precision recall f1-score support
Not Depressed 0.90 0.95 0.92 1079
Depressed 0.22 0.12 0.16 134
accuracy 0.86 1213
macro avg 0.56 0.53 0.54 1213
weighted avg 0.82 0.86 0.84 1213
ROC-AUC: 0.674
======================================================================
CROSS-VALIDATION: 5-Fold Stratified ROC-AUC
======================================================================
Logistic Regression: 0.710 ± 0.009 (folds: 0.704, 0.700, 0.719, 0.723, 0.703)
Random Forest: 0.684 ± 0.017 (folds: 0.694, 0.674, 0.703, 0.657, 0.694)
Logistic Regression outperforms Random Forest across all folds.
The narrow standard deviation (±0.009) confirms the 80/20 holdout result is stable.
# =============================================================================
# MODEL 1: ROC Curves and Feature Importance
# =============================================================================
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# ROC Curves
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_prob_lr)
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_prob_rf)
axes[0].plot(fpr_lr, tpr_lr, color='#232D4B', lw=2, label=f'Logistic Regression (AUC={auc_lr:.3f})')
axes[0].plot(fpr_rf, tpr_rf, color='#E57200', lw=2, label=f'Random Forest (AUC={auc_rf:.3f})')
axes[0].plot([0, 1], [0, 1], 'k--', lw=1, label='Random Chance')
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate')
axes[0].set_title('Graph 9: ROC Curves — Depression Classification', fontsize=12)
axes[0].legend(loc='lower right')
# Feature Importance (Random Forest)
importances = rf.feature_importances_
feat_imp = pd.Series(importances, index=feature_cols_full).sort_values(ascending=True)
feat_imp.plot(kind='barh', color='#232D4B', ax=axes[1])
axes[1].set_title('Graph 10: Random Forest Feature Importance', fontsize=12)
axes[1].set_xlabel('Importance (Gini)')
plt.tight_layout()
plt.show()
# Logistic Regression Coefficients (Odds Ratios)
print("\n--- Logistic Regression: Odds Ratios ---")
odds_ratios = pd.DataFrame({
'Feature': feature_cols_full,
'Coefficient': lr.coef_[0],
'Odds_Ratio': np.exp(lr.coef_[0])
}).sort_values('Odds_Ratio', ascending=False)
display(odds_ratios.round(3))
--- Logistic Regression: Odds Ratios ---
| Feature | Coefficient | Odds_Ratio | |
|---|---|---|---|
| 7 | Food_Insecure | 0.380 | 1.463 |
| 6 | Alcohol_Drinks_Day | 0.220 | 1.247 |
| 8 | Gender_Female | 0.135 | 1.145 |
| 1 | BMI | 0.116 | 1.123 |
| 5 | CRP_mgL | 0.064 | 1.066 |
| 3 | Vigorous_Activity_Min | 0.024 | 1.025 |
| 2 | Sleep_Hours | -0.132 | 0.876 |
| 0 | Age | -0.200 | 0.819 |
| 4 | Income_Poverty_Ratio | -0.273 | 0.761 |
# =============================================================================
# MODEL 1: Logistic Regression — Odds Ratio Forest Plot with 95% CIs
# =============================================================================
# To produce CIs that are exactly calibrated to the sklearn model (which uses
# L2 regularization and balanced class weights), we compute standard errors
# directly from the model's Hessian matrix rather than refitting with statsmodels.
# Compute balanced class weights (same formula sklearn uses internally)
n_samples = len(y_train)
n_classes = 2
class_counts = np.bincount(y_train.astype(int))
w0 = n_samples / (n_classes * class_counts[0])
w1 = n_samples / (n_classes * class_counts[1])
sample_wts = np.where(y_train == 1, w1, w0)
# Compute standard errors from the weighted Hessian of the logistic likelihood
p_hat = lr.predict_proba(X_train_scaled)[:, 1]
W = p_hat * (1 - p_hat) * sample_wts
X_const = np.column_stack([np.ones(X_train_scaled.shape[0]), X_train_scaled])
H = X_const.T @ (X_const * W[:, np.newaxis]) # Hessian: X'WX
cov_matrix = np.linalg.inv(H)
se_all = np.sqrt(np.diag(cov_matrix))
se_coefs = se_all[1:] # drop intercept
# Build odds ratio table from sklearn coefficients + Hessian-derived CIs
sklearn_coefs = lr.coef_[0]
z_scores = sklearn_coefs / se_coefs
p_values = 2 * (1 - stats.norm.cdf(np.abs(z_scores)))
odds_ratios = pd.DataFrame({
'Feature': feature_cols_full,
'Coefficient': sklearn_coefs,
'Odds_Ratio': np.exp(sklearn_coefs),
'CI_low': np.exp(sklearn_coefs - 1.96 * se_coefs),
'CI_high': np.exp(sklearn_coefs + 1.96 * se_coefs),
'p_value': p_values
}).sort_values('Odds_Ratio', ascending=True)
odds_ratios['Significant'] = odds_ratios['p_value'] < 0.05
# Clean feature labels
label_map = {
'Food_Insecure': 'Food Insecure', 'Alcohol_Drinks_Day': 'Alcohol (Drinks/Day)',
'Gender_Female': 'Gender (Female)', 'BMI': 'BMI', 'CRP_mgL': 'CRP (mg/L)',
'Vigorous_Activity_Min': 'Vigorous Activity (Min)', 'Sleep_Hours': 'Sleep Hours',
'Age': 'Age', 'Income_Poverty_Ratio': 'Income-to-Poverty Ratio'
}
labels = [label_map[f] for f in odds_ratios['Feature']]
fig, ax = plt.subplots(figsize=(10, 6))
for idx, row in enumerate(odds_ratios.itertuples()):
color = '#E57200' if row.Odds_Ratio > 1 else '#232D4B'
alpha = 1.0 if row.Significant else 0.35
# CI whiskers with cap marks
ax.plot([row.CI_low, row.CI_high], [idx, idx],
color=color, linewidth=2.5, alpha=alpha, solid_capstyle='round')
ax.plot([row.CI_low, row.CI_low], [idx - 0.12, idx + 0.12],
color=color, linewidth=2, alpha=alpha)
ax.plot([row.CI_high, row.CI_high], [idx - 0.12, idx + 0.12],
color=color, linewidth=2, alpha=alpha)
# Point estimate
ax.scatter(row.Odds_Ratio, idx, color=color, s=100, zorder=4,
edgecolors='white', linewidths=1, alpha=alpha)
# Label with significance flag
label_text = f'OR = {row.Odds_Ratio:.2f}'
if not row.Significant:
label_text += ' (n.s.)'
if row.Odds_Ratio >= 1:
ax.text(row.CI_high + 0.02, idx, label_text, va='center', ha='left',
fontsize=9.5, fontweight='bold', alpha=alpha)
else:
ax.text(row.CI_low - 0.02, idx, label_text, va='center', ha='right',
fontsize=9.5, fontweight='bold', alpha=alpha)
ax.axvline(x=1.0, color='black', linewidth=1.5, linestyle='--', alpha=0.7)
ax.set_yticks(range(len(odds_ratios)))
ax.set_yticklabels(labels, fontsize=11)
ax.set_xlabel('Odds Ratio (95% CI, Standardized Coefficients)', fontsize=12)
ax.set_title('Graph 10a: Logistic Regression \u2014 Odds Ratios for Depression', fontsize=13)
ax.axvspan(ax.get_xlim()[0], 1.0, alpha=0.04, color='#232D4B')
ax.axvspan(1.0, ax.get_xlim()[1], alpha=0.04, color='#E57200')
ax.text(0.72, -1.4, '\u2190 Protective', fontsize=10, color='#232D4B',
ha='center', fontstyle='italic', fontweight='bold')
ax.text(1.35, -1.4, 'Risk \u2192', fontsize=10, color='#E57200',
ha='center', fontstyle='italic', fontweight='bold')
ax.text(0.72, -2.1, 'CI crossing dashed line = not significant',
fontsize=9, color='#666666', ha='center', fontstyle='italic')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_xlim(0.52, 1.68)
ax.set_ylim(-2.5, len(odds_ratios) - 0.5)
plt.tight_layout()
plt.show()
# Print the full odds ratio table with CIs for reference
print("\nOdds Ratios with 95% Confidence Intervals:")
display(odds_ratios[['Feature', 'Odds_Ratio', 'CI_low', 'CI_high', 'p_value', 'Significant']].round(4))
Odds Ratios with 95% Confidence Intervals:
| Feature | Odds_Ratio | CI_low | CI_high | p_value | Significant | |
|---|---|---|---|---|---|---|
| 4 | Income_Poverty_Ratio | 0.7614 | 0.7122 | 0.8141 | 0.0000 | True |
| 0 | Age | 0.8188 | 0.7695 | 0.8713 | 0.0000 | True |
| 2 | Sleep_Hours | 0.8764 | 0.8310 | 0.9244 | 0.0000 | True |
| 3 | Vigorous_Activity_Min | 1.0247 | 0.9756 | 1.0762 | 0.3309 | False |
| 5 | CRP_mgL | 1.0660 | 1.0025 | 1.1334 | 0.0413 | True |
| 1 | BMI | 1.1230 | 1.0580 | 1.1920 | 0.0001 | True |
| 8 | Gender_Female | 1.1450 | 1.0761 | 1.2183 | 0.0000 | True |
| 6 | Alcohol_Drinks_Day | 1.2466 | 1.1739 | 1.3238 | 0.0000 | True |
| 7 | Food_Insecure | 1.4626 | 1.3784 | 1.5520 | 0.0000 | True |
# =============================================================================
# Natural-Scale Odds Ratios
# Graph shows PER-SD ORs, which are useful for visually
# comparing feature strengths on a common scale. But for the binary variables
# (Food_Insecure, Gender_Female), the more intuitive quantity is the OR for
# a 0 -> 1 transition. Below we convert per-SD coefficients to per-natural-unit
# coefficients by dividing by each feature's original SD: beta_natural = beta_std / sigma.
# =============================================================================
feature_sds = X_train[feature_cols_full].std().values
natural_coefs = sklearn_coefs / feature_sds
natural_se = se_coefs / feature_sds
natural_ors_df = pd.DataFrame({
'Feature': feature_cols_full,
'Per_SD_OR': np.exp(sklearn_coefs),
'Natural_OR': np.exp(natural_coefs),
'Natural_CI_low': np.exp(natural_coefs - 1.96 * natural_se),
'Natural_CI_high': np.exp(natural_coefs + 1.96 * natural_se),
'Unit': ['per year', 'per kg/m²', 'per hour', 'per minute',
'per ratio unit', 'per mg/L', 'per drink/day',
'0 → 1 (food insecure)', '0 → 1 (female)']
}).round(4)
print("Natural-Scale Odds Ratios (per natural unit of each predictor):")
print("(Use these for substantive interpretation, especially for binary variables.)\n")
display(natural_ors_df)
Natural-Scale Odds Ratios (per natural unit of each predictor): (Use these for substantive interpretation, especially for binary variables.)
| Feature | Per_SD_OR | Natural_OR | Natural_CI_low | Natural_CI_high | Unit | |
|---|---|---|---|---|---|---|
| 0 | Age | 0.8188 | 0.9884 | 0.9848 | 0.9920 | per year |
| 1 | BMI | 1.1230 | 1.0160 | 1.0077 | 1.0243 | per kg/m² |
| 2 | Sleep_Hours | 0.8764 | 0.9198 | 0.8892 | 0.9514 | per hour |
| 3 | Vigorous_Activity_Min | 1.0247 | 1.0001 | 0.9999 | 1.0002 | per minute |
| 4 | Income_Poverty_Ratio | 0.7614 | 0.8386 | 0.8032 | 0.8756 | per ratio unit |
| 5 | CRP_mgL | 1.0660 | 1.0095 | 1.0004 | 1.0188 | per mg/L |
| 6 | Alcohol_Drinks_Day | 1.2466 | 1.1073 | 1.0770 | 1.1385 | per drink/day |
| 7 | Food_Insecure | 1.4626 | 2.6715 | 2.2918 | 3.1141 | 0 → 1 (food insecure) |
| 8 | Gender_Female | 1.1450 | 1.3130 | 1.1589 | 1.4876 | 0 → 1 (female) |
💡 Aside — Reading an odds ratio: OR = 1 means no effect; OR < 1 is protective; OR > 1 is risk. Because we standardized features before fitting, the per-SD ORs in Graph 10a are directly comparable across predictors — a CRP per-SD OR of 1.07 means a one-standard-deviation increase in CRP raises the odds of depression by about 7%, on the same scale as the per-SD OR for any other feature. For binary predictors, however, the more intuitive quantity is the natural-scale OR for a 0→1 transition (table above): the natural OR for food insecurity is ≈ 2.7 — being food insecure (vs. food secure) is associated with roughly 2.7× the odds of clinical depression, holding all else constant. The two ORs answer different questions: per-SD for cross-feature comparison, natural-scale for "what is the effect of this specific exposure?" See Gelman (2008) for a discussion of standardization and cross-predictor comparability, and Hosmer, Lemeshow, & Sturdivant (2013) for OR interpretation more generally.
Discussion: Model 1 — Binary Classification Results¶
Logistic Regression achieved an ROC-AUC of 0.725 and an F1-score of 0.33 for the depressed class (precision = 0.22, recall = 0.66). Although overall accuracy is 71%, the model correctly identifies 66% of truly depressed individuals — a meaningful recall rate for a screening application where catching at-risk individuals matters more than avoiding false alarms.
The odds ratios (with 95% confidence intervals derived from the model's Hessian matrix, calibrated to the sklearn coefficients) quantify each feature's association with depression after controlling for all others. A CI that crosses 1.0 indicates the effect is not statistically significant at the 0.05 level:
- Food Insecurity is the strongest risk factor (per-SD OR = 1.46; natural OR = 2.67, 95% CI [2.29, 3.11]): being food insecure is associated with roughly 2.7× the odds of clinical depression compared to being food secure, holding all else constant. Both CIs fall entirely above 1.0, confirming significance.
- Alcohol consumption (per-SD OR = 1.25; natural OR = 1.11 per drink/day, 95% CI [1.08, 1.14]) and Female gender (per-SD OR = 1.15; natural OR = 1.31 for female vs. male, 95% CI [1.16, 1.49]) also elevate risk, with CIs above 1.0.
- Income-to-Poverty Ratio (OR = 0.76) is the strongest protective factor: each unit increase in the ratio reduces depression odds by 24%, with a CI entirely below 1.0.
- Age (OR = 0.82) shows a protective effect — older adults in this cohort screen lower, possibly reflecting survivorship bias or cohort effects.
- CRP (OR = 1.07) shows a modest but positive association, consistent with the inflammatory hypothesis.
- Vigorous Activity (OR ≈ 1.02) has a CI spanning 1.0, indicating it is not a significant independent predictor after controlling for other features — consistent with its non-significant OLS coefficient (p = 0.787).
Random Forest achieved a lower ROC-AUC of 0.674 and a much lower depressed-class F1-score of 0.16 (recall = 0.12). This is a notable result: contrary to our initial expectation, the linear model outperformed the non-linear ensemble. This likely reflects the fact that with only 670 depressed cases and 9 features, the Random Forest overfits to noise in the training data — the weak, diffuse signal in this dataset is better captured by the regularized linear decision boundary of Logistic Regression than by the deep splits of 200 decision trees. This is an important data science lesson: more complex models are not always better, especially when the signal-to-noise ratio is low and the minority class is small.
The ROC curves (Graph 9) visually confirm the Logistic Regression advantage across all probability thresholds. The odds ratio plot (Graph 10a) provides the Logistic Regression analogue of the Random Forest feature importance (Graph 10). It visually confirms that Food Insecurity and Alcohol are the strongest risk factors, while Income-to-Poverty Ratio and Age are protective. The feature importance plot (Graph 10) shows that the Random Forest relied most heavily on Age and CRP — both continuous variables with wide ranges that are easy to split on — rather than the binary Food Insecurity variable that proved most informative in the Logistic Regression framework.
Cross-validation confirms stability. A 5-fold stratified cross-validation reproduces the Logistic Regression advantage across all folds, with narrow standard deviations around the mean AUC. This confirms that the 80/20 holdout result is not an artifact of a lucky split — the Logistic Regression consistently outperforms the Random Forest regardless of how the data is partitioned. The stability of the CV scores also suggests that our dataset, despite having only ~670 positive cases, provides enough signal for reliable model comparison.
# =============================================================================
# MODEL 2: Multiple Linear Regression — Predicting PHQ-9 Severity
# =============================================================================
print("=" * 70)
print("MODEL 2: OLS Regression — Predicting Continuous PHQ-9 Score")
print("=" * 70)
# Prepare features with Gender encoded
X_reg = df_model[feature_cols_full].astype(float)
y_reg = df_model['PHQ9_Score'].astype(float)
# Add constant for statsmodels OLS
X_reg_const = sm.add_constant(X_reg)
# Fit OLS model
ols_model = sm.OLS(y_reg, X_reg_const).fit()
print(ols_model.summary())
======================================================================
MODEL 2: OLS Regression — Predicting Continuous PHQ-9 Score
======================================================================
OLS Regression Results
==============================================================================
Dep. Variable: PHQ9_Score R-squared: 0.104
Model: OLS Adj. R-squared: 0.102
Method: Least Squares F-statistic: 77.89
Date: Sat, 02 May 2026 Prob (F-statistic): 4.45e-137
Time: 15:27:54 Log-Likelihood: -17489.
No. Observations: 6064 AIC: 3.500e+04
Df Residuals: 6054 BIC: 3.506e+04
Df Model: 9
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
const 3.7929 0.439 8.640 0.000 2.932 4.654
Age -0.0199 0.003 -5.961 0.000 -0.026 -0.013
BMI 0.0318 0.008 3.999 0.000 0.016 0.047
Sleep_Hours -0.1434 0.035 -4.064 0.000 -0.213 -0.074
Vigorous_Activity_Min 3.976e-05 0.000 0.270 0.787 -0.000 0.000
Income_Poverty_Ratio -0.1802 0.040 -4.544 0.000 -0.258 -0.102
CRP_mgL 0.0245 0.008 3.023 0.003 0.009 0.040
Alcohol_Drinks_Day 0.3385 0.026 12.838 0.000 0.287 0.390
Food_Insecure 2.0589 0.159 12.913 0.000 1.746 2.371
Gender_Female 0.7898 0.113 6.968 0.000 0.568 1.012
==============================================================================
Omnibus: 1933.494 Durbin-Watson: 2.031
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5734.417
Skew: 1.669 Prob(JB): 0.00
Kurtosis: 6.399 Cond. No. 3.02e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.02e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Testing a Theoretically-Motivated Interaction Term¶
A natural follow-up question is whether the effect of one predictor depends on the value of another — known as an interaction effect. This is distinct from the collinearity check we performed earlier (which asks whether predictors are correlated with each other), and it is a separate question from what additive OLS regression captures by default.
A theoretically-motivated candidate is Food_Insecure × Income_Poverty_Ratio: the buffering hypothesis that higher income mitigates the mental health impact of food hardship. We test this directly by adding the interaction term to the OLS model and comparing R².
💡 Aside — Interaction vs. collinearity vs. non-linearity: These are three different concepts that get conflated. Collinearity asks "are predictors correlated with each other?" (we addressed this with the heatmap). Interactions ask "does the effect of one predictor depend on the level of another?" (tested below). Non-linearity asks "is the effect of a single predictor curved rather than straight?" (would require polynomial terms or splines). Our Random Forest model captures both interactions and non-linearities automatically, and the fact that it underperformed additive logistic regression (AUC 0.674 vs 0.725) is suggestive that these effects are weak in aggregate — but it doesn't rule out specific targeted interactions like the one tested below.
# =============================================================================
# OLS REGRESSION WITH INTERACTION TERM: Food_Insecure × Income_Poverty_Ratio
# Testing the "buffering hypothesis" — does income moderate the effect of
# food insecurity on depression severity?
# =============================================================================
# Build feature matrix with the interaction term added
X_interact = X_reg.copy()
X_interact['FoodInsec_x_Income'] = (
X_interact['Food_Insecure'] * X_interact['Income_Poverty_Ratio']
)
X_interact_const = sm.add_constant(X_interact)
# Fit the expanded model
ols_interact = sm.OLS(y_reg, X_interact_const).fit()
# Compare to the original additive model
print("=" * 70)
print("MODEL COMPARISON: Additive vs. Interaction")
print("=" * 70)
print(f"Original additive model R²: {ols_model.rsquared:.4f}")
print(f"Model with interaction R²: {ols_interact.rsquared:.4f}")
print(f"Change in R² (ΔR²): {ols_interact.rsquared - ols_model.rsquared:+.4f}")
print(f"Interaction coefficient: {ols_interact.params['FoodInsec_x_Income']:+.4f}")
print(f"Interaction p-value: {ols_interact.pvalues['FoodInsec_x_Income']:.4f}")
ci = ols_interact.conf_int().loc['FoodInsec_x_Income']
print(f"Interaction 95% CI: [{ci[0]:+.4f}, {ci[1]:+.4f}]")
print()
sig = ols_interact.pvalues['FoodInsec_x_Income'] < 0.05
print(f"Verdict: Interaction is {'SIGNIFICANT' if sig else 'NOT significant'} at α = 0.05")
print()
print(ols_interact.summary())
======================================================================
MODEL COMPARISON: Additive vs. Interaction
======================================================================
Original additive model R²: 0.1038
Model with interaction R²: 0.1042
Change in R² (ΔR²): +0.0004
Interaction coefficient: +0.1954
Interaction p-value: 0.1090
Interaction 95% CI: [-0.0436, +0.4344]
Verdict: Interaction is NOT significant at α = 0.05
OLS Regression Results
==============================================================================
Dep. Variable: PHQ9_Score R-squared: 0.104
Model: OLS Adj. R-squared: 0.103
Method: Least Squares F-statistic: 70.37
Date: Sat, 02 May 2026 Prob (F-statistic): 1.14e-136
Time: 15:27:54 Log-Likelihood: -17487.
No. Observations: 6064 AIC: 3.500e+04
Df Residuals: 6053 BIC: 3.507e+04
Df Model: 10
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
const 3.8630 0.441 8.757 0.000 2.998 4.728
Age -0.0198 0.003 -5.932 0.000 -0.026 -0.013
BMI 0.0313 0.008 3.931 0.000 0.016 0.047
Sleep_Hours -0.1414 0.035 -4.003 0.000 -0.211 -0.072
Vigorous_Activity_Min 4.084e-05 0.000 0.278 0.781 -0.000 0.000
Income_Poverty_Ratio -0.2035 0.042 -4.818 0.000 -0.286 -0.121
CRP_mgL 0.0243 0.008 2.988 0.003 0.008 0.040
Alcohol_Drinks_Day 0.3384 0.026 12.838 0.000 0.287 0.390
Food_Insecure 1.7146 0.267 6.410 0.000 1.190 2.239
Gender_Female 0.7898 0.113 6.969 0.000 0.568 1.012
FoodInsec_x_Income 0.1954 0.122 1.603 0.109 -0.044 0.434
==============================================================================
Omnibus: 1931.447 Durbin-Watson: 2.031
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5720.508
Skew: 1.668 Prob(JB): 0.00
Kurtosis: 6.394 Cond. No. 3.07e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.07e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Result: The interaction is not statistically significant. The Food_Insecure × Income_Poverty_Ratio term has a coefficient of +0.1954 (95% CI: [−0.044, +0.434], p = 0.109), and adding it to the model leaves R² essentially unchanged (Δ ≈ +0.001). The buffering hypothesis — that higher income mitigates the depression-elevating effect of food insecurity — is not supported by these data. If anything, the (non-significant) positive sign on the interaction coefficient points in the opposite direction, suggesting that food insecurity's effect may even be slightly amplified at higher income levels. One plausible interpretation is that food insecurity at higher income levels may signal acute crisis (job loss, medical bankruptcy, family disruption) rather than chronic poverty, which is a substantively different exposure — but with p = 0.109, this is a hypothesis to flag for future work, not a finding to claim.
Why this is consistent with the broader picture: This result aligns with the Random Forest comparison reported earlier. RF captures arbitrary interactions automatically and still underperformed additive logistic regression (AUC 0.674 vs. 0.725) — collectively, both pieces of evidence point to interactions being weak in this feature set. Note also that adding the interaction term inflated the standard errors on the parent variables: Food_Insecure's coefficient dropped from 2.06 to 1.71, and Income_Poverty_Ratio shifted from −0.18 to −0.20, with both standard errors widening. This is the expected consequence of introducing a term that is, by construction, correlated with both of its components — the model splits explanatory variance across collinear predictors rather than capturing genuinely new information. The condition number rose modestly (3.02e+03 → 3.07e+03), elevated but consistent with mild multicollinearity expected when adding an interaction term constructed from two existing predictors.
Implication for the conclusions: The low R² (≈10.4%) is not explained by missing interaction structure among the measured variables — it reflects the genuinely multifactorial nature of depression. Variance not captured here lies in factors absent from NHANES: genetics, childhood trauma, social support, therapy access, medication use, life events. This reinforces the limitation noted in the Conclusions section, and the addition of this peer-suggested test makes the limitation more rigorous, not less.
This analysis was added in response to feedback from Areen Khalaila — thank you for the valuable feedback.
# =============================================================================
# MODEL 2: Diagnostic Plots
# =============================================================================
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Residual Plot
residuals = ols_model.resid
fitted = ols_model.fittedvalues
axes[0].scatter(fitted, residuals, alpha=0.3, s=10, color='#232D4B')
axes[0].axhline(y=0, color='#E57200', linestyle='--', linewidth=2)
axes[0].set_xlabel('Fitted Values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Graph 11: Residual Plot — OLS Regression', fontsize=12)
# Q-Q Plot
stats.probplot(residuals, dist="norm", plot=axes[1])
axes[1].set_title('Graph 12: Q-Q Plot — Residual Normality', fontsize=12)
axes[1].get_lines()[0].set_color('#232D4B')
axes[1].get_lines()[1].set_color('#E57200')
plt.tight_layout()
plt.show()
# Model performance metrics
from sklearn.metrics import r2_score, mean_squared_error
y_pred_ols = ols_model.predict(X_reg_const)
r2 = r2_score(y_reg, y_pred_ols)
rmse = np.sqrt(mean_squared_error(y_reg, y_pred_ols))
print(f"\nR-squared: {r2:.4f}")
print(f"RMSE: {rmse:.2f}")
print(f"Mean PHQ-9 Score: {y_reg.mean():.2f}")
R-squared: 0.1038 RMSE: 4.33 Mean PHQ-9 Score: 3.51
Discussion: Model 2 — OLS Regression Results¶
The OLS regression answers our second research question: which predictors have the largest independent effect on PHQ-9 severity? The model is jointly significant (F = 77.89, p < 10⁻¹³⁷) and explains 10.4% of the variance in PHQ-9 scores (R² = 0.104, Adj. R² = 0.102, RMSE = 4.33).
Key coefficient interpretations (all controlling for the other eight predictors):
- Food Insecurity has the largest effect: being food insecure is associated with a 2.06-point higher PHQ-9 score (p < 0.001). On a 0–27 scale where the clinical cutoff is 10, this is a substantial shift — equivalent to roughly half the distance between "minimal" and "mild" symptom categories.
- Gender (Female) is associated with a 0.79-point increase (p < 0.001), confirming the gender disparity observed in the EDA.
- Alcohol consumption shows a 0.34-point increase per additional drink per day (p < 0.001). A person averaging 3 drinks/day would score roughly 1 point higher than a non-drinker, all else equal.
- Income-to-Poverty Ratio has a −0.18-point effect per unit (p < 0.001): moving from poverty (ratio = 1) to middle income (ratio = 3) is associated with a 0.36-point lower PHQ-9 score.
- CRP (inflammation) independently predicts severity with a coefficient of 0.025 per mg/L (p = 0.003). While statistically significant, this is a small effect — a person with CRP of 10 mg/L (high inflammation) would score only 0.25 points higher than a person with CRP of 0. CRP adds independent information, but its practical effect size is modest compared to food insecurity and alcohol.
- Sleep shows a protective effect (−0.14 per hour, p < 0.001): each additional hour of sleep is associated with a slightly lower depression score.
- Vigorous Physical Activity is not statistically significant (p = 0.787). This is an important null finding: after controlling for socioeconomic status, alcohol, sleep, and inflammation, vigorous exercise does not independently predict depression severity in this cross-sectional dataset. This does not mean exercise is irrelevant to mental health — it may operate indirectly through BMI, sleep, or CRP — but it is not a significant direct predictor in our multivariate model.
Model fit (R² = 10.4%): A low R² is expected and informative. Depression is a complex, multifactorial condition influenced by genetics, childhood adversity, social relationships, therapy access, and other factors not captured in NHANES. The fact that just nine lifestyle and socioeconomic variables explain over 10% of the variance — with food insecurity alone accounting for a substantial share — is itself a meaningful public health finding.
Diagnostic plots: The residual plot (Graph 11) shows heteroscedasticity: residuals fan out at higher fitted values, which is expected given the right-skewed PHQ-9 distribution. The Q-Q plot (Graph 12) departs from normality in the upper tail, again reflecting the skew. These violations mean that the OLS standard errors may be slightly biased; in future work, robust standard errors (HC3) or a generalized linear model (e.g., negative binomial regression for count data) could address this.
Convergence across models: Both Model 1 and Model 2 agree that Food Insecurity and Income-to-Poverty Ratio are the most impactful predictors, while CRP contributes modest but statistically significant independent information. This convergent evidence across classification and regression frameworks strengthens our confidence in these findings.
Conclusions and Recommendations - Insight and Policy Decision¶
Summary of Findings¶
This tutorial walked through the complete Data Science Lifecycle — from extracting and merging 8 NHANES component tables from different files, through exploratory analysis of distributions and group differences, to building and evaluating two complementary predictive models. The key findings are:
Depression is prevalent but detectable. Approximately 11% of the 6,064 US adults in this cohort screened positive for clinical depression (PHQ-9 ≥ 10). The Logistic Regression classifier achieved an ROC-AUC of 0.725 and identified 66% of depressed individuals (recall = 0.66) using only nine routinely collected variables — demonstrating that population-level screening with non-clinical data is feasible, if imperfect.
Socioeconomic factors are the most powerful predictors. Food insecurity emerged as the single strongest predictor in both models: an odds ratio of 1.46 for binary depression classification and a 2.06-point increase in continuous PHQ-9 score. Income-to-poverty ratio was the strongest protective factor (OR = 0.76). Together, these findings underscore that mental health cannot be separated from material living conditions.
Objective biomarkers add independent but modest value. C-Reactive Protein was significantly elevated in the depressed cohort (mean 4.9 vs. 3.6 mg/L, p = 1.11 × 10⁻⁶) and contributed independent predictive power beyond self-reported lifestyle factors (OLS coef = 0.025, p = 0.003). However, its practical effect size is small — much smaller than food insecurity or alcohol consumption. CRP is best understood as one signal among many, not a standalone biomarker for depression.
Simpler models outperformed complex ones. Logistic Regression (AUC = 0.725) outperformed Random Forest (AUC = 0.674) for depression classification, likely because the signal in this dataset is weak and diffuse — better suited to a regularized linear boundary than an ensemble of deep decision trees. This is a practical reminder that model complexity should match signal strength.
Physical activity was not a significant predictor after controlling for other factors. Despite its importance in clinical recommendations, vigorous exercise did not independently predict PHQ-9 severity in the OLS model (p = 0.787). Its effects may be mediated through BMI, sleep, or inflammation rather than operating as a direct predictor.
Recommendations¶
For Public Health Policy:
- Food security programs may have underappreciated mental health benefits. Our data suggests that a food-insecure individual scores, on average, 2 points higher on the PHQ-9 than a food-secure individual with otherwise identical characteristics — a clinically meaningful difference. Addressing material deprivation could be as impactful for population mental health as expanding access to clinical screening.
- Routine inclusion of inflammatory biomarkers (CRP) in primary care wellness panels could provide a modest additional signal for depression risk, though its effect size (OR = 1.07) suggests it should complement, not replace, behavioral and socioeconomic screening.
For Clinical Practice:
- PHQ-9 screening should be paired with assessment of food security status and other social determinants. The 40.1% food insecurity rate among the depressed cohort (vs. 15.4% in the non-depressed cohort) suggests that a positive food insecurity screen may warrant closer mental health follow-up.
- The gender disparity in depression prevalence (63.1% female in depressed cohort vs. 54.2% overall) persists in the post-pandemic era and should inform targeted outreach.
For Data Science:
- When working with health survey data, be cautious of skip-logic NaNs (e.g., zero-filling vigorous activity for non-exercisers), class imbalance (the naive accuracy of 89% is meaningless here), and the difference between statistical significance and practical effect size (CRP is significant at p = 0.003 but moves the PHQ-9 by only 0.025 per mg/L).
- More complex models are not always better. With 670 positive cases and 9 features, Logistic Regression outperformed Random Forest — a useful counterexample to the assumption that ensemble methods always dominate.
- Combining self-reported behavioral data with objective laboratory measurements (CRP, BMI) produces richer models, but the self-reported socioeconomic variables (food security, income) proved more predictive than the laboratory values in this dataset.
Limitations¶
- Food insecurity operationalization: The binary Food Insecure variable was derived from the NHANES FSDAD variable (Adult Food Security Category), which is based on 10 items from the U.S. Food Security Survey Module. Following the standard USDA convention, we classified categories 3 (Low Food Security) and 4 (Very Low Food Security) as food insecure, and categories 1 (Full) and 2 (Marginal) as food secure. Missing values were conservatively imputed as food secure (0), which would tend to attenuate the observed association — meaning the true effect of food insecurity on depression may be even stronger than reported.
- Cross-sectional design: NHANES is a snapshot, not a longitudinal study. We can identify associations but cannot establish causation. Food insecurity may cause depression, or depression may cause food insecurity (through job loss), or both may share common upstream causes.
- Sample attrition from inner joins: The inner join across 8 tables reduced the sample from 11,933 to 6,064 adults. Individuals who did not complete certain exam components may differ systematically from those who did, potentially introducing selection bias.
- Median imputation: Filling missing values with medians preserves sample size but may attenuate associations. More sophisticated imputation methods (e.g., multiple imputation methods) could be explored.
- Self-reported behavioral data: Physical activity and alcohol consumption are self-reported and subject to social desirability bias. CRP and BMI, being objectively measured, are not affected by this limitation.
- Survey weights not applied: NHANES uses a complex sampling design with survey weights to produce nationally representative estimates. This tutorial uses unweighted analysis for pedagogical simplicity, which may slightly bias prevalence estimates.
- OLS assumption violations: The residual and Q-Q plots reveal heteroscedasticity and non-normality, consistent with the right-skewed PHQ-9 distribution. Robust standard errors or a negative binomial generalized linear model (GLM) could improve inference in future work.
- Model scope: Depression is influenced by many factors not present in NHANES (genetics, childhood trauma, social support, therapy access, etc). Our model explains 10.4% of the variance — the remaining 89.6% reflects unmeasured complexity.
Future Directions¶
- Apply NHANES survey weights for nationally representative estimates.
- Incorporate longitudinal data (e.g., MIDUS study) to test causal pathways between food insecurity and depression.
- Explore additional biomarkers (vitamin D, thyroid function) and lifestyle variables (screen time, social participation).
- Use robust standard errors or negative binomial regression to address heteroscedasticity.
- Deploy the Logistic Regression classifier as an interactive screening tool for community health workers, using the odds ratios to generate individualized risk scores.
References and Resources¶
Data Sources¶
- CDC NHANES 2021-2023 Public-Use Data
- CDC PLACES: Local Data for Better Health
- BRFSS: Behavioral Risk Factor Surveillance System
Clinical & Scientific References¶
- PHQ-9 Instrument (APA)
- NIMH: Major Depression Statistics
- Osimo, E.F. et al. (2019). Prevalence of low-grade inflammation in depression: a systematic review and meta-analysis. Molecular Psychiatry. DOI
- Boden, J.M. & Fergusson, D.M. (2011). Alcohol and depression. Addiction. DOI
- Pourmotabbed, A. et al. (2020). Food insecurity and mental health: a systematic review and meta-analysis. Advances in Nutrition. DOI
- Salk, R.H. et al. (2017). Gender differences in depression in representative national samples. Psychological Bulletin. DOI
Technical References¶
- scikit-learn Documentation
- statsmodels OLS Documentation
- seaborn Visualization Library
- Arie's Coding Guide (CMPS 3140)
- Gelman, A. (2008). Scaling regression inputs by dividing by two standard deviations. Statistics in Medicine, 27(15), 2865–2873. DOI
- Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression (3rd ed.). Wiley. DOI