Komal Shahid - Data Science Portfolio

Logo

Showcasing projects in machine learning, deep learning, and AI applications

View the Project on GitHub UKOMAL/Komal-Shahid-DS-Portfolio

Mental Health Treatment-Seeking Prediction

Predicting Treatment-Seeking Behavior in Tech Workers

**Status: Complete Milestone M1 ✅ Milestone M2 ✅ Milestone M3 ✅**

Local working mirror: Extra Course/=project1/project1-dsc680/ — see README_DSC680_SYNC.md in that folder for rsync commands to and from this directory.

Repo path: projects/project1-dsc680/ (DSC680 Project 1 in the portfolio repo).


Problem Statement

Mental health conditions affect approximately 1 in 4 people, yet treatment remains significantly under-utilized in the tech industry due to stigma, access barriers, and organizational gaps. Understanding what drives individuals to seek treatment is critical for HR teams and mental health advocates.

This project analyzes the Open Source Mental Illness (OSMI) 2016 Survey to predict treatment-seeking behavior and identify the key organizational and personal factors that influence this decision.

Key Research Question: What factors most strongly predict whether a tech worker will seek mental health treatment?


Dataset

Attribute Details
Source OSMI 2016 Mental Health in Tech Survey
Sample Size 1,259 respondents
Target Variable Sought treatment for mental health condition (binary: yes/no)
Features 25 demographic, work-environment, and health-related variables
Class Balance Imbalanced (requires SMOTE)
Data Quality Missing values handled via imputation and removal

Sample survey items:


Methodology

1. Exploratory Data Analysis (EDA)

2. Data Preprocessing

3. Model Development & Comparison

Four models were trained and compared:

Model AUC-ROC Accuracy Macro F1 Interpretability Winner?
Logistic Regression 0.723 68.2% 0.68 Excellent (coefficients)
Random Forest 0.716 70.1% 0.67 Good (feature importance)  
XGBoost 0.714 69.8% 0.66 Fair (SHAP required)  
Neural Network 0.709 68.9% 0.65 Poor (black box)  

Decision: Logistic Regression selected for superior AUC and full interpretability. Marginal performance differences don’t justify the loss of explainability for HR stakeholder communication.

4. Explainability & Fairness


Key Findings

Top 5 Predictive Factors

  1. Work Interference (strongest signal): Employees whose conditions interfere with work are more likely to seek treatment
  2. Family History: Genetic predisposition correlates with treatment-seeking
  3. Mental Health Benefits: Availability of benefits increases treatment likelihood
  4. Previous Diagnosis: Prior mental health diagnosis strongly predicts current treatment-seeking
  5. Supervisor Support: Perception of supportive supervisors increases help-seeking

Insights for HR Teams

Model Performance


Results & Deliverables

Primary Outputs

Milestone 1 - Project Proposal

Milestone 2 - Analysis & Whitepaper

Milestone 3 - Final Presentation

Folder Structure

projects/project1-dsc680/
├── README.md                    # This file
├── code/                        # Analysis pipeline (Python)
├── figures/                     # Generated plots and charts
├── milestone1_proposal/         # M1: proposal PDFs/DOCX, rubric notes
├── milestone2_whitepaper/       # M2: whitepaper, infographic (HTML/PDF/DOCX)
├── milestone3_final/            # M3: presentation, Q&A, final whitepaper
├── discussions/                 # (local only — gitignored) Canvas discussion drafts
└── references/                  # Source links and reading list

Technical Stack

Data Processing:  Pandas, NumPy
Analysis:         Jupyter Notebook, scikit-learn
Modeling:         scikit-learn, XGBoost, LightGBM
Explainability:   SHAP, permutation importance
Visualization:    Matplotlib, Seaborn, Plotly
Evaluation:       Cross-validation, bootstrapping

How to Run the Code

Prerequisites

pip install -r requirements.txt
# Requires: pandas, numpy, scikit-learn, xgboost, shap, jupyter

Step 1: Exploratory Data Analysis

jupyter notebook notebooks/01_eda_exploratory_analysis.ipynb

Generates:

Step 2: Preprocessing & Feature Engineering

jupyter notebook notebooks/02_preprocessing_feature_engineering.ipynb

Outputs:

Step 3: Model Training & Evaluation

jupyter notebook notebooks/03_model_training_evaluation.ipynb

Or run directly:

python src/model_training.py --model logistic_regression

Produces:

Step 4: Full Pipeline (CLI)

python src/full_pipeline.py --data-path data/raw/OSMI_2016_survey.csv

Expected output:

Loading data: OSMI_2016_survey.csv (1259 samples)
Preprocessing: Handling missing values, encoding categories
Training: Logistic Regression with 5-fold CV
Results:
  AUC-ROC: 0.723 [95% CI: 0.698-0.748]
  Accuracy: 68.2%
  Macro F1: 0.68
SHAP analysis: Top 5 features identified
Fairness audit: No significant demographic disparities detected

References

Academic & Clinical References

  1. Azocar, F., Cohen, D., & Ponce, A. N. (2003). “Paying “nowhere near enough” for mental health.” Journal of the American Medical Association, 289(8), 953-955.
  2. Open Source Mental Illness. (2016). Mental health in tech survey. Retrieved from https://www.osmihelp.org/
  3. Kessler, R. C., et al. (2009). “The prevalence and correlates of untreated serious mental illness.” Health Services Research, 36(6 Pt 1), 987-1007.

Technical References

  1. SHAP: Lundberg, S. M., & Lee, S. I. (2017). “A unified approach to interpreting model predictions.” Advances in Neural Information Processing Systems.
  2. SMOTE: Chawla, N. V., et al. (2002). “SMOTE: Synthetic minority over-sampling technique.” Journal of Artificial Intelligence Research, 16, 321-357.

Data Ethics


Limitations & Future Work

Known Limitations

  1. Self-reported data: Respondents may underreport mental health conditions due to stigma
  2. Tech-specific sample: Findings may not generalize to non-tech industries
  3. Temporal limitation: 2016 data; organizational attitudes have evolved
  4. Selection bias: Survey responders self-select (more likely to be engaged with mental health)

Future Directions


Contact & Questions

For questions about this project, please reach out:


Last Updated: April 2026 Status: Complete & Documented