Python code and data link:
https://drive.google.com/drive/folders/1kC621QtmjG3C_2ok-I53fPYqDRf9r8RK?usp=sharing
A. Data Preparation
-
Import & Clean Data: Read factor data with ratings and defaults; handle missing values.
-
Target Variable Setup: Calculate yearly default rates and define the target (Default flag).
-
Data Split: Train-test split for robust model validation.
B. Statistical Screening of Independent Variables
-
Stability Check (PSI): Ensure variable stability over time.
-
Discriminatory Power (KS-Stat): Select variables that distinguish well between default and non-default.
-
Predictive Power (IV & WoE): Retain only those with high predictive value.
-
Stationarity (ADF Test): Remove non-stationary series.
-
Multicollinearity (VIF): Drop highly correlated variables.
-
Partial Correlation: Remove redundant/confounding variables.
C. Logistic Regression & Model Construction
-
Stepwise Logistic Regression: Based on p-values (< 0.01), build the core model.
-
PD Estimation: Generate scores and PD predictions with monotonicity checks.
-
Diagnostics: Autocorrelation (Durbin-Watson) and Heteroskedasticity (Breusch-Pagan) tests ensure statistical robustness.
D. Model Testing
-
Rating Assignment: Cluster PD outputs into buckets using K-means for interpretability.
-
Validation Tests:
-
Jeffreys Test and KS-Stat – Compare predicted vs actual default distributions.
-
E. Final Model Validation
-
Accuracy Checks: AUC-ROC, F1, Recall, Precision, Log Loss across Train/Test sets.
-
Cross-Validation: K-Fold CV for model generalization.
-
Regularization Checks:
-
Lasso Regression – Identifies non-contributing features.
-
Ridge Regression – Tests coefficient stability.
-
-
Model Comparison: Combine and review coefficients from Logit, Lasso, and Ridge models.
This pipeline balances statistical rigor with regulatory expectations, providing a ready-to-explain model for auditors, regulators, and internal committees. It’s a great base for both Basel and IFRS9/CECL-aligned PD model builds.
No comments:
Post a Comment