Python code and data link:
https://drive.google.com/drive/folders/1kC621QtmjG3C_2ok-I53fPYqDRf9r8RK?usp=sharing
A. Data Preparation
-
Import & Clean Data: Read factor data with ratings and defaults; handle missing values.
-
Target Variable Setup: Calculate yearly default rates and define the target (Default flag).
-
Data Split: Train-test split for robust model validation.
B. Statistical Screening of Independent Variables
-
Stability Check (PSI): Ensure variable stability over time.
-
Discriminatory Power (KS-Stat): Select variables that distinguish well between default and non-default.
-
Predictive Power (IV & WoE): Retain only those with high predictive value.
-
Stationarity (ADF Test): Remove non-stationary series.
-
Multicollinearity (VIF): Drop highly correlated variables.
-
Partial Correlation: Remove redundant/confounding variables.
C. Logistic Regression & Model Construction
-
Stepwise Logistic Regression: Based on p-values (< 0.01), build the core model.
-
PD Estimation: Generate scores and PD predictions with monotonicity checks.
-
Diagnostics: Autocorrelation (Durbin-Watson) and Heteroskedasticity (Breusch-Pagan) tests ensure statistical robustness.
D. Model Testing
-
Rating Assignment: Cluster PD outputs into buckets using K-means for interpretability.
-
Validation Tests:
-
Jeffreys Test and KS-Stat – Compare predicted vs actual default distributions.
-
E. Final Model Validation
-
Accuracy Checks: AUC-ROC, F1, Recall, Precision, Log Loss across Train/Test sets.
-
Cross-Validation: K-Fold CV for model generalization.
-
Regularization Checks:
-
Lasso Regression – Identifies non-contributing features.
-
Ridge Regression – Tests coefficient stability.
-
-
Model Comparison: Combine and review coefficients from Logit, Lasso, and Ridge models.
This pipeline balances statistical rigor with regulatory expectations, providing a ready-to-explain model for auditors, regulators, and internal committees. It’s a great base for both Basel and IFRS9/CECL-aligned PD model builds.
Huber M-estimation uses a loss function (1) that transitions from squared error to absolute error depending on a threshold 𝛿:
δ based on the expected distribution of residuals.
e.g. δ=m * Stdev of errors
- If ∣ri∣≤m * Stdev, weight wi=1
- If ∣ri∣>m * Stdev the weight wi= m * Stdev / ∣ri∣
Steps:
Y (CCF) =β0+ β1X+ ϵ,
fit the Ordinary Least Squares (OLS) regression:
β^=(X^TX)^−1 * X^TY
residuals ri=yi−y^i
Define the Huber loss function (1) to calculate weights.
Using weights, modify the regression:
β^=(X^T*W*X)^−1X^T*W*Y