Python Code:
https://drive.google.com/drive/folders/1SVGLABBtkLU7kxZPArwnQK9JJ8bnYS8-?usp=sharing
PD Model Pipeline – Technical Summary
-
Built binary PD classification framework (
Default = f(Rating threshold)). - Compared Logistic Regression vs XGBoost under identical preprocessing pipeline.
Data & Setup
- Train/test split: 80/20
- Target: binary default flag
- Time feature retained for macro alignment
- Leakage controls applied (rating/date removed)
Feature Pre-Filtering (Macro Risk Controls)
Applied sequential filtering:
- PSI (<0.1) → removes unstable variables across train/test
- KS (>0.1) → ensures discriminatory power
- IV (0.02–2) → retains predictive but non-dominant features
- ADF (p < 0.05) → ensures stationarity in macro series
- VIF (<10) → removes multicollinearity
Final feature set = intersection of all filters.
Logistic Regression (Model Selection)
-
Exhaustive subset selection using combinations of
n_vars - Statsmodels Logit estimation
-
Selection criterion:
- Maximize: Pseudo R²
- Minimize: average p-values
- Output: best interpretable variable set
XGBoost Model
- Gradient boosting classifier (fallback: sklearn GBC)
- Feature selection via importance ranking
- Top-N features retained
- Non-linear interaction capture enabled automatically
Predictions
- Logistic: logit → sigmoid transformation to PD
- XGBoost: probability output directly
Evaluation Metrics
Computed for train & test:
- AUC (ranking power)
- KS (class separation)
- Accuracy
- Precision / Recall
- F1-score
- LogLoss (calibration)
Feature Interpretability
- Logistic: coefficient sign + magnitude (regulatory usable)
- XGBoost: feature importance ranking only
Comparison Logic
-
Model comparison based on:
- Discriminatory power (AUC, KS)
- Stability (train vs test gap)
- Calibration (LogLoss)
-
Trade-off:
- Logistic = interpretability + stability
- XGBoost = predictive lift + non-linearity
Huber M-estimation uses a loss function (1) that transitions from squared error to absolute error depending on a threshold 𝛿:
δ based on the expected distribution of residuals.
e.g. δ=m * Stdev of errors
- If ∣ri∣≤m * Stdev, weight wi=1
- If ∣ri∣>m * Stdev the weight wi= m * Stdev / ∣ri∣
Steps:
Y (CCF) =β0+ β1X+ ϵ,
fit the Ordinary Least Squares (OLS) regression:
β^=(X^TX)^−1 * X^TY
residuals ri=yi−y^i
Define the Huber loss function (1) to calculate weights.
Using weights, modify the regression:
β^=(X^T*W*X)^−1X^T*W*Y