Friday, May 8, 2026

Logistic Regression vs XGBoost

Python Code:
https://drive.google.com/drive/folders/1SVGLABBtkLU7kxZPArwnQK9JJ8bnYS8-?usp=sharing

PD Model Pipeline – Technical Summary

Built binary PD classification framework (Default = f(Rating threshold)).
Compared Logistic Regression vs XGBoost under identical preprocessing pipeline.

Data & Setup

Train/test split: 80/20
Target: binary default flag
Time feature retained for macro alignment
Leakage controls applied (rating/date removed)

Feature Pre-Filtering (Macro Risk Controls)

Applied sequential filtering:

PSI (<0.1) → removes unstable variables across train/test
KS (>0.1) → ensures discriminatory power
IV (0.02–2) → retains predictive but non-dominant features
ADF (p < 0.05) → ensures stationarity in macro series
VIF (<10) → removes multicollinearity

Final feature set = intersection of all filters.

Logistic Regression (Model Selection)

Exhaustive subset selection using combinations of n_vars
Statsmodels Logit estimation
Selection criterion:
- Maximize: Pseudo R²
- Minimize: average p-values
Output: best interpretable variable set

XGBoost Model

Gradient boosting classifier (fallback: sklearn GBC)
Feature selection via importance ranking
Top-N features retained
Non-linear interaction capture enabled automatically

Predictions

Logistic: logit → sigmoid transformation to PD
XGBoost: probability output directly

Evaluation Metrics

Computed for train & test:

AUC (ranking power)
KS (class separation)
Accuracy
Precision / Recall
F1-score
LogLoss (calibration)

Feature Interpretability

Logistic: coefficient sign + magnitude (regulatory usable)
XGBoost: feature importance ranking only

Comparison Logic

Model comparison based on:
- Discriminatory power (AUC, KS)
- Stability (train vs test gap)
- Calibration (LogLoss)
Trade-off:
- Logistic = interpretability + stability
- XGBoost = predictive lift + non-linearity

Wednesday, March 18, 2026

Detecting Outliers Using Medcouple – A Simple, Robust Approach

Python Implementation along with Required Files:

https://drive.google.com/drive/folders/1i6ZN3noeTN9MCDk8fqbA1RndzV1L49dh?usp=drive_link

Risk modeling—often means dealing with skewed distributions. Standard methods like Z-scores or basic IQR can fail, either missing real outliers or flagging valid extreme values.

To address this, I used a Medcouple-based method, which is a robust, skewness-aware outlier detection technique.

How It Works

Compute Ratios – Transform raw variables into a ratio (e.g., X2 / X1).
Center Around Median – Scale values relative to the median to preserve asymmetry.
Estimate Spread Robustly – Use quartiles above and below the median to calculate IQR.
Measure Skewness (Medcouple) – A robust statistic capturing asymmetry without being influenced by extremes.
Adjust Outlier Bounds – Expand or shrink thresholds based on skewness for accurate detection.
Identify Outliers – Flag observations outside the skewness-adjusted bounds.

Benefits

Handles skewed and heavy-tailed distributions
Preserves meaningful extreme values
Improves data quality for modeling and analysis

Attached Files

To make this reproducible, I’m sharing:

Excel replication – See the method step by step in Excel
Python implementation – Fully automated outlier detection
Input data used in Python – The dataset for replication

Saturday, September 27, 2025

PIT PD Modeling Using Systematic Factor Approach

Python Code and Data

: https://drive.google.com/drive/folders/1d7vkT9SeXlELPjRRKDU3qUezibQaUL-y?usp=sharing

Robust methodology to estimate Point-in-Time (PIT) Probability of Default (PD) for non-default obligors under IFRS9, combining obligor-level characteristics with macroeconomic indicators. The approach bridges regulatory compliance with practical portfolio forecasting.

Key Steps:

TTC PD Calculation:
- We start with a Through-the-Cycle (TTC) PD model at the obligor level, capturing borrower-specific risk factors such as financial ratios, credit history, and product attributes.
- Macro variables (Macroeconomic Exposure Variables, MEVs) are averaged over a historical period to normalize for economic cycles, ensuring stability and compliance with regulatory TTC requirements.
Systematic Factor Extraction (Credit Cycle Index):
- To incorporate the impact of economic cycles on forward-looking PDs, we applied Principal Component Analysis (PCA) to a set of macroeconomic indicators.
- The first principal component serves as a credit cycle index, representing the systematic risk factor that drives correlated changes in credit quality across obligors.
Forecasted Credit Cycle:
- Using macroeconomic forecasts for upcoming quarters, we projected the credit cycle index forward, maintaining the relationship with historical MEVs.
- This allows us to translate macroeconomic expectations into a forward-looking credit environment.
PIT PD Estimation via Vasicek Transformation:
- The TTC PDs were adjusted to PIT PDs using a Vasicek-based single-factor model, incorporating a correlation coefficient (ρ) to reflect the sensitivity of obligors to the systematic credit cycle.
- This transformation ensures that obligors’ forward-looking PDs respond dynamically to expected changes in the macroeconomic environment.
Portfolio-Level Forecast:
- The final output is a matrix of PIT PDs, with each obligor in rows and forecasted quarters in columns, allowing granular IFRS9 expected credit loss calculations while remaining aligned with Basel and EBA guidance.

Benefits of this Methodology:

Combines obligor-specific risk and macroeconomic trends for accurate PIT PD forecasting.
Compliant with IFRS9 and regulatory expectations for forward-looking credit risk modeling.
Avoids the need for future obligor-level forecasts, which are often unavailable.
Easily scalable to large portfolios for quarterly IFRS9 reporting.

Saturday, September 20, 2025

Estimating Obligor-Level PIT PDs in Low-Default Portfolios

Excel Example:
https://docs.google.com/spreadsheets/d/1vx3K1YsKxPD3wIc229W2QcghbUIOSe55/edit?usp=sharing&ouid=115594792889982302405&rtpof=true&sd=true

In low-default portfolios, estimating Point-in-Time (PIT) probability of default (PD) at the obligor level is particularly challenging due to data scarcity. To address this, I implemented the Basel Committee’s BCR approach and extended it with idiosyncratic adjustments for borrower-level differentiation.

Portfolio Level

Macro Driver: GDP YoY is standardized into a Z-score.
Systematic Link: Vasicek’s single-factor model connects portfolio Through-the-Cycle (TTC) PDs to the macroeconomic cycle.
Calibration: Goal Seek ensures unconditional PIT PDs are consistent with observed default frequencies.

Obligor Level

Start from TTC PDs and apply PIT adjustments consistently across obligors.
To differentiate obligors within the same quarter, I introduced idiosyncratic shifts (e.g., Debt-to-Equity ratio).
This framework can be extended using Principal Component Analysis (PCA) across multiple borrower-level factors (leverage, liquidity, profitability, etc.) to extract orthogonal risk drivers for richer differentiation.
These shifts or factors adjust each obligor’s threshold in the Vasicek model, producing distinct PIT PDs.
Finally, obligor PITs are rescaled so their average aligns with the calibrated portfolio PIT.

This approach ensures regulatory consistency at the portfolio level while producing economically intuitive obligor-level PDs — higher leverage or weaker fundamentals result in higher PIT PDs, while PCA allows multiple dimensions of risk to be captured systematically.

Thursday, August 14, 2025

Import Macro Data from MOSPI into Python:

Step 1 — Capture the Download URL (One-Time Setup)

Open the MOSPI page: https://esankhyiki.mospi.gov.in/
Search for your dataset (CPI, WPI, IIP, etc.)
Press F12 → Network tab and tick Preserve log
Click the Download button on the page
In the network log, find the .xlsx request (e.g., cpi_8.xlsx) and copy the Request URL

Step 1 — Python Automation and Data Processing

# CPI Python Code can be copied directly

url = "https://api.mospi.gov.in/api/download/CPI/cpi_8.xlsx"

output_file = "cpi_8.xlsx"

response = requests.get(url)

response.raise_for_status() # Check for errors

with open(output_file, "wb") as f:

f.write(response.content)

Ind_CPI = pd.read_excel("cpi_8.xlsx")

Ind_CPI['month_end'] = pd.to_datetime(

Ind_CPI['year'].astype(str) + '-' + Ind_CPI['month_code'].replace(0, 12).astype(str) + '-01'

) + pd.offsets.MonthEnd(0)

Ind_CPI = (

Ind_CPI

.groupby(['month_end', 'group'], as_index=False)['index']

.mean()

)

Ind_CPI = Ind_CPI.pivot(index='month_end', columns='group', values='index')

Ind_CPI = Ind_CPI.reset_index()

Ind_CPI['Quarter_End'] = pd.to_datetime(Ind_CPI['month_end']) + pd.offsets.QuarterEnd(0)

Ind_CPI = Ind_CPI.drop(columns=['month_end'])

Ind_CPI_qtr = (

Ind_CPI

.groupby('Quarter_End', as_index=False)

.mean(numeric_only=True) # averages each subgroup's monthly values into quarterly

)

Ind_CPI_qtr_pc = Ind_CPI_qtr.copy()

Ind_CPI_qtr_pc.iloc[:, 1:] = Ind_CPI_qtr_pc.iloc[:, 1:].pct_change(periods=x) # in %

Ind_CPI_qtr_pc = Ind_CPI_qtr_pc.dropna().reset_index(drop=True)

Ind_CPI_qtr_pc['Quarter_End'] = Ind_CPI_qtr_pc['Quarter_End'].dt.date

Friday, August 1, 2025

PD Model Development in Python:

Python code and data link:

https://drive.google.com/drive/folders/1kC621QtmjG3C_2ok-I53fPYqDRf9r8RK?usp=sharing

A. Data Preparation

Import & Clean Data: Read factor data with ratings and defaults; handle missing values.
Target Variable Setup: Calculate yearly default rates and define the target (Default flag).
Data Split: Train-test split for robust model validation.

B. Statistical Screening of Independent Variables

Stability Check (PSI): Ensure variable stability over time.
Discriminatory Power (KS-Stat): Select variables that distinguish well between default and non-default.
Predictive Power (IV & WoE): Retain only those with high predictive value.
Stationarity (ADF Test): Remove non-stationary series.
Multicollinearity (VIF): Drop highly correlated variables.
Partial Correlation: Remove redundant/confounding variables.

C. Logistic Regression & Model Construction

Stepwise Logistic Regression: Based on p-values (< 0.01), build the core model.
PD Estimation: Generate scores and PD predictions with monotonicity checks.
Diagnostics: Autocorrelation (Durbin-Watson) and Heteroskedasticity (Breusch-Pagan) tests ensure statistical robustness.

D. Model Testing

Rating Assignment: Cluster PD outputs into buckets using K-means for interpretability.
Validation Tests:
- Jeffreys Test and KS-Stat – Compare predicted vs actual default distributions.

E. Final Model Validation

Accuracy Checks: AUC-ROC, F1, Recall, Precision, Log Loss across Train/Test sets.
Cross-Validation: K-Fold CV for model generalization.
Regularization Checks:
- Lasso Regression – Identifies non-contributing features.
- Ridge Regression – Tests coefficient stability.
Model Comparison: Combine and review coefficients from Logit, Lasso, and Ridge models.

This pipeline balances statistical rigor with regulatory expectations, providing a ready-to-explain model for auditors, regulators, and internal committees. It’s a great base for both Basel and IFRS9/CECL-aligned PD model builds.

Tuesday, July 22, 2025

BCR Approach with Python for Low-Default Portfolios

Access the full Python code and input data here: [https://drive.google.com/drive/folders/1T8clLUy9h42pn-WZ9Z89f3Bo98uQb1U4?usp=sharing]

Key features of the Python implementation:

Inverse Vasicek calibration using root_scalar() to estimate PD
Time-series PD projection based on GDP path volatility
Goodness-of-fit validation using Binomial hypothesis testing
Bayesian posterior estimation of PD with credible intervals
Stress scenario simulation – evaluates PD under adverse GDP shocks
Sensitivity analysis – assesses how varying asset correlation (ρ) affects PD outcomes

Sunday, July 13, 2025

BCR Approach for Regulatory Reporting (PD in Low-Default Portfolios)

Attached: Excel workbook with full BCR implementation (macro-driven, quarterly defaults, confidence intervals, and model diagnostics)
https://docs.google.com/spreadsheets/d/1ltVYX4vhGeTllOQkEWKrED2DV33eb55M/edit?usp=sharing&ouid=115594792889982302405&rtpof=true&sd=true

Excel Implementation Details (see link above)

Historical GDP YoY data as a macro factor
Z-score standardization of macro series
Conditional PD computed using Vasicek model
Expected defaults calculated for each period
Goal Seek used to backsolve for portfolio PD
Bayesian posterior estimate and 95% confidence bound from Beta distribution

BCR Approach for Regulatory Reporting (PD in Low-Default Portfolios)

To calculate PD for low-default portfolios like sovereigns, large corporates, or prime mortgages, the Benjamin, Cathcart, and Ryan (BCR) methodology (2006) is a robust statistical approach used in both regulatory modeling and internal risk management:

It leverages:

- Vasicek (asymptotic single risk factor) model

-Observed macroeconomic conditions (e.g., GDP YoY growth)

-Asset correlation

-Observed default count over a given period

Excel Implementation Steps (Included in Attached File)

Inputs:

#Obligors (e.g., 54,000)
Observed defaults (e.g., 150)
Correlation (e.g., 0.20)
Historical GDP YoY (%) path as macro risk driver

Thursday, May 22, 2025

VIF (Variation Inflation Factor) vs. GVIF (Generalized VIF):

Handling Multicollinearity with Factor Variables
When building regression models, checking multicollinearity is crucial.

For continuous predictors, VIF helps identify multicollinearity. But categorical variables with multiple levels (factors) need special treatment

A factor with k levels is represented by d=k−1 dummy variables. These dummies are inherently correlated because they encode the same categorical feature.
In this case we use the Generalized Variance Inflation Factor (GVIF) — which measures multicollinearity jointly for all dummy variables representing a factor.

When calculating GVIF for a factor variable, we regress all its dummy variables simultaneously on the other predictors in the model
GVIFj=(1/ (1−Rj^2)^d)

Adjusted GVIF=GVIF^(1/(2⋅d))

d = number of dummy variables (degrees of freedom for the factor).
This adjustment scales GVIF to be comparable to standard VIF values.

Key reasons to consider GVIF:
Treating the factor’s dummies jointly prevents misleading interpretations of multicollinearity.
It reflects the true inflation of variance caused by correlation between the factor and other predictors.
Helps in making informed decisions about feature selection and model stability.

Wednesday, March 19, 2025

Incorporating IPCC Climate Projections into Probability of Default:

For a detailed description od theory and excel worbook example, refer to the attached link

https://drive.google.com/drive/folders/1vAcH8ge2KEFBPxxJt06cPgIr5XVKhyOr?usp=sharing

Saturday, January 25, 2025

Huber M-estimation (CCF for EAD):

Huber M-estimation is a robust regression technique used to address the influence of outliers on model parameters. It is used to calculate CCF (Credit Conversion Factor) models for EAD (Exposure at Default).

Huber M-estimation ensures robust parameter estimation by minimizing the impact of extreme observations (e.g., outliers in utilization rates, credit line drawdowns, or other key drivers).

Huber M-estimation uses a loss function (1) that transitions from squared error to absolute error depending on a threshold 𝛿:
δ based on the expected distribution of residuals.

e.g. δ=m * Stdev of errors
- If ∣ri∣≤m * Stdev, weight wi=1

- If ∣ri∣>m * Stdev the weight wi= m * Stdev / ∣ri∣

Steps:
Y (CCF) =β0+ β1X+ ϵ,
fit the Ordinary Least Squares (OLS) regression:
β^=(X^TX)^−1 * X^TY
residuals ri=yi−y^i

Define the Huber loss function (1) to calculate weights.

Using weights, modify the regression:
β^=(X^T*W*X)^−1X^T*W*Y

Friday, June 14, 2024

Change Point Detection Time Series

Change Point Detection Methods

Kernel Change Point Detection:

Kernel change point detection method detects changes in the distribution of the data, not just changes in the mean or variance.

Kernel Method is utilized to map the data into a high-dimensional feature space, where changes are more easily detectable. This approach uses the Maximum Mean Discrepancy (MMD) to measure the difference between the distributions of segments of the time series.

Steps:

1- Data and Kernel Function: Consider a univariate time series {x1,x2,…,xn} We start by choosing a kernel function k(x,y) to measure similarity between points.

2- Construction of Kernel Matrix: kernel matrix K is constructed, where each element K_ij=k(xi,xj)

For the linear kernel, this is: Kij=xi⋅xj (X^TX)

3- Maximum Mean Discrepancy (MMD):

MMD measures how different two groups of data are by comparing the average of all pairwise similarities within each group and between the groups or compares two distributions to see if they are different.

MMD is used to measure the difference between the distributions before and after a candidate change point t.

For each candidate change point t

In the above equation:

- The first term measures the similarity within the first segment.

- The second term measures the similarity within the second segment.

- The third term measures the similarity between the two segments.

4- To detect the change point, we compute the MMD values are computed for all possible change points t and choose the one that maximizes the MMD value:

Excel Example :https://docs.google.com/spreadsheets/d/1IdC-ss1VjaL2QVQdABNwuIPfRphDtlZi/edit?usp=sharing&ouid=115594792889982302405&rtpof=true&sd=true