Model
Selection Criteria:
(A.)
Number of Independent Variables: First step is
to determine the number of minimum variables to be included in the
model. This is required for precise coefficient estimates and
powerful hypothesis tests in regression.
-
Criteria: Minimum Residuals degrees of freedom should be
greater than 30 for training set (80/20 rule to split data into
training and testing data). Further using Rule of
thumb suggested by Frank Harrell’s,” Regression Modeling Strategies” i.e.
to detect reasonable-size effects with reasonable power, we need at least 10
observations per parameter (covariate) estimated.
Example: Number of
observations required for 1 independent variable is 32 and for 2 is 42 and for
3 is 52 and so on.
(B.)
Dropping the models with any of the coefficient is insignificant
(p-value greater than significance level
of .05) and not fulfilling basic assumptions of Regression such
as Homoscedasticity etc. Further dropping 90% of the models with lowest R-square. The shortlisted family
of Models are named Unrestricted Models.
(C.)
Testing for Confounding: Confounding makes the
coefficients significant on outcome due to the presence of another
variable.
-
Restricted Regression
on family of Unrestricted Models finalized in step (B.):
Dropping
variable with the lowest correlation with dependent variable and highest with
any of the covariate.
-
If the change in the coefficients of restricted model
w.r.t unrestricted model is more than absolute (10%), than computing partial correlation (to measures the relationship
between two variables not considering the influence of other variables).
-
Dropping the variable with highest
partial correlation from unrestricted model and re running the regression.
All
models left now will have to pass the LASSO Regression Test.
(D.)
Lasso Regression: Using Lasso regression on all the models
finalized in step (C.) to find whether the coefficients squeeze to zero. The
tuning parameter lambda will be chosen using cross validation.
-
If no coefficient of the model squeezes to zero than finalizing
the model with highest R-Square and significant p-values.
-
If some independent variables squeeze to zero, then dropping
those variables and running Linear Regression post LASSO
And
repeating the steps (I) to (v) for Models shortlisted.
(E.)
Final Model will be the model
(F.)
The model selected satisfying the above conditions. will be the
best model among the linear model in the family.
Note:
Assumption: the given variables or the
transformed variables fulfils the initial condition of Regression and variables
do not have:
-
Auto
correlation: An important assumption that the value
taken by disturbance term in any observation is determined independently of its
values in all other observations. Auto Correlation will be led to
heteroscedasticity:
o
Estimates of the regression coefficients
are inefficient.
o
Forecasts based on the regression
equations are sub-optimal.
o
Usual significance tests on the coefficients
are invalid (Standard Error of coefficients is calculated based on assumption
of homoscedasticity)).
Test
for Auto-Correlation: Ljung-Box Q-test
-
Stochastic
trend (unit Root): If a time series has a unit root, it shows
a systematic pattern that is unpredictable. The series is not stationary.
Test
for Unit root: Augmented Dickey-Fuller Test (Unit Root)
-
Deterministic trend:
Model will not work for out of sample data or for validation dataset. Sample
estimates are not true estimates of population.
Test:
KPSS test
-
Strictly
exogeneous:
-
Simultaneous
equation bias: simultaneous system of equations results
in measurement error.
Solution:
Depending on the data Instrumental Variable or Dummy variable.
No comments:
Post a Comment