Saturday, May 25, 2019

Best Model selection Criteria.


Model Selection Criteria:
(A.)  Number of Independent Variables: First step is to determine the number of minimum variables to be included in the model. This is required for precise coefficient estimates and powerful hypothesis tests in regression. 
-        Criteria: Minimum Residuals degrees of freedom should be greater than 30 for training set (80/20 rule to split data into training and testing data). Further using Rule of thumb suggested by Frank Harrell’s,” Regression Modeling Strategies” i.e. to detect reasonable-size effects with reasonable power, we need at least 10 observations per parameter (covariate) estimated. 
Example: Number of observations required for 1 independent variable is 32 and for 2 is 42 and for 3 is 52 and so on.
(B.)  Dropping the models with any of the coefficient is insignificant (p-value greater than significance level of .05) and not fulfilling basic assumptions of Regression such as Homoscedasticity etc. Further dropping 90% of the models with lowest R-square. The shortlisted family of Models are named Unrestricted Models.
(C.)  Testing for Confounding: Confounding makes the coefficients significant on outcome due to the presence of another variable. 
-        Restricted Regression on family of Unrestricted Models finalized in step (B.):
Dropping variable with the lowest correlation with dependent variable and highest with any of the covariate. 
-         If the change in the coefficients of restricted model w.r.t unrestricted model is more than absolute (10%), than computing partial correlation (to measures the relationship between two variables not considering the influence of other variables).
-        Dropping the variable with highest partial correlation from unrestricted model and re running the regression.
All models left now will have to pass the LASSO Regression Test.
(D.)  Lasso Regression: Using Lasso regression on all the models finalized in step (C.) to find whether the coefficients squeeze to zero. The tuning parameter lambda will be chosen using cross validation.

-        If no coefficient of the model squeezes to zero than finalizing the model with highest R-Square and significant p-values. 
-        If some independent variables squeeze to zero, then dropping those variables and running Linear Regression post LASSO
And repeating the steps (I) to (v) for Models shortlisted.
(E.)  Final Model will be the model
(F.)   The model selected satisfying the above conditions. will be the best model among the linear model in the family.
Note:
Assumption: the given variables or the transformed variables fulfils the initial condition of Regression and variables do not have:
-        Auto correlation: An important assumption that the value taken by disturbance term in any observation is determined independently of its values in all other observations. Auto Correlation will be led to heteroscedasticity:
o   Estimates of the regression coefficients are inefficient.
o   Forecasts based on the regression equations are sub-optimal.
o    Usual significance tests on the coefficients are invalid (Standard Error of coefficients is calculated based on assumption of homoscedasticity)).
Test for Auto-Correlation: Ljung-Box Q-test
-        Stochastic trend (unit Root): If a time series has a unit root, it shows a systematic pattern that is unpredictable. The series is not stationary.
Test for Unit root: Augmented Dickey-Fuller Test (Unit Root)
-         Deterministic trend: Model will not work for out of sample data or for validation dataset. Sample estimates are not true estimates of population.
Test: KPSS test
-        Strictly exogeneous:
-        Simultaneous equation bias: simultaneous system of equations results in measurement error.
Solution: Depending on the data Instrumental Variable or Dummy variable.

No comments:

R3 chase - Pursuit

Change Point Detection Time Series

  Change Point Detection Methods Kernel Change Point Detection: Kernel change point detection method detects changes in the distribution of ...