Statistics| Basics II - RISHABH LALA

Log and Exponential Transformations and Other Statistics notes:

Baseline Category in RIn categorical variables, the baseline category serves as the reference group against which the effects of other categories are measured. In R, when performing regression analysis with categorical variables, the first level of the factor is chosen as the baseline by default. However, this can be changed based on analytical needs or interpretability by releveling the factor so that another category serves as the reference.
Transformations: Log and Exponential ModelsTransformations such as log and exponential are used to linearize relationships between variables, making linear regression models more applicable when the original relationship is non-linear.

Log Transformation: Applying the log function to one or more variables can help in stabilizing variance and making the relationship between variables more linear. For example, a log transformation of the independent variable log(x) is useful when dealing with multiplicative effects.
Exponential Transformation: An exponential transformation might be applied to the dependent variable to model exponential growth or decay processes. The exponential model can describe how changes in the independent variable have multiplicative effects on the dependent variable.

Baseline InterpretationsIn the context of regression with categorical variables, the baseline interpretation refers to the expected change in the dependent variable when moving from the baseline category to another category, holding all other variables constant.
QQ Plot: Normality CheckA QQ (Quantile-Quantile) plot is a graphical tool to assess if a dataset follows a particular distribution, such as the normal distribution. If the points in a QQ plot lie roughly along a straight line, the data is considered to follow that distribution. Heavy tails suggest the presence of outliers or deviations from the assumed distribution.

Model Comparison

R² (R-squared): A measure of the proportion of variance in the dependent variable that is predictable from the independent variables. It is used for comparing the goodness of fit for different models on the same data. Comparing R² across models with different dependent variables or datasets is not appropriate.
AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion): Both are used for model selection among a finite set of models. They take into account the goodness of fit of the model and the complexity of the model, helping to balance between overfitting and underfitting.

Interpretation of Log and Exponential Models

Log Model: The coefficient in a log-transformed regression can be interpreted as the percentage change in the dependent variable for a one percent change in the independent variable, holding other variables constant.
Exponential Model: In an exponential model, the coefficient can be interpreted in terms of multiplicative effects on the dependent variable for a one-unit change in the independent variable.

Heteroscedasticity and Robust Standard Errors

Heteroscedasticity Assumption Failure: Heteroscedasticity occurs when the variance of the error terms varies across levels of an independent variable, violating one of the key OLS assumptions. This can lead to biased estimates of standard errors, affecting confidence intervals and hypothesis tests.
Robust Standard Errors: These are adjusted standard errors that account for heteroscedasticity, providing more reliable hypothesis testing. However, they correct only the standard errors for inconsistency; they do not correct bias in the coefficient estimates themselves.
Omega (Ω) in Practice: In the context of heteroscedasticity, Ω represents the true variance-covariance matrix of the error terms, which is rarely known in practice. Various techniques, including robust standard errors and heteroscedasticity-consistent (HC) estimators, are used to approximate Ω without explicitly knowing it.

Feature Engineering and NonlinearityFeature engineering involves creating new predictors or transforming existing ones to better capture the relationship between independent and dependent variables, especially when transformations like log or exponential do not fully linearize the data. Machine learning models can often handle non-linearity more flexibly than traditional linear models, potentially reducing the need for transformations or feature engineering to achieve linearity.

OLS (Ordinary Least Squares) Regression
OLS regression is the most common method used for linear regression analysis. It aims to find the line (or hyperplane in higher dimensions) that best fits a set of data points. The "best fit" is determined by minimizing the sum of the squared differences between the observed values and the values predicted by the linear model. This method assumes that the errors (residuals) between the observed values and the model's predicted values are homoscedastic, meaning they have the same variance across all levels of the independent variables.
Key Characteristics of OLS:

Assumes homoscedasticity (constant variance of errors).
The estimates are obtained by minimizing the sum of squared residuals.
Under the Gauss-Markov theorem, OLS estimators are the Best Linear Unbiased Estimators (BLUE) if the assumptions hold, including linearity, independence, and homoscedasticity of errors.

HC (Heteroscedasticity-Consistent) RegressionHC regression refers to a set of modifications to the standard OLS regression to allow for heteroscedasticity, where the variance of the error terms varies across observations. Heteroscedasticity is common in real-world data and violates one of the key OLS assumptions, potentially leading to inefficient and biased estimates of the standard errors, which in turn affects hypothesis testing and confidence intervals.
Key Characteristics of HC:

Does not assume constant variance of errors across observations.
Adjusts the standard errors of the OLS estimates to be consistent in the presence of heteroscedasticity.
There are several versions of HC standard errors (e.g., HC0, HC1, HC2, HC3), with different adjustments for small sample sizes or other concerns.

Applying HC in Regression:

The regression model can be estimated using OLS to obtain parameter estimates.
Then, HC standard errors are calculated to correct for heteroscedasticity, allowing for more reliable hypothesis testing and confidence intervals.

R-squared (R²): This metric indicates the proportion of variance in the dependent variable explained by the independent variables (predictors). It's useful for comparing models on the same data, but not across models with different dependent variables or datasets.
AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion): These criteria help balance model goodness-of-fit with complexity, preventing overfitting (a model that performs well on training data but poorly on unseen data).

Interpreting Transformed Models
Coefficients in log-transformed models represent the percentage change in the dependent variable for a one-percent change in the independent variable, holding other variables constant. On the other hand, coefficients in exponential models translate to multiplicative effects on the dependent variable for a one-unit change in the independent variable.
Heteroscedasticity: A Potential Roadblock
Heteroscedasticity occurs when the variance of the error terms (the difference between predicted and actual values) varies across levels of an independent variable. This violates a key assumption of linear regression and can lead to biased estimates.
Robust Standard Errors: A Reliable Alternative
To address heteroscedasticity, robust standard errors are used. These adjust the standard errors of the model, providing more reliable hypothesis testing even in the presence of non-constant variance.
Feature Engineering: Beyond Transformations
Sometimes, transformations like log or exponential might not fully capture the relationship between variables. This is where feature engineering comes in. It involves creating new predictors or transforming existing ones to better model the underlying relationships. Additionally, machine learning models can often handle non-linearity more flexibly, potentially reducing the need for transformations.
OLS Regression: The Workhorse
Ordinary Least Squares (OLS) regression is the most common method used for linear regression analysis. It aims to find the line (or hyperplane in higher dimensions) that best fits a set of data points, minimizing the sum of squared errors between observed and predicted values.
Beyond the Basics: Addressing Challenges
This post provides a foundational understanding of regression analysis. However, the journey doesn't end here! Here's a glimpse into some advanced topics you might encounter:

HC (Heteroscedasticity-Consistent) Regression: This method tackles heteroscedasticity by adjusting the standard errors of the OLS estimates.
Non-Linear Relationships: When data exhibits non-linearity, alternative models like survival models or classification models (e.g., logit and probit) might be better suited.
Multi-Level Data: This refers to data with hierarchical structures, requiring specialized analysis techniques.

Remember, this blog post just scratches the surface of regression analysis. As you delve deeper, you'll unlock a powerful tool for uncovering patterns and relationships hidden within your data.

Check for Linearity:
normality
multi-colinearity

Check for Non Linearity:

GLA assumptions:
-> Y is independent
-> Distribution is from the exponential family.
-> Linear relationship not required b/w Y and X.
-> Common variance is not required.
-> Weibull
-> Gompertz
-> Log-logistics

Maximum Likelihood:
likelihood function:

Non Parametric Estimation:
non- parametric equations do not account for the actual situations. they are just based on coefficients.

Semi-Parametric Estimation: hazard rate (lambda) (time, t) (B)
Baseline hazard (lambda = 0, which does not depend on the co-variates): when B>1, it expands the baseline or if B<0, it means there is contraction in the baseline.
example: smoking expands the chances of death whereas workout reduces the chances of death: the baseline roughly remains constant.

Kaplan-Meier Analysis by groups:
Cox-Proportional Hazard Model (semi-parametric): parameters are only on one part of the equation.

Multi-Level Data:
Example:
Level 1 -> Houses
Level 2-> Subdivision
Level 3-> Counties
Level 4: years
Level 5: Decades
Organize the data in the levels based on similarity. So that we capture the averages and patterns based on similar level data otherwise the data averages would be skewed.

OATS Data:
1, 2, 3, ...6 regions in the US where OATS come from.

Random Effect Model:

Limitation: There can be situations where we can find houses have their own subdivisions and no other houses in that subdivision.

## Generally, the accepted way to get rich is using the researched knowledge and experienced knowledge in a way that generate a product that reaches/is useful/and replicable to large scale population.

Survival Models:
Alternate Names: Duratino Analysis, failure analysis, duration analysis,
Event: Patent Died, person god a job, loan was repaid (always binary)
Time Scale: Year, month, week, days (
Origin of the event:
Why cant we use linear regression:
a) Dependent variable and residuals are not normally distributed:
data is not normal, time of entry into the sample may be a poisson distribution, Y is assumed to have continuous probability distribution,

b) Y may be censored (incomplete, time series, binary):
Patent hasn't died at teh current time, we dont know if they will die in a future time, subjects may have dropped out ( or we stopped tracking them) before the sduty ended.

Example: How will you find out the probability of you not dying given that you have not died until time <t (given max life span is 100 years old)?
Answer: P(die at time t|still alive at t-epslon) = P (die time - t)/survived (upto time t)

Censored Data:
1) Patent Died
2) Patent Survived
3) Patent Dropped Out
4) Patient Entered the study Later
5) Patent Died
6) Patent survived

Hazard Rate: probability that the event will happen at time t and given that is has not happened at time < t.

Popular Distributions:
1) Normal -> variance = 1/sigma | PDF (probability density function) = [1/(signma*root(2pi)]*e^(-x-u)^2/(2*sigma^2) |
2) Bernoulli -> single trial with only two outcomes (binary) -> yes/no -> PDF -> 1-P or P | by symmetry the center is at p=1/2 | variance = 1/4
3) Binomial -> probability of observing a set of bernoulli's trials -> Pr =(X=k) -> (n,k)p^k (1-p)^(n-k) where (n,k) = n!/(k(n-k)! | Three parameters, n=total number of trials , x = current trial; p = probability of success
4) Poissons: failures are indepnednt and they dont happen at the same time - (example how many smoke detectors fail at my home at the same time) - assuming there is no other factors causing the failure meaning the failures are independent.
5) Uniform Distribution: any number in the bucket b/w A and B (connected by straight line) | PDF = 1/n when x belongs to A and 1/(b-a) if a<x<b | E(x) = (a+b)/2 and Var = (b-a)^2/12 | standard Uniform density distribution | very similar to linear distribution | but doesn't increase/decrease with change in Y (axis)
Can a Von-Neumann type of computer generate real random numbers? Answer is no: when we set a random seed = 42 example - it always generate same set of random numbers -> because there is a pseudo set of numbers that uses uniform distribution equation and generates random numbers based on the random seed number we assign.

RGTI -> quantum computer block chain company.
Talking about vacation next week - > we want to

#missed first 5 minute of the class

Original Notes:
The baseline category for R when producing the linear model (regression), R chooses the first entry as the baseline, but we can change the baseline per our needs.
Transformations: Log and exponential models

Baseline interpretations: for 1 unit increase in the baseline, there is related x magnitude increase/decrease per unit change in the baseline.
We apply the log function to the independent variable to linearize the data (for QQ plot).

QQ Plot: Normality Check: Heavy tails on left and right indicate outliers. if the values of QQ plot lie on the same line, then the data is not normal.
Model Comparison:
R^2: is worse: of same models should only be compared. Never compare R^2 of two different models.
QQ Plot: is worse

Universal model comparison is done by AIC and BIC.

Interpretation of LOG and Exponential Model:
-> Both models are used to make the data look more linear.
-> Log model -> increase or decrease is in terms of 1%
-> Exponential model -> 1 Unit

log(y)=a+blog(x)
interpretation: if I increase x by 1% y increases by abc%. Example: demand supply elastic curve.
Future Engineering: required when just log and exponential do not work ie. make the data linear on QQ plot.
-> make new predictors (feature engineering)
-> Machine learning models deal with nonlinearity more naturally.
Note: we are trying to linearize the data (through transformations) otherwise the assumptions will fail.

Heteroscedasticity Assumption: If the assumption fails, the data would be biased. Do we know omega in practice? No
So, we assume omega and see how model works. We try to infer omega based on how data responds.

Robust Standard Errors: can only fix the variance of data but they dont fix the bias.

OLS (Ordinary Least Squares) and HC (Heteroscedasticity-Consistent) regression are both statistical methods used in econometrics and statistics for estimating the parameters of a linear regression model. Let's break down what each of these terms means and how they are applied:
OLS (Ordinary Least Squares) RegressionOLS regression is the most common method used for linear regression analysis. It aims to find the line (or hyperplane in higher dimensions) that best fits a set of data points. The "best fit" is determined by minimizing the sum of the squared differences between the observed values and the values predicted by the linear model. This method assumes that the errors (residuals) between the observed values and the model's predicted values are homoscedastic, meaning they have the same variance across all levels of the independent variables.
Key Characteristics of OLS:

Assumes homoscedasticity (constant variance of errors).
The estimates are obtained by minimizing the sum of squared residuals.
Under the Gauss-Markov theorem, OLS estimators are the Best Linear Unbiased Estimators (BLUE) if the assumptions hold, including linearity, independence, and homoscedasticity of errors.

HC (Heteroscedasticity-Consistent) Regression: HC regression refers to a set of modifications to the standard OLS regression to allow for heteroscedasticity, where the variance of the error terms varies across observations. Heteroscedasticity is common in real-world data and violates one of the key OLS assumptions, potentially leading to inefficient and biased estimates of the standard errors, which in turn affects hypothesis testing and confidence intervals.
Key Characteristics of HC:

Does not assume constant variance of errors across observations.
Adjusts the standard errors of the OLS estimates to be consistent in the presence of heteroscedasticity.
There are several versions of HC standard errors (e.g., HC0, HC1, HC2, HC3), with different adjustments for small sample sizes or other concerns.

Applying HC in Regression:

The regression model can be estimated using OLS to obtain parameter estimates.
Then, HC standard errors are calculated to correct for heteroscedasticity, allowing for more reliable hypothesis testing and confidence intervals.

HC standard errors are often used in empirical research to ensure the robustness of statistical inference when the assumption of homoscedasticity is violated. This adjustment is crucial in fields like economics, finance, and social sciences, where heteroscedasticity is a common issue.

It is not uncommon to see better errors (R^2) than HC models with robust standard errors.

STEPS: Estimate the basic OLS
Then Obtain the fitted the values of least square and compute the weights
pass the weights to the new LM call

Weighted Least Squares (WLS): (above): Fitted Square.
Generalized Least Squares (GLS): is more generalized, when omeaga =1 it becomes OLS.
Flexible GLS: reduces the steps of OLS. a +bx. If you are nailing the right omega function, this method works fantastic.

CLASSIFICATION MODELS:
Logit and Probit models:

Predicting coronary heart disease: we want to classify higher chances of heart disease, behavioural, and medial. The model is binary.
Notice that: the LM (Linear model) runs without warning. However, in the residual plot will not be normal, and the Beta values of all variables will be very very small indicating indicating not much co-relation.

The inference will be fraud even though probability will be high. The regression line will be straight, however, the predictions are in the top and bottom horizontal line in the chart.

Beta ->
LOGIT: popular in health science. Logistic regression. Susceptable to hetroskendasticity. ln(P/(1-P)) = B1 + B2X2+.....BkXk
PROBIT: Popular in econometrics and political sceince. Robust to hetroskendasticity. integration...High XB -> more likely the event can happen
X does not have constant effect on Y.

Odds vs probabilities:
p=0 -> O=0 (odds are chances, based on some more known information, probability is definition)
w=P/(1-P) -> formula for odds

Observation is not indicator of causality.

Goodness of fit of R^2 (pseudo R^2)

Assumption of LOGIT MODEL
1) Non linea transformation
2) No multi-colinearity

Accuracy using train and test data:
Are false positive same as false negatives?
Accuracy = (TP+TN)/ (TP+FP+TN+FN) | Diagonal /ALL
Precision = TP/(TP+FP) | Right Colm
Sensitivity = TP/(FP+FN)
Recall = TP/(TP+FN) |

Confusion matrix:

The problem of unbalanced Sample:
example; identifying 1 terrorists in a million people -
or identifying fradulent transactions in a million legit transactions.

How to work with multiple classes (not just binary 0 and 1)

-> multinomial logit or ordered logit (dependent variable)
-> logistic regression assume bernoulli distribution with no ordering between variables
-> Logit assume uniform distribution with three probabilities adding upto 1.
-> ordered logit assume ordering of th eDV, and use cumulative elvents for log of odds computation.