Sunday, November 2, 2025

What is SHAP? How it can be used for Linear Regression?

 **SHAP** (SHapley Additive exPlanations) is a unified framework for interpreting model predictions based on cooperative game theory. For linear regression, it provides a mathematically elegant way to explain predictions.


---


## **How SHAP Works for Linear Regression**


### **Basic Concept:**

SHAP values distribute the "credit" for a prediction among the input features fairly, based on their marginal contributions.


### **For Linear Models:**

In linear regression: \( y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n \)


The **SHAP value** for feature \( i \) is:

\[

\phi_i = \beta_i (x_i - \mathbb{E}[x_i])

\]


Where:

- \( \beta_i \) = regression coefficient for feature \( i \)

- \( x_i \) = feature value for this specific observation

- \( \mathbb{E}[x_i] \) = expected (average) value of feature \( i \) in the dataset


---


## **Key Properties**


### **1. Additivity**

\[

\sum_{i=1}^n \phi_i = \hat{y} - \mathbb{E}[\hat{y}]

\]

The sum of all SHAP values equals the difference between the prediction and the average prediction.


### **2. Efficiency**

All the prediction is distributed among features - no "lost" explanation.


### **3. Symmetry & Fairness**

Features with identical effects get identical SHAP values.


---


## **Example**


Suppose we have a linear model:

\[

\text{Price} = 10 + 5 \times \text{Size} + 3 \times \text{Bedrooms}

\]

Dataset averages: Size = 2, Bedrooms = 3, Average Price = 31


For a house with:

- Size = 4, Bedrooms = 2

- Predicted Price = \( 10 + 5\times4 + 3\times2 = 36 \)


**SHAP values:**

- ϕ_Size = \( 5 \times (4 - 2) = 10 \)

- ϕ_Bedrooms = \( 3 \times (2 - 3) = -3 \)

- ϕ_Baseline = 31 (average prediction)


**Verification:** 31 + 10 - 3 = 38 (slight adjustment for intercept)


---


## **Benefits for Linear Regression**


### **1. Unified Feature Importance**

- Shows how much each feature contributed to a specific prediction

- Unlike coefficients, SHAP values are prediction-specific


### **2. Directional Impact**

- Positive SHAP value → Feature increased the prediction

- Negative SHAP value → Feature decreased the prediction


### **3. Visualization**

- **SHAP summary plots**: Show feature importance across all predictions

- **Force plots**: Explain individual predictions

- **Dependence plots**: Show feature effects


---


## **Comparison with Traditional Interpretation**


| **Traditional** | **SHAP Approach** |

|-----------------|-------------------|

| Coefficient βᵢ | SHAP value ϕᵢ |

| Global effect | Local + Global effects |

| "One size fits all" | Prediction-specific explanations |

| Hard to compare scales | Comparable across features |


---


## **Practical Usage**


```python

import shap

import numpy as np

from sklearn.linear_model import LinearRegression


# Fit linear model

model = LinearRegression().fit(X, y)


# Calculate SHAP values

explainer = shap.Explainer(model, X)

shap_values = explainer(X)


# Visualize

shap.summary_plot(shap_values, X)

shap.plots.waterfall(shap_values[0])  # Explain first prediction

```


---


## **Why Use SHAP for Linear Regression?**


Even though linear models are inherently interpretable, SHAP provides:

- **Consistent methodology** across different model types

- **Better visualization** tools

- **Local explanations** for individual predictions

- **Feature importance** that accounts for data distribution


SHAP makes the already interpretable linear models even more transparent and user-friendly for explaining predictions.

Goldfeld-Quandt Test

 ## **Goldfeld-Quandt Test**


The **Goldfeld-Quandt test** is a statistical test used to detect **heteroscedasticity** in a regression model.


---


### **What is Heteroscedasticity?**

Heteroscedasticity occurs when the **variance of the errors** is not constant across observations. This violates one of the key assumptions of ordinary least squares (OLS) regression.


---


### **Purpose of Goldfeld-Quandt Test**

- Checks if the **error variance** is related to one of the explanatory variables

- Tests whether heteroscedasticity is present in the data

- Helps determine if robust standard errors or other corrections are needed


---


### **How the Test Works**


1. **Order the data** by the suspected heteroscedasticity-causing variable


2. **Split the data** into three groups:

   - Group 1: First \( n \) observations (low values)

   - Group 2: Middle \( m \) observations (typically excluded)

   - Group 3: Last \( n \) observations (high values)


3. **Run separate regressions** on Group 1 and Group 3


4. **Calculate the test statistic**:

   \[

   F = \frac{\text{RSS}_3 / (n - k)}{\text{RSS}_1 / (n - k)}

   \]

   Where:

   - \( \text{RSS}_3 \) = Residual sum of squares from high-value group

   - \( \text{RSS}_1 \) = Residual sum of squares from low-value group

   - \( n \) = number of observations in each group

   - \( k \) = number of parameters estimated


5. **Compare to F-distribution** with \( (n-k, n-k) \) degrees of freedom


---


### **Interpretation**


- **Large F-statistic** → Evidence of heteroscedasticity

- **Small F-statistic** → No evidence of heteroscedasticity

- If \( F > F_{\text{critical}} \), reject null hypothesis of homoscedasticity


---


### **When to Use**

- When you suspect variance increases/decreases with a specific variable

- When you have a medium to large dataset

- When you can identify which variable might cause heteroscedasticity


---


### **Limitations**

- Requires knowing which variable causes heteroscedasticity

- Sensitive to how data is split

- Less reliable with small samples

- Middle exclusion reduces power


---


### **Example Application**

If you're modeling house prices and suspect error variance increases with house size, you would:

1. Order data by house size

2. Run Goldfeld-Quandt test using house size as the ordering variable

3. If test shows heteroscedasticity, use robust standard errors or transform variables


The test helps ensure your regression inferences are valid by checking this important assumption.

What is OLS summary with Linear regression ?

OLS Summary and Confidence Intervals

OLS (Ordinary Least Squares) summary is the output from fitting a linear regression model that provides key statistics about the model's performance and coefficients.

Default Confidence Interval in OLS Summary

By default, most statistical software packages (Python's statsmodels, R, etc.) show the 95% confidence interval for model coefficients in OLS summary output.


What OLS Summary Typically Includes:

Coefficient estimates (β values)

Standard errors of coefficients

t-statistics and p-values

95% Confidence intervals for each coefficient

R-squared and Adjusted R-squared

F-statistic for overall model significance

Log-likelihood, AIC, BIC (in some packages)

How statistics can be used for linear regression?

 **True**

---

## **Explanation**

In linear regression, we often use **hypothesis tests on coefficients** to decide whether to keep or drop variables.

### **Typical Procedure:**

1. **Set up hypotheses** for each predictor \( X_j \):

   - \( H_0: \beta_j = 0 \) (variable has no effect)

   - \( H_1: \beta_j \neq 0 \) (variable has a significant effect)


2. **Compute t-statistic**:

   \[

   t = \frac{\hat{\beta}_j}{\text{SE}(\hat{\beta}_j)}

   \]

   where \( \text{SE}(\hat{\beta}_j) \) is the standard error of the coefficient.


3. **Compare to critical value** or use **p-value**:

   - If p-value < significance level (e.g., 0.05), reject \( H_0 \) → **keep** the variable

   - If p-value ≥ significance level, fail to reject \( H_0 \) → consider **dropping** the variable


---


### **Example:**

In regression output:

```

            Coefficient   Std Error   t-stat   p-value

Intercept   2.5          0.3         8.33     <0.001

X1          0.8          0.4         2.00     0.046

X2          0.1          0.5         0.20     0.842

```

- **X1** (p = 0.046): Significant at α=0.05 → **keep**

- **X2** (p = 0.842): Not significant → consider **dropping**


---


### **Note:**

While this is common practice, variable selection shouldn't rely **only** on p-values — domain knowledge, model purpose, and multicollinearity should also be considered. But the statement itself is **true**: hypothesis testing on coefficients is indeed used for deciding whether to keep/drop variables.

How to find variance percentage given VIF

 ## **Step-by-Step Solution**


### **1. Understanding VIF Formula**

The Variance Inflation Factor is:

\[

\text{VIF} = \frac{\text{Actual variance of coefficient}}{\text{Variance with no multicollinearity}}

\]


Given: **VIF = 1.8**


### **2. Interpret the VIF Value**

\[

1.8 = \frac{\text{Actual variance}}{\text{Variance with no multicollinearity}}

\]


This means the actual variance is **1.8 times** what it would be with no multicollinearity.


### **3. Calculate Percentage Increase**

If variance with no multicollinearity = 1 (base), then:

- Actual variance = 1.8

- **Increase** = 1.8 - 1 = 0.8

- **Percentage increase** = \( \frac{0.8}{1} \times 100\% = 80\% \)


---


## **Final Answer**

\[

\boxed{80}

\]


The variance of the coefficient is **80% greater** than what it would be if there was no multicollinearity.


---


### **Verification**

- VIF = 1.0 → 0% increase (no multicollinearity)

- VIF = 2.0 → 100% increase (variance doubles)

- VIF = 1.8 → 80% increase ✓


This makes intuitive sense: moderate multicollinearity (VIF = 1.8) inflates the variance by 80% compared to the ideal case.

What is Variable Inflation factor?

## **Variance Inflation Factor (VIF)**


The **Variance Inflation Factor (VIF)** measures how much the variance of a regression coefficient is inflated due to multicollinearity in the model.

---

### **Formula**

For predictor \( X_k \):

\[

\text{VIF}_k = \frac{1}{1 - R_k^2}

\]

where \( R_k^2 \) is the R-squared value from regressing \( X_k \) on all other predictors.

---


### **Interpretation**

- **VIF = 1**: No multicollinearity

- **1 < VIF ≤ 5**: Moderate correlation (usually acceptable)

- **VIF > 5 to 10**: High multicollinearity (may be problematic)

- **VIF > 10**: Severe multicollinearity (coefficient estimates are unstable)

---

## **How VIF is Helpful**

1. **Detects Multicollinearity**

   - Identifies when predictors are highly correlated with each other

   - Helps understand which variables contribute to collinearity

2. **Assesses Regression Coefficient Stability**

   - High VIF → large standard errors → unreliable coefficient estimates

   - Helps decide if some variables should be removed or combined

3. **Guides Model Improvement**

   - Suggests when to:

     - Remove redundant variables

     - Combine correlated variables (e.g., using PCA)

     - Use regularization (Ridge regression)

4. **Better Model Interpretation**

   - With lower multicollinearity, coefficient interpretations are more reliable

   - Each predictor's effect can be isolated more clearly

---

### **Example Usage**

If you have predictors: House Size, Number of Rooms, Number of Bathrooms

- Regress "Number of Rooms" on "House Size" and "Number of Bathrooms"

- High \( R^2 \) → High VIF → these variables contain overlapping information

- Solution: Maybe use only "House Size" and one other, or create a composite feature

---

**Bottom line**: VIF helps build more robust, interpretable models by identifying and addressing multicollinearity issues.



 


What is Q-Q plot and their benefits

A Q-Q (quantile-quantile) plot compares the quantiles of two distributions.

If the two distributions are identical (or very close), the points on the Q-Q plot will fall approximately along the 45° straight line 

A **Q-Q plot** (quantile-quantile plot) is a graphical tool used to compare two probability distributions by plotting their quantiles against each other.

---

## **How it works**

- One distribution’s quantiles are on the x-axis, the other’s on the y-axis.
- If the two distributions are similar, the points will fall roughly along the **line \(y = x\)** (the 45° diagonal).
- Deviations from this line indicate how the distributions differ in shape, spread, or tails.

---

## **Types of Q-Q plots**

1. **Two-sample Q-Q plot**: Compare two empirical datasets.
2. **Theoretical Q-Q plot**: Compare sample data to a theoretical distribution (e.g., normal Q-Q plot to check normality).

---

## **Benefits of Q-Q plots**

1. **Visual check for distribution similarity**  
   - Quickly see if two datasets come from the same distribution family.

2. **Assess normality**  
   - Common use: Normal Q-Q plot to check if data is approximately normally distributed.

3. **Identify tails behavior**  
   - Points deviating upward at the top → right tail of sample is heavier than theoretical.  
   - Points deviating downward at the top → right tail is lighter.

4. **Detect skewness**  
   - A curved pattern suggests skew.

5. **Spot outliers**  
   - Points far off the line may be outliers.

6. **Compare location and scale differences**  
   - If points lie on a straight line with slope ≠ 1 → scale difference.  
   - If intercept ≠ 0 → location shift.

---

## **Example interpretation**

- **Straight diagonal line**: Distributions are the same.
- **Straight line with slope > 1**: Sample has greater variance.
- **S-shaped curve**: Tails differ (one distribution has heavier or lighter tails).
- **Concave up**: Sample distribution is right-skewed relative to theoretical.