Sunday, May 25, 2025

What are typical threshold for imputation?

When deciding whether to impute missing data, the proportion of missing values is an important factor.

General Rule of Thumb for Imputation Thresholds:

Missing % of a Column Recommended Action

< 5% Impute (mean, median, mode, etc.) or drop rows if impact is negligible

5–30% Consider imputation; carefully analyze patterns and impact

> 30% Consider dropping the column or using advanced methods (e.g., model-based imputation)

If the example dataset is having 20K rows and the missing values are oney 18, that's a 0.09 percent missing values. Since it falls in the < 5% range, 

Mean Imputation

Definition:

Replace missing values in a column with the average (mean) of the non-missing values.

Formula:

Mean = ∑𝑥𝑖 / 𝑛 

​When to Use:

The data is normally distributed (i.e., symmetric).

No significant outliers present.

Pros:

Simple and fast.

Preserves the overall mean of the data.

Cons:

Sensitive to outliers — large or small extreme values can skew the mean.

Can reduce variability in the data (makes imputed values common).

Example:

Data: [10, 12, 13, 11, NA]

Mean = 11.5 → impute missing value with 11.5

2. Median Imputation

Definition:

Replace missing values with the median (middle value) of the non-missing data.

When to Use:

The data is skewed (not symmetric).

Outliers are present — median is more robust to them.

Pros:

Not affected by outliers.

Maintains the central tendency better in skewed distributions.

Cons:

Doesn’t preserve mathematical properties like the mean does.

Less effective for symmetric distributions.

Example:

Data: [10, 12, 100, 11, NA]

Mean = 33.25 (inflated by 100)

Median = 11.5 → more representative → impute with 11.5


No comments:

Post a Comment