When deciding whether to impute missing data, the proportion of missing values is an important factor.
General Rule of Thumb for Imputation Thresholds:
Missing % of a Column Recommended Action
< 5% Impute (mean, median, mode, etc.) or drop rows if impact is negligible
5–30% Consider imputation; carefully analyze patterns and impact
> 30% Consider dropping the column or using advanced methods (e.g., model-based imputation)
If the example dataset is having 20K rows and the missing values are oney 18, that's a 0.09 percent missing values. Since it falls in the < 5% range,
Mean Imputation
Definition:
Replace missing values in a column with the average (mean) of the non-missing values.
Formula:
Mean = ∑𝑥𝑖 / 𝑛
When to Use:
The data is normally distributed (i.e., symmetric).
No significant outliers present.
Pros:
Simple and fast.
Preserves the overall mean of the data.
Cons:
Sensitive to outliers — large or small extreme values can skew the mean.
Can reduce variability in the data (makes imputed values common).
Example:
Data: [10, 12, 13, 11, NA]
Mean = 11.5 → impute missing value with 11.5
2. Median Imputation
Definition:
Replace missing values with the median (middle value) of the non-missing data.
When to Use:
The data is skewed (not symmetric).
Outliers are present — median is more robust to them.
Pros:
Not affected by outliers.
Maintains the central tendency better in skewed distributions.
Cons:
Doesn’t preserve mathematical properties like the mean does.
Less effective for symmetric distributions.
Example:
Data: [10, 12, 100, 11, NA]
Mean = 33.25 (inflated by 100)
Median = 11.5 → more representative → impute with 11.5
No comments:
Post a Comment