Data drift (also called distribution shift) means the statistical properties of input features change over time compared to training data.
• Example: A fraud detection model trained on transaction patterns from 2023 may see very different spending patterns in 2025.
• Types:
• Covariate shift: Change in feature distributions (e.g., average age of customers rises from 30 → 45).
• Prior probability shift: Change in target distribution (e.g., fraud rate increases).
• Concept drift: Relationship between features and target changes (e.g., fraudsters use new methods).
How Drift is Measured (Structured Data)
You compare the distribution of features (train vs. current data) using statistical tests or divergence metrics.
Common methods:
1. Population Stability Index (PSI)
• Used heavily in credit risk / finance.
• Measures how much a variable’s distribution has shifted over time.
• Rule of thumb:
• PSI < 0.1 → no drift
• 0.1–0.25 → moderate drift
• 0.25 → significant drift
2. Kullback–Leibler Divergence (KL Divergence)
• Measures how one probability distribution diverges from another.
• Asymmetric → KL(P‖Q) ≠ KL(Q‖P).
3. Jensen–Shannon Divergence (JS Divergence)
• Symmetric version of KL divergence.
• Outputs bounded values (0–1).
4. Kolmogorov–Smirnov Test (KS Test)
• Non-parametric test comparing cumulative distributions of two samples.
• Often used in fraud detection / credit scoring.
Yes — the algorithms you mentioned (PSI, KL, JS, KS) are all useful for structured data drift detection.
Drift in Unstructured Data
Unstructured data = text, images, audio, video. Drift here is harder to measure because distributions are not just numbers.
Methods:
1. Text Drift
• Compare embeddings of text using cosine similarity.
• Measure drift in word distributions (TF-IDF, BERT embeddings).
2. Image Drift
• Use feature embeddings (CNN, CLIP) → compare with KL/JS divergence or Maximum Mean Discrepancy (MMD).
3. Audio Drift
• Extract spectrogram features / embeddings → compare distributions.
So, for unstructured data, embedding-based drift detection is common.
Benefits of Calculating Drift
1. Model Monitoring → Ensures model is still valid in production.
2. Early Warning System → Detect changes in customer behavior, fraud, medical conditions.
3. Data Quality Assurance → Spot broken pipelines (e.g., a column suddenly all zeros).
4. Regulatory Compliance → Finance/healthcare require continuous monitoring.
5. Reduce Business Risk → Prevent degraded predictions causing revenue loss.
Summary
• Drift = change in statistical distribution of data between training & production.
• Structured data drift → measured via PSI, KL, JS, KS, etc.
• Unstructured data drift → embeddings + divergence tests.
• Benefits = monitoring, risk management, compliance, early alerts.
No comments:
Post a Comment