Saturday, April 5, 2025

What is Bagging ( Bootstrap Aggregating) - Part 1

 Bootstrap sampling: Bagging involves creating multiple subsets of the original training data by sampling with replacement. This means that some data points may be included multiple times in a single subset, while others may be left out.

Aggregation: After training a separate base learner (e.g., a decision tree) on each of these bootstrap samples, the predictions of these learners are aggregated. For classification, this is typically done by majority voting. For regression, it's usually done by averaging.

Let's look at why the other options are not the primary definition of Bagging:

decreasing impurity: Decreasing impurity is a goal within individual decision tree algorithms (like CART) when deciding how to split nodes. While Bagging often uses decision trees as base learners, decreasing impurity is a mechanism within the individual trees, not the core concept of Bagging itself.

Cross-Validation: Cross-validation is a technique used for evaluating the performance of a model by splitting the data into multiple folds for training and validation. It's a model evaluation technique, not the definition of Bagging. Bagging can be evaluated using cross-validation.

Sampling with replacement: While sampling with replacement is a key part of the bootstrap sampling step in Bagging, it doesn't encompass the entire Bagging process. The aggregation step is equally crucial.

The true statements about Bagging are:

Makes the model more robust: By training multiple models on different subsets of the data and aggregating their predictions, Bagging reduces the impact of1 noisy data points and outliers, leading to a more stable and reliable model.   

1.azrilhafizi.medium.com

2. azrilhafizi.medium.com

Guards you against overfitting the model to original data: The bootstrap sampling process introduces randomness, and the aggregation step smooths out the predictions. This reduces the variance of the model and makes it less likely to memorize the training data, thus mitigating overfitting.

Follows parallel model building: Each base learner in a Bagging ensemble is trained independently on its respective bootstrap sample. This allows for parallel training, which can significantly reduce the overall training time, especially with large datasets and multiple computing resources.

The false statement is:

Follows sequential model building: Sequential model building is a characteristic of boosting techniques (like AdaBoost and Gradient Boosting), where each subsequent model tries to correct the errors made by the previous ones. Bagging, on the other hand, builds its models in parallel.

If ps is the probability of choosing an observation, then which of the following are true for sampling with replacement? 

p is different for some observations

p remains same at each stage for all observations

p depends on the number of columns

p depends on the number base estimators

For sampling with replacement, where an observation is put back into the population after being selected, the probability of choosing any specific observation remains the same at each stage of the sampling process.

Here's why the other options are incorrect:

p is different for some observations: In simple random sampling with replacement, each observation has an equal chance of being selected at every draw. Therefore, the probability p is the same for all observations.

p depends on the number of columns: The probability of selecting an observation in the sampling process depends on the number of observations (the population size), not the number of columns (features) in the dataset.

p depends on the number of base estimators: The probability of choosing an observation during the sampling process in Bagging is determined by the bootstrap sampling method applied to the original dataset, independent of the number of base estimators that will be trained.

Therefore, the correct statement is:

p remains the same at each stage for all observations

What is meant by Observation here. Is it sample ? 

Yes, in this context, an observation directly refers to a sample from your dataset.

Here's why:

Dataset Structure: In machine learning and statistics, your dataset is typically organized into rows and columns. Each row represents a single instance or observation of the phenomenon you are studying. These rows are also often referred to as samples or data points.

Sampling with Replacement: When you perform bootstrap sampling in Bagging, you are repeatedly selecting rows (which are your observations or samples) from your original training dataset, with the possibility of selecting the same row multiple times.

Probability of Choosing: The probability ps refers to the chance of selecting a specific row (a specific observation/sample) from the dataset during each draw of the bootstrap sampling process.

Therefore, in the context of Bagging and the question about sampling with replacement, an "observation" is indeed synonymous with a "sample" or a "data point" from your training dataset.

On an average, what percentage of samples get selected in sampling with replacement? 

95%

100%

38%

63%

Probability of Not Being Selected: In a single draw with replacement from a dataset of size n, the probability of a specific observation not being selected is (1 - 1/n).

Probability of Not Being Selected in n Draws: When you perform n draws (to create a bootstrap sample of the same size as the original dataset), the probability that a specific observation is never selected is (1 - 1/n)^n.

Limit as n Approaches Infinity: As the number of samples n becomes large, the value of (1 - 1/n)^n approaches 1/e (Euler's number), which is approximately 0.368.

Probability of Being Selected: Therefore, the probability that a specific observation is selected at least once in a bootstrap sample is 1 - (1 - 1/n)^n, which approaches 1 - 0.368 = 0.632.

So, on average, approximately 63.2% of the original samples will be present in a bootstrap sample (due to some samples being selected multiple times and others not at all). The closest option provided is 63%.


What are the problems of Decision Trees of the below which can be overcome by random forest? 

Overfitting

Instability due to changes in data

Interpretability

Computational complexity


What are the problems of Decision Trees of the below which can be overcome by random forest? 

Overfitting

Instability due to changes in data

Interpretability

Computational complexity

Based on the common problems associated with Decision Trees, the issues that Random Forests effectively overcome are:

Overfitting: Decision Trees, especially if grown to a large depth, tend to overfit the training data. They learn the noise and specific details of the training set, leading to poor generalization on unseen data. Random Forests mitigate overfitting by creating an ensemble of many trees, each trained on a random subset of the data and a random subset of features. The final prediction is an aggregation of the predictions from all the trees, which reduces the variance and makes the model more robust.   

Instability due to changes in data: Decision Trees can be highly sensitive to small variations in the training data. A slight change can lead to a completely different tree structure. Random Forests are more stable because the final prediction is based on the consensus of many trees. The impact of a single noisy or slightly different data point is less likely to drastically alter the overall prediction.   

While Random Forests offer significant improvements in these areas, they generally do not inherently solve the following problems of Decision Trees:

Interpretability: Decision Trees are relatively easy to interpret as their decision-making process can be visualized as a tree with clear rules. Random Forests, being an ensemble of many trees, are much harder to interpret. While you can get feature importance scores, understanding the exact decision path for a particular prediction becomes complex. In fact, Random Forests often sacrifice some interpretability for better predictive performance.   

Computational complexity: While individual decision trees can be relatively fast to train and predict with (depending on their depth and the size of the data), Random Forests, as they involve training multiple trees, are generally more computationally expensive in terms of training time and memory usage. The prediction time might also be slightly higher as it involves aggregating predictions from multiple trees. However, the parallel nature of building the trees in a Random Forest can help reduce the overall training time on multi-core processors.   

Therefore, the primary problems of Decision Trees that are overcome by Random Forests are Overfitting and Instability due to changes in data.

In random forest, to get different n-models with the same algorithm, we can use which one of the below 

Bootstrap aggregation

Cross Validation

Normalization

None of These

The correct answer is Bootstrap aggregation.


Here's why:

Bootstrap aggregation (Bagging): This is the core technique used in Random Forests to create multiple models. It involves:

Bootstrap sampling: Creating multiple subsets of the original training data by sampling with replacement. Each subset is used to train a different decision tree.

Aggregation: Combining the predictions of all the individual trees (e.g., through majority voting for classification or averaging for regression) to make the final prediction.

Let's look at why the other options are incorrect:

Cross Validation: This is a technique used to evaluate the performance of a model by training and testing it on different subsets of the data. It doesn't inherently create multiple models within the Random Forest algorithm itself.

Normalization: This is a data preprocessing technique used to scale features to a similar range. It's not directly involved in the process of creating multiple models in a Random Forest.

None of These: Since Bootstrap aggregation is the method used, this option is incorrect.

Random forest randomly pick as subset of independent variables for each node's split. If m is the size of the subset and M is the total number of independent variables, where m generally less than M is this correct? 

Yes, this is correct. In Random Forests:

At each node in each decision tree, the algorithm randomly selects a subset of m independent variables (features) from the total M available independent variables.

The best split for that node is then chosen from only these m randomly selected variables.

Typically, the size of this subset m is significantly smaller than the total number of variables M (m << M). Common choices for m are the square root of M (for classification) or M/3 (for regression), but it can be tuned.

This random selection of features at each split is a key mechanism in Random Forests that contributes to the diversity among the individual trees. By not considering all the features for every split, the trees become less correlated, which in turn helps to reduce the variance of the ensemble and prevents overfitting.

Do random forest need to be pruned to get good prediction? If so why? 

While individual decision trees within a Random Forest are often grown to their full depth without explicit pruning, the Random Forest as an ensemble achieves good prediction and avoids overfitting through its inherent mechanisms, making explicit pruning of individual trees often unnecessary and sometimes even detrimental.   

Here's a breakdown of why explicit pruning is generally not needed in Random Forests for good prediction:

Bagging (Bootstrap Aggregation): Each tree is trained on a different bootstrap sample of the original data. This introduces randomness and ensures that each tree sees a slightly different perspective of the data. This process itself helps to reduce overfitting by training on multiple variations of the dataset.   

Random Feature Subsampling: At each node split, only a random subset of features is considered. This further decorrelates the trees, making them less likely to overfit to specific noise in the data. Each tree focuses on different aspects of the features.   

Ensemble Averaging/Voting: The final prediction is made by averaging the predictions of all the regression trees or by majority voting for classification trees. This aggregation process smooths out the individual errors and reduces the variance of the overall model, which is a key aspect of preventing overfitting.   

Why explicit pruning is often not performed and can be counterproductive:

Bias-Variance Tradeoff: Individual, unpruned decision trees tend to have low bias (they can fit the training data very well) but high variance (they are sensitive to noise in the training data). Random Forests leverage this by combining many high-variance, low-bias trees. The aggregation reduces the variance significantly, leading to a good overall bias-variance tradeoff.   

Loss of Diversity: Pruning individual trees might make them more similar to each other, reducing the diversity within the ensemble. This loss of diversity can weaken the power of the ensemble to generalize well.

Computational Cost: While pruning can reduce the complexity of a single tree, performing pruning on every tree in a large Random Forest can add significant computational overhead.

However, there are some scenarios where controlling the growth of individual trees (which can be seen as a form of pre-pruning) might be beneficial:

Computational Constraints: If you have extremely large datasets and building very deep trees is computationally prohibitive, you might limit the maximum depth or the minimum number of samples per leaf.   

Very Noisy Data: In cases with extremely high levels of noise, limiting tree growth might offer a slight improvement in generalization by preventing individual trees from fitting the noise too closely.

In conclusion:

While the individual decision trees in a Random Forest are typically grown without explicit pruning, the ensemble's inherent mechanisms of bagging and random feature subsampling effectively prevent overfitting and lead to good predictive performance. Explicitly pruning individual trees is generally not necessary and can sometimes reduce the effectiveness of the Random Forest by decreasing the diversity of the ensemble. The focus in Random Forests is on building a diverse set of potentially overfit individual trees and then letting the aggregation process create a robust and well-generalizing model.  

Sources and related content

In a classification setting, for a new test data point, the final prediction by a random forest is done by taking which one of the below ? 

average of individual predictions

mode of the individual predictions

minimum of individual predictions

median of individual predictions

In a classification setting, for a new test data point, the final prediction by a random forest is done by taking the mode of the individual predictions while in a regression setting, for a new test data point, the final prediction by a random forest is done by taking the average of individual predictions.

What is stratify in sklearn 

In train_test_split() from sklearn.model_selection, the stratify parameter ensures that the proportions of classes (labels) are the same in both the training and testing sets as in the original dataset.

This is especially useful when dealing with imbalanced datasets, to prevent the train/test split from introducing further class imbalance.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(

    X, y, test_size=0.2, stratify=y, random_state=42

)

Here, stratify=y ensures that the distribution of labels in y is maintained in both training and testing sets.

What is significance of class_weight in Random Forest 

class_weight is used in Random Forest (and other classifiers like Logistic Regression, SVM, etc.) to handle class imbalance by assigning more importance (weight) to the minority class.

From the below which all hyperparameters are tunable in Random forest? 


max_depth

max_features

min_samples_split

min_samples_leaf

max_depth


Controls the maximum depth of each decision tree.

Prevents overfitting if set properly.

✅ Tunable

max_features


The number of features to consider when looking for the best split.

Can be a float, int, "sqrt", "log2", etc.

✅ Tunable

min_samples_split


The minimum number of samples required to split an internal node.

Higher values can prevent overfitting.

✅ Tunable

min_samples_leaf


The minimum number of samples required to be at a leaf node.

Helps control tree complexity.

✅ Tunable

No comments:

Post a Comment