Saturday, February 25, 2023

Why should do One hot encoding

Many machine learning algorithms, instead, require all the input and output variables to be numeric. Although some like decision tree can work on categorical data. 


one-hot encoding comes in help because it transforms categorical data into numerical; in other words: it transforms strings into numbers so that we can apply our Machine Learning algorithms without any problems.


animals = ['dog', 'cat', 'mouse'] 

one-hot encoding will create new columns as much as the number of unique kinds of animals in the “animals” column, and the new columns will be filled with 0s and 1s. So, if you have 100 kinds of animals in your “animals” column, one-hot encoding will create 100 new columns all filled with 1s and 0s.



this process can lead to some troubles. In this case, the trouble is the so-called “Dummy Variable Trap”.



The Dummy Variable Trap is a scenario where the variables present become highly correlated to each other, and this means an important thing: one-hot encoding can lead to multicollinearity; it means that we always have to analyze the variables (the new features, aka: the new columns) and decide if it is the case to drop some of them



There is a much more simpler way to perform one-hot encoding and it can be done directly in pandas. Consider the data frame, df, as we created it earlier. To encode it we can simply write the following line of code:



#one-hot encoding

df3 = pd.get_dummies(df, dtype=int)

#showing new head

df3.head()



More convoluted way of doing this is using SK Learn 



SKLearn has one hot encoder 


import pandas as pd

from sklearn.preprocessing import OneHotEncoder


# initializing values

data = {'Name':['Tom', 'Jack', 'Nick', 'John',

                'Tom', 'Jack', 'Nick', 'John',

                'Tom', 'Jack', 'Nick', 'John',],

        'Time':[20, 21, 19, 18,

                20, 100, 19, 18,

                21, 22, 21, 20]

}

#creating dataframe

df = pd.DataFrame(data)

#showing head

df.head()


encoder = OneHotEncoder(handle_unknown='ignore')


encoder_df = pd.DataFrame(encoder.fit_transform(df[['Name']]).toarray())


#merge one-hot encoded columns back with original DataFrame

df2 = df.join(encoder_df)

#drop columns with strings

df2.drop('Name', axis=1, inplace=True)

#showing new head

df2.head()



references:

https://towardsdatascience.com/how-and-why-performing-one-hot-encoding-in-your-data-science-project-a1500ec72d85#:~:text=In%20these%20cases%2C%20one%2Dhot,Learning%20algorithms%20without%20any%20problems.

No comments:

Post a Comment