Many machine learning algorithms, instead, require all the input and output variables to be numeric. Although some like decision tree can work on categorical data.
one-hot encoding comes in help because it transforms categorical data into numerical; in other words: it transforms strings into numbers so that we can apply our Machine Learning algorithms without any problems.
animals = ['dog', 'cat', 'mouse']
one-hot encoding will create new columns as much as the number of unique kinds of animals in the “animals” column, and the new columns will be filled with 0s and 1s. So, if you have 100 kinds of animals in your “animals” column, one-hot encoding will create 100 new columns all filled with 1s and 0s.
this process can lead to some troubles. In this case, the trouble is the so-called “Dummy Variable Trap”.
The Dummy Variable Trap is a scenario where the variables present become highly correlated to each other, and this means an important thing: one-hot encoding can lead to multicollinearity; it means that we always have to analyze the variables (the new features, aka: the new columns) and decide if it is the case to drop some of them
There is a much more simpler way to perform one-hot encoding and it can be done directly in pandas. Consider the data frame, df, as we created it earlier. To encode it we can simply write the following line of code:
#one-hot encoding
df3 = pd.get_dummies(df, dtype=int)
#showing new head
df3.head()
More convoluted way of doing this is using SK Learn
SKLearn has one hot encoder
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# initializing values
data = {'Name':['Tom', 'Jack', 'Nick', 'John',
'Tom', 'Jack', 'Nick', 'John',
'Tom', 'Jack', 'Nick', 'John',],
'Time':[20, 21, 19, 18,
20, 100, 19, 18,
21, 22, 21, 20]
}
#creating dataframe
df = pd.DataFrame(data)
#showing head
df.head()
encoder = OneHotEncoder(handle_unknown='ignore')
encoder_df = pd.DataFrame(encoder.fit_transform(df[['Name']]).toarray())
#merge one-hot encoded columns back with original DataFrame
df2 = df.join(encoder_df)
#drop columns with strings
df2.drop('Name', axis=1, inplace=True)
#showing new head
df2.head()
references:
https://towardsdatascience.com/how-and-why-performing-one-hot-encoding-in-your-data-science-project-a1500ec72d85#:~:text=In%20these%20cases%2C%20one%2Dhot,Learning%20algorithms%20without%20any%20problems.
No comments:
Post a Comment