

One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.
So, youβre playing with ML models and you encounter this βOne hot encodingβ term all over the place. You see the sklearn documentation for one hot encoder and it says β Encode categorical integer features using a one-hot aka one-of-K scheme.β Itβs not all that clear right? Or at least it was not for me. So letβs look at what one hot encoding actually is.
ββββββββββββββ¦ββββββββββββββββββ¦βββββββββ
β CompanyName Categoricalvalue β Price β
β βββββββββββββ¬ββββββββββββββββββ£βββββββββ
β VW β¬ 1 β 20000 β
β Acura β¬ 2 β 10011 β
β Honda β¬ 3 β 50000 β
β Honda β¬ 3 β 10000 β
ββββββββββββββ©ββββββββββββββββββ©βββββββββ
onehot-datasetΒ hosted with β€ byΒ GitHub
The categorical value represents the numerical value of the entry in the dataset. For example: if there were to be another company in the dataset, it would have been given categorical value as 4. As the number of unique entries increases, the categorical values also proportionally increases.
The previous table is just a representation. In reality, the categorical values start from 0 goes all the way up to N-1 categories.
As you probably already know, the categorical value assignment can be done using sklearnβs LabelEncoder.
Now letβs get back to one hot encoding: Say we follow instructions as given in the sklearnβs documentation for one hot encoding and follow it with a little cleanup, we end up with the following:
ββββββ¦βββββββ¦βββββββ¦βββββββββ¦
β VW β Acuraβ Hondaβ Price β
β βββββ¬βββββββ¬βββββββ¬βββββββββ¬
β 1 β¬ 0 β¬ 0 β 20000 β
β 0 β¬ 1 β¬ 0 β 10011 β
β 0 β¬ 0 β¬ 1 β 50000 β
β 0 β¬ 0 β¬ 1 β 10000 β
ββββββ©βββββββ©βββββββ©βββββββββ
After one hot encodingΒ hosted with β€ byΒ GitHub
0 indicates non existent while 1 indicates existent.
Before we proceed further, could you think of one reason why just label encoding is not sufficient to provide to the model for training? Why do you need one hot encoding?
Problem with label encoding is that it assumes higher the categorical value, better the category. βWait, What!?β.
Let me explain: What this form of organization presupposes is VW > Acura > Honda based on the categorical values. Say supposing your model internally calculates average, then accordingly we get, 1+3 = 4/2 =2. This implies that: Average of VW and Honda is Acura. This is definitely a recipe for disaster. This modelβs prediction would have a lot of errors.
This is why we use one hot encoder to perform βbinarizationβ of the category and include it as a feature to train the model.
Another Example: Suppose you have βflowerβ feature which can take values βdaffodilβ, βlilyβ, and βroseβ. One hot encoding converts βflowerβ feature to three features, βis_daffodilβ, βis_lilyβ, and βis_roseβ which all are binary.
Lead image via https://i.stack.imgur.com/mfsNd.png
Create your free account to unlock your custom reading experience.