How to Handle Categorical Data with Encoding

Handle Categorical Data
Photo by Aleksandar Pasaric on Pexels.com

Introduction

When we train a model, computers can struggle with categorical data because they prefer numbers. This is where encoding methods come into play. Think of encoding as assigning a special code to each group, which helps computers work better with and analyze the data. These methods act as translators, turning words into numbers that computers can handle effortlessly. Through encoding, we connect the world of human-readable words with the math-focused world of computers. This simplifies tricky data, reveals new understandings, and gives us the power to make smart choices based on the hidden insights in the data. In this article, we will discuss how to handle categorical data with encoding techniques.

Table of Content

For demonstration, we will use the following table as an example.

FruitColor
OrangeOrange
AppleRed
BananaYellow
StrawberryRed
BlueberryBlue

Label Encoding

Let’s start with the first encoding technique: Label Encoding. This method gives numbers to different categories in a column of data. Consider the table above, label encoding will assign numbers to every color like {orange: 0, red: 1, yellow: 2, blue: 3}. This makes it simpler for our computer or model to grasp and work with the information. Here is an example in Python.

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Creating the dataset
data = {'Fruit': ['Orange', 'Apple', 'Banana', 'Strawberry', 'Blueberry'],
        'Color': ['Orange', 'Red', 'Yellow', 'Red', 'Blue']}

df = pd.DataFrame(data)

# Using Label Encoding
label_encoder = LabelEncoder()
df['Color_LabelEncoded'] = label_encoder.fit_transform(df['Color'])

print("Label Encoding:")
print(df[['Color', 'Color_LabelEncoded']])

Output

ColorColor_LabelEncoded
Orange0
Red1
Yellow2
Red1
Blue3

One-Hot Encoding

One-Hot Encoding transforms each different category or label from a categorical variable into its own binary representation. This technique helps machine learning models better understand and work with categorical data by creating separate “flags” for each category, making it easier for algorithms to process and analyze.

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Creating the dataset
data = {'Fruit': ['Orange', 'Apple', 'Banana', 'Strawberry', 'Blueberry'],
        'Color': ['Orange', 'Red', 'Yellow', 'Red', 'Blue']}

df = pd.DataFrame(data)

# Using One-Hot Encoding
onehot_encoder = OneHotEncoder()
onehot_encoded = onehot_encoder.fit_transform(df['Color'].values.reshape(-1, 1)).toarray()
df_onehot = pd.DataFrame(onehot_encoded, columns=[f'Color_{color}' for color in df['Color'].unique()])

print("One-Hot Encoding:")
print(df_onehot)

Output

Color_OrangeColor_RedColor_YellowColor_Blue
0100
0010
0001
0010
1000

Summary

Label Encoding and One-Hot Encoding are methods to convert categorical data into a numerical format for machine learning. Label Encoding assigns unique integers to categories without considering an order, while One-Hot Encoding creates binary columns for each category, preserving their distinctiveness. Label Encoding is suitable for nominal data, while One-Hot Encoding works well for nominal and ordinal data. The choice depends on the data’s nature and the algorithm’s requirements.