How to Handle Categorical Data with Encoding

Introduction

When we train a model, computers can struggle with categorical data because they prefer numbers. This is where encoding methods come into play. Think of encoding as assigning a special code to each group, which helps computers work better with and analyze the data. These methods act as translators, turning words into numbers that computers can handle effortlessly. Through encoding, we connect the world of human-readable words with the math-focused world of computers. This simplifies tricky data, reveals new understandings, and gives us the power to make smart choices based on the hidden insights in the data. In this article, we will discuss how to handle categorical data with encoding techniques.

Table of Content

Label Encoding
One-Hot Encoding

For demonstration, we will use the following table as an example.

Fruit	Color
Orange	Orange
Apple	Red
Banana	Yellow
Strawberry	Red
Blueberry	Blue

Label Encoding

Let’s start with the first encoding technique: Label Encoding. This method gives numbers to different categories in a column of data. Consider the table above, label encoding will assign numbers to every color like {orange: 0, red: 1, yellow: 2, blue: 3}. This makes it simpler for our computer or model to grasp and work with the information. Here is an example in Python.

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Creating the dataset
data = {'Fruit': ['Orange', 'Apple', 'Banana', 'Strawberry', 'Blueberry'],
        'Color': ['Orange', 'Red', 'Yellow', 'Red', 'Blue']}

df = pd.DataFrame(data)

# Using Label Encoding
label_encoder = LabelEncoder()
df['Color_LabelEncoded'] = label_encoder.fit_transform(df['Color'])

print("Label Encoding:")
print(df[['Color', 'Color_LabelEncoded']])

Output

Color	Color_LabelEncoded
Orange	0
Red	1
Yellow	2
Red	1
Blue	3

One-Hot Encoding

One-Hot Encoding transforms each different category or label from a categorical variable into its own binary representation. This technique helps machine learning models better understand and work with categorical data by creating separate “flags” for each category, making it easier for algorithms to process and analyze.

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Creating the dataset
data = {'Fruit': ['Orange', 'Apple', 'Banana', 'Strawberry', 'Blueberry'],
        'Color': ['Orange', 'Red', 'Yellow', 'Red', 'Blue']}

df = pd.DataFrame(data)

# Using One-Hot Encoding
onehot_encoder = OneHotEncoder()
onehot_encoded = onehot_encoder.fit_transform(df['Color'].values.reshape(-1, 1)).toarray()
df_onehot = pd.DataFrame(onehot_encoded, columns=[f'Color_{color}' for color in df['Color'].unique()])

print("One-Hot Encoding:")
print(df_onehot)

Output

Color_Orange	Color_Red	Color_Yellow	Color_Blue
0	1	0	0
0	0	1	0
0	0	0	1
0	0	1	0
1	0	0	0

Summary

Label Encoding and One-Hot Encoding are methods to convert categorical data into a numerical format for machine learning. Label Encoding assigns unique integers to categories without considering an order, while One-Hot Encoding creates binary columns for each category, preserving their distinctiveness. Label Encoding is suitable for nominal data, while One-Hot Encoding works well for nominal and ordinal data. The choice depends on the data’s nature and the algorithm’s requirements.

Introduction

Table of Content

Label Encoding

One-Hot Encoding

Related Posts

What is Data Science and Its Roles

Best Data Science Platforms: How to Learn Data Science Online