Introduction
When we train a model, computers can struggle with categorical data because they prefer numbers. This is where encoding methods come into play. Think of encoding as assigning a special code to each group, which helps computers work better with and analyze the data. These methods act as translators, turning words into numbers that computers can handle effortlessly. Through encoding, we connect the world of human-readable words with the math-focused world of computers. This simplifies tricky data, reveals new understandings, and gives us the power to make smart choices based on the hidden insights in the data. In this article, we will discuss how to handle categorical data with encoding techniques.
Table of Content
For demonstration, we will use the following table as an example.
Fruit | Color |
---|---|
Orange | Orange |
Apple | Red |
Banana | Yellow |
Strawberry | Red |
Blueberry | Blue |
Label Encoding
Let’s start with the first encoding technique: Label Encoding. This method gives numbers to different categories in a column of data. Consider the table above, label encoding will assign numbers to every color like {orange: 0, red: 1, yellow: 2, blue: 3}. This makes it simpler for our computer or model to grasp and work with the information. Here is an example in Python.
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# Creating the dataset
data = {'Fruit': ['Orange', 'Apple', 'Banana', 'Strawberry', 'Blueberry'],
'Color': ['Orange', 'Red', 'Yellow', 'Red', 'Blue']}
df = pd.DataFrame(data)
# Using Label Encoding
label_encoder = LabelEncoder()
df['Color_LabelEncoded'] = label_encoder.fit_transform(df['Color'])
print("Label Encoding:")
print(df[['Color', 'Color_LabelEncoded']])
Output
Color | Color_LabelEncoded |
---|---|
Orange | 0 |
Red | 1 |
Yellow | 2 |
Red | 1 |
Blue | 3 |
One-Hot Encoding
One-Hot Encoding transforms each different category or label from a categorical variable into its own binary representation. This technique helps machine learning models better understand and work with categorical data by creating separate “flags” for each category, making it easier for algorithms to process and analyze.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# Creating the dataset
data = {'Fruit': ['Orange', 'Apple', 'Banana', 'Strawberry', 'Blueberry'],
'Color': ['Orange', 'Red', 'Yellow', 'Red', 'Blue']}
df = pd.DataFrame(data)
# Using One-Hot Encoding
onehot_encoder = OneHotEncoder()
onehot_encoded = onehot_encoder.fit_transform(df['Color'].values.reshape(-1, 1)).toarray()
df_onehot = pd.DataFrame(onehot_encoded, columns=[f'Color_{color}' for color in df['Color'].unique()])
print("One-Hot Encoding:")
print(df_onehot)
Output
Color_Orange | Color_Red | Color_Yellow | Color_Blue |
---|---|---|---|
0 | 1 | 0 | 0 |
0 | 0 | 1 | 0 |
0 | 0 | 0 | 1 |
0 | 0 | 1 | 0 |
1 | 0 | 0 | 0 |
Summary
Label Encoding and One-Hot Encoding are methods to convert categorical data into a numerical format for machine learning. Label Encoding assigns unique integers to categories without considering an order, while One-Hot Encoding creates binary columns for each category, preserving their distinctiveness. Label Encoding is suitable for nominal data, while One-Hot Encoding works well for nominal and ordinal data. The choice depends on the data’s nature and the algorithm’s requirements.