Feature scaling is a crucial preprocessing step in machine learning that standardizes or rescales the range of independent variables or features. It ensures that each feature contributes proportionately to model training, avoiding undue influence due to differing scales. This is essential because many machine learning algorithms rely on distance-based metrics, and features with large scales can dominate those with smaller scales.
Let’s start with an example. Here we have a table:
Employee | Working Experience (Years) | Salary (Annually) | Salary Class |
---|---|---|---|
A | 1 | 50,000 | Low |
B | 2 | 70,000 | Medium |
C | 4 | 80,000 | Medium |
D | 6 | 90,000 | High |
E | 8 | 100,000 | High |
Now, we want to use the K-nearest neighbours (KNN) algorithm to categorize the data based on salary class. In the KNN algorithm, distance plays a critical role in determining the neighbours. The algorithm calculates the distance between the input data point and all other data points in the training set. To calculate distance, euclidean distance is the most commonly used formula. The formula is written as:
Distance = \(\sqrt{(x – x)^2 + (y – y)^2}\)
Let’s calculate the distance between employees A and B without scaling the data.
Distance between A and B = \(\sqrt{(2 – 1)^2 + (70000 – 50000)^2}\) = 20000.00003
Notice how a large scaled feature overpowers other features. In the example, the “Working experience” feature was only contributing a significantly small value of 0.00003. This is why we need to scale those large-scale features.
Importance of Feature Scaling
Algorithm Sensitivity
Algorithm sensitivity refers to how much an algorithm reacts to changes in data. It’s like how different people react to hot and cold temperatures – some might handle it well, while others might feel extreme discomfort.
Certain algorithms, such as k-nearest neighbours (KNN), support vector machines (SVM), and models based on gradient descent, can be particularly sensitive to the way data is scaled. Scaling means adjusting the size of the numbers so they’re all in a similar range. Think of it as using a sensitive scale. If you put a heavy object on one side and a light one on the other, the heavy one will have a bigger impact. Algorithms can be like that too – if features have different magnitudes, those with larger magnitudes might have a stronger effect on the outcome.
Convergence Speed
In optimization-based algorithms, feature scaling can improve convergence speed. In the context of gradient descent, any features that have a large scale than the others can dominate the gradient updates. When the algorithm takes a step in the direction of the gradient, the larger scale feature can lead to a much larger change in the cost function compared to the smaller scale features. This can lead to unbalanced parameter updates.
Imagine you’re climbing a mountain to reach the lowest point. You want to take steps in the direction that goes downhill the fastest. But here’s the catch: the ground isn’t even – some parts are steep, and some are gentle slopes.
In the same way, machine learning algorithms need to find the lowest point (the best solution). The “steps” they take are like changes to the model’s parameters. But if some features in your data have bigger numbers (large scale), they can “pull” the algorithm more. It’s like steep parts of the ground – they make you go down faster.
So, if one feature has large numbers compared to others, it can “overpower” the algorithm’s decisions. It might ignore the other features that could help it find the best solution.
Feature scaling makes all the features play fair. It’s like stretching or squishing the ground, so all parts become equally important. This helps the algorithm take balanced steps and find the best solution more quickly.
Common Feature Scaling Techniques
Min-Max Scaling (Normalization)
Min-Max Scaling, also known as normalization is a method to scale feature values to a common range, usually between 0 and 1. For example, there is a dataset of heights of people, ranging from 150cm to 190cm. With Min-Max scaling, 150cm will become 0 and 190 become 1. In this way, all the values are in a uniform range, making them comparable and helping algorithms work better. Have a look at the figure below,
Standardization (Z-Score Scaling)
Standardization, also called Z-score normalization, makes your data have a mean of 0 and a standard deviation of 1. It’s like repositioning your data on a common scale. Think of it as adjusting temperatures from Fahrenheit to Celsius. Even though the numbers change, the relationship between temperatures remains the same. Similarly, standardization rescales your data while maintaining the relationships between feature values.
Here is an example:
Feature Scaling in Python
This is the Python code I used to generate the figures in the previous section.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler, StandardScaler
# First, generate some sample data with numpy
sample_data = np.random.rand(50) * np.random.randint(10000, 15000)
not_scaled_data = pd.DataFrame({'Salary': sample_data})
# Then, we need to initial the scaler as following
scaler = MinMaxScaler()
scaler = StandardScaler()
# Now, we scale the data with the scaler
scaled_data = scaler.fit_transform(not_scaled_data[['Salary']])
# scaled_data = scaler.fit_transform(not_scaled_data[['Salary']])
scaled_data = pd.DataFrame(scaled_data, columns=['Salary'])
# Finally, plot the graph and see the result
# plot graph
fig, ax = plt.subplots(1, 2, figsize=(20, 5))
ax[0].tick_params(axis='both', which='major', labelsize=15)
ax[1].tick_params(axis='both', which='major', labelsize=15)
sns.scatterplot(not_scaled_data['Salary'], ax=ax[0])
sns.scatterplot(scaled_data['Salary'], ax=ax[1])
plt.show()
Guidelines
- Always perform feature scaling when dealing with distance-based algorithms like KNN or SVM.
- Algorithms like decision trees and random forests are not sensitive to feature scaling.
- Feature scaling should be applied to both training and testing data consistently.
- Avoid scaling binary features or categorical variables transformed through one-hot encoding.
Some algorithms, like neural networks, might implicitly handle varying feature scales due to their architecture and optimization techniques. However, it’s recommended to experiment and observe the model’s performance with and without scaling.
In essence, feature scaling ensures that features are on a similar scale, preventing issues caused by varying magnitudes. It’s a crucial step in preparing data for machine learning, enabling models to learn more effectively and produce accurate predictions.
Posts You Miss