Logistic regression is a statistical technique commonly used for binary classification tasks, where the goal is to predict the probability of a categorical outcome with two possible classes such as yes/no, true/false, or 0/1. In other words, it is designed for tasks with only two possible outcomes. Despite its name, logistic regression is a classification model, not a regression algorithm.
In logistic regression, we model the relationship by estimating the probabilities of the outcomes. The goal is to predict the probability of an event, such as the likelihood of a customer making a purchase or a medical condition being present. This makes logistic regression particularly useful in situations where we want to understand the probability of a binary outcome based on a set of input variables.
There are two components in logistic regression.
- Logistic Regression Model
- Logistic Function
In general, the Logistic Regression Model is responsible for estimating the relationship between the input variables and the probability of the binary outcome. It does this by fitting a logistic function to the data. The logistic function, also known as the sigmoid function, transforms a linear combination of the input variables into a value between 0 and 1.
Let’s see the idea in detail.
The Logistic Regression Model
Mathematically, the logistic regression model can be written as:
Where:
p(x)
represents the probability of the event occurring given the input variablesx
.ln
denotes the natural logarithm.β0, β1, β2, ..., βn
are the coefficients associated with the independent variablesx1, x2, ..., xn
(input variables).
The odds ratio, p(x)
/ 1 – p(x)
is defined as the ratio of the probability of the event occurring to the probability of it not occurring. In logistic regression, the relationship between the independent variables and the logarithm of the odds ratio is assumed to be linear. However, the logarithm of the odds ratio itself is not directly interpretable as a probability. It can be any real-valued number, including positive and negative numbers. To transform the turn it into probabilities that fall within the range of [0, 1], the logistic function is applied.
The Logistic Function (Sigmoid Function)
To comprehend logistic regression, it is essential to grasp the concept of the logistic function (also known as the sigmoid function). The logistic function transforms any real-valued input into a value between 0 and 1. It is defined as:
Where:
f(x)
is the output value between 0 and 1.e
denotes the base of the natural logarithm (approximately 2.71828).x
represents the linear combination of the independent variables.
To understand how the logistic regression model combines with the logistic function, here is a much more intuitive equation:
By applying the logistic regression model to the logistic function, we could transform the value within the range of [0, 1]. This is how it looks like in the chart.
Applying Logistic Regression
Once the coefficients have been estimated, predictions can be made using the logistic regression model. Given a new set of input variables, the following steps are typically employed:
- Calculate the linear combination of the independent variables and their corresponding coefficients.
- Apply the logistic function to the linear combination to obtain the predicted probability.
- Apply a threshold (e.g., 0.5) to convert the probability into a binary class label.
Here’s an example Python code that demonstrates logistic regression using the Scikit-Learn library:
# Import the necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
# Generate some sample data for binary classification
np.random.seed(0)
X = np.random.randn(200, 2)
y = np.repeat([0, 1], 100)
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Create a logistic regression model
model = LogisticRegression()
# Train the model
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred))
# Visualize the decision boundary
h = 0.02 # Step size in the mesh
x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure()
plt.contourf(xx, yy, Z, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k')
plt.title("Logistic Regression Decision Boundary")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()
Please note that this code assumes you have the necessary dependencies installed (numpy
, matplotlib
, and scikit-learn
). You can install them using pip if you don’t have them already:
pip install numpy matplotlib scikit-learn
This code generates a synthetic dataset for binary classification, splits it into training and test sets, trains a logistic regression model, makes predictions on the test set, and evaluates the model’s performance using a confusion matrix and classification report. It also visualizes the decision boundary of the logistic regression model. However, this is just random data so the model is not quite practical. Feel free to modify the code to fit your specific use case or dataset.
Advantages
Logistic regression’s main advantages include simplicity and interpretability, making it easier to understand and convey results. It handles various independent variables, like categorical, continuous, and ordinal, and allows interaction terms. Its computational efficiency makes it suitable for large datasets and real-time applications.
Furthermore, logistic regression offers probabilistic predictions, providing probabilities instead of binary labels. This aids decision-making when balancing false positives and negatives is crucial. Additionally, it copes well with outliers and missing data through techniques like multiple imputation or indicator variables. Even when assumptions aren’t perfectly met, it performs admirably, as long as there’s sufficient predictive power.
Disadvantages
Logistic regression, despite its advantages, has certain limitations. One key limitation is its assumption of linearity between independent variables and the logarithm of the odds ratio, which may not hold in cases of nonlinear relationships. In such situations, alternative models like polynomial regression or nonlinear regression may be more suitable.
Moreover, logistic regression assumes independence of observations, which can be problematic in time series or clustered data. In such cases, alternative models like mixed effects or generalized estimating equations (GEE) should be considered.
Additionally, dealing with high-dimensional or imbalanced datasets can pose challenges. Regularization techniques can help with high-dimensional datasets while oversampling, undersampling, or using specialized algorithms like SMOTE can address imbalanced data.
Conclusion
Logistic regression utilizes the logistic function and maximum likelihood estimation to model the relationship between independent variables and a binary dependent variable. By understanding the mathematical foundations discussed above, you are now equipped to apply logistic regression to binary classification problems and check out my post on how to train a machine learning model.