Decision Tree: A Complete Guide (with Example)

Decision tree is a versatile and widely used machine learning algorithm that’s primarily used for classification and regression tasks. It’s a tree-like structure where every possible choice is split into different branches. It is easy to understand and explainable, making them valuable for data-driven decision making. In this article, we will try to explain the underlying logic behind and see how decision tree make decisions.

Table of Content

What is Decision Tree
The Structure of Decision Tree
Entropy
Information Gain

What is Decison Tree

Decision tree is a supervised model that employs a tree-like structure to make predictions based on input data. It starts with a starting point called ‘root’, then branches out for every possible outcome. Each branching action is controlled by a conditional control statement. This technique serves classification and regression tasks, offering easily understandable models. Decision tree finds utility across various domains. It handles classification and regression challenges, using a tree-like structure to display predictions derived from feature-based divisions.

The Structure of Decision Tree

Its structure includes a root node, branches, internal nodes, and terminal nodes, creating a hierarchical, tree-shaped arrangement. Here is a table and diagram for a quick review:

Root Node: The root node is the initial point of decision tree. At this node, a feature is selected to make the first decision.

Internal Node: An internal node (also called decision node) is a decision point to further split the data based on specific feature conditions.

Terminal Node: A terminal node (also called leaf node) represents the decision tree’s endpoints, which are the possible outcomes.

Let’s say we have a table:

Person	Education Level	Income	Marital Status	Loan Approval
A	Graduate	60000	Married	Denied
B	High School	30000	Single	Denied
C	Graduate	75000	Single	Approved
D	Graduate	70000	Married	Denied
E	Graduate	45000	Single	Denied
F	Postgraduate	80000	Married	Approved

In this table, we have data about individuals’ income, education level, and marital status. The target variable is “Loan Approval,” which indicates whether a loan application was approved or denied.

This is how it looks in the decision tree:

As you might notice, the decision tree automatically denies those with “High School” education and approves those with “Postgraduate” education. This is because all records/rows with “High School” or “Postgraduate” are consistently assigned to the same outcome. Therefore, there would be no further splitting along that branch of a decision tree. The decision tree recognizes that no additional information or distinctions can be gained from that specific feature since it uniformly leads to a single outcome.

But here come the question:

How does a decision tree decide which feature to be the root node?

To understand how, we need to know two important metrics: Entropy and Information Gain.

Entropy

Entropy is a metric that represents the uncertainty in the subset after splitting. In other words, entropy evaluates the effectiveness of a split in a decision tree. It returns a number scaled between 0 and 1, the lower the value, the lesser uncertainty. Here is the formula to calculate the entropy:

\(E(X) = -\sum_{i=1}^{n} p_i \log_2(p_i)\)

\(n\) = number of outcomes
\(p_i\) = probability of outcome, \(i\)

Consider the loan approval diagram above, the decision tree returns 3 subsets after splitting by education level. We want to know how many “Approved” and “Denied” are in each subset. Here is the overview:

Subset	Approved	Denied
High School	0	1
Graduate	1	3
Postgraduate	1	0

Let’s calculate the entropy of these subsets. Since there are two outcomes, then the entropy formula will be:

\(E(X) = -\sum_{i=1}^{2} p_i \log_2(p_i) = -p_1 \cdot \log_2(p_1) – (p_2) \cdot \log_2(p_2)\)

\(E(High School) = \)

\(-(\frac{0}{1}) \cdot \log_2(\frac{0}{1}) – (\frac{1}{1}) \cdot \log_2(\frac{1}{1})\)

\(= 0\)

\(E(Graduate) = \)

\(-(\frac{1}{4}) \cdot \log_2(\frac{1}{4}) – (\frac{3}{4}) \cdot \log_2(\frac{3}{4})\)

\(= 0.8113\)

\(E(Postgraduate) = \)

\(-(\frac{1}{1}) \cdot \log_2(\frac{1}{1}) – (\frac{0}{1}) \cdot \log_2(\frac{0}{1})\)

\(= 0\)

Notice how the entropy represents the uncertainty? Since the “High School” and “Postgraduate” subsets only shows one specific outcome, there is no uncertainty. Thus, the result is 0. On the other hand, “Graduate” contains different fractions of both possible outcomes. Thus, uncertainty exists.

Remember that the objective of entropy is to evaluate the effectiveness of a decision point. We can do that by creating a weighted average of the entropy values above.

\(\text{Weighted Average} = (\frac{1}{6}\cdot 0) + (\frac{4}{6} \cdot 0.8113) + (\frac{1}{6} \cdot 0) = 0.541\)

Information Gain

Information gain simply means the reduction of uncertainty. It is a metric to determine how much entropy could be reduced if we select a certain feature as decision point.

\(Information Gain = E(S) – E(S|X)\)

\(E(S)\) = the entropy value of entire dataset
\(E(S|X)\) = the entropy value of a subset

So the information gain when we split by education level is:

\(Information Gain = E(S) – E(S|X) = 0.918 – 0.541 = 0.377\)

This means that if we split the dataset by education level, we could reduce the entropy value by 0.377. Therefore, to select the best feature as decision node, decision tree calculates the entropy for all features and selects the one with the highest information gain.

Summary

A decision tree is an algorithm frequently used in classification and regression problems. It is a tree-structure algorithm that contains 3 different types of nodes, which are the root node, internal node, and terminal node. To decide which feature to use as root, decision tree will do the following

Select a feature
Split the data with the selected feature and calculate the entropy
Calculate the information gain with the entropy in step 2
Repeat step 1 – 3 for other features
Select the feature with highest information gain

Decision trees stand as an invaluable asset in the realm of data-driven decision-making. Their ability to unravel complexities, offer clear insights, and guide strategic choices makes them an essential tool for those seeking to navigate the ever-evolving landscape of information with confidence and precision