In this article, we will explain how decision trees work and build a tree by hand.

The code used in this article can be found in this repo.

Definition of the problem

We will decide whether one should go to work today. In this demo project, we consider the following features.

feature possible values
health 0: feeling bad, 1: feeling good
weather 0: bad weather, 1: good weather
holiday 1: holiday, 0: not holiday

For more compact notations, we use the abstract notation $\{0,1\}^3$ to describe a set of three features each with 0 and 1 as possible values. In general, the notation $\{0,1\}^d$ indicates $d$ binary features.

Our prediction will be a binary result, 0 or 1, with 0 indicates staying at home and 1 indicates going to work.

To be compact, this prediction can be denoted as $\{0,1\}^1$.

How to Describe a Decision Tree

In theory, we would expect a decision tree of the following.

graph TD A[health] --> |feeling bad| E[stay home] A[health] --> |feeling good| B[weather] B --> |bad weather| E B --> |good weather| C[holiday] C --> |holiday| E C --> |not holiday| G[go to the office]

It is straight forward to prove that the max required depths and max required leaves of a model that maps $\{0,1\}^d$ to $\{0,1\}^1$ are $d+1$ and $2^d$. In our simple example, some of the branches are truncated based on our understanding of the problem. In principle, the branch “feeling bad” could also go on to the next level.

Data

However, we do not forge trees using experience. We build the tree using data.

The following is a sample of the full dataset.

  health weather holiday go_to_office
0 0 0 1 0
1 1 1 1 0
2 1 0 1 0
3 0 0 0 0
4 1 0 1 0

Full Data

Build a Tree

A decision tree trained with the dataset.

Full Data

This is a very good result. It is the same as our theoretical expectations.

Surely it will. We forged the dataset based on the theoretical expectations.

Overfitting

Fully grown trees will most likely to overfit the data since they always try to grow pure leafs. Besides, fully grown trees grow exponentially as the number of features grow which requires a lot of computation resources.

Applying the Occam’s razor, we prefer smaller trees as long as the trees can explain the data well.

To achieve this, we will either have to limit how the trees grow during training, or pruning the trees after the trees are built.

Pruning of a tree is achieved by replacing subtrees at a node with a leaf if some certain conditions based on cost estimations.

Remarks

The Iterative Dichotomizer 3 algorithm, aka ID3 algorithm, is one of the most famous implementations of the decision tree. The following is the “flowchart” of the algorithm.

graph TD Leaf("Prepare samples in node") MajorityVote["Calculate majority vote"] Assign[Assign label to node] Leaf --> MajorityVote --> Assign Assign --> Split1[Split on feature 1] Assign --> Splitdots["..."] Assign --> Splitd[Split on feature d] subgraph "split on a subset of features" Split1 --> |"Split on feature 1"|B1["Calculate gain of split"] Splitdots --> |"..."| Bdots["..."] Splitd --> |"Split on feature d"| Bd["Calculate gain of split"] end B1 --> C["Use the split with the largest gain"] Bdots --> C Bd --> C C --> Left["Prepare samples in left node"] C --> Right["Prepare samples in right node"] subgraph "left node" MajorityVoteL["Calculate majority vote"] AssignL(Assign label to left node) Left --> MajorityVoteL --> AssignL end subgraph "right node" MajorityVoteR["Calculate majority vote"] Right --> MajorityVoteR AssignR(Assign label to right node) MajorityVoteR --> AssignR end

To “calculate gain of split”, we use information gain or Gini impurity.