The Spicy Secrets of Decision Trees

6 min readNov 25, 2023

With branches reaching for insights and decision making, decision trees have become a reliable compass guiding us through the wilderness of the data. Let’s peel back the layers of decision tree algorithm, understanding how they work and what makes them so attractive to Machine Learning. This blog will serve as an authoritative and comprehensive guide on intuition behind decision trees and how they work.

A very basic idea on decision tree

It is a greedy, top down, recursive partitioning approach on the data.

A decision tree essentially performs the straightforward task of partitioning data. In the realm of classification problems, this entails strategically dividing the data to maximize information gain at each split. As we navigate through the layers of the decision tree depicted in the figure, it becomes evident that our thought process aligns remarkably with the algorithm’s logic. Much like the decision tree, our minds instinctively analyze information by progressively categorizing it based on specific criteria, gradually honing in on a final decision. This natural congruence in approach highlights the inherent simplicity and intuitive nature of decision tree methodology, making it a powerful and relatable tool in data analysis.

How do we choose the split that increases our information gain ?

Encountering concepts like misclassification, Gini loss, and cross-entropy loss can be initially perplexing. In my early encounters, these ideas left me scratching my head, struggling to grasp their essence. However, with a deeper understanding, I’ve come to appreciate their intuitive nature. What once seemed bewildering now unfolds as a logical and comprehensible framework.

Misclassification Loss

The essence lies in understanding how many misclassified entries exist within a partition group, assuming we default to predicting the class with the higher probability. The objective is clear: orchestrate a split that minimizes misclassifications. An efficient split is one that significantly reduces the count of misclassified entries. The accompanying diagram serves as a mathematical guide, offering a visual representation to reinforce this intuitive notion.

The misclassification loss, while insightful, grapples with a limitation: the potential emergence of splits where the gain is marginal. This arises due to the piecewise linear nature of the loss distribution concerning proportion or probability. Unless a split strategically aligns with the boundaries of these linear segments, the information gain might not be appreciable. To surmount this challenge, we turn our attention to alternative loss functions — Cross Entropy and Gini.

Cross Entropy and Gini Function

We overcome the limitation of the above method by having the distribution of the loss function in a gaussian manner. Unlike the prior method where split regions falling on one side of the curve might lead to indecision, this quadratic distribution allows us to accurately estimate gains. It instills a greater sense of certainty, enhancing the model’s decisiveness in navigating complex decision boundaries.

What about Regression Trees?

Regression trees, a specialized form of decision trees tailored for regression problems, employ a akin partitioning methodology with a distinct twist in the loss function. Unlike their classification counterparts, regression trees estimate the value for each partition by calculating the mean within that specific segment. Once the mean is determined, the loss is then computed as the squared distance of the values within the partition from this mean. This approach strategically generates splits that minimise these squared losses, facilitating a nuanced and effective strategy for predicting continuous outcomes in regression scenarios.

Overfitting in Decision Trees

As evident from the partitioning process described earlier, there’s a risk of the algorithm spiralling into infinite divisions. If left unchecked, this incessant partitioning can lead to a scenario where each unique value occupies its own partition. The consequence? A model exhibiting excessively high variance, inevitably succumbing to overfitting and performing poorly on unseen data.

To avert this pitfall, it becomes imperative to introduce regularization techniques to decision trees. Consider the following strategies:

Minimum Leaf Size: Halt further partitioning when a certain minimum leaf size is reached. This instills a practical constraint, preventing the model from overly fine-tuning to individual data points.
Maximum Depth: Impose a predefined maximum depth for the decision tree. This ensures a cap on the number of hierarchical layers, mitigating the risk of overly complex and overfit models.
Maximum Number of Nodes: Restrict the tree by defining the maximum allowable number of nodes. By controlling the growth of the tree, this constraint promotes a balance between model complexity and generalisation.
Pruning based on Loss Decrease: Introduce a criterion for pruning the tree when the decrease in loss falls below a certain threshold. While this approach has its merits, it may sometimes prove counterintuitive by prematurely halting the algorithm, potentially missing out on beneficial splits. Navigating the decision tree landscape involves recognising that not every split contributes significantly on its own. Some splits, while appearing suboptimal individually, set the stage for subsequent divisions that lead to improved overall partitioning. Striking the right balance between growth and regularization is key to harnessing the predictive power of decision trees effectively. Hence we opt for post pruning in most cases.

Decision Trees on Linearly Separable Data

Obviously in this case a linear algorithm will do a better job than a decision tree (Just visualise it from a framework of partitioning). Consider the simplicity of a linear algorithm’s decision boundary, a straight line in the case of classification or a hyperplane in regression. This straightforward separation might be more effective and efficient for linearly separable data compared to the intricacies introduced by decision tree partitioning.

Choosing the right algorithm hinges on the inherent structure of the data. Linear algorithms shine when the relationships within the data can be adequately captured by linear boundaries. Decision trees, with their capacity to model complex, non-linear relationships, are more valuable in situations where such intricate structures exist.

In essence, understanding the characteristics of your data and the underlying patterns is crucial in selecting the most appropriate algorithm. For linearly separable data, a linear model often proves to be a more straightforward and effective choice.

Runtime Complexity for Decision Trees

The time complexity is different for the decision trees while training and testing.

A. Testing Time Complexity (Inference):

Complexity: O(d)
Explanation: During testing or inference, traversing a decision tree involves moving down the tree from the root to a leaf node. The time complexity is linear with respect to the depth of the tree, denoted as "d."

B. Training Time Complexity:

Complexity: O(fdn)
Explanation: The training process involves constructing the decision tree, and the time complexity is influenced by the number of classes (f), the depth of the tree (d), and the number of training instances or variables (n).
Breakdown:
f (Number of Classes): The algorithm needs to evaluate potential splits for each class.
d (Depth of the Tree): The maximum depth of the tree affects the number of decisions and splits made during the training process.
n (Number of Training Variables): The algorithm examines different variables and instances during the training phase.

Refrences

Lecture 10 — Decision Trees and Ensemble Methods | Stanford CS229: Machine Learning (Autumn 2018)
Statistical Learning: More details on Trees