Random Forest Regression

Definition of Regression tree:

A regression tree (a decision tree with a numerical target variable) is a machine learning algorithm where clusters of target output variables with corresponding input variables are determined following optimal classification criteria derived from the training data.

This is the common definition for Regression tree, other definitions can be discussed in the article

Short introduction

A simple regression tree algorithm starts with defining an optimal splitting value separating input and corresponding output training data into two clusters (subdomains of input training data and corresponding output data). In the following step each cluster is split again, till at the end of each tree branch a cluster is left that contains only a small more or less homogeneous subdomain of the output data (a so-called leaf). The training of the regression tree on known output data is based on finding the most efficient split for each input variable at each level of the regression tree. This consists of determining for each split the average value of the target variables in the two clusters and searching the optimal split that minimizes the total sum of the squared residuals of the target training variables with respect to the average value in each of the two clusters. The simple regression tree tends to overfit the training data; the noise (observation inaccuracies, random fluctuations) in the training data is reflected in the regression tree. Noise can be minimized by considering a multitude of regression trees, each trained on a subset of the training data and a subset of the input variables. The subsets are chosen at random and some of the training data may appear more than once in the subsets. This multitude of regression trees is called a random forest. The final output is the mean of the outputs of all the regression trees. By constructing all possible different random forests, and comparing the final outputs with test data, the random forest with the best prediction will be selected as the final choice. Some regression trees make use of a boosting algorithm (Adaboost) to enhance the performance of a weak classifier by attributing different weights to the outcomes of the forest of random regression trees.

Analysis technique	Strengths	Limitations	Application example
Prediction tool based on machine learning from training data	* Handles nonlinear relationships * Does classification and regression * Resilient to data noise and data gaps * Computationally efficient * Low overfitting risk	* Black box, no easy interpretation of results, no probability estimates * Less efficient if many trees * Not reliable outside the range of trained situations transformation	Time series forecasting Pattern recognition from images, e.g. interpretation remote sensing images

For more detailed explanations see:

StatQuest: Regression Trees, Clearly Explained by Josh Starmer

StatQuest: Random Forests Part 1 - Building, Using and Evaluating by Josh Starmer

Wikipedia Machine learning decision tree

Wikipedia Random forest

More

Navigation

Tools

Random Forest Regression

Short introduction

Related articles