A
C
D
G
M
N
R
S
X
In machine learning (ML), decision trees are algorithms that use a series of if-else decisions to classify input data by the answers generated. Regression trees are a specific form of decision trees which are used to predict numerical outputs instead of classifications. Regression trees are based on a data set, from either historical data or an experiment. The data set contains input variables called predictors and an output variable that the user wants to predict. The regression tree is then built to help predict the outcome of future inputs.
Building a regression tree begins with careful analysis of the data set. The data is plotted for each predictor, similar to how a user would plot a regression. Then, for each graph, the sum of squared residuals (RSS) is calculated at various points across the graph. The RSS is a statistical calculation which aggregates the distance of each data point from the average above and below the selected data point. A higher sum of squared residuals represents a higher variance from the average line. The value which generated the minimum RSS for each predictor is selected as the threshold for the relevant decision. The variable that has the lowest of all the calculated sums of squared residuals becomes the root of the tree. The tree is then built downward based on the sum of squared residuals.
When first created, a new regression tree is prone to overfitting. Regression tree pruning is the process of optimizing a regression tree by removing or splitting decision nodes based on the bias and variance of the output. Pruning modifies the tree to create different variations and find the tree that performs best on a validation data set. Pruning is often performed in a reverse order, meaning the last node generated is the first to be considered for elimination. The cost complexity algorithm can help guide the pruning process. Other methods allow multiple tree variants to work together to provide a more accurate and well supported prediction. These include bagging, boosting, and the random tree method.