In this chapter, we will learn about the boosting methods in Sklearn, which enables building an ensemble model.
Boosting methods build ensemble model in an increment way. The main principle is to build the model incrementally by training each base model estimator sequentially. In order to build powerful ensemble, these methods basically combine several week learners which are sequentially trained over multiple iterations of training data. The sklearn.ensemble module is having following two boosting methods.
It is one of the most successful boosting ensemble method whose main key is in the way they give weights to the instances in dataset. That’s why the algorithm needs to pay less attention to the instances while constructing subsequent models.
For creating a AdaBoost classifier, the Scikit-learn module provides sklearn.ensemble.AdaBoostClassifier. While building this classifier, the main parameter this module use is base_estimator. Here, base_estimator is the value of the base estimator from which the boosted ensemble is built. If we choose this parameter’s value to none then, the base estimator would be DecisionTreeClassifier(max_depth=1).
In the following example, we are building a AdaBoost classifier by using sklearn.ensemble.AdaBoostClassifier and also predicting and checking its score.
from sklearn.ensemble import AdaBoostClassifier from sklearn.datasets import make_classification X, y = make_classification(n_samples = 1000, n_features = 10,n_informative = 2, n_redundant = 0,random_state = 0, shuffle = False) ADBclf = AdaBoostClassifier(n_estimators = 100, random_state = 0) ADBclf.fit(X, y)
AdaBoostClassifier(algorithm = 'SAMME.R', base_estimator = None, learning_rate = 1.0, n_estimators = 100, random_state = 0)
Once fitted, we can predict for new values as follows −
print(ADBclf.predict([[0, 2, 3, 0, 1, 1, 1, 1, 2, 2]]))
[1]
Now we can check the score as follows −
ADBclf.score(X, y)
0.995
We can also use the sklearn dataset to build classifier using Extra-Tree method. For example, in an example given below, we are using Pima-Indian dataset.
from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.ensemble import AdaBoostClassifier path = r"C:\pima-indians-diabetes.csv" headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data = read_csv(path, names = headernames) array = data.values X = array[:,0:8] Y = array[:,8] seed = 5 kfold = KFold(n_splits = 10, random_state = seed) num_trees = 100 max_features = 5 ADBclf = AdaBoostClassifier(n_estimators = num_trees, max_features = max_features) results = cross_val_score(ADBclf, X, Y, cv = kfold) print(results.mean())
0.7851435406698566
For creating a regressor with Ada Boost method, the Scikit-learn library provides sklearn.ensemble.AdaBoostRegressor. While building regressor, it will use the same parameters as used by sklearn.ensemble.AdaBoostClassifier.
In the following example, we are building a AdaBoost regressor by using sklearn.ensemble.AdaBoostregressor and also predicting for new values by using predict() method.
from sklearn.ensemble import AdaBoostRegressor from sklearn.datasets import make_regression X, y = make_regression(n_features = 10, n_informative = 2,random_state = 0, shuffle = False) ADBregr = RandomForestRegressor(random_state = 0,n_estimators = 100) ADBregr.fit(X, y)
AdaBoostRegressor(base_estimator = None, learning_rate = 1.0, loss = 'linear', n_estimators = 100, random_state = 0)
Once fitted we can predict from regression model as follows −
print(ADBregr.predict([[0, 2, 3, 0, 1, 1, 1, 1, 2, 2]]))
[85.50955817]
It is also called Gradient Boosted Regression Trees (GRBT). It is basically a generalization of boosting to arbitrary differentiable loss functions. It produces a prediction model in the form of an ensemble of week prediction models. It can be used for the regression and classification problems. Their main advantage lies in the fact that they naturally handle the mixed type data.
For creating a Gradient Tree Boost classifier, the Scikit-learn module provides sklearn.ensemble.GradientBoostingClassifier. While building this classifier, the main parameter this module use is ‘loss’. Here, ‘loss’ is the value of loss function to be optimized. If we choose loss = deviance, it refers to deviance for classification with probabilistic outputs.
On the other hand, if we choose this parameter’s value to exponential then it recovers the AdaBoost algorithm. The parameter n_estimators will control the number of week learners. A hyper-parameter named learning_rate (in the range of (0.0, 1.0]) will control overfitting via shrinkage.
In the following example, we are building a Gradient Boosting classifier by using sklearn.ensemble.GradientBoostingClassifier. We are fitting this classifier with 50 week learners.
from sklearn.datasets import make_hastie_10_2 from sklearn.ensemble import GradientBoostingClassifier X, y = make_hastie_10_2(random_state = 0) X_train, X_test = X[:5000], X[5000:] y_train, y_test = y[:5000], y[5000:] GDBclf = GradientBoostingClassifier(n_estimators = 50, learning_rate = 1.0,max_depth = 1, random_state = 0).fit(X_train, y_train) GDBclf.score(X_test, y_test)
0.8724285714285714
We can also use the sklearn dataset to build classifier using Gradient Boosting Classifier. As in the following example we are using Pima-Indian dataset.
from pandas import read_csv from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.ensemble import GradientBoostingClassifier path = r"C:\pima-indians-diabetes.csv" headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data = read_csv(path, names = headernames) array = data.values X = array[:,0:8] Y = array[:,8] seed = 5 kfold = KFold(n_splits = 10, random_state = seed) num_trees = 100 max_features = 5 ADBclf = GradientBoostingClassifier(n_estimators = num_trees, max_features = max_features) results = cross_val_score(ADBclf, X, Y, cv = kfold) print(results.mean())
0.7946582356674234
For creating a regressor with Gradient Tree Boost method, the Scikit-learn library provides sklearn.ensemble.GradientBoostingRegressor. It can specify the loss function for regression via the parameter name loss. The default value for loss is ‘ls’.
In the following example, we are building a Gradient Boosting regressor by using sklearn.ensemble.GradientBoostingregressor and also finding the mean squared error by using mean_squared_error() method.
import numpy as np from sklearn.metrics import mean_squared_error from sklearn.datasets import make_friedman1 from sklearn.ensemble import GradientBoostingRegressor X, y = make_friedman1(n_samples = 2000, random_state = 0, noise = 1.0) X_train, X_test = X[:1000], X[1000:] y_train, y_test = y[:1000], y[1000:] GDBreg = GradientBoostingRegressor(n_estimators = 80, learning_rate=0.1, max_depth = 1, random_state = 0, loss = 'ls').fit(X_train, y_train)
Once fitted we can find the mean squared error as follows −
mean_squared_error(y_test, GDBreg.predict(X_test))
5.391246106657164