This chapter focusses on the polynomial features and pipelining tools in Sklearn.
Linear models trained on non-linear functions of data generally maintains the fast performance of linear methods. It also allows them to fit a much wider range of data. That’s the reason in machine learning such linear models, that are trained on nonlinear functions, are used.
One such example is that a simple linear regression can be extended by constructing polynomial features from the coefficients.
Mathematically, suppose we have standard linear regression model then for 2-D data it would look like this −
$$Y=W_{0}+W_{1}X_{1}+W_{2}X_{2}$$Now, we can combine the features in second-order polynomials and our model will look like as follows −
$$Y=W_{0}+W_{1}X_{1}+W_{2}X_{2}+W_{3}X_{1}X_{2}+W_{4}X_1^2+W_{5}X_2^2$$The above is still a linear model. Here, we saw that the resulting polynomial regression is in the same class of linear models and can be solved similarly.
To do so, scikit-learn provides a module named PolynomialFeatures. This module transforms an input data matrix into a new data matrix of given degree.
Followings table consist the parameters used by PolynomialFeatures module
Sr.No | Parameter & Description |
---|---|
1 |
degree − integer, default = 2 It represents the degree of the polynomial features. |
2 |
interaction_only − Boolean, default = false By default, it is false but if set as true, the features that are products of most degree distinct input features, are produced. Such features are called interaction features. |
3 |
include_bias − Boolean, default = true It includes a bias column i.e. the feature in which all polynomials powers are zero. |
4 |
order − str in {‘C’, ‘F’}, default = ‘C’ This parameter represents the order of output array in the dense case. ‘F’ order means faster to compute but on the other hand, it may slow down subsequent estimators. |
Followings table consist the attributes used by PolynomialFeatures module
Sr.No | Attributes & Description |
---|---|
1 |
powers_ − array, shape (n_output_features, n_input_features) It shows powers_ [i,j] is the exponent of the jth input in the ith output. |
2 |
n_input_features _ − int As name suggests, it gives the total number of input features. |
3 |
n_output_features _ − int As name suggests, it gives the total number of polynomial output features. |
Following Python script uses PolynomialFeatures transformer to transform array of 8 into shape (4,2) −
from sklearn.preprocessing import PolynomialFeatures import numpy as np Y = np.arange(8).reshape(4, 2) poly = PolynomialFeatures(degree=2) poly.fit_transform(Y)
array( [ [ 1., 0., 1., 0., 0., 1.], [ 1., 2., 3., 4., 6., 9.], [ 1., 4., 5., 16., 20., 25.], [ 1., 6., 7., 36., 42., 49.] ] )
The above sort of preprocessing i.e. transforming an input data matrix into a new data matrix of a given degree, can be streamlined with the Pipeline tools, which are basically used to chain multiple estimators into one.
The below python scripts using Scikit-learn’s Pipeline tools to streamline the preprocessing (will fit to an order-3 polynomial data).
#First, import the necessary packages. from sklearn.preprocessing import PolynomialFeatures from sklearn.linear_model import LinearRegression from sklearn.pipeline import Pipeline import numpy as np #Next, create an object of Pipeline tool Stream_model = Pipeline([('poly', PolynomialFeatures(degree=3)), ('linear', LinearRegression(fit_intercept=False))]) #Provide the size of array and order of polynomial data to fit the model. x = np.arange(5) y = 3 - 2 * x + x ** 2 - x ** 3 Stream_model = model.fit(x[:, np.newaxis], y) #Calculate the input polynomial coefficients. Stream_model.named_steps['linear'].coef_
array([ 3., -2., 1., -1.])
The above output shows that the linear model trained on polynomial features is able to recover the exact input polynomial coefficients.