Online learning is a subfield of machine learning that allows to scale supervised learning models to massive datasets. The basic idea is that we don’t need to read all the data in memory to fit a model, we only need to read each instance at a time.
In this case, we will show how to implement an online learning algorithm using logistic regression. As in most of supervised learning algorithms, there is a cost function that is minimized. In logistic regression, the cost function is defined as −
$$J(\theta) \: = \: \frac{-1}{m} \left [ \sum_{i = 1}^{m}y^{(i)}log(h_{\theta}(x^{(i)})) + (1 - y^{(i)}) log(1 - h_{\theta}(x^{(i)})) \right ]$$
where J(θ) represents the cost function and hθ(x) represents the hypothesis. In the case of logistic regression it is defined with the following formula −
$$h_\theta(x) = \frac{1}{1 + e^{\theta^T x}}$$
Now that we have defined the cost function we need to find an algorithm to minimize it. The simplest algorithm for achieving this is called stochastic gradient descent. The update rule of the algorithm for the weights of the logistic regression model is defined as −
$$\theta_j : = \theta_j - \alpha(h_\theta(x) - y)x$$
There are several implementations of the following algorithm, but the one implemented in the vowpal wabbit library is by far the most developed one. The library allows training of large scale regression models and uses small amounts of RAM. In the creators own words it is described as: "The Vowpal Wabbit (VW) project is a fast out-of-core learning system sponsored by Microsoft Research and (previously) Yahoo! Research".
We will be working with the titanic dataset from a kaggle competition. The original data can be found in the bda/part3/vw folder. Here, we have two files −
In order to convert the csv format to the vowpal wabbit input format use the csv_to_vowpal_wabbit.py python script. You will obviously need to have python installed for this. Navigate to the bda/part3/vw folder, open the terminal and execute the following command −
python csv_to_vowpal_wabbit.py
Note that for this section, if you are using windows you will need to install a Unix command line, enter the cygwin website for that.
Open the terminal and also in the folder bda/part3/vw and execute the following command −
vw train_titanic.vw -f model.vw --binary --passes 20 -c -q ff --sgd --l1 0.00000001 --l2 0.0000001 --learning_rate 0.5 --loss_function logistic
Let us break down what each argument of the vw call means.
-f model.vw − means that we are saving the model in the model.vw file for making predictions later
--binary − Reports loss as binary classification with -1,1 labels
--passes 20 − The data is used 20 times to learn the weights
-c − create a cache file
-q ff − Use quadratic features in the f namespace
--sgd − use regular/classic/simple stochastic gradient descent update, i.e., nonadaptive, non-normalized, and non-invariant.
--l1 --l2 − L1 and L2 norm regularization
--learning_rate 0.5 − The learning rate αas defined in the update rule formula
The following code shows the results of running the regression model in the command line. In the results, we get the average log-loss and a small report of the algorithm performance.
-loss_function logistic creating quadratic features for pairs: ff using l1 regularization = 1e-08 using l2 regularization = 1e-07 final_regressor = model.vw Num weight bits = 18 learning rate = 0.5 initial_t = 1 power_t = 0.5 decay_learning_rate = 1 using cache_file = train_titanic.vw.cache ignoring text input in favor of cache input num sources = 1 average since example example current current current loss last counter weight label predict features 0.000000 0.000000 1 1.0 -1.0000 -1.0000 57 0.500000 1.000000 2 2.0 1.0000 -1.0000 57 0.250000 0.000000 4 4.0 1.0000 1.0000 57 0.375000 0.500000 8 8.0 -1.0000 -1.0000 73 0.625000 0.875000 16 16.0 -1.0000 1.0000 73 0.468750 0.312500 32 32.0 -1.0000 -1.0000 57 0.468750 0.468750 64 64.0 -1.0000 1.0000 43 0.375000 0.281250 128 128.0 1.0000 -1.0000 43 0.351562 0.328125 256 256.0 1.0000 -1.0000 43 0.359375 0.367188 512 512.0 -1.0000 1.0000 57 0.274336 0.274336 1024 1024.0 -1.0000 -1.0000 57 h 0.281938 0.289474 2048 2048.0 -1.0000 -1.0000 43 h 0.246696 0.211454 4096 4096.0 -1.0000 -1.0000 43 h 0.218922 0.191209 8192 8192.0 1.0000 1.0000 43 h finished run number of examples per pass = 802 passes used = 11 weighted example sum = 8822 weighted label sum = -2288 average loss = 0.179775 h best constant = -0.530826 best constant’s loss = 0.659128 total feature number = 427878
Now we can use the model.vw we trained to generate predictions with new data.
vw -d test_titanic.vw -t -i model.vw -p predictions.txt
The predictions generated in the previous command are not normalized to fit between the [0, 1] range. In order to do this, we use a sigmoid transformation.
# Read the predictions preds = fread('vw/predictions.txt') # Define the sigmoid function sigmoid = function(x) { 1 / (1 + exp(-x)) } probs = sigmoid(preds[[1]]) # Generate class labels preds = ifelse(probs > 0.5, 1, 0) head(preds) # [1] 0 1 0 0 1 0