In this chapter, how to measure performance of out-of-memory datasets will be explained.
In previous sections, we have discussed about various methods to validate the performance of our NN, but the methods we have discussed, are ones that deals with the datasets that fit in the memory.
Here, the question arises what about out-of-memory datasets, because in production scenario, we need a lot of data to train NN. In this section, we are going to discuss how to measure performance when working with minibatch sources and manual minibatch loop.
While working with out-of-memory dataset, i.e. minibatch sources, we need slightly different setup for loss, as well as metric, than the setup we used while working with small datasets i.e. in-memory datasets. First, we will see how to set up a way to feed data to the trainer of NN model.
Following are the implementation steps−
Step 1 − First, from cntk.io module import the components for creating the minibatch source as follows−
from cntk.io import StreamDef, StreamDefs, MinibatchSource, CTFDeserializer, INFINITY_REPEAT
Step 2 − Next, create a new function named say create_datasource. This function will have two parameters namely filename and limit, with a default value of INFINITELY_REPEAT.
def create_datasource(filename, limit =INFINITELY_REPEAT)
Step 3 − Now, within the function, by using StreamDef class crate a stream definition for the labels that reads from the labels field that has three features. We also need to set is_sparse to False as follows−
labels_stream = StreamDef(field=’labels’, shape=3, is_sparse=False)
Step 4 − Next, create to read the features filed from the input file, create another instance of StreamDef as follows.
feature_stream = StreamDef(field=’features’, shape=4, is_sparse=False)
Step 5 − Now, initialise the CTFDeserializer instance class. Specify the filename and streams that we need to deserialize as follows −
deserializer = CTFDeserializer(filename, StreamDefs(labels= label_stream, features=features_stream)
Step 6 − Next, we need to create instance of minisourceBatch by using deserializer as follows −
Minibatch_source = MinibatchSource(deserializer, randomize=True, max_sweeps=limit) return minibatch_source
Step 7 − At last, we need to provide training and testing source, which we created in previous sections also. We are using iris flower dataset.
training_source = create_datasource(‘Iris_train.ctf’) test_source = create_datasource(‘Iris_test.ctf’, limit=1)
Once you create MinibatchSource instance, we need to train it. We can use the same training logic, as used when we worked with small in-memory datasets. Here, we will use MinibatchSource instance, as the input for the train method on loss function as follows −
Following are the implementation steps−
Step 1 − In order to log the output of the training session, first import the ProgressPrinter from cntk.logging module as follows −
from cntk.logging import ProgressPrinter
Step 2 − Next, to set up the training session, import the trainer and training_session from cntk.train module as follows−
from cntk.train import Trainer, training_session
Step 3 − Now, we need to define some set of constants like minibatch_size, samples_per_epoch and num_epochs as follows−
minbatch_size = 16 samples_per_epoch = 150 num_epochs = 30 max_samples = samples_per_epoch * num_epochs
Step 4 − Next, in order to know how to read data during training in CNTK, we need to define a mapping between the input variable for the network and the streams in the minibatch source.
input_map = { features: training_source.streams.features, labels: training_source.streams.labels }
Step 5 − Next to log the output of the training process, initialize the progress_printer variable with a new ProgressPrinter instance. Also, initialize the trainer and provide it with the model as follows−
progress_writer = ProgressPrinter(0) trainer: training_source.streams.labels
Step 6 − At last, to start the training process, we need to invoke the training_session function as follows −
session = training_session(trainer, mb_source=training_source, mb_size=minibatch_size, model_inputs_to_streams=input_map, max_samples=max_samples, test_config=test_config) session.train()
Once we trained the model, we can add validation to this setup by using a TestConfig object and assign it to the test_config keyword argument of the train_session function.
Following are the implementation steps−
Step 1 − First, we need to import the TestConfig class from the module cntk.train as follows−
from cntk.train import TestConfig
Step 2 − Now, we need to create a new instance of the TestConfig with the test_source as input−
Test_config = TestConfig(test_source)
from cntk.io import StreamDef, StreamDefs, MinibatchSource, CTFDeserializer, INFINITY_REPEAT def create_datasource(filename, limit =INFINITELY_REPEAT) labels_stream = StreamDef(field=’labels’, shape=3, is_sparse=False) feature_stream = StreamDef(field=’features’, shape=4, is_sparse=False) deserializer = CTFDeserializer(filename, StreamDefs(labels=label_stream, features=features_stream) Minibatch_source = MinibatchSource(deserializer, randomize=True, max_sweeps=limit) return minibatch_source training_source = create_datasource(‘Iris_train.ctf’) test_source = create_datasource(‘Iris_test.ctf’, limit=1) from cntk.logging import ProgressPrinter from cntk.train import Trainer, training_session minbatch_size = 16 samples_per_epoch = 150 num_epochs = 30 max_samples = samples_per_epoch * num_epochs input_map = { features: training_source.streams.features, labels: training_source.streams.labels } progress_writer = ProgressPrinter(0) trainer: training_source.streams.labels session = training_session(trainer, mb_source=training_source, mb_size=minibatch_size, model_inputs_to_streams=input_map, max_samples=max_samples, test_config=test_config) session.train() from cntk.train import TestConfig Test_config = TestConfig(test_source)
------------------------------------------------------------------- average since average since examples loss last metric last ------------------------------------------------------ Learning rate per minibatch: 0.1 1.57 1.57 0.214 0.214 16 1.38 1.28 0.264 0.289 48 [………] Finished Evaluation [1]: Minibatch[1-1]:metric = 69.65*30;
As we see above, it is easy to measure the performance of our NN model during and after training, by using the metrics when training with regular APIs in CNTK. But, on the other side, things will not be that easy while working with a manual minibatch loop.
Here, we are using the model given below with 4 inputs and 3 outputs from Iris Flower dataset, created in previous sections too−
from cntk import default_options, input_variable from cntk.layers import Dense, Sequential from cntk.ops import log_softmax, relu, sigmoid from cntk.learners import sgd model = Sequential([ Dense(4, activation=sigmoid), Dense(3, activation=log_softmax) ]) features = input_variable(4) labels = input_variable(3) z = model(features)
Next, the loss for the model is defined as the combination of the cross-entropy loss function, and the F-measure metric as used in previous sections. We are going to use the criterion_factory utility, to create this as a CNTK function object as shown below−
import cntk from cntk.losses import cross_entropy_with_softmax, fmeasure @cntk.Function def criterion_factory(outputs, targets): loss = cross_entropy_with_softmax(outputs, targets) metric = fmeasure(outputs, targets, beta=1) return loss, metric loss = criterion_factory(z, labels) learner = sgd(z.parameters, 0.1) label_mapping = { 'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2 }
Now, as we have defined the loss function, we will see how we can use it in the trainer, to set up a manual training session.
Following are the implementation steps −
Step 1 − First, we need to import the required packages like numpy and pandas to load and preprocess the data.
import pandas as pd import numpy as np
Step 2 − Next, in order to log information during training, import the ProgressPrinter class as follows−
from cntk.logging import ProgressPrinter
Step 3 − Then, we need to import the trainer module from cntk.train module as follows −
from cntk.train import Trainer
Step 4 − Next, create a new instance of ProgressPrinter as follows −
progress_writer = ProgressPrinter(0)
Step 5 − Now, we need to initialise trainer with the parameters the loss, the learner and the progress_writer as follows −
trainer = Trainer(z, loss, learner, progress_writer)
Step 6 −Next, in order to train the model, we will create a loop that will iterate over the dataset thirty times. This will be the outer training loop.
for _ in range(0,30):
Step 7 − Now, we need to load the data from disk using pandas. Then, in order to load the dataset in mini-batches, set the chunksize keyword argument to 16.
input_data = pd.read_csv('iris.csv', names=['sepal_length', 'sepal_width','petal_length','petal_width', 'species'], index_col=False, chunksize=16)
Step 8 − Now, create an inner training for loop to iterate over each of the mini-batches.
for df_batch in input_data:
Step 9 − Now inside this loop, read the first four columns using the iloc indexer, as the features to train from and convert them to float32 −
feature_values = df_batch.iloc[:,:4].values feature_values = feature_values.astype(np.float32)
Step 10 − Now, read the last column as the labels to train from, as follows −
label_values = df_batch.iloc[:,-1]
Step 11 − Next, we will use one-hot vectors to convert the label strings to their numeric presentation as follows −
label_values = label_values.map(lambda x: label_mapping[x])
Step 12 − After that, take the numeric presentation of the labels. Next, convert them to a numpy array, so it is easier to work with them as follows −
label_values = label_values.values
Step 13 − Now, we need to create a new numpy array that has the same number of rows as the label values that we have converted.
encoded_labels = np.zeros((label_values.shape[0], 3))
Step 14 − Now, in order to create one-hot encoded labels, select the columns based on the numeric label values.
encoded_labels[np.arange(label_values.shape[0]), label_values] = 1.
Step 15 − At last, we need to invoke the train_minibatch method on the trainer and provide the processed features and labels for the minibatch.
trainer.train_minibatch({features: feature_values, labels: encoded_labels})
from cntk import default_options, input_variable from cntk.layers import Dense, Sequential from cntk.ops import log_softmax, relu, sigmoid from cntk.learners import sgd model = Sequential([ Dense(4, activation=sigmoid), Dense(3, activation=log_softmax) ]) features = input_variable(4) labels = input_variable(3) z = model(features) import cntk from cntk.losses import cross_entropy_with_softmax, fmeasure @cntk.Function def criterion_factory(outputs, targets): loss = cross_entropy_with_softmax(outputs, targets) metric = fmeasure(outputs, targets, beta=1) return loss, metric loss = criterion_factory(z, labels) learner = sgd(z.parameters, 0.1) label_mapping = { 'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2 } import pandas as pd import numpy as np from cntk.logging import ProgressPrinter from cntk.train import Trainer progress_writer = ProgressPrinter(0) trainer = Trainer(z, loss, learner, progress_writer) for _ in range(0,30): input_data = pd.read_csv('iris.csv', names=['sepal_length', 'sepal_width','petal_length','petal_width', 'species'], index_col=False, chunksize=16) for df_batch in input_data: feature_values = df_batch.iloc[:,:4].values feature_values = feature_values.astype(np.float32) label_values = df_batch.iloc[:,-1] label_values = label_values.map(lambda x: label_mapping[x]) label_values = label_values.values encoded_labels = np.zeros((label_values.shape[0], 3)) encoded_labels[np.arange(label_values.shape[0]), label_values] = 1. trainer.train_minibatch({features: feature_values, labels: encoded_labels})
------------------------------------------------------------------- average since average since examples loss last metric last ------------------------------------------------------ Learning rate per minibatch: 0.1 1.45 1.45 -0.189 -0.189 16 1.24 1.13 -0.0382 0.0371 48 [………]
In the above output, we got both the output for the loss and the metric during training. It is because we combined a metric and loss in a function object and used a progress printer in the trainer configuration.
Now, in order to evaluate the model performance, we need to perform same task as with training the model, but this time, we need to use an Evaluator instance to test the model. It is shown in the following Python code−
from cntk import Evaluator evaluator = Evaluator(loss.outputs[1], [progress_writer]) input_data = pd.read_csv('iris.csv', names=['sepal_length', 'sepal_width','petal_length','petal_width', 'species'], index_col=False, chunksize=16) for df_batch in input_data: feature_values = df_batch.iloc[:,:4].values feature_values = feature_values.astype(np.float32) label_values = df_batch.iloc[:,-1] label_values = label_values.map(lambda x: label_mapping[x]) label_values = label_values.values encoded_labels = np.zeros((label_values.shape[0], 3)) encoded_labels[np.arange(label_values.shape[0]), label_values] = 1. evaluator.test_minibatch({ features: feature_values, labels: encoded_labels}) evaluator.summarize_test_progress()
Now, we will get the output something like the following−
Finished Evaluation [1]: Minibatch[1-11]:metric = 74.62*143;