In this chapter, we will learn about how to work with the in-memory and large datasets in CNTK.
When we talk about feeding data into CNTK trainer, there can be many ways, but it will depend upon the size of the dataset and format of the data. The data sets can be small in-memory or large datasets.
In this section, we are going to work with in-memory datasets. For this, we will use the following two frameworks −
Here, we will work with a numpy based randomly generated dataset in CNTK. In this example, we are going to simulate data for a binary classification problem. Suppose, we have a set of observations with 4 features and want to predict two possible labels with our deep learning model.
For this, first we must generate a set of labels containing a one-hot vector representation of the labels, we want to predict. It can be done with the help of following steps −
Step 1 − Import the numpy package as follows −
import numpy as np num_samples = 20000
Step 2 − Next, generate a label mapping by using np.eye function as follows −
label_mapping = np.eye(2)
Step 3 − Now by using np.random.choice function, collect the 20000 random samples as follows −
y = label_mapping[np.random.choice(2,num_samples)].astype(np.float32)
Step 4 − Now at last by using np.random.random function, generate an array of random floating point values as follows −
x = np.random.random(size=(num_samples, 4)).astype(np.float32)
Once, we generate an array of random floating-point values, we need to convert them to 32-bit floating point numbers so that it can be matched to the format expected by CNTK. Let’s follow the steps below to do this −
Step 5 − Import the Dense and Sequential layer functions from cntk.layers module as follows −
from cntk.layers import Dense, Sequential
Step 6 − Now, we need to import the activation function for the layers in the network. Let us import the sigmoid as activation function −
from cntk import input_variable, default_options from cntk.ops import sigmoid
Step 7 − Now, we need to import the loss function to train the network. Let us import binary_cross_entropy as loss function −
from cntk.losses import binary_cross_entropy
Step 8 − Next, we need to define the default options for the network. Here, we will be providing the sigmoid activation function as a default setting. Also, create the model by using Sequential layer function as follows −
with default_options(activation=sigmoid): model = Sequential([Dense(6),Dense(2)])
Step 9 − Next, initialise an input_variable with 4 input features serving as the input for the network.
features = input_variable(4)
Step 10 − Now, in order to complete it, we need to connect features variable to the NN.
z = model(features)
So, now we have a NN, with the help of following steps, let us train it using in-memory dataset −
Step 11 − To train this NN, first we need to import learner from cntk.learners module. We will import sgd learner as follows −
from cntk.learners import sgd
Step 12 − Along with that import the ProgressPrinter from cntk.logging module as well.
from cntk.logging import ProgressPrinter progress_writer = ProgressPrinter(0)
Step 13 − Next, define a new input variable for the labels as follows −
labels = input_variable(2)
Step 14 − In order to train the NN model, next, we need to define a loss using the binary_cross_entropy function. Also, provide the model z and the labels variable.
loss = binary_cross_entropy(z, labels)
Step 15 − Next, initialize the sgd learner as follows −
learner = sgd(z.parameters, lr=0.1)
Step 16 − At last, call the train method on the loss function. Also, provide it with the input data, the sgd learner and the progress_printer.−
training_summary=loss.train((x,y),parameter_learners=[learner],callbacks=[progress_writer])
import numpy as np num_samples = 20000 label_mapping = np.eye(2) y = label_mapping[np.random.choice(2,num_samples)].astype(np.float32) x = np.random.random(size=(num_samples, 4)).astype(np.float32) from cntk.layers import Dense, Sequential from cntk import input_variable, default_options from cntk.ops import sigmoid from cntk.losses import binary_cross_entropy with default_options(activation=sigmoid): model = Sequential([Dense(6),Dense(2)]) features = input_variable(4) z = model(features) from cntk.learners import sgd from cntk.logging import ProgressPrinter progress_writer = ProgressPrinter(0) labels = input_variable(2) loss = binary_cross_entropy(z, labels) learner = sgd(z.parameters, lr=0.1) training_summary=loss.train((x,y),parameter_learners=[learner],callbacks=[progress_writer])
Build info: Built time: *** ** **** 21:40:10 Last modified date: *** *** ** 21:08:46 2019 Build type: Release Build target: CPU-only With ASGD: yes Math lib: mkl Build Branch: HEAD Build SHA1:ae9c9c7c5f9e6072cc9c94c254f816dbdc1c5be6 (modified) MPI distribution: Microsoft MPI MPI version: 7.0.12437.6 ------------------------------------------------------------------- average since average since examples loss last metric last ------------------------------------------------------ Learning rate per minibatch: 0.1 1.52 1.52 0 0 32 1.51 1.51 0 0 96 1.48 1.46 0 0 224 1.45 1.42 0 0 480 1.42 1.4 0 0 992 1.41 1.39 0 0 2016 1.4 1.39 0 0 4064 1.39 1.39 0 0 8160 1.39 1.39 0 0 16352
Numpy arrays are very limited in what they can contain and one of the most basic ways of storing data. For example, a single n-dimensional array can contain data of a single data type. But on the other hand, for many real-world cases we need a library that can handle more than one data type in a single dataset.
One of the Python libraries called Pandas makes it easier to work with such kind of datasets. It introduces the concept of a DataFrame (DF) and allows us to load datasets from disk stored in various formats as DFs. For example, we can read DFs stored as CSV, JSON, Excel, etc.
You can learn Python Pandas library in more detail at https://www.howcodex.com/python_pandas/index.htm.
In this example, we are going to use the example of classifying three possible species of the iris flowers based on four properties. We have created this deep learning model in the previous sections too. The model is as follows −
from cntk.layers import Dense, Sequential from cntk import input_variable, default_options from cntk.ops import sigmoid, log_softmax from cntk.losses import binary_cross_entropy model = Sequential([ Dense(4, activation=sigmoid), Dense(3, activation=log_softmax) ]) features = input_variable(4) z = model(features)
The above model contains one hidden layer and an output layer with three neurons to match the number of classes we can predict.
Next, we will use the train method and loss function to train the network. For this, first we must load and preprocess the iris dataset, so that it matches the expected layout and data format for the NN. It can be done with the help of following steps −
Step 1 − Import the numpy and Pandas package as follows −
import numpy as np import pandas as pd
Step 2 − Next, use the read_csv function to load the dataset into memory −
df_source = pd.read_csv(‘iris.csv’, names = [‘sepal_length’, ‘sepal_width’, ‘petal_length’, ‘petal_width’, ‘species’], index_col=False)
Step 3 − Now, we need to create a dictionary that will be mapping the labels in the dataset with their corresponding numeric representation.
label_mapping = {‘Iris-Setosa’ : 0, ‘Iris-Versicolor’ : 1, ‘Iris-Virginica’ : 2}
Step 4 − Now, by using iloc indexer on the DataFrame, select the first four columns as follows −
x = df_source.iloc[:, :4].values
Step 5 −Next, we need to select the species columns as the labels for the dataset. It can be done as follows −
y = df_source[‘species’].values
Step 6 − Now, we need to map the labels in the dataset, which can be done by using label_mapping. Also, use one_hot encoding to convert them into one-hot encoding arrays.
y = np.array([one_hot(label_mapping[v], 3) for v in y])
Step 7 − Next, to use the features and the mapped labels with CNTK, we need to convert them both to floats −
x= x.astype(np.float32) y= y.astype(np.float32)
As we know that, the labels are stored in the dataset as strings and CNTK cannot work with these strings. That’s the reason, it needs one-hot encoded vectors representing the labels. For this, we can define a function say one_hot as follows −
def one_hot(index, length): result = np.zeros(length) result[index] = index return result
Now, we have the numpy array in the correct format, with the help of following steps we can use them to train our model −
Step 8 − First, we need to import the loss function to train the network. Let us import binary_cross_entropy_with_softmax as loss function −
from cntk.losses import binary_cross_entropy_with_softmax
Step 9 − To train this NN, we also need to import learner from cntk.learners module. We will import sgd learner as follows −
from cntk.learners import sgd
Step 10 − Along with that import the ProgressPrinter from cntk.logging module as well.
from cntk.logging import ProgressPrinter progress_writer = ProgressPrinter(0)
Step 11 − Next, define a new input variable for the labels as follows −
labels = input_variable(3)
Step 12 − In order to train the NN model, next, we need to define a loss using the binary_cross_entropy_with_softmax function. Also provide the model z and the labels variable.
loss = binary_cross_entropy_with_softmax (z, labels)
Step 13 − Next, initialise the sgd learner as follows −
learner = sgd(z.parameters, 0.1)
Step 14 − At last, call the train method on the loss function. Also, provide it with the input data, the sgd learner and the progress_printer.
training_summary=loss.train((x,y),parameter_learners=[learner],callbacks= [progress_writer],minibatch_size=16,max_epochs=5)
from cntk.layers import Dense, Sequential from cntk import input_variable, default_options from cntk.ops import sigmoid, log_softmax from cntk.losses import binary_cross_entropy model = Sequential([ Dense(4, activation=sigmoid), Dense(3, activation=log_softmax) ]) features = input_variable(4) z = model(features) import numpy as np import pandas as pd df_source = pd.read_csv(‘iris.csv’, names = [‘sepal_length’, ‘sepal_width’, ‘petal_length’, ‘petal_width’, ‘species’], index_col=False) label_mapping = {‘Iris-Setosa’ : 0, ‘Iris-Versicolor’ : 1, ‘Iris-Virginica’ : 2} x = df_source.iloc[:, :4].values y = df_source[‘species’].values y = np.array([one_hot(label_mapping[v], 3) for v in y]) x= x.astype(np.float32) y= y.astype(np.float32) def one_hot(index, length): result = np.zeros(length) result[index] = index return result from cntk.losses import binary_cross_entropy_with_softmax from cntk.learners import sgd from cntk.logging import ProgressPrinter progress_writer = ProgressPrinter(0) labels = input_variable(3) loss = binary_cross_entropy_with_softmax (z, labels) learner = sgd(z.parameters, 0.1) training_summary=loss.train((x,y),parameter_learners=[learner],callbacks=[progress_writer],minibatch_size=16,max_epochs=5)
Build info: Built time: *** ** **** 21:40:10 Last modified date: *** *** ** 21:08:46 2019 Build type: Release Build target: CPU-only With ASGD: yes Math lib: mkl Build Branch: HEAD Build SHA1:ae9c9c7c5f9e6072cc9c94c254f816dbdc1c5be6 (modified) MPI distribution: Microsoft MPI MPI version: 7.0.12437.6 ------------------------------------------------------------------- average since average since examples loss last metric last ------------------------------------------------------ Learning rate per minibatch: 0.1 1.1 1.1 0 0 16 0.835 0.704 0 0 32 1.993 1.11 0 0 48 1.14 1.14 0 0 112 [………]
In the previous section, we worked with small in-memory datasets using Numpy and pandas, but not all datasets are so small. Specially the datasets containing images, videos, sound samples are large. MinibatchSource is a component, that can load data in chunks, provided by CNTK to work with such large datasets. Some of the features of MinibatchSource components are as follows −
MinibatchSource can prevent NN from overfitting by automatically randomize samples read from the data source.
It has built-in transformation pipeline which can be used to augment the data.
It loads the data on a background thread separate from the training process.
In the following sections, we are going to explore how to use a minibatch source with out-of-memory data to work with large datasets. We will also explore, how we can use it to feed for training a NN.
In the previous section, we have used iris flower example and worked with small in-memory dataset using Pandas DataFrames. Here, we will be replacing the code that uses data from a pandas DF with MinibatchSource. First, we need to create an instance of MinibatchSource with the help of following steps −
Step 1 − First, from cntk.io module import the components for the minibatchsource as follows −
from cntk.io import StreamDef, StreamDefs, MinibatchSource, CTFDeserializer, INFINITY_REPEAT
Step 2 − Now, by using StreamDef class, crate a stream definition for the labels.
labels_stream = StreamDef(field=’labels’, shape=3, is_sparse=False)
Step 3 − Next, create to read the features filed from the input file, create another instance of StreamDef as follows.
feature_stream = StreamDef(field=’features’, shape=4, is_sparse=False)
Step 4 − Now, we need to provide iris.ctf file as input and initialise the deserializer as follows −
deserializer = CTFDeserializer(‘iris.ctf’, StreamDefs(labels= label_stream, features=features_stream)
Step 5 − At last, we need to create instance of minisourceBatch by using deserializer as follows −
Minibatch_source = MinibatchSource(deserializer, randomize=True)
from cntk.io import StreamDef, StreamDefs, MinibatchSource, CTFDeserializer, INFINITY_REPEAT labels_stream = StreamDef(field=’labels’, shape=3, is_sparse=False) feature_stream = StreamDef(field=’features’, shape=4, is_sparse=False) deserializer = CTFDeserializer(‘iris.ctf’, StreamDefs(labels=label_stream, features=features_stream) Minibatch_source = MinibatchSource(deserializer, randomize=True)
As you have seen above, we are taking the data from ‘iris.ctf’ file. It has the file format called CNTK Text Format(CTF). It is mandatory to create a CTF file to get the data for the MinibatchSource instance we created above. Let us see how we can create a CTF file.
Step 1 − First, we need to import the pandas and numpy packages as follows −
import pandas as pd import numpy as np
Step 2 − Next, we need to load our data file, i.e. iris.csv into memory. Then, store it in the df_source variable.
df_source = pd.read_csv(‘iris.csv’, names = [‘sepal_length’, ‘sepal_width’, ‘petal_length’, ‘petal_width’, ‘species’], index_col=False)
Step 3 − Now, by using iloc indexer as the features, take the content of the first four columns. Also, use the data from species column as follows −
features = df_source.iloc[: , :4].values labels = df_source[‘species’].values
Step 4 − Next, we need to create a mapping between the label name and its numeric representation. It can be done by creating label_mapping as follows −
label_mapping = {‘Iris-Setosa’ : 0, ‘Iris-Versicolor’ : 1, ‘Iris-Virginica’ : 2}
Step 5 − Now, convert the labels to a set of one-hot encoded vectors as follows −
labels = [one_hot(label_mapping[v], 3) for v in labels]
Now, as we did before, create a utility function called one_hot to encode the labels. It can be done as follows −
def one_hot(index, length): result = np.zeros(length) result[index] = 1 return result
As, we have loaded and preprocessed the data, it’s time to store it on disk in the CTF file format. We can do it with the help of following Python code −
With open(‘iris.ctf’, ‘w’) as output_file: for index in range(0, feature.shape[0]): feature_values = ‘ ‘.join([str(x) for x in np.nditer(features[index])]) label_values = ‘ ‘.join([str(x) for x in np.nditer(labels[index])]) output_file.write(‘features {} | labels {} \n’.format(feature_values, label_values))
import pandas as pd import numpy as np df_source = pd.read_csv(‘iris.csv’, names = [‘sepal_length’, ‘sepal_width’, ‘petal_length’, ‘petal_width’, ‘species’], index_col=False) features = df_source.iloc[: , :4].values labels = df_source[‘species’].values label_mapping = {‘Iris-Setosa’ : 0, ‘Iris-Versicolor’ : 1, ‘Iris-Virginica’ : 2} labels = [one_hot(label_mapping[v], 3) for v in labels] def one_hot(index, length): result = np.zeros(length) result[index] = 1 return result With open(‘iris.ctf’, ‘w’) as output_file: for index in range(0, feature.shape[0]): feature_values = ‘ ‘.join([str(x) for x in np.nditer(features[index])]) label_values = ‘ ‘.join([str(x) for x in np.nditer(labels[index])]) output_file.write(‘features {} | labels {} \n’.format(feature_values, label_values))
Once you create MinibatchSource, instance, we need to train it. We can use the same training logic as used when we worked with small in-memory datasets. Here, we will use MinibatchSource instance as the input for the train method on loss function as follows −
Step 1 − In order to log the output of the training session, first import the ProgressPrinter from cntk.logging module as follows −
from cntk.logging import ProgressPrinter
Step 2 − Next, to set up the training session, import the trainer and training_session from cntk.train module as follows −
from cntk.train import Trainer,
Step 3 − Now, we need to define some set of constants like minibatch_size, samples_per_epoch and num_epochs as follows −
minbatch_size = 16 samples_per_epoch = 150 num_epochs = 30
Step 4 − Next, in order to know CNTK how to read data during training, we need to define a mapping between the input variable for the network and the streams in the minibatch source.
input_map = { features: minibatch.source.streams.features, labels: minibatch.source.streams.features }
Step 5 − Next, to log the output of the training process, initialise the progress_printer variable with a new ProgressPrinter instance as follows −
progress_writer = ProgressPrinter(0)
Step 6 − At last, we need to invoke the train method on the loss as follows −
train_history = loss.train(minibatch_source, parameter_learners=[learner], model_inputs_to_streams=input_map, callbacks=[progress_writer], epoch_size=samples_per_epoch, max_epochs=num_epochs)
from cntk.logging import ProgressPrinter from cntk.train import Trainer, training_session minbatch_size = 16 samples_per_epoch = 150 num_epochs = 30 input_map = { features: minibatch.source.streams.features, labels: minibatch.source.streams.features } progress_writer = ProgressPrinter(0) train_history = loss.train(minibatch_source, parameter_learners=[learner], model_inputs_to_streams=input_map, callbacks=[progress_writer], epoch_size=samples_per_epoch, max_epochs=num_epochs)
------------------------------------------------------------------- average since average since examples loss last metric last ------------------------------------------------------ Learning rate per minibatch: 0.1 1.21 1.21 0 0 32 1.15 0.12 0 0 96 [………]