Keras: Getting Started with AI

A great way to start writing code with AI is to use Keras, an open source easy-to-learn library that can use multiple frameworks.

I wanted to start learning about artificial intelligence (AI) and writing code, but I wasn’t sure where to start. I didn’t want to write in C++ or use a directed acyclic graph (DAG) or some other computer-science-centric language. I wanted to be able to write something straightforward, without complication, that allowed me to define clearly what I wanted to do.

The early years of AI saw the emergence of a wide range of frameworks. Each had its own language for reading the data, defining the model, and solving weights. Almost all of the frameworks I started learning used Python, which exposed my fears of learning a new language. Which framework should I select and why? Was one a clear leader?

I didn’t see any pointers that said, “go this way,” and I didn’t see any articles that explained the tradeoffs, but I did read about Keras, and that is where I started. (Note that I can’t write about the things I do or learn in my day job, but this subject I learned before joining.)

A Brief History of Keras

Keras was developed by François Chollet in 2014 out of necessity for an open source implementation of recurrent neural networks (RNNs) and the long short-term memory (LSTM) models that train them. Keras is Python based, and was first released in March 2015. It developed a following fairly quickly because the models were popular at the time.

Up to version 2.3, Keras supported multiple frameworks (back ends) that included TensorFlow, Microsoft Cognitive Toolkit (commonly referred to as CNTK), Theano, and PlaidML. By version 2.4, Keras only supported TensorFlow. In version 3.0 and subsequent versions, Keras once again supported multiple frameworks, including TensorFlow, PyTorch, and JAX.

Fundamentally, Keras abstracts away the details of the back-end frameworks so that you call functions from Keras to build and train models. The application programming interface (API) is consistent regardless of the back end and, in many cases, is much easier to use than those in a specific framework.

This last point is important. You don’t have to learn TensorFlow or PyTorch or JAX to use the framework effectively. You can learn Keras and use all three frameworks. Then, if you want, you can pick one framework to learn more in depth, having learned the concepts and developed ideas with Keras.

Keras and VGG16

Getting started with Keras is not difficult. Rather than use the MNIST dataset of 60,000 grayscale images as an example, I’ll use a VGG16 model as the example. It is a larger model than MNIST and more useful, so it’s more realistic. Moreover, it is easy to understand because it is a sequential convolutional neural network (CNN).

The VGG16 model was developed as part of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2014. Karen Simonyan and Andrew Zisserman from the Visual Geometry Group (Department of Engineering Science, University of Oxford) showcased their model at the contest and introduced it in a paper. You can see that the name of the group, Visual Geometry Group, gave rise to the name (VGG).

The model is focused on object detection and classification tasks. The model comprises:

  • 13 convolutional layers
  • 3 fully connected layers
  • 5 pooling layers

The pooling layers are not counted in the total number of layers, resulting in 16 total layers (hence the name). You can see the layout of the layers in a thorough blog post on VGG16.

I won’t go into the details of the model, but I want to show you an example of Keras-based code that implements the model and trains it on a very simple dataset of images of cats and dogs.

System Assumptions

I’ll assume your system is Linux or macOS just because I haven’t tested Windows. I’ll also assume you have at least 16GB of memory and a GPU that is compatible with Keras with at least 4GB of dedicated video memory. I used an NVIDIA GPU.

You should have a recent version of Python installed on your system – I used Python 3.8.10 – and a specific Python environment built for testing, which I’ll discuss. I don’t recommend using your base environment, although I’ve been known to do that. (It’s simple enough to erase Python3 and reinstall it or just force a reinstall.)

Keras Install and Back-End Framework

Before you start writing code, you have to install Keras and a matching back-end framework on your system. The Keras website says you can install it with PyPI:

$ pip install --upgrade keras

This command should install the GPU version of Keras as well. If you like, you can check the Keras version with the commands:

$ python3
>>> import keras
>>> print(keras.__version__)

You also need to install one of the back ends – TensorFlow, PyTorch, or JAX – on your system. From the Keras URL you can go to the page for whichever back end you want to use. I used TensorFlow. The PyPI command to install TensorFlow is simple:

$ pip install tensorflow

This command should also install the GPU version of TensorFlow. Be sure to check the version installed:

$ python3
>>> import tensorflow as tf
>>> print(tf.__version__)

It doesn’t make too much difference, but if you are curious, I used TensorFlow 2.9.2 and Keras 2.9.0. The TensorFlow version is a bit old; 2.16.1 is the latest as of this writing, but I already had it installed. My Keras is also a bit old. I think Keras 3.6 is the latest, and my version isn't even 3.x. I haven’t tested the code with Keras 3.x yet, so your mileage may vary if you go that route.

CIFAR-10

The model and dataset I use is CIFAR-10. It is a very common dataset in computer vision and classification, but the images are low resolution (32x32). However, the low resolution makes the training go faster.

CIFAR-10 stands for Canadian Institute for Advance Research with 10 outputs or classes. The 10 image classes are:

  • airplanes
  • cars
  • birds
  • cats
  • deer
  • dogs
  • frogs
  • horse
  • ships
  • trucks

Python Modules

To start writing code, open your favorite editor in the directory where you downloaded and unzipped the data. The first few lines in the code should load the needed modules (Listing 1). This code is probably not strictly Pythonic, but I don’t write Pythonic code. I like to solve problems, not worry about proper syntax. If it works, it works.

Listing 1: Importing Modules

import os
 
os.environ["KERAS_BACKEND"] = "tensorflow"
 
import keras
from keras.datasets import cifar10
from keras.models import Sequential
from keras import datasets, layers, models
from keras import regularizers
from keras.layers import Dense, Dropout, BatchNormalization
 
import numpy as np

The entire Keras library is imported (import keras) followed by the specific keras functions to be used.

Starting the Code

Start the code by reading the data with a built-in Keras function:

(train_images, train_labels), (test_images, test_labels) = cifar10.load_data()

This line uses a pre-defined Keras dataset routine named cifar10. It loads the dataset and, in this case, splits the data into two parts (train_imagestrain_labels), a tuple of the images and their corresponding labels that will be used to train the model. The labels specify the correct classification of the image.

The second set of data (test_imagestest_labels) is used to validate the model to check whether it works on non-training data. If you only check the model with the training data, you run the risk of overtraining your model, which will perform poorly on non-training data.

The next part of the code (Listing 2) standardizes the data and converts the labels to categories. The first two lines divide the number of possible pixels to standardize the data. The next three lines convert the labels to category data that can be used in training and checking the state of the model (validation).

Listing 2: Standardizing Data

# Standardizing (255 is the total number of pixels an image can have)
train_images = train_images / 255
test_images = test_images / 255 
 
# One hot encoding the target class (labels)
num_classes = 10
train_labels = keras.utils.to_categorical(train_labels, num_classes)
test_labels = keras.utils.to_categorical(test_labels, num_classes)

Building the Model

This next step of constructing the VGG16 model is really fun because it shows off the power of Keras. The model is sequential starting with the first layer, then moving to the second layer, and so on, down to the final layer, which is the output. The first two lines of the model are:

model = Sequential()
model.add(layers.Conv2D(32, (3,3), padding='same', activation='relu', input_shape=(32,32,3)))

The first line initializes the model with the Sequential class from Keras (keras.models.Sequential). The next line is the first layer of the model, which is also referred to as the input layer, which is a convolutional layer (Conv2D) with an input_shape of 32x32 (two-dimensional (2D) image) and three channels (RGB values). The first layer uses 32 convolutional filters (first argument) with a kernel size of 3x3 (second argument).

To decode this a bit more, a convolutional filter is applied to the image. The filter is 3x3 and scans the entire 32x32 image in some fashion such as left to right and top to bottom. Thirty-two of these filters run across the image. The values of the 32 3x3 filters are parameters of the model that are to be adjusted during training. The input to the filter is an image from the collection of training images, and the output is a convolved image that goes through an activation function and is passed to the next layer. In this case, it is the rectified linear unit (ReLU) function (relu).

As a reminder, the parameters of this layer are the values in each 3x3 convolution filter for all 32 filters. These are the weights that are to be trained (initially they are just random values).

The padding argument specifies that padding is even to the left/right or up/down of the image, so the output size is the same size as the input (i.e., the image has not changed size), which is relevant to the extra image pixels that are around the edges of the input image.

Finally, the activation function for this layer, the ReLU function, is very common in image models, but I won’t discuss it in this article.

Next is a normalization layer:

model.add(layers.BatchNormalization())

This function applies a transformation so that the mean of the output is around zero and the standard deviation is around 1.

One thing you should note is that Keras takes care of making sure all of the output dimensions from the input layer with the 2D convolution match the inputs to the batch normalization layer.

Next is another convolutional layer that is identical to the first layer, except you don’t need to specify the input size because Keras ensures that the input size of this layer matches the output size from the previous layer:

model.add(layers.Conv2D(32, (3,3), padding='same', activation='relu'))

After this second 2D convolution layer, you have another batch normalization layer followed by a new layer type, MaxPooling2D:

model.add(layers.BatchNormalization())
model.add(layers.MaxPooling2D(pool_size=(2,2)))

max pooling layer has a pool size, 2x2 in this case, that is used to scan the entire input image (left to right, top to bottom). The pool is like a filter. Within this pool, the largest value is used for the output and downsizes the image from the previous layer. In this case, it reduces it by two.

strides argument in the max pooling layer defines how the pool window moves across the image. In this case, it will move two pixels to the right/left or down/up, which allows you to overlap pools (e.g., to retain some knowledge from other pools) and allows the pooling layer to downsize the image more or less than the pool size.

Again, note that Keras takes care of matching the output from the previous layer to the input for the next layer. If there is a fundamental problem, Keras gives an error and stops.

The max pooling layer is followed by a new layer called a Dropout layer:

model.add(layers.Dropout(0.3))

This layer randomly sets certain inputs to 0 with a specified frequency – in this case 0.3 (30%) – which helps prevent overfitting of the training data; however, it does not affect the overall generalization of the model or the training of the model to the test data set.

The first six layers of the model (call it a block) are shown in Listing 3. 

Listing 3: First Block

model.add(layers.Conv2D(32, (3,3), padding='same', activation='relu', input_shape=(32,32,3)))
model.add(layers.BatchNormalization())
model.add(layers.Conv2D(32, (3,3), padding='same', activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.MaxPooling2D(pool_size=(2,2)))
model.add(layers.Dropout(0.3))

The next size layers of the model (Listing 4) are the same except for some small changes:

  • input_shape does not need to be specified in the first 2D convolutional layer.
  • Convolutional filters number 64 instead of 32.
  • The dropout rate has been increased to 0.5 (50%).

Listing 4: Second Block

model.add(layers.Conv2D(64, (3,3), padding='same', activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.Conv2D(64, (3,3), padding='same', activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.MaxPooling2D(pool_size=(2,2)))
model.add(layers.Dropout(0.5))

A third block (Listing 5) is the same, except for some small parameter changes as before (number of convolutional filers, etc).

Listing 5: Third Block

model.add(layers.Conv2D(128, (3,3), padding='same', activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.Conv2D(128, (3,3), padding='same', activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.MaxPooling2D(pool_size=(2,2)))
model.add(layers.Dropout(0.5))

Only one more small block of model layers remains. Recall that the model is processing 2D images, which doesn’t work well when categorizing images: How do you map them? The Keras Flatten function converts a 2D image to a 1D vector:

model.add(layers.Flatten())

This layer reshapes the data from the previous layer, which is 2D, to a one-dimensional vector that can be used in “classic” fully connected layers. In this way, you convert an image into something you can ultimately use as output to classify the image.

At this point, you add another layer, which is referred to as a dense layer. Think of the classic model of a neural network, which has layers of neurons in a column (i.e., a 1D vector). This dense layer is going to process the inputs with a ReLU activation function:

model.add(layers.Dense(128, activation='relu'))

This particular model then adds a BatchNormalization layer followed by a Dropout layer:

model.add(layers.BatchNormalization())
model.add(layers.Dropout(0.5))

Finally, the last layer is a dense layer where the output number of neurons is 10, which matches the number of classes of images:

model.add(layers.Dense(num_classes, activation='softmax'))    # num_classes = 10

Because this is the last layer, you need to use a softmax activation function. You can read more about the softmax function and why softmax is used for classification models.

An important consideration is that when an image is processed through the model, you get numbers for all 10 classes that are the probabilities that the processed image falls into any of those classes. You will never get an image with a 100% (1.0) probability in a specific class and a zero in all other classes. Neural networks generalize; they don’t give you a 100% specific answer. However, if you look at all of the probabilities, you can probably tell to which class the image belongs because it has the larger value.

If you have an image in which two or more classes have almost the same probability, the model is having a difficult time determining to which class the image belongs. If this happens, you likely need to gather more testing data and create a larger model. Remember that neural networks can be fooled by images just like humans.

The entire model should look like Listing 6. Next, you need to compile the model, which builds the code that TensorFlow (or whatever back end you chose) runs for the model.

Listing 6: The Model

model = Sequential()
 
model.add(layers.Conv2D(32, (3,3), padding='same', activation='relu', input_shape=(32,32,3)))
model.add(layers.BatchNormalization())
model.add(layers.Conv2D(32, (3,3), padding='same', activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.MaxPooling2D(pool_size=(2,2)))
model.add(layers.Dropout(0.3))
 
model.add(layers.Conv2D(64, (3,3), padding='same', activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.Conv2D(64, (3,3), padding='same', activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.MaxPooling2D(pool_size=(2,2)))
model.add(layers.Dropout(0.5))
 
model.add(layers.Conv2D(128, (3,3), padding='same', activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.Conv2D(128, (3,3), padding='same', activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.MaxPooling2D(pool_size=(2,2)))
model.add(layers.Dropout(0.5))
 
model.add(layers.Flatten())
model.add(layers.Dense(128, activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.Dropout(0.5))
model.add(layers.Dense(num_classes, activation='softmax'))    # num_classes = 10

Compiling the Model

The next step is to compile the model:

model.compile(optimizer='adam', loss=keras.losses.categorical_crossentropy, metrics=['accuracy'])

Keras configures the overall learning process, specifically the optimizer, the loss function, and the metrics. The optimizer to be used is adam, a stochastic gradient descent optimizer that makes estimation of the first and second order moments (gradients).

For the model to be trained, the difference between the test image class (100% probability of a specific class and 0% probability of all other classes) and the results of the model is collected to produce a loss. For a specific image, the loss is the error between the test image probabilities and the model output probabilities, which is used by the optimizer to determine the new parameters. Typically, these parameters are the weights in the model for the next iterations. For this problem the loss function is defined as a categorical crossentropy function.

Train!

At this point, you can train the model, which means that all the training images are passed through the model and the loss function is computed. The adam algorithm then computes the changes that need to be made to the model parameters (weights) and applies them. This process is called the backpropagation step.

After the parameters have been updated with the new values, the process repeats until the model has converged, which can be when the loss function you defined when the model was compiled stops changing very much, the changes in the parameters are very small, or both. This point is referred to as the “stopping criteria.” This process sounds like a loop, doesn’t it? However, you don’t have to write the details of the training loops, including computing the loss function and the updates, to the model parameters. Keras has a method or function that does all of this for you named fit:

history = model.fit(train_images, train_labels, batch_size=64, epochs=20, validation_data=(test_images, test_labels))

An iteration is referred to as an epoch, and this training loop only runs 20 epochs. Just change the value for epochs to run as many iterations as you want.

Notice that you pass in the validation data to the function. It will run the test images through the model to check how well it’s learning. The batch_size=64 tells the fit function to take the images in groups of 64 and run them through the model. It will compute the gradients for the adam optimizer in the group of 64. The subject of batch size has been the topic of several papers. A small batch size usually means convergence will be slow, but it also frees up more memory because not as many images are stored at once. A large batch size means more memory usage and can result in early convergence.

Sample Output

I won't bore you with all the output, but I do want to share some of it. Listing 7 shows the output for the first three epochs. Note that val_accurancy and val_loss are the accuracy and loss for the validation data, not the training data.

Listing 7: First Three Epochs

================
== TensorFlow ==
================
 
NVIDIA Release 24.08-tf2 (build 106933591)
TensorFlow Version 2.16.1
Container image Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright 2017-2024 The TensorFlow Authors.  All rights reserved.
 
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
 
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
 
NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 12.6 driver version 560.35.03 with kernel driver version 535.161.08.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
 
Epoch 1/100
782/782 ━━━━━━━━━━━━━━━━━━━━ 21s 12ms/step - accuracy: 0.3156 - loss: 2.1379 - val_accuracy: 0.4105 - val_loss: 1.7636
Epoch 2/100
782/782 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.5481 - loss: 1.2612 - val_accuracy: 0.5852 - val_loss: 1.2059
Epoch 3/100
139/782 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.6390 - loss: 1.0173

Summary

Keras is a great place to start your journey in AI. When you write your code in Keras, you can choose whichever back-end framework you want. Keras has much more capability, such as checking for convergence and stopping, writing checkpoints during training, summarizing the model, writing callback functions called by the fit function at various times during training, and more. You can also plot the training history from the results in the history variables, (i.e., the output from the fit function).

Keras is great for experimenting and learning about various hyperparameters, which are the values you specify when creating the model, compiling the model, or running the training. Simple examples include the number of convolutional filters used with specific layers, the dropout rate, the activation function, and so on. There is no real science to selecting these hyperparameters, and you might have to try several options to get some feel for improving the accuracy of the model or improving convergence.

Keras has a very large set of examples for a wide range of topics, such as transformer-based examples to learn about how these popular building blocks work in generative AI examples, time series examples, audio examples, graph examples, structured data examples, and style transfer examples, as well as a nice section on generative examples that go beyond the transformer examples.

Related content