5.0 / 5
We will cover the following tasks in 1 hour and 13 minutes:
We will understand the Rhyme interface and our learning environment. You will get a virtual machine, you will need Jupyter Notebook and TensorFlow for this course and both of these are already installed on your virtual machine. Jupyter Notebooks are very popular with Data Science and Machine Learning Engineers as one can write code in cells and use other cells for documentation.
What is Overfitting?
Overfitting is when the accuracy of the model on the training data would either keep increasing or remain constant with more epochs but the accuracy of the model on the validation data would peak after training for a certain number of epochs and then it will start decreasing. Let’s import the libraries that we will need. We will use TensorFlow and Keras. We will also use the fundamental package for scientific computing in Python - NumPy.
We are going to use the IMDB movie reviews from the Internet Movie Database as our dataset. These are split into 25k for training and 25k reviews for testing. The training and testing sets consists of equal numbers of positive and negative reviews. The dataset is pre-processed where each example is an array of integers representing the words of the movie review. Each label is an integer value of either
0 (negative sentiment) or
1 (positive sentiment). We will Multi-Hot-Encode our data. In the dataset, every example has a list of words represented by numbers that exist in that example. We are going to work with only the 10000 most common words in our entire dictionary.
Creating the Baseline Model
To find an appropriate size for our model, we normally start with relatively few layers and parameters, then begin increasing the size of the layers or adding new layers until we see diminishing returns on the validation loss. We’ll create three models - a baseline model, a smaller version and a larger versions, and then we will compare them.
Creating Model Variants
We will create three models with the same number of layers but with more number of nodes in the first two layers. We will train all the three models using the
fit method. We will pass on the training data, we will run the training for 20 epochs. We will set the batch size to 512. We will use our test set as the validation set.
Plot History Function
We will define a function to plot the cross entropy against epochs given a history parameter. As usual, we are using Matplotlib’s PyPlot module. Our function will create a plot of
binary cross entropy against
epochs given the three history objects that we got from training the three models.
Plotting the Training and Validation Loss
The more capacity a neural network has, the quicker it will be able to model the training data (resulting in a low training loss), but would be more susceptible to overfitting (resulting in a large difference between the training and validation loss). Out of the three models, the smaller one seems to be doing a better job!
A common way to mitigate overfitting is to put constraints on the complexity of a network by forcing its weights to take only small values, which makes the distribution of weight values more regular. This is called weight regularization, and is done by adding a cost associated with having large weights to the loss function of the network.
L2 Model vs Baseline
We will look at the impact of regularization on our models. We have two models with the same network architecture. Despite the same architecture, the regularized model is much more resistant to overfitting as we can see from this plot. The L2 model is definitely an improvement over the base model.
Dropouts is a very common regularization technique used in neural networks. Dropout consists of setting random output features of a layer to 0 during training. Let’s say we apply dropout to a layer which would returns a 10 dimensional vector. This will mean that some of these 10 outputs may be randomly set to 0 or dropped out of the vector. The fraction of the values that are being dropped out or are being set to 0 is called Dropout Rate. This type of dropout is only applied during training and not during testing. At the time of testing, a layers output values are scaled down by a factor equal to the dropout rate. This is to balance for the fact that more units are active at testing compared to training.
Dropout Model vs Baseline
Finally, we plot the training and validation loss for the models using Dropout and Weight Regularization. Just like Weight Regularization, Dropout also reduces overfitting. We went through a lot of concepts related to Overfitting and hopefully you now have a pretty good understanding of overfitting and on how to solve overfitting.
About the Host (Amit Yadav)
I am a Software Engineer with many years of experience in writing commercial software. My current areas of interest include computer vision and sequence modelling for automated signal processing using deep learning as well as developing chatbots.