Deep papers: VGGNET

With this blog post, I going to discuss the very popular and widely used convolutional neural network architecture proposed by the Visual Geometric Group of Oxford university's department of engineering science. Widely known as VGGNET. As its original paper title suggest, "Very deep convolutional networks for large scale image recognition" it perform in ILSVRC. It achieves 23.7% top 1 validation error and 6.8% in both top 5 validation and test error. Here while we implement original paperwork with TensorFlow Keras, also we going to test how these architectures work with small datasets (less than 100k and more than 10k images). Let's dive in. other than implementation let's do some experiments with these architectures by examining how they work with relatively small datasets. let's have fun.

How it's different?

As the author also mentioned their work is done to find out how the depth of the conversational network affect the accuracy and also they try to explain why small receptive kernels with more depth effective than shallow depth with larger receptive kernels. Other than architecture changes they also talk about multi-scale training. One unique thing they tried in this paper was the single size for all convolutional layers which is 3x3. Also, they use 16 - 19 weight layers which means total convolutional and dense layers to their networks. And another thing to notice they did not use local response normalization to their networks which were used in state of art architecture before the VGGNET.

Architecture

There are six architectures in the original paper that authors used to prove their theory. Let's examine them one by one. In this section, we discussed only architecture not implementation.

there are 6 networks in the paper with different layers, depth. let's see the main networks which are in terms of depth. main there are 4 different depths 11, 13, 16, and 19 weight layer networks. with convolutional networks of 3x3 receptive field and stride of 1x1. also identical densely connected 3 layers, two layers with 4096 nodes and one with 1000 nodes for final predictions. rectified linear unit (ReLU) used to activate hidden layers and L2 weight decay of 5e-4 in hidden layers. for fully connected layers dropout of 50% were used. max-pooling is used to downsampling with 2x2 pool size and 2x2 stride.

11 layers, 13 layers, 16 layers and 19 layers architecture

there are two other implementations which one is 11 layers but in the first convolutional layer they used local response normalization (LRN) to show that LRN does not improve network accuracy. another one is 16 layer network with a 1x1 convolution layer as the final layer in 3,4 and 5th convolutional blocks.

16 layers with 1x1 Conv., 11 layers with LRN 

Implementation with Keras

I'm not going to put code implementation for all networks here, only 16 layer implementation. stochastic gradient descent is used as an optimizer with 1e-2 initial learning rate and 9e-1 momentum as in the VGGNet paper. as final activation used 1000 way softmax function.

implementation of special cases,


Experiments

first, the most noticeable and also the author's main target is to prove that deep networks are better when we have to solve very large scale classification problems. so they use 4 different depths to achieve this. they use 11, 13, 16 and 19 weight layers in their models. to prove their point they use the accuracy of each architecture. in the VGGNet paper they used two training methods which are fixed scale and multi-scale training, here for the experiments we only use fixed scale training. which is 96x96 RGB images.

validation accuracy graph against training steps
in the original dataset (9k training images and 4k validation images)
validation accuracy graph against training steps
with the augmented dataset (39k training images and 16k validation images)

By observing these graphs only we can't conclude this deeper networks works worked better than shallow networks or depth is not that much important to network accuracy because of a few factors, 
  1. because this dataset is not large as ImageNet which originally these networks trained.
  2. also, the stochastic nature of the optimizer, because we implement the same SGD optimizer that configures in the original paper.
  3. and of cause complexity of the dataset. this dataset is not much complex a task as a large scale image classification task like ImageNet.
but clearly, we can see in these graphs as we increase the training set size with augmentation methods validation accuracy declined, but the deepest network holds its ground. theses grasp show the full training history but these networks are trained to fixed training steps. fist we need to see the loss graphs to realise if these networks start overfitting at any point.

overfitting points in the original dataset

overfitting points in the augmented dataset

now we can draw those points in our validation accuracy graphs and see can we get any clues about what's happening here.


now we have more clear insight. when we consider overfitting points in these graphs in the original dataset best model is 13 layers deep network second best is to go to both 19 deep networks and 11 deep networks. 16 layers deep network is in third place but all of them are very close in terms of validation accuracy when they start to overfitting. there is a totally different situation in the augmented dataset. two deep networks (13 deep and 16 deep networks) fall in far behind 11 deep networks gain more accuracy than them but 19 deep networks get more accurate than all of them.

Local response normalization

as in AlexNet paper, LRN is "This sort of response normalization (LRN) implements a form of lateral inhibition inspired by the type found in real neurons, creating competition for big activities among neuron outputs computed using different kernels. AlexNet 3.3". nowadays they are didn't used in CNN's because they made very few improvements or nothing. in our experiment, LRN didn't make any improvements to the network. actually, it performed worse than without the LRN layer.

compare local response normalization

Conclusion

with all these facts, now we can tell a few things,
  1. depth is a factor that decides the accuracy of the network but it's not a much bigger factor when it comes to small datasets like this.
  2. deeper networks take a long time to train and more computational power.
  3. The accuracy of a network depends on a lot of things such as the complexity of the data, the amount of data we have for training.
  4. why not to use LRN, and use regularization instead of them.

Referances

original paperarXiv:1409.1556 [cs.CV]

Comments

  1. As always, Solid content. Thanks for sharing Mr Ashen

    ReplyDelete
  2. Thank you Ashen for sharing. Very good contents and elelaborate well to understand for the people who is not related to this subject. Good luck.

    ReplyDelete
  3. Good luck & thank for sharing.....

    ReplyDelete

Post a Comment