Deep Network Structure
Convolution neural networks can be regarded as trainable feed-forward multi-layered artificial neural networks that comprise of multiple feature extraction stages [36,45]. Each of the feature extraction stage was consist of convolutional layers with learnable filters, pooling layers and activation functions or non-linearity layers [36].
Our proposed fully convolutional neural networks (F-CNNs) consist of three convolutional layers. The training samples from Landsat images and their corresponding label image samples were uploaded to the proposed model network to train the model. Each layer applies a convolutional operation on the input image using learnable filters and passing the output feature maps to the next convolutional layer [36]. The first two convolutional layers (L1,L2) are designed using kernels (a set of learnable filters) of size 3 × 3. The dimensions of each filter allows the network to slide across the entire width and height of the local region and generates pixel-wise probabilities for each class based on the contextual information. The network takes pixel-patches (target pixel is at the centre of each patch) as input instead of single pixel. Filters of the last convolutional layers help the convolutional layer to learn patterns of the input training data of different class types. Each kernel gave the weight for each class label. Therefore, to keep the number of output feature maps of the last convolutional layer equal to the number of classes, the size of the filters in the last layer matched to the height and width of the input feature maps. The size (
length×height
length×height
) of the output feature maps generated by convolutional layers depends on the number of pooling and strides and the size of convolutional filters [46]. Assuming Y and X be the heights and width of the inputs to the convolutional layer, S be the stride and P be the padding, Equations (1) and (2) explain how the height and width of the outputs of each convolutional layer is determined.