https://images.unsplash.com/photo-1499781350541-7783f6c6a0c8?ixlib=rb-1.2.1&q=85&fm=jpg&crop=entropy&cs=srgb&ixid=eyJhcHBfaWQiOjYzOTIxfQ

Introduction:


LFFD is an anchor free face detector and belongs to the one-stage category of object detection mechanisms

Receptive Fields (RFs) are deployed as natural “anchors”

Contents:


  • Receptive Field (RF) vs Effective Receptive Field (ERF)
  • Large faces vs tiny faces
  • RFs as Natural "Anchor"
  • Network Architecture

    Unlike in fully connected networks, where the value of each unit depends on the entire input to the network, a unit in convolutional networks only depends on a region of the input. This region in the input is the receptive field for that unit.

Source:Understanding the Effective Receptive Field in Deep Convolutional Neural Networks ✏️✏️✏️

Receptive Field (RF) vs Effective

Receptive Field (ERF):


https://i1.wp.com/syncedreview.com/wp-content/uploads/2017/05/24.png?resize=950%2C485&ssl=1

Image Source

Receptive field (RF)


Consider the scenario above: (visualisation on lower left) 💡💡💡

  1. Input image (blue 💙) with padding 1 (refer visualisation on upper left)
  2. 1st convolutional filter with kernel size 3x3 and stride 2 (shaded portion on input image - refer the green 💚 lines projected from feature map to input image)
  3. 2nd convolutional filter with kernel size 3x3 and stride 2 (shaded portion on input image - refer the orange 🧡 lines projected from feature map to input image)yellow
  4. First feature map (green 💚) with output size 3x3 as a result of convolving first cnn filter with input image
  5. Second feature map (orange 🧡) with output size 2x2 as a result of convolving second cnn filter with first feature map
  6. The size of feature maps reduce consecutively due to stride 2 🚧🚧🚧🚧

    If you notice, the receptive field (RF) of the first neuron belonging to the first convolutional filter (green 💚) is ** a 3x3 patch of the input image(blue 💙). The output feature map consists of 9 neurons/features (each neuron will have a RF of 3x3)

Similarly, the receptive field (RF) of the first neuron belonging to the second convolutional filter (orange 🧡) is ** a 7x7 patch of the input image(blue 💙). The output feature map consists of 4 neurons/features (each neuron will have a RF of 7x7)

With this intuition we can state that the neurons in shallow layers have small RFs and those in deeper layers have large RFs.

📔 For mathematical equations you could refer to following link

Effective Receptive field (ERF)


However, one thing that the authors (Luo, Wenjie et al) of the paper "Understanding the Effective Receptive Field in Deep Convolutional Neural Networks" notice is that not all pixels in the receptive field contribute equally to the output unit’s response

The pixels that actually contribute to learning is what we call the Effective Receptive Field (ERF)

If you notice this is clearly visible by the pixels that are repeatedly covered by the kernel as it is convolved with the input image. The edge pixels are convolved lesser number of times as compared to the centre pixels. (You could try this out in your head and it will make more sense )

Thus the ERF plays an important role when it comes to learning.

Image Source

✏️✏️✏️

Large faces vs tiny faces


The paper treats large and tiny faces differently.

The assumption is that tiny faces are more difficult to recognize as compared to larger faces.

In order to capture tiny faces more contextual information is used to come to a conclusion

As the paper states:- for tiny/small faces, ERFs have to cover the faces aswell as sufficient context information

  • for medium faces, ERFs only have to contain the faces with little context information
  • for large faces, only keeping them in RFs is enough

✏️✏️✏️

RFs as Natural "Anchor"


Face detection falls under the category of object detection and the authors try to avoid the use of anchor boxes as in general object detection pipelines and use the receptive fields as natural anchors.

Since faces have an aspect ratio of 1:1 kernels with width and height as 1 is used to generate square shaped RFs to detect faces. For the matching strategy, the RF box is matched to a ground truth (GT) box iff its center falls in the GT bbox, other than thresholding IOU

✏️✏️✏️

Network Architecture


https://img.techpowerup.org/200917/screenshot-from-2020-09-17-11-11-00.png

The network architecture is divided into 4 parts.

  • tiny part - has 10 convolution layers
  • small part
  • medium part
  • large part

The first two layers downsample the input with stride 4, stride 2 from each

Sources:


Code (under construction):


  • [ ] Data Augmentation
  • [ ] LFFD backbone CNN
  • [ ] Priors and matching strategy
  • [ ] LFFD Loss

Implementation:


The LFFD Loss function is similar to an SSD Multibox Loss function.

It consists of :

- negative_positive ratio : During the computation, by default most of the bboxes would belong to class 'background'. In order to avoid this imbalance, we keep a limit on the num of background bboxes that would be a part of the loss function (Try a value b/w 3 and 10)

-