4F12 Revisions

Computer visions

Posted by Jingbiao on April 10, 2022

Question 1 Image structures

Implementing Gaussian/Low-Pass Filters

Fourier transform of the Gaussian is a low pass filter, with the frequency cut-off at $1/\sigma$

Why smoothing is necessary:

  • Since differentiation could amplify high frequency noises, by eliminating the noise, the results could be more precise.
  • With different variance of Gaussian, we could select feature of different scale of interest.

Fourier transform / Convolution

Fourier transform is more efficient to implement the low-pass filtering for large kernel size, but the overhead is large, with 2 forward and 1 inverse Fourier transform. Direct 2D convolution is more efficient for small kernel size and could be made even more efficiently with 2 separate 1D convolution. With image pyramid, the sequential sampling could be implemented with smaller size kernel to further improve the efficiency.

Image pyramid

Implementing image pyramid

Image pyramid is an efficient way to sample smoothed images of different scales.

  1. Sparse sampling, using discrete sigma levels $\sigma_i = 2^{i/s} \sigma_0$ for each smoothed image $S_{\sigma}(x,y)$. The smoothing sigma doubles every s intervals, which represent one octave.
  2. For each octave, incremental smoothing is applied. Since \( G(\sigma_1) * G(\sigma_2) = G(\sqrt{\sigma_1^2 + \sigma_2^2}) \) Incremental sigma could be computed with $G(\sigma_{i+1}) = G(\sigma_i) * G(\sigma_k)$: \( \sigma_k = \sigma_i\sqrt{2^{2/s}-1} \)
  3. When $\sigma_i = 2\sigma_0$, subsample the image into 1/4 of the resolution, i.e., reduce height and width by factor of 2. This is implemented with skipping one pixels. Then for the new subsampled octave, the incremental smoothing weight could be reused.
Using image pyramid to localise feature/blob-like features

Laplacian of Gaussian can be approximated with either looking at Taylor expansion or use the difference of Gaussian: \( S(x,y,k\sigma) - S(x,y,\sigma) \approx (k-1) \sigma^2 \nabla^2 S(x,y) \) or \( \nabla^2 G_{\sigma_i}(x,y)*I(x,y) = \nabla^2 G_{\sigma_i}(x,y) * I(x,y) = \nabla^2 S_{\sigma_i}(x,y) \approx S_{\sigma(i+1)} - S_{\sigma(i)} \) given that k is small, around 1.2-1.6. This could be achieved efficiently by subtracting different images in the octave. Looking at the minima/maxima of the DoG responses could help determine the key points’ location and scale by evaluating the 26 neighbouring pixels. The scale is determined by the sigma $\sigma$. The location (x,y) is determined by the position of the max/min.

Note that the Laplacian of the Gaussian is a band pass filter in spatial domain. This could be proved by either taking F.T. or considering the DoG filter. Since the 2 Gaussian cut-off at different frequency, the difference between them could be used as a band pass filter of the between that 2 two cut-off frequency.

Localising the feature orientations (To achieve orientation invariance)

Sample 16 by 16 patches of pixels for each feature point from $S(x,y,\sigma_i)$. Then compute the gradient vector for each pixel value to get orientation and magnitude. Arrange the computed and smoothed gradient vector (with gaussian, larger $\sigma$) into a histogram of 36 bins, each responsible for a 10 degree orientation. The largest bin in the histogram could represent the dominant orientation. Could improve accuracy by interpolation

How does SIFT encode the feature shape and achieve lighting, image, viewpoint change invariance?

  • The 16 by 16 patches of pixels is normalised to the scale and orientation with the blob like feature detection and orientation histogram to help with 2D viewpoint change. The gradient is also smoothed.
  • The 2D feature is encoded with the gradient vector, thus it helps the descriptor to be robust to light changes.
  • The local 4 by 4 histogram of gradients gives invariance to small viewpoint changes. 8 bins are used to form the 128D descriptor $ (4 \times 4 \times 8 ) $
  • Furthermore, the 128D descriptor vector is normalised to provide invariant gradient magnitude. It is then threshold to truncate any value that is larger than 0.2 to 0.2 and then renormalise to provide robustness to affine light changes.


  • Fail at strong edges / occuluding boundaries
  • Poor at larger viewpoint changes (> 30 degree)