2019年1月22日 星期二

Exploit SSE/AVX Instructions to Speed up the Computation of Separable Convolution

MathJax LaTeX Example Page

※The full code of this post is available in here.

      In my previous posts, this one and this one, I discussed various SSE/AVX optimization for efficient general 2D convolution. In this post we attack convolution algorithmically by using separable convolution.When the kernel is symmetric, they are often separable.
   

零 Mathematics background.

  Convolution - by definition

  \[y[m, n] = x[m, n]*h[m, n] \equiv \sum_{j = -\infty}^{\infty}\sum_{i = -\infty}^{\infty}x[i, j]\cdot h[m-i, n-i]\]

If the kernel function, \(h[m, n]\), is a multiplication of 2 vectors :: \(h[m,n] = h_{1}[m] \cdot h_{2}[n]\) ( \(h_{1}[m]\) is a vertical vector, \(h_{2}[n]\) is horizontal), the convolution is separatable :\[y[m, n] = \sum_{j = -\infty}^{\infty}h_{2}[j] \left ( \sum_{i = -\infty}^{\infty}h_{1}[i]\cdot x[m-i, n-j] \right )\] 
 ※ \(h_{1}[m]\) and \(h_{2}[n]\) are permutable in the separable form.
 In which case, we can use two 1D convolution passes to compute the full convolution. The detail proof you could find in here
   
     For the most applications, the kernel function is separable (e.g. Gaussian, or down-sampling). Instead of taking \(O((w \cdot h)\cdot k^2 )\), where the \(w\) and \(h\) are the width and height of the input data, respectively; and \(k\) is the kernel width. if we use the separable convolution, the complexity is \(O(w\cdot h \cdot(2 k))\), the complexity is reduced by \(k\). This is a big win if we want to use a wide kernel (e.g. 15x15 or higher) (wide kernel is common in blurring an image.).

一.  Serial implementation:

   I assume the input data are padded with borders with a half kernel width, as my previous post denoted, following the separable formula,  the code in show below:

SeparableConvolutionColumnSerial:
for (j = 0; j < height; j++) {
 int x;

 x = kernel_radius;
 
 memset(p_row_done_extended_output +
  j*extended_width, 0, kernel_radius * sizeof(float));

 for (i = 0; i < width; i++) {

  int jj;
  int y_mul_input_width;
  float sum;

  sum = 0;
  y_mul_input_width = j*extended_width;

  for (jj = 0; jj < kernel_length; jj++) {

   sum += p_kernel_column[jj]
    * p_extended_input[y_mul_input_width + x];

   y_mul_input_width += extended_width;
  }/*for kernel*/

  p_column_done_extended_output[j*extended_width + x] 
   = sum;
  x += 1;
 }/*for width*/

 memset(p_column_done_extended_output +
  j*extended_width + (kernel_radius + width), 0, 
  kernel_radius * sizeof(float));

}/*for j*/

SeparableConvolutionRowSerial:
int y_mul_input_width;
y_mul_input_width = 0;

for (j = 0; j < height; j++) {

 for (i = 0; i < width; i++) {
  int x;
  int ii;
  float sum;

  x = i;
  sum = 0;

  for (ii = 0; ii < kernel_length; ii++) {

   sum += p_kernel_row[ii]
    * p_column_done_extended_input[y_mul_input_width + x];
   x += 1;
  }/*for kernel_length*/

  p_output[j*width + i] = sum;
 }/*for width*/

 y_mul_input_width += extended_width;
}/*for j*/


And the function calling is :
SeparableConvolutionColumnSerial(width, height, p_extended_input,
   kernel_length, p_kernel_column, p_separable_column_intermediate);

SeparableConvolutionRowSerial(width, height, p_separable_column_intermediate,
 kernel_length, p_kernel_row, p_separable_output_cpu);


Compared with the definition form full optimized by AVX (which is the best result of my previous post), the performance of the separable convolution is pretty fast :
input = 1000x1000
kernel = 31 x 31
10 Rounds
Runtime, ms
Convolution AVX4018
Separable, Serial335

      It is very natural : algorithmic optimization usually brings more benefit than programming optimization.


 二. Use the vectorization instructions to improve the performance:

     Before modifying the code with the SSE/AVX instructions, we should observe the function SeparableConvolutionColumnSerial, about the inner-loop, which controls kernel_length :
:
y_mul_input_width = j*extended_width;

for (jj = 0; jj < kernel_length; jj++) {

 sum += p_kernel_column[jj]
  * p_extended_input[y_mul_input_width + x];

 y_mul_input_width += extended_width;
}/*for kernel*/
:

      The variable, p_extended_input, accesses the addresses disconsecutively(the interval between the addresses is extended_width). disconsecutive memory accessing is bad for performance, and makes the vectorization instructions useless.
     After further observing, It is easy to find that the inner 2 loops, kernel_length and width are swappable. The code after the swapping is :
memset(p_column_done_extended_output, 0,
  extended_width * height * sizeof(float));

for (j = 0; j < height; j++) {
 int jj;

 int y_mul_input_width;

 y_mul_input_width = j*extended_width;

 for (jj = 0; jj < kernel_length; jj++) {

  int x;
  float kernel_element;

  x = kernel_radius;
  kernel_element = p_kernel_column[jj];

  for (i = 0; i < width; i++)
  {
   float product;
   product = kernel_element
    * p_extended_input[y_mul_input_width + x];

   p_column_done_extended_output[j*extended_width + x]
    += product;
   x += 1;

  }/*for width*/

  y_mul_input_width += extended_width;
 }/*for kernel*/

}/*for j*/


Graphically, the procedures are as this figure:


Based on the modification, the SeparableConvolutionColumn function is able to be optimized via SSE/AVX.


Modifying the two function to use SSE4/AVX is not difficult (but a little tedious),  Below is the key part of function  SeparableConvolutionRowSSE4, the inner loop has been vectorized :
float *p_mov_kernel;
float *p_mov_column_done_extended_input;
:
p_mov_kernel = (float*)p_kernel_row;
p_mov_column_done_extended_input 
 = (float*)p_column_done_extended_input + y_mul_input_width + x;

for (ii = 0; ii < steps; ii++) {
 __m128 m_kernel, m_input;
 __m128 m_temp0;

 float temp_sum;

 m_kernel = _mm_loadu_ps(p_mov_kernel);
 m_input = _mm_loadu_ps(p_mov_column_done_extended_input);

 m_temp0 = _mm_dp_ps(m_kernel, m_input, 0xf1);
 temp_sum = _mm_cvtss_f32(m_temp0);

 sum += temp_sum;

 p_mov_kernel += step_size;
 p_mov_column_done_extended_input += step_size;
}/*for sse*/
:

the functions, SeparableConvolutionColumnSSE4 , SeparableConvolutionColumnAVX and SeparableConvolutionRowAVX are pretty similar with this one, I leave them out.

The performance is as this table:
input = 1000x1000
kernel = 31 x 31
10 Rounds
Runtime, ms
Serial335 (100%)
SSE4169 (198%)
AVX142 (236%)


The improvement is not as good as the result of definition form (about 275% in the SSE4 case, 300% of AVX), there must be room to be enhanced.

三.  Align the kernel data address to 16.
      The accessing addresses of the kernel data are consecutive, hence, It is pretty simple to improve the data loading speed of SSE/AVX registers. The work is only use _aligned_malloc/_aligned_free to replace malloc/free while allocating the memory block for the kernel data, and use _mm_load_ps/_mm256_load_ps instead of _mm_loadu_ps/_mm256_loadu_ps to load the kernel data.
     For the change is simple, the code is completely omitted.

The result:
input = 1000x1000
kernel = 31 x 31
10 Rounds
Runtime, ms
Serial335 (100%)
SSE4169 (198%)
AVX142 (236%)
SSE4, aligned16160 (209%)
AVX, aligned16141 (237%)

There is no obvious improvement. there must be the other way to achieve better performance.


三. Swap the inner loops of the row function.
       Loops are able to be swapping, not only in the column function, but also in row.

Review the function SeparableConvolutionRowSerial:
:
for (i = 0; i < width; i++) {
 int x;
 int ii;
 float sum;

 x = i;
 sum = 0;

 for (ii = 0; ii < kernel_length; ii++) {

  sum += p_kernel_row[ii]
   * p_column_done_extended_input[y_mul_input_width + x];
  x += 1;
 }/*for kernel_length*/

 p_output[j*width + i] = sum;
}/*for width*/
:

     The outer loop repeat time means how many time the inner loop would start.  And the outer loop repeat times, width, is much greater than kernel_length. Besides, due to kernel_length is a small number, the inner loop would not lasts too long. Hence, the inner loop would be entered/leaved frequently.
     Loop is a condition-branch essentially : Although the modern CPUs have been optimized while the loop continues to next round, but the ending of a loop stops the pipelines inevitably. Therefore, frequently leaving loop slows the performance conspicuously. So swapping the inner two loops is going to bring performance improved.

The code, the inner two loops swapped in the function SeparableConvolutionRowSerial is :
int y_mul_input_width;
y_mul_input_width = 0;

for (j = 0; j < height; j++) {
 int ii;
 for (ii = 0; ii < kernel_length; ii++) {
  
  int x;
  float kernel_element;

  x = ii;
  kernel_element = p_kernel_row[ii];

  for (i = 0; i < width; i++) {

   float product;

   product = kernel_element
    * p_column_done_extended_input[y_mul_input_width + x];
   p_output[j*width + i] += product;
   x += 1;
  }/*for width*/
 }/*kernel*/
 
 y_mul_input_width += extended_width;
}/*for j*/

Based on this code, the SSE and AVX versions are easy to be derived (but that needs some work). The result is as below :
input = 1000x1000
kernel = 31 x 31
10 Rounds
Runtime, ms
Serial335 (100%)
SSE4169 (198%)
AVX142 (236%)
SSE4, aligned16160 (209%)
AVX, aligned16141 (237%)
Serial, width in inner281 (191%)
SSE4, width in inner131 (255%)
AVX, width in inner91 (368%)


The performance could be called "nice", especially for AVX.
The original procedure (kernel_length in the most inner loop), use eleven AVX/SSE instructions to derive the dot-product; whereas there are only five instructions in the width-in-inner version. The instruction number is obvious less, Thus, The is not strange why the width-in-inner version is much faster.

※ I have also tried to use the FMA instruction set, but the performance is totally the same as AVX version.
Summary :

     0. Algorithmic and Mathematical can bring big rewards.
     1. Before use the vectorization instructions, The serial version should be cleaned and optimized ahead, it may bring us some inspiration.
     2. If a loop is with a greater repeat time, the loop should be place in inner.
 


沒有留言:

張貼留言