2019年1月20日 星期日

Use SSE/AVX to Improve 2D Convolution performance for the Specified Kernel Size



※The full code of this post could be downloaded in here.

      This post is the continuation of my previous post, which demonstrates how to use the SSE/AVX instructions to improve the performance. The best performance in that post is about 300% improved,  compared to the serial result. it is good, but that can go further.

     I use the two techniques for the further optimization : loop-unrolling and data-alignment. Those are able to improve the performance obviously, but trade-off is, t the code it is  for the specified kernel size only, no longer generic.
 
 壹. Unroll the most inner loop .

   Essentially, loop is a kind of condition-branch, where the CPU pipelines would stop, leading low performance. Thus, for specific kernel length, unrolling the most inner loop is able to get more improvement.

   Due to VC does not support pragma hint to indicate the loop unrolling, I do the loop-unroll manually.  I do it on the SSE4 version for kernel length being 21:

 __m128 m_kernel, m_src;
 __m128 m_temp0;

 p_mov_kernel = p_kernel + kernel_length*jj;
 p_mov_input = p_extended_input + y*extended_width + i;

#define ONE_SSE4_STEP()     \
 {      \
  float temp_sum;   \
        \
  m_kernel = _mm_loadu_ps(p_mov_kernel); \
  m_src = _mm_loadu_ps(p_mov_input); \
  \
  \
  m_temp0 = _mm_dp_ps(m_kernel, m_src, 0xf1); \
  temp_sum = _mm_cvtss_f32(m_temp0); \
  \
  sum += temp_sum; \
       \
  p_mov_kernel += sizeof(__m128) / sizeof(float); \
  p_mov_input += sizeof(__m128) / sizeof(float); \
 }

 ONE_SSE4_STEP();
 ONE_SSE4_STEP();
 ONE_SSE4_STEP(); ONE_SSE4_STEP(); ONE_SSE4_STEP(); {
  sum += p_mov_kernel[0] * p_mov_input[0];
 }/*for ii*/

 y += 1;

Below is the run times.

Round =10, the run time units is ms.


size\kernel21x21
10001312 (304.1%)
20005035 (308.0%)
400020762 (302.1%)

In the all  cases, the performances of unrolling-SSE4 are all improved to over 300% . almost equivalent to the AVX version.


貳. Padding kernel data to be aligned to 16.
 
      The operation, loading data to SSE/AVX,  is able to be improved.  In previous code, I use operation, _mm_loadu_ps:

m_kernel = _mm_loadu_ps(p_mov_kernel);    

     The instruction, _mm_load_ps is faster than _mm_loadu_ps. But _mm_load_ps requires the accessed addresses aligned to 16 : If you use _mm_load_ps  accesses the address unaligned to 16, the program would crash.

    To achieve the goal, I put some padding in kernel data, as below figure:




Prepare  the padding in kernel data:

int kernel_length_alignment16;
float *p_aligned_kernel_matrix;

kernel_length_alignment16
 = (1 + kernel_length * sizeof(float) / 16)*(16/sizeof(float));

p_aligned_kernel_matrix = 
 (float*)_aligned_malloc(kernel_length_alignment16*kernel_length*sizeof(float), 16);

memset(p_aligned_kernel_matrix, 0, 
 kernel_length_alignment16*kernel_length * sizeof(float));

for (j = 0; j < kernel_length; j++) {

 float *p_mov;
 p_mov = p_aligned_kernel_matrix + kernel_length_alignment16*j;

 for (i = 0; i < kernel_length; i++) 
  p_mov[i] = p_kernel_matrix[j*kernel_length + i];
}/*for j*/

:

_aligned_free(p_aligned_kernel_matrix);


Modify the offsetting of the pointer to kernel data, to match the data aligned to 16 ; and replace _mm_loadu_ps as _mm_load_ps:

int steps;
int kernel_length_alignment16;
:
kernel_length_alignment16 
 = (1 + kernel_length * sizeof(float) / 16)*(16/sizeof(float));;

:

for (jj = 0; jj < kernel_length; jj++) {
 :
 p_mov_kernel = p_kernel + kernel_length_alignment16*jj;
 
#define ONE_SSE4_KERNEL_ALIGNED16_STEP() \
 {       \
  float temp_sum;   \
        \
  m_kernel = _mm_load_ps(p_mov_kernel); \
  :
 }

 ONE_SSE4_KERNEL_ALIGN16ED_STEP();
 :
}/*jj*/


For 1000x1000, kernel size = 21, 10 rounds, the runtime is 1167 ms. It is improved from 302.1%(unrollment version) to 330.0% compared with serial-extension version.
I also implement the AVX-unrolling-alignment version, based on the shuffle procedure. the performance is 377.6%

Optimization MethodPerformance
Serial boundary95.8%
Serial Extension100%
SSE1 shuffle230.7%
SSE1 shuffle mov-ptr247.7%
SSE3 shuffle mov-ptr252.7%
SSE3 h-add mov-ptr225.0%
SSE4 dot mov-ptr275.7%
AVX dot mov-ptr231.4%
AVX h-add mov-ptr268.1%
AVX shuffle mov-ptr 294.5%
SSE4 dot unroll mov-ptr †304.4%
SSE4 dot unroll
kernel alignment †
330.0%
AVX suffle unroll
kernel alignment †
377.6%


參. Summary :

This table contains the functions implemented in my previous post:

※ All the vectorization versions are based on the extension method.
† Unrollment and kernel-aligment bring the code for the specified kernel size only.



Reference :

     The Software Optimization Cookbook: High Performance Recipes for IA-32 Platforms, 2nd Edition. by Richard Gerber, Aart J. C. Bik, Kevin Smith, Xinmin Tian, 2005.  Intel press.

     Agner Fog's Optimization manuals

沒有留言:

張貼留言