Programming, Mostly : Use SSE/AVX to Improve 2D Convolution performance for the Specified Kernel Size

※The full code of this post could be downloaded in here.

This post is the continuation of my previous post, which demonstrates how to use the SSE/AVX instructions to improve the performance. The best performance in that post is about 300% improved, compared to the serial result. it is good, but that can go further.

I use the two techniques for the further optimization : loop-unrolling and data-alignment. Those are able to improve the performance obviously, but trade-off is, t the code it is for the specified kernel size only, no longer generic.

壹. Unroll the most inner loop .

Essentially, loop is a kind of condition-branch, where the CPU pipelines would stop, leading low performance. Thus, for specific kernel length, unrolling the most inner loop is able to get more improvement.

Due to VC does not support pragma hint to indicate the loop unrolling, I do the loop-unroll manually. I do it on the SSE4 version for kernel length being 21:

 __m128 m_kernel, m_src;
 __m128 m_temp0;

 p_mov_kernel = p_kernel + kernel_length*jj;
 p_mov_input = p_extended_input + y*extended_width + i;

#define ONE_SSE4_STEP()     \
 {      \
  float temp_sum;   \
        \
  m_kernel = _mm_loadu_ps(p_mov_kernel); \
  m_src = _mm_loadu_ps(p_mov_input); \
  \
  \
  m_temp0 = _mm_dp_ps(m_kernel, m_src, 0xf1); \
  temp_sum = _mm_cvtss_f32(m_temp0); \
  \
  sum += temp_sum; \
       \
  p_mov_kernel += sizeof(__m128) / sizeof(float); \
  p_mov_input += sizeof(__m128) / sizeof(float); \
 }

 ONE_SSE4_STEP();
 ONE_SSE4_STEP();
 ONE_SSE4_STEP(); ONE_SSE4_STEP(); ONE_SSE4_STEP(); {
  sum += p_mov_kernel[0] * p_mov_input[0];
 }/*for ii*/

 y += 1;

Below is the run times.

Round =10, the run time units is ms.

size\kernel	21x21
1000	1312 (304.1%)
2000	5035 (308.0%)
4000	20762 (302.1%)

In the all cases, the performances of unrolling-SSE4 are all improved to over 300% . almost equivalent to the AVX version.

貳. Padding kernel data to be aligned to 16.

The operation, loading data to SSE/AVX, is able to be improved. In previous code, I use operation, _mm_loadu_ps:

m_kernel = _mm_loadu_ps(p_mov_kernel);

The instruction, _mm_load_ps is faster than _mm_loadu_ps. But _mm_load_ps requires the accessed addresses aligned to 16 : If you use _mm_load_ps accesses the address unaligned to 16, the program would crash.

To achieve the goal, I put some padding in kernel data, as below figure:

Prepare the padding in kernel data:

int kernel_length_alignment16;
float *p_aligned_kernel_matrix;

kernel_length_alignment16
 = (1 + kernel_length * sizeof(float) / 16)*(16/sizeof(float));

p_aligned_kernel_matrix = 
 (float*)_aligned_malloc(kernel_length_alignment16*kernel_length*sizeof(float), 16);

memset(p_aligned_kernel_matrix, 0, 
 kernel_length_alignment16*kernel_length * sizeof(float));

for (j = 0; j < kernel_length; j++) {

 float *p_mov;
 p_mov = p_aligned_kernel_matrix + kernel_length_alignment16*j;

 for (i = 0; i < kernel_length; i++) 
  p_mov[i] = p_kernel_matrix[j*kernel_length + i];
}/*for j*/

:

_aligned_free(p_aligned_kernel_matrix);

Modify the offsetting of the pointer to kernel data, to match the data aligned to 16 ; and replace _mm_loadu_ps as _mm_load_ps:

int steps;
int kernel_length_alignment16;
:
kernel_length_alignment16 
 = (1 + kernel_length * sizeof(float) / 16)*(16/sizeof(float));;

:

for (jj = 0; jj < kernel_length; jj++) {
 :
 p_mov_kernel = p_kernel + kernel_length_alignment16*jj;
 
#define ONE_SSE4_KERNEL_ALIGNED16_STEP() \
 {       \
  float temp_sum;   \
        \
  m_kernel = _mm_load_ps(p_mov_kernel); \
  :
 }

 ONE_SSE4_KERNEL_ALIGN16ED_STEP();
 :
}/*jj*/

For 1000x1000, kernel size = 21, 10 rounds, the runtime is 1167 ms. It is improved from 302.1%(unrollment version) to 330.0% compared with serial-extension version.
I also implement the AVX-unrolling-alignment version, based on the shuffle procedure. the performance is 377.6%

Optimization Method	Performance
Serial boundary	95.8%
Serial Extension	100%
SSE1 shuffle	230.7%
SSE1 shuffle mov-ptr	247.7%
SSE3 shuffle mov-ptr	252.7%
SSE3 h-add mov-ptr	225.0%
SSE4 dot mov-ptr	275.7%
AVX dot mov-ptr	231.4%
AVX h-add mov-ptr	268.1%
AVX shuffle mov-ptr	294.5%
SSE4 dot unroll mov-ptr †	304.4%
SSE4 dot unroll kernel alignment †	330.0%
AVX suffle unroll kernel alignment †	377.6%

參. Summary :

This table contains the functions implemented in my previous post:

※ All the vectorization versions are based on the extension method.
† Unrollment and kernel-aligment bring the code for the specified kernel size only.

Reference :

The Software Optimization Cookbook: High Performance Recipes for IA-32 Platforms, 2nd Edition. by Richard Gerber, Aart J. C. Bik, Kevin Smith, Xinmin Tian, 2005. Intel press.

Agner Fog's Optimization manuals

Programming, Mostly

2019年1月20日星期日

Use SSE/AVX to Improve 2D Convolution performance for the Specified Kernel Size

沒有留言:

張貼留言

2019年1月20日 星期日

Use SSE/AVX to Improve 2D Convolution performance for the Specified Kernel Size

沒有留言:

張貼留言

2019年1月20日星期日