※The full code of this post could be downloaded in here.
The x86 has Streaming SIMD Extensions(SSE) instruction sets. By processing 4 data items in parallel, theoretically SSE can improve performance to 400%. Advanced Vector Extensions (AVX) is 8-wide and can improve to 800% ! It is really amazing.
※ The vectorization instruction sets is called NEON in ARM architecture.
Theory aside, in practice, it is difficult for SSE to give us to over 300% without some careful coding. In some cases SSE might worsen the speed.
The figure below show what are vectorization instructions.
Thus, to vectorize a processing, needs the two necessary conditions :
1. SIMD-wide register to load multi data.
2. Instructions to operate the SIMD-wide data.
If there is no vectorization instruction meeting the programmer requires, it is necessary to use a sequence of SIMD instructions to achieve the same goal. As this figure iluminates for dot-product:
The sequence may be not short, and there must be some overhead operation unable to be vectorized (e.g. loop control, suffle etc). Therefore, the improvement to 400% is unreachable. By my experience, if we can improve to over 220%, the optimization could be called "nice work".
Of course, if the procedure is very simple, e.g. one-array multiple a constant, the improvement could be very closed to 400%.
The SSE registers, called xmm, are 128 bits. The xmm registers are able to contain integer types and float types. The AVX registers are 256 bit-wide, called ymm.
AVX looks like better, right ? Theoretically, Yes, AVX is the better choice, But : Not all CPUs support AVX instruction sets, (the CPUs lower than Pentium) ; And if the instruction sets do not directly support the operation you want, the longer data, the longer the sequence of operations be; Besides, current AVX instruction sets are not as diverse as SSE. Therefore, it is not definitely a procedure adopting AVX would be faster than SSE.
Ahead, I would like to show the simplest SSE code example.
The code sums two arrays serially.
float *p_src1, *p_src2, *p_dst; int n; :/*init p_src1 p_src2 and p_dst*/ n = 1000 * 10000; #pragma loop(no_vector) for (i = 0; i < n; i++) p_dst[i] = p_src1[i] + p_src2[i];
The SSE version :
#include <xmmintrin.h> : : float *p_src1, *p_src2, *p_dst; int n, end; int step_size; n = 1000 * 10000; :/*init p_src1 p_src2 and p_dst*/ step_size = sizeof(__m128) / sizeof(float); end = n / step_size; n /= step_size; for (i = 0; i < end; i++) { __m128 m128_tmp[3]; m128_tmp[0] = _mm_loadu_ps(p_src1 + i*step_size); m128_tmp[1] = _mm_loadu_ps(p_src2 + i*step_size); m128_tmp[2] = _mm_add_ps(m128_tmp[0], m128_tmp[1]); _mm_storeu_ps(p_dst + i*step_size, m128_tmp[2]); }/*for i end*/ for (i = end*step_size; i < n; i++) p_dst[i] = p_src1[i] + p_src2[i];
I use those instructions:
_mm_loadu_ps : loading non-aligned 4-wide data from memory.
_mm_add_ps : 4-wide add.
_mm_storeu_ps : store 4-wide non-aligned data into memory.
In my environment, VS2015 on i5-4750 and windows 7, the result is:
100 ROUND :
Serial : 1820 ms
SSE : 1568 ms
It is a simple case, but why the improvement is not to 400%? It is because the processing is without branch and totally data-independent. The out of order engine in the CPU is able to bring its full effect in this case. Even so, the SSE version is better.
In the most case, the procedure must not be so simple : the effect of the out of order engine would be much less. Therefore, to exploit SSE instructions brings much more performance.
In addition, it's worth mentioning : There is some room in above SSE code to be optimized. For example, use _mm_load_ps/_mm_store_ps (with aligned memory, _aligned_malloc/_aligned_free) instead of _mm_loadu_ps/_mm_storeu_ps to improve memory accessing speed ; Or use moving pointers to avoid the computing of accessing address.
零. Optimization the serial code of convolution.
This post emphasizes programming optimization, hence, I do not want to discuss convolution algorithm optimization , e.g. separable convolution and so forth. I discuss the general form only.
The complete function for the general form of convolution is as below :
int ConvolutionSerialCPU(int width, int height, float *p_input, int kernel_length, float *p_kernel, float *p_output) { int i, j; int kernel_radius; if (0 == width || 0 == height) return -1; if (kernel_length > width || kernel_length > height) { return -2; }/*if */ if (NULL == p_input || NULL == p_kernel || NULL == p_output) { return -3; } kernel_radius = kernel_length / 2; for (j = 0; j < height; j++) { for (i = 0; i < width; i++) { int ii, jj; float sum; sum = 0; for (jj = 0; jj < kernel_length; jj++) { for (ii = 0; ii < kernel_length; ii++) { int x, y; x = i + ii - kernel_radius; y = j + jj - kernel_radius; if ((x < 0 || x >= width) || (y < 0 || y >= height)) { continue; }/*out of bound*/ sum += p_kernel[kernel_length*jj + ii] * p_input[y*width + x]; }/*ii*/ }/*jj*/ p_output[j*width + i] = sum; }/*for i*/ }/*for j*/ return 0; }/*ConvolutionSerialCPU*/
Aware that the "if block" in the most inner loop is a performance killer due to branch mispredict:
if ((x < 0 || x >= width) || (y < 0 || y >= height)) { continue; }/*out of bound*/
The instruction pipelines in the CPU is optimized for sequential code. It has to be stopped and flushed while there is conditional-branch.
To avoid the branch I pad the source data, called "extension source":
Extend the input:
kernel_radius = kernel_length/2; //assume kernel_length is an odd number extended_width = width + 2 * kernel_radius; extended_height = height + 2 * kernel_radius; p_extended_data = (float*)malloc( extended_width*extended_height * sizeof(float)); for (j = 0; j < extended_height; j++) { for (i = 0; i < extended_width; i++) { if (i < kernel_radius || i >= (kernel_radius + width) || j < kernel_radius || j >= (kernel_radius + height) ) { p_extended_data[i + extended_width*j] = 0; } else { int ii, jj; jj = j - kernel_radius; ii = i - kernel_radius; p_extended_data[i + extended_width*j] = p_input_data[jj*width + ii]; }/*if */ }/*for j*/ }/*for i*/
And the convolution function to deal with the extended input becomes more simple:
extended_width = width + kernel_length - 1; for (j = 0; j < height; j++) { for (i = 0; i < width; i++) { int ii, jj; float sum; sum = 0; for (jj = 0; jj < kernel_length; jj++) { for (ii = 0; ii < kernel_length; ii++) { int x, y; x = i + ii; y = j + jj; sum += p_kernel[kernel_length*jj + ii] * p_extended_input[extended_width*y + x]; }/*for ii*/ }/*for jj*/ p_output[j*width + i] = sum; }/*for i*/ }/*for j*/
For a 1200x1200 source data, and the kernel size is 15x15, 10 rounds, the run times are :
ConvolutionSerialCPU : 3056 ms
ConvolutionSerialExtensionCPU : 2696 ms
It is typical example of "trading space for time". The performance increases by 13.4%, and the side effect is wasting some memory to contain zero. But I think it worthwhile.
壹. Optimization for SSE 1 :
SSE 1 instruction set features the operation of single precision float points, so I adopt them to operate the"multiply then sum" procedure (which is inside the most inner loop). See the following figure:
It totally takes 4 instructions.
The corresponding code is :
#include <xmmintrin.h> : for (ii = 0; ii < steps; ii++) { __m128 m_kernel, m_src; __m128 m_temp0, m_temp1, m_temp2, m_temp3, m_temp4; float temp_sum; int x, y; x = i + ii*step_size; y = j + jj; m_kernel = _mm_loadu_ps( p_kernel + kernel_length*jj + ii*step_size); m_src = _mm_loadu_ps( p_extended_input + y*extended_width + x); m_temp0 = _mm_mul_ps(m_kernel, m_src); m_temp1 = _mm_movehl_ps(_mm_setzero_ps(), m_temp0); m_temp2 = _mm_add_ps(m_temp0, m_temp1); m_temp3 = _mm_shuffle_ps(m_temp2, _mm_setzero_ps(), _MM_SHUFFLE(0, 0, 0, 1)); m_temp4 = _mm_add_ps(m_temp3, m_temp2); temp_sum = _mm_cvtss_f32(m_temp4); sum += temp_sum; }/*for ii*/
Use the same test size and round, the run time is 1354 ms. Comparing with the serial extension version, it is improved to 199.7%.
If we scrutinize the code, we may find there are some places for improvement . For example, the address calculations are complicated and in the inner loop:
x = i + ii*step_size; y = j + jj; m_kernel = _mm_loadu_ps( p_kernel + kernel_length*jj + ii*step_size); m_src = _mm_loadu_ps( p_extended_input + y*extended_width + x);
If we use the two pointers to store the current addresses for the source and kernel and use '+=' to advance. The corresponding code is:
int y; y = j; for (jj = 0; jj < kernel_length; jj++) { float *p_mov_input; float *p_mov_kernel; p_mov_kernel = p_kernel + kernel_length*jj; p_mov_input = p_extended_input + y*extended_width + i; for (ii = 0; ii < steps; ii++) { __m128 m_kernel, m_src; __m128 m_temp0, m_temp1, m_temp2, m_temp3, m_temp4; float temp_sum; m_kernel = _mm_loadu_ps(p_mov_kernel); m_src = _mm_loadu_ps(p_mov_input); m_temp0 = _mm_mul_ps(m_kernel, m_src); m_temp1 = _mm_movehl_ps(_mm_setzero_ps(), m_temp0); m_temp2 = _mm_add_ps(m_temp0, m_temp1); m_temp3 = _mm_shuffle_ps(m_temp2, _mm_setzero_ps(), _MM_SHUFFLE(0, 0, 0, 1)); m_temp4 = _mm_add_ps(m_temp3, m_temp2); temp_sum = _mm_cvtss_f32(m_temp4); sum += temp_sum; p_mov_kernel += step_size; p_mov_input += step_size; }/*for ii*/
The address computation has been reduced from two multiplication and two addition to an addition.
Use the same test size and round, the run time is 1211 ms. It is improved to 222.6% from serial code.
貳. Optimization for SSE 3.
SSE3 features horizontal operation, can we exploit them in the computing of sum? The procedure is as the below figure.
The code is :
#include <pmmintrin.h> : for (ii = 0; ii < steps; ii++) { __m128 m_kernel, m_src, m_dst; __m128 m_temp0, m_temp1, m_temp2, m_temp3; float temp_sum; m_kernel = _mm_loadu_ps(p_mov_kernel); m_src = _mm_loadu_ps(p_mov_input); m_temp0 = _mm_mul_ps(m_kernel, m_src); /* temp0 = a0, a1, a2, a3 */ m_temp1 = _mm_movehdup_ps(m_temp0); //m_temp1 = a1, a1, a3, a3 m_temp2 = _mm_add_ps(m_temp0, m_temp1); //m_temp2 = a1+a0, a1+ a1, a2 + a3, a3 + a3, m_temp3 = _mm_movehl_ps(_mm_setzero_ps(), m_temp2); //m_temp3 = a2 + a3, a3 + a3, 0, 0 m_dst = _mm_add_ps(m_temp1, m_temp3); temp_sum = _mm_cvtss_f32(m_dst); sum += temp_sum; p_mov_kernel += step_size; p_mov_input += step_size; }/*for ii*/
Here, I analyze why the performance is the same:
_mm_add_ps (1)
※ Counting throughput to predict performance is very inaccurate, it could only explain the very limited cases, not for forecasting the improvement.
※ You could refer to here to know the definition of latency and throughput.
Below is another SSE3 version, uses horizontal add instruction, but it gets a little worse (1311 ms, only 205.6% )
: m_temp0 = _mm_mul_ps(m_kernel, m_src); m_temp1 = _mm_hadd_ps(m_temp0, _mm_setzero_ps()); m_temp2 = _mm_hadd_ps(m_temp1, _mm_setzero_ps()); temp_sum = _mm_cvtss_f32(m_temp2); :
m_temp1 = _mm_dp_ps(m_kernel, m_src, 0xf1); temp_sum = _mm_cvtss_f32(m_temp1);
int step_size_avx; int steps_avx; int remainder_avx; int step_size_sse; int steps_sse; int remainder_sse; : step_size_avx = sizeof(__m256) / sizeof(float); steps_avx = kernel_length / step_size_avx; remainder_avx = kernel_length % step_size_avx; step_size_sse = sizeof(__m128) / sizeof(float); steps_sse = remainder_avx / step_size_sse; remainder_sse = remainder_avx % step_size_sse;
if the remainder of 8 be greater or equal to 4, I use SSE4 procedure to deal with the data:
for (ii = 0; ii < steps_sse; ii++) { __m128 m_kernel, m_src; __m128 m_temp0, m_temp1; float temp_sum; m_kernel = _mm_loadu_ps(p_mov_kernel); m_src = _mm_loadu_ps(p_mov_input); m_temp1 = _mm_dp_ps(m_kernel, m_src, 0xf1); temp_sum = _mm_cvtss_f32(m_temp1); sum += temp_sum; }/*(kernel_length%8) /4 > 0*/
#include <immintrin.h> : _m256 m256_kernel, m256_src; __m256 m256_temp0; m256_kernel = _mm256_loadu_ps(p_mov_kernel); m256_src = _mm256_loadu_ps(p_mov_input); { __m256 m256_temp1, m256_temp2, m256_temp3; m256_temp1 = _mm256_dp_ps(m256_kernel, m256_src, 0xf1); { __m128 m128_low, m128_high; __m128 m128_sum; m128_low = _mm256_castps256_ps128(m256_temp1); m128_high = _mm256_extractf128_ps(m256_temp1, 1); m128_sum = _mm_add_ps(m128_low, m128_high); temp_sum = _mm_cvtss_f32(m128_sum); } }
The run time is a little worse than SSE4 version, 1206 ms (223.5%)
Another AVX try :
#include <immintrin.h> : __m256 m256_kernel, m256_src; __m256 m256_temp0; __m128 m128_temp0, m128_temp1, m128_temp2, m128_temp3; __m128 m128_temp4, m128_temp5, m128_temp6; float temp_sum; m256_kernel = _mm256_loadu_ps(p_mov_kernel); m256_src = _mm256_loadu_ps(p_mov_input); m256_temp0 = _mm256_mul_ps(m256_kernel, m256_src); m128_temp0 = _mm256_castps256_ps128(m256_temp0); m128_temp1 = _mm256_extractf128_ps(m256_temp0, 1); m128_temp2 = _mm_add_ps(m128_temp0, m128_temp1); #if(0) // crash !! m128_temp3 = _mm_movehl_ps(_mm_setzero_ps(), m128_temp2); #else m128_temp3 = _mm_shuffle_ps(m128_temp2, _mm_setzero_ps(), _MM_SHUFFLE(0, 0, 3, 2)); #endif m128_temp4 = _mm_add_ps(m128_temp2, m128_temp3); m128_temp5 = _mm_shuffle_ps(m128_temp4, _mm_setzero_ps(), _MM_SHUFFLE(0, 0, 0, 1)); m128_temp6 = _mm_add_ps(m128_temp4, m128_temp5); temp_sum = _mm_cvtss_f32(m128_temp6);
I use the AVX instructions to multiply two 8-packed floats, and separate the output as 2 SSE packets, then use the procedure of SSE to combination the two 128 bits packets.
※ The lines in #if() and #else are equivalent. But if I use the line containing _mm_movehl_ps instead of _mm_shuffle_ps in release mode VS2015 would show the instruction _mm_movehl_ps is illegal and crash and I do not know why.
The run time is 1042 ms, improved to 258.7%.
Below version is to use AVX horizontal add:
#include <immintrin.h> : __m256 m256_kernel, m256_src; __m256 m256_temp0; __m256 m256_temp1, m256_temp2; float temp_sum; m256_kernel = _mm256_loadu_ps(p_mov_kernel); m256_src = _mm256_loadu_ps(p_mov_input); m256_temp0 = _mm256_mul_ps(m256_kernel, m256_src); m256_temp1 = _mm256_hadd_ps(m256_temp0, _mm256_setzero_ps()); m256_temp2 = _mm256_hadd_ps(m256_temp1, _mm256_setzero_ps()); { __m128 m128_low, m128_high; __m128 m128_sum; m128_low = _mm256_castps256_ps128(m256_temp2); m128_high = _mm256_extractf128_ps(m256_temp2, 1); m128_sum = _mm_add_ps(m128_low, m128_high); temp_sum = _mm_cvtss_f32(m128_sum); }
伍. Benchmark:
All data are runs 10 times, the run time units is millisecond.
1000x1000
instruction\kernel | 9x9 | 15x15 | 21x21 | 31x31 |
---|---|---|---|---|
Serial extension | 568 | 1929 | 3990 | 8815 |
SSE | 300 | 845 | 1611 | 3474 |
SSE3 shuffle | 292 | 852 | 1576 | 3506 |
SSE4 | 242 | 816 | 1429 | 3256 |
AVX shuffle | 206 | 727 | 1323 | 3058 |
2000x2000
instruction\kernel | 9x9 | 15x15 | 21x21 | 31x31 |
---|---|---|---|---|
Serial extension | 2185 | 7691 | 15511 | 34821 |
SSE | 1150 | 3401 | 5927 | 13571 |
SSE3 shuffle | 1174 | 3400 | 5435 | 13572 |
SSE4 | 971 | 3248 | 4917 | 12549 |
AVX shuffle | 841 | 2896 | 4849 | 10864 |
4000x4000
instruction\kernel | 9x9 | 15x15 | 21x21 | 31x31 |
---|---|---|---|---|
Serial extension | 8733 | 30899 | 62712 | 141096 |
SSE | 4674 | 13816 | 24655 | 54979 |
SSE3 shuffle | 4694 | 13629 | 23279 | 54079 |
SSE4 | 3953 | 12995 | 21922 | 49503 |
AVX shuffle | 3471 | 11525 | 19606 | 43701 |
Beware the more large the kernel size is, the more significant the improvement shows (AVX improved to over 300%). It is natural : the longer kernel size is, the less preparations relative to the computation of the sum is.
The key points to use the vectorization instructions to improve the performance, are :
零. Avoid your performance code with condition-branch, it slow the speed immensely.
一. Reduce the computing for memory addressing, the computing obviously affects the performance.
二. Try to use several procedure to achieve the same goal. Some instructions are sluggish, may be slower than a sequence of instructions.
The performance is not bad, right ? Yes, but the optimizing has not be ended : if kernel size is fixed, there are still some room to be improved. About that, please refer to my next post.
Reference :
"Fastest way to do horizontal float vector sum on x86" discussed in stack overflow
Agner Fog's Optimization manuals
沒有留言:
張貼留言