Single Instruction Multiple Data or SIMD execution is a common feature on modern CPUs. SIMD execution seeks to improve instruction throughput of repeated arithmetic operations by processing multiple operations at the same time with specialized vector hardware. Vector hardware is capable of processing many lanes of data at once, cutting down on loop iterations substantially. When paired with work distribution between threads, the throughput gained is even more dramatic.

OpenMP has supported SIMD compilation since version 4.0, published in 2013. The extension allows loops to be compiled for vectorized execution with the simd construct.

A loop can be vectorized by preceding the loop with a #pragma omp simd directive, just as you would designate a multi-threaded loop with #pragma omp parallel for. In fact, you could even do both, creating a loop that splits its iteration space among the threads of a team and has each of those threads process the loop with vectorized instructions. As with parallel loops, designating a loop for vectorization does impose extra restrictions upon the kinds of operations that can occur inside that loop. The most important restriction is that control-flow branching is limited inside SIMD loops; only simple if/else constructs are allowed.

#pragma omp simd
for (int i = 0; i < vec.size(); ++i)
    if (vec[i] > 0)
        vec[i] += x;
        vec[i] += y;
OpenMP handles a branch like this by masking the vector operation, essentially doing the operation for all lanes of the SIMD unit, but only writing the results for some of them.

For loops that are memory-bound, the speed at which concurrently executed loads/stores can complete will depend on the coherence and alignment of the operations. In an optimal situation, each vectorized load/store should not cross any cache line boundaries. Scattered or misaligned memory accesses will result in worse performance. OpenMP does allow you to certify variables' alignment using the aligned clause, which will enable the compiler to use faster aligned load/store instructions within the loop.

Dependencies within the iteration space can also cause problems in SIMD loops, like the below:

#pragma omp simd
for (int i = 0; i < vec.size() - 7; ++i)
    vec[i] += vec[i+7];

If the vector length is greater than eight, this loop will give unexpected results. OpenMP does allow control over this behaviour through the safelen clause, which sets a maximum safe vector length for the loop.

Function calls inside a vectorized loop should also be considered. Normally, calls to functions won’t get vectorized unless the called function is inlined. OpenMP offers a way to fix this, with the declare simd construct. The compiler builds both a scalar and a vector version of declare simd functions, such that a vectorized loop can call the correct version. To be valid, vectorized functions must meet the same requirements as vectorized loops.

While most modern compilers support automatic vectorization of loops, detecting dependencies is still a challenge. Using OpenMP gives the control back to the programmer to vectorize loops in these situations. As with parallel constructs, skillful use of SIMD constructs can prove a useful addition to an optimizer’s toolkit.