High-Performance Computing as applied to numerical modelling is the art of optimizing numerical approximations to mathematical models on modern compute hardware for the interests of professionals in many fields of science and engineering. In general, the numerical approximations are of interest because realistic problems do not have an analytical solution that can be solved with pen and paper. Advances in computer hardware have led to stunningly realistic models that allow researchers to investigate phenomena such as how proteins fold, how the human heart beats, and how stars are formed. But how do we know that our numerical predictions are correct, and not simply an artifact from a bug in the code?
First and foremost, it is important to have a few simple test cases, with a (pseudo-)analytical result that is accurate to the limitations of the numerical approximation used. This can be very difficult, especially because the test must be set up in a way that minimizes expected numerical errors. Instead, this post will focus on consistency tests that are much easier to design and automate. Although none of the following are always satisfied, scientific numerical models should
- Give the same result when run twice with an identical setup
- Give the same result when run on different hardware types
- Give the same result when split over multiple machines versus run on a single machine
Unfortunately, “give the same result” does not always mean “expect a binary match”. In all cases, some level of understanding of the expected result will be required to decide what is “similar enough”.
Automated analytical tests are difficult to setup and assess because numerical approximations do not typically exhibit an exact match with the analytical solution. Instead, they converge within a certain level of error. Many numerical methods have books dedicated to the expected errors, how to mitigate them, and how to determine upper bounds for the error. Testing at this level is often done by having an expert in the field manually assess the result, and then it is saved as-is and all future versions of the code are expected to reproduce that result. I do not like this method because an improvement to an algorithm can result in unnecessarily failed tests. It is worth automating analytical tests by extracting the desired information from the results whenever possible. (e.g. Extract a single scalar of interest instead of saving a 3D example output as a reference) If it is not possible, then the method of assessing the correctness of the reference result should be well documented.
A better class of analytical tests ensure properties such as reciprocity, symmetry, and image theory are satisfied. One of my favorite tests is for perfectly planar reflecting boundary conditions in wave equation modelling. Analytically, an infinite planar Neumann or Dirichlet boundary can be replaced by a mirror image (with a possible minus sign) of the setup on one side of the plane. This holds no matter how complicated the model is. Therefore, two different tests can be setup that are expected to produce the same result including numerical errors in the wave propagation. The only difference should be from the error in the implementation of the reflective boundary, which is prone to shifts in location if implemented incorrectly.
In summary, analytical tests are essential, but should be implemented in a way that allows for improvements to numerical accuracy and minimizes the need for domain experts to repeatedly manually assess the result.
Expected Result of Multiple Runs with an Identical Setup
One of the simplest automated tests is to run the same numerical experiment twice, and assert that the same answer is reproduced. In many cases, a binary match is expected. It is worth putting in extra effort to ensure a binary match is achievable.
It is common for parallel algorithms involving a reduction to give a different result depending on the order of operations because floating-point arithmetic is not associative and scheduling is non-deterministic. Note that if the order of operations remains fixed, a parallel implementation of a reduction should give a binary match when run multiple times.
A second exception is algorithms that require pseudo-random number generation. In that case, I recommend consistently seeding the random number generation on every processing unit to maintain reproducible results.
Finally, we have found that unaligned allocations on the CPU can have run-to-run variations if that memory is operated on within a vectorized loop. The size of the remainder loop (the leftover elements when the number of iterations is not a multiple of the SIMD vector length) changes depending on the alignment of the allocated memory and the remainder loop produces a subtly different numerical result. Solutions include disabling vectorization for that loop (#pragma novector), vectorizing the remainder loop (#pragma vector vecremainder), or aligning the memory.
This type of test is commonly used during our optimization passes as an initial assessment because it is easily automated, requires no domain knowledge, and fails if a race condition or out of bounds access error is introduced. Our overnight test suite runs small tests cases 3 or 4 times to help catch uncommon race conditions.
Expected Result When Solving Problems with Different Compute Architectures
Our applications are usually developed to run on x64 CPUs and NVIDIA GPUs. Much of the code is structured to keep high level portions of the algorithm in common base classes, but the core of the algorithm ends up being written twice. Furthermore, data access patterns and optimization details differ between the two platforms. Therefore, the likelihood of an identical bug being written in both versions of the code is greatly reduced. For consistency, we run automated tests to ensure both implementations of the code produce similar results.
For different hardware types, a binary match is not expected. In fact, the differences between CPU and GPU can be substantial, and it is difficult to assess whether the two implementations are close enough. I find that acceptable differences should be relatively incoherent, and that coherent differences are usually caused by bugs. Architectures with and without support for fused multiply-add (FMA) tend to have the most differences. Obviously, production code should use FMAs, but the NVIDIA compiler flag (--fmad=true) and the Intel compiler flag (-no-fma) are handy when investigating questionable cases.
I strongly recommend against running hardware comparison tests on random data, or any data that is not designed to produce a meaningful result. Numerical algorithms are expected to have accuracy limitations and suffer from floating point errors, but nothing shows that better than taking a high-order derivative of meaningless data. Don’t waste your time trying to justify why different architectures do not match in these cases.
Expected Result When Solving Problems on Differing Numbers of Processing Units
Real world numerical models are often split over multiple processing units because they are too large to fit on a single one. In this case, I am referring to processing units in general as either individual cores on a CPU, NUMA nodes, discrete GPU chips, or whatever your base computation unit is. Assuming you are able to get an exact binary match for multiple runs on a single device, you should be able to get an exact binary match on multiple. The only caveat is once again you need to be careful of how remainder loops change when the numerical model is split over multiple devices. The easiest solution is to ensure that optimized code is padded and aligned. Otherwise the remainder loop should be vectorized with compiler directives.
Consistency tests often elucidate the strangest bugs. The ones that cause the late-time NaNs, but only for specific sizes, and the ones that cause a segmentation fault on your customer's hardware, but work just fine for you. An example bug that we found this week was introduced during the optimization of GPU code for modelling waves in an anisotropic material. For anisotropic modelling, several terms in the underlying differential equation have no effect in isotropic regions, and a subtle effect in anisotropic ones. Data was improperly loaded for a subset of threads in a way the only affected one of those subtle terms in a small subset of the total volume. As such, the code passed analytical tests, but failed consistency tests. It took two developers more than a day to track down, but detecting and fixing peculiar bugs like that increases confidence in both the code and the test suite.