Software optimization has always been one of my favourite tasks in software development. However, the hardware you are optimizing for, is a moving target (unless of course you are doing retroprogramming/oldskool demoscening, where you have nicely fixed targets). That nice algorithm that you fine-tuned last year? It may not be all that optimal for today’s CPUs anymore.
One area where this was most apparent, was in the move from single-core CPUs to multi-core CPUs, in the early 2000s. Prior to the first commonly available consumer dualcore x86 CPUs from Intel and AMD, multi-core/multi-CPU systems were very expensive and were mainly used in the server and supercomputer markets. Most consumers would just have single-core systems, and this is also what most developers would target.
The thing with optimizing tasks for single-core systems is that it is a relatively straightforward process (pun intended). That is, in most cases, you can concentrate on just getting a single job done as quickly as possible. This job will consist of a sequence of 1 or more processing steps, which will be executed one after another in a single thread. Take for example the decoding of a JPEG image. There are a number of steps in decoding an image, roughly:
- Huffman decoding
- Zero run-length decoding
- Inverse DCT
- De-zigzag the 8×8 blocks
- Upscaling U and V components
- YUV->RGB conversion
There can be only one
For a single-threaded solution, most of the optimization will be about making each individual step run as quickly as possible. Another thing to optimize is making the transition from one step to the next as efficient as possible, using the most optimal data flow with the least amount of copying or transforming data between steps.
But that is about it. If you want to decode 10 JPEG images, or even 100 JPEG images, you will process them one after another. Regardless of how many images you want to process, the code is equally optimal in every case. There are some corner-cases though, since even a single-core system consists of various bits of hardware which may be able to process in parallel. For example, you could have disk transfers that can be performed in the background via DMA. Your OS might provide APIs to perform asynchronous disk access with this functionality. Or your system may have a GPU that can run in parallel with the CPU. But let us stick to just the CPU-part for the sake of this article.
How does one parallelize their code? That is the big question. And there is a reason why the title is a question. I do not pretend to have an answer. What I do have however, is various ideas and concepts, which I would like to discuss. These may or may not apply to the problem you are currently trying to solve.
When you are used to optimizing for single-core systems, then intuitively you might take the same approach to parallelization: you will try to make each step of the process as fast as possible, by applying parallel algorithms where possible, and trying to use as many threads/cores as you can, to maximize performance. This is certainly a valid approach in some cases. For example, if you want to decode a single JPEG image as quickly as possible, then this is the way to do it. You will get the lowest latency for a single image.
However, you will quickly run into Amdahl’s law this way: not every step can be parallelized to the same extent. After the zero-length decoding, you can process the data in 8×8 blocks, which is easy to parallelize. However, the Huffman decoding is very difficult to parallelize, for the simple reason that each code in the stream has a variable length, so you do not know where the next code starts until you have decoded the previous one. This means that you will not make full use of all processing resources in every step. Another issue is that you need to have explicit synchronization between the different steps now. For example, the 8×8 blocks are separated into Y, U and V components. But at the end, when you want to convert from YUV to RGB, you need to have all three components decoded before you can do the conversion. Instead of just waiting for a single function to return a result, you may now need to wait for all threads to have completed their processing, causing extra overhead/inefficiency.
When you want to decode more than one image, this may not be the fastest way. The well-parallelized parts may be able to use all cores at the same time, but the serial parts will have very inefficient use of the cores. So the scaling will be less than linear with core count. You will not be getting the best possible throughput.
People from the server world are generally used to exploiting parallelism in another way: if they want to process multiple items at the same time, they will just start multiple processes. In this case, you can run as many of the single-core optimized JPEG decoders side-by-side as you have cores in your system. Generally this is the most efficient way, if you want to decode at least as many JPEG images as you have cores. You mostly avoid Amdahl’s law here, because each core runs very efficient code, and all cores can be used at all times. The main way in which Amdahl’s law will manifest itself in such a situation is in the limitations of shared resources in the system, such as cache, memory and disk I/O. For this reason, scaling will still not be quite linear in most cases, but it generally is as good as it gets, throughput-wise.
However, in this case, if you have say 16 cores, but you only want to decode 10 images, then you will have 6 idle cores, so like the earlier parallelized approach, you are still not making full use of all processing resources in that case, so again your throughput is not optimal.
You could try to run the above parallelized solution for multiple images in parallel, but then you run into other problems: since the parallel parts are designed to use as many resources as available for a single image, running two or more instances in parallel will be very inefficient, because the instances will be fighting for the resources and end up starving eachother.
It gets even more complicated if we would add a time component: say you want to decode a batch of JPEG images, but they are not available at the same time. The images are coming in from some kind of external source, and you do not know exactly when they are coming in, or how many you need to process at a time. Say, if you expect a batch of 100 images, you may get 3 images at a time, then nothing for a while, then another 10 images, etc. So you never know how many images you want to process at the same time, or how many cores you may have available. How would you try to make this as efficient as possible in the average case? So the question is: do you optimize for lowest latency, highest throughput, or some balance between them?
Take a cue from hardware
I think it is interesting to look at CPUs and GPUs at this point, because they need to solve very similar problems, but at a lower level. Namely, they have a number of execution units, and they have batches of instructions and data coming in in unpredictable circumstances. Their job is to allocate the execution units as efficiently as possible.
An interesting parallel to draw between the above two software solutions of parallelizing the decoding of JPEG images and GPU technology is the move from VLIW to scalar SIMD processing in GPUs.
To give some background: GPUs traditionally processed either 3d (XYZ) or 4d vectors (XYZW), or RGB/ARGB colours (effectively also 3d or 4d vectors). So it makes sense to introduce vectorized SIMD instructions (much like MMX, SSE and AVX on x86 CPUs):
add vec1.xyzw, vec2.xyzw, vec3.xyzw
So a single add-instruction can add vectors of up to 4 elements. However, in some cases, you may only want to add some of the elements, perhaps just 1 or 2:
add vec1.x, vec2.x, vec3.x
add vec1.xy, vec2.xy, vec3.xy
The below image is a nice illustration of this I believe:
What you see here is an approach where a single processing unit can process vectors of up to 5 elements wide. You can see that the first instruction is 5 wide, so you get full utilization there. Most other instructions however are only 1d or 2d, and there is one more 4d one near the end. So most of the time, the usage of the 5d processing unit is very suboptimal. This is very similar to the above example where you try to parallelize a JPEG decoder and optimize for a single image: some parts may be ’embarrassingly parallel’ and can use all the cores in the CPU. Other parts can extract only limited or even no parallelism, leaving most of the units idle. Let’s call this ‘horizontal’ parallelization. In the case of the GPU, it is instruction-level parallelism.
The solution with GPUs was to turn the processing around by 90 degrees. Instead of trying to extract parallelism from the instructions themselves, you treat all code as if it is purely scalar. So if your shader code looked like this:
add vec1.xyzw, vec2.xyzw, vec3.xyzw
The compiler would actually compile it as a series of scalar operations, like this:
add vec1.x, vec2.x, vec3.x
add vec1.y, vec2.y, vec3.y
add vec1.z, vec2.z, vec3.z
add vec1.w, vec2.w, vec3.w
The parallelism comes from the fact that the same shader is run on a large batch of vertices or pixels, so you can place many of these ‘scalar threads’ side-by-side, running on a SIMD unit. For example, if you take a unit like the above 5d vector unit, you could pack 5 scalar threads this way, and always make full use of the execution unit. It is also easy to make it far wider than just 5 elements, and still have great efficiency, as long as you have enough vertices or pixels to feed. Let’s call this ‘vertical’ parallelization. In the case of the GPU, this is thread-level parallelism.
Now, you can probably see the parallel with the above two examples of the JPEG decoding. One tries to extract as much parallelism from each step as possible, but will not reach full utilization of all cores at all times, basically a ‘horizontal’ approach. The other does not try to extract parallelism from the decoding code itself, but instead parallelizes by running multiple decoders side-by-side, ‘vertically’.
Sharing is caring
The ‘horizontal’ and ‘vertical’ approaches here are two extremes. My example above with images coming in ‘randomly’ shows that you may not always want to use one of these extremes. Are there some hybrid forms possible?
In hardware, there certainly are. On CPUs we have SMT/HyperThreading, to share the execution units of a single core between multiple threads. The idea behind this is that the instructions in a single thread will not always keep all execution units busy. There will be ‘bubbles’ in the execution pipeline, for example when an instruction has to wait for data to become available from cache or memory. By feeding instructions from multiple threads at a time, the execution units can be used more efficiently, and bubbles can be reduced.
GPUs have recently acquired very similar functionality, known as asynchronous compute shaders. This allows you to feed multiple workloads, both graphics and compute tasks, simultaneously, so that (if things are balanced out properly) the GPU’s execution units can be used more efficiently, because one task can use resources that would otherwise remain an idle ‘bubble’ during another task.
The software equivalent of this is the threadpool: a mechanism where a number of threads are always available (usually the same amount as you have cores in your machine), and these threads can receive any workload. This has some advantages:
- Creating or destroying threads on demand is quite expensive, a threadpool has lower overhead
- Dependencies can be handled by queuing a next task to start when a current task completes.
- The workloads are scheduled dynamically, so as long as you have enough workloads, you can always keep all threads/cores busy. You do not have to worry about how many threads you need to run in parallel at a given time
That last one might require an example to clarify. Say you have a system with 8 cores. Normally you will want to run 8 tasks in parallel. If you were to manually handle the threads, then you could create 8 threads, but that only works if you’re running only one instance of that code. If you were to run two, then it would create 16 threads, and they would be fighting for the 8 cores. You could try to make it smart, but then you’d probably quickly come to the conclusion that the proper way to do that is to… create a threadpool.
Because if you use a threadpool, it would always have 8 threads running for your 8 cores. If you create 8 independent workload tasks, it can run them all in parallel. If you created 16 however, it would run the first 8 in parallel, and then start with the 9th as soon as the first task is complete. So it will always keep 8 tasks running.
Another advantage is that you can run any type of task in parallel. So in the case that the images do not come in at the same time, the different images can be in different steps. Instead of the massively parallel steps hogging all the CPU cores, the non-parallelized steps can be run in parallel with the massively parallel ones, finding a decent balance of resource sharing.
In this case, I suppose the key is to find the right level of granularity. You could in theory create a separate task for each 8×8 block for every step. But that will create a lot of overhead in starting, stopping and scheduling each individual task. So you may want to group large batches of 8×8 blocks together in single tasks. You might also want to group multiple decoding steps together on the same batch of 8×8 blocks, to reduce the total amount of tasks further.
Anyway, these are just some ideas of how you can parallelize your code in various ways on modern multi-core systems. Whichever is the best way depends on your specific needs. Perhaps there are also other approaches that I have not mentioned yet, or which I am not even aware of. As I said, I don’t have all the answers, just some ideas to try. Feel free to share your experiences in the comments!