Multi-core and multi-threading performance (the multi-core myth?)

Today I read a new article on Anandtech, discussing the Bulldozer architecture in detail, focusing on where it performs well, and more importantly, where it does not, and why. There weren’t too many surprises in the article, as I mentioned most problematic areas (decoding, integer ALU configuration, shared FPU, cache configuration…) years ago already. But that is not really what I want to talk about today. What I want to focus on is multi-core and multi-threading performance in general.

The thing is, time and time again I see people recommending a CPU with more cores for applications that require more threads. As Johan de Gelas points out in the aforementioned article, it is not that simple. Although the Bulldozer-based Opterons have 16 cores, they often have trouble keeping up with the older 12 core Magny Cours-based Opterons.

Now, it seems there is a common misconception. Perhaps it is because the younger generations have grown up with multi-core processors only. At any rate, let me point out the following:

In order to use multiple cores at the same time, multiple threads are required. The inverse is not true!

That is: a single core is not limited to running a single thread.

First things first

Let me explain the basics of threading first. A thread is essentially a single sequence of instructions. A process (a running instance of a program) consists of one or more threads. A processor core is a unit capable of processing a sequence of instructions. So there is a direct relation between threads and cores. For the OS, a thread is a unit of workload which can be scheduled to execute on a single core.

This scheduling appears to get overlooked by many people. Obviously threads and multitasking have been around far longer than multi-core systems (and for the sake of this article, we can place systems with multiple single-core CPUs and CPUs with multiple cores in the same category). Since very early on, sharing the resources of the CPU between multiple users, multiple programs, or multiple parts of a program (threads), has been a very important feature.

In order to make a single core able to run multiple threads, a form of time-division multiplexing was used. To simplify things a bit: the OS sets up a timer which interrupts the system at a fixed interval. A single interval is known as a time slice. Everytime this interrupt occurs, the OS runs the scheduling routine, which picks the next thread that is due to be executed. The context of the core is then switched from the currently running thread to the new thread, and execution continues.

Since these timeslices are usually very short (in the order of 10-20 ms, depending on the exact OS and configuration), as a user you generally don’t even notice the switches. For example, if you play an mp3 file, the CPU has to decode the audio in small blocks, and send them to the sound card. The sound card will signal when it is done playing, and this will trigger the mp3 player to load new blocks from the mp3 file, decode them, and send them to the sound card. However, even a single-core CPU has no problem playing an mp3 in the background while you continue work in other applications. Your music will not skip, and your applications will run about as well as when no music is playing.

On a modern system that is no surprise, as playing an mp3 takes < 1% CPU time, so its impact is negligible. However, if we go further back in time, we can see just how well this scheme really works. For example, playing an mp3 in the background worked well even in the early days of Windows 95 and Pentium processors. An mp3 would easily take 20-30% CPU time to decode. But since the OS scheduler did its job well enough, nearly all of the remaining 70-80% were available to the rest of the system. So most applications still worked fine. Things like web browsers or word processors don’t need all that much CPU time. They just need a bit of CPU time at the right moment, so that they are responsive to keystrokes, mouseclicks and such. And if the OS scheduler does a good enough job, then the response time is only one or two timeslices, so in the range of 20-40 ms. This is fast enough for people not to notice a delay.

Or, let’s go back even further… The Commodore Amiga was the first home/personal computer with a multitasking OS, back in 1985. It only had a 7 MHz Motorola 68000 processor. But look at how well multitasking worked, even on such an old and slow machine (from about 3:52 on, and again at 7:07):

As you can see, even such a modest system can handle multiple heavy applications at the same time. Even though computers have multiple cores these days, there are usually many more threads than there are cores, so thread switching (multiplexing) is still required.

Multitasking vs multi-threading

The terms multitasking and multi-threading are used somewhat interchangeably. While they are slightly different in concept, at the lower technical level (OS scheduling and CPU cores), the difference is very minor.

Multitasking means performing multiple tasks at the same time. The term itself is more widespread than computers alone, but within the domain of computers, a task generally refers to a single application/process. So multitasking means you are using multiple applications at the same time. Which you always do, these days. You may have an IM or mail client open in the background, or a browser, or just a malware scanner, or whatnot. And the OS itself also has various background processes running.

Multi-threading means running multiple threads at the same time. Generally this term is used when talking about a single process which uses more than one thread.

The ‘at the same time’ is as seen from the user’s perspective. As explained earlier, the threads/processes are multiplexed, running for a timeslice at a time. So at the level of a CPU core, only one thread is running at a time, but at the OS level, multiple threads/processes can be in a ‘running’ state, meaning that they will be periodically scheduled to run on the CPU. When we refer to a running thread or process, we generally refer to this running state, and not whether it is actually running on the CPU at the time (since the time slices are so short, there can be thread switches dozens of times per second, so it is too fast to observe this in realtime, and as such, it is generally meaningless to look at threading at this level).

A process is a container for threads, as far as the OS scheduler is concerned. Each process has at least one thread. When there are multiple threads inside a single process, there may be extra rules on which threads get scheduled when (different thread priorities and such). Other than that, running multiple processes and running multiple threads are mostly the same thing: after each timeslice, the OS scheduler determines the next thread to run for each CPU core, and switches the context to that thread.

There are threads, and then there are threads

All threads are not created equal. This seems to be another point of confusion for many people. In this age of multitasking and multi-threading, it is quite common for a single process to use multiple threads. In some cases, the programmer may not even be aware of it. The OS may start some threads in the background for some of the functions he calls, and some of the objects he uses, even though he only uses a single thread himself. In other cases, the OS is designed in a way that it demands that the programmer uses a thread for some things, so that the thread can wait for a certain event to occur, without freezing up the rest of the application.

And that is exactly the point: Even though there may be many threads in a process, they are not necessarily in a ‘running’ state. When a thread is waiting for an event, it is no longer being scheduled by the OS. Therefore it does not take any CPU time. The actual waiting is done by the OS itself. It simply removes the thread from the list of running threads, and puts it in a waiting list instead. If the event occurs, the OS will put the thread back in the running list again, so the event can be processed. Usually the thread is also scheduled right away, so that it can respond to the event as quickly as possible.

Applications that use these types of threads are still seen as ‘single-threaded’ by most people, because the work is still mostly done by one thread, while any other threads are mostly waiting for an event to occur, waking up only to process the event, and then going back to sleep again. As a result, such an application will appear to only use a single core. The additional threads may be processed on other cores, but their processing needs are so minor that they probably don’t even register in CPU usage stats. Even if you only had a single core, you probably would not notice the difference, since the threads could be scheduled efficiently on a single core (just like the example of playing an mp3 file earlier).

To really take advantage of a multi-core system, an application should split up the main processing into multiple threads as well. Its algorithms need to be parallelized. This is only possible to a point however, depending on the algorithm. To give a very simple example:

e = a + b + c + d

You could parallelize a part of that, like so:

t0 = a + b
t1 = c + d
e = t0 + t1

t0 and t1 can be calculated in parallel threads. However, to calculate e, you need the results of both threads. So part of the algorithm can be parallel, but another part is implicitly sequential. It depends on results from earlier calculations, so there is no way to run this calculation in parallel with other dependent calculations.

Amdahl’s law deals with these limitations of parallel computing. In one sentence, it says this:

The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program

The sequential parts result in situations where threads for one step in the algorithm have to wait for the threads of the previous step to signal that they’re ready. The more sequential parts there are in a program, the less benefit it will have from multiple cores. And also, the more benefit it will have from the single-threaded performance of each core.

And that brings me back to the original point: people who think that the number of cores is the only factor in performance of multithreaded software.

The multi-core myth

This is a myth that bears a lot of resemblance to the Megahertz-myth that Apple so aptly pointed out back in 2001, and which was also used to defend the AMD Athlon’s superior performance compared to Pentium 4s running at higher clockspeed.

The Megahertz-myth was a result of people being conditioned to see the clockspeed as an absolute measure of performance. It is an absolute measure of performance, as long as you are talking about the same microarchitecture. So yes, if you have two Pentium 4 processors, the one with the higher clockspeed is faster. Up to the Pentium 4, the architectures of Intel and competing x86 processors were always quite similar in performance characteristics, so as a result, the clockspeeds were also quite comparable. An Athlon and a Pentium II or III were not very far apart at the same clockspeed.

However, when the architectures are different, clockspeed becomes quite a meaningless measure of performance. For example, the first Pentiums were introduced at 66 MHz, the same clockspeed as the 486DX2-66 that went before it. However, since the Pentium had a superscalar pipeline, it could often perform 2 instructions per cycle, where the 486 did only one at most. The Pentium also had a massively improved FPU. So although both CPUs ran at 66 MHz, the Pentium was a great deal faster in most cases.

Likewise, since Apple used PowerPC processors, and AMD’s Athlon was much more similar to the Pentium III than the Pentium 4 in architecture, clockspeed meant very little in performance comparisons.

Today we see the same regarding the core-count of a CPU. When comparing CPUs with the same microarchitecture, a CPU (at the same clockspeed) with more cores will generally do better in multithreaded workloads. However, since AMD and Intel have very different microarchitectures, the core-count becomes a rather meaningless measure of performance.

As explained above, a single core can handle multiple threads via the OS scheduler. Now, roughly put, if a single core CPU is more than twice as fast as one core of another dualcore CPU, then this single core CPU can also run two threads faster than the dualcore CPU.

In fact, if we factor in Amdahl’s law, we can see that in most cases, the single core does not even have to be twice as fast. Namely, not all threads will be running at all times. Some of the time they will be waiting to synchronize sequential parts. As explained above, this waiting does not take any actual CPU time, since it is handled by the OS scheduler (in various cases you will want to use more threads than your system has cores, so that the extra threads can fill up the the CPU time that would go to waste otherwise, while threads are waiting for some event to occur).

Another side-effect of the faster single core is that parts that are strictly sequential in nature (where only one thread is active) are processed faster. Amdahl’s law essentially formulates a rather paradoxal phenomenon:

The more cores you add to a CPU, the faster the parallel parts of an application are processed, so the more the performance becomes dependent on the performance in the sequential parts

In other words: the single-threaded performance becomes more important. And that is what makes the multi-core myth a myth!

What we see today is that Intel’s single-threaded performance is a whole lot faster than AMD’s. This not only gives them an advantage in single-threaded tasks, but also makes them perform very well in multi-threaded tasks. We see that Intel’s CPUs with 4 cores can often outperform AMD’s Bulldozer architecture with 6 or even 8 cores. Simply because Intel’s 4 cores are that much faster. Generally, the extra cores come at the cost of lower clockspeed as well (in order to keep temperatures and power consumption within reasonable limits), so it is generally a trade-off with single-threaded performance anyway, even with CPUs using the same microarchitecture.

The above should also give you a better understanding of why Intel’s HyperThreading (Simultaneous Multithreading) works so nicely. With HyperThreading, a physical core is split up into two logical cores. These two logical cores may not be as fast as two physical cores would be, but that is not always necessary. Threads are not running all the time. If the cores were faster, it would just mean some threads would be waiting longer.

The idea seems to fit Amdahl’s law quite well: for sequential parts you will only use one logical core, which will have the physical core all to itself, so you get the excellent single-threaded performance that Intel’s architecture has to offer. For the parallelized parts, all logical cores can be used. Now, they may not be as fast as the physical cores, but you have twice as many. And each logical core will still exceed half the speed of a physical core, so there is still performance gain.

One of the things that SMT is good at is reducing overall response time. Since you can run more threads in parallel, new workloads can be picked up more quickly, rather than having to wait for other workloads to be completed first. This is especially interesting for things like web or database servers. Sun (now Oracle) took this idea to the extreme with their Niagara architecture. Each physical core can run up to 8 threads through SMT. For an 8-core CPU that is a total of 64 threads. These threads may not be all that fast individually, but there are a lot of them active at the same time, which brings down response time for server tasks. Because of SMT, the total transistor count for a 64-thread chip is extremely low, as is the power consumption.

So, to conclude, the story of multithreading performance is not as simple as just looking at the number of cores. The single-threaded performance per core and technologies such as SMT also have a large impact on the overall performance. Single-threaded performance is always a good thing, both in single-threaded and multi-threaded scenarios. With multi-core, your mileage may vary, depending on the algorithms used by the application (to what extent are they parallelized, and to what extent do they have to remain sequential?), and on the performance of the individual cores. Just looking at the core-count of a CPU is about as meaningless a way to determine overall performance as just looking at the clockspeed.

We see that Intel continues to be dedicated to improve single-threaded performance. With Bulldozer, AMD decided to trade single-threaded performance for having more cores on die. This is a big part of the reason why Bulldozer is struggling to perform in so many applications, even heavily multithreaded ones.

This entry was posted in Hardware news, Software development and tagged , , , , , , , , , , , , , , , , , . Bookmark the permalink.

136 Responses to Multi-core and multi-threading performance (the multi-core myth?)

  1. Bob says:

    This is a great blog entry and I think very important as we start to see mobile devices based on Medfield hit the market. Many bloggers are lamenting it is a single-core CPU at 1.6 GHz but in many cases it is likely to deliver similar is not better performance as dual-core and quad-core ARM SoCs especially in the 2 GHz version.

  2. MacOS9 says:

    A damn refreshing entry on multitasking and multiprocessing if I ever read one; no longer will I be jealous of 4-, 6-, and more-core processors with my modest Core2Duos.

    (I remember there being comparisons on the internet years ago in the 90s between a PowerPC [the 604 by IBM] running at 120MHz, and being as fast as the popular Pentium Pro of the mid-1990s running at 166MHz – although that may have been a comparison between RISC and CISC and had nothing to do with multitasking.)

    By the way, since I’m already on the subject, what are Scali’s views on RISC vs. CISC processors? I remember Apple’s controversial switch from PowerPC (RISC) to CISC-based Intel around 2005 – but perhaps newer Intel processors (those supporting 64-bit) have also gone by now in the direction of RISC?

  3. MacOS9 says:

    From what I’ve gathered by reading this entry , my question may be irrelevant, but I look forward to Scali’s comments nonetheless.

  4. MacOS9 says:

    The article in question is called “Cisc or Risc? Or Both?” (my apologies for the several posts, since I wasn’t able to paste in the URL).

  5. MacOS9 says:

    Well let me continue embarrassing myself by offering one more post: have now stumbled onto your “RISC/CISC” article of late Feb. that has answered most of my questions: in other words, the complexity (or lack thereof) of instruction sets is largely irrelevant in the “post-RISC” world I take it?

    • Scali says:

      The Pentium Pro was actually the first Intel x86 CPU that used a RISC backend. It couldn’t quite keep up with the PowerPC yet, but a few hundred extra MHz could compensate for it. As CPUs became larger and more complex, the extra cost of the CISC-legacy became smaller. Today, Intel even has a competitive x86 CPU for smartphones and tablets, the Atom codenamed Medfield: https://scalibq.wordpress.com/2012/01/23/intel-medfield-vs-arm/
      So yes, it is largely irrelevant. In theory a RISC CPU may still be slightly more efficient, but in practice, Intel can compensate by using superior manufacturing technology (they are at 22 nm, where the competition is at 28, 32 or 40 nm), and by other technologies such as HyperThreading.
      The irony is that this time it’s the ARM CPUs that aren’t very impressive in terms of performance per GHz.

  6. anthonyvenable110 says:

    Reblogged this on anthonyvenable110.

  7. Pingback: Multi-core and multi-threading performance (the multi-core myth?) | Scali’s OpenBlog™ | Itsaat

  8. Pingback: The Multi-Core Myth | Shwuzzle

  9. chris says:

    Very interesting read. So if clock speed (when analyzing different architectures) doesn’t matter, what should I be looking at when I buy my next processor? How does one determine how “fast” a processor executes a thread.

    • Scali says:

      Well, the only way to get a decent performance estimate is to look at it on a per-application basis.
      The only reliable way to know which CPU runs application X best, is to just benchmark them and compare the results.
      Of course, application Y could be a completely different story again.
      There is no simple answer.
      Which is why there are so many myths… The MHz myth and the multi-core myth approach the problem of performance from opposite angles, and they both fail equally hard at indicating overall performance.

      • chris says:

        For a user like myself who mainly runs things like a browser and netbeans (to a lesser extent LAMP stack locally) how is that feasible? Even for a small business determining which hardware to purchase to run a database server, how is that really feasible…

      • Scali says:

        Funny how some people can’t accept that not all answers are ‘feasible’. If you want to have a computer for generic use, then your performance needs will be some weighted average of the applications you will be using, where the weights depend on your wants/needs/priorities.
        You think that asking again will change my answer? No it won’t. I’m a realist, reality is not always ‘feasible’. Performance of a CPU architecture cannot be caught in a single measurement.

        Having said that, I can give you a simple tip: Filter your application needs on based on how performance-critical they are.
        Browsers are not performance-critical whatsoever. Depending on the size of your projects and how you use it, Netbeans is generally not performance-critical either.
        A local LAMP stack (presumably for development purposes) is usually not performance-critical either (you’re the only user, where the technology can easily scale up to hundreds or thousands of users at a time).

        Also, if you want to purchase a database server, you’ve already answered yourself how feasible it is: ‘database server’, so you only have to look at database performance, and more specifically, only at the specific database software you intend to run on it.

  10. Pingback: Thought this was cool: Multi-core and multi-threading performance (the multi-core myth?) | Scali’s OpenBlog™ « CWYAlpha

  11. Pingback: Of GCs and Multi-cores « Ping!

  12. Sreejith says:

    So wat about the dual core and quad core processors in mobile phones, increasing the number of cores in it doest really count…

  13. Hambster says:

    Hi,
    It’s a good post. May I refer your post on my blog?

  14. elpipo says:

    Good article, it has brung me back to school.

    As I’m getting old and little by little far from this, could you also discuss the case of vitualisation. Are cores really allocated to virtual machine or is this just an abstraction ? Is it stupid to say “I want one core for each virtual machine” ? How hypervisor handle real time allocation ? …

    • Scali says:

      I think those details are implementation-specific. I don’t use virtualization much myself, but I do use VirtualBox or VMWare from time to time for some cross-platform testing/development, and I believe they just map the virtual cores to threads on the host OS, without specifically pinning them down on cores (as you could do with SetThreadAffinityMask() in Windows).
      Which would make sense, since this would still allow the host to dynamically allocate cores, and do proper load balancing. I suppose you could still use the process affinity mask to have each instance run on a subset of the cores, but I think it would be a bad idea in general.

  15. Christos says:

    The Megahertz myth would have you believe that higher clock means faster.
    Well, it’s true, for given cpu architecture.
    The “multi core myth” will have you believe that more cores equals more speed.
    Well, again it’s true, only this time there are two parameters to consider instead of one.
    Architecture and application.
    Intel fueled the MHz myth because it was what they had to offer relative to the opposition, now AMD will advertise more cores for the same reason.
    You could then argue about the “64bit” myth that AMD started with the Athlon 64 that offered nothing in terms of performance while offering a lot in terms of marketing.
    But still, it was the way forward and someone had to make the first step.
    More cores equals more performance generally speaking and anyone who doesn’t understand the pitfalls involved will just end up paying a bit more for a bit less, not a big deal.
    At the end of the day it’s exactly the same as in every aspect of life, the more knowledgeable make more knowledgeable decisions.
    No point in getting too crazy about it.
    At the end of the day I doubt those who know little about computers will read any of this any way.
    The just buy a PC with “Intel inside” cause it’s all they know.
    So the world is good place to be right now it seems because the name they all know makes the best cpus at the moment, how lucky is that.
    Come to think of it the MHz myth must have done people’s pockets much greater injustice then.

    • Scali says:

      Not sure what point you’re trying to make. Everything you say is already literally mentioned in the article, or is so obvious that it would not need to be mentioned.
      Or well… some part is rather dubious. Namely, you say Intel fueled the MHz myth… You mean to imply that AMD did not? Which would be funny, since as I pointed out, Apple coined the term ‘MHz myth’ when they were competing with their PowerPC CPUs against the Intel Pentium III and the AMD Athlon. And the Pentium III and the Athlon were in a tight race towards the GHz mark at the time (with AMD actually winning that race).

  16. Pingback: Multi-core and multi-threading performance (the multi-core myth?) | Scali’s OpenBlog™ | @EconomicMayhem

  17. Keith Walden says:

    Not really a computer geek … (no negative inflection intended) 🙂 … Im closer to one of those ding-dongs that look for “Intel inside” and hope for the best when I buy a computer as I drive a truck with not much free time. Years ago I was a technician and still have a Tek 466b storage scope and associated equip I mean to set up to troubleshoot with. I do like to tinker with computers and my P4E desktop (3GHz /4GB/800MHZ ) pooped with a motherboard issue. Was wondering whether to buy a new one or replace it and decided to try R+R’ing the motherboard first. My application is mainly internet for googling things, ebay etc and simple programs that would probably run fine on a Pentium 1 or 486. I would like to see videos without buffering and have it be wireless and be fast enough to be usable for those things in the next few years. I was considering buying an HP Phoenix 8-core which is there top instead for 650$ (recon) and may still but wondering where will that get me in performance improvement for the things I do? Really ? Probably not far. The P4E is fast enough to accomodate the fastest residential internet service most likely and for the near future I think as well as the Phoenix … the Phoenix probably wont get me anywhere but 600$ in the hole. I think most people use their computer for the internet today as I do with 1-3 people possibly at a time. I live alone so the P4 is probably fine … maybe if there were additional users simultaneously the multi-cores would be better, guess it depends on the sequential / parallel content of normal internet activity. Is there a measure of that available?

  18. Lauro Andrea says:

    Thanks for your blog here.
    I was researching the virtues of a having a 12-core server for our biometric time clock and how threads are directly proportional to cores and have seen that it is not quite so. I am not so much engaged in the search for speed as I am for capacity for multiple requests for data. Though there is a central server and we are having about 400 time clocks around the country with every body clocking in a the space of five minutes at around 7:55 to 8:00 am, we will be using a sync software to cache the data while it piles up and send it server-bound after the lines are less clogged around 9:00 am or later.
    Still busy on a Sunday but your blog had me riveted. Thanks.

  19. I’ve looked all over the internet and I’m surprised no one discusses this in the context of multi-core cpu’s: I had a dual celeron setup that made me smile because I could do cpu-intensive stuff without having to wait. Like big file transfers. Especially: big file transfers. Then multi-core came along. If figured this is going to be just great. No futzing around trying to tell the OS that I have 2 cpus plugged in. Just assemble and go. I upgraded from dual celerons directly to a dual core athlon at something like 3800 MHz. File transfer speed was AMAZING. But guess what? I had to wait! The dual celeron handled the workload so as to leave one cpu free, so I NEVER EVER had to wait. This feature of dual-love is gone now, and I’m guessing it’s because OS’s and apps are multi-thread aware, so they do a “superior” job of scheduling. So, in this case better cpu + better software = worse user satisfaction in a major area of computer use. Even things like virus scans or indexing operations take the computer away for effectively the whole duration of the procedure. I want my old system back but everything is obsoleted by the new OS’s and browsers. You would think there would be a way to configure your system so that you don’t have to wait if you don’t want to. Anyone know of a way to do this?

    • Scali says:

      Well, there is a difference between having multiple CPUs/cores and having multiple computers.
      Multiple cores still share the same memory (and generally also share some of the cache), the same chipset, the same harddisk controller, the same harddisks etc.
      It’s simply physically impossible to do high speed file transfers without impacting the overall performance of the system in some way, because you are using a lot of shared resources in the process.
      Perhaps the older Celeron system just appeared to work better because the balance worked out better for you (the harddisk, chipset and memory may have been relatively fast compared to the CPUs, so there was a larger budget to share between the CPUs, so to say).

      In essence, multi-CPU systems are not that different from multi-core ones. In fact, the first generation of dual-core x86 CPUs (both Pentium D and Athlon64 X2) were little more than two CPUs connected on the same package/die.
      And OS schedulers have not changed all that dramatically over the years. They have mainly seen improvements in being more HT-aware and NUMA-aware. But that would not apply to your situation.

      • >>And OS schedulers have not changed all that dramatically over the years. They have mainly seen improvements in being more HT-aware and NUMA-aware. But that would not apply to your situation.<<

        Maybe my mind is playing tricks, but I seem to remember looking at the performance tab in task manager and seeing one cpu unused during a file transfer (in the celeron days). Is that possible? If anyone out there has an old dual-PII, could you please tell me if I'm crazy? I guess it depends on the OS. I had Win2K Pro and liked it a lot better than Windows 7 Home.

        As for the new systems, seems like they could offer an option for how you want your resources allocated so you can keep working during long cpu-intensive procedures. Maybe it's just my OS.

  20. Came across this today:
    http://www.techspot.com/community/topics/making-a-program-remember-its-priority.3913/
    Haven’t tried it yet, but looks like it’s pretty easy to set priority for anything you can run with a shortcut — which is pretty much everything!

  21. Abhishek M says:

    Each core should still be able to run a different task. So a multi-core processor should give more throughput than a single core any day right?

    • Scali says:

      No. I think you didn’t quite understand what I wrote here 🙂

      “When comparing CPUs with the same microarchitecture, a CPU (at the same clockspeed) with more cores will generally do better in multithreaded workloads. However, since AMD and Intel have very different microarchitectures, the core-count becomes a rather meaningless measure of performance.

      As explained above, a single core can handle multiple threads via the OS scheduler. Now, roughly put, if a single core CPU is more than twice as fast as one core of another dualcore CPU, then this single core CPU can also run two threads faster than the dualcore CPU.”

      • Abhishek M says:

        No. I think you’re generalizing too much :).

        I wasn’t talking about Intel or AMD. I didn’t even have a specific vendor in mind.

        Theoretically, if everything else is the same, an multi-core processor system is able to do more work than a single core processor system. Of course, you need tools to do that. Languages like Scala and Erlang are already doing that. Simultaneous Multi Threading on a single core is efficient but it is good only for instruction level parallelism.

        Of course, a simple user with minimal needs doesn’t need to worry about that. A good OS will do the trick for him.

      • Scali says:

        “Theoretically, if everything else is the same, an multi-core processor system is able to do more work than a single core processor system.”

        Which makes you the one who is generalizing too much.
        Aside from that, this is a different statement from the one you made earlier. This new statement sounds exactly like the one I already made:
        “When comparing CPUs with the same microarchitecture, a CPU (at the same clockspeed) with more cores will generally do better in multithreaded workloads”
        Which I would not argue against, obviously.
        Your original statement was however:
        “So a multi-core processor should give more throughput than a single core any day right?”

        I take the ‘any day’ to mean that a multi-core processor will ALWAYS give more throughput (and that is an AMD-specific marketing term, so it was already quite obvious which vendor you had in mind, as if rooting for moar coars was not enough of an indication).
        Which is not true, since multi-core processors will always share a certain level of resources, being on a single socket. This means they share the data bus, and usually also the memory controller and at least some of the cache.
        So in I/O-limited situations multi-core processors will not have more throughput, because the throughput is limited by the available bandwidth of the system, which is the same regardless of the number of cores.

        Add to that the fact that the more cores a CPU has, the lower it will be clocked, and in practice multi-core processors may actually have less throughput in certain situations than single core variations. So it does not hold ‘any day’.

        “Simultaneous Multi Threading on a single core is efficient but it is good only for instruction level parallelism.”

        You are mixing up terminology here. Instruction level parallelism is a term related to superscalar processing, and is applied on a per-thread basis. It is not meant to be applied to SMT, where you use two or more streams of instructions rather than one.

  22. Pingback: What could be my bottleneck?

  23. Mark says:

    This is a very nice explanation i stumbled across when trying to found it if it’s worth to spend more on an i7 4770 which has hyperthreading instead of the i5 4670, which is exacty the same speed cpu. If it’s not now, than maybe in the future with better optimized programs (i plan to use it 7+ years).
    When i started reading you mentioned my daily usage such as multi tasking with music playback, firefox with a lot of tabs open (and more importantly plugins like e.g. Disconnect which does freeze FF on my Q6600 because of blockin a lot of tracking links on some pages, not sure if it’s cpu related though). From this explanation i started to think the i5 was more then fast enough because it has a lot of single threaded performance and such could run a lot of threads, as my daily usage would seem more then efficient to let threads access the cpu.
    But then you started mentioning the hyperthreading which was very usefull if i understand correctly. So HTT would seem something to consider when buying an cpu.

    Now does this mean that even on already fast cpu like the i5 with 4 cores, the only extra added hyperthreading from the i7 as an difference would still justify the more expensive i7 (of not now maybe in the future)?

    Hopefully you can help me with this question.

    • Scali says:

      Whether or not HT is worth the extra money depends a lot on what kind of applications you use, and how many you use at a time.
      In some cases it may not give you any extra performance at all, while in other cases you may get 30-40% extra.

      In my case, I always go for the i7. If you were to calculate the raw numbers, I’m sure the i5 would give more performance for the buck on average… But I gladly pay the small extra premium to get that litle boost from HT when I require it.

      • Mark says:

        I already know it does give a boost to packing tools, encryption and media editing. But this is not something i do daily. What i’m trying to found out is if the HTT extra to an already very fast quad core cpu will now or in the future speed up browsing, mail, etc.
        This is very hard to found out (a lot of different opinions on that subject) and if i’m interpreting your response correctly it is purely a personal decision and not something very measurable of foreseeable?

      • Scali says:

        Well, as you say… You get a boost in some applications, not in others. It’s a very personal thing: how often do you use such applications, and how important is it to you to get a bit of extra performance?
        It’s much the same as with many other CPU-specs… Do you want to pay a bit extra to get more cache, or is the model with less cache good enough for you? Do you want a few hundred MHz extra clockspeed? Etc.

  24. Mark says:

    Thanks for your opinion. I’ve made my choice, although currently the i5 is the best for the buck, i’m going for the i7 4771 because i think (hope) it will be more future proof.

  25. Vishal says:

    After reading all the above comments and your post, i have come to a conclusion. ‘Intel has higher single threaded performance than AMD and multiple cores does not matter at all in performance’.

    • Scali says:

      It’s not that multiple cores don’t matter at all in performance. Both the per-core performance and the number of cores affect performance, it’s just that per-core performance is more important (as Amdahl’s Law says). A lot of people seem to ignore per-core performance completely, and only look at core-count, which is completely wrong.

  26. Rahul Bansal says:

    Hi Scali,
    I came here when I was trying to pick one out of – http://cpuboss.com/cpus/Intel-Xeon-E3-1245V2-vs-AMD-Opteron-3280
    I have 2 different set of applications:
    (a) LEMP server with lots of wordpress sites
    (b) FFMPEG based encoding server with node.js for front-end
    Some servers has both of above.
    After reading your article, I think Intel-Xeon-E3-1245V2 score over AMD-Opteron-3280 (higher clock rate with multithreading).
    But AMD-Opteron-3280 has 8MB L2 Cache which is 8-times bigger than Intel -Xeon-E3-1245V2. How that will affect?
    I know it will depend on application a lot, hence I listed both sets of applications above.
    Thanks for one of the most amazing article I ever read on this topic! It finally settled thread v/s core debate (atleast for me).

  27. Z says:

    Thanks for a very informative read! I’ve been out of computing for a while but it is nearly time to replace my 2005 Pentium D 3.0GHz Dell. I have upgraded it as far as it is possible to do without replacing the motherboard or CPU. I have been extremely pleased with this processor. It is a beast!

    I am familiar with threading but was confused by the benchmarks comparing “single-threaded” and “multi-threaded” performance between the AMD FX 8350 Black Edition and Intel i7-4770k that I found on anandtech.com. Since my main programs that I use are PhotoShop CS3, Nikon’s View NX2, some general gaming, and looking at video editing in the near future I really needed to understand what this difference really is. According to these benchmarks (I have always taken benchmarks with a grain of salt… I’ve been computing since the Timex Sinclair 1000, Commodore 64, personally owned a 386SX33, Cyrix 6×86, Pentium II i think it was, then this Pentium D, just picked up this laptop with an AMD A10-5750M and love it so far) it SEEMS like, for my specific uses, the AMD FX 8350 MIGHT be the way to go.

    But then I decided that I really needed to understand the differences in performance between these two chips, specifically the single- vs. multi- threading. I can tell you that finding information on this specific question is not very easy. I very much appreciate the time you put into writing this because I think you’ve probably saved me from making a less than optimal decision.

    I THINK that, despite my desire to fuel the competition between Intel and AMD by buying another AMD processor, the Intel i7 would be the wiser choice. It should last me at least five years before I’d need to upgrade to whatever someone comes up with next. If I were ONLY gaming and didn’t need it to last as long as possible, I probably would get the AMD, since it is much less expensive.

    So my question to you is, have I understood your article and do you agree with this choice for my specific listed uses? Video involves a lot of sequential processing, right? I know there are “video benchmarks” but, like I said earlier, I don’t think benchmarks are always accurate and I don’t plan on using the format they listed in the benchmarks I’ve seen.

    I think it is more informative to understand the way a program is coded to be processed, especially in today’s multi-core computing landscape. And, if I have understood what you have written, the “single-thread” performance is a potential bottleneck, essentially. Because this article means that, no matter how many simultaneously (parallel) running cores are involved they can and will still be waiting for the results of a single calculation at some point. Intel’s Hyperthreading gives them a huge advantage for this reason, correct?

    I understand enough of the way computers actually work (specifically MS OS’s) to know you are spot on in what you have written. I just need to make sure I understand the implications and consequences of CPU choice.

    Thank you again for this well-written and information-packed article. Or blog post, or whatever you want to call it. I couldn’t stop reading it once I started!

    • Scali says:

      In short, AMD is *never* the way to go… This article is already a few years old, and new iterations of Core i7 CPUs have opened up the gap even wider.

      Anyway, video processing/encoding can be parallelized to a great degree. You also see that newer video processing software tends to offload work to the GPU for that reason. Which means that aside from GPU-performance, the single-threaded CPU performance becomes most important (the CPU will mainly be preparing and feeding commands to the GPU, which is a task that is hard to parallelize, so it still relies mainly on single-threaded performance, much like conventional graphics acceleration in games and such. So, especially when you’re gaming, you’d want an Intel CPU… in many cases just a cheap dualcore will do the job just fine). Intel also has the super-fast QuickSync technology for encoding video.
      But when you are doing CPU-only video processing, you’ll want some balance of good parallel performance and good single-threaded performance. The exact performance requirements depend a lot on which software you use, and what kind of operations you are performing, so it is difficult to make any generalized claims about that.

      And, if I have understood what you have written, the “single-thread” performance is a potential bottleneck, essentially. Because this article means that, no matter how many simultaneously (parallel) running cores are involved they can and will still be waiting for the results of a single calculation at some point.

      Yes, that is basically what Amdahl’s law says. Basically the idea is that if you improve single-threaded performance, you improve performance across ALL parts of the code, both parallel and sequential. But if you improve multi-threaded performance (as in: adding more cores, but not making the cores faster), the bottleneck merely shifts towards the sequential part of the code.

      Intel’s Hyperthreading gives them a huge advantage for this reason, correct?

      Yes and no. Hyperthreading, as in simultaneous multithreading, does not necessarily improve this situation. However, the approach that Intel has taken, does. Namely, Intel focuses on maximum single-threaded performance. Which speeds up all code. Then they add hyperthreading to double the number of threads that the CPU can handle. This brings down the single-threaded performance when two threads are running on a core simultaneously, but the net effect is that the two threads run faster than they would without HT. In terms of the added complexity (only about 5% extra transistors per core required for HT), there is quite a significant boost in multi-threaded scenarios (which may be 20-30% on average, and can be over 50% more performance).
      So in terms of performance-per-transistor, or performance-per-watt, Intel’s HT is very efficient.
      And given that the sequential parts of the code would generally slow down the parallel parts anyway, in practice it is not that much of a problem that the multithreaded code does not run as quickly on 4 cores with HT as it would on 8 dedicated cores. With 8 dedicated cores, in a lot of cases the cores would be sitting idle for longer, waiting for other sequential parts to complete.
      So having the 4 extra threads is a nice bonus, and the fact that the extra threads slow down the single-threaded performance of the cores is usually less important.

  28. Z says:

    Sorry, got the Pentium D 3.0GHz in 2006, first quarter. And by “out of computing for a while” I meant I’d stopped reading up on the myriad new chips being developed years ago and have been doing research for a couple months to get back up to speed on what’s available. I had allowed myself to loose touch with what was going on. I have always been fascinated by computers though and use them on a daily basis at work, at home, at play, and hopefully to finally become self-employed. Thanks again!

  29. Required says:

    Ah, I love the tears of the Intel “but our less cores are faster so there!” fanboys, like Scali is. Just beautiful replies like his reply to Abhishek M. Thank you for crying in this post Scali, you misguided Intel moron.

  30. Hùng Lê says:

    Dear all
    How to calculate the optimal number of thread, assume that JVM have enough Java heap space, and the task can be entirely splitted.
    Thank you
    Jimmy

  31. eulises melo says:

    excellent summary of multitasking and multitren…. an eye opener to 89% of ceo and vp running this businesses. the only advacement will b to focus in programmin…
    John L. Gustafson pointed out in 1988 what is now known as Gustafson’s law: people typically are not interested in solving a fixed problem in the shortest possible period of time, as Amdahl’s and me eulises melo Law describes, but rather in solving the largest possible problem (e.g., the most accurate possible approximation) in a fixed “reasonable” amount of time. If the non-parallelizable portion of the problem is fixed, or grows very slowly with problem size (e.g., O(log n)), then additional processors can increase the possible problem size without limit.
    your article takled the myth and provided the solution to the current misconception and bottle neck. im heading to tigerdirect and buying me a core 2 duo forget that i3. jaja good job.
    eulises melo
    duvis10@hotmail.com
    chiyinyan xian xuan

  32. Excellent blog! By reading this article, one question came in my mind. Say I have one program which requires 5 threads to perform and if I have 4 core machine then how threads will work? 4 threads will be allocated on each of the 4 cores and 5th thread will share the time? or will there something else happen? waiting for the responses

    • Scali says:

      It all depends on the workload of these threads, and the priorities they are assigned. If all threads are working all of the time, and they all have equal priority, then the scheduler will try to switch threads in and out so that all 5 threads get an equal amount of CPU time (the time-division multiplexing mentioned above). If some of the threads have higher priority than others, these will get more CPU time, so the scheduler will switch to these threads more often than to the others. And when a thread is entering some kind of waiting state for a certain event, it is no longer being scheduled, so the remaining threads will get all CPU time (in which case the CPU may even become partly idle, when there are less than 4 active threads).

  33. Burgmeister says:

    Hey Scali, everything I’ve read here on your website thus far has been pretty helpful. I didn’t understand 100 percent of it (I’m not a very tech-y person), but I think I get the just of what you’re saying.

    So, what I got from your article and from reading all of these posts is that clock speed and number of cores basically aren’t good indicators of speed, and that the only real way to determine how fast a certain processor is to look at how it performs on a “per application basis”. If that’s true, does that mean I have to find benchmark tests online (or conduct my own) in order to really get a good estimate of how fast a particular processor is, or is there some easier way? For example: If I’m comparing an i5 with an i7 that have similar clock speeds and numbers of cores, is there some easier way to tell how well each will perform without actually using them first-hand? Or is the architecture simply too different for them to be compared just by looking at their specs?

    Also, I still don’t entirely understand what “applications” hyperthreading would be helpful in. I’ve heard that for gaming, you don’t really need it, which is what I would primarily be using a new computer for. Is hyperthreading generally good for gaming, or is it better for multitasking?

    Lastly, how much does having extra threads, on average, increase speed? For example, on an i7 4770 there are 4 cores and 8 threads. How much does having those extra 4 threads help, if at all, with using a single application (game) at a time?

    I know these questions are really general, but like I said, I’m not that knowledgeable when it comes to computers…that’s why I’m asking you :).

    • Scali says:

      For example: If I’m comparing an i5 with an i7 that have similar clock speeds and numbers of cores, is there some easier way to tell how well each will perform without actually using them first-hand? Or is the architecture simply too different for them to be compared just by looking at their specs?

      Well, in the case of the i3, i5 and i7 (assuming they are of the same generation), the architecture is the same. The main difference is in the number of cores and whether or not HyperThreading is enabled. An i5 and i7 both have 4 cores, so at the same clockspeed, HyperThreading will make the difference. And then it depends on how much an application can benefit from the extra threads on the i7.

      Is hyperthreading generally good for gaming, or is it better for multitasking?

      Games generally don’t benefit that much from many cores/threads. Most games don’t benefit beyond 4 cores. Partly because driving a GPU is very much a single-threaded affair: there is only one GPU, and all the instructions have to be executed in a strict order to get correct rendering results.
      Until recently, graphics APIs had little or no support for multithreading at all. In DirectX 11, you can prepare lists of commands on other threads/cores, but it is still rather limited (and AMD’s drivers have a broken implementation of it). DirectX 12 and Mantle may gain more from extra cores/threads.

      Other tasks in games are also a bit hard to parallelize, such as physics and AI. So generally those don’t scale too well past 4 cores either.
      So, HyperThreading on an i7 does not really do much for games. On a core i3 it’s a different story: you only have 2 cores, and with HT you can run 4 threads, which will bring the CPU performance close to that of a regular 4-core i5 in most games.

      Lastly, how much does having extra threads, on average, increase speed? For example, on an i7 4770 there are 4 cores and 8 threads. How much does having those extra 4 threads help, if at all, with using a single application (game) at a time?

      Well, that’s the whole point of this article: any ‘average’ is completely useless, because the results per application are all over the place. In statistical terms: the standard deviation is very large.

  34. Mike says:

    Interesting read. Going back in time, this can help explain the “original” intentions. Things have only improved since.

    http://www.xbitlabs.com/articles/cpu/display/pentium4-3066_2.html#sect1

    “The 3.06 GHz Pentium 4 enabled Hyper-Threading Technology that was first supported in Foster-based Xeons. This began the convention of virtual processors (or virtual cores) under x86 by enabling multiple threads to be run at the same time on the same physical processor. By shuffling two (ideally differing) program instructions to simultaneously execute through a single physical processor core, the goal is to best utilize processor resources that would have otherwise been unused from the traditional approach of having these single instructions wait for each other to execute singularly through the core. This initial 3.06 GHz 533FSB Pentium 4 Hyper-Threading enabled processor was known as Pentium 4 HT and was introduced to mass market by Gateway in November 2002.” (http://en.wikipedia.org/wiki/Pentium_4)

  35. John S says:

    Excellent article, I only wish more could be written on explaining how the CPU works for the average consumer to understand. I think many people buy PC’s based either on salesmen suggestions , friends, or possible reviews. They do little to even research what they actually need.
    I think its clear today speed in Mhz is very deceiving such as a 2.0 ghz tablet and a 2.0 Ghz in a laptop can be very different in speeds. The trickery of hyper threading is just that, a trick. A physical core will not perform 2X better using hyper threading. In fact, I am surprised Intel even brought HT back. I guess the brick wall of speed has again caused Intel to go into the bag of tricks like HT and Turbo mode to try and distinguish itself from AMD. I guess it works from a retail platform and it sells to people who know little about how HT or Turbo mode even work or when they work. Unless your into performance figures, actual user experience will hardly be able to tell if HT is working or if Turbo mode is enabled. I myself have always been more concerned with power management then raw processing power. Maybe that’s why today we see far more emphasis on power use of the chip then on speed. We have plenty of speed for what most people need. Its like the problem with the Power PC chip Apple was using. It was never bad, but the case was always made that Intel was far exceeding the Power PC chip in speed and that made Apple look bad.

    • Scali says:

      Intel never claimed that HT will give twice the performance. Heck, not even a second physical core will give twice the performance (as should be obvious from reading this article).
      HT is not ‘just a trick’ however. As I said elsewhere: https://scalibq.wordpress.com/2012/02/14/the-myth-of-cmt-cluster-based-multithreading/

      Now clearly, HyperThreading is just a marketing-term as well, but it is Intel’s term for their implementation of SMT, which is a commonly accepted term for a multithreading approach in CPU design, and has been in use long before Intel implemented HyperThreading (IBM started researching it in 1968, to give you an idea of the historical perspective here).

      As I also said, I was never surprised that Intel brought HT back. x86 is not a very efficient instructionset, so a lot of execution units are left idle on a modern backend. HT is an efficient way to make better use of these execution units. It probably turns out positive in terms of power-usage as well (after all, Intel even uses a form of HT on their Atom series). But I have never seen anyone do some tests on that.

  36. M.Shazly says:

    Where has this blog been ALL my life; thx for the informative post man keep up the good work.

  37. Gustavo says:

    Nice article. But, when buying a computer, people should think about costs x performance.
    I was comparing the prices for a new PC, since my Core 2 Duo E7400 is too slow to handle NetBeans 8, QtCreator 4, Gimp, Blender, MySQL Workbench, MySQL Server, MySQL Client, Dia, Mozzilla Thunderbird, Google Chrome and the project (Qt compiling at the same time i code in NetBeans on another project) – all running at the same time. The Core 2 Duo just can’t handle all that workloads anymore in a good time.

    So, after comparing, I found that i can buy an i7 4770 for more than double of the price of an FX 8350. But I would be far away from getting double performance, even comparing single threaded performance. For the same price of the FX 8350 I can only get an i7 950. And the i7 loses in everything in all benchmark tests. So, no doubt, i’m buying an FX 8350.

    One thing, that should have been said here: There are softwares optimized for Intel. And there are softwares optmized for AMD. The intel leads the market. And so, there are much more softwares optmized for it. If you look for games optmized for AMD and benchmark it comparing to an Intel (all powerfull haswell – off course using the same hardware, besides processor), you will see an FX 8350 with better performance in FPS (for example) than the intel. It might not be too relevant, but I think it is worth the comment, specially when you pay 2 – 3 times more and don’t get not even 1.5 times more performance.

    • Scali says:

      My blog is a technical piece. I don’t want to go into price, only technical merits of different architectures.
      Prices change all the time (AMD competes aggressively on price, which is why their 8-cores are so affordable these days, making them compete with Intel CPUs with much lower core-count, transistor-count, power consumption etc). Besides, prices are never linear to performance. There’s always an exponential scale as performance goes up.

  38. Wes says:

    You wrote: “x86 is not a very efficient instruction set, so a lot of execution units are left idle on a modern backend. HT is an efficient way to make better use of these execution units.” So does this mean that even when a program is running flat out at 100% on a core that behind the scenes there is still some idle time?

    I’m doing a project that involves a single-threaded sequential executable (no fpu nor gpu) that reads in data from a file, then spends any where from a few hours to a couple of weeks number crunching, then dumps the results to a file. The file i/o is insignificant compared to the calculation phase which runs with no i/o at 100% of a core. Even though the executable is single-threaded, I can run multiple instances of this program in parallel. By the end, over a 100 instances of the program have to be run. I have temporary access to a spare server with 2 Intel E5620 processors (8 physical cores, 16 logical cores) with 96 GB of ram.

    Would it be faster in this unusual situation to run only 8 instances of the program at a time? I was under the impression since they’re running at 100% anyway, there wouldn’t be any gains from HT by running more than 8. But if you’re saying that there are still idle cycles in the “backend” anyway, then running more than 8 would reduce the total run time. Is this correct?

    • Scali says:

      Yes, if you look at a modern x86 backend, you will see many execution units. Usually you have something like 3 ALUs, 2 load/store units, 2 FPU/SIMD units (okay, even more, slightly more complicated, see eg Nehalem/Sandy Bridge/Haswell here for details: http://www.anandtech.com/show/6355/intels-haswell-architecture/8)
      Each of which can execute an instruction per cycle, best case. So in theory you could have a maximum IPC of 3+2+2 = 7 instructions per cycle.
      In practice however, most code will struggle to execute at more than 2 instructions per cycle.
      Partly, x86 itself is also to blame. Namely, the instructions are not very powerful, and you can’t make use of the backend and the actual registers properly. Besides, decoding instructions is very complex and expensive. This means that CPUs can’t decode more than 3 instructions per cycle at most, and also the reordering after out-of-order execution is a bottleneck. So you’ll never be able to reach the theoretical maximum of the total execution units anyway.

      So that means a lot of units are sitting idle most of the time. Which is why the concept of HT makes so much sense.

      And yes, in your case it’s certainly possible that HT will improve performance. Namely, if we assume that your code can only average <= 2 ALU instructions per cycle, then you'll have some spare ALU time that an extra HT thread could make use of.
      Whether you actually gain in practice depends roughly on the following parameters:
      1) How efficient is the ALU use of your code?
      2) How efficiently can HT schedule two threads of your code on a single physical core?
      3) What effects does having two threads on a physical core for the cache and memory subsystem? (If a single thread was completely bandwidth-limited by the cache/memory anyway, then even though there may be idle ALUs, you can't get any workload to them)

      You could consider however rewriting your code. Namely, even though your code is integer-only, you could still create two variations of it. One targeting the regular integer ALUs, another targeting the SIMD ALUs (you can do integer operations with MMX/SSE2+).
      Because you will have a different 'mix' of instructions this way, running the 'regular' and 'SIMD' threads on a single physical core may give you better results with HT.

      • Wes says:

        Thanks, that’s good to know.

        It took me a while to get the code working, so I don’t know that I want to mess with it, but I’m assuming that by compiling with appropriate options it would use MMX/SSE whenever it’s appropriate. I’m using two different compilers (for verification purposes). I’m using gcc with “-O3 -march=core2 -msse4.2” and the Microsoft C compiler with “/Ox /favor:INTEL64” which I understand uses SSE instructions.

      • Scali says:

        If you compile for AMD64 architecture, MS will use SSE2 by default.
        For 32-bit, SSE2 is not available on all architectures. You should use /arch:SSE2 to enable this.
        However, that is not the right way to get the code I am talking about.
        Such optimizations will generally only replace naive FPU code with SSE2. What I’m talking about requires *ALL* the (ALU) code to be SSE2, which you can only do by using intrinsics or inline assembly.

        Never rely on a compiler to generate optimal code for MMX/SSE.

  39. Deepak Roy says:

    I just stumbled upon this article rather by accident and it has really helped. With this octa core vs quad core vs dual core fight going on in the mobile world, I didn’t want to compromise on speed and was about to invest in an expensive phone. Now I am better off buying a dual core which advertises better mutli threading performance like ASUS Zenfone 5.

    Thanks a ton again. Deepak, India

  40. Pingback: More about multi-thread! | Graphic Debugging Tool

  41. tomant90 says:

    TLDR;@Scali It is 2015, I think it is time you revised your article and clear up some of the confusion you have caused. You sound like a professor at the University of California. This article has some merits; however, you have completely overlooked crucial details. You speak of “measures of performance” but only seem to concern yourself with processing speed. If you are talking overall performance, you need to consider the effects of such operations on the hardware level as well. The CPU has to telegraph a series of high and low signals to perform computations, move data and such. Due to enthalpy and the impedance values of the billions of integrated components, the system will undoubtably produce an excess of energy dispersed in the form of heat (most likely, infrared radiation). The temperature is determined by far too many factors to predict; however, one thing is certain. Due to the laws of thermal dynamics, the temperature of the system will rise until it has reached thermal equilibrium, or until the surrounding elements have reached the same temperature as the system. This accounts for most system failures in personal electronic computers and devices on the market; which is why nearly all computers have fans. Getting more to the point, these CPUs are soldered onto a BGA with a lead-free thermal compound (usually of Tin-Silver-Copper) due RoHS restrictions. This alloy is not sufficient enough to handle the thermal exigencies of a system that has well exceeded the maximum permissible operating temperature (a common problem caused by simply forgetting to clean your fan). A phenomenon known as tinning can occur as well, causing setting issues (another point of failure). Furthermore, chip manufacturers have been producing products to meet the exorbitant demands of modern society for the past 10 years or so; apparently with less care towards efficiency. The consequence is often a single-core that can compute threads of execution at a frequency of about 3.7GHz and upwards. Part of this problem comes from America’s fascination with Moore’s Law, trying to meet the curve by increasing the speed of the processors exponentially. This done by reducing the physical dimensions of the transistors but keeping the processor dies about the usual dimensions of 450mm2. What is the problem? More power needs more power. Transistors can be fabricated at the nano-scale in the tens of nanometers. The ideal initiative (among others) would be to reduce the size of the unitary dies to thwart the high power requirements of the entire system. Smaller dimensions with high density of well-structured integrated combinational logic units mean less power consumption. Larger dimensions and higher density of poorly-structured combinational logic means those high and low signals need to travel farther and consume more power. More threads mean more signal switching (multiplexed, remember?). More switching means more power. More power means more heat. It does not take a physics background to know that when you increase the thermal potential (enthalpy) of a system you, in turn, increase the thermal equilibrium of the system (if other factors remain unchanged); leading back to my first point. Multi-core processors are not perfect; however, they often have lower power requirements, in part, due to the smaller dimensions of the individual dies versus a single-core equivalent. Since you apparently enjoy basing your arguments in suppositions, I will lend you one of my own. If a single-core processor had twice the speed of a dual-core of similar architecture, the single-core would process the sequential operations twice as fast, requiring twice the amount of power and dissipating twice as much heat (I explained why above). Also, the parallel operations would process at half the speed (you apparently to forgot each of those dual-core processors are multi-threaded as well). You have some valid points here. For one, multi-threading is crucial to both single- and multi- core CPUs. Secondly, clock speed is not a true test of a CPUs effectiveness and overall performance. Lastly, more cores are not inherently optimal if the hardware/software interface is poorly designed. However, you seem to have a misunderstanding of the function multi-threading in general and you completely misunderstood Amdahl’s Law as though it were an argument to multi-core systems. Simple put, more processing units allow for execution of more tasks per unit of work. Multithreading is a means to increase the effectiveness of a single core; simulating parallelism by reducing the amount of time the CPU spends in polling loops (waiting). These two are not mutual exclusive of one another, but rather complimentary as most, if not all, multi-core systems utilize multithreading. Amdahl’s Law is a way of predicting the theoretical time savings of utilizing a multi-core system by parallelizing portions of the program (in favor of multi-core). Even if you could not parallelize a large portion of a program, the benefits of multi-core CPUs still prevail because modern system are constantly running background applications such as: cron jobs, daemons and the like. Multithreading is not limited to a single process. There are multi-threading techniques that switch execution from multiple threads of multiple processes. In conclusion, the overall performance of a CPU goes beyond the number of processors. It is a delicate balance between hardware logic, the timing and efficiency of interrupts, well-designed programming interfaces and other factors. Multi-core technology is part of the solution of the future of computing as business needs create more exigencies. Multi-core is the next logic step to multi-threading (vertical scaling in addition to horizontal scaling); it is not a myth. It is up to software developers to responsible parallelize their applications in effective ways to better utilize those multiple cores and chip manufacturer to advance research and development in these areas. It is 2015 and chip manufacturers are still building multi-core processors. Would Intel, a $55 billion dollar company, waste money on fruitless pursuits of their Intel Core product lineup?

    A few more points of topic. First, the processing unit itself (separate from it’s housing) is referred to as a die, as it is cut from a much larger semiconductor wafer; the substrate on which the integrated circuit is printed. Second, never compare AMD to Intel. This has nothing to do with single- vs. multi- core technology, Intel just manufactures better products. Third, Pentium is obsolete; Core is the only way to go at this point in time. Last, single-threading is opposed to multi-threading not multi-core processing (you stated: “With Bulldozer, AMD decided to trade single-threaded performance for having more cores on die”). Like I stated earlier, multi-threading and multi-core technologies are complements of one another.

    • Scali says:

      TL;DR indeed. If you need that many words to try and make a point, it’s not worth reading.

    • k1net1cs says:

      If you can’t be bothered to put up a structured reply by splitting up points into paragraphs, then don’t expect people to be bothered analyzing what you’ve written.

      Unless that was intentional.

  42. TH says:

    thanks, good explanation, i understand the difference better, now

  43. Pingback: AMD processors | wellinformeduser

  44. Pingback: well informed user

  45. Scali Is An Idiot says:

    This is the dumbest, most self contradicting asinine explanation of why Intel CPU’s outperform AMD’s I have ever seen in my life.

    This guy obviously is just an Intel fanboy trying to talk people in circles so they actually believe the crap he is typing.

    • Scali says:

      Well, why don’t you start by pointing out a few of said contradictions… Because so far it just sounds like baseless accusations from an AMD fanboy (the article isn’t specifically about Intel or AMD, but apparently you see everything as Intel vs AMD).

      Do you have any idea how dumb you sound anyway? This article has been posted on many sites over the years, and has been referenced even in course materials at universities. Try googling for examples.

  46. Danny Lee says:

    Which chip manufacture (AMD-Intel) offers the best implementation of simultaneous multithreading (SMT)? Is it an essential CPU feature for virtualization platforms?
    I’m just a bit confused as to advantages and disadvantages between the two.
    If you had a choice which would you prefer?

    Love your blog!
    Such great information.
    Thank you so much in advance.

  47. Alex says:

    What do you think about Zen? Can it save AMD?

  48. momina razzaq says:

    please tell me that is multi core processor more efficient than multi chip processor?if yes,then why?

    • Scali says:

      A multi-core processor *can* be more efficient, because it’s easier to make high-speed connections between different units on a single chip than between 2 or more chips. You can also share things like cache in an efficient way. But it is not a guarantee. In some cases, multi-core processors are nothing more than multiple chips copy-pasted on a single chip, using the same bus to connect them as you would use between separate chips.

  49. snookerbetting says:

    Hi Scali, really great explanations, thanks.

    Could you please help me with clarifying one question? I’ve been looking for the answer all over the internet.

    If we have a heavy single-threaded high priority process (say, Solidworks) running on a HT core (say, 3.8 GHz Skylake i3) and this app requires a lot of available clock power, may one core deliver almost full computing power for this process (say, 3.7 GHz)? Is this how it works?

    Before reading your article I didn’t know much about HT and thought that one process simply can’t have more that 50% physical core clocks on i3 (1.85 GHz with a given CPU).

    It seems you wrote exactly about it:

    for sequential parts you will only use one logical core, which will have the physical core all to itself, so you get the excellent single-threaded performance that Intel’s architecture has to offer

    • Scali says:

      It depends on whether you’re talking about the hardware-level or the software-level. At the hardware-level, all cores work in parallel (both physical and logical ones), so they always get all clock cycles in the machine. The thing is that there can be resource contention, so not all cycles will be used effectively, there may be some stalls.

      On the software side, it depends on how the OS scheduler divides the workload. In theory, yes, a single process/thread can get ~100% CPU time.

      • snookerbetting says:

        Just to make sure I got it right.

        If we have two CPUs based on the same microarchitecture. The first one has 4 cores clocked on 2.5 GHz, the second one has 2 cores with HT clocked on 4 GHz (they are not present in Intel reality of course). And we have small amount of active and background tasks to hold busy second ‘logical core’ of first physical core (and presume OS schedules their execution on the second physical core). Heavy mostly single-threaded ‘ideal app’ (no stalls) in theory could have near to 4 GHz on the second chip (Hyper-threaded!) and only 2.5 GHz on the first chip, right?

        You know where I’m going here. 🙂 My friend is looking for a new PC for AutoCAD and Solidworks modeling and we figured out that these apps are single-threaded mostly. So if he is on a low budget I think he possibly should choose 3.7-3.9 GHz i3 over 3.3 GHz i5 and spend 40-50 bucks on extra DDR4 or SSD or better GPU.

      • Scali says:

        No, you can’t add up clockspeeds of different cores. Regardless of how many cores a CPU has, if it runs at a speed of 2 GHz, it runs at 2 GHz. HT shares the execution units of a core between multiple threads. It does not add extra clock cycles or anything. It just means that one core runs 2 threads, so that thread 2 can use the execution units that thread 1 is not using, and vice versa. Which CPU is the best for your needs depends on how many threads you are likely to run at the same time, and how much work these threads need to do. If you generally just run 1 thread at a time, then yes, more clockspeed will win over more cores. When you run multiple threads, see article 🙂

      • snookerbetting says:

        Yeah, thanks, I’m already familiar to how HT works (Arstechnica has very good article about it), my only misunderstanding was this ‘can in theory one app have almost all execution resources on a hyper-threaded core’ thing.

      • Scali says:

        Yes, it can. Although some parts are split between the two threads (HT is implemented on top of the out-of-order-execution logic), in more recent HT processors these shared resources have been dimensioned so wide that they’re overkill for a single thread, and even when split, they will rarely form a bottleneck. So for a single thread, there is little or no performance hit for enabling HT.

      • snookerbetting says:

        Perfect. Thank you and good luck.

  50. Philo says:

    It’s 2016, and I want to address this as I see exactly the same misconception I have always seen with respect to multi-core (and before that, multi-CPU) architectures. You seem to be speaking as though someone will only use their computer for one thing.

    I’m running Windows 8.1, and right now my desktop is running a backup, downloading files, browsing the web (in the past a web page was a static thing; this is no longer true, so every open tab is a process), rendering video from Premiere, running Rainmeter (desktop gadgets), and of course all the usual stuff an OS does – network stack, DNS lookups, talking to various USB devices, and so on.

    The OS manages those processes between the eight virtual cores in my desktop. While it’s true that switching threads costs cycles, it’s also true that context switching costs cycles. So having eight cores to assign threads to means one eighth the context switches for the cores.

    Add in that when code misbehaves and locks a thread, there are still other cores for the task scheduler to work with. Folks with single-core machines may remember the situation of having some task lock up their PC, where they could get in a keystroke or a mouseclick once a minute, thanks to the multitasking nature of Windows. People with multi-core architectures may note this doesn’t happen as much as it used to. (Of course, a poorly-written app can still lock a PC hard if it tries hard enough!)

    Most of the benchmarks I’ve seen make the same presumption – that a single application is the only concern. What I’d like to see (and it may exist, I just haven’t gone looking this time around) is a benchmark that works like a multitasker – open up four instances of Word, an instance of Photoshop, several browsers (and browse to pages on the local hard drive to eliminate remote server factors), etc. Run a bunch of unrelated tasks simultaneously and clock both total time of execution as well as summing the times across all tasks.

    Because we can argue theory all day long; it’s the actual benchmark that tells us what we really want to know.

    • Scali says:

      You seem to be speaking as though someone will only use their computer for one thing.

      Not really. While the point was mainly to explain how you should or shouldn’t optimize software and hardware for certain tasks, the story doesn’t really change that much whether you’re running one process or many. It always boils down to the number of threads vs the amount of time each thread needs. To which process each thread belongs is not that relevant in this particular context.

      I’m running Windows 8.1, and right now my desktop is running a backup, downloading files, browsing the web (in the past a web page was a static thing; this is no longer true, so every open tab is a process), rendering video from Premiere, running Rainmeter (desktop gadgets), and of course all the usual stuff an OS does – network stack, DNS lookups, talking to various USB devices, and so on.

      As the article explains: most of these threads are dormant most of the time, and only wake up for a brief period to handle a certain event (things like downloading or making backups are usually I/O-limited, not CPU-limited). There is no reason to allocate physical cores to such threads, because the cores would just be idling most of the time. The impact of running such threads on other cores is minimal.
      Adobe Premiere is the only truly CPU-heavy task you mention (and possibly the web browser if you have many dynamic pages open, eg YouTube, Facebook or such). You will agree with me that most other stuff you mention would work just as well on a PC with one or two cores as it would with 4 or 8. Or do you really want to claim that we couldn’t run a network stack, do DNS lookups, make backups, download files or connect USB devices before we had 4 cores or more?
      You make exactly the kind of flawed reasoning that inspired this article in the first place: “Oh, I have X tasks, therefore I need X cores”.

      Add in that when code misbehaves and locks a thread, there are still other cores for the task scheduler to work with. Folks with single-core machines may remember the situation of having some task lock up their PC, where they could get in a keystroke or a mouseclick once a minute, thanks to the multitasking nature of Windows. People with multi-core architectures may note this doesn’t happen as much as it used to. (Of course, a poorly-written app can still lock a PC hard if it tries hard enough!)

      That’s not the point though.
      Firstly, I’m not arguing against multi-core, and saying we should all go back to single-core, obviously.
      Secondly, what you describe only makes a difference between 1 and 2 cores.
      With a single-core machine, realtime-priority tasks can all but lock up a Windows system. With a dual-core machine, Windows schedules things slightly differently, and one core will always remain ‘non-realtime’, to ensure some level of responsiveness, no matter what.
      This doesn’t change with more cores. An 8-core machine will still get sluggish when you saturate it with realtime-priority threads. Then again, that is exactly what you want. You want to use all the available CPU cycles for that workload. So it gets just as unresponsive as a dual-core machine would under a ‘misbehaving’ workload.

      Because we can argue theory all day long; it’s the actual benchmark that tells us what we really want to know.

      I don’t think they will, because the test you describe is a completely arbitrary workload.
      The point of my article is that everyone should understand the *mechanics* of multithreading and multi-core processing. So that they have enough knowledge to:
      1) Understand that they should not just be blinded by number of cores, but look at their own requirements.
      2) Be able to set up their own benchmark for their specific needs.

  51. conradca says:

    The only benefit of multiple cores is to allow a process to solve CPU bound problems faster by applying the equivalent of multiple CPUs to the problem. I am not sure that this actually works because compete for access to the memory bus and cashe which might make a milk-threaded application slower.

  52. aaabbb says:

    Per this, there’s only about 20% benefit that HT can give – https://www.percona.com/blog/2015/01/15/hyper-threading-double-cpu-throughput/

    • Scali says:

      It depends very much on what kind of software you’re running. IBM has reported some cases where they got over 50% better performance from enabling SMT.
      It should be possible to create some synthetic scenario where you will get close to 100% better performance (run two threads, where both don’t access any data from memory/cache, and choose the instructions so that thread 1 only uses instructions from one set of execution units, and thread 2 only uses instructions from another, disjunct set of execution units).

      • Thomas says:

        In relation to Intel+HT implementation, who cares what IBM report with their SMT implementation.

        Both have the same principle, but their goals+design+implementation (the actual details) aren’t.

      • Scali says:

        Nice try fanboy, but I didn’t say this was about IBM hardware. This was IBM reporting their gains of some of their x86 software on Pentium 4-class CPUs with HT. They got 50+% gains in some cases.
        See here:
        http://www.ibm.com/developerworks/library/l-htl/

        The results on Linux kernel 2.4.19 show Hyper-Threading technology could improve multithreaded applications by 30%. Current work on Linux kernel 2.5.32 may provide performance speed-up as much as 51%.

  53. Thomas says:

    Thank you for clarifying. I apologise for jumping to the wrong conclusion.

    Now to your post.

    A paper from 2003 showing a P4-class CPU getting great HT gains. No argument here.

    Can you find evidence of the Core-class CPUs getting 50% HT gains, or anything for Intel+HT CPUs in the last decade? There’s no point in you saying “depends on what kind of software” when current hardware doesn’t support those gains.

    • Scali says:

      Why would I have to do any of that?
      The following claim was posted:

      Per this, there’s only about 20% benefit that HT can give – https://www.percona.com/blog/2015/01/15/hyper-threading-double-cpu-throughput/

      This more or less implies that there is some kind of ‘hard limit’ to how much benefit HT can give, and that this limit is around 20%.
      So I respond by pointing out that IBM has reported gains of over 50% from HT in some cases. This proves the claim of ~20% wrong (which should be obvious anyway to anyone who understands HT).

      I don’t see why any of the evidence you are now demanding would be relevant. And I don’t see why you suddenly move the goalposts from 20% to 50%.
      All I see is that 99% of all people have no idea what HT really is, and how it really works, and somehow everyone is ‘against’ HT or SMT in general for some strange reason.
      The 1% who understand what SMT is and how it works, have no problem understanding that in certain synthetic cases you would be able to get close to 100% gains (as I explained above). And by extension, they have no problem understanding that there may be practical situations where you can get (significantly) more than 20%.

      • Thomas says:

        I guess my text isn’t translating over the internet to what I think I’m typing. I’m not demanding anything from you and I’m not shifting goal posts.

        You don’t have to do anything, I’m just curious to see if you have anything like the 50% for current generation processors. If not, all good, don’t hassle yourself. I’m just trying to get a handle on this before making my (likely dumb) conclusion.

        Considering what I’m asking for doesn’t have any relevance. I guess my conclusion is that I can get over 50% from HT when using some decade+ old processor with some decade+ Linux kernel because some guy at IBM benchmarked his ‘Chat Room’ software.

      • Scali says:

        I’m just curious to see if you have anything like the 50% for current generation processors.

        Why are you asking me? Does this blog look like a benchmarking blog? Do I seem to be the kind of person that keeps track of these things? Use Google if you are looking for information, or conduct your own benchmarks to answer your questions.

        I’m pretty sure there will be plenty of 20+% cases somewhere out there, possibly even 50+%, since HT has evolved a lot since the early Pentium 4 days, and the OSes and libraries have become more HT-aware. But I personally am not interested in such statistics, so I neither conduct such benchmarks myself, nor do I categorize such data.

  54. Pingback: nVidia’s GeForce GTX 1080, and the enigma that is DirectX 12 | Scali's OpenBlog™

  55. JD says:

    So, if i am most interested in multitasking. (running multiple instances of stock trading software with more than 10 windows over multiple screens with multiple studies,and graphs, 20 browser tabs, streaming video, MS office software …etc) what would be preferable: an intel core i7 6800k with 6cores/12threads at 3.6GHz OR an intel core i7 6700k with 4cores/8threads running at 4.2GHz.

  56. Pingback: DirectX 12 and Vulkan: what it is, and what it isn’t | Scali's OpenBlog™

  57. Snappy says:

    All threads are processes. Intel follows the IBM concept of the Power PC when it comes to how processing behaves. You have a master/slave scenario with cores and threads. The way multithreaded works is that you can have a single processor with multiple processes that don’t follow the master/slave rule, instead they flow like running water and thus exchange data in a kind of message passing left and right scenario. Here is where things get tricky, not all apps are designed for threads, in the hyperthreaded scenario it doesn’t matter what your app is designed for. The processing level is smart enough to determine message passing but does gain advantage if the app is written with threads in mind. Multithreaded in this regard sucks, if the app is not designed for threads your programs run slower, but on the twist side if they are multithreaded processing is much faster than hyperthreading as the exchange becomes more apparent to the processor. SMT is Intel’s answer to multithreading, they bought the rights from AMD for use of it 😛

    • Scali says:

      All threads are processes. Intel follows the IBM concept of the Power PC when it comes to how processing behaves.

      Context is important here (pun intended). What a thread is to an OS or application, is not the same as for the CPU. A CPU doesn’t know what a process is at all. A process is an abstraction at the OS-level.
      Generally, all threads have their own thread context, but share the same process-context (eg, they all run in the same address space, and can share handles/objects belonging to that process). A CPU doesn’t know anything about this. The OS does however.

      SMT is Intel’s answer to multithreading, they bought the rights from AMD for use of it

      I assume you meant IBM instead of AMD here.

  58. helloacm says:

    Hello, I have make some experiments and it turns out that the single-threaded process can utilize all cores as well, please see https://helloacm.com/multi-processes-experiments-when-can-windows-utilize-all-the-cores/

    • Scali says:

      A single-threaded process can only utilize one core. You can run multiple single-threaded processes in parallel to use all cores. But a single thread by definition can only run on one core at once (although with modern OSes, the scheduler may make it ‘hop’ from one core to the next).

  59. Tarique Hasheem says:

    Awesome blog! Great knowledge and information. Commodore computer was a kick to look at!

  60. Ksec says:

    Any educated guess as to why ARM still have not included any SMT in their design?

  61. cloud squid says:

    Thanks for the great article! Still relevant 3 year later.

    One question I have is: If I want to run 4 independent single-threaded processes (similar workloads) on a CPU, would the overall performance be better on a 4-core CPU with better single-threaded (single-core) performance or on a 4-core CPU with better multi-thread performance?

    Looking forward to hearing your opinion.

    Thanks!

    • Scali says:

      Thanks for the great article! Still relevant 3 year later.

      Try 5 years 🙂 It’s from 2012. Still the post with the most hits on this blog.

      One question I have is: If I want to run 4 independent single-threaded processes (similar workloads) on a CPU, would the overall performance be better on a 4-core CPU with better single-threaded (single-core) performance or on a 4-core CPU with better multi-thread performance?

      It’s impossible to answer that, since it depends on so many factors, such as:
      1) Just what are these workloads, and how can they be distributed? Is latency important?
      2) What exactly is a ‘4-core CPU with better single-threaded performance’, and what is a ‘4-core CPU with better multi-thread performance’?

      I think if anything, the point of the article is that the usual “more programs/threads means more cores” is a much too simplistic view, and you can’t really predict performance very well in the general sense.

  62. Pingback: How can a single core processor run multiple tasks at once? | jogendra@.net

  63. sslz says:

    Hello,

    I came to your post because I have felt confused what to pick as the first machine to learn multicore multithread programming.

    – single socket or dual socket
    – many slow cores 2.0GHz x20 vs. fast and less cores 3.7GHz x8
    – old and cheap Westmere vs Skylake Xeon with AVX512

    My aim is to develop some compute intensive applications. Will you have any advice for a beginner setup such that one can gradually bulid up the skills in this perspective?

    many thanks

    • Scali says:

      It depends more on what you want to do. An extreme example is polygon rasterization. We have GPUs, which have an extreme amount of cores, but the individual cores are very weak. Because rasterization is embarrassingly parallel, this type of processor works extremely well here.

      But, if you just want to get started, it doesn’t really matter what you pick. It’s more about getting experience. Just experiment with how you can distribute algorithms and data over multiple cores/threads, how to synchronize them, how to benchmark etc.

      • sslz says:

        Thank you very much.

        I know it would be easier just to pick up a multi-core workstation to get started. But there are so many “dimensions” to learn for multi-core multi-threading programming:
        – can’t learn NUMA if I buy only one socket
        – the optimal number of cores to get started. Is 8 core too little? 20 cores enough?
        – CUDA, OpenCL (now ROCm) etc have specific requirements of power supply, PCIe slots
        – a motherboard to house two double-width GPU seems better, as I could test different algorithms.
        – AVX/AVX2/AVX512 vectorization might help for some parallel problem, instead of porting to GPUs

        Anyway, I know I could just settle the cheapest old Westmere multi-core workstation as a start.
        However, I just want to avoid replacing the workstation later i.e. double investment

        A workstation to meet all those requirements, two socket + multi-core + two GPU + AVX512, means a lot of money to spend.

        So still hesitant what to pick 😦

      • Scali says:

        Yes, there are so many different angles to multi-core/multithreading… It’s going to be very difficult, not to mention expensive, to buy a single machine that can do it all. But it would take years to master all the different types of programming, so you’d need to take things one at a time anyway. In theory you could start anywhere, and just replace/upgrade the hardware over the years, as you progress. An extra dimension of course is that the hardware gets outdated over time… If you start with CUDA 5 years from now, you wouldn’t want a 5-year old GPU to start on.

        I personally would start on the CPU-side anyway. GPGPU is more difficult than conventional multithreading. Not to mention it has a more limited range of applications. Perhaps starting with two sockets is a bit much, there’s quite a premium to pay for such machines. I think just a single-socket system with perhaps 6 cores and 12 threads may be a good enough starting point for the first few years of learning how to optimize multi-threaded applications.

  64. sslz says:

    Your advice sounds good. I can’t have the best of both worlds as a start!
    (EDIT: I could as Dell quoted USD 7000!!! for its latest and greatest 7920 dual socket three-double-width GPU tower)

    Currently I am torn between the used Ivy Bridge dual socket dual double-width PCIe Dell, and the brand new single socket Dell Skylake

    For the used Ivy Bridge dell, I am not sure if that is a good idea because the machine is, I guess, already 4.5 years old. I will be on my own to support it if anything goes wrong.

    Also, you said GPGPU is very limited. Are these the flexibility of CPU side multi-threading:
    – large memory 512GB
    – handling of branchy codes
    – no latency of sending/receiving across slow PCIe buses
    – and most importantly scale the codes across multiple sockets in one server.

    For the last point, how important is NUMA coding? Is NUMA a priority I should learn together with single socket multi-threading? I imagine that if I do (learn NUMA), I could “spread” my codes across multiple “dual-socket” (or “quad-socket”) servers in a grid environment.

    On the other hand, if I buy the latest single socket Dell Skylake, the path will be from 1) multithreading to 2) AVX vectorization to 3) single GPU to parallel some random numbers generation. And I would miss NUMA.

    Anyway, I want to make a quick decision because I know you never learn without even a start.!

    • Scali says:

      Currently I am torn between the used Ivy Bridge dual socket dual double-width PCIe Dell, and the brand new single socket Dell Skylake

      Well, multi-socket machines do pose some unique challenges. You get to play with some NUMA issues, and you have to be more careful with sharing data, because sharing data between cores on the same socket is faster than between the two sockets.
      But you could also argue that this is an ‘advanced’ topic, so it is not something you would start with until you have good experience with multi-threading in general.

      Also, you said GPGPU is very limited. Are these the flexibility of CPU side multi-threading:
      – large memory 512GB
      – handling of branchy codes
      – no latency of sending/receiving across slow PCIe buses
      – and most importantly scale the codes across multiple sockets in one server.

      Yes, that is a reasonable overview. Also, the type of cores is quite different. GPUs may have many more cores than CPUs do, but their ‘single-threaded’ performance (if there was such a thing in GPGPU) is generally not as good as on a CPU.
      Also, the memory on video cards is very different from system memory. It tends to be high-latency, but also very high bandwidth. And the caches are not as advanced as on a CPU. They are very efficient with linear memory accesses, but not with random access patterns.

      As for the ‘branchy code’… Once you get into MMX, SSE and AVX on an x86 CPU, you will write scalar-threaded code in much the same way as GPGPU does. And then you will have the same issues with ‘branchy’ code: you generally won’t actually branch, but rather you let threads process the results, and use bitmasking to discard results (‘not taken branch’). Or well, with MMX/SSE/AVX you can do it manually. With GPGPU the compiler will compile the scalar threads into SIMD-form automatically, so the actual bitmasking and such is abstracted away.

      For the last point, how important is NUMA coding? Is NUMA a priority I should learn together with single socket multi-threading? I imagine that if I do (learn NUMA), I could “spread” my codes across multiple “dual-socket” (or “quad-socket”) servers in a grid environment.

      I think you are taking the concept of NUMA one step too far. When you have a grid of multiple servers, that is considered a ‘cluster’. NUMA is a more ‘integrated’ form, where there are multiple buses, but still inside the same system (such as a multi-socket system where each CPU has its own memory controller, with a part of the system memory attached to it).
      Cluster computing is an entirely different kind of system scaling. In that case, the memory contents have to be shared via some kind of network connection instead of a bus on the mainboard. So it would require yet another approach to programming, and again, it has its own collection of problems on which it works well.

  65. Pingback: Common Misconceptions About a Multi-Core Processor – Network Services Inc.

  66. Matt Joy says:

    I was searching to clear some of my OS concepts when I came upon this blog. It has helped me a lot but I still have some doubts:
    1) Does multithreading & multiprocessing means you have to use more than 1 worker or not. (Is the actual meaning of these is doing them parallelly)
    2) If you say that a single core is still capable of doing multithreading then what is the difference between asynchronous and multithreading (https://stackoverflow.com/questions/34680985/what-is-the-difference-between-asynchronous-programming-and-multithreading).

    • Scali says:

      It’s sometimes confusing, as not everyone uses the same definitions for the same terms. I’ll try to give the definitions I use, and then try to answer your question:
      Multithreading: multiple threads of CPU code can be managed by the system at once (this does not necessarily mean that more than one thread is actually being executed at a time, like on a single core system)
      Multiprocessing: A system with multiple processors that are active at the same time. This could be a system with multiple CPUs (single core or otherwise), a system with a CPU with multiple cores, or a system with different types of processors working together (eg a CPU and a GPU, or a DSP or such).
      Asynchronous: This literally means ‘not synchronous’. In its most basic form, it means that when you execute tasks, they are not running in a strict order. It means it is unpredictable which task completes when.

      So to get back to your questions:
      1) I guess that is a question of definitions. As I mentioned in the article, a single core will execute a single thread at a time. So if you have a multiprocessing system, if you want to actually keep all your processing units busy, then they all need to run a thread. But what is the definition of a ‘worker’ in this case? If it is a running thread, then yes, they will all need a running ‘worker’ thread to fully utilize the multiprocessing system. However, if it is a more highlevel ‘worker’ service/module, then perhaps that ‘worker’ will spawn multiple threads by itself, in which case a single ‘worker’ may still be able to use the multiprocessing capabilities.

      ‘Multithreading’ depends on the context… A ‘multithreading system’ is capable of running multiple threads. It’s still a multithreading system even when you only run one thread.
      But a ‘multithreading application’ is an application that actually uses multiple threads.

      Where multiprocessing is actually parallel, because you have multiple physical processors running at the same time, multithreading is not necessarily parallel, as a single core can run multiple threads through timeslicing.

      2) Multithreading is just one way of having multiple tasks ‘in flight’ at the same time. I would say that ‘asynchronous processing’ is a more basic, more low-level concept. You can implement asynchronous processing through multithreading, but there are also other ways.
      Hardware interrupts are a good example. If we take disk access as an example… It is possible to perform asynchronous reads and writes.
      Let’s take an async read as example.
      The most basic version of disk access would be for the CPU to send a read command to the disk controller, and then block while waiting for the disk controller to complete reading, so that all data is in memory. Of course this is wasting precious CPU time.
      If the disk controller uses a hardware interrupt to signal that it has completed the operation, the CPU can respond to that instead.
      So you would then send a BeginRead-command to the disk controller. Once the read is set up, the CPU can continue doing other things.
      The read is now running ‘asynchronously’, it is running ‘in the background’, while the main thread that you called it from, will continue running.
      At some point (you don’t know when, difficult to predict), the disk controller will complete, and send an interrupt to the CPU.
      The CPU will handle that interrupt, and it will perform the EndRead-logic to finish the read.
      In Windows for example, you can perform a BeginRead is done by using the ReadFileEx() API:
      https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-readfileex
      You can pass a completion routine, which is basically a callback function that you register to be called when the read is complete.
      Windows will interrupt your current thread, and run the callback function. After the callback is complete, it will return to your current thread.

      So this is an example of having multiple tasks ‘in flight’ at the same time (your main thread and the disk read operation), while only having a single thread in your program. So technically it is asynchronous, but not multithreaded.
      This specific example also does not require more than one core, as the disk controller can do its work without any help from the CPU. The tasks actually do run in parallel, at the same time. One task on the CPU, one on the disk controller.

      In fact, the timeslicing that runs multiple threads on a single core is actually based on hardware interrupts as well. The system has a hardware timer, which can be programmed to generate interrupts at a given interval (the timeslice). You can install an interrupt handler for the timer interrupt, which will manage the threads. So your thread scheduler is actually a callback routine that is triggered asynchronously by the hardware timer.

      With some of the oldskool demo stuff I’ve made, there’s also no actual multithreading, and there is only a single-core CPU. But there still are ‘background’ tasks, there is still asynchronous processing going on. For example, music has to run at a strict tempo, so that is usually done via a timer interrupt. 3D graphics routines tend to just run ‘as fast as the CPU will allow’, so they are not synchronized to anything. So the 3D rendering will run asynchronously from the music.
      And in some cases there is data being read from disk in the background, while music and graphics are playing in the foreground.
      It’s all based on timers and other hardware interrupts. Like poor man’s multithreading.

  67. Pingback: [SOLVED] Running two threads at the same time – BugsFixing

  68. Pingback: Running two threads at the same time – w3toppers.com

Leave a reply to Christos Cancel reply