Multi-core and multi-threading performance (the multi-core myth?)

Today I read a new article on Anandtech, discussing the Bulldozer architecture in detail, focusing on where it performs well, and more importantly, where it does not, and why. There weren’t too many surprises in the article, as I mentioned most problematic areas (decoding, integer ALU configuration, shared FPU, cache configuration…) years ago already. But that is not really what I want to talk about today. What I want to focus on is multi-core and multi-threading performance in general.

The thing is, time and time again I see people recommending a CPU with more cores for applications that require more threads. As Johan de Gelas points out in the aforementioned article, it is not that simple. Although the Bulldozer-based Opterons have 16 cores, they often have trouble keeping up with the older 12 core Magny Cours-based Opterons.

Now, it seems there is a common misconception. Perhaps it is because the younger generations have grown up with multi-core processors only. At any rate, let me point out the following:

In order to use multiple cores at the same time, multiple threads are required. The inverse is not true!

That is: a single core is not limited to running a single thread.

First things first

Let me explain the basics of threading first. A thread is essentially a single sequence of instructions. A process (a running instance of a program) consists of one or more threads. A processor core is a unit capable of processing a sequence of instructions. So there is a direct relation between threads and cores. For the OS, a thread is a unit of workload which can be scheduled to execute on a single core.

This scheduling appears to get overlooked by many people. Obviously threads and multitasking have been around far longer than multi-core systems (and for the sake of this article, we can place systems with multiple single-core CPUs and CPUs with multiple cores in the same category). Since very early on, sharing the resources of the CPU between multiple users, multiple programs, or multiple parts of a program (threads), has been a very important feature.

In order to make a single core able to run multiple threads, a form of time-division multiplexing was used. To simplify things a bit: the OS sets up a timer which interrupts the system at a fixed interval. A single interval is known as a time slice. Everytime this interrupt occurs, the OS runs the scheduling routine, which picks the next thread that is due to be executed. The context of the core is then switched from the currently running thread to the new thread, and execution continues.

Since these timeslices are usually very short (in the order of 10-20 ms, depending on the exact OS and configuration), as a user you generally don’t even notice the switches. For example, if you play an mp3 file, the CPU has to decode the audio in small blocks, and send them to the sound card. The sound card will signal when it is done playing, and this will trigger the mp3 player to load new blocks from the mp3 file, decode them, and send them to the sound card. However, even a single-core CPU has no problem playing an mp3 in the background while you continue work in other applications. Your music will not skip, and your applications will run about as well as when no music is playing.

On a modern system that is no surprise, as playing an mp3 takes < 1% CPU time, so its impact is negligible. However, if we go further back in time, we can see just how well this scheme really works. For example, playing an mp3 in the background worked well even in the early days of Windows 95 and Pentium processors. An mp3 would easily take 20-30% CPU time to decode. But since the OS scheduler did its job well enough, nearly all of the remaining 70-80% were available to the rest of the system. So most applications still worked fine. Things like web browsers or word processors don’t need all that much CPU time. They just need a bit of CPU time at the right moment, so that they are responsive to keystrokes, mouseclicks and such. And if the OS scheduler does a good enough job, then the response time is only one or two timeslices, so in the range of 20-40 ms. This is fast enough for people not to notice a delay.

Or, let’s go back even further… The Commodore Amiga was the first home/personal computer with a multitasking OS, back in 1985. It only had a 7 MHz Motorola 68000 processor. But look at how well multitasking worked, even on such an old and slow machine (from about 3:52 on, and again at 7:07):

As you can see, even such a modest system can handle multiple heavy applications at the same time. Even though computers have multiple cores these days, there are usually many more threads than there are cores, so thread switching (multiplexing) is still required.

Multitasking vs multi-threading

The terms multitasking and multi-threading are used somewhat interchangeably. While they are slightly different in concept, at the lower technical level (OS scheduling and CPU cores), the difference is very minor.

Multitasking means performing multiple tasks at the same time. The term itself is more widespread than computers alone, but within the domain of computers, a task generally refers to a single application/process. So multitasking means you are using multiple applications at the same time. Which you always do, these days. You may have an IM or mail client open in the background, or a browser, or just a malware scanner, or whatnot. And the OS itself also has various background processes running.

Multi-threading means running multiple threads at the same time. Generally this term is used when talking about a single process which uses more than one thread.

The ‘at the same time’ is as seen from the user’s perspective. As explained earlier, the threads/processes are multiplexed, running for a timeslice at a time. So at the level of a CPU core, only one thread is running at a time, but at the OS level, multiple threads/processes can be in a ‘running’ state, meaning that they will be periodically scheduled to run on the CPU. When we refer to a running thread or process, we generally refer to this running state, and not whether it is actually running on the CPU at the time (since the time slices are so short, there can be thread switches dozens of times per second, so it is too fast to observe this in realtime, and as such, it is generally meaningless to look at threading at this level).

A process is a container for threads, as far as the OS scheduler is concerned. Each process has at least one thread. When there are multiple threads inside a single process, there may be extra rules on which threads get scheduled when (different thread priorities and such). Other than that, running multiple processes and running multiple threads are mostly the same thing: after each timeslice, the OS scheduler determines the next thread to run for each CPU core, and switches the context to that thread.

There are threads, and then there are threads

All threads are not created equal. This seems to be another point of confusion for many people. In this age of multitasking and multi-threading, it is quite common for a single process to use multiple threads. In some cases, the programmer may not even be aware of it. The OS may start some threads in the background for some of the functions he calls, and some of the objects he uses, even though he only uses a single thread himself. In other cases, the OS is designed in a way that it demands that the programmer uses a thread for some things, so that the thread can wait for a certain event to occur, without freezing up the rest of the application.

And that is exactly the point: Even though there may be many threads in a process, they are not necessarily in a ‘running’ state. When a thread is waiting for an event, it is no longer being scheduled by the OS. Therefore it does not take any CPU time. The actual waiting is done by the OS itself. It simply removes the thread from the list of running threads, and puts it in a waiting list instead. If the event occurs, the OS will put the thread back in the running list again, so the event can be processed. Usually the thread is also scheduled right away, so that it can respond to the event as quickly as possible.

Applications that use these types of threads are still seen as ‘single-threaded’ by most people, because the work is still mostly done by one thread, while any other threads are mostly waiting for an event to occur, waking up only to process the event, and then going back to sleep again. As a result, such an application will appear to only use a single core. The additional threads may be processed on other cores, but their processing needs are so minor that they probably don’t even register in CPU usage stats. Even if you only had a single core, you probably would not notice the difference, since the threads could be scheduled efficiently on a single core (just like the example of playing an mp3 file earlier).

To really take advantage of a multi-core system, an application should split up the main processing into multiple threads as well. Its algorithms need to be parallelized. This is only possible to a point however, depending on the algorithm. To give a very simple example:

e = a + b + c + d

You could parallelize a part of that, like so:

t0 = a + b
t1 = c + d
e = t0 + t1

t0 and t1 can be calculated in parallel threads. However, to calculate e, you need the results of both threads. So part of the algorithm can be parallel, but another part is implicitly sequential. It depends on results from earlier calculations, so there is no way to run this calculation in parallel with other dependent calculations.

Amdahl’s law deals with these limitations of parallel computing. In one sentence, it says this:

The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program

The sequential parts result in situations where threads for one step in the algorithm have to wait for the threads of the previous step to signal that they’re ready. The more sequential parts there are in a program, the less benefit it will have from multiple cores. And also, the more benefit it will have from the single-threaded performance of each core.

And that brings me back to the original point: people who think that the number of cores is the only factor in performance of multithreaded software.

The multi-core myth

This is a myth that bears a lot of resemblance to the Megahertz-myth that Apple so aptly pointed out back in 2001, and which was also used to defend the AMD Athlon’s superior performance compared to Pentium 4s running at higher clockspeed.

The Megahertz-myth was a result of people being conditioned to see the clockspeed as an absolute measure of performance. It is an absolute measure of performance, as long as you are talking about the same microarchitecture. So yes, if you have two Pentium 4 processors, the one with the higher clockspeed is faster. Up to the Pentium 4, the architectures of Intel and competing x86 processors were always quite similar in performance characteristics, so as a result, the clockspeeds were also quite comparable. An Athlon and a Pentium II or III were not very far apart at the same clockspeed.

However, when the architectures are different, clockspeed becomes quite a meaningless measure of performance. For example, the first Pentiums were introduced at 66 MHz, the same clockspeed as the 486DX2-66 that went before it. However, since the Pentium had a superscalar pipeline, it could often perform 2 instructions per cycle, where the 486 did only one at most. The Pentium also had a massively improved FPU. So although both CPUs ran at 66 MHz, the Pentium was a great deal faster in most cases.

Likewise, since Apple used PowerPC processors, and AMD’s Athlon was much more similar to the Pentium III than the Pentium 4 in architecture, clockspeed meant very little in performance comparisons.

Today we see the same regarding the core-count of a CPU. When comparing CPUs with the same microarchitecture, a CPU (at the same clockspeed) with more cores will generally do better in multithreaded workloads. However, since AMD and Intel have very different microarchitectures, the core-count becomes a rather meaningless measure of performance.

As explained above, a single core can handle multiple threads via the OS scheduler. Now, roughly put, if a single core CPU is more than twice as fast as one core of another dualcore CPU, then this single core CPU can also run two threads faster than the dualcore CPU.

In fact, if we factor in Amdahl’s law, we can see that in most cases, the single core does not even have to be twice as fast. Namely, not all threads will be running at all times. Some of the time they will be waiting to synchronize sequential parts. As explained above, this waiting does not take any actual CPU time, since it is handled by the OS scheduler (in various cases you will want to use more threads than your system has cores, so that the extra threads can fill up the the CPU time that would go to waste otherwise, while threads are waiting for some event to occur).

Another side-effect of the faster single core is that parts that are strictly sequential in nature (where only one thread is active) are processed faster. Amdahl’s law essentially formulates a rather paradoxal phenomenon:

The more cores you add to a CPU, the faster the parallel parts of an application are processed, so the more the performance becomes dependent on the performance in the sequential parts

In other words: the single-threaded performance becomes more important. And that is what makes the multi-core myth a myth!

What we see today is that Intel’s single-threaded performance is a whole lot faster than AMD’s. This not only gives them an advantage in single-threaded tasks, but also makes them perform very well in multi-threaded tasks. We see that Intel’s CPUs with 4 cores can often outperform AMD’s Bulldozer architecture with 6 or even 8 cores. Simply because Intel’s 4 cores are that much faster. Generally, the extra cores come at the cost of lower clockspeed as well (in order to keep temperatures and power consumption within reasonable limits), so it is generally a trade-off with single-threaded performance anyway, even with CPUs using the same microarchitecture.

The above should also give you a better understanding of why Intel’s HyperThreading (Simultaneous Multithreading) works so nicely. With HyperThreading, a physical core is split up into two logical cores. These two logical cores may not be as fast as two physical cores would be, but that is not always necessary. Threads are not running all the time. If the cores were faster, it would just mean some threads would be waiting longer.

The idea seems to fit Amdahl’s law quite well: for sequential parts you will only use one logical core, which will have the physical core all to itself, so you get the excellent single-threaded performance that Intel’s architecture has to offer. For the parallelized parts, all logical cores can be used. Now, they may not be as fast as the physical cores, but you have twice as many. And each logical core will still exceed half the speed of a physical core, so there is still performance gain.

One of the things that SMT is good at is reducing overall response time. Since you can run more threads in parallel, new workloads can be picked up more quickly, rather than having to wait for other workloads to be completed first. This is especially interesting for things like web or database servers. Sun (now Oracle) took this idea to the extreme with their Niagara architecture. Each physical core can run up to 8 threads through SMT. For an 8-core CPU that is a total of 64 threads. These threads may not be all that fast individually, but there are a lot of them active at the same time, which brings down response time for server tasks. Because of SMT, the total transistor count for a 64-thread chip is extremely low, as is the power consumption.

So, to conclude, the story of multithreading performance is not as simple as just looking at the number of cores. The single-threaded performance per core and technologies such as SMT also have a large impact on the overall performance. Single-threaded performance is always a good thing, both in single-threaded and multi-threaded scenarios. With multi-core, your mileage may vary, depending on the algorithms used by the application (to what extent are they parallelized, and to what extent do they have to remain sequential?), and on the performance of the individual cores. Just looking at the core-count of a CPU is about as meaningless a way to determine overall performance as just looking at the clockspeed.

We see that Intel continues to be dedicated to improve single-threaded performance. With Bulldozer, AMD decided to trade single-threaded performance for having more cores on die. This is a big part of the reason why Bulldozer is struggling to perform in so many applications, even heavily multithreaded ones.

About these ads
This entry was posted in Hardware news, Software development and tagged , , , , , , , , , , , , , , , , , . Bookmark the permalink.

75 Responses to Multi-core and multi-threading performance (the multi-core myth?)

  1. Bob says:

    This is a great blog entry and I think very important as we start to see mobile devices based on Medfield hit the market. Many bloggers are lamenting it is a single-core CPU at 1.6 GHz but in many cases it is likely to deliver similar is not better performance as dual-core and quad-core ARM SoCs especially in the 2 GHz version.

  2. MacOS9 says:

    A damn refreshing entry on multitasking and multiprocessing if I ever read one; no longer will I be jealous of 4-, 6-, and more-core processors with my modest Core2Duos.

    (I remember there being comparisons on the internet years ago in the 90s between a PowerPC [the 604 by IBM] running at 120MHz, and being as fast as the popular Pentium Pro of the mid-1990s running at 166MHz – although that may have been a comparison between RISC and CISC and had nothing to do with multitasking.)

    By the way, since I’m already on the subject, what are Scali’s views on RISC vs. CISC processors? I remember Apple’s controversial switch from PowerPC (RISC) to CISC-based Intel around 2005 – but perhaps newer Intel processors (those supporting 64-bit) have also gone by now in the direction of RISC?

  3. MacOS9 says:

    From what I’ve gathered by reading this entry , my question may be irrelevant, but I look forward to Scali’s comments nonetheless.

  4. MacOS9 says:

    The article in question is called “Cisc or Risc? Or Both?” (my apologies for the several posts, since I wasn’t able to paste in the URL).

  5. MacOS9 says:

    Well let me continue embarrassing myself by offering one more post: have now stumbled onto your “RISC/CISC” article of late Feb. that has answered most of my questions: in other words, the complexity (or lack thereof) of instruction sets is largely irrelevant in the “post-RISC” world I take it?

    • Scali says:

      The Pentium Pro was actually the first Intel x86 CPU that used a RISC backend. It couldn’t quite keep up with the PowerPC yet, but a few hundred extra MHz could compensate for it. As CPUs became larger and more complex, the extra cost of the CISC-legacy became smaller. Today, Intel even has a competitive x86 CPU for smartphones and tablets, the Atom codenamed Medfield: http://scalibq.wordpress.com/2012/01/23/intel-medfield-vs-arm/
      So yes, it is largely irrelevant. In theory a RISC CPU may still be slightly more efficient, but in practice, Intel can compensate by using superior manufacturing technology (they are at 22 nm, where the competition is at 28, 32 or 40 nm), and by other technologies such as HyperThreading.
      The irony is that this time it’s the ARM CPUs that aren’t very impressive in terms of performance per GHz.

  6. Pingback: Multi-core and multi-threading performance (the multi-core myth?) | Scali’s OpenBlog™ | Itsaat

  7. Pingback: The Multi-Core Myth | Shwuzzle

  8. chris says:

    Very interesting read. So if clock speed (when analyzing different architectures) doesn’t matter, what should I be looking at when I buy my next processor? How does one determine how “fast” a processor executes a thread.

    • Scali says:

      Well, the only way to get a decent performance estimate is to look at it on a per-application basis.
      The only reliable way to know which CPU runs application X best, is to just benchmark them and compare the results.
      Of course, application Y could be a completely different story again.
      There is no simple answer.
      Which is why there are so many myths… The MHz myth and the multi-core myth approach the problem of performance from opposite angles, and they both fail equally hard at indicating overall performance.

      • chris says:

        For a user like myself who mainly runs things like a browser and netbeans (to a lesser extent LAMP stack locally) how is that feasible? Even for a small business determining which hardware to purchase to run a database server, how is that really feasible…

      • Scali says:

        Funny how some people can’t accept that not all answers are ‘feasible’. If you want to have a computer for generic use, then your performance needs will be some weighted average of the applications you will be using, where the weights depend on your wants/needs/priorities.
        You think that asking again will change my answer? No it won’t. I’m a realist, reality is not always ‘feasible’. Performance of a CPU architecture cannot be caught in a single measurement.

        Having said that, I can give you a simple tip: Filter your application needs on based on how performance-critical they are.
        Browsers are not performance-critical whatsoever. Depending on the size of your projects and how you use it, Netbeans is generally not performance-critical either.
        A local LAMP stack (presumably for development purposes) is usually not performance-critical either (you’re the only user, where the technology can easily scale up to hundreds or thousands of users at a time).

        Also, if you want to purchase a database server, you’ve already answered yourself how feasible it is: ‘database server’, so you only have to look at database performance, and more specifically, only at the specific database software you intend to run on it.

  9. Pingback: Thought this was cool: Multi-core and multi-threading performance (the multi-core myth?) | Scali’s OpenBlog™ « CWYAlpha

  10. Pingback: Of GCs and Multi-cores « Ping!

  11. Sreejith says:

    So wat about the dual core and quad core processors in mobile phones, increasing the number of cores in it doest really count…

  12. Hambster says:

    Hi,
    It’s a good post. May I refer your post on my blog?

  13. elpipo says:

    Good article, it has brung me back to school.

    As I’m getting old and little by little far from this, could you also discuss the case of vitualisation. Are cores really allocated to virtual machine or is this just an abstraction ? Is it stupid to say “I want one core for each virtual machine” ? How hypervisor handle real time allocation ? …

    • Scali says:

      I think those details are implementation-specific. I don’t use virtualization much myself, but I do use VirtualBox or VMWare from time to time for some cross-platform testing/development, and I believe they just map the virtual cores to threads on the host OS, without specifically pinning them down on cores (as you could do with SetThreadAffinityMask() in Windows).
      Which would make sense, since this would still allow the host to dynamically allocate cores, and do proper load balancing. I suppose you could still use the process affinity mask to have each instance run on a subset of the cores, but I think it would be a bad idea in general.

  14. Christos says:

    The Megahertz myth would have you believe that higher clock means faster.
    Well, it’s true, for given cpu architecture.
    The “multi core myth” will have you believe that more cores equals more speed.
    Well, again it’s true, only this time there are two parameters to consider instead of one.
    Architecture and application.
    Intel fueled the MHz myth because it was what they had to offer relative to the opposition, now AMD will advertise more cores for the same reason.
    You could then argue about the “64bit” myth that AMD started with the Athlon 64 that offered nothing in terms of performance while offering a lot in terms of marketing.
    But still, it was the way forward and someone had to make the first step.
    More cores equals more performance generally speaking and anyone who doesn’t understand the pitfalls involved will just end up paying a bit more for a bit less, not a big deal.
    At the end of the day it’s exactly the same as in every aspect of life, the more knowledgeable make more knowledgeable decisions.
    No point in getting too crazy about it.
    At the end of the day I doubt those who know little about computers will read any of this any way.
    The just buy a PC with “Intel inside” cause it’s all they know.
    So the world is good place to be right now it seems because the name they all know makes the best cpus at the moment, how lucky is that.
    Come to think of it the MHz myth must have done people’s pockets much greater injustice then.

    • Scali says:

      Not sure what point you’re trying to make. Everything you say is already literally mentioned in the article, or is so obvious that it would not need to be mentioned.
      Or well… some part is rather dubious. Namely, you say Intel fueled the MHz myth… You mean to imply that AMD did not? Which would be funny, since as I pointed out, Apple coined the term ‘MHz myth’ when they were competing with their PowerPC CPUs against the Intel Pentium III and the AMD Athlon. And the Pentium III and the Athlon were in a tight race towards the GHz mark at the time (with AMD actually winning that race).

  15. Pingback: Multi-core and multi-threading performance (the multi-core myth?) | Scali’s OpenBlog™ | @EconomicMayhem

  16. Keith Walden says:

    Not really a computer geek … (no negative inflection intended) :) … Im closer to one of those ding-dongs that look for “Intel inside” and hope for the best when I buy a computer as I drive a truck with not much free time. Years ago I was a technician and still have a Tek 466b storage scope and associated equip I mean to set up to troubleshoot with. I do like to tinker with computers and my P4E desktop (3GHz /4GB/800MHZ ) pooped with a motherboard issue. Was wondering whether to buy a new one or replace it and decided to try R+R’ing the motherboard first. My application is mainly internet for googling things, ebay etc and simple programs that would probably run fine on a Pentium 1 or 486. I would like to see videos without buffering and have it be wireless and be fast enough to be usable for those things in the next few years. I was considering buying an HP Phoenix 8-core which is there top instead for 650$ (recon) and may still but wondering where will that get me in performance improvement for the things I do? Really ? Probably not far. The P4E is fast enough to accomodate the fastest residential internet service most likely and for the near future I think as well as the Phoenix … the Phoenix probably wont get me anywhere but 600$ in the hole. I think most people use their computer for the internet today as I do with 1-3 people possibly at a time. I live alone so the P4 is probably fine … maybe if there were additional users simultaneously the multi-cores would be better, guess it depends on the sequential / parallel content of normal internet activity. Is there a measure of that available?

  17. Lauro Andrea says:

    Thanks for your blog here.
    I was researching the virtues of a having a 12-core server for our biometric time clock and how threads are directly proportional to cores and have seen that it is not quite so. I am not so much engaged in the search for speed as I am for capacity for multiple requests for data. Though there is a central server and we are having about 400 time clocks around the country with every body clocking in a the space of five minutes at around 7:55 to 8:00 am, we will be using a sync software to cache the data while it piles up and send it server-bound after the lines are less clogged around 9:00 am or later.
    Still busy on a Sunday but your blog had me riveted. Thanks.

  18. I’ve looked all over the internet and I’m surprised no one discusses this in the context of multi-core cpu’s: I had a dual celeron setup that made me smile because I could do cpu-intensive stuff without having to wait. Like big file transfers. Especially: big file transfers. Then multi-core came along. If figured this is going to be just great. No futzing around trying to tell the OS that I have 2 cpus plugged in. Just assemble and go. I upgraded from dual celerons directly to a dual core athlon at something like 3800 MHz. File transfer speed was AMAZING. But guess what? I had to wait! The dual celeron handled the workload so as to leave one cpu free, so I NEVER EVER had to wait. This feature of dual-love is gone now, and I’m guessing it’s because OS’s and apps are multi-thread aware, so they do a “superior” job of scheduling. So, in this case better cpu + better software = worse user satisfaction in a major area of computer use. Even things like virus scans or indexing operations take the computer away for effectively the whole duration of the procedure. I want my old system back but everything is obsoleted by the new OS’s and browsers. You would think there would be a way to configure your system so that you don’t have to wait if you don’t want to. Anyone know of a way to do this?

    • Scali says:

      Well, there is a difference between having multiple CPUs/cores and having multiple computers.
      Multiple cores still share the same memory (and generally also share some of the cache), the same chipset, the same harddisk controller, the same harddisks etc.
      It’s simply physically impossible to do high speed file transfers without impacting the overall performance of the system in some way, because you are using a lot of shared resources in the process.
      Perhaps the older Celeron system just appeared to work better because the balance worked out better for you (the harddisk, chipset and memory may have been relatively fast compared to the CPUs, so there was a larger budget to share between the CPUs, so to say).

      In essence, multi-CPU systems are not that different from multi-core ones. In fact, the first generation of dual-core x86 CPUs (both Pentium D and Athlon64 X2) were little more than two CPUs connected on the same package/die.
      And OS schedulers have not changed all that dramatically over the years. They have mainly seen improvements in being more HT-aware and NUMA-aware. But that would not apply to your situation.

      • >>And OS schedulers have not changed all that dramatically over the years. They have mainly seen improvements in being more HT-aware and NUMA-aware. But that would not apply to your situation.<<

        Maybe my mind is playing tricks, but I seem to remember looking at the performance tab in task manager and seeing one cpu unused during a file transfer (in the celeron days). Is that possible? If anyone out there has an old dual-PII, could you please tell me if I'm crazy? I guess it depends on the OS. I had Win2K Pro and liked it a lot better than Windows 7 Home.

        As for the new systems, seems like they could offer an option for how you want your resources allocated so you can keep working during long cpu-intensive procedures. Maybe it's just my OS.

  19. Came across this today:

    http://www.techspot.com/community/topics/making-a-program-remember-its-priority.3913/

    Haven’t tried it yet, but looks like it’s pretty easy to set priority for anything you can run with a shortcut — which is pretty much everything!

  20. Abhishek M says:

    Each core should still be able to run a different task. So a multi-core processor should give more throughput than a single core any day right?

    • Scali says:

      No. I think you didn’t quite understand what I wrote here :)

      “When comparing CPUs with the same microarchitecture, a CPU (at the same clockspeed) with more cores will generally do better in multithreaded workloads. However, since AMD and Intel have very different microarchitectures, the core-count becomes a rather meaningless measure of performance.

      As explained above, a single core can handle multiple threads via the OS scheduler. Now, roughly put, if a single core CPU is more than twice as fast as one core of another dualcore CPU, then this single core CPU can also run two threads faster than the dualcore CPU.”

      • Abhishek M says:

        No. I think you’re generalizing too much :).

        I wasn’t talking about Intel or AMD. I didn’t even have a specific vendor in mind.

        Theoretically, if everything else is the same, an multi-core processor system is able to do more work than a single core processor system. Of course, you need tools to do that. Languages like Scala and Erlang are already doing that. Simultaneous Multi Threading on a single core is efficient but it is good only for instruction level parallelism.

        Of course, a simple user with minimal needs doesn’t need to worry about that. A good OS will do the trick for him.

      • Scali says:

        “Theoretically, if everything else is the same, an multi-core processor system is able to do more work than a single core processor system.”

        Which makes you the one who is generalizing too much.
        Aside from that, this is a different statement from the one you made earlier. This new statement sounds exactly like the one I already made:
        “When comparing CPUs with the same microarchitecture, a CPU (at the same clockspeed) with more cores will generally do better in multithreaded workloads”
        Which I would not argue against, obviously.
        Your original statement was however:
        “So a multi-core processor should give more throughput than a single core any day right?”

        I take the ‘any day’ to mean that a multi-core processor will ALWAYS give more throughput (and that is an AMD-specific marketing term, so it was already quite obvious which vendor you had in mind, as if rooting for moar coars was not enough of an indication).
        Which is not true, since multi-core processors will always share a certain level of resources, being on a single socket. This means they share the data bus, and usually also the memory controller and at least some of the cache.
        So in I/O-limited situations multi-core processors will not have more throughput, because the throughput is limited by the available bandwidth of the system, which is the same regardless of the number of cores.

        Add to that the fact that the more cores a CPU has, the lower it will be clocked, and in practice multi-core processors may actually have less throughput in certain situations than single core variations. So it does not hold ‘any day’.

        “Simultaneous Multi Threading on a single core is efficient but it is good only for instruction level parallelism.”

        You are mixing up terminology here. Instruction level parallelism is a term related to superscalar processing, and is applied on a per-thread basis. It is not meant to be applied to SMT, where you use two or more streams of instructions rather than one.

  21. Pingback: What could be my bottleneck?

  22. Mark says:

    This is a very nice explanation i stumbled across when trying to found it if it’s worth to spend more on an i7 4770 which has hyperthreading instead of the i5 4670, which is exacty the same speed cpu. If it’s not now, than maybe in the future with better optimized programs (i plan to use it 7+ years).
    When i started reading you mentioned my daily usage such as multi tasking with music playback, firefox with a lot of tabs open (and more importantly plugins like e.g. Disconnect which does freeze FF on my Q6600 because of blockin a lot of tracking links on some pages, not sure if it’s cpu related though). From this explanation i started to think the i5 was more then fast enough because it has a lot of single threaded performance and such could run a lot of threads, as my daily usage would seem more then efficient to let threads access the cpu.
    But then you started mentioning the hyperthreading which was very usefull if i understand correctly. So HTT would seem something to consider when buying an cpu.

    Now does this mean that even on already fast cpu like the i5 with 4 cores, the only extra added hyperthreading from the i7 as an difference would still justify the more expensive i7 (of not now maybe in the future)?

    Hopefully you can help me with this question.

    • Scali says:

      Whether or not HT is worth the extra money depends a lot on what kind of applications you use, and how many you use at a time.
      In some cases it may not give you any extra performance at all, while in other cases you may get 30-40% extra.

      In my case, I always go for the i7. If you were to calculate the raw numbers, I’m sure the i5 would give more performance for the buck on average… But I gladly pay the small extra premium to get that litle boost from HT when I require it.

      • Mark says:

        I already know it does give a boost to packing tools, encryption and media editing. But this is not something i do daily. What i’m trying to found out is if the HTT extra to an already very fast quad core cpu will now or in the future speed up browsing, mail, etc.
        This is very hard to found out (a lot of different opinions on that subject) and if i’m interpreting your response correctly it is purely a personal decision and not something very measurable of foreseeable?

      • Scali says:

        Well, as you say… You get a boost in some applications, not in others. It’s a very personal thing: how often do you use such applications, and how important is it to you to get a bit of extra performance?
        It’s much the same as with many other CPU-specs… Do you want to pay a bit extra to get more cache, or is the model with less cache good enough for you? Do you want a few hundred MHz extra clockspeed? Etc.

  23. Mark says:

    Thanks for your opinion. I’ve made my choice, although currently the i5 is the best for the buck, i’m going for the i7 4771 because i think (hope) it will be more future proof.

  24. Vishal says:

    After reading all the above comments and your post, i have come to a conclusion. ‘Intel has higher single threaded performance than AMD and multiple cores does not matter at all in performance’.

    • Scali says:

      It’s not that multiple cores don’t matter at all in performance. Both the per-core performance and the number of cores affect performance, it’s just that per-core performance is more important (as Amdahl’s Law says). A lot of people seem to ignore per-core performance completely, and only look at core-count, which is completely wrong.

  25. Rahul Bansal says:

    Hi Scali,
    I came here when I was trying to pick one out of – http://cpuboss.com/cpus/Intel-Xeon-E3-1245V2-vs-AMD-Opteron-3280
    I have 2 different set of applications:
    (a) LEMP server with lots of wordpress sites
    (b) FFMPEG based encoding server with node.js for front-end
    Some servers has both of above.
    After reading your article, I think Intel-Xeon-E3-1245V2 score over AMD-Opteron-3280 (higher clock rate with multithreading).
    But AMD-Opteron-3280 has 8MB L2 Cache which is 8-times bigger than Intel -Xeon-E3-1245V2. How that will affect?
    I know it will depend on application a lot, hence I listed both sets of applications above.
    Thanks for one of the most amazing article I ever read on this topic! It finally settled thread v/s core debate (atleast for me).

  26. Z says:

    Thanks for a very informative read! I’ve been out of computing for a while but it is nearly time to replace my 2005 Pentium D 3.0GHz Dell. I have upgraded it as far as it is possible to do without replacing the motherboard or CPU. I have been extremely pleased with this processor. It is a beast!

    I am familiar with threading but was confused by the benchmarks comparing “single-threaded” and “multi-threaded” performance between the AMD FX 8350 Black Edition and Intel i7-4770k that I found on anandtech.com. Since my main programs that I use are PhotoShop CS3, Nikon’s View NX2, some general gaming, and looking at video editing in the near future I really needed to understand what this difference really is. According to these benchmarks (I have always taken benchmarks with a grain of salt… I’ve been computing since the Timex Sinclair 1000, Commodore 64, personally owned a 386SX33, Cyrix 6×86, Pentium II i think it was, then this Pentium D, just picked up this laptop with an AMD A10-5750M and love it so far) it SEEMS like, for my specific uses, the AMD FX 8350 MIGHT be the way to go.

    But then I decided that I really needed to understand the differences in performance between these two chips, specifically the single- vs. multi- threading. I can tell you that finding information on this specific question is not very easy. I very much appreciate the time you put into writing this because I think you’ve probably saved me from making a less than optimal decision.

    I THINK that, despite my desire to fuel the competition between Intel and AMD by buying another AMD processor, the Intel i7 would be the wiser choice. It should last me at least five years before I’d need to upgrade to whatever someone comes up with next. If I were ONLY gaming and didn’t need it to last as long as possible, I probably would get the AMD, since it is much less expensive.

    So my question to you is, have I understood your article and do you agree with this choice for my specific listed uses? Video involves a lot of sequential processing, right? I know there are “video benchmarks” but, like I said earlier, I don’t think benchmarks are always accurate and I don’t plan on using the format they listed in the benchmarks I’ve seen.

    I think it is more informative to understand the way a program is coded to be processed, especially in today’s multi-core computing landscape. And, if I have understood what you have written, the “single-thread” performance is a potential bottleneck, essentially. Because this article means that, no matter how many simultaneously (parallel) running cores are involved they can and will still be waiting for the results of a single calculation at some point. Intel’s Hyperthreading gives them a huge advantage for this reason, correct?

    I understand enough of the way computers actually work (specifically MS OS’s) to know you are spot on in what you have written. I just need to make sure I understand the implications and consequences of CPU choice.

    Thank you again for this well-written and information-packed article. Or blog post, or whatever you want to call it. I couldn’t stop reading it once I started!

    • Scali says:

      In short, AMD is *never* the way to go… This article is already a few years old, and new iterations of Core i7 CPUs have opened up the gap even wider.

      Anyway, video processing/encoding can be parallelized to a great degree. You also see that newer video processing software tends to offload work to the GPU for that reason. Which means that aside from GPU-performance, the single-threaded CPU performance becomes most important (the CPU will mainly be preparing and feeding commands to the GPU, which is a task that is hard to parallelize, so it still relies mainly on single-threaded performance, much like conventional graphics acceleration in games and such. So, especially when you’re gaming, you’d want an Intel CPU… in many cases just a cheap dualcore will do the job just fine). Intel also has the super-fast QuickSync technology for encoding video.
      But when you are doing CPU-only video processing, you’ll want some balance of good parallel performance and good single-threaded performance. The exact performance requirements depend a lot on which software you use, and what kind of operations you are performing, so it is difficult to make any generalized claims about that.

      And, if I have understood what you have written, the “single-thread” performance is a potential bottleneck, essentially. Because this article means that, no matter how many simultaneously (parallel) running cores are involved they can and will still be waiting for the results of a single calculation at some point.

      Yes, that is basically what Amdahl’s law says. Basically the idea is that if you improve single-threaded performance, you improve performance across ALL parts of the code, both parallel and sequential. But if you improve multi-threaded performance (as in: adding more cores, but not making the cores faster), the bottleneck merely shifts towards the sequential part of the code.

      Intel’s Hyperthreading gives them a huge advantage for this reason, correct?

      Yes and no. Hyperthreading, as in simultaneous multithreading, does not necessarily improve this situation. However, the approach that Intel has taken, does. Namely, Intel focuses on maximum single-threaded performance. Which speeds up all code. Then they add hyperthreading to double the number of threads that the CPU can handle. This brings down the single-threaded performance when two threads are running on a core simultaneously, but the net effect is that the two threads run faster than they would without HT. In terms of the added complexity (only about 5% extra transistors per core required for HT), there is quite a significant boost in multi-threaded scenarios (which may be 20-30% on average, and can be over 50% more performance).
      So in terms of performance-per-transistor, or performance-per-watt, Intel’s HT is very efficient.
      And given that the sequential parts of the code would generally slow down the parallel parts anyway, in practice it is not that much of a problem that the multithreaded code does not run as quickly on 4 cores with HT as it would on 8 dedicated cores. With 8 dedicated cores, in a lot of cases the cores would be sitting idle for longer, waiting for other sequential parts to complete.
      So having the 4 extra threads is a nice bonus, and the fact that the extra threads slow down the single-threaded performance of the cores is usually less important.

  27. Z says:

    Sorry, got the Pentium D 3.0GHz in 2006, first quarter. And by “out of computing for a while” I meant I’d stopped reading up on the myriad new chips being developed years ago and have been doing research for a couple months to get back up to speed on what’s available. I had allowed myself to loose touch with what was going on. I have always been fascinated by computers though and use them on a daily basis at work, at home, at play, and hopefully to finally become self-employed. Thanks again!

  28. Required says:

    Ah, I love the tears of the Intel “but our less cores are faster so there!” fanboys, like Scali is. Just beautiful replies like his reply to Abhishek M. Thank you for crying in this post Scali, you misguided Intel moron.

  29. Hùng Lê says:

    Dear all
    How to calculate the optimal number of thread, assume that JVM have enough Java heap space, and the task can be entirely splitted.
    Thank you
    Jimmy

  30. eulises melo says:

    excellent summary of multitasking and multitren…. an eye opener to 89% of ceo and vp running this businesses. the only advacement will b to focus in programmin…
    John L. Gustafson pointed out in 1988 what is now known as Gustafson’s law: people typically are not interested in solving a fixed problem in the shortest possible period of time, as Amdahl’s and me eulises melo Law describes, but rather in solving the largest possible problem (e.g., the most accurate possible approximation) in a fixed “reasonable” amount of time. If the non-parallelizable portion of the problem is fixed, or grows very slowly with problem size (e.g., O(log n)), then additional processors can increase the possible problem size without limit.
    your article takled the myth and provided the solution to the current misconception and bottle neck. im heading to tigerdirect and buying me a core 2 duo forget that i3. jaja good job.
    eulises melo
    duvis10@hotmail.com
    chiyinyan xian xuan

  31. Excellent blog! By reading this article, one question came in my mind. Say I have one program which requires 5 threads to perform and if I have 4 core machine then how threads will work? 4 threads will be allocated on each of the 4 cores and 5th thread will share the time? or will there something else happen? waiting for the responses

    • Scali says:

      It all depends on the workload of these threads, and the priorities they are assigned. If all threads are working all of the time, and they all have equal priority, then the scheduler will try to switch threads in and out so that all 5 threads get an equal amount of CPU time (the time-division multiplexing mentioned above). If some of the threads have higher priority than others, these will get more CPU time, so the scheduler will switch to these threads more often than to the others. And when a thread is entering some kind of waiting state for a certain event, it is no longer being scheduled, so the remaining threads will get all CPU time (in which case the CPU may even become partly idle, when there are less than 4 active threads).

  32. Burgmeister says:

    Hey Scali, everything I’ve read here on your website thus far has been pretty helpful. I didn’t understand 100 percent of it (I’m not a very tech-y person), but I think I get the just of what you’re saying.

    So, what I got from your article and from reading all of these posts is that clock speed and number of cores basically aren’t good indicators of speed, and that the only real way to determine how fast a certain processor is to look at how it performs on a “per application basis”. If that’s true, does that mean I have to find benchmark tests online (or conduct my own) in order to really get a good estimate of how fast a particular processor is, or is there some easier way? For example: If I’m comparing an i5 with an i7 that have similar clock speeds and numbers of cores, is there some easier way to tell how well each will perform without actually using them first-hand? Or is the architecture simply too different for them to be compared just by looking at their specs?

    Also, I still don’t entirely understand what “applications” hyperthreading would be helpful in. I’ve heard that for gaming, you don’t really need it, which is what I would primarily be using a new computer for. Is hyperthreading generally good for gaming, or is it better for multitasking?

    Lastly, how much does having extra threads, on average, increase speed? For example, on an i7 4770 there are 4 cores and 8 threads. How much does having those extra 4 threads help, if at all, with using a single application (game) at a time?

    I know these questions are really general, but like I said, I’m not that knowledgeable when it comes to computers…that’s why I’m asking you :).

    • Scali says:

      For example: If I’m comparing an i5 with an i7 that have similar clock speeds and numbers of cores, is there some easier way to tell how well each will perform without actually using them first-hand? Or is the architecture simply too different for them to be compared just by looking at their specs?

      Well, in the case of the i3, i5 and i7 (assuming they are of the same generation), the architecture is the same. The main difference is in the number of cores and whether or not HyperThreading is enabled. An i5 and i7 both have 4 cores, so at the same clockspeed, HyperThreading will make the difference. And then it depends on how much an application can benefit from the extra threads on the i7.

      Is hyperthreading generally good for gaming, or is it better for multitasking?

      Games generally don’t benefit that much from many cores/threads. Most games don’t benefit beyond 4 cores. Partly because driving a GPU is very much a single-threaded affair: there is only one GPU, and all the instructions have to be executed in a strict order to get correct rendering results.
      Until recently, graphics APIs had little or no support for multithreading at all. In DirectX 11, you can prepare lists of commands on other threads/cores, but it is still rather limited (and AMD’s drivers have a broken implementation of it). DirectX 12 and Mantle may gain more from extra cores/threads.

      Other tasks in games are also a bit hard to parallelize, such as physics and AI. So generally those don’t scale too well past 4 cores either.
      So, HyperThreading on an i7 does not really do much for games. On a core i3 it’s a different story: you only have 2 cores, and with HT you can run 4 threads, which will bring the CPU performance close to that of a regular 4-core i5 in most games.

      Lastly, how much does having extra threads, on average, increase speed? For example, on an i7 4770 there are 4 cores and 8 threads. How much does having those extra 4 threads help, if at all, with using a single application (game) at a time?

      Well, that’s the whole point of this article: any ‘average’ is completely useless, because the results per application are all over the place. In statistical terms: the standard deviation is very large.

  33. Mike says:

    Interesting read. Going back in time, this can help explain the “original” intentions. Things have only improved since.

    http://www.xbitlabs.com/articles/cpu/display/pentium4-3066_2.html#sect1

    “The 3.06 GHz Pentium 4 enabled Hyper-Threading Technology that was first supported in Foster-based Xeons. This began the convention of virtual processors (or virtual cores) under x86 by enabling multiple threads to be run at the same time on the same physical processor. By shuffling two (ideally differing) program instructions to simultaneously execute through a single physical processor core, the goal is to best utilize processor resources that would have otherwise been unused from the traditional approach of having these single instructions wait for each other to execute singularly through the core. This initial 3.06 GHz 533FSB Pentium 4 Hyper-Threading enabled processor was known as Pentium 4 HT and was introduced to mass market by Gateway in November 2002.” (http://en.wikipedia.org/wiki/Pentium_4)

  34. John S says:

    Excellent article, I only wish more could be written on explaining how the CPU works for the average consumer to understand. I think many people buy PC’s based either on salesmen suggestions , friends, or possible reviews. They do little to even research what they actually need.
    I think its clear today speed in Mhz is very deceiving such as a 2.0 ghz tablet and a 2.0 Ghz in a laptop can be very different in speeds. The trickery of hyper threading is just that, a trick. A physical core will not perform 2X better using hyper threading. In fact, I am surprised Intel even brought HT back. I guess the brick wall of speed has again caused Intel to go into the bag of tricks like HT and Turbo mode to try and distinguish itself from AMD. I guess it works from a retail platform and it sells to people who know little about how HT or Turbo mode even work or when they work. Unless your into performance figures, actual user experience will hardly be able to tell if HT is working or if Turbo mode is enabled. I myself have always been more concerned with power management then raw processing power. Maybe that’s why today we see far more emphasis on power use of the chip then on speed. We have plenty of speed for what most people need. Its like the problem with the Power PC chip Apple was using. It was never bad, but the case was always made that Intel was far exceeding the Power PC chip in speed and that made Apple look bad.

    • Scali says:

      Intel never claimed that HT will give twice the performance. Heck, not even a second physical core will give twice the performance (as should be obvious from reading this article).
      HT is not ‘just a trick’ however. As I said elsewhere: http://scalibq.wordpress.com/2012/02/14/the-myth-of-cmt-cluster-based-multithreading/

      Now clearly, HyperThreading is just a marketing-term as well, but it is Intel’s term for their implementation of SMT, which is a commonly accepted term for a multithreading approach in CPU design, and has been in use long before Intel implemented HyperThreading (IBM started researching it in 1968, to give you an idea of the historical perspective here).

      As I also said, I was never surprised that Intel brought HT back. x86 is not a very efficient instructionset, so a lot of execution units are left idle on a modern backend. HT is an efficient way to make better use of these execution units. It probably turns out positive in terms of power-usage as well (after all, Intel even uses a form of HT on their Atom series). But I have never seen anyone do some tests on that.

  35. M.Shazly says:

    Where has this blog been ALL my life; thx for the informative post man keep up the good work.

  36. Gustavo says:

    Nice article. But, when buying a computer, people should think about costs x performance.
    I was comparing the prices for a new PC, since my Core 2 Duo E7400 is too slow to handle NetBeans 8, QtCreator 4, Gimp, Blender, MySQL Workbench, MySQL Server, MySQL Client, Dia, Mozzilla Thunderbird, Google Chrome and the project (Qt compiling at the same time i code in NetBeans on another project) – all running at the same time. The Core 2 Duo just can’t handle all that workloads anymore in a good time.

    So, after comparing, I found that i can buy an i7 4770 for more than double of the price of an FX 8350. But I would be far away from getting double performance, even comparing single threaded performance. For the same price of the FX 8350 I can only get an i7 950. And the i7 loses in everything in all benchmark tests. So, no doubt, i’m buying an FX 8350.

    One thing, that should have been said here: There are softwares optimized for Intel. And there are softwares optmized for AMD. The intel leads the market. And so, there are much more softwares optmized for it. If you look for games optmized for AMD and benchmark it comparing to an Intel (all powerfull haswell – off course using the same hardware, besides processor), you will see an FX 8350 with better performance in FPS (for example) than the intel. It might not be too relevant, but I think it is worth the comment, specially when you pay 2 – 3 times more and don’t get not even 1.5 times more performance.

    • Scali says:

      My blog is a technical piece. I don’t want to go into price, only technical merits of different architectures.
      Prices change all the time (AMD competes aggressively on price, which is why their 8-cores are so affordable these days, making them compete with Intel CPUs with much lower core-count, transistor-count, power consumption etc). Besides, prices are never linear to performance. There’s always an exponential scale as performance goes up.

  37. Wes says:

    You wrote: “x86 is not a very efficient instruction set, so a lot of execution units are left idle on a modern backend. HT is an efficient way to make better use of these execution units.” So does this mean that even when a program is running flat out at 100% on a core that behind the scenes there is still some idle time?

    I’m doing a project that involves a single-threaded sequential executable (no fpu nor gpu) that reads in data from a file, then spends any where from a few hours to a couple of weeks number crunching, then dumps the results to a file. The file i/o is insignificant compared to the calculation phase which runs with no i/o at 100% of a core. Even though the executable is single-threaded, I can run multiple instances of this program in parallel. By the end, over a 100 instances of the program have to be run. I have temporary access to a spare server with 2 Intel E5620 processors (8 physical cores, 16 logical cores) with 96 GB of ram.

    Would it be faster in this unusual situation to run only 8 instances of the program at a time? I was under the impression since they’re running at 100% anyway, there wouldn’t be any gains from HT by running more than 8. But if you’re saying that there are still idle cycles in the “backend” anyway, then running more than 8 would reduce the total run time. Is this correct?

    • Scali says:

      Yes, if you look at a modern x86 backend, you will see many execution units. Usually you have something like 3 ALUs, 2 load/store units, 2 FPU/SIMD units (okay, even more, slightly more complicated, see eg Nehalem/Sandy Bridge/Haswell here for details: http://www.anandtech.com/show/6355/intels-haswell-architecture/8)
      Each of which can execute an instruction per cycle, best case. So in theory you could have a maximum IPC of 3+2+2 = 7 instructions per cycle.
      In practice however, most code will struggle to execute at more than 2 instructions per cycle.
      Partly, x86 itself is also to blame. Namely, the instructions are not very powerful, and you can’t make use of the backend and the actual registers properly. Besides, decoding instructions is very complex and expensive. This means that CPUs can’t decode more than 3 instructions per cycle at most, and also the reordering after out-of-order execution is a bottleneck. So you’ll never be able to reach the theoretical maximum of the total execution units anyway.

      So that means a lot of units are sitting idle most of the time. Which is why the concept of HT makes so much sense.

      And yes, in your case it’s certainly possible that HT will improve performance. Namely, if we assume that your code can only average <= 2 ALU instructions per cycle, then you'll have some spare ALU time that an extra HT thread could make use of.
      Whether you actually gain in practice depends roughly on the following parameters:
      1) How efficient is the ALU use of your code?
      2) How efficiently can HT schedule two threads of your code on a single physical core?
      3) What effects does having two threads on a physical core for the cache and memory subsystem? (If a single thread was completely bandwidth-limited by the cache/memory anyway, then even though there may be idle ALUs, you can't get any workload to them)

      You could consider however rewriting your code. Namely, even though your code is integer-only, you could still create two variations of it. One targeting the regular integer ALUs, another targeting the SIMD ALUs (you can do integer operations with MMX/SSE2+).
      Because you will have a different 'mix' of instructions this way, running the 'regular' and 'SIMD' threads on a single physical core may give you better results with HT.

      • Wes says:

        Thanks, that’s good to know.

        It took me a while to get the code working, so I don’t know that I want to mess with it, but I’m assuming that by compiling with appropriate options it would use MMX/SSE whenever it’s appropriate. I’m using two different compilers (for verification purposes). I’m using gcc with “-O3 -march=core2 -msse4.2″ and the Microsoft C compiler with “/Ox /favor:INTEL64″ which I understand uses SSE instructions.

      • Scali says:

        If you compile for AMD64 architecture, MS will use SSE2 by default.
        For 32-bit, SSE2 is not available on all architectures. You should use /arch:SSE2 to enable this.
        However, that is not the right way to get the code I am talking about.
        Such optimizations will generally only replace naive FPU code with SSE2. What I’m talking about requires *ALL* the (ALU) code to be SSE2, which you can only do by using intrinsics or inline assembly.

        Never rely on a compiler to generate optimal code for MMX/SSE.

  38. Deepak Roy says:

    I just stumbled upon this article rather by accident and it has really helped. With this octa core vs quad core vs dual core fight going on in the mobile world, I didn’t want to compromise on speed and was about to invest in an expensive phone. Now I am better off buying a dual core which advertises better mutli threading performance like ASUS Zenfone 5.

    Thanks a ton again. Deepak, India

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s