Why the need for CPUs?

GPUS are like a Formula 1 car. It will absolutely blow everything else out of the water for a very specific task. However if an F1 car would be absolutely useless for going shopping with your family.
A CPU is like a big family SUV, it can pretty much do everything to a reasonable degree to some level. It can go quite fast, it can go a lot of places, you can put more people and items in it.

To get more technical, GPUS are processors optimised for parallel computing. The math required to calculate what a 3D object should look like on a screen is easily run in parallel, so GPUs were developed to take advantage of it. CPUs on the other hand can handle things like networking, AI, application state computations, IO and so on much better than a GPU can.

To get to a math level, which is really what computers are doing, GPUs are processors designed to perform matrix algebra as fast as possible, if you can express your problem in the form of matrices, a GPU will absolutely demolish it. If you cannot express your problem in terms of matrix operations, then a regular CPU will be much faster.

Why GPUs are getting a lot of attention is because a lot of problems can be expressed in the form of matrix operations. Simulations, machine learning, graphics, are all very matrix intensive calculations, thus faster GPUs means more powerful simulations, better graphics and more useful machine learning models.
The above is mostly on point. The only thing I would add is that a GPU is good at a lot more than just matrix operations. In general, they’re good at Single Instruction Multiple Data (SIMD) operations, where matrix multiplication is just one such example.
 
Last edited:
The above is mostly on point. The only thing I would add is that a GPU is good at a lot more than just matrix operations. In general, they’re good at Single Instruction Multiple Data (SIMD) operations, where matrix multiplication is just one such example.
Processing multiple data with one operation is very matrixy to me :laugh:

I am curious though, is there a SIMD operation that couldn't be expressed as linear algebra?
 
Processing multiple data with one operation is very matrixy to me :laugh:

I am curious though, is there a SIMD operation that couldn't be expressed as linear algebra?
Transfer functions for AI, Tree based ML models, Monte Carlo simulations, Binomial Tree Options Pricing, Ray Tracing, Vertex Shading, Geometry Shading, Pixel Shading, Protein Folding, Prefix Sums, Bitonic Sorting, etc. just for starters.

Pretty much anything you would write a CUDA kernel or HLSL shader for. There would be no point to these languages if you could just make a CuBLAS call to use prepackaged linear algebra routines.
 
Transfer functions for AI, Tree based ML models, Monte Carlo simulations, Binomial Tree Options Pricing, Ray Tracing, Vertex Shading, Geometry Shading, Pixel Shading, Protein Folding, Prefix Sums, Bitonic Sorting, etc. just for starters.

Pretty much anything you would write a CUDA kernel or HLSL shader for. There would be no point to these languages if you could just make a CuBLAS call to use prepackaged linear algebra routines.
Those again, all look pretty matrixy. But think I see what you are getting at.

Correct me if I am wrong, but the efficiency seems to also come from not needing to move data around as it stays in memory on the GPU.


Devoting more transistors to data processing, for example, floating-point computations, is beneficial for highly parallel computations; the GPU can hide memory access latencies with computation, instead of relying on large data caches and complex flow control to avoid long memory access latencies, both of which are expensive in terms of transistors.

In general, an application has a mix of parallel parts and sequential parts, so systems are designed with a mix of GPUs and CPUs in order to maximize overall performance. Applications with a high degree of parallelism can exploit this massively parallel nature of the GPU to achieve higher performance than on the CPU.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
 
Those again, all look pretty matrixy. But think I see what you are getting at.
None of those really use matrices at all (except some of the rendering processing although that's probably < 5% of the shader code today). I've implemented all of the above except for protein folding. They do plenty of math, sure, but it's not matrix math.

Correct me if I am wrong, but the efficiency seems to also come from not needing to move data around as it stays in memory on the GPU.
It comes form a bunch of things. The "cores" are grouped such that a single instruction controls the register and data movements for blocks of them (blocks of 32 on modern NVIDIA GPUs, called a "warp"), so you only need one front-end decoder, for 32 threads/cores. The bit you quoted is saying that for example, if you are running a program on the GPU, and it does a memory read, it doesn't have to sit and wait for the result to come back, like an older CPU would, instead it swaps in another "warp", and runs that, while the memory is being fetched. It can do this in a single clock, while a CPU has to do an expensive context switch.

Since CPUs can't swap out their threads in a single clock, they try to minimize the wait time - this means implementing things like Out of Order Execution (to continue processing independent parts of the instruction stream while waiting - if there is anything), and it also means they add a lot of cache to hopefully get the data back faster most of the time. GPUs can use the die area used for front-end decoders, cache, Out of Order Execution, etc., to pack more cores, which is why they can effectively fit 1000s of cores (16384 on the GTX 4090) onto the die. Also, since the GPU problems tend not to be as memory intensive, GPUs can invest in much faster, but smaller memory technology, so that memory bandwidth is many times faster than on a CPU (but the latency is worse).

Note that CPUs do have some SIMD support (see AVX-512). Xeon servers can do ~32 32bit FMA (fused multipy-add instructions) in a single clock, and are roughly as performant as 32 GPU cores, per clock. Since they run at ~2x the clock, a single Xeon CPU is ~64x faster. It would require a ~256 core Xeon CPU running at 4ghz to compete with a GPU in raw FP performance (so a GPU is O(10x) faster). On top of that, GPUs today, do have additional tensor units, that do fast matrix multiplies, which can add an additional 5-10x multiplier onto GPUs for matrix multiplies (so, comparing high end CPUs and high end GPUs, GPUs are ~50-100x faster for matrix multiplies, and ~10-20x faster for general SIMD).
 
The FPU was integrated into the CPU in the x86 architecture from the 80486 onwards...

The x86 architecture is inefficient both from a processing and power point of view. This is why the world is going the ARM route, Apple took the first step.

From power consumption, yes. They are not more powerful though. ARM has certain things it excels at, but overall can't reach the performance levels of x86 yet. We'll see if it ever gets there.
 
Ah interesting, so the math explains why GPUs excel at mining. I do wonder how optimized systems are given these roles and if a major architecture overhaul is in our near future... not talking quantum or anything. I guess it all boils down to cost vs benefit.
I do find it interesting that the UI hasn't changed dramatically - the way we interact with systems (other than mobile perhaps).

Who knows.
They could probably create something better if they can completely drop support for any and all legacy applications and OSes. Only Apple or some very specialized hardware can really get away with something like that, and they went the power efficiency route instead of the raw power route.

As for the UI, it actually changed quite significantly over the years, but you can't change a UI to a completely different paradigm without alienating lots of users. How can you make something so much better than what we have at the moment that people would be able to put in the effort to learn it. People create polls when FB changes it's layout and complain non-stop when Microsoft moves the start button.
 
The FPU was integrated into the CPU in the x86 architecture from the 80486 onwards...

The x86 architecture is inefficient both from a processing and power point of view. This is why the world is going the ARM route, Apple took the first step.

They were on RISC (PowerPC, "I was there, 3000 years ago") before they went Intel.
 
From power consumption, yes. They are not more powerful though. ARM has certain things it excels at, but overall can't reach the performance levels of x86 yet. We'll see if it ever gets there.
Instruction sets are why x86 is so 'heavy' compared to ARM
CISC vs RISC, different architectures are better at different things & require a different amount of work done on the coding side to get those things to work.

The same mindset applies to CPU vs GPU, architecture, cost & power for its function.

ARM & x86 are on a similar level of speed & cores at this point in time, so if your app CAN work on ARM, it'll probably perform just as well as it did on x86, which is how we see these amazing apple ARM benchmarks.

We can apply the same question to other hardware, why do we need RAM when CPUs have L3 Cache?
(yes yes intel SR with HBM)
 
Short answer: They're not as versatile as CPUs.

You still need a processor to handle the wide variety of tasks and even the instructions for the GPU. GPUs simply compliments this by running whatever it can run in parallel.
 
Instruction sets are why x86 is so 'heavy' compared to ARM
CISC vs RISC, different architectures are better at different things & require a different amount of work done on the coding side to get those things to work.

The same mindset applies to CPU vs GPU, architecture, cost & power for its function.

ARM & x86 are on a similar level of speed & cores at this point in time, so if your app CAN work on ARM, it'll probably perform just as well as it did on x86, which is how we see these amazing apple ARM benchmarks.

We can apply the same question to other hardware, why do we need RAM when CPUs have L3 Cache?
(yes yes intel SR with HBM)
ARM has surpassed the crappy ole x86. This is why I use ARM for anything embedded or even server-ish where I can.
Its ironic because back in the late 1990s we used to look at the Pentium instruction execution times and laugh because we had access to other machines doing the same thing at half the speed. But yes, Bill Gates..
 
They took the first step towards ARM is what I meant. Before that they were on Motorola stuff, m68k... 68020, 68030, then onto PowerPC architecture as you said. I know.. I had Macintosh computers from that era.

FPU was only integrated on the 68040 & 68060. 020 & 030 used a separate 68881/2
 
Top
Sign up to the MyBroadband newsletter
X