A few thoughts about today's CPU era...
Posted January 6, 2019 at 15:44 by wyldckat
So for whatever reason I went on a nostalgic thought process an hour or so ago and began reading about:
Intel Phi Co-processors KNC
So the Phi Co-processors were available for cheap, some 200-300€ per card because they were getting rid of stock and boasted a potential 64GHz of cumulative CPU clock, of which 16GHz may be plausible to take advantage, given that it was a monster with 64 threads and 8 memory channels, but each core (16 of them) could only run at ~1.1 GHz.
AMD A10-7850K
This is what to me felt like the technology that could disrupt all others: 4 cores that could be used to manage 512 FPUs, all on the same die, not requiring memory offloading through PCI-E lanes... this was like a dream come true, the quintessential technology holy grail for high performance computing, if there was ever one as such. Hypothetically this harbored 512 FPUS at 720MHz, which would add up to ~368GHz of cumulative potential CPU clock power, which along 4 x86_64 cores @3.7GHz to shepherd them, would allow for a killing in HPC...
But the memory bottleneck of only having 2 channels at 2133MHz, was like having only some 150 FPUs to herd, when comparing to a GPU card with DDR5 at 7GHz...
However, even if that were the case, that wouldn't be all too bad, given that it would give a ratio of about 38 FPUs per core, which compared to the 4-16 float arrays in AVX, the A10-7850K would still make a killing...
Unfortunately:
Intel Xeon Phi KNL
Knight's Landing... The Phi was somewhat inspired by D&D, in the sense that Knights took up their arms and went on an adventure, in search or hunt for a better home: https://en.wikipedia.org/wiki/Xeon_Phi
The problem: 64-72 cores running at 1.1GHz is pointless for FPU processing, if you only have 64-72 of the bloody criters, ain' it? Compared to the countless FPUs on a GPGPU, this is peanuts...
Intel Skylake-SP
They finally learned their lessons with the KNL and gave the x86_64 architecture proper infrastructure for scaling, at least from my understanding of the "KNL vs Skylake-SP" document I mentioned at the start of this post.
They even invested in AVX512... 64 double-precision or 128 single-precision array vector FPUs (all crunched in a single clock cycle, instead of just one FPU per core in the common x86 architecture), which I guess run at 2.2 to 3GHz instead of 1.1GHz, so effectively making them 2 to 3 times faster than GPU FPUs.
I'm not even going to venture at estimating how much potential in CPU clock power these AVX512 units can compare to a GPU, for a very simple reason: they can only reach 6 channels at a maximum of 2666 MHz, which pales in comparison to the 7GHz or more that exist nowadays in GDDR5/6 technology on GPU cards.
AMD EPYC
This made me laugh, once I saw the design architecture: https://www.anandtech.com/show/11544...f-the-decade/2
So the trick was fairly simple:
Either way, the EPYC CPUs are nearly equivalent to 4 mainstream-grade CPUs in each socket, connected via Infiniband, for roughly the size of a single credit card...
Playstation 4 and Xbox One
What I expect in the near future
To me, the plan is simple, given that Moore's law gave out years ago due to it being hard to scale down lithography and that we are now reaching the smallest possible limit on how much a transistor can hold a charge without sneezing...
- EPYC vs Skylake-SP: https://www.anandtech.com/show/11544...-of-the-decade
- KNL vs Skylake-SP: https://www.emc.com/collateral/white...skylake-wp.pdf
Intel Phi Co-processors KNC
So the Phi Co-processors were available for cheap, some 200-300€ per card because they were getting rid of stock and boasted a potential 64GHz of cumulative CPU clock, of which 16GHz may be plausible to take advantage, given that it was a monster with 64 threads and 8 memory channels, but each core (16 of them) could only run at ~1.1 GHz.
- The downside? Required porting code to it, even though it was x86_64 architecture and was using a small Linux-like OS within it, as it it were a CPU with Android on a USB stick, but it was a PCI-E card on a PCI-E slot....
- The result:
- Takes too long to make any code work with it and it was essentially something akin to a gaming console, i.e. expensive hardware designed for a specific purpose.
- Would be plausible to use them, if they had done things right...
- That said, that's how NVidia does its job with CUDA... but GPUs can crank number crunching all the way up to some 1000-4000 FPUs, so 64 threads sharing 16 or 8 FPUs was borderline nonsense...
AMD A10-7850K
This is what to me felt like the technology that could disrupt all others: 4 cores that could be used to manage 512 FPUs, all on the same die, not requiring memory offloading through PCI-E lanes... this was like a dream come true, the quintessential technology holy grail for high performance computing, if there was ever one as such. Hypothetically this harbored 512 FPUS at 720MHz, which would add up to ~368GHz of cumulative potential CPU clock power, which along 4 x86_64 cores @3.7GHz to shepherd them, would allow for a killing in HPC...
But the memory bottleneck of only having 2 channels at 2133MHz, was like having only some 150 FPUs to herd, when comparing to a GPU card with DDR5 at 7GHz...
However, even if that were the case, that wouldn't be all too bad, given that it would give a ratio of about 38 FPUs per core, which compared to the 4-16 float arrays in AVX, the A10-7850K would still make a killing...
Unfortunately:
- It's not exactly easy to code for it, mostly because of the stack that needs to be installed...
- Which wouldn't be so bad, given that he competition is CUDA, which also relies on the same kind of installation hazards...
- But the thing that eventually held me back on ever doing anything with it was that Kaveri architecture had a bug that rendered it not supportable in AMD's ROCm development efforts: https://github.com/RadeonOpenCompute...ment-270193586
Intel Xeon Phi KNL
Knight's Landing... The Phi was somewhat inspired by D&D, in the sense that Knights took up their arms and went on an adventure, in search or hunt for a better home: https://en.wikipedia.org/wiki/Xeon_Phi
- Knights Ferry - began traveling by boat...
- Knights Corner - nearly there...
- Knights Landing - reached the hunting/fighting grounds...
- Knights Hill - conquered the hill... albeit was canceled, because they didn't exactly conquer it...
- Knights Mill - began working on it... but it was mostly oriented towards deep learning...
The problem: 64-72 cores running at 1.1GHz is pointless for FPU processing, if you only have 64-72 of the bloody criters, ain' it? Compared to the countless FPUs on a GPGPU, this is peanuts...
Intel Skylake-SP
They finally learned their lessons with the KNL and gave the x86_64 architecture proper infrastructure for scaling, at least from my understanding of the "KNL vs Skylake-SP" document I mentioned at the start of this post.
They even invested in AVX512... 64 double-precision or 128 single-precision array vector FPUs (all crunched in a single clock cycle, instead of just one FPU per core in the common x86 architecture), which I guess run at 2.2 to 3GHz instead of 1.1GHz, so effectively making them 2 to 3 times faster than GPU FPUs.
I'm not even going to venture at estimating how much potential in CPU clock power these AVX512 units can compare to a GPU, for a very simple reason: they can only reach 6 channels at a maximum of 2666 MHz, which pales in comparison to the 7GHz or more that exist nowadays in GDDR5/6 technology on GPU cards.
AMD EPYC
This made me laugh, once I saw the design architecture: https://www.anandtech.com/show/11544...f-the-decade/2
So the trick was fairly simple:
- Have 4 Ryzen CPUs connected to each other in an Infiniband-like connection between all 4 or them.
- Each Ryzen CPU has only 2 memory channels, but can have up to 8 cores and 2 thread per core...
- Has 2666 Mhz RAM... being accessed through a total of 8 memory channels.
Either way, the EPYC CPUs are nearly equivalent to 4 mainstream-grade CPUs in each socket, connected via Infiniband, for roughly the size of a single credit card...
Playstation 4 and Xbox One
- Octa-Core AMD x86-64 "Jaguar"-based CPU
- AMD Radeon with a ton of shaders (~700 to ~2500 at around 1.2GHz), depending on the version...
- 8-12 GB GDDR5 RAM, depending on the version, but mostly shared between CPU and GPU...
What I expect in the near future
To me, the plan is simple, given that Moore's law gave out years ago due to it being hard to scale down lithography and that we are now reaching the smallest possible limit on how much a transistor can hold a charge without sneezing...
- Specialization: we are already seeing this in several fronts:
- ASICs were created for the bitcoin mining thingamabob... a clear sign of the future, even though it's a pain in the butt to code for... since we are coding the actual hardware... but that's how GPUs appeared in the first place and the AI-oriented tech coming out on current CPUs is the same kind of thing, as well as AVX tech et al.
- ARM and RISC CPUs, where trimming down hardware specs can help make CPUs run cooler and with less power on our precious smartphones and tablets...
- You can even design your own RISC CPU nowadays: https://www.youtube.com/watch?v=jNnCok1H3-g
- x86_64 needs to go past its primordial soup design and go all out in integration:
- 3D stacking of core groups, with liquid cooling running between stacks, because copper extraction is likely not enough.
- Intertwining GDDR RAM between those stacks.
- Cumulative memory channels should be non-ubiquitous, akin to AMD EPYC design.
- Essentially create a cluster within a single socket, which is essentially what an AMD EPYC nearly is...
Total Comments 1
Comments
-
As I was revising what I wrote, I began rethinking what I read at https://en.wikipedia.org/wiki/Eighth...les#Comparison - for the PS4 memory specs, and I quote:
Quote:- Original PS4:
- 8 GB GDDR5 RAM @ 1375 MHz (5500 MHz effective) (176.0 GB/s)
- 4.5–5.5 GB (flexible memory) available for games
- PS4 Pro
- 8 GB GDDR5 RAM @ 1700 MHz (6800 MHz effective) (217.6 GB/s)
But then I look at the wikipedia page for GDDR5: https://en.wikipedia.org/wiki/GDDR5_SDRAM - and they mention:
Quote:Hynix 40 nm class "2 Gb" (2 × 1024³ bit) GDDR5 was released in 2010. It operates at 7 GHz effective clock-speed and processes up to 28 GB/s. "2 Gb" GDDR5 memory chips will enable graphics cards with 2 GB or more of onboard memory with 224 GB/s or higher peak bandwidth.
- 1x Command Clock CK at 1.25 or 1.375 GHz
- 2x Write Clock WCK at 2.5 ot 2.75 GHz (each)
- 2.5×2+1.25 = 6.25 GHz
- 1.375+2×2.75 = 6.875 GHz
Anyway, if my understanding of this is correct, then the 6-8 GHz specs that the GPU cards keep advertising aren't all that great, when compared to the 8x2.6GHz that the EPYC CPUs have nowadays...
But still, having RAM embedded into the motherboard or on a stacked die would allegedly reduce latency by quite a bit, as in graphics cards...Posted January 6, 2019 at 16:01 by wyldckat - Original PS4: