CFD Online Logo CFD Online URL
www.cfd-online.com
[Sponsors]
Home > Forums

A few thoughts about today's CPU era...

Register Blogs Members List Search Today's Posts Mark Forums Read

Rate this Entry

A few thoughts about today's CPU era...

Posted January 6, 2019 at 15:44 by wyldckat

So for whatever reason I went on a nostalgic thought process an hour or so ago and began reading about:
The nostalgic reason for this was that I had briefly gotten a chance to work with a couple of Intel Phi Co-processors a couple of years ago and never got the time to work on it. And I had gotten an AMD A10-7850K for myself as well and likewise never got the time to work on it either.


Intel Phi Co-processors KNC

So the Phi Co-processors were available for cheap, some 200-300 per card because they were getting rid of stock and boasted a potential 64GHz of cumulative CPU clock, of which 16GHz may be plausible to take advantage, given that it was a monster with 64 threads and 8 memory channels, but each core (16 of them) could only run at ~1.1 GHz.
  • The downside? Required porting code to it, even though it was x86_64 architecture and was using a small Linux-like OS within it, as it it were a CPU with Android on a USB stick, but it was a PCI-E card on a PCI-E slot....
  • The result:
    • Takes too long to make any code work with it and it was essentially something akin to a gaming console, i.e. expensive hardware designed for a specific purpose.
    • Would be plausible to use them, if they had done things right...
  • That said, that's how NVidia does its job with CUDA... but GPUs can crank number crunching all the way up to some 1000-4000 FPUs, so 64 threads sharing 16 or 8 FPUs was borderline nonsense...


AMD A10-7850K
This is what to me felt like the technology that could disrupt all others: 4 cores that could be used to manage 512 FPUs, all on the same die, not requiring memory offloading through PCI-E lanes... this was like a dream come true, the quintessential technology holy grail for high performance computing, if there was ever one as such. Hypothetically this harbored 512 FPUS at 720MHz, which would add up to ~368GHz of cumulative potential CPU clock power, which along 4 x86_64 cores @3.7GHz to shepherd them, would allow for a killing in HPC...

But the memory bottleneck of only having 2 channels at 2133MHz, was like having only some 150 FPUs to herd, when comparing to a GPU card with DDR5 at 7GHz...

However, even if that were the case, that wouldn't be all too bad, given that it would give a ratio of about 38 FPUs per core, which compared to the 4-16 float arrays in AVX, the A10-7850K would still make a killing...

Unfortunately:
  1. It's not exactly easy to code for it, mostly because of the stack that needs to be installed...
  2. Which wouldn't be so bad, given that he competition is CUDA, which also relies on the same kind of installation hazards...
  3. But the thing that eventually held me back on ever doing anything with it was that Kaveri architecture had a bug that rendered it not supportable in AMD's ROCm development efforts: https://github.com/RadeonOpenCompute...ment-270193586
I still wish I can find the time and inspiration to try and figure out what I could still do with this APU... but a cost/benefit analysis states that it's not worth the effort


Intel Xeon Phi KNL
Knight's Landing... The Phi was somewhat inspired by D&D, in the sense that Knights took up their arms and went on an adventure, in search or hunt for a better home: https://en.wikipedia.org/wiki/Xeon_Phi
  1. Knights Ferry - began traveling by boat...
  2. Knights Corner - nearly there...
  3. Knights Landing - reached the hunting/fighting grounds...
  4. Knights Hill - conquered the hill... albeit was canceled, because they didn't exactly conquer it...
  5. Knights Mill - began working on it... but it was mostly oriented towards deep learning...
KNL was essentially a nice CPU, in the sense that we didn't need to cross-compile and instead focus work on optimizing for this CPU. The pseudo-Level 4 cache, technically named MCDRAM: https://en.wikipedia.org/wiki/MCDRAM - was akin to having GPU-rated RAM (by which I mean akin to GDDR5) nearby the 64-72 cores that the CPU had...

The problem: 64-72 cores running at 1.1GHz is pointless for FPU processing, if you only have 64-72 of the bloody criters, ain' it? Compared to the countless FPUs on a GPGPU, this is peanuts...


Intel Skylake-SP
They finally learned their lessons with the KNL and gave the x86_64 architecture proper infrastructure for scaling, at least from my understanding of the "KNL vs Skylake-SP" document I mentioned at the start of this post.

They even invested in AVX512... 64 double-precision or 128 single-precision array vector FPUs (all crunched in a single clock cycle, instead of just one FPU per core in the common x86 architecture), which I guess run at 2.2 to 3GHz instead of 1.1GHz, so effectively making them 2 to 3 times faster than GPU FPUs.

I'm not even going to venture at estimating how much potential in CPU clock power these AVX512 units can compare to a GPU, for a very simple reason: they can only reach 6 channels at a maximum of 2666 MHz, which pales in comparison to the 7GHz or more that exist nowadays in GDDR5/6 technology on GPU cards.


AMD EPYC
This made me laugh, once I saw the design architecture: https://www.anandtech.com/show/11544...f-the-decade/2
So the trick was fairly simple:
  1. Have 4 Ryzen CPUs connected to each other in an Infiniband-like connection between all 4 or them.
  2. Each Ryzen CPU has only 2 memory channels, but can have up to 8 cores and 2 thread per core...
  3. Has 2666 Mhz RAM... being accessed through a total of 8 memory channels.
This is what the Knight's thingamabob should have been right from the start... this is the kind of technology that will allow extending to the next logical step: 3D CPU stacks, with liquid cooling running between them...

Either way, the EPYC CPUs are nearly equivalent to 4 mainstream-grade CPUs in each socket, connected via Infiniband, for roughly the size of a single credit card...


Playstation 4 and Xbox One
  • Octa-Core AMD x86-64 "Jaguar"-based CPU
  • AMD Radeon with a ton of shaders (~700 to ~2500 at around 1.2GHz), depending on the version...
  • 8-12 GB GDDR5 RAM, depending on the version, but mostly shared between CPU and GPU...
All in the same board... sharing GDDR5 RAM... this is like the holy grail of modern computing which could proliferate in some HPC environments such as CFD and FEM... and it is only being used for gaming. Really? Seriously??


What I expect in the near future
To me, the plan is simple, given that Moore's law gave out years ago due to it being hard to scale down lithography and that we are now reaching the smallest possible limit on how much a transistor can hold a charge without sneezing...
  1. Specialization: we are already seeing this in several fronts:
    1. ASICs were created for the bitcoin mining thingamabob... a clear sign of the future, even though it's a pain in the butt to code for... since we are coding the actual hardware... but that's how GPUs appeared in the first place and the AI-oriented tech coming out on current CPUs is the same kind of thing, as well as AVX tech et al.
    2. ARM and RISC CPUs, where trimming down hardware specs can help make CPUs run cooler and with less power on our precious smartphones and tablets...
    3. You can even design your own RISC CPU nowadays: https://www.youtube.com/watch?v=jNnCok1H3-g
  2. x86_64 needs to go past its primordial soup design and go all out in integration:
    1. 3D stacking of core groups, with liquid cooling running between stacks, because copper extraction is likely not enough.
    2. Intertwining GDDR RAM between those stacks.
    3. Cumulative memory channels should be non-ubiquitous, akin to AMD EPYC design.
    4. Essentially create a cluster within a single socket, which is essentially what an AMD EPYC nearly is...
Posted in Rantings
Views 460 Comments 1 Edit Tags Email Blog Entry
« Prev     Main     Next »
Total Comments 1

Comments

  1. Old Comment
    As I was revising what I wrote, I began rethinking what I read at https://en.wikipedia.org/wiki/Eighth...les#Comparison - for the PS4 memory specs, and I quote:
    Quote:
    • Original PS4:
      • 8 GB GDDR5 RAM @ 1375 MHz (5500 MHz effective) (176.0 GB/s)
      • 4.55.5 GB (flexible memory) available for games
    • PS4 Pro
      • 8 GB GDDR5 RAM @ 1700 MHz (6800 MHz effective) (217.6 GB/s)
    Which means that GDDR5 is being used as a 4 channel group... at least here...

    But then I look at the wikipedia page for GDDR5: https://en.wikipedia.org/wiki/GDDR5_SDRAM - and they mention:
    Quote:
    Hynix 40 nm class "2 Gb" (2 1024 bit) GDDR5 was released in 2010. It operates at 7 GHz effective clock-speed and processes up to 28 GB/s. "2 Gb" GDDR5 memory chips will enable graphics cards with 2 GB or more of onboard memory with 224 GB/s or higher peak bandwidth.
    But after reading the initial introduction section, it looks like what the clocks are in fact the addition of 3 clocks:
    • 1x Command Clock CK at 1.25 or 1.375 GHz
    • 2x Write Clock WCK at 2.5 ot 2.75 GHz (each)
    So becoming:
    • 2.52+1.25 = 6.25 GHz
    • 1.375+22.75 = 6.875 GHz
    OK... so... it's in fact 3 channels? But the specs on the PS4 implies 4 channels...





    Anyway, if my understanding of this is correct, then the 6-8 GHz specs that the GPU cards keep advertising aren't all that great, when compared to the 8x2.6GHz that the EPYC CPUs have nowadays...


    But still, having RAM embedded into the motherboard or on a stacked die would allegedly reduce latency by quite a bit, as in graphics cards...
    permalink
    Posted January 6, 2019 at 16:01 by wyldckat wyldckat is offline
 

All times are GMT -4. The time now is 00:28.