Secrets of the PS4: Heavily modified Radeon, supercharged APU design(Part 2)

In the Gamasutra interview, Cerny states the following: “we added another bus to the GPU that allows it to read directly from system memory or write directly to system memory… As a result, if the data that’s being passed back and forth between CPU and GPU is small, you don’t have issues with synchronization between them anymore… We can pass almost 20 gigabytes a second down that bus. That’s not very small in today’s terms — it’s larger than the PCIe on most PCs!”
That sounds almost exactly like Garlic, with some additional HSA features baked in. Remember, the point of HSA is to allow CPU and GPU to share a common set of pointers and swap data more efficiently. It suggests that the PS4′s interconnect structure looks something like this:
PS4's APU design
This simplified structure shows the GPU with the lion’s share of access to memory bandwidth. Both the Onion and Garlic interfaces are faster than they were in Llano, and they’re tied to much faster memory, but they function in the same basic way. This is the most logical design based on what AMD has done before, it incorporates the direct memory bus that Cerny discusses, and it would be the easiest system for AMD to design given the firm’s limited resources.
The disadvantage is that it’s not particularly efficient. This table of available CPU-GPU bandwidth in Llano based on the type of operation being conducted indicates the problem:
AMD Llano's cache bandwidth
Much of the anecdotal information on the PS4 suggests that the chip is designed for a much greater degree of sharing than Piledrive/Llano. We suspect that Option #1 integrated HSA-like features with an APU-style design. AMD would have had to make a number of improvements to bring these various capabilities up to more uniform bandwidth / latency, but these were improvements the company was planning to make with HSA in any case.

Option 2: Hearkening back to R600 and a modern ring bus

AMD could have opted for a ring bus. Ring buses are great for joining multiple components in a high-bandwidth, low-latency configuration where data is shared across multiple elements. Intel uses a ring bus for Sandy Bridge and Ivy Bridge, and AMD’s first programmable GPU (R600) used one as well. The advantage of a ring bus is that it’d be simple. Not every component needs the same amount of memory bandwidth (the estimated 176GBps of memory bandwidth would be wasted on the CPUs) so you end up with 20GBps of bandwidth for the CPU cores and 176GBps of bandwidth for the GPU.
PS4 ringbus
Sony has some experience with ring buses — the PS3′s Cell Architecture used one to manage communication between the various processing elements — but we don’t think this is a likely approach for the PS4. There’s no particular problem that a ring bus would solve, and no specific use-case that strongly suggests AMD would adopt one. Intel has used a ring bus in Sandy Bridge and Ivy Bridge, but these GPUs are tiny compared to the 18 CU design that’s built into the PlayStation 4.

1 comment:

  1. Hi,

    My name is Sage and I have really enjoyed reading your blog. I love how useful a lot of your topics are. I was wondering if we could exchange shout-outs. Would you please consider mentioning my website on your next post? I’ll be sure to mention yours on my blog. Thanks and hope to hear from you soon!


    sage.harman123 at