Secrets of the PS4: Heavily modified Radeon, supercharged APU design(Part 1)

PS4 concept art
For months, there have been rumors that the PS4 wouldn’t just be more powerful than Microsoft’s upcoming Xbox — it would be capable of certain compute workloads that Redmond’s next-generation console wouldn’t be able to touch. In an interview last week, Sony lead hardware architect Mark Cerny shed some major light on what these capabilities look like. We’ve taken his comments and combined them with what we’ve learned from other sources to build a model of how the PS4 is likely organized, and what it can do.

First, we now know the PS4 is a single system on a chip (SoC) design. According to Cerny, all eight CPU cores, the GPU, and a number of other custom units are all on the same die. Typically, when we talk about SoC design, we distinguish between “on-die” and “on-package.”
Wii U's package
Components are on-package if they’re part of a finished processor but aren’t fabbed in a single unit. The Wii U, for example, has the CPU and GPU on-package, but not on-die. Building the entire PS4 in a monolithic die could cut costs long-term and improve performance, but is riskier in the short-term.

An overhauled GPU

According to Cerny, the GPU powering the PS4 is an ATI Radeon with “a large number of modifications.” From the GPU’s perspective, the large RAM pool doesn’t count as innovative. The PS4 has a unified pool of 8GB of RAM, but AMD’s Graphics Core Next GPU architecture (hereafter abbreviated GCN) already ships with 6GB of GDDR5 aboard workstation cards. The biggest change to the graphics processor is Sony’s modification to the command processor, described as follows:
The original AMD GCN architecture allowed for one source of graphics commands, and two sources of compute commands. For PS4, we’ve worked with AMD to increase the limit to 64 sources of compute commands — the idea is if you have some asynchronous compute you want to perform, you put commands in one of these 64 queues, and then there are multiple levels of arbitration in the hardware to determine what runs, how it runs, and when it runs, alongside the graphics that’s in the system.
That’s a fairly bold statement. Let’s look at the relevant portion of the HD 7970′s structure:
AMD GCN front-end
Here, you can see the Asynchronous Compute Engines and the GPU Command Processor. AMD has always said that it could add more Asynchronous Compute Engine blocks to this structure to facilitate a greater degree of parallelization, but I think Cerny mixed his apples and oranges here, possibly on purpose. First, he refers to specific hardware blocks, then segues into discussing queue depths. AMD released a different slide in its early GCN unveils that may shed some additional light on this topic.
Each ACE can fetch queue information from the Command Processor and can switch between asynchronous compute tasks depending on what’s coming next. GCN was designed with some support for out-of-order processing, and it sounds as though Sony has expanded the chip’s ability to monitor and schedule how tasks are executed. It’s entirely possible that Sony has added additional ACEs to GCN to support a greater amount of asynchronous computing capability, but simply stuffing the front of the chip with 61 additional ACEs wouldn’t magically make more execution resources available.
Now we turn our attention to the memory architecture. We know the PS4 uses a 256-bit memory bus and Cerny specifies 176GB of bandwidth. That works out to a GDDR5 clock speed of 1375MHz, which is comfortably within the current range of GDDR5 products already on the market. We’ve put together a set of what we consider to be the top three most likely structures, their strengths, and their weaknesses.

Option 1: A supercharged APU-style design

AMD has published a great deal of information on Llano and Trinity’s APU design. Llano and Trinity share a common structure that looks like this:
AMD APU diagram
In Llano and Trinity, the CPU-GPU communication path varies a great deal depending onwhich kind of data is being communicated. The solid line (Onion) is a lower-bandwidth bus (2x16B) that allows the GPU to snoop the CPU cache. The dotted lines are the Radeon Memory Bus (Garlic). This is a direct link between the GPU and the Unified North Bridge architecture, which contains the memory controller.