AMD Polaris 北极星架构 白皮书 部分解析

北极星架构白皮书(Polaris Architecture whitepaper):http://radeon.com/polaris-architecture-whitepaper/

pdf版:http://radeon.wpengine.netdna-cd … -Final-08042016.pdf






RX480 GPU,Polaris 10



Underlying Polaris architecture is the choice of process technology, which determines what is physically possible. Active (or dynamic) power consumption increases linearly with the number of computational units, but cubically when boosting frequency through higher voltage (e.g., 15% higher frequency and voltage increases power consumption by 52%). As a result, graphics processors tend to prefer lower frequencies and use greater density to deploy more computational units that operate in parallel. For the last five years, graphics processors have relied on 28nm high-k/metal nodes (see fig. 2). 

【北极星架构的基础是制程,制程决定了物理上的可能性。随着计算单元的增加,平均功率线性增长,而通过加压来提高频率时,功耗则会以三次方形式增长(比如提高15%频率、电压,功耗则将上升52%).因此,GPU一般频率比较低,并采用高密度的制程,来放入更多并行计算单元。过去的五年里,GPU用的全都是28nm HKMG工艺。】

Figure 2: Evolution of process nodes utilizes by Radeon? graphics since 2005. 


For Polaris GPUs, AMD selected and Global Foundries’ 14nm FinFET-based process technology, which is the densest foundry GPU process available for production to date. FinFET transistors are crucial to reducing power consumption and enable operating voltages that are 150mV lower than the previous generation, thereby cutting active power by approximately 30% from a 1V baseline.  

【在Polaris GPU上,AMD选择了GF的14nm FinFET制程,这是目前密度最高的GPU制程,FinFET对于降低功耗十分重要,相对前代,14nm FinFET的工作电压降低了150mV,因此也将平均功率下降了大概30%。】


【FinFET制程 关键参数】

Table 1 contains publicly available details on key dimensions for modern FinFET process nodes. For example, the table illustrates that the 14nm transistor spacing (i.e. contacted gate pitch) is approximately 15% smaller than TSMC 16nm spacing, while the SRAM used for caches and register files is 10% smaller. Overall, these process technology advantages translate into GPUs with more compute units, which allows for parallelism and better power efficiency.





【注:这个应该是用于VR的,类似于NV的 VRWorks Audio音频技术,借鉴了光线追踪渲染的思路,充分考虑了3D场景的渲染,通过将音频交互映射到3D场景中的物体上,实现了完全符合听觉和声学原理的音频,听起来更加自然,包括方向、远近等等。】

The second quality-of-service technique, compute unit reservation, is even more potent and general-purpose. As the name suggests, programmers can partition the execution resources of the Polaris GPU for compute tasks using API extensions. Specifically, compute units (CUs) in the shader array are reserved for a queue in one of the ACEs, ensuring dedicated resources are available for work-groups from the queue. This is a powerful tool for developers to avoid contention between multiple tasks.  


Figure 4 illustrates an example, where a queue for audio tasks is assigned 4 CUs, while the remaining compute units are available to all tasks. Partitioning the CUs ensures that audio tasks will have the lowest possible latency and jitter, although the CUs are no longer available for other tasks


首先,这个表中的P10 GPU一共只有36个CU,2304SP(数学!数学!),上文的意思是,可能在运行VR时,会使用4组CU,256SP来处理VR音频,剩余的32组,2048SP处理其他任务,而不是说完整的有40组。



In a similar vein, the Polaris geometry engines can detect triangles that have no area, and discard them during the input assembly stage. As vertex indices are read from the input buffer, the Polaris geometry engine will check if two or more vertices have the same coordinates (i.e., degenerate triangles). The degenerate triangles are culled before they are passed to the vertex shaders, which increases throughput by reducing the amount of work done and reducing the energy consumed. By eliminating the vertex fetches for degenerate triangles, Polaris can increase throughput by up to 3X for certain scenes. 
【这里是前端加强,之前帖子就提到过,Polaris增加了Primative Discard Accelerator,在渲染管线前,甚至是到达顶点渲染器之前,就能够去除无用(无法渲染/不可见)的部分多边形,声称MSAA开的越高,带来的提升就越高,因为正是MSAA加重了这种无用多边形给GPU带来的负担,在特定场景下可以提高最多3倍的吞吐量。】



【AMD Delta色彩压缩的进步,在Polaris上等效带宽提升了最多35%】

The Polaris memory interface has been updated to both increase bandwidth and also operate more efficiently with compression (fig. 7). The Polaris render back-end is designed to compress color buffers to save power and more effectively use the available memory bandwidth. Delta color compression is a lossless algorithm that dynamically divides a color buffer into several blocks and was first deployed in 3rd-generation GCN solutions (e.g. GPUs codenamed “Tonga,” “Fiji,” and “Antigua”). A single pixel in each block is written using a normal representation and all other pixels in the block are encoded as a difference from the first value. The block size is dynamically chosen based on access patterns and the data patterns to maximize the benefits. The peak compression ratio is 8:1 for a 256-byte block. Since many objects have large patches of similar colors (e.g. clothing and cars), the delta color compression takes advantage of this locality to improve performance. While 3rd-generation GCN and Polaris use similar algorithms, Polaris is more aggressive and compresses even more blocks, thereby saving more bandwidth and power. 


The biggest savings come when the color buffer is read back for subsequent computations, i.e. render to texture mode. The Polaris shader cores can read and transparently decompress the compressed color data thereby saving read bandwidth in the memory and caches as well. 



In fact, the compression is so efficient that the Polaris architects were able to reduce the number of render back-ends. Polaris has a compact 256-bit memory interface that uses cost-effective GDDR5 memory, but delivers similar end-user performance to the GPU codenamed “Hawaii”, which used a 512-bit memory interface ? all while consuming less power (fig. 8). 




留学中 Comp.Arch|RISCV|HPC|FPGA 最近沉迷明日方舟日服 联系方式请 discord 或者 weibo 私信。目前不在其他平台活动。 邮箱已更新为[email protected]。 看板娘:ほし先生♥



您的电子邮箱地址不会被公开。 必填项已用*标注