【AMD Zen 微架构解析 Part1：双调度器，微指令缓存以及缓存架构】AMD Zen Microarchitecture: Dual Schedulers, Micro-Op Cache and Memory Hierarchy Revealed
In their own side event this week, AMD invited select members of the press and analysts to come and discuss the next layer of Zen details. In this piece, we’re discussing the microarchitecture announcements that were made, as well as a look to see how this compares to previous generations of AMD core designs.
Prediction, Decode, Queues and Execution【分支预测、解码、队列以及执行】
First up, let’s dive right into the block diagram as shown:
If we focus purely on the left to start, we can see most of the high-level microarchitecture details including basic caches, the new inclusion of an op-cache, some details about decoders and dispatch, scheduler arrangements, execution ports and load/store arrangements. A number of slides later in the presentation talk about cache bandwidth.
Firstly, one of the bigger deviations from previous AMD microarchitecture designs is the presence of a micro-op cache (it might be worth noting that these slides sometimes say op when it means micro-op, creating a little confusion). AMD’s Bulldozer design did not have an operation cache, requiring it to fetch details from other caches to implement frequently used micro-ops. Intel has been implementing a similar arrangement for several generations to great effect (some put it as a major stepping stone for Conroe), so to see one here is quite promising for AMD. We weren’t told the scale or extent of this buffer, and AMD will perhaps give that information in due course.
Aside from the as-expected ‘branch predictor enhancements’, which are as vague as they sound, AMD has not disclosed the decoder arrangements in Zen at this time, but has listed that they can decode four instructions per cycle to feed into the operations queue. This queue, with the help of the op-cache, can deliver 6 ops/cycle to the schedulers. The reasons behind the queue being able to dispatch more per cycle is if the decoder can supply an instruction which then falls into two micro-ops (which makes the instruction vs micro-op definitions even muddier). Nevertheless, this micro-op queue helps feed the separate integer and floating point segments of the CPU. Unlike Intel who uses a combined scheduler for INT/FP, AMD’s diagram suggests that they will remain separate with their own schedulers at this time.
The INT side of the core will funnel the ALU operations as well as the AGU/load and store ops. The load/store units can perform 2 16-Byte loads and one 16-Byte store per cycle, making use of the 32 KB 8-way set associative write-back L1 Data cache. AMD has explicitly made this a write back cache rather than the write through cache we saw in Bulldozer that was a source of a lot of idle time in particular code paths. AMD is also stating that the load/stores will have lower latency within the caches, but has not explained to what extent they have improved.
【整数部分包括ALU、AGU以及LS操作。LS单元每周期可以执行2次16字节的load以及1次16字节的store操作，利用32KB 8路组相连 回写式L1数据缓存。AMD明确说明这是回写式缓存，而不是推土机上的穿透式缓存（在一定条件下会带来大量的闲置时间）。AMD声称缓存内的LS操作延迟会更低，但没再做进一步说明。】
The FP side of the core will afford two multiply ports and two ADD ports, which should allow for two joined FMAC operations or one 256-bit AVX per cycle. The combination of the INT and FP segments means that AMD is going for a wide core and looking to exploit a significant amount of instruction level parallelism. How much it will be able to depends on the caches and the reorder buffers ? no real data on the buffers has been given at this time, except that the cores will have a +75% bigger instruction scheduler window for ordering operations and a +50% wider issue width for potential throughput. The wider cores, all other things being sufficient, will also allow AMD’s implementation of simultaneous multithreading to potentially take advantage of multiple threads with a linear and naturally low IPC.
【每核心浮点部分包括两个乘法端口，两个ADD端口，每周期能够执行两条捆绑的FMAC命令或者一条256bit AVX。把整数和浮点部分合起来看，Zen核心在指令级并行上将会有很大提升 – 提升多少取决于缓存和重排序缓存 – 这次没给出ROB的具体数据，只说排序操作的指令调度窗口将会增大75%，发射宽度提升50%。即便是天生IPC就低的AMD处理器，核心并行性越好，其他的方面就有效率多了，这也使得这次用的SMT在多线程上占得先机。】
Deciphering the New Cache Hierarchy
The cache hierarchy is a significant deviation from recent previous AMD designs, and most likely to its advantage. The L1 data cache is both double in size and increased in associativity compared to Bulldozer, as well as being write-back rather than write-through. It also uses an asymmetric load/store implementation, identifying that loads happen more often than stores in the critical paths of most work flows. The instruction cache is no longer shared between two cores as well as doubling in associativity, which should decrease the proportion of cache misses. AMD states that both the L1-D and L1-I are low latency, with details to come.
The L2 cache sits at half a megabyte per core with 8-way associativity, which is double that of Intel’s Skylake which has 256 KB/core and is only 4-way. On the other hand, Intel’s L3/LLC on their high-end Skylake SKUs is at 2 MB/core or 8 MB/CPU, whereas Zen will feature 1 MB/core and both are at 16-way associativity.
【L2缓存变成了每核心512KB，8路相连，这是Intel Skylake上256kb 4路关联的两倍。另一方面，Intel的L3在高端Skylake i7上是每核心2MB，每CPU8MB，在Zen上则是每核心1MB，这两者都是16路关联。】
Edit 7:18am: Actually, the slide above is being slightly evasive in its description. It doesn’t say how many cores the L3 cache is stretched over, or if there is a common LLC between all cores in the chip. However, we have recieved information from a source (which can’t be confirmed via public AMD documents) that states that Zen will feature two sets of 8MB L3 cache between two groups of four cores each, giving 16 MB of L3 total. This would means 2 MB/core, but it also implies that there is no last-level unified cache in silicon across all cores, which Intel has. The reasons behind something like this is typically to do with modularity, and being able to scale a core design from low core counts to high core counts. But it would still leave a Zen core with the same L3 cache per core as Intel.
【实际上上面的ppt在描述上有点暧昧。没有说多少核心共享8M L3，更没说是否每颗芯片上的所有核心都是共享同一个L3的。然而我们从一个消息来源获得的信息（在AMD官方ppt上找不到的）表明，Zen的8核芯片上是4个核心为一个簇，每个簇4个核心共享8M L3，8核芯片有两组8MB，共16MB的 L3。这样的话就是每核心2MB，但这也说明了Zen的L3不是完全共享的，然而Intel的是完全共享的。这样做的原因估计和模块化有点关系，通过增加这样的模块可以做出从4核心直到32核心，但Zen的每核心L3和Intel的依然都是每核心2MB（没有任何优势)】
What this means, between the L2 and the L3, is that AMD is putting more lower level cache nearer the core than Intel, and as it is low level it becomes separate to each core which can potentially improve single thread performance. The downside of bigger and lower (but separate) caches is how each of the cores will perform snoop in each other’s large caches to ensure clean data is being passed around and that old data in L3 is not out-of-date. AMD’s big headline number overall is that Zen will offer up to 5x cache bandwidth to a core over previous designs.
【这也意味着，AMD的L1和L2比Intel更大、延迟更低。而且L1、L2距离核心更近，还是每核心独立的，在单线程性能上会有显著提升。但更大的独立L1/L2带来的坏处是，每个核心都要监听其他核心的缓存，确保 1.传递的是干净数据、2. L3上的原数据不过期。AMD给出的总体数字是，Zen在缓存带宽上是前代的5倍。】
Low Power, FinFET and Clock Gating
When AMD launched Carrizo and Bristol Ridge for notebooks, one of the big stories was how AMD had implemented a number of techniques to improve power consumption and subsequently increase efficiency. A number of those lessons have come through with Zen, as well as a few new aspects in play due to the lithography.
First up is the FinFET effect. Regular readers of AnandTech and those that follow the industry will already be bored to death with FinFET, but the design allows for a lower power version of a transistor at a given frequency. Now of course everyone using FinFET can have a different implementation which gives specific power/performance characteristics, but Zen on the 14nm FinFET process at Global Foundries is already a known quantity with AMD’s Polaris GPUs which are built similarly. The combination of FinFET with the fact that AMD confirmed that they will be using the density-optimised version of 14nm FinFET (which will allow for smaller die sizes and more reasonable efficiency points) also contributes to a shift of either higher performance at the same power or the same performance at lower power.
【首先就是FinFET。虽然大部分的人都已经熟悉FinFET到吐了，但我们还是要介绍一下。FinFET设计能在给定频率下设计出晶体管的低功耗版本。每个FinFET代工厂给出的技术指标都不同，但Zen用的GF 14nm技术和Polaris GPU的不会差太多，这意味着AMD使用的是14nm的追求密度版本，能在同等功耗下达成更高性能，或者低功耗下达成同等性能。】
AMD stated in the brief that power consumption and efficiency was constantly drilled into the engineers, and as explained in previous briefings, there ends up being a tradeoff between performance and efficiency about what can be done for a number of elements of the core (e.g. 1% performance might cost 2% efficiency). For Zen, the micro-op cache will save power by not having to go further out to get instruction data, improved prefetch and a couple of other features such as move elimination will also reduce the work, but AMD also states that cores will be aggressively clock gated to improve efficiency.
We saw with AMD’s 7th Gen APUs that power gating was also a target with that design, especially when remaining at the best efficiency point (given specific performance) is usually the best policy. The way the diagram above is laid out would seem to suggest that different parts of the core could independently be clock gated depending on use (e.g. decode vs FP ports), although we were not able to confirm if this is the case. It also relies on having very quick (1-2 cycle) clock gating implementations, and note that clock gating is different to power-gating, which is harder to implement.
On Zen, each core will be able to support two threads in what is called ‘simulatenous multi-threading’. Intel has supported their version of SMT for a number of years, and other CPU manufacturers like IBM support up to 8 threads per core on their POWER8 platform designs. Building a core to be able to use multiple threads can be tough, as it requires a lot of resources to make sure that the threads do not block each other by consuming all the cache and buffers in play. But AMD will equip Zen with SMT which means we will see 8C/16T parts hitting the market.
Unlike Bulldozer, where having a shared FP unit between two threads was an issue for floating point performance, Zen’s design is more akin to Intel’s in that each thread will appear as an independent core and there is not that resource limitation that BD had. With sufficient resources, SMT will allow the core instructions per clock to improve, however it will be interesting to see what workloads will benefit and which ones will not.
Timeframe and Availability【时间表、供货日期】
At the presentation, it was given that Zen will be available in volume in 2017. As the AM4 platform will share a socket with Bristol Ridge, users are likely to see Bristol Ridge systems from AMD’s main OEM partners, like Dell and others, enter the market before separate Zen CPUs will hit the market for DIY builders. It’s a matter of principle that almost no consumer focused semiconductor company releases a product for the sale season, and Q1 features such events as CES, which gives a pretty clear indication of when we can expect to get our hands on one.
【在ppt上写着Zen将会在17年大量出货。由于AM4平台上Summit Ridge和Bristol Ridge使用同样的插槽，可能我们能从AMD的OEM们那里先见到Bristol Ridge进入市场。没有哪个主卖消费级产品的半导体厂商会在年末清仓季发布新品，而第一季度会有CES之类的大型展会，那时候我们肯定能拿到手。】
It’s worth noting that AMD said that as we get closer to launch, further details will come as well as deeper information about the design. It was also mentioned that the marketing strategy is also currently being determined, such that Zen may not actually be the retail product name for the line of processors (we already have Summit Ridge as the platform codename, but that could change for retail as well).
AMD has gone much further into their core design than I expected this week. When we were told we had a briefing, and there were 200-odd press and analysts in the room, I was expecting to hear some high level puff about the brand and a reiteration of their commitment to the high end. To actually get some slides detailing parts of the microarchitecture, even at a basic cache level, was quite surprising and it somewhat means that AMD might have stolen the show with the news this week.