至于在超低功率3-7W区间，AMD的推土机家族更是没有任何对应的产品。他们曾经的确有过超低功率芯片，但那些用的都是山猫（Bobcat）处理器的衍生品。这是一种完全不一样的处理器设计，是专门为了移动操作和低功率操作而开发的。山猫的衍生版本还被用在PlayStation 4, PlayStation 4 Pro, Xbox One 和 Xbox One S上。
经过四年的闭门造车，Zen处理器作为该理念指导下的成果终于浮出水面。关于推土机没有达成的IPC目标，AMD说Zen已经远远超出了提高40%的预设指标。在单线程运行速度3.4GHz的Cinebench R15测试中，Zen跑分比挖掘机高出58%，比打桩机高出76%。就IPC提高而言，与挖掘机相比Zen进步了52%。而且这一切是在大幅降低的功率下做到的：在多线程Cinebench R15测试中，Zen的效能功耗比是打桩机的两倍还多。
Zen的基本构件是核心复合体（CCX）：四核为一个单元，同时跑八个线程。恰好印证了AMD在桌面处理器的设计上对多核多线程的信仰的是，第一代锐龙Ryzen 7系列处理器搭载两个CCX，共八核十六线程。有三个版本即将发布：1800X，速度3.6-4.0GHz， 售价 $499/?490； 1700X，速度3.4-3.8GHz ，售价 $399/?390，以及1700， 速度3.0-3.7GHz，售价 $329/?320 .
第二季度中，锐龙Ryzen 5也将面世。R5 1600X是六核十二线程的芯片，以3.6-4.0GHz运行（两个CCX各关闭一核），1500X是四核八线程的芯片，以3.5-3.7GHz运行（只有单个CCX）。
不同的设计决策已经让AMD和英特尔分道扬镳了。英特尔的处理器性能分布被岔开得很奇怪，它最新的处理器是Kaby Lake，但Kaby Lake只有双核和四核，有些有同时多线程（SMT）而有些没有。四核以上你又不得不回到前一代处理器架构了：Broadwell。
第一代锐龙处理器刚好横跨英特尔产品线的断裂点。R7 1700多多少少会和Kaby Lake i7-7700K正面竞争。后者利用了英特尔14纳米工艺以及最佳单线程性能的最新架构，运行速度为4.2-4.5GHz. 但1800和1800X将迎击Broadwell架构，分别是六核12线程3.6-3.8GHz的i7-6850K处理器（约$620/?580 ），以及八核16线程3.2-3.7GHz的i7-6900K处理器（约$1,049/?1,000）。到了更高核数，英特尔就会迫使你放弃最新的核和最高的电源效力，从而不得不换回更老的、更新频率比较低的芯片集（目前的X99早在2014年底就发布了）。
英特尔把所有功能单元分组分到四个发送端口下，编号0,1,5,6. 所有四个端口都包含一个普通的整数ALU，但端口0还含有一个AVX FMA单元、一个除法单元和一个分支单元。端口1有另一个AVX FMA单元但没有除法单元。端口5和6既没有FMA也没有除法单元。这意味着在一个周期内，处理器可以同时安排两个AVX FMA操作或是一个除法和一个AVX FMA操作，但没法同时做一个除法和两个FMA。
比如这在Geekbench的浮点测试SGEMM中就非常明显。这是一个矩阵乘法测试，为了最佳性能会调用AVX和FMA指令集。在单线程上，6900K管理着大约每秒900亿单精度浮点指令(90 gigaflops)。相比之下，1800X的处理速度只有53 gigaflops。虽然1800X相对更高的时钟速度有点用，但英特尔芯片在每个周期内能完成其两倍的工作量。高出来的几百兆赫不足以抵消架构区别带来的劣势。
至于单线程性能上，他们都输给了Kaby Lake i7-7700K。鉴于它更高的IPC和更快的时钟速度，Broadwell-E和Zen都是远远追不上的。
支撑那些指令单元的是指令解码器和故障处理器。AMD在这些方面同样较推土机做出了长足的改进。就像x86处理器常做的那样，Zen先把x86指令拆解至微指令（?op）然后再安排与执行。在推土机里，重复指令（比如循环）必须被重复地读取和解码。Zen添加了一个能存储2000 ?op的微指令缓存，如此，循环中的重复指令就可以跳过解码部分。英特尔在2011年初发布的Sandy Bridge架构中首次提出了同等的数据结构。
至少在多插口的情况中，AMD把Infinity Fabric称为“改良版的相关性超传输（Coherent HyperTransport），”但有时候AMD又说这不是基于超传输。
除了运行资源分配大不相同外，总体规律就是Zen上运行的所有工作都还是要比基于单个周期的英特尔设计慢一些。（我们估计落后于Broadwell 5%，落后于Skylake 15%。）但Zen的更高时钟速度加上可靠的同时多线程设计意味着它可以比肩Broadwell-E。比如在Cinebench R15中，1800X和6900X在单线程测试中持平，在多线程测试中比英特尔芯片高出6%。在单线程Geekbench 4中，AMD处理器依然与英特尔持平，虽然多线程上落后英特尔20%。这是在不同测试中由不同混合指令集和不同带宽依赖度带来的反馈。
每个CCX上布满了各种传感器：20个热敏二极管，48个电源监控器，9个电压下降检测器，还有1300多个关键路径监控器。这些传感器都与Infinity Fabric控制面板相连，向电源管理单元传输数据，每秒汇报读数1000次，精度分别达到1mA, 1mV, 1mW, 1°C。
其次，如果传感器发现还有富余的能量净空，时钟速度就可以被推到标准提速极限以上，AMD把这叫做自动超频技术（Extended Frequency Range，XFR）。如果散热片和冷却器远超最小规格使得芯片工作得很悠闲，XFR就会继续上推时钟速度，1800X和1700X的增加上限是100兆赫，1700的上限是50兆赫，基本上算是“免费”加了一个小小的超频。
处理器本身有20个PCIe 3.0通道，4个第一代USB 3.1（5GB/s）控制器，外加4个PCIe和输入输出的混合通道；这些可以被分到单个x4 NVMe设备中，或分开到2个SATA 外加一个x2 NVMe中，或是2个SATA外加x2 PCIe。此外还有2个DDR4存储渠道。
在这20个PCIe 3.0通道中，四个一般用于与芯片集沟通。AMD有三个常用芯片集；高端的X370, 中端的B350，以及低端的A320. 它们都带有第二代USB 3.1（10Gb/s）控制器（X370和B350含两个，A320含一个），一些额外的SATA控制器（X370含4个，B350含2个，A320含2个），2个SATA Express端口（也可以被用作4个SATA 3.0端口），以及一些PCIe 2.0通道（三个版本分别含8个，6个，4个）。X370和B350都可启用超频，X370还允许处理器上剩余的16个PCIe 3.0通道分成两组8通道用于双GPU支持。
但对于小型系统而言，额外的输入输出通道和加大版的芯片集并不符合需求。于是就有了另外两种芯片集，X300和A/B300，它们作为芯片集的意义大大下降：唯一的输入输出功能是处理器本身自带的，没有第二代USB 3.1，没有SATA Express，没加任何PCIe 2.0通道。这种芯片集只提供了为数不多的几个功能，主要是围绕安全和可信任平台模块（Trusted Platform Module）的配置。正因如此它们长得非常迷你，AMD说它们能被安装在指甲那么大的芯片上。
撇开芯片集不管，AMD打算在2020年前始终采用同一个插口Socket 1331和同一个平台AM4，除非有什么新科技的诞生（比如PCIe 4.0或DDR5）迫使它更换组件接脚分布。锐龙AM4母板能够适配现在的AM4 APU，也能适配今年即将面世的Raven Ridge 基于Zen处理器的APU。这就是为什么就算是现在的锐龙处理器没法用，几乎所有Zen母板还是搭载了集成显示输出。
但到底要不要买Zen是个复杂的问题，这个复杂跟推土机为什么会那么弱的很多原因是一样的。对于多线程密集的工作，Zen比Broadwell的性价比高很多，且性能不相上下。有时候Broadwell-E会更快一点而有时候Zen会更快一点，但在那些尤其依赖AVX或存储带宽的工作上，Zen具有压倒性优势。这使得AMD 499美元的芯片成为英特尔 1049美元的芯片的完美替代品。
但在其他工作中，英特尔更优的单线程性能更被看重。包括Grand Theft Auto V, Battlefield 4, 以及 Ashes of the Singularity在内的游戏中，Kaby Lake i7-7700K能领先于1800X，就算它只有后者一半的核数和线程数。
Before the company’s new Zen offerings, it’s fair to say that AMD’s last attempt at building a performance desktop processor was not tremendously successful.
The Bulldozer core released in 2011 had a design that can, at best, be described as idiosyncratic. AMD made three bets with Bulldozer: that general purpose workloads would become increasingly multithreaded, that floating point intensive workloads would become increasingly GPU-driven, and that it would be able to aggressively scale clock speeds.
Accordingly, AMD created processors with oodles of simultaneous threads, relatively long pipelines, relatively narrow pipelines, and relatively few floating point resources. The idea was that clock speed and the GPU would make up for the narrowness and lack of floating point capability. AMD hoped all those threads would be working hard.
Each Bulldozer module could run two threads simultaneously, with two independent integer pipelines and one shared floating point pipeline within a module. Desktop processors shipped with two, three, or four modules for four, six, or eight threads in total. Compared to Bulldozer’s predecessor, the K10, each integer pipeline was narrow: two arithmetic logic units (ALUs) and two address generating units (AGUs), instead of three of each in K10. So, too, was the floating point pipeline, with two 128-bit fused multiply-add (FMA) units that could be paired together to perform a single 256-bit AVX FMA instruction. AMD designed the processor with a base clock speed goal of 4.4GHz.
Bulldozer? Bullshit, more like
None of AMD’s gambles paid off. The high-end desktop parts, with their four modules and eight threads, had an abundance of integer threads. But most consumer workloads still can’t be distributed evenly across eight threads. Single-threaded performance continues to matter a great deal. On the other hand, the sharing of the floating point units means that applications stuffed with floating point arithmetic have too few resources to work with. While GPU-based computing is important in certain workloads?such as scientific supercomputing?mainstream applications still require the CPU for floating point number crunching. Bulldozer leaves them short.
Even these issues might have been tolerable if the clock speed goals had been reached. A processor can get away with low instructions per cycle (IPC) if it runs at a high enough clock speed, but AMD came nowhere close to its 4.4GHz base goal. The top-end, four-module part had a base frequency of 3.6GHz. It could boost up to 4.2GHz under reduced workloads. This is a long way short of the design goal.
As a result, the first Bulldozer processors were in many workloads slower, yet more expensive, than their K10 predecessors. They were wholly uncompetitive with contemporaneous Intel parts.
AMD did iterate the design. The top-end second generation Bulldozer, named Piledriver, boosted the base clock up to 4.7GHz and up to 5.0GHz boosted. Combined with some internal improvements, this made it about 40 percent faster than the top Bulldozer. This came at a power cost, however: to hit those clock speeds, the processor drew 220W, compared to 125W for the Bulldozer.
The third-generation Steamroller made improvements to IPC and gained about nine percent over Piledriver. Fourth-generation Excavator added as much as 15-percent IPC over Steamroller. However, neither Steamroller nor Excavator were used in high-end desktop processors. The performance desktop space was ceded entirely to Intel.
AMD did use Steamroller and Excavator in some of its APUs?”Accelerated Processing Units,” which is to say, CPUs with integrated GPUs. But even in this space, the Bulldozer family has proved limiting. Mobile-oriented APUs in the 10-25W space only have one Excavator module (two threads). Their performance is substantially lower than that of Intel’s chips in the same power envelope, and Intel manages to squeeze four threads (albeit only on two cores) onto its low-power processors.
And to reach the ultra-low-power 3-7W space, AMD offers nothing at all with a Bulldozer-family core. The company has had chips operating in very low-power envelopes, but these have all used derivatives of the Bobcat core. This is a completely different processor design, developed for mobile and low-power operation. Bobcat derivatives are also used in the PlayStation 4, PlayStation 4 Pro, Xbox One, and Xbox One S.
By comparison, Intel’s designs run the gamut (albeit with staggered release schedules); its Broadwell design ranges from two-core, four-thread mobile parts with a power draw of as little as 3.5W, up to 22-core, 44-thread server chips drawing 145W (or higher clocked 12-core, 24-thread parts drawing 160W).
Time for something new
By 2013, AMD had realized that Bulldozer was never going to be the processor that the company wanted it to be. A new architecture was necessary. AMD had a few particular goals for this: the new architecture had to be a viable challenger in the high-end desktop market, and it had to offer at least 40 percent better IPC than Excavator.
Like Intel before it, AMD wanted its new design to span the full range from fanless mobile through server and high-end desktop. So this improved IPC needs to be wedded to improved power efficiency. But AMD isn’t giving up on the Bulldozer ideas completely: the company still believes that high numbers of multiple simultaneous threads are the future, and some of the design decisions suggest that AMD still sees GPUs as being central to serious floating point number crunching.
Four years in the making, the Zen core is the result of this new approach. And where Bulldozer failed to meet its objectives, AMD says that it has soundly beaten its 40-percent IPC improvement goal. In the single-threaded Cinebench R15 benchmark at a constant 3.4GHz, Zen achieves a score 58-percent higher than Excavator and 76-percent better than Piledriver. The typical IPC improvement, when compared to Excavator, is around 52 percent. It does this at significantly lower power draw, too: in multithreaded Cinebench R15, the performance per watt is more than double what it was for Piledriver.
Compared to the Bulldozer family, Zen is so much better across the board that it makes for an interesting?if uneven?competitor to what Intel is offering. Years have passed since AMD could even hope to be considered a performance rival to its much larger competitor, but with Zen, AMD finally has an architecture that can compete.
What makes it tick
The basic building block of Zen is the Core Complex (CCX): a unit containing four cores, capable of running eight simultaneous threads. True to AMD’s belief in high core and thread counts for desktop processors, the first Ryzen 7 series processors include two CCXes, for a total of eight cores and 16 threads. Three versions are launching: the 1800X, a 3.6-4.0GHz part at $499/?490, the 1700X, 3.4-3.8GHz at $399/?390, and the $329/?320 1700 at 3.0-3.7GHz.
In the second quarter, these will be joined by Ryzen 5. The R5 1600X will be a six-core, 12-thread chip running at 3.6-4.0GHz (two CCXes, with one core from each disabled), and the 1500X will be a four-core, eight-thread chip at 3.5-3.7GHz (just a single CCX).
Zen scales up, too. At some point this year, AMD will launch server processors, codenamed “Naples,” containing eight CCXes for a total of 32 cores and 64 threads.
This design decision already sets AMD apart from Intel. Intel’s processor range is strangely bifurcated. The company’s latest core is Kaby Lake, but Kaby Lake is only available in two- and four-core versions, some with simultaneous multithreading (SMT), others without. To go beyond four cores, you have to switch to a previous-generation architecture: Broadwell.
Broadwell, a 14nm shrink of the previous Haswell architecture, was first introduced in September 2014. Currently, every processor that’s “bigger” than a four-core, eight-thread mainstream desktop or mobile part is built using the Broadwell core. This includes not just the enthusiast-oriented Broadwell-E parts that offer six, eight, or 10 cores and 12, 16, or 20 threads. It also includes Broadwell-EP server parts, right up to the Xeon E7-8894V4 that was launched just two weeks ago. This is an 8-socket-capable 24-core, 48-thread chip that won’t give you much change from $9,000.
These first Ryzen processors straddle that discontinuity in the Intel line-up. The R7 1700 is going more or less head to head with the Kaby Lake i7-7700K. The latter uses Intel’s refined 14nm process and newest architecture with the best single-threaded performance, running at 4.2-4.5GHz. But the 1800 and 1800X are competing against Broadwell designs, the six-core, 12-thread 3.6-3.8 GHz i7-6850K ($620/?580 or so) and eight-core, 16-thread 3.2-3.7 GHz i7-6900K ($1,049/?1,000 or so), respectively. In upping the core count, Intel’s forcing you to sacrifice its latest core and most power efficient process and to switch to older, less frequently updated chipsets to boot (the current X99 dates back to late 2014).
Bigger, beefier cores
Zen core block diagram.
Each of those new cores is equipped with many more execution resources than Bulldozer. On the integer side, Zen has four ALUs and two AGUs. On the floating point side, the shared floating point unit concept has been scrapped: each core now has a pair of 128-bit FMA units of its own. The floating point units are organized as separate add and multiply pipes to handle a more diverse instruction mix when not performing multiply-accumulate operations. But 256-bit AVX instructions have to be split up across the two FMA units and tie up all the floating point units.
This is a big step up from Bulldozer, essentially doubling the integer and floating point resources available to each core. Compared to Broadwell and Skylake, however, things are murkier. AMD’s four ALUs are similar to each other though not identical; some instructions have to be processed on a particular unit (only one has a full multiplier, only one has a divider), and they can’t be run on other units even if they’re available. Intel’s are a bit more diverse, so for some instruction mixes, Intel’s four ALUs may possibly be lesser than AMD’s.
Complicating this further, AMD says that six instructions total can be dispatched per cycle across the 10 pipelines (four ALU, two AGU, four FP) in the core. Broadwell and Skylake can both issue eight instructions per cycle. Four of those go to AGUs?Skylake has two general-purpose AGUs and two more specialized ones. The other four perform arithmetic of some kind, either integer or floating point.
Intel groups its functional units behind four dispatch ports, numbered 0, 1, 5, and 6. All four ports include a regular integer ALU, but port 0 also has an AVX FMA unit, a divider, and a branch unit. Port 1 has a second AVX FMA unit but no divider. Ports 5 and 6 have neither an FMA nor a divider. This means that in one cycle, the processor can schedule two AVX FMA operations or one divide and one AVX FMA. But the processor can’t do a divide and two FMAs.
In principle, this means that in one cycle, Zen could dispatch four integer operations and two floating point operations. Skylake could dispatch four integer operations, but this would tie up all four ports, leaving it unable to also dispatch any floating point operations. On the other hand, Skylake and Broadwell could both dispatch four integer operations and four address operations in a single cycle. Zen would only manage two address operations.
Bulldozer’s foibles aren’t quite a thing of the past
Isolating these differences to measure the impact of the designs is nigh impossible. Nonetheless, there are a few areas where the advantage of one design or the other is clear. Intel’s chips both have two full 256-bit AVX FMA units that can be used simultaneously. For code that can take advantage of this, Skylake and Broadwell’s performance should easily double Zen’s. For many years, AMD has been pushing the GPU as the best place to perform this kind of floating point-intensive parallel workload. So on some level this discrepancy makes sense?but programs that depend heavily on AVX instructions are likely to strongly favor the Intel chips.
This becomes visible in, for example, one of Geekbench’s floating point tests, SGEMM. This is a matrix multiplication test that uses, when available, AVX and FMA instructions for best performance. On a single thread, the 6900K manages about 90 billion single precision floating point instructions per second (gigaflops). The 1800X, by contrast, only offers 53 gigaflops. Although the 1800X’s higher clock speed helps a little, the Intel chip is doing twice as much work with each cycle. A few hundred megahertz isn’t enough to offset this architectural difference.
Of course, this is the kind of workload that in some ways proves AMD’s point: GPU-accelerated versions of the same matrix multiply operation can hit 800 or more gigaflops. If your computational requirements include a substantial number of matrix multiplications, you’re not going to want to do that work on your CPU.
The long-standing difficulty for AMD, and for general purpose GPU computation in general, is how to cope when only some of your workload is a good match for the GPU. Moving data back and forth between CPU and GPU imposes overheads and often requires developers to switch between development tools and programming languages. There are solutions to this, such as AMD’s heterogeneous systems architecture and OpenCL, but they’re still awaiting widespread industry adoption.
One particular Geekbench subtest showed a strong advantage in the other direction. Geekbench includes tests of the cryptographic instructions found in all mainstream processors these days. In a test of single-threaded performance, the Ryzen trounces the Broadwell-E, encrypting at 4.5GB/s compared to 2.7GB/s. Ryzen has two AES units that both reside within the floating point portion of the processor. Broadwell only has one, giving the AMD chip a big lead.
This situation is reversed when moving from one thread to 16: the Intel system can do 24.4GB/s while the AMD only does 10.2GB/s. This suggests that the test becomes bandwidth-limited with high thread counts, allowing the 6900K’s quad-memory channels to give it a lead over the 1800X’s dual channels. Even though the Ryzen has more computational resources to throw at this particular problem, that doesn’t help when the processor sits twiddling its thumbs waiting for data.
And when it comes to single-threaded performance, the Kaby Lake i7-7700K pulls well ahead of both platforms. With its combination of superior IPC and faster clockspeeds, neither Broadwell-E nor Zen can keep up.
Feeding those instruction units are the instruction decoder and out-of-order machinery. Again, AMD has made substantial improvements relative to Bulldozer. As is now the norm among x86 processors, Zen decodes x86 instructions into micro-ops (?op) that are then scheduled and executed within the processor. In Bulldozer, repeated instructions (such as those in a loop) would have to be repeatedly fetched and decoded. Zen adds a ?op cache storing 2,000 ?ops so that loops can bypass the decoding. Intel first introduced an equivalent data structure in its Sandy Bridge architecture released at the start of 2011.
This is combined with a much smarter branch predictor. Branch predictors try to guess which instructions will be executed after a branch before the processor actually knows for sure. If the branch predictor guesses correctly, the processor’s pipelines can be kept full. If not, it will have to flush its pipelines and waste some amount of work.
The Zen branch predictor is smarter?it guesses correctly more of the time?and cheaper?the penalty of an incorrect guess has been reduced by three cycles. AMD now describes the branch predictor as being a neural network because it’s based on perceptrons. A perceptron takes a number of weighted inputs and adds them together. If their sum is greater than zero then the perceptron’s value is 1. Otherwise, that value is 0.
Perceptrons are interesting for branch predictors because they can track lots of input states to provide their decision on whether a branch is taken or not taken. This makes them a good match even for long loops. Bulldozer is suspected to use perceptrons, too. But it’s only with Zen that AMD seems to have cottoned on to the fact that you can call this a neural net?and hence bring to mind visions of artificial intelligence and Arnold Schwarzenegger?making the thing sound incredibly advanced.
Feeding all this is a bigger, better cache system. The level 1 cache is now a write-back cache (instead of Bulldozer’s write-through), which makes it faster and reduces memory traffic. The level 1 and 2 cache both offer about twice the bandwidth than in Bulldozer. The level 3 cache offers five times the bandwidth, and it’s also more complex. Each core within a CCX has 2MB of L3 cache, for 8MB across the CCX and 16MB across the entire processor. This cache is shared, but the speed at which it can be accessed will vary. The cache slice closest to a core is, naturally, the fastest. The other three are a little slower.
A big chunk of each CCX is cache.
Communication between the CCXes uses what AMD calls the Infinity Fabric. AMD is a little opaque when describing this. The basic principle is that this is a high-speed, cache-coherent interface and bus which can be used within a CCX?the power management and security microcontrollers both connect to it, as do PCIe and memory controllers?between CCXes. It can even be used between sockets on a motherboard.
At least for multisocket situations, AMD describes Infinity Fabric as “Coherent HyperTransport plus enhancements,” but on other occasions the company has said it isn’t based on HyperTransport.
To top it all off, Zen supports simultaneous multithreading. Almost all the resources in the core are “competitively shared,” which means that in the absence of a second thread, the first thread should generally be able to use all the available execution resources. Instructions are dispatched on a round robin basis, with alternate cycles executing from alternate threads.
Gains from SMT varied. Cinebench showed healthy improvement, scoring some 40 percent higher just from having SMT enabled. This is highly workload dependent, however; multithreaded Geekbench saw a boost of just under 10 percent from enabling SMT. While SMT is normally a big win?a free boost to multithreaded programs at no real detriment to single-threaded ones?we did notice that Hitman (2016) shed around 10 percent of its framerate from having SMT enabled.
Apart from situations in which the difference in execution resources is significant, the general pattern is that all of this work on Zen leaves it a little slower than Intel’s designs on a per-cycle basis. (Let’s estimate five percent behind Broadwell and 15 percent behind Skylake.) But a combination of higher clock speeds and a solid simultaneous multithreading implementation means that it can keep up with the Broadwell-E. In Cinebench R15, for example, the 1800X ties with the 6900X in single-threaded tests, and it’s about six percent better than the Intel chip in multithreaded. In single-threaded Geekbench 4, the AMD processor again ties the Intel one, though its multithreaded performance trails Intel by about 20 percent. This is a reflection of the different instruction mix and bandwidth dependence of the different tests.
Accordingly, AMD has beaten its IPC goal relative to Bulldozer and is within spitting distance of Intel’s two-and-a-half-year-old design. Thanks to the clock speed and core count, this means Zen is competitive with Intel across a wide range of workloads.
Relative to Excavator, Zen does this at a significant reduction in power usage, too.
If boosting IPC by 52 percent was impressive, the power reduction is even more so: in multithreaded Cinebench, AMD claims that performance per watt has increased by 269 percent. For the same power draw, Zen scores 3.7 times what Excavator achieved.
That efficiency improvement comes from a range of sources. One big improvement is out of AMD’s hands: performance per watt is improved by 70 percent from the switch to GlobalFoundries’ 14nm FinFET manufacturing process (and should GlobalFoundries falter, AMD has validated the design on Samsung’s 14nm process too); even mobile-oriented Excavator parts are currently built on a now-ancient 28nm process. But the rest of the improvement is down to AMD’s engineers.
Some 129 percent of the improvement is attributed to the new, better architecture. As well as being a lot faster, it’s also more efficient. That ?op cache doesn’t just relieve the burden on instruction fetching and decoding; it also saves power. It’s cheaper to read from the ?op cache than to read from the level 1 instruction cache and then run through the decoder. Similarly, the improved branch predictor means that the processor spends less time speculatively executing the wrong branch, a task that represents wasted energy.
The integer cores also include features that help performance and boost efficiency. Some of the most common x86 instructions are move instructions that copy data from memory to register, register to register, or register to memory. These register to register moves are eliminated by the core, replaced instead by register renaming, a trick first used in Bulldozer.
x86 also includes instructions that are dedicated to manipulating the stack; these instructions will simultaneously read or write a value from memory and add or subtract from a dedicated register (the stack pointer). While Bulldozer includes some special handling of the stack to reduce the dependencies between instructions using the stack (thereby improving the scope for parallel execution), Zen includes a more complex stack engine that can eliminate certain stack manipulation instructions. This both improves performance (again, by allowing more parallel execution) and reduces power usage.
The design of the core was also extensively optimized to use less power. Integrated circuits are built of a variety of standard units such as NAND and NOT logic gates, flipflops, and even more complex elements such as half and full adders. For each of these components (called standard cells), a range of designs is possible with different trade-offs between performance, size, and power consumption.
Fast flipflops, on the left, are large and power-hungry. Most of Zen uses slower, more efficient ones, on the right.
AMD built a large library of standard cells with different characteristics. For example, it has five different flipflop designs. The fastest is twice as fast as the slowest, but it takes about 80 percent more space and uses more than twice as much power. Armed with this library, Zen was optimized to use the smaller, slower, more efficient parts where it can and the faster, larger, high-performance parts when it must. In Zen, the high-performance design is used for fewer than 10 percent of the flipflops, with the efficient one used about 60 percent of the time.
The result is a careful balance of performance in the critical paths where it matters and efficiency where less performance is needed.
Zen’s power management is complex and capable. Like any other modern processor, it’s aggressively clock gated, enabling unused areas of the chip to be temporarily turned off. But more than that, it has an integrated power management controller that monitors and adjusts the voltage used by each core according to temperature and loading. It’s a system that AMD is calling SenseMI (where “MI” stands for “machine intelligence”).
Just as Intel did with Skylake, this takes most power management roles out of the hands of software and the operating system, and it bakes them into silicon. Operating systems can respond to processor power management events in tens or hundreds of milliseconds; hardware on the chip can respond in just a handful of milliseconds. This allows much tighter control over voltages and clock speeds. And again, like Skylake, the operating system is responsible for setting the coarse-grained power of the chip?it can throttle the whole thing back by many hundreds or even thousands of megahertz?but when operating at maximum performance (the “P0” power state), control is turned over to the processor itself.
Scattered across each CCX are a whole bunch of sensors; 20 thermal diodes, 48 power supply monitors, nine voltage droop detectors, and more than 1,300 critical path monitors. These sensors are all connected to an Infinity Fabric control plane, feeding their data into a power management unit.They report their readings 1,000 times a second, and are accurate to 1mA, 1mV, 1mW, and 1°C.
On the one hand, this system allows Zen to cut back its voltages to the bare minimum needed to operate correctly. Each core has its own voltage regulator, and they’ll be set as low as possible to maintain clock speeds. Not all cores are created equal; some are naturally faster than others. Those faster cores are given slightly less power, creating the thermal headroom to boost the power given to other, weaker cores. A fast core can use as much as five percent less power than a slow one at the same speed.
This per-core adjustment is in contrast to AMD’s previous Adaptive Voltage and Frequency Scaling (AVFS) implementations, where the voltage was set chip-wide. In the past, the voltage had to be high enough to support all the cores. Likewise, Intel’s cores all share a power plane and operate at the same voltage. Setting it on a core-by-core basis saves power.
The voltage regulation uses on-chip low drop-out (LDO) linear regulators. Intel’s Haswell and Broadwell also used on-chip regulators of a more complex, less efficient type (FIVRs, fully integrated voltage regulators). In principle, Intel’s system can handle a wider range of core voltages compared to AMD, but in practice, AMD said that this capability wasn’t particularly useful. All the cores operate at about the same voltage, so the LDO was a better approach. Skylake is believed to drop the FIVRs, moving voltage regulation back onto the motherboard; this precludes making the kind of high-speed fine adjustment that AMD claims for Zen. If the voltage droops too low, Zen will slow a core’s clock speed while it recovers.
With its 25MHz increments, the Zen clock speed can vary smoothly between its standard clock and its boosted maximum.
Enlarge / With its 25MHz increments, the Zen clock speed can vary smoothly between its standard clock and its boosted maximum.
The counterpart to cutting the voltage as aggressively as possible to maintain a given frequency is pushing the frequency up while not exceeding a given power draw. The same control system that cuts power can also push up the clock speed of each core to make maximum use of the processor’s power budget. This kind of turbo boosting is nothing new, but AMD’s system, “Precision Boost,” has a couple of novel attributes. The first is that the boosting is fine-grained; it can adjust the clock speed in 25MHz increments.
Second, the clock speed can be pushed beyond the standard boosted maximum if the sensors detect that there is still thermal headroom available, which AMD is calling Extended Frequency Range (XFR). If the heatsink and cooler are significantly above the minimum spec and hence the chip is running cool, XFR will add up to 100MHz extra clock speed (for the 1800X and 1700X; it adds only 50MHz to the 1700), in essence giving a tiny overclock “for free.”
These refinements and optimizations to power distribution, timing, and voltage, account for the remaining 70 percent of the performance per watt improvement.
When it comes to “real” overclocking, however, the R7 processors are a mixed bag. On the one hand, they’re all multiplier unlocked, at least when used in conjunction with the right chipset, so you’re free to experiment in whatever way you see fit. AMD estimates that most 1800Xes will manage 4.2GHz across all eight cores, as long as they’re bumped up to 1.45V. AMD also estimates, however, that 1.45V will shorten the life of the chip.
Enabling overclocking mode, however, disables much of the fancy power management. There’s no way, for example, to say that you’d like XFR to add 200MHz of extra speed, subject to thermal constraints. Nor can you, say, set an 1800X to run at a 3.8GHz base, 4.2GHz turbo (bumping both by 200MHz). As soon as overclocking is enabled, the processor simply runs at a fixed speed (when in the P0 power state).
This stands in contrast with Intel, where you can specify the turbo multipliers for 1, 2, 3, and 4-core boosting, and with Kaby Lake, where you can even set a reduced multiplier for use during AVX-intensive workloads. As smart as Zen is in normal operation, it’s very simplistic in overclocked mode.
A platform that includes a chipsetless chipset
The Ryzen R7 is not just a processor; it’s a system-on-chip. Whether it’s used in this capacity, however, will depend on which motherboard and chipset it’s paired with.
The processor itself has 20 PCIe 3.0 lanes, 4 USB 3.1 generation 1 (5 Gb/s) controllers, and a further 4 mixed PCIe and I/O lanes; these can be grouped for a single x4 NVMe device or split into 2 SATA plus 1 x2 NVMe or 2 SATA plus x2 PCIe. There are also two DDR4 memory channels.
Of those 20 PCIe 3.0 lanes, four are normally used to communicate with the chipset. AMD has three regular chipsets; the high-end X370, mid-range B350, and low-end A320. All include USB 3.1 generation 2 (10 Gb/s) controllers (two for the X370 and B350, one for A320), six USB 2 controllers, some additional SATA controllers (4, 2, and 2, for X370, B350, and A320, respectively), two SATA Express ports (which can be used as 4 SATA 3.0 ports), and some PCIe 2.0 lanes (8, 6, and 4). The X370 and B350 both enable overclocking, and the X370 allows the 16 remaining PCIe 3.0 lanes from the processor to be split into 2×8 channels for dual GPU support.
But for small systems, the extra I/O and extra size of a chipset might be undesirable. Accordingly, there are two other chipsets, X300 and A/B300, that greatly diminish what it means to be a chipset. With X300 or A/B300, the only I/O capabilities are those within the processor itself; they don’t include USB 3.1 generation 2, they don’t include SATA Express, they don’t add any PCIe 2.0 lanes. The chipsets provide a handful of functions, mainly around security and provision of a Trusted Platform Module. As such they are tiny: AMD says they’ll fit on a chip the size of a fingernail.
Because these chipsets do so little, they don’t need four PCIe lanes from the processor; there’s a dedicated SPI link for them. This means that all 20 of the processor’s PCIe lanes become available. The X300 also enables overclocking and dual GPUs.
Regardless of chipset, AMD intends to use the same socket, Socket 1331, and the same platform, AM4, until 2020, unless some new technology (such as PCIe 4.0 or DDR5) forces it to change the package pinout. Ryzen AM4 motherboards are compatible with existing AM4 APUs, and they’ll also work with Raven Ridge Zen-based APUs that should become available later this year. This is why pretty much all of the Zen motherboards include integrated display outputs, even though right now, Ryzen processors can’t use them.
AMD’s gambles starting to pay off
It’s already clear that Zen is a tremendously more successful architecture than Bulldozer was. Bulldozer was barely competitive with AMD’s existing products, and it was left in the dust by Intel’s chips. That’s not true of Zen. AMD’s new architecture is performance-competitive with Intel’s Broadwell, and it’s efficient, too.
But the question of whether to buy a Zen is a complex one, and it’s complex for many of the same reasons that Bulldozer was so weak. For highly multithreaded workloads, Zen is much more cost-effective than Broadwell-E, while performing at almost exactly the same level. Sometimes Broadwell-E will be a little faster, sometimes Zen will be a little faster, but with the exception of workloads that are particularly dependent on AVX or memory bandwidth, it’s a wash. That makes AMD’s $499 part an easy and attractive alternative to Intel’s $1,049 chip.
I’d expect developers, for example, to be keen to get their hands on Zen as soon as they can. Software compilation, especially of C++, is readily multithreaded and almost invariably CPU-bound, and doubling the number of cores and threads is a huge performance win. Video game streamers are another demographic that AMD is courting with much the same logic: a chip with lots of cores has the capacity to both play a game and perform high quality video compression, which is something that Intel’s four core parts struggle at.
But in other workloads, Intel’s greater single-threaded performance is the more important factor. In games including Grand Theft Auto V, Battlefield 4, and Ashes of the Singularity, the Kaby Lake i7-7700K will pull ahead of the 1800X in spite of having half the number of cores and threads.
Just as it was with Bulldozer, AMD’s hope is that developers will build their software to scale better onto larger numbers of cores, allowing Zen to more consistently show its benefits over Intel’s chips. But a look at the Steam Hardware Survey illustrates why, in the short term at least, developers might be reluctant to do that; 2- and 4- core processors make up the overwhelming majority of gaming systems.
The big difference, however, is that Zen is still good enough to be competitive even in workloads that aren’t optimal. An i7-7700K may offer slightly higher frame rates and slightly better performance for a particular workload, but where Bulldozer was just downright embarrassing for single-threaded tasks, Zen is still good?good enough that you may well be willing to take the performance hit in some programs so that you can enjoy the big wins in others.
That’s simply not a position that AMD has been in for a long, long time. It’s been many years since PC enthusiasts have had to seriously consider which processor to buy, but AMD has built something that’s absolutely worth consideration. It might not be the right chip for everyone, but this time around, it is at least the right chip for someone.