Intel Launches Cooper Lake: 3rd Generation Xeon Scalable for 4P/8P Serversby Dr. Ian Cutress on June 18, 2020 9:00 AM EST
We’ve known about Intel’s Cooper Lake platform for a number of quarters. What was initially planned, as far as we understand, as a custom silicon variant of Cascade Lake for its high-profile customers, it was subsequently productized and aimed to be inserted into a delay in Intel’s roadmap caused by the development of 10nm for Xeon. Set to be a full range update to the product stack, in the last quarter, Intel declared that its Cooper Lake platform would end up solely in the hands of its priority customers, only as a quad-socket or higher platform. Today, Intel launches Cooper Lake, and confirms that Ice Lake is set to come out later this year, aimed at the 1P/2P markets.
Count Your Coopers: BFloat16 Support
Cooper Lake Xeon Scalable is officially designated as Intel’s 3rd Generation of Xeon Scalable for high-socket count servers. Ice Lake Xeon Scalable, when it launches later this year, will also be called 3rd Generation of Xeon Scalable, except for low core count servers.
For Cooper Lake, Intel has made three key additions to the platform. First is the addition of AVX512-based BF16 instructions, allowing users to take advantage of the BF16 number format. A number of key AI workloads, typically done in FP32 or FP16, can now be performed in BF16 to get almost the same throughput as FP16 for almost the same range of FP32. Facebook made a big deal about BF16 in its presentation last year at Hot Chips, where it forms a critical part of its Zion platform. At the time the presentation was made, there was no CPU on the market that supported BF16, which led to this amusing exchange at the conference:
BF16 (bfloat16) is a way of encoding a number in binary that attempts to take advantage of the range of a 32-bit number, but in a 16-bit format such that double the compute can be packed into the same number of bits. The simple table looks a bit like this:
|Data Type Representations|
By using BF16 numbers rather than FP32 numbers, it would also mean that memory bandwidth requirements as well as system-to-system network requirements could be halved. On the scale of a Facebook, or an Amazon, or a Tencent, this would appeal to them. At the time of the presentation at Hot Chips last year, Facebook confirmed that it already had silicon working on its datasets.
Doubling Socket-to-Socket Interconnect Bandwidth
The second upgrade that Intel has made to Cooper Lake over Cascade Lake is in socket-to-socket interconnect. Traditionally Intel’s Xeon processors have relied on a form of QPI/UPI (Ultra Path Interconnect) in order to connect multiple CPUs together to act as one system. In Cascade Lake Xeon Scalable, the top end processors each had three UPI links running at 10.4 GT/s. For Cooper Lake, we have six UPI links also running at 10.4 GT/s, however these links still only have three controllers behind them such that each CPU can only connect to three other CPUs, but the bandwidth can be doubled.
This means that in Cooper Lake, each CPU-to-CPU connection involves two UPI links, each running at 10.4 GT/s, for a total of 20.8 GT/s. Because the number of links is doubled, rather than an evolution of the standard, there are no power efficiency improvements beyond anything Intel has done to the manufacturing process. Note that double the bandwidth between sockets is still a good thing, even if latency and power per bit is still the same.
Intel still uses the double pinwheel topology for its eight socket designs, ensuring at max two hops to any required processor in the set. Eight socket is the limit with a glueless network – we have already seen companies like Microsoft build servers with 32 sockets using additional glue logic.
Memory and 2nd Gen Optane
The third upgrade for Cooper Lake is the memory support. Intel is now supporting DDR4-3200 with the Cooper Xeon Platinum parts, however only in a 1 DIMM per channel (1 DPC) scenario. 2 DPC is supported, but only at DDR4-2933. Support for DDR4-3200 technically gives the system a boost from 23.46 GB/s per channel to 25.60 GB/s, an increase of 9.1%.
The base models of Cooper Lake will also be updated to support 1.125 TiB of memory, up from 1 TB. This allows for a 12 DIMM scenario where six modules are 64 GB and six modules are 128 GB. One of the complaints about Cascade Xeons was that in 1 TB mode, it would not allow for an even capacity per memory channel when it was filled with memory, so Intel have rectified this situation. In this scenario, it means that the six 128 GB modules could also be Optane. Why Intel didn’t go for the full 12 * 128 GB scenario, we’ll never know.
The higher memory capacity processors will support 4.5 TB of memory, and be listed as ‘HL’ processors.
Cooper Lake will also support Intel’s second generation 200-series Optane DC Persistent Memory, codenamed Barlow Pass. 200-series Optane DCPMM will still available in 128 GB, 256 GB, and 512 GB modules, same as the first generation, and will also run at the same DDR4-2666 memory speed. Intel claims that this new generation of Optane offers 25% higher memory bandwidth than the previous generation, which we assume comes down to a new generation of Optane controller on the memory and software optimization at the system level.
Intel states that the 25% performance increase is when they compare 1st gen Optane DCPMM to 2nd gen Optane DCPMM at 15 W, both operating at DDR4-2666. Note that the first-gen could operate in different power modes, from 12 W up to 18 W. We asked Intel if the second generation was the same, and they stated that 15 W is the maximum power mode offered in the new generation.
Post Your CommentPlease log in or sign up to comment.
View All Comments
azfacea - Thursday, June 18, 2020 - linkare u suggesting these will compete with IBM z platform or something else on reliability? clearly this is not a reliability play. its commodity x86. and if max core count and max memory and max IO of 8s server does not beat a 4s EPYC, not sure what the selling point is, never mind charging a premium.
unless there is particular order from like facebook for BFloat16 its not going anywhere. with a 2x perf disadvantage even that wont be enof for long.
SarahKerrigan - Thursday, June 18, 2020 - linkNot on reliability, just on scalability. 4s/8s x86 is largely replacing RISC/UNIX (*not* z, which is a separate animal.)
As for 4s Epyc... you realize that Epyc only goes to 2s, right? If you want a really big tightly-bound x86 system, whether to replace RISC/UNIX or just because you have an interconnect-sensitive app that eats a lot of RAM, Intel goes higher than AMD. That's not a value judgment, it's a statement of fact. That's also an incredibly niche market and always has been - but it's one with good margins, which presumably is why Intel still bothers.
kc77 - Thursday, June 18, 2020 - linkNo Eypc can scale further than that. Second, these chips top out at 28 cores. AMD is at a double density advantage (Actually it's worse) . Hell you have to go to 8S on these parts just to beat out the 2S AMD counter parts. The power and density lost is crazy. These are super niche parts. Aside from FaceBook I don't see anyone else getting these.
Deicidium369 - Thursday, June 18, 2020 - linkAnd it seems to the people making the decisions about what goes into the Datacenter - AMD supposed "advantages" are meaningless. The 4 and 8 socket Cooper Lake is destined for hyperscalers.
Zibi - Thursday, June 18, 2020 - linkLike Facebook OCP Delta Lake Cooper Lake perhaps ?
Too bad it's 2S xD
Deicidium369 - Friday, June 19, 2020 - linkCooper Lake is 4 and 8 sockets - designed for AI / Hyper scalers
Ice Lake SP is single and dual socket 38C and 64 PCIe4 lanes per socket.
Deicidium369 - Friday, June 19, 2020 - linkIce Lake SP has 76 cores and offers 128 lanes of PCIe4 in a dual socket system - this is the mainstream platform - most servers in traditional data centers are 2 socket... makes for an efficient VM farm - better to have 2 dual socket than a single 4 socket. And with the significant IPC increase the Sunny Cove brought (~20%) makes the 76 cores in a dual socket config equivalent to 90 or 91 cores when compared to Skylake derived Comet and Cooper and by extension Epyc. So Epyc may have 128 cores in a dual socket config - that really is not a huge advantage anymore - and with the same # of PCIe4 lanes.. Epyc shows little advantage here.
You would be hard pressed to find any motherboard that supports more the 2 Epyc CPUs. there is a poster on Reddit Optilasgar that explains why more than 2 sockets on Epyc are basically not possible
Yeah the 4/8 socket will go mostly to the hyper scalers - Facebook was one of the driving factors for Cooper Lake at 4 or 8 sockets - but you can bet they won't be the only hyper scalers getting them.
Apparently what you see as AMDs advantage isn't what the large customers - hyper scalers or traditional data centers want - revenue show that to be true.
Spunjji - Friday, June 19, 2020 - link@Deicidum
"the significant IPC increase the Sunny Cove brought (~20%) makes the 76 cores in a dual socket config equivalent to 90 or 91 cores when compared to Skylake derived Comet and Cooper and by extension Epyc."
What's this "by extension Epyc" nonsense? Everybody knows Epyc has better IPC than Skylake.
We don't know the clock speeds for Ice Lake SP either, but if it ends up anything like the mobile variants then the IPC increase will be eaten by the clock speed decrease.
Deicidium369 - Saturday, June 20, 2020 - linkYeah server variants with 270W are going to have the same clocks as the 15W mobile variant,,,
you are really grasping at straws. AMD Epyc are roughly comparable to Skylake derived cores - so Comet Lake and Cooper Lake are Skylake derived cores, and Epyc is trying to compete with Skylake - therefore - by extension.. means that Sunny Cove has a 20% IPC advantage over Skylake - which is Comet Lake, Cooper Lake, and AMD Epyc.
mtfbwy - Thursday, June 18, 2020 - linkThen why are the 'rate' numbers for SPEC CPU 2017 Dominated by EPYC? spots #1, #2, and #3 are all EPYC, with socket count of 16, 24, and 32.
While the "glue" in this case is software instead of a hardware node-controller, it still makes for a scale-up server; the same technology is also used with Xeons for customers running workloads like SAP HANA - it makes for a far cheaper and more flexible architecture to scale up your memory.