Arm's Cortex-A76 CPU Unveiled: Taking Aim at the Top for 7nmby Andrei Frumusanu on May 31, 2018 3:01 PM EST
- Posted in
Cortex A76 µarch - Backend
Switching to the back-end of the core we have a look at the execution core.
The integer core contains 6 issue queues and execution ports (4 depicted in the slide plus 2 load/store pipelines). There are 3 integer execution pipelines – two ALUs capable of simple arithmetic operations and a complex pipeline handling also multiplication, division and CRC ops. The three integer pipelines are served by 16 deep issue queues. The same size issue queue can also be found serving the single branch execution port.
Two load/store units are the remaining ports of the integer core and are each served by two 12 deep issue queues. The issue queue stages are 3 cycles deep and while I mentioned that the rename/dispatch is 1 stage deep, the dispatch stage actually overlaps with the first cycle of the issue queues stages.
The ASIMD/floating point core contains two pipelines which are served by two 16-deep issue queues.
When it comes to the backend of a CPU core the two most important metrics are instruction throughput and latency. Where the A76 in particular improves a lot is in terms of instruction latency as it’s able to shave off cycles on very important instructions.
To better overview the improvements I created a table with the most common instruction types. The execution throughput and latencies presented here are for AArch64 instructions and if not otherwise noted represent operations on 64-bit data for integer and 64bit (double precision) FP.
|Backend Execution Throughput and Latency|
|Integer Arithmetic (Add, sub)||2||1||3||1||4||1|
|Integer Multiply 32b||1||3||1||2||2||3|
|Integer Multiply 64b||1||3||1||2||1
|Integer Multiply Accumulate||1||3||1||2||1||3|
|Integer Division 32b||0.25||12||0.2||< 12||1/12 - 1||< 12|
|Integer Division 64b||0.25||12||0.2||< 12||1/21 - 1||< 21|
|Shift ops (Lsl)||2||1||3||1||3||1|
|FP Multiply Accumulate||2||5||2||4||3||4|
|FP Division (S-form)||0.2-0.33||6-10||0.66||7||>0.16
|ASIMD Multiply Accumulate||1||4||1||4||1||3|
|ASIMD FP Arithmetic||2||3||2||2||3||2|
|ASIMD FP Multiply||2||3||2||3||1||3|
|ASIMD FP Chained MAC (VMLA)||2||6||2||5||3||5|
|ASIMD FP Fused MAC (VFMA)||2||5||2||4||3||4|
On the integer operations side the A76 improves the multiplication and multiply accumulate latencies from 3 cycles down to 2 cycles, with the throughput remaining the same when compared to the A75. Obviously because the A76 has 3 integer pipelines simple arithmetic operations see a 50% increase in throughput versus the A75’s 2 pipelines.
The much larger and important improvements can be found in the “VX” (vector execution) pipelines which are in charge of FP and ASIMD operations. Arm calls the new pipeline a “state-of-the-art” design and this is finally the result that’s been hyped up for several years now.
Floating point arithmetic operations have been reduced in latency from 3 cycles down to 2 cycles, and multiply accumulate has also shaved off a cycle from 5 cycles down to 4.
What Arm means by the “Dual 128bit ASIMD” with doubled execution bandwidth is that for the A75 and prior only one of the vector pipelines was capable of 128bit while the other one was still 64-bit. For the A76 both vector pipelines are 128-bit now so quad-precision operations see a doubling of the execution throughput.
Moving onto more details of the data handling side, we see the again the two load/store pipelines which was something first implemented on the A73 and A75. Although depicted as one issue queue in the slide, the LD/S pipelines each have their own queues at 16 entries deep.
The data cache is fixed at 64KB and is 4-way associative. Load latency remains at 4 cycles. The DTLBs run a separate pipeline as tag and data lookup. Arm’s goals here is aiming for maximum MLP/ memory level parallelism to be able to feed the core.
In a perfect machine everything would be already located in the caches, so it’s important to have very robust prefetching capabilities. On the A76 we see a new 4th generation prefetchers introduced to get nearer to this goal of perfect cache-hit operation. In all the A76 has 4 different prefetching engines running in parallel looking at various data patterns and loading data into the caches.
In terms of the A76 cache hierarchy Arm is said to have made no compromises and got the best of both worlds in terms of bandwidth and latency. The 64KB L1 instruction cache reads up to 32B/cycle and the same bandwidth applies to the L1 data cache in both directions. The L1 is a writeback cache. The L2 cache is configurable in 256 or 512KB sizes and is D-side inclusive with the same 2x 32B/cycle write and read interfaces up to the exclusive L3 cache in the 2nd generation DSU.
Overall the microarchitectural improvements on the core are said to improve memory bandwidth to DRAM by up to 90% in microbenchmarks.
All in all the microarchitecture of the A76 could be summed up in a few focus design points: Maximise memory performance throughout the core by looking at every single cycle. During the design phase the engineers were looking at feature changes with a sensitivity of up to 0.25% in performance or power – if that metric was fulfilled then it was deemed to be a worthwhile change in the core. Small percentages then in turn add up to create significant figures in the end product.
The focus on bandwidth on latency is said to have been extreme, and Arm was very adamant in re-iterating that to be able to take full advantage of the microarchitecture that vendors need to implement an equally capable memory subsystem on the SoC to see full advantages. A figure that was put out there was 0.25% of performance per nanosecond of latency to main memory. As we’ve seen in the Snapdragon 845 one of the reasons the SoC didn’t quite reach Arm’s projected performance metrics was the degraded memory latency figures which might have been introduced by the L4 system cache in the SoC. In the future vendors will need to focus more on providing latency sensitive memory subsystems as otherwise they’ll be letting free performance and power on the table with differences that could amount to basically a generational difference in CPU IP.
Post Your CommentPlease log in or sign up to comment.
View All Comments
tipoo - Thursday, May 31, 2018 - linkStill a 4-wide front end, I don't imagine it'll catch A10, maybe A9 per core then eh.
wicketr - Thursday, May 31, 2018 - linkI just don't understand why ARM doesn't at least come out with a design that can match the Monsoon cores of an A11, or even the power of what will likely be the next A12 cores. It seems like ARM is eternally 2-3 steps behind Apple on this and they need to catch up.
shadowx360 - Thursday, May 31, 2018 - linkProbably their power/efficiency constraints. They manage to get the same performance as a M3 core with a 4 wide instead of 6 wide decoder and half the power usage. The A11 cores are absolute monsters at power draw at max performance but Apple is able to tweak the hell out of the rest of the device and OS to get the battery life in check. Android OEMs don't have that much control.
wicketr - Thursday, May 31, 2018 - linkAnd I could understand the power issues for phones, but not all ARM chips are destined for phones. Some can go into cars or gaming consoles that are always plugged in and well ventilated.
I just think they should come out with another tier ( Cortex A9X series) that can go toe-to-toe with Apple's best even if it is too power hungry for phones. Just come up with a design and see where we're at.
Wilco1 - Thursday, May 31, 2018 - linkUsing a much larger core to get modest extra performance wouldn't make sense even in less power constrained cases. Not every market is happy with just 2 huge cores, so power and area efficiency remain important. For laptops binning for frequency and adding turbo modes would make far more sense.
BillBear - Friday, June 1, 2018 - link>Using a much larger core to get modest extra performance wouldn't make sense even in less power constrained cases.
It makes perfect sense if you don't care that your core is large, because you aren't just selling a SOC. For Qualcomm, increased die size means reduced profit. For Apple, it does not.
For instance, Apple's Cyclone core from 2013:
>With six decoders and nine ports to execution units, Cyclone is big. As I mentioned before, it's bigger than anything else that goes in a phone. Apple didn't build a Krait/Silvermont competitor, it built something much closer to Intel's big cores. At the launch of the iPhone 5s, Apple referred to the A7 as being "desktop class" - it turns out that wasn't an exaggeration.
Matthmaroo - Monday, June 4, 2018 - linkApple has so many built in advantages - huge RD , excellent engineering, closed system ... android manufacturers are disadvantaged to Apple inso manu ways
close - Tuesday, June 5, 2018 - linkARM has to build a "one size fits all" kind of solution. Unlike Apple they are not catering for a single customer with full control over every aspect of HW and SW development, and the profits associated with that.
Plus, achieving the power that the Apple cores bring doesn't come cheap. Samsung's Exynos is still lagging behind and it's not like Samsung doesn't have expertise or deep pockets.
techconc - Tuesday, June 5, 2018 - linkYeah, but when you have a big little architecture, OEMs could choose the most efficient combination to meet their needs. There needs to be a powerful single core option that's available for the ARM platform. Until ARM goes there, the rest of the ARM community will be behind Apple. Remember, not all workloads can take advantage of multiple cores. At best ARM will be approaching 2016 level Apple A series core performance.
bananaforscale - Saturday, June 9, 2018 - linkExcellent engineering? Like the bendgate, touch screen problems etc. that were *engineering screwups*?