Intel Woodcrest, AMD's Opteron and Sun's UltraSparc T1: Server CPU Shoot-out
by Johan De Gelas on June 7, 2006 12:00 PM EST- Posted in
- IT Computing
Secure Socket Layers RSA Performance
Secure Web communication is possible through the utilization of the Secure Sockets Layer (SSL) protocol. Using the command
While
We included the HP DL585 to see whether 8 cores of complex general purpose CPUs (Opteron 880) can keep up with the 8 MAU of the Sun T1. If you want to compare Woodcrest and the Opteron, you should check the 2 and 4 concurrency numbers. You can find our 1024-bit numbers in the graph below. One thread per core is optimal, so we tested the DL585 with a maximum of 16 threads, to show you that the peak is attained at 8 threads. The Xeon Irwindale was tested with 8 threads to show you that 4 threads (4 logical cores) is optimal and so on.
Notice that the 8 MAUs of the Sun T1 can only get in full action if we fire off 32 "SSL RSA signing" threads. Once that happens, the little 1 GHz T1 is able to keep up with the massive 2.4 GHz 8 core DL585. Without MAU, the T1 is as fast as a 1.8 GHz Xeon Irwindale. It is thus very important to check that your favorite web server works with SCF if you want to run your secure web services on the Sun T2000.
It looks like we've discovered the first - but rather insignificant to most people - "weakness" of the new Core architecture: decryption and encryption. The Opteron at 2.4 GHz has no trouble keeping up with the 3 GHz Woodcrest. This might be a result of the fact that the Woodcrest can only perform one rotate per cycle, while the Opteron can do 3. Although the RSA algorithm doesn't really use rotations, the hash algorithms needed to sign or encrypt a key make use of rotations. However, the most important reason is probably that the Opteron can sustain 2 ADC (Add with Carry) instructions per clock cycle, while Woodcrest can only do one. As ADC is good for about 17% of the instruction mix of the RSA algorithm, this might be enough to negate the extra integer power (Memory disambiguation, 4 wide decode ...) that the Woodcrest has.
Also notice that the previous NetBurst architecture, represented by the Xeon Irwindale, does very badly. The reason is that the P4 doesn't have a barrel shifter, a circuit in the chip which can shift or rotate any number in one clock cycle. Without this shifter, rotates and shifts take much longer, resulting in high latency. Most x86 code couldn't care less, but most encrypting code makes heavy use of rotates or shifts or both. We also did a quick test with Hyper-Threading on and off. In this case Hyper-Threading sped up the encryption (signs/s) with 20 to 28%.
To end the RSA sign/s benchmark, we'll make a quick comparison between quad core AMD Opteron 2.4 GHz, quad-core Intel Xeon Woodcrest and Sun's T1 with MAU enabled across different RSA bit lengths.
Notice that the hardware acceleration of the T1 does not work beyond 2048-bit keys. Considering that most secure applications use 1024-bit and only a few "high security" ones use 2048-bit, this is not an issue.
In case of doing verifies as opposed to signs, the server has to authenticate the identity of the client. This is a lot less intensive, and we'll show you the verifies per second numbers at 2048-bits. At 1024-bits length, both the Woodcrest and Opteron were able to verify more than 50000 keys per core, and that is a hard limit of the OpenSSL benchmark.
Again, the Opteron takes the lead. The Sun T1 even with the 8 MAUs is half as slow as four Opterons or Woodcrests, but this is hardly an issue. Encrypting or signing will slow down a server much quicker than verifying keys.
Both verifies/s and signs/s benchmark are rather synthetic. It is much more realistic to test with a real web server running SSL, and that is what we are currently doing. We followed Sun's instructions to enable RSA hardware acceleration for Apache, but for some reason, the Apache web server is still not making use of the Solaris Cryptographic Framework. So our Web server SSL test is work in progress.
Secure Web communication is possible through the utilization of the Secure Sockets Layer (SSL) protocol. Using the command
"openssl speed rsa"
we can measure the number of RSA public key operations (signs) that a system can perform per second.While
"openssl speed rsa"
is sufficient to test the Xeons and Opterons, the Sun T1 can speed up the Rivest Shamir Adleman (RSA) and Digital Signal Algorithm (DSA) encryption and decryption operations needed for SSL processing, thanks to a modular arithmetic unit (MAU) that supports modular exponentiation and multiplication. Each T1 core has a MAU, thus one 8 core T1 has 8 MAUs. To make use of those 8 MAUs, you have run the SSL calculations through the Solaris Cryptographic Framework (SCF). To test the T1 with the MAU crunching at full speed we used the command: "openssl speed -engine pkcs11 rsa"
. The Solaris 10 OS also provides in-kernel SSL termination, offering greater security than SSL termination outside the kernel.We included the HP DL585 to see whether 8 cores of complex general purpose CPUs (Opteron 880) can keep up with the 8 MAU of the Sun T1. If you want to compare Woodcrest and the Opteron, you should check the 2 and 4 concurrency numbers. You can find our 1024-bit numbers in the graph below. One thread per core is optimal, so we tested the DL585 with a maximum of 16 threads, to show you that the peak is attained at 8 threads. The Xeon Irwindale was tested with 8 threads to show you that 4 threads (4 logical cores) is optimal and so on.
Notice that the 8 MAUs of the Sun T1 can only get in full action if we fire off 32 "SSL RSA signing" threads. Once that happens, the little 1 GHz T1 is able to keep up with the massive 2.4 GHz 8 core DL585. Without MAU, the T1 is as fast as a 1.8 GHz Xeon Irwindale. It is thus very important to check that your favorite web server works with SCF if you want to run your secure web services on the Sun T2000.
It looks like we've discovered the first - but rather insignificant to most people - "weakness" of the new Core architecture: decryption and encryption. The Opteron at 2.4 GHz has no trouble keeping up with the 3 GHz Woodcrest. This might be a result of the fact that the Woodcrest can only perform one rotate per cycle, while the Opteron can do 3. Although the RSA algorithm doesn't really use rotations, the hash algorithms needed to sign or encrypt a key make use of rotations. However, the most important reason is probably that the Opteron can sustain 2 ADC (Add with Carry) instructions per clock cycle, while Woodcrest can only do one. As ADC is good for about 17% of the instruction mix of the RSA algorithm, this might be enough to negate the extra integer power (Memory disambiguation, 4 wide decode ...) that the Woodcrest has.
Also notice that the previous NetBurst architecture, represented by the Xeon Irwindale, does very badly. The reason is that the P4 doesn't have a barrel shifter, a circuit in the chip which can shift or rotate any number in one clock cycle. Without this shifter, rotates and shifts take much longer, resulting in high latency. Most x86 code couldn't care less, but most encrypting code makes heavy use of rotates or shifts or both. We also did a quick test with Hyper-Threading on and off. In this case Hyper-Threading sped up the encryption (signs/s) with 20 to 28%.
To end the RSA sign/s benchmark, we'll make a quick comparison between quad core AMD Opteron 2.4 GHz, quad-core Intel Xeon Woodcrest and Sun's T1 with MAU enabled across different RSA bit lengths.
RSA Encryption (Signs/s) | |||
Opteron 2.4 GHz 4 threads |
Xeon 5160 3 GHz 4 threads |
SUN T1 with MAU 32 threads |
|
512 bit | 19003 | 21194 | 35613 |
1024 bit | 6098 | 6240 | 10722 |
2048 bit | 1145 | 1087 | 1918 |
4096 bit | 185 | 164 | 1 |
Notice that the hardware acceleration of the T1 does not work beyond 2048-bit keys. Considering that most secure applications use 1024-bit and only a few "high security" ones use 2048-bit, this is not an issue.
In case of doing verifies as opposed to signs, the server has to authenticate the identity of the client. This is a lot less intensive, and we'll show you the verifies per second numbers at 2048-bits. At 1024-bits length, both the Woodcrest and Opteron were able to verify more than 50000 keys per core, and that is a hard limit of the OpenSSL benchmark.
Again, the Opteron takes the lead. The Sun T1 even with the 8 MAUs is half as slow as four Opterons or Woodcrests, but this is hardly an issue. Encrypting or signing will slow down a server much quicker than verifying keys.
Both verifies/s and signs/s benchmark are rather synthetic. It is much more realistic to test with a real web server running SSL, and that is what we are currently doing. We followed Sun's instructions to enable RSA hardware acceleration for Apache, but for some reason, the Apache web server is still not making use of the Solaris Cryptographic Framework. So our Web server SSL test is work in progress.
91 Comments
View All Comments
rayl - Thursday, June 8, 2006 - link
"Best Performance/Watt in the high end "Which part of performance per watt do you not understand? Do more, pay less.
MrKaz - Thursday, June 8, 2006 - link
Dual Opteron 275 HE 2CPU's (275HE) - 4 GB RAM 192 Watts!!!Dual Opteron 275 2CPU's - 4 GB RAM 239 Watts!!!
Dual Xeon 5160 3 GHz 2 CPU's - 4 GB RAM 245 Watts!!!
http://www.intel.com/performance/server/xeon/ppw.h...">http://www.intel.com/performance/server/xeon/ppw.h...
Even Intel numbers show Xeon 3.6Ghz on par with AMD (obvious fake)
And the do more pay less, is not like you say on the server market, while your PC is doing lot of work (processing) with a computer game, most servers stand there doing almost nothing. Our servers for example from 0:00 to 8:00 do almost zero. Even in the day they work very little. Our Xeon 2.4 is more than enough, and I think most people think the same. Of course this depends a lot what you do, but this is generic. I think you know why virtualization is very important right?
rayl - Thursday, June 8, 2006 - link
Isn't this obvious to you. Those are power consumption numbers at 100% CPU load. This is where performance/watt number really matters.If you're running idle, the power saving mode starts kicking in, you'll need a separate table to draw your conclusion.
Why this preoccupation with power consumption? 6-watts for a performance leap; it's moot.
coldpower27 - Thursday, June 8, 2006 - link
It will be interesting to note the Delta difference between 1 Woodcrest 5160 and 2 is 59W as reported by TechReport, and since the TDP for Woodcrest 5160 is 80W TDP we can extrapolate and since the TDP for Woodcrest 5148 is 40W I can expect it to spew about 30W per processor.
245W - (2x29W) = 187W
This bring the Low Power Woodcrest system to ~ the same power usage as the HE Opteron 275's even with the heat spewing FB-DIMM's with higher performance per watt, pretty impressive.
Questar - Thursday, June 8, 2006 - link
Yeah I'm worried about those six watts of power when I'm getting twice the performace.fikimiki - Thursday, June 8, 2006 - link
You forgot about Intel chipset consumption - 22 Watts.So Intel has 245+22=267 vs. 192 and even if you are running in power-saving mode, chipset is running all the time...
coldpower27 - Thursday, June 8, 2006 - link
No Wrong, they measured the system power consumption hence why the Woodcrests systems are so hungry in comparison to the Opteron the FB-DIMM's are what eating away at the wattage.So in the end it's 223 + 22 = 245, if indeed the chipset is consuming 22W.
Questar - Thursday, June 8, 2006 - link
That was system power consumption - it included the chipset dufus.Saist - Wednesday, June 7, 2006 - link
I amd going to make the argument that evaluating only one version of Linux in this type of situation is not a good idea in and of itself. Not to knock Gentoo directly, it is a fine distro to itself, but it has a very small slice of the Linux market. It would have made more sense for Anandtech to have benchmarked using other distrobution types for a couple of reasons.The first reason is the ability to duplicate the tests. This is actually a strike against Gentoo for what the operating system is. While it possible to duplicate an installation of Gentoo and the applications used, generating an exact copy of the exact configuration used without clear description of the compile targets used is very hard. This means that anybody wishing to reproduce these results on their own will be very hard-pressed to do so.
The second reason is commercial and residential use. Gentoo has it's market, that market just isn't very widespread. It would have made more sense for Anandtech to have tested a RPM based distro such as Mandriva, RedHat, Fedora Core, Novell Suse, or OpenSuse against a .deb based distro such as Debian(sid), Ubuntu, Mepis, or Xandros. The reason why it would have made more sense is that .deb and .rpm distros are actually used in the commercial and residential spheres, and used in great quantities. Had Anandtech used a distrobution that is in active use it would mean more to buyers currently looking to replace their Windows computers with a new system.
It would only be in the interests in providing a point of perspective that one would test a different type of Linux distrobution like Gentoo or Slackware.
Going back to the first point, had Anandtech benchmarked these on a Debian based system it would be fairly easy to duplicate the tests. Anandtech would just need to list the base version of the Debian distro they used, list the apt-repositories they pulled from, and the application in apt that were pulled. Anybody else who comes along afterwords with a Debian based distro would easily be able to duplicate the steps and the benchmarks.
The overall point is that while it is nice to see a non-dedicated Linux site approaching hardware, this isn't the way to approach it. As it stands now, the Anandtech tests are useless, reguardless of whatever results the benchmarks returned.
BasMSI - Thursday, June 8, 2006 - link
These tests are also 100% useless.....The MSI K2-102 is numa aware....
But for some reason the K8N-Master isn't shown in the graphs....that board is NOT NUMA aware.
I'm also missing the HP server everywhere in the graphs.
I realy believe all these tests are done on the K8N-Master board for all Opteron tests.
No way the graphs are showing all the systems.
These tests are a total fraude, letting us believe Intel all of a sudden became that fast.
No way on earth I believe any of these results.
Also, why using Gentoo? Why not Debian 64bit?
This puzzles me, as Gentoo is compiled but not known to be faster on every system.
Why not using precompiled Linuxes? Like Debian 64bit....that one is stable as hell and incredible fast!
Too much parameters missing here to get any judgement at all.
Do it better, this is 100% rubbish.
Bas.