Intel Woodcrest, AMD's Opteron and Sun's UltraSparc T1: Server CPU Shoot-out
by Johan De Gelas on June 7, 2006 12:00 PM EST- Posted in
- IT Computing
Secure Socket Layers RSA Performance
Secure Web communication is possible through the utilization of the Secure Sockets Layer (SSL) protocol. Using the command
While
We included the HP DL585 to see whether 8 cores of complex general purpose CPUs (Opteron 880) can keep up with the 8 MAU of the Sun T1. If you want to compare Woodcrest and the Opteron, you should check the 2 and 4 concurrency numbers. You can find our 1024-bit numbers in the graph below. One thread per core is optimal, so we tested the DL585 with a maximum of 16 threads, to show you that the peak is attained at 8 threads. The Xeon Irwindale was tested with 8 threads to show you that 4 threads (4 logical cores) is optimal and so on.
Notice that the 8 MAUs of the Sun T1 can only get in full action if we fire off 32 "SSL RSA signing" threads. Once that happens, the little 1 GHz T1 is able to keep up with the massive 2.4 GHz 8 core DL585. Without MAU, the T1 is as fast as a 1.8 GHz Xeon Irwindale. It is thus very important to check that your favorite web server works with SCF if you want to run your secure web services on the Sun T2000.
It looks like we've discovered the first - but rather insignificant to most people - "weakness" of the new Core architecture: decryption and encryption. The Opteron at 2.4 GHz has no trouble keeping up with the 3 GHz Woodcrest. This might be a result of the fact that the Woodcrest can only perform one rotate per cycle, while the Opteron can do 3. Although the RSA algorithm doesn't really use rotations, the hash algorithms needed to sign or encrypt a key make use of rotations. However, the most important reason is probably that the Opteron can sustain 2 ADC (Add with Carry) instructions per clock cycle, while Woodcrest can only do one. As ADC is good for about 17% of the instruction mix of the RSA algorithm, this might be enough to negate the extra integer power (Memory disambiguation, 4 wide decode ...) that the Woodcrest has.
Also notice that the previous NetBurst architecture, represented by the Xeon Irwindale, does very badly. The reason is that the P4 doesn't have a barrel shifter, a circuit in the chip which can shift or rotate any number in one clock cycle. Without this shifter, rotates and shifts take much longer, resulting in high latency. Most x86 code couldn't care less, but most encrypting code makes heavy use of rotates or shifts or both. We also did a quick test with Hyper-Threading on and off. In this case Hyper-Threading sped up the encryption (signs/s) with 20 to 28%.
To end the RSA sign/s benchmark, we'll make a quick comparison between quad core AMD Opteron 2.4 GHz, quad-core Intel Xeon Woodcrest and Sun's T1 with MAU enabled across different RSA bit lengths.
Notice that the hardware acceleration of the T1 does not work beyond 2048-bit keys. Considering that most secure applications use 1024-bit and only a few "high security" ones use 2048-bit, this is not an issue.
In case of doing verifies as opposed to signs, the server has to authenticate the identity of the client. This is a lot less intensive, and we'll show you the verifies per second numbers at 2048-bits. At 1024-bits length, both the Woodcrest and Opteron were able to verify more than 50000 keys per core, and that is a hard limit of the OpenSSL benchmark.
Again, the Opteron takes the lead. The Sun T1 even with the 8 MAUs is half as slow as four Opterons or Woodcrests, but this is hardly an issue. Encrypting or signing will slow down a server much quicker than verifying keys.
Both verifies/s and signs/s benchmark are rather synthetic. It is much more realistic to test with a real web server running SSL, and that is what we are currently doing. We followed Sun's instructions to enable RSA hardware acceleration for Apache, but for some reason, the Apache web server is still not making use of the Solaris Cryptographic Framework. So our Web server SSL test is work in progress.
Secure Web communication is possible through the utilization of the Secure Sockets Layer (SSL) protocol. Using the command
"openssl speed rsa"
we can measure the number of RSA public key operations (signs) that a system can perform per second.While
"openssl speed rsa"
is sufficient to test the Xeons and Opterons, the Sun T1 can speed up the Rivest Shamir Adleman (RSA) and Digital Signal Algorithm (DSA) encryption and decryption operations needed for SSL processing, thanks to a modular arithmetic unit (MAU) that supports modular exponentiation and multiplication. Each T1 core has a MAU, thus one 8 core T1 has 8 MAUs. To make use of those 8 MAUs, you have run the SSL calculations through the Solaris Cryptographic Framework (SCF). To test the T1 with the MAU crunching at full speed we used the command: "openssl speed -engine pkcs11 rsa"
. The Solaris 10 OS also provides in-kernel SSL termination, offering greater security than SSL termination outside the kernel.We included the HP DL585 to see whether 8 cores of complex general purpose CPUs (Opteron 880) can keep up with the 8 MAU of the Sun T1. If you want to compare Woodcrest and the Opteron, you should check the 2 and 4 concurrency numbers. You can find our 1024-bit numbers in the graph below. One thread per core is optimal, so we tested the DL585 with a maximum of 16 threads, to show you that the peak is attained at 8 threads. The Xeon Irwindale was tested with 8 threads to show you that 4 threads (4 logical cores) is optimal and so on.
Notice that the 8 MAUs of the Sun T1 can only get in full action if we fire off 32 "SSL RSA signing" threads. Once that happens, the little 1 GHz T1 is able to keep up with the massive 2.4 GHz 8 core DL585. Without MAU, the T1 is as fast as a 1.8 GHz Xeon Irwindale. It is thus very important to check that your favorite web server works with SCF if you want to run your secure web services on the Sun T2000.
It looks like we've discovered the first - but rather insignificant to most people - "weakness" of the new Core architecture: decryption and encryption. The Opteron at 2.4 GHz has no trouble keeping up with the 3 GHz Woodcrest. This might be a result of the fact that the Woodcrest can only perform one rotate per cycle, while the Opteron can do 3. Although the RSA algorithm doesn't really use rotations, the hash algorithms needed to sign or encrypt a key make use of rotations. However, the most important reason is probably that the Opteron can sustain 2 ADC (Add with Carry) instructions per clock cycle, while Woodcrest can only do one. As ADC is good for about 17% of the instruction mix of the RSA algorithm, this might be enough to negate the extra integer power (Memory disambiguation, 4 wide decode ...) that the Woodcrest has.
Also notice that the previous NetBurst architecture, represented by the Xeon Irwindale, does very badly. The reason is that the P4 doesn't have a barrel shifter, a circuit in the chip which can shift or rotate any number in one clock cycle. Without this shifter, rotates and shifts take much longer, resulting in high latency. Most x86 code couldn't care less, but most encrypting code makes heavy use of rotates or shifts or both. We also did a quick test with Hyper-Threading on and off. In this case Hyper-Threading sped up the encryption (signs/s) with 20 to 28%.
To end the RSA sign/s benchmark, we'll make a quick comparison between quad core AMD Opteron 2.4 GHz, quad-core Intel Xeon Woodcrest and Sun's T1 with MAU enabled across different RSA bit lengths.
RSA Encryption (Signs/s) | |||
Opteron 2.4 GHz 4 threads |
Xeon 5160 3 GHz 4 threads |
SUN T1 with MAU 32 threads |
|
512 bit | 19003 | 21194 | 35613 |
1024 bit | 6098 | 6240 | 10722 |
2048 bit | 1145 | 1087 | 1918 |
4096 bit | 185 | 164 | 1 |
Notice that the hardware acceleration of the T1 does not work beyond 2048-bit keys. Considering that most secure applications use 1024-bit and only a few "high security" ones use 2048-bit, this is not an issue.
In case of doing verifies as opposed to signs, the server has to authenticate the identity of the client. This is a lot less intensive, and we'll show you the verifies per second numbers at 2048-bits. At 1024-bits length, both the Woodcrest and Opteron were able to verify more than 50000 keys per core, and that is a hard limit of the OpenSSL benchmark.
Again, the Opteron takes the lead. The Sun T1 even with the 8 MAUs is half as slow as four Opterons or Woodcrests, but this is hardly an issue. Encrypting or signing will slow down a server much quicker than verifying keys.
Both verifies/s and signs/s benchmark are rather synthetic. It is much more realistic to test with a real web server running SSL, and that is what we are currently doing. We followed Sun's instructions to enable RSA hardware acceleration for Apache, but for some reason, the Apache web server is still not making use of the Solaris Cryptographic Framework. So our Web server SSL test is work in progress.
91 Comments
View All Comments
snorre - Thursday, June 8, 2006 - link
Anandtech is going down the drain, there are no doubts left about it IMHO."Woodcrest" may be a nice improvement for Intel, but comparing it to clearly crippled (both software and hardware wise) Opteron systems is pretty lame by any standard.
Remember: Fool us once shame on us, fool us twice shame on YOU!
This is your third strike in my book, so now your officially out in THG hell.
I hope you wake up and smell the coffee soon...
Slappi - Thursday, June 8, 2006 - link
Exactly.I just can't believe what I am seeing here.
This site was once THE HARDWARE SITE for me and I always recommended it to others.
If Intel has a better chip hey that's great! But.... what is with the OBVIOUS underhanded reporting against AMD and for INTEL that has been going on here for the past few months?
It is so blatant here that I am starting to wonder of Intel's new chips are a lot of smoke and mirrors. If it is such a great chip it should speak for itself, not with all this closed testing and crippled AMD machines. Makes me wonder.
You would think after reading all the Anand Intel press that the new CPUs could cure cancer and cook dinner.
duploxxx - Thursday, June 8, 2006 - link
i can give 2 pages full of rather strange figures and compares about this review. but i hope you'll bring the readers the windows benches fast and compare with other published benches so everybody can see that the linux optimization can shift wherever you want.you use workstaion/budget motherboard against the intel server board. use a sun galaxy or hp proliant.
the specint and specfp are not correct, even intel gives way other numbers
some benches are done with one socket others with 2 socket. why?
mysql benches are optimized for two cores thats very clear.. the perfromance drop on opteron is much more the the one on woodcrest. knowing the architecture of the opteron this should be the other way round. the opteron is lacking here due to the motherboard
you can extrapolate it in a different way showing different results, again you use 2 different opterons and use thsi difference to calculate 3.0, both setups are workstation and therefore performance is wrong. some benches you even talk and calculate 2 systems but not showing on the graphs.
your conclusion: is rather funny. you state that the wooodcrest is the best performing server on a platform that has maybe 2% worlwide support with benches that can not be compared to other publication. no linnear powerconsuption with other servers because no exual hardware setup and most systems use 2gb/cpu thats a +28w consumption for the woodcrest.
as stated from line 1 give some real world benches where people can compare with other posted results.
zsdersw - Thursday, June 8, 2006 - link
The MSI K8N Master2-FAR board is a server motherboard. So are the boards in the other two Opteron servers.
MrKaz - Thursday, June 8, 2006 - link
I don’t know if you all already have realized but that is what it will look like the 4x4 boards.And that’s NOT a server board, ONLY ONE of the processors is accessing directly to the memory and that must IMPACT the performance.
http://www.msi.com.tw/images/product_img/mbd_img/9...">http://www.msi.com.tw/images/product_img/mbd_img/9...
AnandThenMan - Thursday, June 8, 2006 - link
Anyone that calls that MSI mobo a "server board" is a freakin retard.As for this "review" it has to be the worst on Anandtech in at least 6 months.
zsdersw - Thursday, June 8, 2006 - link
I guess MSI themselves must be retards then. Look where it's listed: http://www.msi.com.tw/program/products/server/svr/...">http://www.msi.com.tw/program/products/server/svr/...
ashyanbhog - Thursday, June 8, 2006 - link
for those who think MSI board must be good because they list it on their server pages,Just look at the memeory banks
MSI has a single bank, forcing the 2nd CPU to share the memory channel, reducing memory bandwidth to both CPUs, and increasing memory latencies. They are discarding NUMA capabailities to keep the price at around 250$
http://www.msi.com.tw/program/products/server/svr/...">http://www.msi.com.tw/program/products/server/svr/...
Now check Tyan k8we and Supermicro h8dci boards linked below. Notice that they all carry two seperate memory banks, giving each processor its own dedicated bank. This doubles the available memory bandwidth and keeps lantencies low.
http://www.tyan.com/products/html/thunderk8we.html">http://www.tyan.com/products/html/thunderk8we.html
http://www.supermicro.com/Aplus/motherboard/Optero...">http://www.supermicro.com/Aplus/motherboard/Optero...
Iwill D8kn is another similar board that I can recall. They all recommend that you put atleast on card in each bank in a two processor setup to utilize the extra bandwidth.
But adding this extra bank comes at a cost, all the above boards are priced around $500 mark. Its common knowledge in the AMD community that one needs get the boards with seperate memory banks if on is looking for a high performance machine.
If you still have doubt, check the review on GamePC, linked below. Notice that the Tyan TIGER k8we, (with single memory channel to both CPUs like the MSI board) is beaten in every benchmark by Tyan THUNDER k8we (which has dedicated memory channels for both CPUs)
BasMSI - Friday, June 9, 2006 - link
MSI lists them as Workstation boards, not server boards.http://www.msi.com.tw/program/products/server/svr/...">>>See link<<
They should have used the K8D-Master series, those are server boards and do have NUMA.
zsdersw - Friday, June 9, 2006 - link
It's under the "Server and Rackmount" section of their website.