SpyderByte.com ;Technical Portals 
 News & Information Related to Linux High Performance Computing, Linux Clustering and Cloud Computing
Home About News Archives Contribute News, Articles, Press Releases Mobile Edition Contact Advertising/Sponsorship Search Privacy
HPC Vendors
Cluster Quoter (HPC Cluster RFQ)
Hardware Vendors
Software Vendors
HPC Consultants
Training Vendors
HPC Resources
Featured Articles
Cluster Builder
User Groups & Organizations
HP Server Diagrams
HPC News
Latest News
News Archives
Search Archives
HPC Links

HPC Tech Forum (was BW-BUG)
The Aggregate
Cluster Computing Info Centre
Coyote Gultch
Dr. Robert Brown's Beowulf Page
FreshMeat.net: HPC Software
HPC User Forum
HPC Newsletters
Stay current on Linux HPC news, events and information.
LinuxHPC.org Newsletter

Other Mailing Lists:
Linux High Availability
Beowulf Mailing List
Gelato.org (Linux Itanium)

Mobile Edition

Latest News

High Performance Linpack on Xeon 5500 v. Opteron
Posted by Shelly Kelley, Thursday June 18 2009 @ 03:07PM EDT

Written by Shane Corder - Cluster Engineer Tuesday, 16 June 2009 16:23

This year has brought big advances to the CPU industry with the arrival of the Intel Xeon 5500 series "Nehalem" and the AMD Opteron 2400 series "Istanbul". While many different benchmarks have been published comparing both systems, they all have seemed to fall a bit short in showing an accurate comparison of the two platforms.. The HPC industry standard for benchmarking is HPL or High Performance Linpack, on which the "Top 500" list is based. Many industry insiders look to the results of this benchmark as an insight into how their systems will perform.

This benchmark was performed with a single goal in mind: to show the peak performance in terms of GFLOPS (billion floating point operations per second). The maximum theoretical GFLOPS per system is very different depending on the number of cores, clock speed and IPC's (instructions per cycle). Older CPUs of just a few years ago would only be able to do 2 IPC's, but with today's newer architectures, CPU's are able to do 4 IPC's. To give you a comparison, an older dual core Opteron running at 2.2 GHz had a theoretical peak of only 17.6 GFLOPS per machine. With today's new CPU's, a quad core Opteron has a theoretical peak of 70.6 GFLOPS. That is roughly twice the performance per core due to the CPU being able to handle 4 IPC's compared to 2 IPC's. Even though both the Nehalem and Istanbul systems provide 4 IPC's there are other architectural design decisions that can impact performance. Both have on-processor memory controllers and large caches, but core counts, processor interconnect speeds, and memory configurations all vary. Testing:

HPL benchmarks that have been released by others appear to have no standardization of the running parameters. Our benchmarks were run with the same compiler, MPI, and linear algebra library. In our recent testing and through real world experience, we have found that the Intel compilers and Intel Math Kernel Library (MKL) usually provide the best performance. Instead of just settling on Intel's toolkit we tried various compilers including: Intel, GNU compilers, and Portland Group. We also tested various linear algebra libraries including: MKL, AMD Core Math Library (ACML), and libGOTO from the University of Texas. All of the testing showed we could achieve the highest performance when using both the Intel Compilers and Intel Math Library--even on the AMD system--so these were used them as the base of our benchmarks. The benchmarks were run on an Opteron 2435 Istanbul system (6 core 2.6GHz processor with 16GB of 800MHz DDR2) and a X5550 Nehalem system (quad core 2.66GHz processor with 12GB of 1333MHz DDR3). An attempt was made to keep the systems identical in every other way. The same power supply, hard drive, and operating system were used (even though these parameters shouldn't effect the performance of HPL). The amount of RAM varies due to the Nehalem providing the best performance when using its tri-channel memory architecture versus the Opteron's dual channel. Since HPL performs best when using as much memory as it can, we adjusted the problem size (N in the HPL configuration file) to use as close to 100% of the RAM on the system as possible.


CPU Model Problem Size (N) Theoretical Peak Actual Peak Efficiency Node Cost $ per GFLOP
Nehalem X5550 2.66GHz 35840 85.12 GFLOPS 74.03 GFLOPS 86.97% $3,800.00 $51.33
Istanbul 2435 2.6GHz 41216 124.8 GFLOPS 99.38 GFLOPS 79.63% $3,500.00 $35.21


When viewing HPL results there are two interesting figures to look at: the Actual Peak which is what is measured by the benchmark and comparing this number to what the theoretical best performance the processor can provide (Theoretical Peak). This is referred to as the efficiency. We've also included the rough prices of the systems and a GFLOP per dollar rating. As you can see, AMD beats Intel on GFLOPS per dollar and peak performance, but loses on overall efficiency. This shows us that while the 6 cores per CPU that AMD Istanbul is offering provides better raw horsepower, the overall system architecture is not as balanced as Intel's Nehalem. The lower efficiency rating is most likely caused by the lack of memory bandwidth, and increased cache snoops in the Istanbul system. The CPU's sit idle for a longer period of time while waiting for data from main memory and while checking for cache hits in all of the system's 12 cores . Memory bandwidth can have a huge impact on overall system performance, and is beyond the scope of this document--it will be covered in a latter post.

So, while the Nehalem may have the best performance per core and higher efficiency, the Istanbul does a good job of making up for its deficiencies by adding additional cores. When choosing a system architecture for your next cluster, HPL should be only one of the benchmarks you use in your evaluation, We will be updating this blog with more performance results and benchmarks over the next couple of months. If you have any ideas or code that you'd like to see tested, please let us know -- send an email to info@advancedclustering.com

< Appro Xtreme-X™ Supercomputer First to Offer Dual QDR InfiniBand on Board | Yellow Dog Linux v6.2 launches with Xfce, USB install, EPEL, ... more! >



Cluster Monkey

HPC Community

Supercomputing 2010

- Supercomputing 2010 website...

- 2010 Beowulf Bash

- SC10 hits YouTube!

- Louisiana Governor Jindal Proclaims the week of November 14th "Supercomputing Week" in honor of SC10!

Appro: High Performance Computing Resources
IDC: Appro Xtreme-X Supercomputer Blade Solution
Analysis of the Xtreme-X architecture and management system while assessing challenges and opportunities in the technical computing market for blade servers.

Video - The Road to PetaFlop Computing
Explore the Scalable Unit concept where multiple clusters of various sizes can be rapidly built and deployed into production. This new architectural approach yields many subtle benefits to dramatically lower total cost of ownership.
White Paper - Optimized HPC Performance
Multi-core processors provide a unique set of challenges and opportunities for the HPC market. Discover MPI strategies for the Next-Generation Quad-Core Processors.

Appro and the Three National Laboratories
[Appro delivers a new breed of highly scalable, dynamic, reliable and effective Linux clusters to create the next generation of supercomputers for the National Laboratories.

AMD Opteron-based products | Intel Xeon-based products

Home About News Archives Contribute News, Articles, Press Releases Mobile Edition Contact Advertising/Sponsorship Search Privacy
     Copyright © 2001-2011 LinuxHPC.org
Linux is a trademark of Linus Torvalds
All other trademarks are those of their owners.
  SpyderByte.com ;Technical Portals