Written by Shane Corder - Cluster Engineer
Tuesday, 16 June 2009 16:23
This year has brought big advances to the CPU industry with the arrival of the Intel Xeon 5500 series "Nehalem" and the AMD Opteron 2400 series "Istanbul". While many different benchmarks have been published comparing both systems, they all have seemed to fall a bit short in showing an accurate comparison of the two platforms.. The HPC industry standard for benchmarking is HPL or High Performance Linpack, on which the "Top 500" list is based. Many industry insiders look to the results of this benchmark as an insight into how their systems will perform.
This benchmark was performed with a single goal in mind: to show the peak performance in terms of GFLOPS (billion floating point operations per second). The maximum theoretical GFLOPS per system is very different depending on the number of cores, clock speed and IPC's (instructions per cycle). Older CPUs of just a few years ago would only be able to do 2 IPC's, but with today's newer architectures, CPU's are able to do 4 IPC's. To give you a comparison, an older dual core Opteron running at 2.2 GHz had a theoretical peak of only 17.6 GFLOPS per machine. With today's new CPU's, a quad core Opteron has a theoretical peak of 70.6 GFLOPS. That is roughly twice the performance per core due to the CPU being able to handle 4 IPC's compared to 2 IPC's. Even though both the Nehalem and Istanbul systems provide 4 IPC's there are other architectural design decisions that can impact performance. Both have on-processor memory controllers and large caches, but core counts, processor interconnect speeds, and memory configurations all vary.
HPL benchmarks that have been released by others appear to have no standardization of the running parameters. Our benchmarks were run with the same compiler, MPI, and linear algebra library. In our recent testing and through real world experience, we have found that the Intel compilers and Intel Math Kernel Library (MKL) usually provide the best performance. Instead of just settling on Intel's toolkit we tried various compilers including: Intel, GNU compilers, and Portland Group. We also tested various linear algebra libraries including: MKL, AMD Core Math Library (ACML), and libGOTO from the University of Texas. All of the testing showed we could achieve the highest performance when using both the Intel Compilers and Intel Math Library--even on the AMD system--so these were used them as the base of our benchmarks. The benchmarks were run on an Opteron 2435 Istanbul system (6 core 2.6GHz processor with 16GB of 800MHz DDR2) and a X5550 Nehalem system (quad core 2.66GHz processor with 12GB of 1333MHz DDR3). An attempt was made to keep the systems identical in every other way. The same power supply, hard drive, and operating system were used (even though these parameters shouldn't effect the performance of HPL). The amount of RAM varies due to the Nehalem providing the best performance when using its tri-channel memory architecture versus the Opteron's dual channel. Since HPL performs best when using as much memory as it can, we adjusted the problem size (N in the HPL configuration file) to use as close to 100% of the RAM on the system as possible.
||Problem Size (N)
||$ per GFLOP
|Nehalem X5550 2.66GHz
|Istanbul 2435 2.6GHz
When viewing HPL results there are two interesting figures to look at: the Actual Peak which is what is measured by the benchmark and comparing this number to what the theoretical best performance the processor can provide (Theoretical Peak). This is referred to as the efficiency. We've also included the rough prices of the systems and a GFLOP per dollar rating. As you can see, AMD beats Intel on GFLOPS per dollar and peak performance, but loses on overall efficiency. This shows us that while the 6 cores per CPU that AMD Istanbul is offering provides better raw horsepower, the overall system architecture is not as balanced as Intel's Nehalem. The lower efficiency rating is most likely caused by the lack of memory bandwidth, and increased cache snoops in the Istanbul system. The CPU's sit idle for a longer period of time while waiting for data from main memory and while checking for cache hits in all of the system's 12 cores . Memory bandwidth can have a huge impact on overall system performance, and is beyond the scope of this document--it will be covered in a latter post.
So, while the Nehalem may have the best performance per core and higher efficiency, the Istanbul does a good job of making up for its deficiencies by adding additional cores. When choosing a system architecture for your next cluster, HPL should be only one of the benchmarks you use in your evaluation, We will be updating this blog with more performance results and benchmarks over the next couple of months. If you have any ideas or code that you'd like to see tested, please let us know -- send an email to email@example.com