SpyderByte.com ;Technical Portals 
 The #1 Site for News & Information Related to Linux High Performance Technical Computing, Linux High Availability and Linux Parallel Clustering
Home About News Archives Contribute News, Articles, Press Releases Mobile Edition Contact Advertising/Sponsorship Search Privacy
Research and Services
Cluster Quoter (HPC Cluster RFQ)
Hardware Vendors
Software Vendors
HPC Consultants
Training Vendors
Latest News
News Archives
Search Archives
Featured Articles
Cluster Builder
User Groups
Golden Eggs (Configuration Diagrams)
Linux HPC Links
Cluster Monkey (Doug Eadline, et al)
HPCWire (Tabor Communications)
insideHPC.com (John West)
Scalability.org (Dr. Joe Landman)

Beowulf Users Group
High Performance Computing Clusters
Thinking Parallel
The Aggregate
Cluster Computing Info Centre
Coyote Gultch
Robert Brown's Beowulf Page
FM.net: Scientific/Engineering
HPC User Forum
Linux HPC News Update
Stay current on Linux related HPC news, events and information.
LinuxHPC Newsletter

Other Mailing Lists:
Linux High Availability
Beowulf Mailing List
Gelato.org (Linux Itanium)

Mobile Edition

Linux Cluster RFQ Form
Reach Multiple Vendors With One Linux Cluster RFQ Form. Save time and effort, let LinuxHPC.org do all the leg work for you free of charge. Request A Quote...

Latest News

New Features in Linux 2.6 - Key Performance Improvements Summarized
Posted by Philip Carinhas, Tuesday March 07 2006 @ 05:36PM EST

Over the last several years, the Linux operating system has gained acceptance as the operating system of choice in many scientific and commercial environments, respectively. Today, the performance aspects of the Linux operating system has improved significantly, as compared to traditional UNIX flavors. This is particularly true for smaller SMP systems with up to 4 processors. Recently, there has been an increased emphasis on Linux performance in mid to high-end enterprise-class environments, consisting of SMP systems that are configured with 64 CPUs. Therefore, scalability and performance of Linux 2.6 are paramount for applications on large systems that are scalable to high CPU counts. This article highlights some of the performance and scalability improvements of the Linux 2.6 kernel.

The Virtual Memory (VM) Subsystem

Most modern computer architectures support more than one memory page size. To illustrate, the IA-32 architecture supports either 4KB or 4MB pages. The 2.4 Linux kernel used to only utilize large pages for mapping the kernel image. In general, large page usage is primarily intended to provide performance improvements for high performance computing applications, as well as database applications that have large working sets. Any memory access intensive application that utilizes large amounts of virtual memory may obtain performance improvements by using large pages. Linux 2.6 can utilize 2MB or 4MB large pages, AIX uses 16MB large pages, whereas Solaris large pages are 4MB in size. The large page performance improvements are attributable to reduced translation lookaside buffer (TLB) misses. Large pages further improve the process of memory prefetching, by eliminating the necessity to restart prefetch operations on 4KB boundaries.

CPU Scheduler

The Linux 2.6 scheduler is a multi queue scheduler that assigns a run-queue to each CPU, promoting a local scheduling approach. The previous incarnation of the Linux scheduler utilized the concept of goodness to determine which thread to execute next. All runnable tasks were kept on a single run-queue that represented a linked list of threads. In Linux 2.6, the single run-queue lock was replaced with a per CPU lock, ensuring better scalability on SMP systems. The new per CPU run-queue scheme decomposes the run-queue into a number of buckets (in priority order) and utilizes a bitmap to identify the buckets that hold runnable tasks. Locating the next task to execute requires a read from the bitmap to identify the first bucket with runnable tasks, and choosing the first task in that bucket's run-queue.

It should be pointed out that the Linux 2.6 environment provides a Non Uniform Memory Access (NUMA) aware extension to the new scheduler. The focus is on increasing the likelihood that memory references are local rather than remote on NUMA systems. The NUMA aware extension augments the existing CPU scheduler implementation via a node-balancing framework. Further, it is imperative to point out that next to the preemptible kernel support in Linux 2.6, the Native POSIX Threading Library (NPTL) represents the next generation POSIX threading solution for Linux, and hence has received a lot of attention from the performance community. The new threading implementation in Linux 2.6 has several major advantages, such as in-kernel POSIX signal handling. In a well-designed multi-threaded application domain, fast user space synchronization (futex) can be utilized. In contrast to the Linux 2.4, the futex framework avoids a scheduling collapse during heavy lock contention among different threads.

I/O Scheduling

The I/O scheduler in Linux is the interface between the generic block layer and the low-level device drivers. The block layer provides functions that are utilized by file systems and the virtual memory manager to submit I/O requests to block devices. As prioritized resource management seeks to regulate the use of a disk subsystem by an application, the I/O scheduler is considered an important kernel component in the I/O path.

It is further possible to tune the disk usage in the kernel layers above and below the I/O scheduler. Adjusting the I/O pattern generated by the file system or the virtual memory manager (VMM) is now an option. Another option is to adjust the way specific device drivers or device controllers handle the I/O requests. Further, a new read-ahead algorithm designed and implemented by Dominique Heger and Steve Pratt for Linux 2.6 significantly boosts read IO throughput for all the discussed IO schedulers below.

The Deadline I/O scheduler available in Linux 2.6 incorporates a per-request expiration based approach, and operates on five I/O queues. The basic idea behind the implementation is to aggressively reorder requests to improve I/O performance while simultaneously ensuring that no I/O request is being starved. More specifically, the scheduler introduces the notion of a per-request deadline, which is used to assign a higher preference to read than write requests. To summarize, the basic idea behind the deadline scheduler is that all read requests are satisfied within a specified time period. On the other hand, write requests do not have any specific deadlines associated. As the block device driver is ready to launch another disk I/O request, the core algorithm of the deadline scheduler is invoked. In a simplified form, the first action being taken is to identify if there are I/O requests waiting in the dispatch queue, and if yes, there is no additional decision to be made on what to execute next. Otherwise, it is necessary to move a new set of I/O requests to the dispatch queue.

The Anticipatory I/O scheduler's design attempts to reduce the per-thread read response time. It introduces a controlled delay component into the dispatching equation. The delay is being invoked on any new request to the device driver, thereby allowing a thread that just finished its I/O request to submit a new request. This basically enhances the chances (based on locality) that this scheduling behavior will result in smaller seek operations. The tradeoff between reduced seeks and decreased disk utilization (due to the additional delay factor in dispatching a request) is managed by utilizing an actual cost-benefit calculation method.

The Completely Fair Queuing (CFQ) I/O scheduler can be considered as representing an extension to the better known stochastic fair queuing (SFQ) scheduler implementation. The focus of both implementations is on the concept of fair allocation of I/O bandwidth among all the initiators of I/O requests. A SFQ based scheduler design was initially proposed for some network subsystems. The goal to be accomplished is to distribute the available I/O bandwidth as equally as possible among the I/O requests.

The Linux 2.6 Noop I/O scheduler can be considered a minimal I/O scheduler that performs basic merging and sorting functionalities. The main usage of the noop scheduler revolves around non disk-based block devices like memory devices, as well as specialized software or hardware environments that incorporate their own I/O scheduling and caching functionality, and hence require only minimal assistance from the kernel. Hence, for large-scale I/O configurations that incorporate RAID controllers and many disk drives, the noop scheduler has the potential to outperform the other three I/O schedulers.


The Linux 2.6 kernel represents another evolutionary step forward, and builds upon its predecessors to boost (application) performance, through enhancements to the VM subsystem, the CPU scheduler and the I/O scheduler. In addition, this new version of the kernel delivers important functional enhancements in security, scalability, and networking. This outline only highlights the major performance features in Linux 2.6. Please visit the Fortuitous Website http://www.fortuitous.com for the full article on Linux 2.6 Performance Enhancements. Fortuitous Technologies provides high quality IT services, focusing on performance tuning and capacity planning.

< Moab Utility/Hosting Suite 4.5.0 Released | PathScale Introduces InfiniPath InfiniBand Adapt >


Supercomputing '07
Nov 10-16, Reno, NV

Register now...



Cluster Monkey

Golden Eggs
(HP Visual Diagram and Config Guides)
CP4000 32x DL145G2 GigE Opteron, Dual Core
CP4000 64x DL145 GigE Opteron
CP4000 102x DL145 GigE Opteron
CP4000 32x DL145 Myri Opteron
Rocks Cluster 16-22 DL145 Opteron
Rocks Cluster 30-46 DL145 Opteron
Rocks Cluster 64-84 DL145 Opteron
LC3000 GigaE 24-36 DL145 Opteron
LC3000 Myri 16-32x DL145 Opteron
LC3000 GigaE 16-22x DL145 Opteron
LC2000 GigaE 16-22x DL360G3 Xeon
> DL365 System 2600Mhz 2P 1U Opteron Dual Core
DL360 G5 System 3000Mhz 2P 1U EM64T Dual/Quad Core
DL385 G2 2600Mhz 2P Opteron Dual Core
DL380 G5 3000Mhz 2P EM64T Dual/Quad Core
DL140 3060MHz 2P IA32
DL140 G2 3600MHz 2P EM64T
DL145 2600MHz 2P Opteron
DL145 G2 2600MHz 2P Opteron Dual Core
DL360 G4 3400MHz 2P EM64T
DL360 G4p 3800MHz 2P EM64T
DL380 G4 3800MHz 2P EM64T
DL385 2800MHz 2P Opteron Dual Core
DL560 3000MHz 4P IA32
DL580 G3 3330MHz 4P EM64T
DL585 2800MHz 4P Opteron Dual Core
Montecito 2P-16P, rx2660-rx8640 (multi-system diagram)
rx2660 1600MHz 2P 2U Montecito Systems and Cluster
rx6600 1600MHz 4P 7U Single & Cluster
rx3600 1600MHz 2P 4U Single & Cluster
rx2620 1600MHz 2P 2U Single & Cluster
Superdome 64P base configuration
Integrity Family Portrait (rx1620 thru rx8620), IA64
rx1620 1600MHz 2P MSA1000 Cluster IA64
rx2620 1600MHz 2P MSA1000 Cluster IA64
rx4640 1600MHz 4P MSA1000 Cluster IA64
rx7620 1600MHz 8P 10U Systems and MSA1000 Cluster
rx8620 1600MHz 16P 17U Systems and MSA1000 Cluster
MSA30-MI Dual SCSI Cluster, rx3600, rx6600 and rx2660
MSA30-MI Dual SCSI Cluster, rx1620...rx4640
MSA1500 48TB, SCSI and SATA
Dual Core AMD64 and EM64T systems with MSA1500

Appro: Enterprise and High Performance Computing Whitepapers
Is Your HPC Cluster Ready for Multi-core Processors?:
Multi-core processors bring new challenges and opportunities for the HPC cluster. Get a first look at utilizing these processors and strategies for better performance.

Accelerating Results through Innovation:
Achieve maximum compute power and efficiency with Appro Cluster Solutions. Our highly scalable clusters are designed to seamlessly integrate with existing high performance, scientific, technical, and commercial computing environments.
Keeping Your Cool in the Data Center:
Rethinking IT architecture and infrastructure is not a simple job. This whitepaper helps IT managers overcome challenges with thermal, power, and system management.

Unlocking the Value of IT with Appro HyperBlade:
A fully integrated cluster combining advantages of blade and rack-mount servers for a flexible, modular, scalable architecture designed for Enterprise and HPC applications.
AMD Opteron-based products | Intel Xeon-based products

Hewlett-Packard: Linux High Performance Computing Whitepapers
Unified Cluster Portfolio:
A comprehensive, modular package of tested and pre-configured hardware, software and services for scalable computation, data management and visualization.

Your Fast Track to Cluster Deployment:
Designed to enable faster ordering and configuration, shorter delivery times and increased savings. Customers can select from a menu of popular cluster components, which are then factory assembled into pre-defined configurations with optional software installation.
Message Passing Interface library (HP-MPI):
A high performance and production quality implementation of the Message-Passing Interface (MPI) standard for HP servers and workstations.

Cluster Platform Express:
Cluster Platform Express comes straight to you, factory assembled and available with pre-installed software for cluster management, and ready for deployment.
AMD Opteron-based ProLiant nodes | Intel Xeon-based ProLiant nodes

Home About News Archives Contribute News, Articles, Press Releases Mobile Edition Contact Advertising/Sponsorship Search Privacy
     Copyright © 2001-2007 LinuxHPC.org
Linux is a trademark of Linus Torvalds
All other trademarks are those of their owners.
  SpyderByte.com ;Technical Portals