The Lustre File System on Copper

1. Introduction

HPC systems at ERDC employ the Lustre file system for their home directories as well as their work directories. Lustre is a high-performance parallel file system. The work directory, in particular, is dedicated to the temporary storage of large data sets produced during execution of user batch jobs. It is this high-performance purpose for which Lustre is especially well suited.

2. Lustre Basics

Lustre is a robust file system which consists of servers and storage. A Metadata Server (MDS) makes metadata (for example, ownership and permissions of a file or directory) available to Lustre clients. Object Storage Servers (OSSs) provide file I/O service for Object Storage Targets (OSTs) which provide the data storage. A Lustre parallel file system achieves its performance by automatically partitioning data into chunks and writing the chunks in round-robin fashion across multiple OSTs. This process, called "striping," plays a vital role in writing or reading very large files because it can significantly improve file I/O speed by eliminating single-disk I/O bottlenecks.

Striping offers two benefits for large files: 1) an increase in bandwidth because multiple processes can simultaneously access the same file, and 2) helping to maintain balance in the usage across the pool of OSTs. However, striping is not without disadvantages: 1) increased overhead due to network operations and server contention, leading to 2) the potential for degrading I/O performance through inappropriate striping settings, and 3) wasting disk space by using stripes that are much larger than the data written to them. Users have the option of configuring the size and number of stripes used for any file, but determining the best settings sometimes requires experimentation.

The term "stripe count" refers to the number of stripes into which a file is divided when written, in other words, the number of OSTs that are used to store the file. Thus, each stripe of the file will reside on a different OST. "Stripe size" refers to the size of the data blocks written to an OST.

Suppose for example, 200 MBytes are to be written to a file that was created with a stripe count of 10 and a stripe size of 1 MByte. When the file is initially created, 10 1-MByte blocks will be simultaneously written to 10 different OSTs. Once those 10 blocks have been filled, Lustre writes another 10 1-MByte blocks to those 10 OSTs. This process continues until the entire file has been written. Upon completion, the file will exist as 20 1-MByte blocks of data on each of 10 separate OSTs.

The following table lists technical specifications for the Lustre file system on Copper.

Specifications of Copper Lustre Work File System
File System Maximum Capacity Number of OSTs OST Capacity Default Stripe Count Default Stripe Size
/lustre/work 219 TBytes 20 10.9 TBytes 2 1 MByte

3. Lustre Stripe Guidance

As mentioned above, one of the primary benefits to striping large files is the increased I/O performance with reading and writing. A secondary benefit is that spreading large files over multiple OSTs helps prevent individual OSTs from filling.

The default stripe counts and stripe sizes have been chosen to balance the needs of I/O performance and efficient use of disk space. On the one hand, the stripe size multiplied by the stripe count is the minimum amount of space that will be allocated for any file. For example, a file of only 10 KBytes of actual data will still be allocated 4 MBytes of space if its stripe count is 4 and the stripe size is 1 MByte. Files smaller than 10 MBytes should be striped with a count of 1. However, setting the stripe count too high can degrade I/O performance. Therefore, you are urged to carefully match stripe specifications to your data.

Striping must also be compatible with the application's I/O strategy. Increasing the stripe count and/or stripe size should be done as needed for multi-node I/O, and it is strongly advised when creating files that are larger than 40 GBytes. As a rule, when writing a large volume of data, an application should try to use all the OSTs. If writing a single file, set the stripe count to the number of OSTs. When writing more files than there are OSTs, set the stripe count to 1. If the number of files being written is fewer than the number of OSTs, set the stripe count so that all OSTs will be used.

As a rule of thumb, use one stripe per 10-100 GBytes (one stripe per 10 GBytes is recommended), up to a maximum stripe count of 20. If you plan to use the file as job input, consider adjusting the stripe count based on the number of parallel processes. Copper storage is configured with a stripe size of 1 MByte. Experimentation may in some cases show improved performance by using stripe sizes of 2 or 4 MBytes. When writing very large files, note that the ORS tape archive system cannot archive files larger than ~800 GBytes and for performance reasons, a practical maximum of ~200 GBytes is recommended.

In addition to striping considerations, for good Lustre performance, avoid small I/O requests or writing many files. It is better to gather small requests into a buffer and write the buffer when it is full. On Cray machines, the iobuf facility is recommended for this kind of I/O aggregation. (For more information, load the iobuf module and see "man iobuf".) Application-level I/O facilities that may offer improved performance include MPI/IO, ADIOS, NetCDF, and HDF5.

4. Lustre Striping Commands

As mentioned earlier, the file system stripe count and size have default settings. Stripe parameters can be set for individual files and set or changed for directories. Directories can be given a stripe setting so that all new files created in that directory (and under any sub-directory) share that setting. Utilities and application libraries are provided to control the striping of an individual file at creation time. However, changing the stripe parameters on an existing file has no effect. You must first create an empty file with the desired striping characteristics and then write your data to it. Likewise, changing the stripe parameters on a directory does not change the striping of files already existing in that directory. Only new files created in the modified directory will inherit the changed striping.

4.1. The lfs getstripe Command

The characteristics of a file or a directory can be found by using "lfs getstripe".

$ lfs getstripe MyDir
MyDir
stripe_count:  2 stripe_size:   1048576 stripe_offset:  -1

The output shows that files created in the directory MyDir will be stored in two stripes of 1048576 bytes (1 MByte) unless explicitly striped before writing. The stripe_offset of -1 means that each file will have an OST offset determined by Lustre (see "lfs setstripe", next).

4.2. The lfs setstripe Command

To set the striping for a file or directory, use the "lfs setstripe" command:

lfs setstripe --size stripe-size --index OST-start-index --count stripe-count file-or-directory

size - # of bytes written to one OST before cycling to the next
index - starting OST (default of -1 is round robin, highly recommended)
count - # of OSTs (default is 1)

The "lfs setstripe" command has an option for changing the stripe size, but the default stripe size is recommended. Moreover, the "lfs setstripe" command has an option for setting the position of the first stripe among the OSTs, called the offset. Users should not specify an offset. Instead, allow the Lustre file system to choose an offset.

The following creates an empty file named LargeFile with a stripe count of 8.

$ cd $WORKDIR
$ lfs setstripe LargeFile -c 8

Next, set the stripe count to 16 for a new directory named LargeDir. Note that any subdirectories created under LargeDir will inherit its new stripe characteristics.

$ cd $WORKDIR
$ mkdir LargeDir
$ lfs setstripe LargeDir -c 16

4.3. Striping in Practice

The "lfs setstripe" command can be placed in the PBS batch script or executed interactively before job submission. Files copied from one directory to another, such as with cp, cat, scp, or tar, inherit the striping of the new directory. Again, note that changing the striping parameters for a directory does not change the striping characteristics for files already in that directory. Only new files written into that directory will inherit the revised characteristics. Likewise, using the mv command will not change the striping characteristics of a file, but files created during program execution will inherit the characteristics of the directory into which they are written.

Additional information can be found by viewing the lfs man page on any HPC system.