Introduction and Policy Guide

Table of Contents

1. Introduction

1.1. Document Scope and Assumptions

This document provides an introduction, listing of policies, and guidance on the use of the Open Research Systems (ORS) High Performance Computers and Servers at the Department of Defense Supercomputing Resource Center (DSRC) at the U.S. Army Corps of Engineers Research and Development Center (ERDC) in Vicksburg, Mississippi.

1.2. ERDC DSRC - Who We Are

The ERDC DSRC is one of five Department of Defense (DoD) High Performance Computing (HPC) Modernization Program sites providing HPC access to DoD users and contractors, including access to ORS systems. Access to all ERDC systems and servers is subject to DoD regulations and access controls. Users are required to sign a user agreement acknowledging applicable policies and detailing acceptable use of the systems.

1.3. What We Offer

The ERDC DSRC currently hosts the ORS High Performance Computer (HPC) and archive server.

Current Systems
SystemDescription
Copper Cray XE6 with 14,720 compute cores on 460 compute nodes
Archive Server Mass Storage System with a petascale tape file system

ERDC is also home to the Data Analysis and Assessment Center (DAAC) for the HPCMP. Users may access visualization tools such as EnSight, ParaView, VisIt, TecPlot, Matlab, and others, on Copper. Users may also request custom made scientific visualization images and animations created by DAAC personnel from user data.

1.4. HPC System Configurations

1.4.1. Copper

Copper has 460 compute nodes with two sixteen-core AMD Opteron 2.5-GHz processors per node (a total of 14,720 compute cores). Each of Copper's compute nodes has 64 GBytes of DDR3 memory, and they are interconnected by a single-rail Gemini High Speed Network. All memory on each node is shared by the cores on that node, but memory is not shared across nodes. Copper has two login nodes each with four quad-core AMD Opteron 2.7-GHz processors. Copper provides the PGI, Cray, Intel, and GNU compilers, with either MPICH2 or Cray's MPT Message Passing Libraries (MPI). For additional information, see the System Configuration section of the Copper User Guide.

1.4.2. Archive Server

The ORS HPC system has access to an online archival mass storage system, Wiseman, that provides long term storage for users' files on a petascale tape file system that resides on a robotic tape library system. A 26-TByte disk cache frontends the tape file system and temporarily holds files while they are being transferred to or from tape.

For additional information on using the archive server, see the Archive Server User Guide.

1.5. Requesting Assistance

The HPC Help Desk is available to help users with unclassified problems, issues, or questions. Analysts are on duty 7:00 a.m. - 7:00 p.m. Central, Monday - Friday (excluding Federal holidays).

You can contact the ERDC DSRC directly in any of the following ways for after-hours support.

  • E-mail: dsrchelp@erdc.hpc.mil
  • Phone: 1-800-500-4722 or (601) 634-4400
  • Fax: (601) 634-5126
  • U.S. Mail:
    U.S. Army Engineer Research and Development Center
    ATTN: CEERD-IH-D HPC Service Center
    3909 Halls Ferry Road
    Vicksburg, MS 39180-6199

For more detailed contact information, see our Contact Page.

2. Policies

2.1. Interactive Use

On all ERDC HPC systems, including ORS, interactive executions on login nodes are restricted to 15 minutes of CPU time per process. Login nodes are shared by all users; if you need to run a computationally intense application or executable for more than 15 minutes, you should use PBS to schedule access to a compute node.

Copper allows you to get an interactive login session on a compute node. To do this, you must use the PBS option "-l ccm=1" and then use the ccmlogin command to login. (See the CCM example in the Copper User Guide.)

2.2. Session Lifetime

To provide users with a more secure high performance computing environment, we have implemented a lifetime of all terminal/window sessions. Regardless of activity, any terminal or window session connections to the ERDC DSRC will automatically terminate after 20 hours.

2.3. Purge Policy

All files in $WORKDIR are subject to purges if they have not been accessed in more than 30 days, or if space in $WORKDIR becomes critically low. Using the touch command (or similar) to prevent files from being purged is prohibited.

2.4. Special Requests

All special requests for allocated HPC resources, including increased priority within queues, increased queue parameters for a maximum number of CPUs and wall time, and dedicated use should be directed to the HPC Help Desk.

3. Using HPC Resources

3.1. File Systems

3.1.1. $WORKDIR

$WORKDIR (/work) is the local temporary file system (i.e., local high-speed disk) available on all ERDC DSRC HPC systems and is available to all ORS users. All files in $WORKDIR that have not been accessed in more than 30 days are subject to purges.

$WORKDIR is not intended to be used as a permanent file storage area by users. This file system is periodically purged of old files based on file access time. Users are responsible for saving their files to archive or transferring them to their own systems in a timely manner. Disk space limits, or quotas, may be imposed on users who continually store files on $WORKDIR for more than 30 days.

The $WORKDIR file system is NOT backed up or exported to any other system. In the event of file or directory structure deletion or a catastrophic disk failure, such files and directory structures are lost.

REMEMBER: It is your responsibility to transfer files that need to be saved to a location that allows for long-term storage, such as your archival ($ARCHIVE_HOME) or, for smaller files, home ($HOME) directory locations. Please note that your archival storage area has no quota, but your home directory does!

3.1.2. $HOME

When you log on, you will be placed in your home directory. The environment variable $HOME is automatically set for you and refers to this directory. $HOME is visible to both the login and compute nodes and may be used to store small user files, but it has limited capacity and is not backed up; therefore, it should not be used for long-term storage.

IMPORTANT: a hard limit of 30 GBytes is imposed on content in $HOME.

3.1.3. Archival File System

The archival file system consists of a tape library and a 204-TByte disk cache. Files transferred to the archive server must be temporarily stored on the disk cache before they can be written to tape. Similarly, files being retrieved from tape are temporarily stored on the disk cache before being transferred to the destination system. The system has only a few tape drives, and these are shared by all users and by ongoing tape file system maintenance.

As with all tape file systems, writing one large tarball containing hundreds of files is quicker and easier on the tape drives than writing hundreds of smaller files. This is because writing only one file requires only one tape mount, one seek, and one write, but writing hundreds of small files may require many tape mounts, many seeks and many writes. For the same reason, retrieving one tarball from tape is much quicker and easier than retrieving hundreds of individual files. The recommended maximum size of a single tarball is 200 to 300 GBytes. Tarballs larger than 300 GBytes will take an inordinate amount of time to transfer to or from archive, and larger files can prevent efficient packing of files on tapes.

For additional information on using the archive server, see the Archive Server User Guide.

3.2. Network File Transfers

The preferred method for transferring files over the network is to use the encrypted (Kerberos) file transfer programs scp, or sftp. In cases where large numbers of files (> 1000) and large amounts of data (> 100 GBytes) must be transferred, contact the HPC Help Desk for assistance in the process. Depending on the nature of the transfer, transfer time may be improved by reordering the data retrieval from tapes, taking advantage of available bandwidth to/from the Center, or dividing the transfer into smaller parts; the ERDC DSRC staff will assist you to the extent that they are able. Limitations such as available resources and network problems outside the Center can be expected, and you should allow sufficient time for transfers.

For additional information on file transfers, see the "File Transfers" section of the Copper User Guide.

3.3. Cray XE6 (Copper)

The Crays have a full Linux OS on their login nodes, but the compute nodes use the Cray Linux Environment (CLE), a light kernel designed for computational performance. The nodes interconnect via a Gemini High Speed Network.

PBS batch scripts on the Crays run on service nodes with a full Linux OS. These service nodes are shared by all running batch jobs. You must use the aprun command to send your parallel executable to their assigned compute nodes.

By default on the Crays, codes are compiled statically using the compile commands: ftn, cc, or CC. These commands evoke the corresponding compilers from the currently loaded environment module. On the Crays, the Portland Group (PGI) compiler is loaded by default. You can use the "module swap" command to switch to another version of PGI or to an entirely different compiler suite.

Dynamically linked executables can be compiled and run on compute nodes on the Crays. For examples of compiling dynamically linked executables, see $SAMPLES_HOME/Programming/CCM-DSL_Example on the Crays. By compiling with dynamically linked executables, you can define static arrays that are larger than 2 GBytes.

On the Crays under CLE, using its Cluster Compatability Mode (CCM), you can log directly into a compute node from an interactive PBS batch job. To do this, use the PBS option "-l ccm=1", and then use the ccmlogin command once the job starts on the batch service node. Also under CCM with the PBS option "-l ccm=1" you can run UNIX commands, scripts or serial executables on their first assigned compute node using the ccmrun command. See the examples in $SAMPLES_HOME/Parallel_Environment/CCMRUN_Example on the Crays.

IMPORTANT!: On the Crays, PBS batch scripts run on a batch service node shared by other users' PBS batch scripts. Commands like aprun, ccmlogin, and ccmrun are the only way to send work to your assigned compute nodes. Do not perform intensive work in your batch script without sending those computationally intensive tasks to the compute nodes.

For additional information on using Copper, see the Copper User Guide.