When you log in, you will be directed to one of the login nodes. The login nodes should only be used for basic tasks such as file editing, code compilation, and job submission. Please do not run production jobs on the login nodes. If you submit a production job on a login node, it will be administratively terminated. Instead, use the cluster's compute resources for production jobs. In this document, you will learn how to submit jobs to the cluster. To learn about how the recent node sharing implementation affects jobs, please review the Job Resource Requirements section.
The priority of a job influences how quickly it executes on the cluster’s compute resources. The general factors that affect priority are listed below.
- Jobs that request more nodes receive higher priority.
- The longer a job waits in the queue, the higher its priority.
- The amount of jobs a single user submits affects the priority of the submitted jobs. At the time of this writing, only twenty jobs submitted by the same user can be simultaneously executed. This limitation only applies to jobs submitted to the institutional condos.
Job Access Control
The cluster uses a variety of mechanisms to determine how to schedule jobs. All these mechanisms can be manipulated by users to ensure that their jobs are scheduled and executed in a reasonable time period.
In the context of the cluster, a condo is a logical group of compute nodes that act as an independent cluster to effectively schedule jobs. Institutional and private condos exist on the cluster. All users belong to their respective institutional condo, whether that be UTK or UTHSC. Private condos are owned and used by individual investors and their associated projects. With a private condo, the investor and his project have exclusive use of the compute nodes in the condo. Investors can choose to share their private condo with the entire cluster under the Responsible Node Sharing implementation, but this is not a requirement.
Projects and Reservations
At its core, a project is simply a standard UNIX / Linux group. In the cluster’s condo-based scheduling model, however, projects are a key component. Projects control access to institutional and private condos. For UTK institutional users, the default project is ACF-UTK0011. For UTHSC institutional users, the default project is ACF-UTHSC0001. Other projects follow the same project identifier format. To determine to which projects you belong, please login to the User Portal and look under the “Projects” header. Always ensure you use the correct project when submitting jobs.
Reservations are special allocations granted to a project. A reservation grants exclusive access to a set of nodes for a specific time, such as two days. All the users in the reservation should receive the reservation’s identifier, but if they do not, the
showres command will output all the reservations assigned to the user. If applicable, always ensure you use the correct reservation when submitting jobs.
The scheduler uses queues to organize jobs. At the time of this writing, the cluster uses the batch queue and the debug queue. The former queue is the default queue to which all jobs are submitted. Users are not required to specify this queue when they submit jobs. The latter queue must be specified to the scheduler. Figure 2.1 shows how to use this queue for job scripts, while Figure 2.2 shows how to use this queue for interactive jobs. In general, the debug queue should only be used to test code. All jobs submitted to the debug queue are limited to an hour of walltime.
Similar nodes are contained within partitions. In addition to grouping like nodes, it also enables users to specify the node set they wish to use. The current partitions are listed below. At the time of this writing, jobs default to using the general, beacon, and rho partitions. Be aware that because Rho is included in the default partition, you will only receive 2GB of memory per core if you do not specify a partition or ppn value. Thus, ensure that your job uses the appropriate partition. To learn more about the nodes within each partition, review the System Overview document. To learn how to target specific partitions, please review the Writing Job Scripts document.
Some partitions consist of multiple node sets. To target specific node sets within a partition, use a feature attribute. For instance, to target the Beacon GPU nodes, use the beacon_gpu feature. The available features are listed below. More information on how to use feature attributes is in the Writing Job Scripts document. Additionally, the Targeting GPU Nodes section of this document shows how the feature attribute is used in a job script.
- sigma (general partition)
- sigma_bigcore (general partition)
- skylake (general partition)
- beacon_gpu (beacon partition)
QoS (Quality of Service) Attributes
QoS (Quality of Service) attributes define node allocations and wallclock limitations. At the time of this writing, opportunistic users are limited to 48 jobs, 24 nodes, and 24 hours of walltime for their jobs. Table 2.1 outlines the available QoS attributes.
|QoS Attribute||Min. Allocation||Max. Allocation||Wall Clock Limit|
|Condo||1 Node||Condo Max.||28 Days|
|Campus||1 Node||24 Nodes||24 Hours|
|Overflow||1 Node||24 Nodes||24 Hours|
|Long (UTHSC Projects Only)||1 Node||24 Nodes||6 Days|
By default, jobs run in the institutional condos use the campus QoS attribute. UTHSC users can specify the long QoS attribute if their project has access to it. With the long QoS attribute, jobs can run for up to six days. The overflow QoS attribute allows users to run their jobs in a condo and spill over into the user's institutional condo if necessary. The condo QoS attribute permits users with private condos on the cluster to run their jobs on those nodes. If you wish to run a job in a private condo, use the condo QoS attribute in your job script for non-interactive jobs or with your qsub command for interactive jobs.
In general, it is best to specify the QoS attribute that applies to your situation rather than rely on the defaults. Please review the Writing Job Scripts document to learn how to use these QoS attributes in your jobs.
Job Resource Requirements
Responsible node sharing on the cluster has altered the process for job submission. Previously, a single-node job would consume an entire node's resources. Now, multiple jobs can share the same node. This increases the cluster's throughput and resource utilization, benefiting users and administration. More information is available in the Responsible Node Sharing document. Rather than deal with what node sharing is, this section deals with how it affects the resources allocated to jobs.
In order to facilitate node sharing, the resource manager implements default resource allocations to jobs that do not specify what they require. By default, a job that does not specify the resources it requires will use the default partition, which consists of the beacon, rho, and general partitions. The job will receive a single core and 2GB of memory. Job scripts that previously worked on the cluster will not likely function in the same way due to these changes. When the job is submitted, the resource manager informs the user about these default values and provides guidance on requesting additional resources. Figure 3.1 shows this output from a submitted interactive job that does not specify its resource requirements.
In Figure 3.1, observe that the resource manager allocates a single core to the job. This value can be changed by users with the ppn (processors per node) option. It is specified on the same line as the
-l nodes=<num-nodes> option. Figure 3.2 shows how the ppn option is used in the context of a job script. The usage is the same in an interactive job minus the
#PBS directive prefix.
It is important to note that the value specified for the ppn option directly influences the amount of memory allocated to the job. The higher the ppn value is, the more memory the job receives. Users cannot manually specify their memory requirements in job scripts or with the qsub command. They must provide a higher ppn value to obtain more memory for the job. If they attempt to specify the amount of memory they desire, the resource manager will reject the job. In Figure 3.2, the job would receive eight cores worth of memory, which will vary with the partition and feature. Another important factor to consider with the ppn option is the partition and feature in use by the job. A job that runs on the beacon partition receives 16GB of memory per core, while the same job on the rho partition receives 2GB of memory per core. The general partition has additional caveats because three node sets belong to it. The formula that calculates the memory allocation a job receives is depicted in Figure 3.3. For reference purposes, Table 3.1 documents the amount of memory available per core on each node set in the cluster.
|Partition||Feature||Memory per Core (MB)|
Use Table 3.1 when determining which partitions and features to use with your job. Note that the default partition, which is used when no partition is specified, uses Rho's per-core memory amount of 2048MB. The general partition uses sigma_bigcore's per-core memory amount of 4682MB. If you wish to receive more memory, specify a partition and feature that provides a higher per-core memory amount. Additional information is available in the Partitions and Features section of this document. The hardware resources available to each node set is documented in the System Overview document.
In scenarios where your job requires all the resources available to a node, the resource manager provides the option to make the job node-exclusive. Node-exclusivity allocates an entire node to the job so that it does not share resources with other jobs. Be aware that using this option may delay the execution of your job. Figure 3.4 shows the PBS directive to include in a job script to specify node-exclusivity. The option is the same for interactive jobs minus the
#PBS directive prefix.
For more information on writing job scripts that include these options, please review the Writing Job Scripts document. To learn how to monitor and troubleshoot jobs with problems related to node sharing, please refer to the Monitoring and Modifying Jobs document.
Non-interactive batch jobs are submitted to the scheduler using job scripts. These scripts contain PBS directives and shell commands. To learn more about job scripts, please visit the Writing Job Scripts document. Please be aware that if you set your job to be node-exclusive, it may take longer for your job to run depending on resource availability.
If you already have a job script ready, then follow the process outlined below to submit your job to the cluster.
- Change your directory to Lustre scratch space. Figure 4.1 shows the command that will place you in this space. All non-interactive batch jobs should be submitted from this space for the best performance.
- Use the qsub command to submit the job. Figure 4.2 shows the syntax for this command.
- If successful, a job identifier should appear to indicate that the job was submitted. Execute
qstat -ato verify that the job was submitted and is queued. A “Q” should appear under the “S” column of qstat. For more information on monitoring your jobs, please review the Monitoring and Modifying Jobs document.
If you need to pass arguments to your job, use the qsub -F option. Figure 4.3 shows the syntax for this option. For more information about command-line arguments, please review the Writing Job Scripts document.
Interactive jobs enable users to directly manipulate the cluster’s compute resources. Rather than drafting a job script to submit to the cluster, the user uses the qsub command with the appropriate options to submit the job. In general, the options for interactive jobs are the same options used with job scripts. The difference is that the options are not specified as PBS directives, but as options for the executable. Table 4.1 lists the pertinent options for interactive jobs. Figure 4.4 shows the syntax to use for a basic interactive job.
||Submits an interactive job to the scheduler.|
||Runs the interactive job under the project specified in <account>.|
||Instructs the scheduler to make the job node-exclusive. The job will consume the entire node. Interactive jobs should not require an entire node. If your job does, consider using a non-interactive job script.|
||Passes the specified variables to the interactive job. Provide the list of variables in a comma-delimited format.|
||Specifies the queue in which the job should be placed. At the time of this writing, only the
||Defines the resources required by jobs. Refer to Table 3.2 in the Writing Job Scripts document for more information on the arguments this option accepts.|
Please note that the first option in Figure 4.4 is an upper-case “i.” The second option is a lower-case “l.”
Once you submit an interactive job, the scheduler will queue the job and execute it when resources are available. Generally, a small, hour-long interactive job should begin within five minutes of submission; however, if the cluster is experiencing high resource utilization, it could take longer.
When you finish your work in the interactive job, issue the
exit command to complete the job and return to the login node.
Parallel Jobs with mpirun
The mpirun command facilitates the execution of MPI programs. These programs execute in parallel across multiple nodes to enhance performance and resource utilization. When you use mpirun, you can specify the total number of ranks you desire the program to use, in addition to the amount of processes you wish to run on each node. By specifying the amount of ranks and processes, you have greater control over the execution of your jobs on the ACF.
Before you use mpirun in your job, please review the System Overview document to familiarize yourself with the core counts of each node. Understanding the amount of cores at your disposal is critical to using mpirun correctly.
To specify the amount of ranks for your MPI program, use the -n option of mpirun. For instance, if you execute
mpirun -n 16 ./test_job on a single Beacon node, one rank will be placed on each core because one Beacon node has a total of sixteen cores between two processors.
mpirun is not limited to one rank per core, however; nodes can be oversubscribed. To oversubscribe a node is to specify more ranks than the node has cores. By default, additional ranks will not be placed until all the cores on each node are filled. To illustrate this process, consider a job that has requested four Rho nodes. Each Rho node has sixteen cores; in this case, the job has 64 cores allocated to it. If this job executes
mpirun -n 256 ./rho_job, 64 ranks will be placed across each core on each node. After all 64 cores have received a rank, an additional 64 ranks will be placed on each core. This process will continue until each rank has been allocated.
If the amount of ranks is fewer than the available cores on a node, the ranks are evenly spread across processors. As mentioned previously, one Beacon node has sixteen cores between two processors. If a job executes
mpirun -n 8 on one of these nodes, four ranks will be placed on the first processor and four ranks will be placed on the second processor.
For greater control over rank placement, mpirun uses the -ppn option. ppn (processes per node) defines how many ranks should execute on each node. By default, ranks are placed based upon the number of cores each node contains. As an example, using
mpirun -n 45 -ppn 15 ./ppn_job across three Beacon nodes would place sixteen ranks on the first two nodes and thirteen ranks on the last one. To override this behavior, use the -f $PBS_NODEFILE option with mpirun so that it can use the -ppn option properly. If you execute
mpirun -n 45 -ppn 15 -f $PBS_NODEFILE ./ppn_job, it will place fifteen ranks across all three Beacon nodes.
Before you attempt to run an MPI program, verify that you have loaded the appropriate compiler and MPI implementation with the
module list command. By default, Intel’s MPI implementation is loaded into your environment. You can switch to other implementations with the
module swap command. Please refer to the Modules document for more information on how to use the module commands. If you intend to use a Python MPI program, load the mpi4py module.
Targeting GPU Nodes
If your job(s) require GPUs (graphics processing units), the process for job submission differs. For non-interactive jobs, you specify a partition and feature set that contains GPU nodes in your batch script. For interactive jobs, you specify these options with the qsub -I (lower-case “l”) command. You must also load the relevant modules that will use the GPUs, such as tensorflow-gpu. For more information on modules and the commands to manipulate them, please refer to the Working with Modules document.
If you intend to use the Beacon GPU nodes, use the ACF-UTK0011 or ACF-UTHSC0001 project for both interactive and non-interactive jobs. At the time of this writing, the Beacon GPU nodes are the only GPU nodes available to all ACF users. Otherwise, specify a project to which you belong that provides access to GPU nodes. Please refer to the System Overview document for more information on which condos have GPUs.
Figure 5.1 shows a sample batch script that targets GPUs on the ACF using the -A option. If necessary, replace the tensorflow-gpu module with the modulefile you require. For the other options, please refer to the Writing Job Scripts document for more information. Note that
./gpu_job refers to the code that will execute on the nodes allocated to your job. Verify that it is in the same directory as the batch script if you use the example as-is. Please note that the line numbers are for reference purposes and should not be included in your job script.
To target GPU nodes with an interactive job, follow the general process described in the Interactive Jobs section of this document. For the -A option, specify ACF-UTK0011 or ACF-UTHSC0001 if you intend to use Beacon GPU nodes. If not, specify a project to which you belong that provides GPU resources.
After the interactive job starts, load the relevant modulefiles with the module load command so that you can utilize the allocated GPUs. Please refer to the Working with Modules document for more information. To query for the GPU and its available driver, execute the
nvidia-smi command after the interactive job starts. Figure 5.2 shows the output of this command.
Targeting Backfill Resources
For short, small jobs, users can use backfill resources. These resources are otherwise idle and will enable jobs to quickly execute. To see which resources are currently available in the backfill, execute the
showbf command. Figure 6.1 shows the output of
showbf if resources are available.
In the case of Figure 6.1, the user could write a job script that targeted the monster partition and expect to quickly land on the node because it is considered a backfill resource. Be aware that backfill resources may or may not be available depending on the cluster’s current resource utilization.
Last Updated: 04 / 23 / 2020