What is Responsible Node Sharing?
Responsible node sharing is a capability for the condos within the Open Research resources of the UT Advanced Computing Facility (ACF). Responsible node sharing allows for a user to submit a batch job with resource specifications that would be selected by the job scheduler to run on either the institutional or private condos and for that job to share a single compute node with other jobs. By running more than one job from different users and even different workgroups on the same compute node, the throughput of the ACF and the efficiency of the use of ACF resources are both significantly increased. In responsible node sharing all of the jobs that run on a single compute node would be managed by the ACF job scheduler to allow only those jobs whose processor core and memory resources fit onto a single node. As the ACF is made up of different node types with different numbers of processor cores and different amounts of memory, the scheduler has to manage all of the job requests and provision jobs with the resources needed for those diverse requests. Responsible node sharing will become the default behavior for the institutional condos. Private condo owners will be asked if their condos can participate in responsible node sharing to benefit other users not associated with their workgroup. Private condo owners will not be required to participate in responsible node sharing.
When Will Responsible Node Sharing Go into Effect?
The scheduling of the institutional condos will be changed on April 15, 2020 to implement responsible node sharing. Once private condo owners are asked if responsible node sharing can be implemented on their private condos and they indicate that responsible node sharing is to be allowed on their private condos, then the scheduler will be modified at that time to allow sharing of private condo compute nodes with a maximum time limit of three hours. Once approved, private condo owners nodes will participate in the responsible node sharing but jobs run by others outside of the private condo owners workgroup will be limited to two hours. So the maximum amount of time a private condo job from a workgroup member will have to wait for their job(s) to run will be two hours. The two hour time limit will only be for private condos and will not be applied to the institutional condos.
What do ACF Users Need to Do?
For an effective responsible node sharing capability, users need to accurately specify their job resources. At the time of this writing, users must specify the amount of nodes their job requires and the wallclock time for the job. Users should also specify the amount of cores their job requires with the ppn option. If this option is not specified, the scheduler will allocate a single core to the job. Memory will be allocated to jobs based on the number of cores requested. If your job requires additional memory, specify more cores. The memory amount will be calculated with this formula:
Total memory = cores requested / total cores on the node
Please note that there is a static amount of memory per core based on the node set. For more information about specifying job resources, please visit the Running Jobs document.
What If My Job Should Not Run on a Shared Compute Node?
Users always have the option to specify that a compute node will be exclusively used by their job. Users can specify the “-n” node exclusive option on their batch jobs to setup this job requirement.
What if I Have More Questions?
Questions about responsible node sharing can be sent to firstname.lastname@example.org.
Last Updated: 04 / 14 / 2020