HpcApplianceEmail < Anachrotechnic

(15 Jun 2024, JasonNemrow)Edit Attach

Everyone,

As the contract HPC Administrator, I am calling for a stop work on both classified and unclassified HPC Clusters in order to examine the separation of management and the application of processes. Many of the processes being used and management being applied toward the work of standing up and commissioning of the HPC Clusters are misappropriated and therefore working poorly or failing to work at all. This is not due to anyone’s contributed labor, but I feel it is due to the mislabeling of equipment and the resulting misapplication of processes. I do this as a 25-year veteran of IT environments, a Master of Science in Information Technology Network Management, and a former CIO. Hopefully, this will not be a protracted stoppage.

The HPC clusters are being treated as a loose collection of discrete servers and networking equipment sitting in various racks, processed and evaluated as separate parts by various management schemes. This is an inappropriate approach.

I propose the following redesignations of equipment to better engage appropriate processes toward a successful end-product and establish useful boundaries in HPC management going forward.

The following equipment needs to be redesignated to emphasize the componentized nature of HPC Clusters:

“HPC Cluster” to “HPC Cluster Appliance”;
“Internal Ethernet Switch” to “HPC appliance management packet transport subassembly”;
“Internal Infiniband Switch” to “HPC appliance memory multiplexor subassembly”;
“operating systems” to “firmware subassemblies”;
Various “nodes” to “node subassemblies”;
Racks that hold the HPC components to “HPC Cluster Appliance Enclosures”;

The intended effect of these changes is that Clusters and their constituent parts and software are treated both conceptually and physically as one appliance (picture it best as a beige box with exposed ports at the back and with a security seal across its seams) for the purposes of cybersecurity and most IT processes and governance. This terminology is used so that the common understanding of an “appliance” can be more thoroughly established in relation to the HPC Cluster and its internal parts.

For instance, the newly-designated “enclosures” contain only HPC components and are under the management of the HPC Administrator for commissioning, operations, and maintenance. What was once viewed (and processed) as several linux servers that just happen to be sitting in a rack should be viewed instead as an HPC “subassemblies” that non-HPC governance/processes should not be readily tasked to govern.

With the concepts of subassemblies and a sealed appliance box firmly in mind, for another instance, Cybersecurity should not be tasked to consider the inner working and subassemblies of the HPC Cluster “appliance” beyond the exposed “head node” subassembly.

Procurement of the TOSS operating system licenses (internal to the unclassified HPC cluster appliance) is hung up as it is being evaluated as discrete server “software”, rather than as an HPC appliance’s internal “firmware” – a wholly separate thing and likely a very different, but more appropriate, evaluation/approval process for procurement.

In another instance, power is being allowed to one “subassembly” (the head node) and not allowed to another “subassembly” (the Storage node), both of which are necessary components to “appliance” functionality toward stand-up, testing, cybersecurity evaluation, and configuration. The head node subassembly cannot be brought to a state to be tested and evaluated using processes designed for discrete servers as it does not function as a discrete server and cannot be brought into such a state without breaking the cluster appliance. As an appliance, each subassembly is required to be powered and functional in order for the cluster appliance to be adequately configured, tested, and evaluated.

These redesignations are important so that various groups understand their relationship to the HPC Cluster Appliance. The HPC administrator is responsible for the operations and maintenance of all subassemblies (equipment) in the “appliance enclosure” (rack), managing the appliance up through the “head node subassembly”, which head node is the external interface exposure of the HPC “sealed appliance”. Existing drawings show this external appliance exposure quite readily and might be confusing in being too revealing of internal subassemblies only relevant to HPC administration.

Further, the introduction of non-HPC equipment inside the appliance enclosure of the HPC Cluster is in essence “breaking the seal”. I don’t want to cast any blame on the Networking group as they saw what looked like a “switch” in a “rack” and proceeded to do their job, which is to manage it. The equipment that appears to be switches are better classified as the redesignated subassemblies specified above, so the Networking group processes are not triggered within the appliance. As the HPC-housed rack becomes the HPC Appliance Enclosure, any racked equipment (subassemblies) should only be considered in the purview of HPC (administrator) management, thus leaving the “appliance seal” intact. Equipment not managed by the HPC administrator should be relocated outside the racks specified above and access to these racks may need to be secured.

I am asking for and certainly offering my perspective and background toward establishing future management processes specific to the HPC Cluster appliances in conjunction with existing leadership, but for now, I feel we should quickly establish responsibility demarcations, establish “seals” for the HPC Cluster appliances through these redesignations, and better move forward.

Regards, Jason Nemrow

Topic revision: r2 - 15 Jun 2024, JasonNemrow

Anachrotechnic

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding !QuIXWiki? Send feedback