Giter Club home page Giter Club logo

docs-csm's Introduction

Cray System Management Documentation

Scope and audience

The documentation included here describes the Cray System Management (CSM) software, how to install or upgrade CSM software, and related supporting operational procedures to manage an HPE Cray EX system. CSM software is the foundation upon which other software product streams for the HPE Cray EX system depend.

The CSM installation prepares and deploys a distributed system across a group of management nodes organized into a Kubernetes cluster which uses Ceph for utility storage. These nodes perform their function as Kubernetes master nodes, Kubernetes worker nodes, or utility storage nodes with the Ceph storage.

System services on these nodes are provided as containerized micro-services packaged for deployment via Helm charts. Kubernetes orchestrates these services and schedules them on Kubernetes worker nodes with horizontal scaling. Horizontal scales increases or decreases the number of service instances as demand for them varies, such as when booting many compute nodes or application nodes.

This information is intended for system installers, system administrators, and network administrators of the system. It assumes some familiarity with standard Linux and open source tools, such as shell scripts, revision control with git, configuration management with Ansible, YAML, JSON, and TOML file formats, etc.

Table of contents

  1. Introduction to CSM Installation

    This chapter provides an introduction to using the CSM software to manage the HPE Cray EX system which also describes the scenarios for installation and upgrade of CSM software, how product stream updates for CSM are delivered, the operational activities done after installation for on-going management of the HPE Cray EX system, differences between previous release and this release, and conventions used in this documentation.

  2. Bare-Metal Steps

    This chapter outlines how to set up default credentials for River BMCs and ServerTech PDUs, which must be done before the initial installation of CSM, in order to enable HSM software to interact with River Redfish BMCs and PDUs.

  3. Update CSM Product Stream

    This chapter explains how to get the CSM product release, get any patches, update to the latest documentation, and check for any Field Notices or Hotfixes.

  4. Install CSM

    This chapter provides an ordered list of procedures to follow when performing an initial install or a reinstall of CSM software. See the separate "Upgrade CSM" chapter for upgrade procedures.

  5. Upgrade CSM

    This chapter provides an ordered list of procedures which can be used to update CSM software that indicate when to do operational tasks as part of the software upgrade workflow. See the separate "Install CSM" chapter for initial install and reinstall procedures.

  6. CSM Operational Activities

    This chapter provides an unordered set of administrative procedures required to operate an HPE Cray EX system with CSM software and grouped into several major areas:

    • CSM Product Management
    • Artifact Management
    • Boot Orchestration
    • Compute Rolling Upgrade
    • Configuration Management
    • Console Management
    • Firmware Management
    • Hardware State Manager
    • Image Management
    • Kubernetes
    • Network Management
    • Node Management
    • Package Repository Management
    • Power Management
    • Resiliency
    • River Endpoint Discovery Service
    • Security And Authentication
    • System Configuration Service
    • System Layout Service
    • System Management Health
    • Utility Storage
    • Validate CSM Health
  7. CSM Troubleshooting Information

    This chapter provides information about some known issues in the system and tips for troubleshooting Kubernetes.

  8. CSM Background Information

    This chapter provides background information about the NCNs (non-compute nodes) which function as management nodes for the HPE Cray EX system. This information is not normally needed to install or upgrade software, but provides background which might be helpful for troubleshooting an installation.

  9. CSM REST API Documentation

    This chapter provides documentation on the REST APIs of the services in CSM.

  10. Glossary

    This chapter provides explanations of terms and acronyms used throughout the rest of this documentation.

Copyright and license

See LICENSE.

docs-csm's People

Contributors

alexanderkingh avatar bklei avatar cdelatte-hpe avatar denniswalker avatar djlapenta avatar dzou-hpe avatar github-actions[bot] avatar heemstra avatar jacobsalmela avatar jimostrom-hpe avatar johren-hpe avatar kburns-hpe avatar kimjensen-hpe avatar kosonenj avatar leliasen-hpe avatar lukebates123 avatar mbuchmann-hpe avatar mharding-hpe avatar mitcharf avatar mtupitsyn avatar ndavidson-hpe avatar nrockershousen avatar phalseth-hpe avatar rsjostrand-hpe avatar rustydb avatar schooler-hpe avatar seanwallace avatar spillerc-hpe avatar trad511 avatar zcrisler avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

docs-csm's Issues

Compare and contrast text for getting the SYSTEM_DOMAIN_NAME

Compare and contrast the text for getting the SYSTEM_DOMAIN_NAME
in these two sections of the manual:

operations/conman/Access_Console_Log_Data_Via_the_System_Monitoring_Framework_SMF.md

operations/system_management_health/Access_System_Management_Health_Services.md

In the former we read

Determine the external domain name by running the following command on any NCN:

kubectl get secret site-init -n loftsman -o jsonpath='{.data.customizations.yaml}' |
        base64 -d | grep "external:"
      external: SHASTA_EXTERNAL_DOMAIN.com

whilst in the latter, we read

The SYSTEM_DOMAIN_NAME value in the URLs on this page is an Ansible
variable that can be retrieved as follows. It is expected to be the
system’s fully qualified domain name (FQDN).

(ncn-mw#) This command can be run on any master or worker NCN.

kubectl get secret site-init -n loftsman -o jsonpath='{.data.customizations\.yaml}' | base64 -d | grep "external:"

Example output:

      external: SYSTEM_DOMAIN_NAME

The latter makes more sense, as it doesn't presume that
your site will be a dot-com, whilst the addition of that
dot-com in the former might have the uninitiated, and/or
naive, thinking that they have to add a dot-com to any
name that gets returned.

Probably worth taking the text from the latter and replacing
the text in the former with it, so as to be consistent.

FWIW, there's even a third way to do it presented in

operations/package_repository_management/Restrict_Admin_Privileges_in_Nexus.md

but that's so different to the other two as to not cause confusion.

HTH

Update manual wipe

Need to update some of the manual steps to align with the latest changes for the automatic wipe.

Originally posted by @rustydb in #2014 (comment)

Namely:

  • include wiping nvme
  • use a for loop to avoid early exists on bulk calls to wipefs where some targets may be skipped

Will new releases tofu the RELEASE_NOTES.md

How will the ./RELEASE_NOTES.md be updated for newer than 1.2 releases (and/or for older)? Ie, will 1.3 info go above the 1.2 info (Text over, full under) or some other process? I'm not seeing other / older release notes in the repo.

Need Way to Auto-Delete Branches

We can't auto-delete branches via the GitHub repository settings because we need to allow time for backports to be created, and backports require the original branch to exist in order to create.

Use a GitHub action to do this automatically.

HELLO, I Want to get a driver of HPE CRAY EX235A.

hello, EVERYONE!
I got a AMD MI250X GPU card recently. Obviously it was a part from HPE cray EX235a system.
But I got problem in using it now in ubuntu. Probably no driver suitable in ubuntu?
So, do you have drivers or BIOS that can make my card alive? It is writtern D65201-OB Rev13 on GPU.
THANK YOU SO MUCH!

An observation on Clean_Up_After_a_BOS-BOA_Job_is_Completed_or_Cancelled.md

Not so much an "issue" with, as an observation prompting a question on. the text in the file

operations/boot_orchestration/Clean_Up_After_a_BOS-BOA_Job_is_Completed_or_Cancelled.md

We read (my formatting):

ConfigMap for BOA:

This ConfigMap contains the configuration information that the BOA job uses.
The BOA pod mounts a ConfigMap named boot-session at /mnt/boot_session
inside the pod.
This ConfigMap has a random UUID name, such as
e786def5-37a6-40db-b36b-6b67ebe174ee.
This name does not obviously connect it to the BOA job.

however, I am currently im the process of cleaning up four day's
worth of BOS Sessions, created whilst trying to solve an issue
that eventaully appeared to solve itself, and I am constantly seeing
that the ConfigMap's UUID doesn't appear to be random, but in
fact appears to be the same as the BOS Session ID, for example:

# export BOA_JOB_NAME=boa-7a98ae9d-512b-4623-8a90-4d3c6426e5fd
#
# export BOS_SESSION_ID=${BOA_JOB_NAME#boa-}
# echo  $BOS_SESSION_ID
7a98ae9d-512b-4623-8a90-4d3c6426e5fd
#
# kubectl -n services describe job ${BOA_JOB_NAME} | \
  grep -A3 boot-session:
   boot-session:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      7a98ae9d-512b-4623-8a90-4d3c6426e5fd
    Optional:  false
#

so, have I just been "lucky", or has the underlying process been
altered so that the BOS Session ID is now propagated into the
ConfigMap UUID for the BOA ?

In which case the documentation doesn't relect that change.

Authoring of triple-backtick codblocks prevents rendering in Jekyll/Liquid

There are a swathe of instances, throughout the source of these docs,
where "code blocks" are begun with the triple-backtick marker, but
are never closed with a corresponding one.

Whilst appreciating that you can "get away with it" in some rendering
environments, it's rather poor form, and, more importantly, it does
break rendering in some envirtonments, with Jekyll/Liquid being the
one I have been trying to use locally.

Given that there are examples of code block markup where the author
has done the "right thing", it's clearly not some "internal style guide
authoring thing": more likely the result of an ad-hoc approach?

Any chance of fixing this up?

Reason for not forking is that there is just too much of it to clean up!

Add Contribution File

We need a .github/CONTRIBUTING.md file, containing details for how another user/developer/maintainer may contribute new content or make amends to existing content.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.