Giter Club home page Giter Club logo

doc-hpc's Introduction

SUSE Linux Enterprise for High Performance Computing (SLE-HPC) Documentation

This is the source for the official High Performance Computing (SLE-HPC) documentation.

This repository hosts the documentation sources including translations (if available).

Released versions of the documentation are published at https://documentation.suse.com/sle-hpc/.

Beta documentation versions are available at https://susedoc.github.io/, where all commits to main and maintenance branches are automatically built.

Branches

Table 1. Overview of important branches
Name Purpose

main

Current working branch

maintenance/*

Maintenance for released versions

On Feb 20, 2021, we switched to a new default branch. The default branch is now called main.

Use the main branch as the basis of your commits/new feature branches.

How to update your local repository

If you created a local clone or GitHub fork of this repo before Feb 20, 2021, do the following:

git branch -m master main
git fetch origin
git branch -u origin/main main
git pull -r

Reporting bugs

Bugs are collected on https://bugzilla.suse.com. If possible, please check for duplicates before creating a new report. When creating a new report, use SUSE Linux Enterprise HPC as the product, and in the next step select the version. Select Documentation as the component.

Contributing

Thank you for contributing to this repo. Please adhere to the following guidelines when creating a pull request:

  1. If you are contributing to the most recent release (currently SLE-HPC 15 SP4), create your pull request against the main branch. This branch is protected.

  2. If you are contributing to a previous release, create your pull request against the respective maintenance/<RELEASENUMBER> branch. These branches are also protected.

  3. Make sure all validation (GitHub Actions) checks are passed.

  4. For your pull request to be reviewed, tag relevant SMEs from the development team (if applicable) and a member of the SLE-HPC doc team: Tahlia Richardson (@tahliar).

    Note
    If your pull request has multiple files and reorganisation changes, please build locally using DAPS or daps2docker (see instructions below) to verify and build the files. GitHub Actions only validates, and does not ensure the XML builds are correct.
  5. Implement any required changes, or fix any merge conflicts if relevant. If you have any questions, ping a documentation team member in #team-suse-docs on Slack.

  6. For help on style and structure, refer to the Documentation Style Guide.

Editing DocBook

To contribute to the documentation, you need to write DocBook.

  • You can learn about DocBook syntax at http://docbook.org/tdg5/en/html .

  • SUSE documents are generally built with DAPS (package daps) and the SUSE XSL Stylesheets (package suse-xsl-stylesheets). Ideally, you should get these from the repository Documentation:Tools. However, slightly older versions are also available from the SLE and openSUSE repositories.

Building documentation

If you are interested in building DAPS documentation (defaulting to HTML and PDF), you can either use DAPS directly or use daps2docker. Both tools only work on Linux.

  • Use daps2docker if you use any Linux distribution that includes Docker and Systemd, want to be set up as quickly as possible, and only want to build HTML, PDF, or EPUB.

  • Use DAPS directly if you are using a recent version of openSUSE, and want to use any of the advanced features of DAPS, such as building Mobipocket or spell-checking documents.

Using daps2docker

  1. Install Docker.

  2. Clone the daps2docker repository from https://github.com/openSUSE/daps2docker.

  3. Within the cloned repository, run ./daps2docker.sh /PATH/TO/DOC-DIR to build HTML and PDF documents.

Using DAPS directly

  • daps -d DC-<YOUR_BOOK> validate: Make sure what you have written is well-formed XML and valid DocBook 5

  • daps -d DC-<YOUR_BOOK> pdf: Build a PDF document

  • daps -d DC-<YOUR_BOOK> html: Build multi-page HTML document

  • daps -d DC-<YOUR_BOOK> html --single: Build single-page HTML document

  • Learn more at https://opensuse.github.io/daps

doc-hpc's People

Contributors

aginies avatar dariavladykina avatar e4t avatar kucharczykl avatar lvicoun avatar mslacken avatar tahliar avatar taroth21 avatar tomschr avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

doc-hpc's Issues

Take commits from `master` branch

@lproven It appears that Egbert accidentally pushed three commits to a (newly-created) master branch some time in April. Could you check whether these commits are still relevant for main? (If you have time.)

Specifically, this is about the following commits: bc72357 / 0354ed3 / b85e09e

Technical reference/developer guide?

Would it make sense to move the "HPC user libraries" chapter (and maybe also the spack chapter) to a separate guide called Technical Reference or Developer Guide or something similar?

Is the explanation of down and down* correct?

@mslacken @e4t Could you check whether this FAQ entry is correct?

Current version

doc-hpc/xml/slurm.xml

Lines 1046 to 1058 in 6b585df

<para> What is the difference between the state <literal>down</literal>
and <literal>down*</literal>? </para>
</question>
<answer>
<para> A <literal>*</literal> shown after a status code means that the
node is not responding. </para>
<para> Thus, when a node is marked as <literal>down*</literal>, this means
that the node is not reachable due to network issues, or its
<literal>slurmd</literal> is not running. </para>
<para> In the <literal>down</literal> state, the node is reachable, but
the node was rebooted unexpectedly, the hardware does not match the
description in <filename>slurm.conf</filename>, or a healthcheck was
configured with the <literal>HealthCheckProgram</literal>. </para>

Original version which explained "down" twice

doc-hpc/xml/slurm.xml

Lines 1327 to 1333 in 38b3183

<para>What is the difference between the state <literal>down</literal> and <literal>down*</literal>?</para>
</question>
<answer>
<para>
When a node is marked as <literal>down</literal> this means that the node is not reachable due to network issues or the <literal>slurmd</literal> is not running. In the <literal>down</literal> state the node is reachable, but the node was rebooted unexpectedly, the hardware does not match the description in <filename>slurm.conf</filename> or a healthcheck configured with the <literal>HealthCheckProgram</literal>.
</para>
</answer>

How to know the new version of slurm to upgrade to?

@mslacken @e4t The documented command to upgrade slurm is zypper install --force-resolution slurm_VERSION, e.g. slurm_20_11. This means you need to know the version number to include it in the command.

How does the admin find out what the new version number is? From update advisories? Or is there a search command you can run to find out?

Outstanding items from SLE-12488?

SLE-12488 lists basic documentation requirements for HPC. I want to mark it as Done now that the Administration Guide exists, but I don't want to miss any items that still need to be done.

Here is the list of items and their status:

Admin:

  • Installation and configuration of the workload manager [Done]
  • Provisioning of compute nodes, document clustduct [DOCTEAM-158]
  • List available HPC libraries, describe their installation features [Done] Does it need to be improved/expanded?
  • Describe Installation of toolchains and libraries [Done] Does it need to be improved/expanded?
  • Describe available admin tools [Done and Done] Are any missing?
  • Considerations about how to set up RDMA [SLE-11046]

User:

  • Interact with the workload manager [Done] Does it need more?
  • Use environment modules [Done?] Is this adequately covered?
  • Check what software components are available [I'm not sure if we have this?]
  • Build own software component [Spack?]
  • [Do we still need a separate User Guide?]

Installation guide

There is a draft quick start guide in PR #10, written in February 2021. It was still a WIP and is now out of date.
(PDF build for reference: art-hpc-install_color_en.pdf)

Here are some notes to start working towards a new installation guide:

Basic cluster:

One management server with HPC Management Server role
Two compute nodes with HPC Compute Node role

Optional nodes:

Additional management servers
Additional compute nodes
Storage node
Database node
There's also a system role called HPC Development Node. Should we give guidance on when to use this?

Install procedure in original draft guide (PR #10):

Prepare management server:

  • Install SLE with HPC Management Server role
  • Set up genders database
  • Set up SSH access
  • Set up network storage
  • (Note: these steps in the draft assume that the compute nodes exist already. There's no step for installing them. So in a future draft these steps should go after setting up the nodes.)

Prepare cluster:

  • Install munge and distribute key
  • Set up and sync NTP
  • Sync genders database
  • Add NFS mount option
  • Install slurm

Latest install procedure for new guide (needs review):

Prepare management server:

  • Install SLE with HPC Management Server role
  • Which other setup tasks need to be done manually before setting up the compute nodes?

Prepare compute nodes:

  • Deploy compute nodes with clustduct
  • How much does clustduct do? Seems like it can install the nodes? Does it create VMs? Would this mean a prerequisite is to have libvirt installed? What about bare metal nodes (is anyone likely to have those)?
  • Which other setup tasks need to be done manually?

Install Slurm?

Upgrade guide

Would it be worth having an Upgrade Guide?

The first step would just be a link to the SLES Upgrade Guide, but then Slurm needs to be upgraded manually. Are there any other HPC-specific upgrade or post-upgrade tasks that would make a guide (or article) worthwhile? Maybe you can use pdsh to help upgrade the compute nodes?

What is meant here -- the Slurm database or slurmdbd?

@mslacken @e4t Could you take a look at this line and check whether this is supposed to mean the Slurm database or slurmdbd?

The original wording had the clearly typo'd slurmbd [note the position of b and d], this was changed to slurmdb since, but I am not sure that that change was correct.

Current version

doc-hpc/xml/slurm.xml

Lines 794 to 797 in 6b585df

<title>Convert primary slurmdb first</title>
<para> If a backup database daemon is used, the primary one needs to be
converted first. The backup will not start until this has happened.
</para>

Original version with slurmbd typo

doc-hpc/xml/slurm.xml

Lines 1027 to 1031 in 38b3183

<title>Convert primary slurmbd first</title>
<para>
If a backup database daemon is used, the primary one needs to be
converted first. The backup will not start until this has happend.
</para>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.