Giter Club home page Giter Club logo

msr-safe's Introduction

NAME

msr_safe - kernel module implementing access-control lists for model-specific registers

SYNOPSIS

/dev/cpu/<cpuid>/msr_safe /dev/cpu/msr_batch /dev/cpu/msr_allowlist /dev/cpu/msr_version msr-save

OVERVIEW

msr_safe provides controlled userspace access to model-specific registers (MSRs). It allows system administrators to give register-level read access and bit-level write access to trusted users in production environments. This access is useful where kernel drivers have not caught up with new processor features, or performance constraints requires batch access across dozens or hundreds of registers.

SETUP

Building the kernel module requires linux kernel headers. Best practice for production environments requires creation of msr-user and msr-admin groups. Members of the former can read and write MSRs using either the per-CPU interface or the batch interface, subject to the restrictions specified in the allowlist. Members of the latter can also change the contents of the allowlist.

git clone https://github.com/LLNL/msr-safe
cd msr-safe
make
sudo insmod ./msr-safe.ko
sudo chmod g+rw /dev/cpu/*/msr_safe /dev/cpu/msr_*
sudo chgrp msr-user /dev/cpu/*/msr_safe /dev/cpu/msr_batch /dev/cpu/msr_version
sudo chgrp msr-admin /dev/cpu/msr_allowlist

msr_safe uses dynamically allocated major device numbers. These can conflict with devices that use hard-coded numbers. To work around this, major device numbers can be specified during module load.

sudo insmod msr-safe.ko \
                [ mdev_msr_safe=<#> ] \
                [ mdev_msr_allowlist=<#> ] \
                [ mdev_msr_batch=<#> ] \
                [ mdev_msr_version=<#> ] 

Use rmmod(8) to unload msr-safe.

sudo rmmod msr-safe

DESCRIPTION

/dev/cpu/msr_allowlist

Contains a list of model specific registers and their writemasks. Supports read(2), write(2) and open(2). Any MSR access using msr_safe or msr_batch is checked against this list. An MSR can be read if its address is present in the list. An MSR can only be written if its address is present in the list and there is at least one bit writable as indicated by the write mask. For example, the following entry marks the MSR at address 0x10 (the time stamp counter) as read-only, as the write mask is 0.

0x00000010 0x0000000000000000 # "MSR_TIME_STAMP_COUNTER"

This entry allows MSR_PERF_CTL (at address 0x199) to be read, but only the bottom sixteen bits are writeable.

0x00000199 0x000000000000ffff # "MSR_PERF_CTL"

It is up to the system administrator to create appropriate per-architecture, per-user allowlists. The "safety" of a particular MRS depends on the totality of the environment. The msr-safe repo provides sample allowlists that have been useful in other installations; they may or may not be appropriate for yours.

To see the existing allowlist:

cat /dev/cpu/msr_allowlist

The output will look something like:

# MSR      Write mask
0x00000010 0x0000000000000000
0x00000017 0x0000000000000000
0x000000C1 0x0000000000000000
...

Comments are not preserved.

To install a new allowlist:

cat <new_allowlist> > /dev/cpu/msr_allowlist

Writing, appending, or modifying a loaded allowlist discards the existing allowlist.

Parsing a new allowlist is done in two passes. If an error occurs during the first pass the existing allowlist is undisturbed. If an error occurs during the second pass the allowlist is reset to be empty. In practice, the most common second-phase error is the discovery of a duplicate allowlist entry. See ERRORS for details.

/dev/cpu/<cpuid>/msr_safe

Per logical-cpu interface for model-specific registers. Supports llseek(2), read(2), write(2), and open(2). Reads or writes a single MSR at a time. To access multiple MSRs and/or MSRs across multiple logical CPUs, use /dev/cpu/msr_batch.

The most common approach is to use pread(2) and pwrite(2), as these combine the seek operation with reading and writing. Alternatively, the device supports SEEK_SET and SEEK_CUR parameters to llseek(2), but not SEEK_END. Both reads and and writes must be exactly 8 bytes.

/dev/cpu/msr_batch

Batch interface for MSR access. Only supports ioctl(2), with the first parameter being the file descriptor, the second parameter being X86_IOC_MSR_BATCH (defined in msr_safe.h), and the third parameter being a pointer to a struct msr_batch_array.

struct msr_batch_array
{
    __u32 numops;             // In: # of operations in operations array
    struct msr_batch_op *ops; // In: Array[numops] of operations
};

The maximum numops is system-dependent, but 30k operations is not unheard-of. Each op is contained in a struct msr_batch_op:

struct msr_batch_op
{
    __u16 cpu;     // In: CPU to execute {rd/wr}msr instruction
    __u16 isrdmsr; // In: 0=wrmsr, non-zero=rdmsr
    __s32 err;     // Out: set if error occurred with this operation
    __u32 msr;     // In: MSR address
    __u64 msrdata; // In/Out: Data to write or data that was read
    __u64 wmask;   // Out: Write mask applied to wrmsr
};

The cpu uses the same numbering found in /dev/cpu/<cpuid>. A zero value for isrdmsr indicates a write operation, any other value indicates a read operation. err is populated by the kernel if there is an error on a particular operation, and will be one of ENXIO (the virtual CPU does not exist or is offline), EACCES (the requested MSR was not found in the allowlist), or EROFS (a write operation was attempted on an MSR with a write mask of 0).

msr is the address of the model-specific register. msrdata is the value that will be written to or read from the MSR, respectively. Finally, the wmask records the writemask for the MSR provided in the allowlist.

/dev/cpu/msr_safe_version

Starting with version 1.6, this device contains the loaded version of msr-safe.

RETURN VALUES

On success, calls to write(2) and read(2) return the number of bytes written or read, which in the case of /dev/cpu/<cpu>/msr_safe will be 8 (as only a single register per call may be written to or read from). llseek(2) returns the new file offset. open(2) returns the new file descriptor. ioctl(2) returns 0.

On error, All of the following system calls will return -1 and set errno to the appropriate value. The errors listed below are specific to msr_safe. The man pages for the individual system calls describe additional errors that may occur.

ERRORS

/dev/cpu/msr_allowlist

write(2)

E2BIG <count> exceeds MAX_WLIST_BSIZE (defined as (128 * 1024) + 1)

EILSEQ Unexpected EOF.

EINVAL Address or writemask caused parsing error.

EFAULT Kernel copy_from_user() failed.

ENOMEM Kernel unable to allocate memory to hold the raw or parsed allowlist.

ENOMSG No valid allowlist entries found.

ENOTUNIQ Duplicate allowlist entries found.

ERANGE Address or writemask is too large for an unsigned long long.

read(2)

E2BIG The read(2) <count> parameter was less than 60 bytes.

EFAULT Kernel copy_from_user() failed.

llseek(2)

EINVAL The <whence> parameter was neither SEEK_CUR nor SEEK_SET, e.g., SEEK_END.

/dev/cpu/<cpuid>/msr_safe

read(2)

EACCESS The MSR requested is not in the allowlist.

EBUSY Requested virtual CPU is (temporarily?) locked.

EFAULT Kernel copy_to_user() failed.

EIO A general protection fault occurred. See the description for EIO errors in the /dev/cpu/msr_batch section below.

EINVAL Number of bytes requested to read is something other than 8.

ENXIO Requested virtual CPU does not exist or is offline.

write(2)

EACCESS The MSR requested is not in the allowlist.

EBUSY Requested virtual CPU is (temporarily?) locked.

EFAULT Kernel copy_from_user() failed.

EIO A general protection fault occurred. See the description for EIO errors in the /dev/cpu/msr_batch section below.

EINVAL Number of bytes requested to read is something other than 8.

ENXIO Requested virtual CPU does not exist or is offline.

open(2)

EIO Model-specific registers not supported on this virtual CPU.

ENXIO Requested virtual CPU does not exist or is offline.

/dev/cpu/msr_batch

ioctl(2)

All of the operations in the batch will be executed. Each operation may result in an EIO, ENXIO, EACCES, or EROFS error, which will be recorded in the msr_batch_op struct. If any operation caused an error, the first such error becomes the return value for ioctl(2).

E2BIG Kernel unable to allocate memory to hold the array of operations.

EACCES An individual operation requested an MSR that is not present in the allowlist.

EBADF The msr_batch file was not opened for reading.

EFAULT Kernel copy_from_user() or copy_to_user() failed.

EINVAL Number of requested batch operations is <=0.

EIO A general protection fault occurred. On Intel processors this can be caused by a) attempting to access an MSR outside of ring 0, b) attempting to access a non-existent or reserved MSR address, c) writing 1-bits to a reserved area of an MSR, d) writing a non-canonical address to MSRs that take memory addresses, or e) writing to MSR bits that are marked as read-only.

ENOMEM Kernel unable to allocate memory to hold the results of zalloc_cpumask_var().

ENOTTY Invalid ioctl command. As of this writing the only ioctl command supported on this device is X86_IOC_MSR_BATCH, defined in msr_safe.h.

ENXIO An individual operation requested a virtual CPU does not exist or is offline.

EROFS An individual operation requested a write to a read-only MSR.

open(2)

There are no msr_safe-specific error conditions.

msr-save

The msrsave utility provides a mechanism for saving and restoring MSR values based on entries in the allowlist. To restore MSR values, the register must have an appropriate writemask.

Modification of MSRs that are marked as safe in the allowlist may impact subsequent users on a shared HPC system. It is important the resource manager on such a system use the msrsave utility to save and restore MSR values between allocating compute nodes to users. An example of this has been implemented for the SLURM resource manager as a SPANK plugin. This plugin can be built with the "make spank" target and installed with the "make install-spank" target. This uses the SLURM SPANK infrastructure to make a popen(3) call to the msrsave command line utility in the job epilogue and prologue.

The version of msrsave (and msr-safe) can be modified by updating the following compiler flag:

-DVERSION=\"MAJOR.MINOR.PATCH\"

The msrsave version can be queried with:

msrsave --version

Security

Model-specific registers

The safety of a particular model-specific register depends on the system environment. The sample allowlists provided were developed for non-classified high performance computing systems where only a single non-privileged user at a time can access a given compute node. These lists should be re-evaluated for use in other environments, particularly multi-user environments.

Filesystems permissions

msr-safe is designed to support multiple classes of users, each of which would have their own group and allowlist. Best practice is to unload and reload the msr-safe kernel module when changing device ownership or permissions. If this is not done, a lower-privileged user can open /dev/cpu/msr_batch and retain the file descriptor until the permissions (and allowlist) are changed to allow higher-privileged users to run and the allowlist remains readable by the less-privileged user, the less-privileged user can continue using their original file descriptor with the higher-privileged allowlist.

FAQ

Can I append or modify an allowlist in place?

No. Each write(2) call discards the previous allowlist.

What happens if an allowlist is changed during an ioctl(2) call?

The kernel records all of the relevant writemasks in the struct msr_batch_op prior to executing the ops. If the allowlist is changed during a call, the new allowlist will be applied to subsequent calls.

How many operations can fit into one batch?

Determining the formula to provide an upper bound is almost certainly more trouble than it's worth, but we have easily gotten 30k entries in a single batch on production machines.

What happens if a CPU is taken offline or brought back online?

We haven't had a good reason to wire up hotplugging. If the collection of online CPUs changes, it's best to unload and reload the msr-safe kernel module.

What happens if a CPU is taken offline and a user still has an open file descriptor for that device?

The kernel checks to see if a CPU is online. Attempts to access MSRs using that file descriptor should generate and error.

Can the batch API be extended to do other operations such as polling?

It can and it has. If you need this functionality please let us know. The code is brittle enough that we don't use it in production, but we are happy to share.

EXAMPLE CODE

/* This example assumes the user has the following permissions:
 *
 * write        /dev/cpu/msr_allowlist
 * read/write   /dev/cpu/<cpu_number>/msr_safe
 * read         /dev/cpu/msr_batch
 *
 * Typically, only the administrator will have write permissions
 * on the allowlist.
 *
 * Production code should have more robust error handling than
 * what is shown here.
 *
 * This example should be able to run successfully on an x86
 * processor from the past ten years or so.
 *
 */


#include <stdio.h>      // printf(3)
#include <assert.h>     // assert(3)
#include <fcntl.h>      // open(2)
#include <unistd.h>     // write(2), pwrite(2), pread(2)
#include <string.h>     // strlen(3), memset(3)
#include <stdint.h>     // uint8_t
#include <inttypes.h>   // PRIu8
#include <stdlib.h>     // exit(3)
#include <sys/ioctl.h>  // ioctl(2)

#include "../msr_safe.h"   // batch data structs

#define MSR_MPERF 0xE7

char const *const allowlist = "0xE7 0xFFFFFFFFFFFFFFFF\n";  // MPERF

static uint8_t const nCPUs = 32;

void set_allowlist()
{
    int fd = open("/dev/cpu/msr_allowlist", O_WRONLY);
    assert(-1 != fd);
    ssize_t nbytes = write(fd, allowlist, strlen(allowlist));
    assert(strlen(allowlist) == nbytes);
    close(fd);
}

void measure_serial_latency()
{
    int fd[nCPUs], rc;
    char filename[255];
    uint64_t data[nCPUs];
    memset(data, 0, sizeof(uint64_t)*nCPUs);

    // Open each of the msr_safe devices (one per CPU)
    for (uint8_t i = 0; i < nCPUs; i++)
    {
        rc = snprintf(filename, 254, "/dev/cpu/%"PRIu8"/msr_safe", i);
        assert(-1 != rc);
        fd[i] = open(filename, O_RDWR);
        assert(-1 != fd[i]);
    }
    // Write 0 to each MPERF register
    for (uint8_t i = 0; i < nCPUs; i++)
    {
        rc = pwrite(fd[i], &data[i], sizeof(uint64_t), MSR_MPERF);
        assert(8 == rc);
    }

    // Read each MPERF register
    for (uint8_t i = 0; i < nCPUs; i++)
    {
        pread(fd[i], &data[i], sizeof(uint64_t), MSR_MPERF);
        assert(8 == rc);
    }

    // Show results
    printf("Serial cycles from first write to last read:"
           "%"PRIu64" (on %"PRIu8" CPUs)\n",
           data[nCPUs - 1], nCPUs);
}

void measure_batch_latency()
{
    struct msr_batch_array rbatch, wbatch;
    struct msr_batch_op r_ops[nCPUs], w_ops[nCPUs];
    int fd, rc;

    fd = open("/dev/cpu/msr_batch", O_RDONLY);
    assert(-1 != fd);

    for (uint8_t i = 0; i < nCPUs; i++)
    {
        r_ops[i].cpu = w_ops[i].cpu = i;
        r_ops[i].isrdmsr = 1;
        w_ops[i].isrdmsr = 0;
        r_ops[i].msr = w_ops[i].msr = MSR_MPERF;
        w_ops[i].msrdata = 0;
    }
    rbatch.numops = wbatch.numops = nCPUs;
    rbatch.ops = r_ops;
    wbatch.ops = w_ops;

    rc = ioctl(fd, X86_IOC_MSR_BATCH, &wbatch);
    assert(-1 != rc);
    rc = ioctl(fd, X86_IOC_MSR_BATCH, &rbatch);
    assert(-1 != rc);

    printf("Batch cycles from first write to last read:"
           "%llu (on %"PRIu8" CPUs)\n",
           r_ops[nCPUs - 1].msrdata, nCPUs);
}

int main()
{
    set_allowlist();
    measure_serial_latency();
    measure_batch_latency();
    return 0;
}

Release

msr-safe is released under the GPL v2.0 license. For more details, please see the LICENSE and NOTICE files.

SPDX-License-Identifier: GPL-2.0-only

LLNL-CODE-807679

License and LLNL release number have been corrected to match internal records.

msr-safe's People

Contributors

bensallen avatar bgeltz avatar cmcantalupo avatar dianarg avatar hramrach avatar infamicstudios avatar jf6b avatar kfan326 avatar kshoga1 avatar mcfadden8 avatar rountree avatar rountree-alt avatar slabasan avatar tpatki avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

msr-safe's Issues

RPM spec incompatible with SUSE

I'm currently using msr-safe on SLES 15.2. I've built the following RPMs from the reference spec:

msr-safe-1.5.0.git5876ae79a155-1.x86_64.rpm
msr-safe-kmp-default-1.5.0.git5876ae79a155_k5.3.18_24.67-1.x86_64.rpm

Spec: https://github.com/LLNL/msr-safe/blob/main/rpm/msr-safe.spec

When installing the non-kmp RPM, I see the following output:

# zypper -n --root $CHROOT --no-gpg-checks install msr-safe*rpm
Reading installed packages...
Resolving package dependencies...

The following 2 NEW packages are going to be installed:
  msr-safe msr-safe-kmp-default

The following 2 packages have no support information from their vendor:
  msr-safe msr-safe-kmp-default

2 new packages to install.
Overall download size: 211.4 KiB. Already cached: 0 B. After the operation, additional 1022.5 KiB will
be used.
Continue? [y/n/v/...? shows all options] (y): y

Checking for file conflicts: (2 skipped) .........................................................[done]
Warning: 2 packages had to be excluded from file conflicts check because they are not yet downloaded.

    Note: Checking for file conflicts requires not installed packages to be downloaded in advance in
    order to access their file lists. See option '--download-in-advance / --dry-run --download-only'
    in the zypper manual page for details.

Retrieving package msr-safe-1.5.0.git5876ae79a155-1.x86_64         (1/2),  24.9 KiB (141.8 KiB unpacked)
msr-safe-1.5.0.git5876ae79a155-1.x86_64.rpm:
    Package is not signed!
msr-safe-1.5.0.git5876ae79a155-1.x86_64 (Plain RPM files cache): Signature verification failed [6-File is unsigned]
Accepting package despite the error. (--no-gpg-checks)

(1/2) Installing: msr-safe-1.5.0.git5876ae79a155-1.x86_64 ........................................[done]
Additional rpm output:
/var/tmp/rpm-tmp.QtGMlg: line 2: weak-modules: command not found


Retrieving package msr-safe-kmp-default-1.5.0.git5876ae79a155_k5.3.18_24.67-1.x86_64
                                                                   (2/2), 186.6 KiB (880.7 KiB unpacked)
msr-safe-kmp-default-1.5.0.git5876ae79a155_k5.3.18_24.67-1.x86_64.rpm:
    Package is not signed!
msr-safe-kmp-default-1.5.0.git5876ae79a155_k5.3.18_24.67-1.x86_64 (Plain RPM files cache): Signature verification failed [6-File is unsigned]
Accepting package despite the error. (--no-gpg-checks)

(2/2) Installing: msr-safe-kmp-default-1.5.0.git5876ae79a155_k5.3.18_24.67-1.x86_64 ..............[done]
Additional rpm output:
comm: /dev/fd/63: No such file or directory
sort: fflush failed: 'standard output': Broken pipe


Executing %posttrans scripts .....................................................................[done]

The important line is about "weak-modules: command not found". This command is not present on SUSE/SLES. This is only a warning at install time, so the installation proceeds OK.

The following issues are encountered when attempting to uninstall:

# zypper --root=${CHROOT} remove msr-safe msr-safe-kmp-default
Reading installed packages...
Resolving package dependencies...

The following 2 packages are going to be REMOVED:
  msr-safe msr-safe-kmp-default

2 packages to remove.
After the operation, 1022.5 KiB will be freed.
Continue? [y/n/v/...? shows all options] (y): y
(1/2) Removing msr-safe-1.5.0.git5876ae79a155-1.x86_64 ..........................................[error]
Removal of (308)msr-safe-1.5.0.git5876ae79a155-1.x86_64(@System) failed:
Error: Subprocess failed. Error: RPM failed: /var/tmp/rpm-tmp.DTTEWJ: line 5: weak-modules: command not found
error: %preun(msr-safe-1.5.0.git5876ae79a155-1.x86_64) scriptlet failed, exit status 127
error: msr-safe-1.5.0.git5876ae79a155-1.x86_64: erase failed

Abort, retry, ignore? [a/r/i] (a): i
(2/2) Removing msr-safe-kmp-default-1.5.0.git5876ae79a155_k5.3.18_24.67-1.x86_64 .........................................................................................................................[done]

Here, the error about weak-modules missing is an error that prevents the non-kmp RPM from being uninstalled. The workaround (identified by @cmcantalupo ) to uninstall the RPM once you're in this state is to temporarily create a weak-modules command that is a no-op based on the true command:

sudo cp -p $(which true) /usr/sbin/weak-modules
(from within the CHROOT if applicable)

Then remove the weak-modules hack once the uninstall is complete to ensure it is not accidentally used in the future.

The fact remains that weak-modules missing from SUSE prevents uninstall of the msr-safe rpm while the msr-safe-kmp-default rpm uninstalls fine in this case. The spec needs to be fixed up to account for this issue.

What is not clear to me is if the weak-modules logic in the spec file is even needed for SUSE based distros. At least once in the past I updated the kernel on my VNFS while I had msr-safe installed, and things seemed to propagate properly in spite of this command missing. I saw entires created in my new kernel's weak-modules directory with symlinks back to my original install:

[bgeltz@mcfly1 modules]$ ls -la ./5.3.18-24.67-default/weak-updates/extra/msr-safe/msr-safe.ko
lrwxrwxrwx 1 root root 60 Jun 11 15:12 ./5.3.18-24.67-default/weak-updates/extra/msr-safe/msr-safe.ko -> /lib/modules/5.3.18-24.64-default/extra/msr-safe/msr-safe.ko

So it seems possible that this all happens by some other automatic mechanism in SUSE and this logic is not necessay. If that's the case, this snippet suggested by @cmcantalupo should be all that's needed:

if which weak-modules; then
	echo /lib/modules/%{latest_kernel}/extra/msr-safe/msr-safe.ko | weak-modules --add-modules
fi

Thoughts on the way to resolve this properly?

msrsave-application

It would be very useful if there were a utility that could be used to save and restore MSR values that have write permission in the white-list. We are working on such an application here:

https://github.com/cmcantalupo/msr-safe/tree/cmcantalupo-msrsave

One common use case would be for a resource manager create a save file in a job prologue and then restore the values in a job epilogue. This is similar to the utility provided by the geopmpolicy application here:

http://geopm.github.io/geopm/man/geopmpolicy.1.html

but applied to all writable MSRs not just the specific MSRs written by geopm. Any comments or questions are welcome.

Fails to compile on kernel 4.8

Running make results in the following:

make -C /lib/modules/4.8.0-26-generic/build M=/local/msr-safe modules 
make[1]: Entering directory '/usr/src/linux-headers-4.8.0-26-generic'
  CC [M]  /local/msr-safe/msr_entry.o
/local/msr-safe/msr_entry.c: In function 'msr_seek':
/local/msr-safe/msr_entry.c:61:19: error: 'struct inode' has no member named 'i_mutex'; did you mean 'i_mode'?
  mutex_lock(&inode->i_mutex);
                   ^~
/local/msr-safe/msr_entry.c:74:21: error: 'struct inode' has no member named 'i_mutex'; did you mean 'i_mode'?
  mutex_unlock(&inode->i_mutex);
                     ^~
scripts/Makefile.build:289: recipe for target '/local/msr-safe/msr_entry.o' failed
make[2]: *** [/local/msr-safe/msr_entry.o] Error 1
Makefile:1489: recipe for target '_module_/local/msr-safe' failed
make[1]: *** [_module_/local/msr-safe] Error 2
make[1]: Leaving directory '/usr/src/linux-headers-4.8.0-26-generic'
Makefile:31: recipe for target 'all' failed
make: *** [all] Error 2

Add note about binaries using stock msr kernel

The binary needs to have CAP_SYS_RAWIO permissions if using the stock msr kernel (and not msr-safe). Add a note about this requirement in the README.

From msr.c:

if (!capable(CAP_SYS_RAWIO))
        return -EPERM;

msrsave fails to restore 64-bit controls

For example, the TURBO_RATIO_LIMIT MSR uses all 64 bits. From what I remember, the reasoning behind this behavior of msrsave was that if any MSR in the whitelist has a completely open mask (all F) then it must be a counter, and therefore should not be restored. I'm not sure how we can tell the difference between these from the whitelist alone. @cmcantalupo

Remove polling from `read(2)` and `write(2)` interface to `msr_safe`

The current read() interface to the /dev/cpu/*/msr_safe devices allow the user to specify a buffer size of 8n bytes. If the specified size is >8, the requested read is repeated until the buffer is full.

Strangely enough, the write() interface has exactly the same loop, causing the write to be executed by the number of bytes in the buffer/8.

Both of these loops date back to the original msr kernel module. I'd suggest removing both loops as the batch interface provides a richer interface for polling multiple MSRs.

Remove `ioctl` interface from `msr_safe`

Background

msr-safe was designed originally to extend the interfaces of the stock msr kernel module, in part to make it easier to argue that it should be included into the mainline linux kernel. After a concerted push by our colleagues at RedHat and Intel, we were told that wasn't going to happen.

A while later we discovered that the ioctl interface in msr-safe was not checking the allowlist before doing reads and writes. This was not consistent with our security model, and there was a bit of a scramble to get it patched. The fix (as I recall) was to add a capabilities check for CAP_SYS_RAWIO in the msr_ioctl() function: it was too much work to add the allowlist check for an interface that no one was using, and if someone had the privs to make a binary with that capability, they already had root. (Note that being root is not sufficient to pass a capability check.)

Proposal

I'm now reviewing this code as part of a documentation pass and would like to remove this interface. There are a bunch of other ideas that have been floating around for a while, so those combined with the interface change will probably warrant a major version bump.

Compile warning stack too large

make[1]: Entering directory '/usr/src/linux-headers-5.4.0-42-generic'
  CC [M]  /home/rountree/repos/msr-safe/msr-smp.o
/home/rountree/repos/msr-safe/msr-smp.c: In function ‘msr_safe_batch’:
/home/rountree/repos/msr-safe/msr-smp.c:108:1: warning: the frame size of 1032 bytes is larger than 1024 bytes [-Wframe-larger-than=]
  108 | }
      | ^

The underlying problem is allocation a struct cpuset on the stack. On x86 builds this can track up to 8192 cpus. We have not yet run into an issue with this, but I'd like to fix it all the same.

MAJOR Device Number Can Conflict with Other Modules

It looks like msr-safe dynamically picks the MAJOR device number. Depending on when we load msr-safe during boot. This is leading to later loaded kernel modules failing to load.

Our specific case is we're seeing ib_umad with MAJOR 231 failing to load with: user_mad: couldn't register device number consistently in RHEL7.5, RHEL's RDMA and OPA adapters. If msr-safe is unloaded ib_umad can be loaded. Then msr-safe can be loaded, since is picks a higher MAJOR number.

Presumably a workaround for this is to make the msr-safe systemd unit After: rdma, but this seems pretty fragile.

Other ideas:

  • Could msr-safe add a module parameter to specify the major number?
  • Can msr-safe hard code an unused major number (if there is such a thing)?

Thoughts?

remove capabilities check

As part of backwards capability with the stock msr kernel module we wanted to allow userspace programs with CAP_SYS_RAWIO to access MSRs without an allowlist check. In practice, this hasn't proven to be practical.

In environments where users have the ability to create CAP_SYS_RAWIO-capable programs, they also have the necessary privileges to set file permission on msr-safe devices. In environments where users don't have those privileges, they're probably not supposed to have unfettered access to MSRs.

I'm not sure I can come up with a plausible scenario where CAP_SYS_RAWIO makes sense for the stock msr module. I see even less of a need for us to carry it forward absent the backwards compatibility argument.

msrsave: Print changes on restore

Requesting a new feature of msrsave.

On restore, print the changes. Including the MSR changed, old value, and optionally the new value. We're interested in using this functionality at job end, when we restore a set of default values on nodes, to provide internal logging of what users are doing with msr-safe. Ideally the output printed should be easily machine parsable, i.e JSON, CSV, etc.

It would be nice to have this upstreamed and combined with kernel lockdown

Since lockdown was added in 5.4 (https://lwn.net/Articles/706637/) I feel that your module should be combined with it and hopefully upstreamed.

Unfortunately, if one uses the early lockdown option, LSM can't be used to allow access to desired MSRs, in my case, for undervolting.

msr-safe would be a perfect way to allow certain MSR's to be read and/or writeable from userspace with lockdown enabled, and from there apparmor or whatever can control access to /dev/msr past that point, allowing only whitelisted users/applications to touch the allowed registers.

No way to determine buffer size contents at runtime

MAX_WLIST_BSIZE is the maximum size of an allowlist and is #defined in msr_whitelist.c as ((128 * 1024) + 1).

The maximum size of a batch array is (2^14-1), which is a function of the ioctl() interface.

While both limits can be added to the documentation, it's not clear that most users will have access to the documentation (as we're not installing man pages in expected places and the sysadmin will be doing the installation).

Should these limits be provided as part of the msr_version device?

Misc cleanup

Deprecate dev branch, bring v4.10 kernel updates and updated error reporting into master branch.

whitelist removal

Commit e7fcf0b removed the whitelist directory which had contained white lists files for different architectures. These are quite useful generally, but are also required for to build the RPM which installs these files. On the tip of the dev branch the rpm/build_rpm.sh script breaks when looking for those files. What was the reason behind removing the white list files?

Getting Erros during make

I fired

CPPFLAGS="-I/usr/src/kernels/2.6.32-696.20.1.el6.x86_64/include -I$PWD -I$PWD/msrsave" make

inside of msr-safe directory and getting following errors:

cc -I/usr/src/kernels/2.6.32-696.20.1.el6.x86_64/include -I/home/cc/vfaculty/nsingh.vfaculty/software/tars/msr-safe -I/home/cc/vfaculty/nsingh.vfaculty/software/tars/msr-safe/msrsave -c -o msrsave/msrsave.o msrsave/msrsave.c
cc -I/usr/src/kernels/2.6.32-696.20.1.el6.x86_64/include -I/home/cc/vfaculty/nsingh.vfaculty/software/tars/msr-safe -I/home/cc/vfaculty/nsingh.vfaculty/software/tars/msr-safe/msrsave -c -o msrsave/msrsave_main.o msrsave/msrsave_main.c
cc msrsave/msrsave.o msrsave/msrsave_main.o -o msrsave/msrsave
make -C /lib/modules/2.6.32-696.20.1.el6.x86_64/build M=/home/cc/vfaculty/nsingh.vfaculty/software/tars/msr-safe modules
make[1]: Entering directory /usr/src/kernels/2.6.32-696.20.1.el6.x86_64' CC [M] /home/cc/vfaculty/nsingh.vfaculty/software/tars/msr-safe/msr_entry.o CC [M] /home/cc/vfaculty/nsingh.vfaculty/software/tars/msr-safe/msr_whitelist.o /home/cc/vfaculty/nsingh.vfaculty/software/tars/msr-safe/msr_whitelist.c: In function ‘find_in_whitelist’: /home/cc/vfaculty/nsingh.vfaculty/software/tars/msr-safe/msr_whitelist.c:303:9: error: implicit declaration of function ‘hash_or_each_possible’ [-Werror=implicit-function-declaration] hash_or_each_possible(whitelist_hash, entry, node, hlist, msr); ^ /home/cc/vfaculty/nsingh.vfaculty/software/tars/msr-safe/msr_whitelist.c:303:60: error: ‘hlist’ undeclared (first use in this function) hash_or_each_possible(whitelist_hash, entry, node, hlist, msr); ^ /home/cc/vfaculty/nsingh.vfaculty/software/tars/msr-safe/msr_whitelist.c:303:60: note: each undeclared identifier is reported only once for each function it appears in cc1: some warnings being treated as errors make[2]: *** [/home/cc/vfaculty/nsingh.vfaculty/software/tars/msr-safe/msr_whitelist.o] Error 1 make[1]: *** [_module_/home/cc/vfaculty/nsingh.vfaculty/software/tars/msr-safe] Error 2 make[1]: Leaving directory /usr/src/kernels/2.6.32-696.20.1.el6.x86_64'
make: *** [all] Error 2

Compiler : gcc-4.9.3
Kernel: 2.6.32-696.20.1.el6.x86_64

Broadwell Whitelist (064F)

Appear the project doesn't have a whitelist included yet for Broadwell-EP. Does anyone have one put together?

CPUID output:

  processor type  = primary processor (0)
  family          = Intel Pentium Pro/II/III/Celeron/Core/Core 2/Atom, AMD Athlon/Duron, Cyrix M2, VIA C3 (6)
  model           = 0xf (15)
  stepping id     = 0x1 (1)
  extended family = 0x0 (0)
  extended model  = 0x4 (4)
  (simple synth)  = Intel Xeon (Broadwell), 14nm

Thanks,

Ben

User provided ioctl() write mask is ignored

I believe this needs to be "&=" not "=" otherwise you are ignoring the user's request to only write a subset of the bits in the MSR.

op->wmask = msr_whitelist_writemask(op->msr);

As it is written, if someone provides a write mask in the ioctl structure that sets only the first 14 bits of the MSR_PKG_POWER_LIMIT msr, but the white list gives permission to write the entire register, then when they call into the batch ioctl all bits above bit 14 will be zeroed in the register.

Cannot get max non-turbo ratio and minimum ratio because register in whitelist missing

I didn't checked all but some whitelists (0645, 0646, 0657 and not yet added 064F) are missing the register MSR_PLATFORM_INFO which contains some values required to calculate the clock of the invariant TSC and the minimal clock speed.

Since the bits are read-only, an empty write mask is required:
0x000000CE 0x0000000000000000 # "SMSR_PLATFORM_INFO"

P.S. The title is wrong, I thought it the number of turbo ratio limits that is inside there but it's some frequencies

whitelist for wl_064e

wl_file for my computer is wl_064e, where to get this file as it is not in whitelists folder.
Also How these whitelists are made, can I make it by by own by choosing which MSRs to access?

Bizzare ENXIO when dealing with /dev/cpu/msr_whitelist

I try not to file issues before I can be 100% that it's not something I'm doing, but I'm pretty lost.

We have msr-safe installed across our compute cluster. Once in a blue moon, the msr-safe init script seems to fail:
Aug 10 17:15:08 nid04370 msr-safe[5812]: /usr/sbin/msr-safe: line 19: /dev/cpu/msr_whitelist: No such device or address

This is where things get weird:

stat /dev/cpu/msr_whitelist 
  File: '/dev/cpu/msr_whitelist'
  Size: 0               Blocks: 0          IO Block: 4096   character special file
Device: 6h/6d   Inode: 21126       Links: 1     Device type: f0,0
Access: (0660/crw-rw----)  Uid: (    0/    root)   Gid: (  476/     msr)
Access: 2018-08-10 17:15:08.352126407 +0000
Modify: 2018-08-10 17:15:08.352126407 +0000
Change: 2018-08-10 17:15:08.352126407 +0000
 Birth: -

cat /usr/share/msr-safe/whitelists/wl_0657 >> /dev/cpu/msr_whitelist
-bash: /dev/cpu/msr_whitelist: No such device or address

cat /dev/cpu/msr_whitelist
cat: /dev/cpu/msr_whitelist: No such device or address

file /dev/cpu/msr_whitelist
/dev/cpu/msr_whitelist: character special (240/0)

Running strace gives a little hint:

open("/dev/cpu/msr_whitelist", O_RDONLY) = -1 ENXIO (No such device or address)

The only way to fix this is to reload the module, and the failure only seems to happen during a boot. It seems to happen only once in a couple hundred nodes, but we have a lot of nodes, so it's a problem. I can't reproduce the failure outside of boot; I've been running systemctl restart and modprobe and so on over and over in a loop, and it's always fine.

Next step is to deploy a version of msr-safe with DDEBUG compiled in. However, that's going to take time to deploy. For now, I'm filing this to see if anyone has some ideas.

Issues when building msr-safe rpms

Hi,
when using your specfile for building rpms, one has to add the dependency

BuildRequires: systemd-units

to be able to build the rpms using mock.
He is not able to substitude _%{unitdir} and _%{udevrulesdir} when the dependency is missing. See this.

msr_whitelist input/output formatting

Catting the current whitelist is formatted differently than how the whitelist should be formatted.

cat msr_whitelist > file.out
Sample printout:
MSR: 00000010 Write Mask: 0000000000000000
MSR: 00000017 Write Mask: 0000000000000000
MSR: 000000c1 Write Mask: ffffffffffffffff
MSR: 000000c2 Write Mask: ffffffffffffffff
MSR: 000000c3 Write Mask: ffffffffffffffff
MSR: 000000c4 Write Mask: ffffffffffffffff

If you are trying to overwrite the whitelist, it should be formatted like this:

MSR Write Mask # Comment

0x00000010 0x0000000000000000 # "SMSR_TIME_STAMP_COUNTER"
0x00000017 0x0000000000000000 # "SMSR_PLATFORM_ID"
0x000000C1 0xffffffffffffffff # "SMSR_PMC0"
0x000000C2 0xffffffffffffffff # "SMSR_PMC1"
0x000000C3 0xffffffffffffffff # "SMSR_PMC2"

We should make the outputs here be the same for ease in reading/writing the whitelist.

msrsave restores wrong values

This seems to happen in situations where some MSRs from the whitelist are not available on a system (because of hyperthreading, SKU, etc.). The output from msrsave -r <file> seems to be writing back values from other offsets:

0x0000000000000FA4, 0x0000000000000000, 0x0000000000000200
0x0000000000000FA5, 0x0000000000000200, 0x0000000000000033
0x0000000000000FA6, 0x0000000000000033, 0x0000000000000000
0x0000000000000FB0, 0x0000000000030000, 0x0000000000000000
0x0000000000000FB4, 0x0000000000000000, 0x0000000000000200
0x0000000000000FB5, 0x0000000000000200, 0x0000000000000033
0x0000000000000FB6, 0x0000000000000033, 0x0000000000000000
0x00000000000000C3, 0x0000000000000000, 0xFB6A58813AEA28CF
0x00000000000000C4, 0x0000000000000000, 0xFB6A58813AEA28CF
0x0000000000000199, 0x0000000000002500, 0x0000000000000003
0x000000000000019B, 0x0000000000000003, 0x0000000000000000
0x000000000000019C, 0x00000000883F0800, 0x00000000883F0000
0x00000000000001AD, 0x1B1B1B1D20222325, 0xFB6A58813AEA28CF
0x00000000000001B0, 0x0000000000000007, 0x0000000000000003
0x000000000000038F, 0x000000070000000F, 0x000000000000000C
0x0000000000000610, 0x8003851000148438, 0x8000000000000000
0x0000000000000E00, 0x0000000000030000, 0x0000000000000000
0x0000000000000E03, 0x0000000000000000, 0x0000000000000200

This line looks suspicious to me:

write_val = restore_buffer[i + num_msr + j];

I'll do more testing to see if changing that to * helps.

Missing documentation on the need to set msrsave version number when building from source

When you build msr-safe from source using the instructions in the README, the version number reported by msrsave --version is 0.0.0. Upon inspecting the code, it looks like 0.0.0 is the default value that the VERSION variable is set to if VERSION hasn't already been defined in the compiler flags. The build commands included in the README don't define VERSION. Some tools like GEOPM are checking the msrsave version number to ensure that newer features of msr-safe are supported in the installed version; retrieving 0.0.0 causes an error.

compiler warnings with gcc 10.2.1

cc  -DVERSION=\"1.6.0\" -fPIC -shared -c msrsave/msrsave_main.c -o msrsave/msrsave_main.o
cc  -DVERSION=\"1.6.0\" -fPIC -shared -c msrsave/msrsave.c -o msrsave/msrsave.o
cc  -DVERSION=\"1.6.0\" msrsave/msrsave_main.o msrsave/msrsave.o -o msrsave/msrsave
make -C /lib/modules/3.10.0-1160.71.1.1chaos.ch6.x86_64/build M=/home/rountree/copperopolis/msr-safe modules
make[1]: Entering directory `/usr/src/kernels/3.10.0-1160.71.1.1chaos.ch6.x86_64'
  CC [M]  /home/rountree/copperopolis/msr-safe/msr_entry.o
In file included from ./arch/x86/include/asm/mem_encrypt.h:18,
                 from include/linux/mem_encrypt.h:20,
                 from ./arch/x86/include/asm/processor-flags.h:5,
                 from ./arch/x86/include/asm/processor.h:4,
                 from ./arch/x86/include/asm/cpufeature.h:7,
                 from /home/rountree/copperopolis/msr-safe/msr_entry.c:25:
include/linux/init.h:312:6: warning: ‘init_module’ specifies less restrictive attribute than its target msr_init’: ‘cold’ [-Wmissing-attributes]
  312 |  int init_module(void) __attribute__((alias(#initfn)));
      |      ^~~~~~~~~~~
/home/rountree/copperopolis/msr-safe/msr_entry.c:369:1: note: in expansion of macro ‘module_init’
  369 | module_init(msr_init);
      | ^~~~~~~~~~~
/home/rountree/copperopolis/msr-safe/msr_entry.c:282:19: note: ‘init_module’ target declared here
  282 | static int __init msr_init(void)
      |                   ^~~~~~~~
In file included from ./arch/x86/include/asm/mem_encrypt.h:18,
                 from include/linux/mem_encrypt.h:20,
                 from ./arch/x86/include/asm/processor-flags.h:5,
                 from ./arch/x86/include/asm/processor.h:4,
                 from ./arch/x86/include/asm/cpufeature.h:7,
                 from /home/rountree/copperopolis/msr-safe/msr_entry.c:25:
include/linux/init.h:318:7: warning: ‘cleanup_module’ specifies less restrictive attribute than its target ‘msr_exit’: ‘cold’ [-Wmissing-attributes]
  318 |  void cleanup_module(void) __attribute__((alias(#exitfn)));
      |       ^~~~~~~~~~~~~~
/home/rountree/copperopolis/msr-safe/msr_entry.c:392:1: note: in expansion of macro ‘module_exit’
  392 | module_exit(msr_exit)
      | ^~~~~~~~~~~
/home/rountree/copperopolis/msr-safe/msr_entry.c:371:20: note: ‘cleanup_module’ target declared here
  371 | static void __exit msr_exit(void)
      |                    ^~~~~~~~
  CC [M]  /home/rountree/copperopolis/msr-safe/msr_allowlist.o
/home/rountree/copperopolis/msr-safe/msr_allowlist.o: warning: objtool: write_allowlist.cold()+0x0: frame pointer state mismatch
  CC [M]  /home/rountree/copperopolis/msr-safe/msr-smp.o
  CC [M]  /home/rountree/copperopolis/msr-safe/msr_batch.o
  LD [M]  /home/rountree/copperopolis/msr-safe/msr-safe.o
  Building modules, stage 2.
  MODPOST 1 modules
  CC      /home/rountree/copperopolis/msr-safe/msr-safe.mod.o
  LD [M]  /home/rountree/copperopolis/msr-safe/msr-safe.ko
make[1]: Leaving directory `/usr/src/kernels/3.10.0-1160.71.1.1chaos.ch6.x86_64'

Identify and fix compiler/kernel version incompatibilities

For kernel version 3.10.0-1160.71.1.1chaos.ch6.x86_64

gcc/4.8-redhat good
gcc/4.9.3 error
gcc/6.1.0 error
gcc/7.1.0 error
gcc/7.3.0 good
gcc/8.1.0 good
gcc/8.3.1 good
gcc/9.3.1 warnings
gcc/10.2.1warnings
gcc 11.2.0-19ubuntu1 good

Eventually I'd like to get a pile of VMs set up to test lots of kernel and compiler versions, and fix up what we can in the code, then document what can't be fixed.

Checking for duplicate entries in whitelist

If you insert a whitelist with duplicate MSR entries, the kernel will throw this non-intuitive error:

$ cat whitelists/wl_0655-tmp > /dev/cpu/msr_whitelist
cat: write error: Invalid argument

We could implement a check for duplicate entries and throw a more appropriate "duplicate" error before trying to insert the whitelist.

Update msr-safe rpm

The msr-safe kernel now accepts parameters, so need to add support to the systemd scripts so someone can set parameters in the sysconfig file.

Migrated from PR #48:
Should we use /etc/modprobe.conf or its equivalents instead of the init script's modprobe?

Incorrect cpuid_info.model for ZEN_RYZEN?

On my system (Fedora 31, kernel 5.5.8, AMD Ryzen 5 2600, likwid master:10159a66), the cpuid_info.model is set to 0x08 and not recognized in topology.c:topology_setName. This is not surprising, given that topology.h contains #define ZEN_RYZEN 0x01 under /* AMD K8 */.

When I add cpuid_info.model = ZEN_RYZEN; to the start of the method, the likwid programs work just fine.

Update msrsave parser to comply with msr_whitelist output

This commit ca64cf1 updated the whitelist enumeration such that it could be subsequently inputted as a new whitelist. Need to update msrsave parser to comply with the updated format.

Previous format:

MSR: 00000010 Write Mask: 0000000000000000
MSR: 00000017 Write Mask: 0000000000000000
MSR: 000000c1 Write Mask: 0000000000000000
MSR: 000000c2 Write Mask: 0000000000000000
MSR: 000000c3 Write Mask: 0000000000000000
MSR: 000000c4 Write Mask: 0000000000000000
MSR: 000000c5 Write Mask: 0000000000000000
MSR: 000000c6 Write Mask: 0000000000000000
MSR: 000000c7 Write Mask: 0000000000000000
MSR: 000000c8 Write Mask: 0000000000000000

Updated format:

# MSR        Write Mask        # Comment
0x00000010    0x0000000000000000
0x00000017    0x0000000000000000
0x000000c1    0x0000000000000000
0x000000c2    0x0000000000000000
0x000000c3    0x0000000000000000
0x000000c4    0x0000000000000000
0x000000c5    0x0000000000000000
0x000000c6    0x0000000000000000
0x000000c7    0x0000000000000000

Submitted by @maiterth.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.