Giter Club home page Giter Club logo

fair-mast's Introduction

FAIR MAST Data Management System

Overview

Development Setup

Mac Users:

If you are using Mac for development, use podman instead of docker. Follow the installation guide to set it up, then follow the below set up.

Linux/Windows Users:

If using Linux or Windows, you need to make sure you have docker and docker-compose installed on your system.

Setup

We will be using the Python package manager uv to install our dependencies. As a first step, make sure this is installed with:

pip install uv

Secondly, clone the repository:

git clone [email protected]:ukaea/fair-mast.git
cd fair-mast

You can use either conda or venv to set up the environment. Follow the below instructions depending on your preference.

Option 1: Using Conda

Assuming you already have conda installed on your system:

conda create -n mast python=3.11
conda activate mast
uv pip install -r requirements.txt

Option 2: Using venv

Ensure you are using Python version 3.11:

uv venv venv
source venv/bin/activate
uv pip install -r requirements.txt

Use uv --help for additional commands, or refer to the documentation if needed.

Start the Data Management System

Run the development container to start the postgres database, fastapi, and minio containers locally. The development environment will watch the source directory and automatically reload changes to the API as you work.

Mac Users:

podman compose \
--env-file dev/docker/.env.dev  \
-f dev/docker/docker-compose.yml \
up \
--build

Podman does not shutdown containers on its own, unlike Docker. To shutdown Podman completely run:

podman compose -f dev/docker/docker-compose.yml down   
podman volume rm --all

Linux/Windows Users:

docker-compose \
--env-file dev/docker/.env.dev  \
-f dev/docker/docker-compose.yml \
up \
--build

The following services will be started:

  • FastAPI REST & GraphQL Server - will start running at http://localhost:8081.
    • The REST API documentation is at http://localhost:8081/redoc.
    • The GraphQL API documentation is at http://localhost:8081/graphql.
  • Postgres Database Server - will start running at http://localhost:5432
  • Postgres Admin Server - will start running at http://localhost:8081/pgadmin
  • Minio S3 Storage Server - will start running at http://localhost:9000.
    • The admin web GUI will be running at http://localhost:8081/minio/ui.

Populate the Database

To create the database and populate it with content we need to get the metadata files. These are stored in the repository using Git LFS.

To retrieve these data files, follow the below instructions in your terminal:

git lfs install
git lfs fetch
git lfs pull

Assuming the files have been pulled successfully, the data files should exist within tests/mock_data/mini in the local directory. We can create the database and ingest data using the following command:

Mac Users:

podman exec -it mast-api python -m src.api.create /code/data/mini

Linux/Windows Users:

docker exec -it mast-api python -m src.api.create /code/data/mini

Running Unit Tests

Verify everything is setup correctly by running the unit tests.

To run the unit tests, input the following command inside your environment:

python -m pytest -rsx tests/ --data-path="INSERT FULL PATH TO DATA HERE"

The data path will be will be along the lines of ~/fair-mast/tests/mock_data/mini.

This will run some unit tests for the REST and GraphQL APIs against a testing database, created from the data in --data-path.

Uploading Data to the Minio Storage

First, follow the instructions to install the minio client tool.

Next, configure the endpoint location. The development minio installation runs at localhost:9000 and has the following default username and password for development:

mc alias set srv http://localhost:9000 minio99 minio123;

Then you can copy data to the bucket using:

mc cp --recursive <path-to-data> srv/mast

Production Deployment

To run the production container to start the postgres database, fastapi, and minio containers. This will also start an nginx proxy and make sure https is all setup

docker compose --env-file dev/docker/.env.dev  -f dev/docker/docker-compose.yml -f dev/docker/docker-compose-prod.yml up --build --force-recreate --remove-orphans -d

To shut down the production deployment, run the following command:

docker compose --env-file dev/docker/.env.dev  -f dev/docker/docker-compose.yml -f dev/docker/docker-compose-prod.yml down

To also destory the volumes (including the metadatabase) you may add the volumes parameter:

docker compose --env-file dev/docker/.env.dev  -f dev/docker/docker-compose.yml -f dev/docker/docker-compose-prod.yml down --volumes

Note that every time you destory volumes, the production server will mint a new certificate for HTTPS. Lets Encrypt currently limits this to 5 per week

You'll need to ingest download and ingest the production data like so:

mkdir -p data/mast/meta
rsync -vaP <CSD3-USERNAME>@login.hpc.cam.ac.uk:/rds/project/rds-sPGbyCAPsJI/archive/metadata data/
docker exec -it mast-api python -m src.api.create /code/data/index

Building Documentation

See the guide to building documentation here

Ingestion to S3

The following section details how to ingest data into the s3 storage on freia with UDA.

  1. SSH onto freia and setup a local development environment following the instuctions above.
  2. Parse the metadata for all signals and sources for a list of shots with the following command
mpirun -n 16 python3 -m src.archive.create_uda_metadata data/uda campaign_shots/tiny_campaign.csv 

This will create the metadata for the tiny campaign. You may do the same for full campaigns such as M9.

  1. Run the ingestion pipleline by submitting the following job:
qsub ./jobs/freia_write_datasets.qsub campaign_shots/tiny_campaign.csv s3://mast/level1/shots

This will submit a job to the freia job queue that will ingest all of the shots in the tiny campaign and push them to the s3 bucket.

fair-mast's People

Contributors

jameshod5 avatar nathancummings avatar samueljackson92 avatar edharrington avatar

Stargazers

cyd cowley avatar Bhavin Patel avatar

Watchers

Alejandra Gonzalez-Beltran avatar  avatar Jonathan Hollocombe avatar  avatar cyd cowley avatar

fair-mast's Issues

Better configuration for ruff checks.

We currently run ruff checks in the CI using chartboost/ruff-action@v1, pointing to ./src. This triggered an issue where it was checking .ipynb files that we didn't want checked. We should configure how ruff runs its checks more explicitly, probably in the pyproject.toml file.

Add license

Add license:

  • - To the website
  • - To the s3 bucket
  • - To the database.
  • - To the shot files.

Ingesting control change data

Graham McArdle has provided a text file containg information about how shot settings were modified.

From the email chain with Graham:

there is a history.txt file that is continuously updated with info about edits made when PCS is running. I've assembled all the changes in the history of MAST into a file on my freia home drive ~gmcardle/mast_history.txt (it's too big for email attachment)

As I mentioned the other day, each of the shot setup files also record changes but they recursively repeat the history of shots that were restored. There are pros and cons of each here. The history file appears to be more robust and monotonic, but it also contains history of irrelevant changes made for fiddling about and running test shots (note that shot numbers in the 900000 range are test shots). If you specifically want to look at changes from one shot to the next you would get that from this history file, but if a shot is restored from an earlier shot you'd have to look back at that earlier shot to find what was done. Where this falls down at times is that we can also prepare a setup offline and restore the setup. In that case the change history of that prepared setup can't be found here but is embedded in the shot setup file.

I'm not entirely sure what you'd like to do with this data but please let me know if you think the shot setup history content is also useful. As I mentioned, there are also issues with shot setup history corruption but I might be able to repair some of that based on the contents of this contiguous log file

We need to investigate this data and consider how we can sensibly integrate this information.

Fix segfault on M7 campaign ingestion

Currently, the ingestion script for parsing data from pyuda is failing for the M7 campaign. This is blocking us proceeding with the ingestion of more data into the s3.

Investigate data version control system

Adding this as a more generic issue (I've closed the LakeFS specific investigation).

We need a data versioning system for our data that can integrate with our current S3 storage, and not add too much complexity for the user to access our data

Write workflow script to harmonise unit names

A lot of the units are inconsistent and could be simplified. Take a look at the result of this query

https://mastapp.site/json/signals/aggregate?data=units$distinct:,units$count:&groupby=units

We should at least try and map them to a minimal consistent set, or do the best we can.

Investigate using podman instead of docker for production

We have already started to use podman for Mac users on the development side of things, and have changed the README for this.

The production set up still uses Docker though, need to look into if we can straight swap Docker for Podman for the commands.

Bulk Data Access Performance for S3

When bulk downloading the data, it can be quite slow, especially when we are being selective with the data stored in s3.

This is because we have many, many small files. When we want to download everything we have to page through all the of the keys in the s3 which can take some time.

I am wondering how we can make this process more performant.

  • One solution is to have a file index and query this, but that would incur introducing an access layer.
  • Another solution is we push users towards local caching and they just load what they need:
endpoint_url = f"https://s3.echo.stfc.ac.uk"
url = 's3://mast/test/shots/tiny/30390.zarr'
ds = xr.open_dataset("filecache::" + url, engine='zarr', group=f'rba', 
                     storage_options={'s3': {'anon': True, 'endpoint_url': endpoint_url}, 'filecache': {'cache_storage':'/tmp/files'}})
ds

The second solution has implications for how you access the data on HPC systems. For CSD3 we should have an internally facing s3 storage. For other sites they will need an internet connection.

Upgrade flexibility of ingestion workflow

After working with Ale on Defuse we identified a few signals that are both useful and can be transformed into a nice representation.

We should upgrade the ingestion workflow to make these transforms

For example, we now know how to transform

  • LCFS
  • xsx tangential and horizontal cameras

In this issue we should also address mapping to different signal names (I.e. imas).

We should be able to map both from uda and from zarr

We should also address mapping units.

We should support re ingesting only a subset of signals/datasets

Integrate DEFUSE into the Database

Look at the outputs of DEFUSE and how we can integrate that into the metadata/database.
IMAS still does not have this, so we should decide that ourselves.
Finding more to feed into DEFUSE.

Write workflow script to harmonise variable names

We have missed a few of the different names for time axes. We should create a mapping file for this and try to fix the missing ones. For example, AYC_TE_CORE is not normalised.

We should make the script and mappings file similar to how we will handle units in #22

Parse out provenance data

Nathan has a tool for parsing the provenance data out from each of the signals on Freia. We need to modify this tool to be parallelised across shots to be more efficient so that we can run it over the entire MAST back catalogue.

Investigate using podman instead of docker desktop

Investigate how we can use podman instead of docker desktop. We need to make sure we can still use docker compose. Please make sure to write some documentation in a readme to describe how to deploy it!

Developing CI

Need to start building a CI pipeline. Best to start with the simple things like linting and formatting (my vote goes to ruff as it covers all the things that black, flake8 and isort do) and then probably look at testing just the API. May need to spin up a temporary database using Sqlite or similar for this. See #3.

Add support for directly accessing the database

Make it so the user can directly query the database with pandas.

We need a read only user
We need to add intake definition to abstract away the boilerplate
We need to add user examples, including using duckdb

Do we still still have need of minio?

The docker compose config still sets up a minio instance. Now that we have our Ceph storage at STFC, is this still required?

It could be useful for testing, as a build-up and tear-down option.

Pros:

  • We won't need to have test data in the production store.
  • Tests are still valid if we move storage elsewhere.

Cons:

  • Need to figure out how to handle test data. Git LFS/pull from production/etc...
  • Another component to understand and manage.

Fix slow pagination issue

When accessing the list of signals (https://mastapp.site/json/signals), which contains a very large number of records, it is noticeably slow to load. I have traced the source of this slow loading to how we are currently doing pagination.

Currently, we are paginating using offset pagination where we have to count the number of items in the table for every query. The slow line of code is line 177 in the snippet below:

fair-mast/src/api/crud.py

Lines 173 to 179 in ea97ab5

def get_pagination_metadata(
db: Session, query: Query, page: int, per_page: int, url: str
) -> t.Dict[str, str]:
count_query = select(func.count()).select_from(query)
total_count = db.execute(count_query).scalar_one()
total_pages = math.ceil(total_count / per_page)

I think the correct solution is to swap to using cursor based pagination. See the difference here. For most tables we should be able to use the UUID as the cursor. For shots we might use the shot_id, although for consistency it might be nicer to give shots UUIDs as well.

Fix unit tests

The unit tests are currently failing because I've changed stuff and not kept up with them. Please can you help fix these tests. It is most likely the test and not the code which is wrong at the moment.

To run the tests you can do:

python -m pytest tests

See what breaks and make them all green!

Pagination for aggregate queries

Context: We have moved to using fastapi.paginate for the requests now, which is straight forward for most requests and provides cursor based pagination for our responses. This paginate function only needs the database and the query. The function needs to return a CursorPage[model] object.

Problem: When trying to do cursor pagination for aggregate queries however, it returns an error about columns being missing based on what the aggregate query is asking for:

mast-api | sqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedColumn) column "min_shot_id" does not exist
mast-api | LINE 1: ..._id) AS min_shot_id, max(shot_id) AS max_shot_id, min_shot_i...
mast-api | ^
mast-api |
mast-api | [SQL: SELECT min(shot_id) AS min_shot_id, max(shot_id) AS max_shot_id, min_shot_id AS _sqlakeyset_oc_3
mast-api | FROM shots ORDER BY _sqlakeyset_oc_3 DESC
mast-api | LIMIT %(param_1)s]
mast-api | [parameters: {'param_1': 51}]
mast-api | (Background on this error at: https://sqlalche.me/e/14/f405)

Current attempts: The first attempt was to create a model based on what we should expect from the response, however creating this dummy model did not fix the issue.

How to reproduce:

  • Using the alternate aggregate function for shots:
@app.get("/json/shots/aggregate")
def get_shots_aggregate(
    request: Request,
    response: Response,
    db: Session = Depends(get_db),
    params: AggregateQueryParams = Depends(),
) -> CursorPage[AggModel]:
    
    query = crud.aggregate_query(
        ShotModel, params.data, params.groupby, params.filters, params.sort
    )
    return paginate(db, query)
  • Use the request: http://localhost:8081/json/shots/aggregate?data=shot_id$min:,shot_id$max:&groupby=campaign&sort=-min_shot_id

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.