kids-first / kf-api-release-coordinator Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 1.0 7.96 MB

🤹‍♀️ Coordinate multiple services to synchronize data releases

Home Page: https://kids-first.github.io/kf-api-release-coordinator

License: Apache License 2.0

Shell 0.40% Python 99.13% Dockerfile 0.47%

api django graphene graphql python rest

kf-api-release-coordinator's People

Contributors

Stargazers

Watchers

Forkers

wahello

kf-api-release-coordinator's Issues

Add author field to task service

It will be required to have task services associated with authors to control who may modify or remove a service.

Get release by id endpoint

Implement an endpoint that will return detailed info about a specific release, including information for any downstream tasks.

GET /releases/RE_00000001
{
    "_links": {
        "self": "/releases/RE_00000001"
    },
    "_status": {
        "code": 200,
        "message": "success"
    },
    "results": {
        "kf_id": "RE_00000001",
        "state": "running",
        "studies": [
            "ST_00000001"
        ],
        "date_submitted": "2018-03-19T20:12:24.702813+00:00",
        "tasks": [
            {
                "name": "data model rollover",
                "kf_id": "TA_3G2409A2",
                "release_id": "RE_AB28FG90",
                "state": "running",
                "progress": "50%",
                "date_submitted": "2018-03-19T20:12:24.702813+00:00"
            }
        ]
    }
}

Add Testing Integration

We should use CircleCI to run PR status checks on the Coordinator Service codebase.

Store ego service token in cache

The service token is currently being stored in the settings. The proper way to store it would be to put it in an in-memory cache where it may be retrieved and sent as a header to ego.

Draft design doc for DOI assignment

We are part of DataCite - so we have the ability to mint DOIs: https://support.datacite.org/v1.1/reference. It would be great if we could have this be part of the release process so it can be citable as part of publications.

Initial Thoughts:

We'd probably want to make Kids First DRC a "data center" for DOI generation
What would be nifty would be for it both to be resolvable to a webpage (standard DOI functionality) but also to be able to be used to something like https://dataguids.org/ or http://n2t.net/ to resolve to the GUID handle of the data itself

Run coordinator in dev mode in compose

We should construct the compose file to run in a dev-friendly manner. This means running the dev webserver instead of nginx and mounting the local code directory to the container.

Heartbeat endpoint

Add /heartbeat endpoint to trigger a task to check on all task services.
This endpoint will be hit periodically to check on the health status of task services.

Create task service to run dumps on release

When a release is created, a task should run that tars all of the data related to that release and stores it in s3.

This task service may be a new repo.

Rewrite states into django-fsm

There is currently a bunch of logic that determines what state transition is happening in order to respond appropriately. This could be cleaned up greatly if everything is changed to use django-fsm.

Tasks that fail get marked as canceled

When a task fails, it is set to the failed state then issues a cancel release task.
This job then cancels all tasks in the release setting their state to canceled including the failed task.
The failed task should not have it's state changed.

Get a task in a release endpoint

Implement an endpoint to get information about a task in a release

GET /releases/RE_00000001/tasks/TA_00000001
{
    "_links": {
        "self": "/releases/RE_00000001/tasks/TA_00000001"
    },
    "_status": {
        "code": 200,
        "message": "success"
    },
    "results": {
        "name": "data model rollover",
        "kf_id": "TA_3G2409A2",
        "release_id": "RE_AB28FG90",
        "state": "running",
        "progress": "50%",
        "date_submitted": "2018-03-19T20:12:24.702813+00:00"
    }
}

Coordinator Service V1 Roadmap

The coordinator should follow the spec outlined in the swagger.

Integrate with Jenkins

The Release Coordinator needs to be integrated with Jenkins so that it may use our current development flow for CI and CD.

Tasks stuck in canceling state

A task managed to get stuck in the canceling state and is not able to progress to canceled as it's not an allowed transition. We need to investigate how this happened and find how to prevent it from happening again.

Standardize loading of secrets

We should standardize on loading secrets from vault using either the vault python client to load into settings, or vault binary to load directly into the environment.

Get task services endpoint

Implement endpoint to list all task services in the coordinator.

GET /task-services
{
    "_links": {
        "self": "/task-services"
    },
    "_status": {
        "code": 200,
        "message": "success"
    },
    "results": [
        {
            "kf_id": "TS_00000001",
            "name": "Cavatica",
            "url": "https://cavatica.io/tasks",
            "health_status": "good"
        }
    ]
}

Push events to sns

It would be nice to push Events to an SNS topic when they are created so that we may potentially use them to send notifications or trigger other tasks.

Cancel release endpoint

DELETE /releases/RE_00000001
{
    "_links": {
        "self": "/releases/RE_00000001"
    },
    "_status": {
        "code": 200,
        "message": "success"
    }
}

User login through Ego

It would be preferable to utilize ego in authenticating users and granting them the ability to create releases for certain studies based on their ego roles.
Ego should have oauth provider.

Add review process

Now that user roles are implemented, we should have a process that requires a release to go through a review and sign off by administrator.

User creates a release (only allowed to use studies they are in the group for)
Release runs ETL
Release is staged and ready for review
Admin either approves release, or returns with comments
if approved, the release is published (either immediately, or at later scheduled date)

Get task service by id endpoint

Implement endpoint to get detailed information about a task service

{
    "_links": {
        "self": "/task-services"
    },
    "_status": {
        "code": 200,
        "message": "success"
    },
    "results": [
        {
            "kf_id": "TS_00000001",
            "name": "Cavatica",
            "url": "https://cavatica.io/tasks",
            "health_status": "good"
        }
    ]
}

Demonstrate the coordinator service - task service interaction

Either document the release publish process with real request/response bodies or
create a notebook that demonstrates a sample release publish using a mock coordinator and task service

Allow filtering releases by status

This will allow us to easily get the latest published release and therefor the latest version number.

Get releases endpoint

Implement an endpoint to list all releases in the Coordinator.

GET /releases
{
    "_links": {
        "self": "/resource/{resourceId}"
    },
    "_status": {
        "code": 200,
        "message": "success"
    },
    "results": [
        {
            "kf_id": "RE_00000001",
            "state": "running",
            "studies": [
                "ST_00000001"
            ],
            "date_submitted": "2018-03-19T20:12:24.702813+00:00",
            "tasks": [
                {
                    "name": "data model rollover",
                    "kf_id": "TA_3G2409A2",
                    "release_id": "RE_AB28FG90",
                    "state": "running",
                    "progress": "50%",
                    "date_submitted": "2018-03-19T20:12:24.702813+00:00"
                }
            ]
        }
    ]
}

Coordinator sends a GET request to the Snapshot's POST /tasks endpoint

The updated Snapshot was deployed in QA and the release coordinator qa logs returns an error as follows:

problem requesting task for start: b'<!DOCTYPE html>\n<html lang="en">\n<head>\n<meta charset="utf-8">\n<title>Error</title>\n</head>\n<body>\n<pre>Cannot GET /tasks</pre>\n</body>\n</html>\n'

When any http request other than POST is send to the Snapshot’s tasks endpoint, it returns

GET https://kf-task-snapshot-qa.kidsfirstdrc.org/task

Cannot GET /task

I am looking at the master branch of release coordinator and these lines are related to this error: https://github.com/kids-first/kf-api-release-coordinator/blob/master/coordinator/tasks.py#L145-L152

Rewrite events to trigger on state transitions

Once #63 has been implemented, it will be preferable to send sns messages whenever any sort of state change occurs. This will also be made easy through the built in events that django-fsm provides.

Issue release endpoint

Create an endpoint that will receive a list of studies to create a release for. This endpoint will begin the release process.

POST /releases { "studies": ["ST_00000001"] }
{
    "_links": {
        "self": "/resource/{resourceId}"
    },
    "_status": {
        "code": 200,
        "message": "success"
    },
    "results": {
        "kf_id": "RE_00000001",
        "state": "running",
        "studies": [
            "ST_00000001"
        ],
        "date_submitted": "2018-03-19T20:12:24.702813+00:00",
        "tasks": [
            {
                "name": "data model rollover",
                "kf_id": "TA_3G2409A2",
                "release_id": "RE_AB28FG90",
                "state": "running",
                "progress": "50%",
                "date_submitted": "2018-03-19T20:12:24.702813+00:00"
            }
        ]
    }
}

GET releases by version number: /releases/<version | kf_id>

GET a release by it's given version number and return the same json response schema as if fetching the release by it's kf_id.

This supports use cases in the Release Archives so that the client doesn't have fetch all releases and filter by version

Add Release Notes model

Release Notes should optionally be added to any study in a release.
They will relate to both a study and a release and be used to describe changes for that particular study in the release.

Add service tokens

Now that the coordinator is public, we should provide a token for each service to use to authenticate itself when updating its state with the coordinator.
This will prevent other users from updating a task that is not the task service itself.
The token should be allowed to be refreshed at any time. Support for multiple tokens may be desirable, but perhaps for a later feature.

Write design doc for review process

We should outline the process for reviewing and signing off on a release.

Define task service for jenkins

A service definition will need to be defined with deployment pipeline and variables for the coordinator service's deployment. We may wish to use the type-1 definition or create a new type if we decide to use a different tech stack.

Update spec/diagrams w/ File Removals

Need to define how the coordinator will communicate file removals in release publish

Add enabled property to task services

An enabled property on a task could allow services to be registered with the coordinator, but not run as part of a release. This may be helpful when adding new tasks that may not be stable yet.

Update Jenkins file

The Jenkinsfile needs to be updated to use the service type 1 module like other projects.

Poll for status of running tasks

We currently require that tasks report back their status when they fail or change state.
We should periodically poll all tasks that are in a running or publishing state to check if they have failed or finished.

Add pytest.ini

We should add a pytest.ini file that runs with the correct django settings module as well as runs pytest so that we only need to execute pytest to run all of our tests.

Add SNS topic and lambda to infrastructure

#40 adds the ability to define an SNS_ARN for a topic in SNS to publish events to. To take advantage of this, a releases topic should be created to receive these messages, and a corresponding lambda listening to this topic should be created to send slack notifications from the events.

Implement a mock task service

Implement a mock task service that can be used to demonstrate the release publishing process from the task service point of view

Add events

An event model would be helpful in looking through a releases history and debugging and monitoring anything that happens during a release.

This could be done with an events table that optionally relates to a release, task, or service. Each release, task, or service, could then have a timeline of events such as task received, task published, release canceled, or health check.

Add release canceled event

Add state fields to events

Events are always used to log transitions between two states. To make them easier to parse, the event model should have starting and final state fields that indicate the transition. This allows more detailed information to be placed in the message field of the event.

Add more logging

Debugging certain errors such as task failures is difficult when there is little information about why a task failed.

More verbosity is needed around bad response from tasks and release cancellations.

Releases get stuck in canceling state

Releases will often get stuck in the canceling state upon failure or user cancellation.
Releases should move to canceling, cancel all tasks, then move to canceled almost immediately.

Add author to events

The author that initiates releases should be included in the event logs.

Write design doc for semantic versioning

We should outline how we will implement semantic versioning in releases.

Allow tasks to be filtered by task service id

The /tasks endpoint should be filterable by service id so that all tasks for a service are not returned in the /task-services endpoint.

Eg:

GET /tasks?task_service=TA_00000000

Decide on version tagging strategy

To be able to easily identify whether the current release of data supersedes a user's current working data, it's important to have some sequential version tag such as 1.3.2. It would make sense that we version on the study level, however, this becomes tricky as studies may be included in one release, but not another.

Definitions

Kids First-wide data release

This type of release will happen when the underlying data model has changed, adding or removing information globally across all studies in Kids First

Single study release

This type of release will only involve a single study. This will likely be the most often used type of release that is initiated by users.

Multi-study release

This type of release will include multiple, but not all, studies. This will probably not be as frequent as other methods.

Preview only 'release'

This type of release follows the release process up until the publish step, where at it is cancelled. This release type is desirable if the user only wants to preview what the data will look like in the portal, or there is some quality issue that causes the release to be cancelled before being published.

Tagging strategy

After a Kids First-wide release, it would make most sense to increment the major version number
eg: 1.3.12 -> 2.0.0

After a single study release, it seems straightforward to increment the minor version
eg: 1.3.12 -> 1.4.0

After a preview, the patch version may be incremented
eg: 1.3.12 -> 1.3.13

The difficult question is what happens after a multi-study release. According to semver, as long as the version numbers increase, it is valid. Meaning that gaps in versions are acceptable. Following this rule, it may be best to add one the max minor version of the studies involved in the release, eg:
SD_1 - last release version: 1.4.3
SD_2 - last release version: 1.9.23
SD_3 - last release version: 1.3.11
The version number for the next release containing all three of the above:
1.10.0

Update a task in a release endpoint

Implement an endpoint that tasks may ping with their current state and progress.

PATCH /releases/RE_00000001/tasks/TA_00000001 { "state": "running", "progress": "50%" }
{
  "_links": {
    "self": "/resource/{resourceId}"
  },
  "_status": {
    "code": 200,
    "message": "success"
  },
  "results": {
    "name": "data model rollover",
    "kf_id": "TA_3G2409A2",
    "release_id": "RE_AB28FG90",
    "state": "running",
    "progress": "50%",
    "date_submitted": "2018-03-19T20:12:24.702813+00:00"
  }
}

Implement status endpoint

The /status endpoint should return basic information about the current state of the Coordinator Service similar to the Data Service status endpoint:

GET /status
{
    "branch": "master",
    "code": 200,
    "commit": "aef3b5a",
    "message": "success",
    "tags": [
        "rc"
    ],
    "version": "2.0.4"
}