kids-first / kf-api-release-coordinator Goto Github PK
View Code? Open in Web Editor NEW🤹♀️ Coordinate multiple services to synchronize data releases
Home Page: https://kids-first.github.io/kf-api-release-coordinator
License: Apache License 2.0
🤹♀️ Coordinate multiple services to synchronize data releases
Home Page: https://kids-first.github.io/kf-api-release-coordinator
License: Apache License 2.0
It will be required to have task services associated with authors to control who may modify or remove a service.
Implement an endpoint that will return detailed info about a specific release, including information for any downstream tasks.
GET /releases/RE_00000001
{
"_links": {
"self": "/releases/RE_00000001"
},
"_status": {
"code": 200,
"message": "success"
},
"results": {
"kf_id": "RE_00000001",
"state": "running",
"studies": [
"ST_00000001"
],
"date_submitted": "2018-03-19T20:12:24.702813+00:00",
"tasks": [
{
"name": "data model rollover",
"kf_id": "TA_3G2409A2",
"release_id": "RE_AB28FG90",
"state": "running",
"progress": "50%",
"date_submitted": "2018-03-19T20:12:24.702813+00:00"
}
]
}
}
We should use CircleCI to run PR status checks on the Coordinator Service codebase.
The service token is currently being stored in the settings
. The proper way to store it would be to put it in an in-memory cache where it may be retrieved and sent as a header to ego.
We are part of DataCite - so we have the ability to mint DOIs: https://support.datacite.org/v1.1/reference. It would be great if we could have this be part of the release process so it can be citable as part of publications.
Initial Thoughts:
We should construct the compose file to run in a dev-friendly manner. This means running the dev webserver instead of nginx and mounting the local code directory to the container.
Add /heartbeat
endpoint to trigger a task to check on all task services.
This endpoint will be hit periodically to check on the health status of task services.
When a release is created, a task should run that tars all of the data related to that release and stores it in s3.
This task service may be a new repo.
There is currently a bunch of logic that determines what state transition is happening in order to respond appropriately. This could be cleaned up greatly if everything is changed to use django-fsm.
When a task fails, it is set to the failed
state then issues a cancel release task.
This job then cancels all tasks in the release setting their state to canceled
including the failed task.
The failed task should not have it's state changed.
Implement an endpoint to get information about a task in a release
GET /releases/RE_00000001/tasks/TA_00000001
{
"_links": {
"self": "/releases/RE_00000001/tasks/TA_00000001"
},
"_status": {
"code": 200,
"message": "success"
},
"results": {
"name": "data model rollover",
"kf_id": "TA_3G2409A2",
"release_id": "RE_AB28FG90",
"state": "running",
"progress": "50%",
"date_submitted": "2018-03-19T20:12:24.702813+00:00"
}
}
The coordinator should follow the spec outlined in the swagger.
The Release Coordinator needs to be integrated with Jenkins so that it may use our current development flow for CI and CD.
A task managed to get stuck in the canceling
state and is not able to progress to canceled
as it's not an allowed transition. We need to investigate how this happened and find how to prevent it from happening again.
We should standardize on loading secrets from vault using either the vault python client to load into settings, or vault binary to load directly into the environment.
Implement endpoint to list all task services in the coordinator.
GET /task-services
{
"_links": {
"self": "/task-services"
},
"_status": {
"code": 200,
"message": "success"
},
"results": [
{
"kf_id": "TS_00000001",
"name": "Cavatica",
"url": "https://cavatica.io/tasks",
"health_status": "good"
}
]
}
It would be nice to push Events to an SNS topic when they are created so that we may potentially use them to send notifications or trigger other tasks.
DELETE /releases/RE_00000001
{
"_links": {
"self": "/releases/RE_00000001"
},
"_status": {
"code": 200,
"message": "success"
}
}
It would be preferable to utilize ego in authenticating users and granting them the ability to create releases for certain studies based on their ego roles.
Ego should have oauth provider.
Now that user roles are implemented, we should have a process that requires a release to go through a review and sign off by administrator.
Implement endpoint to get detailed information about a task service
{
"_links": {
"self": "/task-services"
},
"_status": {
"code": 200,
"message": "success"
},
"results": [
{
"kf_id": "TS_00000001",
"name": "Cavatica",
"url": "https://cavatica.io/tasks",
"health_status": "good"
}
]
}
Either document the release publish process with real request/response bodies or
create a notebook that demonstrates a sample release publish using a mock coordinator and task service
This will allow us to easily get the latest published release and therefor the latest version number.
Implement an endpoint to list all releases in the Coordinator.
GET /releases
{
"_links": {
"self": "/resource/{resourceId}"
},
"_status": {
"code": 200,
"message": "success"
},
"results": [
{
"kf_id": "RE_00000001",
"state": "running",
"studies": [
"ST_00000001"
],
"date_submitted": "2018-03-19T20:12:24.702813+00:00",
"tasks": [
{
"name": "data model rollover",
"kf_id": "TA_3G2409A2",
"release_id": "RE_AB28FG90",
"state": "running",
"progress": "50%",
"date_submitted": "2018-03-19T20:12:24.702813+00:00"
}
]
}
]
}
The updated Snapshot was deployed in QA and the release coordinator qa logs returns an error as follows:
problem requesting task for start: b'<!DOCTYPE html>\n<html lang="en">\n<head>\n<meta charset="utf-8">\n<title>Error</title>\n</head>\n<body>\n<pre>Cannot GET /tasks</pre>\n</body>\n</html>\n'
When any http request other than POST is send to the Snapshot’s tasks endpoint, it returns
GET https://kf-task-snapshot-qa.kidsfirstdrc.org/task
Cannot GET /task
I am looking at the master branch of release coordinator and these lines are related to this error: https://github.com/kids-first/kf-api-release-coordinator/blob/master/coordinator/tasks.py#L145-L152
Once #63 has been implemented, it will be preferable to send sns messages whenever any sort of state change occurs. This will also be made easy through the built in events that django-fsm provides.
Create an endpoint that will receive a list of studies
to create a release for. This endpoint will begin the release process.
POST /releases { "studies": ["ST_00000001"] }
{
"_links": {
"self": "/resource/{resourceId}"
},
"_status": {
"code": 200,
"message": "success"
},
"results": {
"kf_id": "RE_00000001",
"state": "running",
"studies": [
"ST_00000001"
],
"date_submitted": "2018-03-19T20:12:24.702813+00:00",
"tasks": [
{
"name": "data model rollover",
"kf_id": "TA_3G2409A2",
"release_id": "RE_AB28FG90",
"state": "running",
"progress": "50%",
"date_submitted": "2018-03-19T20:12:24.702813+00:00"
}
]
}
}
GET a release by it's given version number and return the same json response schema as if fetching the release by it's kf_id
.
This supports use cases in the Release Archives so that the client doesn't have fetch all releases and filter by version
Release Notes should optionally be added to any study in a release.
They will relate to both a study and a release and be used to describe changes for that particular study in the release.
Now that the coordinator is public, we should provide a token for each service to use to authenticate itself when updating its state with the coordinator.
This will prevent other users from updating a task that is not the task service itself.
The token should be allowed to be refreshed at any time. Support for multiple tokens may be desirable, but perhaps for a later feature.
We should outline the process for reviewing and signing off on a release.
A service definition will need to be defined with deployment pipeline and variables for the coordinator service's deployment. We may wish to use the type-1 definition or create a new type if we decide to use a different tech stack.
Need to define how the coordinator will communicate file removals in release publish
An enabled
property on a task could allow services to be registered with the coordinator, but not run as part of a release. This may be helpful when adding new tasks that may not be stable yet.
The Jenkinsfile needs to be updated to use the service type 1 module like other projects.
We currently require that tasks report back their status when they fail or change state.
We should periodically poll all tasks that are in a running
or publishing
state to check if they have failed or finished.
We should add a pytest.ini file that runs with the correct django settings module as well as runs pytest so that we only need to execute pytest
to run all of our tests.
#40 adds the ability to define an SNS_ARN
for a topic in SNS to publish events to. To take advantage of this, a releases
topic should be created to receive these messages, and a corresponding lambda listening to this topic should be created to send slack notifications from the events.
Implement a mock task service that can be used to demonstrate the release publishing process from the task service point of view
An event model would be helpful in looking through a releases history and debugging and monitoring anything that happens during a release.
This could be done with an events
table that optionally relates to a release, task, or service. Each release, task, or service, could then have a timeline of events such as task received
, task published
, release canceled
, or health check
.
Events are always used to log transitions between two states. To make them easier to parse, the event model should have starting
and final
state fields that indicate the transition. This allows more detailed information to be placed in the message field of the event.
Debugging certain errors such as task failures is difficult when there is little information about why a task failed.
More verbosity is needed around bad response from tasks and release cancellations.
Releases will often get stuck in the canceling
state upon failure or user cancellation.
Releases should move to canceling
, cancel all tasks, then move to canceled
almost immediately.
The author that initiates releases should be included in the event logs.
We should outline how we will implement semantic versioning in releases.
The /tasks
endpoint should be filterable by service id so that all tasks for a service are not returned in the /task-services
endpoint.
Eg:
GET /tasks?task_service=TA_00000000
To be able to easily identify whether the current release of data supersedes a user's current working data, it's important to have some sequential version tag such as 1.3.2
. It would make sense that we version on the study level, however, this becomes tricky as studies may be included in one release, but not another.
This type of release will happen when the underlying data model has changed, adding or removing information globally across all studies in Kids First
This type of release will only involve a single study. This will likely be the most often used type of release that is initiated by users.
This type of release will include multiple, but not all, studies. This will probably not be as frequent as other methods.
This type of release follows the release process up until the publish
step, where at it is cancelled. This release type is desirable if the user only wants to preview what the data will look like in the portal, or there is some quality issue that causes the release to be cancelled before being published.
After a Kids First-wide release, it would make most sense to increment the major version number
eg: 1.3.12
-> 2.0.0
After a single study release, it seems straightforward to increment the minor version
eg: 1.3.12
-> 1.4.0
After a preview, the patch version may be incremented
eg: 1.3.12
-> 1.3.13
The difficult question is what happens after a multi-study release. According to semver, as long as the version numbers increase, it is valid. Meaning that gaps in versions are acceptable. Following this rule, it may be best to add one the max minor version of the studies involved in the release, eg:
SD_1
- last release version: 1.4.3
SD_2
- last release version: 1.9.23
SD_3
- last release version: 1.3.11
The version number for the next release containing all three of the above:
1.10.0
Implement an endpoint that tasks may ping with their current state and progress.
PATCH /releases/RE_00000001/tasks/TA_00000001 { "state": "running", "progress": "50%" }
{
"_links": {
"self": "/resource/{resourceId}"
},
"_status": {
"code": 200,
"message": "success"
},
"results": {
"name": "data model rollover",
"kf_id": "TA_3G2409A2",
"release_id": "RE_AB28FG90",
"state": "running",
"progress": "50%",
"date_submitted": "2018-03-19T20:12:24.702813+00:00"
}
}
The /status
endpoint should return basic information about the current state of the Coordinator Service similar to the Data Service status endpoint:
GET /status
{
"branch": "master",
"code": 200,
"commit": "aef3b5a",
"message": "success",
"tags": [
"rc"
],
"version": "2.0.4"
}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.