kids-first / kf-api-study-creator Goto Github PK
View Code? Open in Web Editor NEW๐ Powers investigator-driven data staging. Backend for Data Tracker app
Home Page: https://kids-first.github.io/kf-api-study-creator
License: Apache License 2.0
๐ Powers investigator-driven data staging. Backend for Data Tracker app
Home Page: https://kids-first.github.io/kf-api-study-creator
License: Apache License 2.0
Currently all jwts are assumed to be valid.
We should instead be getting the public key used to validate them from ego.
The public key is available from /oauth/token/public_key
on ego.
Object tags containing at least the app name, kf_id, and date created should be added to files uploaded in s3.
We should remove authentication middleware for the development configuration to allow developers to use /graphql
un-restricted.
Files uploaded as in #9 and stored in s3 as in #10 need to be downloaded by users.
This may need to be a separate endpoint such as /data
as GraphQL does not support file transfers.
Endpoint will be: GET /download/study/[studyId]/file/[fileId]?token=[token]
Where studyId
refers to the study that the file belongs to
fileId
is the internal file identifier
token
is ego jwt that must be in the studyId
group to download
This will download the latest version of the file.
GET /download/study/[studyId]/file/[fileId]/version/[version]?token=[token]
This will download the given version
of the file.
The settings.py
module should be split up into different files so that we may configure the application depending on what deployment environment we're working in.
creator/settings.py
should become:
creator/settings/dev.py
creator/settings/test.py
creator/settings/prd.py
Each study should connect to a some sort of owner/user/investigator (I'll leave the naming up to you) node. This will allow us to group by and manage studies by investigator.
Use case query:
{
allStudies {
edges {
node {
id
kfId
name
investigator
}
}
}
}
The entrypoint.sh
should sync studies with dataservice so that every time a container is run.
Update documentation with:
DATASERVICE
data as PRELOAD_DATA
and fix minor typos
The GET /download/study/<study_id>/file/<file_id>
endpoint should allow a file download to occur if the user belongs to the study_id
group, or has an ADMIN
role. Should return 403
otherwise.
Users will have to specify their token in the Authorization
header as usual during the get request to verify their identity. This means that blind sharing of the file urls will not be possible, but will instead have to occur through some interface that handles sending of the token.
Right now you can only fetch a single study via it's node id
, ideally we would like to be able to get a study via it's kfId
as that is what would be used in the url for the single study view, something like /study/<kfid>/files
Add an on_create
hook on the study model to send a request to the dataservice to create a new study.
This should trigger both a new object in the dataservice and an new bucket for the study.
We should switch to using postgres under the hood sooner than later to avoid any possibility for adding models that are not compatible between the two.
The below are the fields on the dataservice's study model. These should also exist on the study model in the study creator:
{
"attribution": "https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001168.v1.p1",
"created_at": "2018-05-22T21:12:42.999818+00:00",
"data_access_authority": "dbGaP",
"external_id": "phs001168",
"kf_id": "SD_9PYZAHHE",
"modified_at": "2018-05-22T21:12:42.999823+00:00",
"name": "Genomic Studies of Orofacial Cleft Birth Defects",
"release_status": "Pending",
"short_name": "Orofacial Cleft: European Ancestry",
"version": "v1.p1",
"visible": true
}
It seems anyone can download using urls returned in a files downloadUrl
field. This should only be allowed for requests containing a valid JWT of a user that belongs to the study that the file is part of.
When an S3 error occurs, the file object is still created. although the object is not. The error message for uploading the file from boto is also returned. This should instead return a standard 'problem uploading' error message and not create a new object.
We should add the django-cors-headers module to allow preflight requests and add CORS headers to responses.
For now, it's probably ok to allow all origins and we can limit it further once it's deployed.
A mutation to modify file descriptor fields needs to be added so that file properties may be modified after they have been uploaded.
A user should only be able to upload a file if the file is being uploaded to a batch in a study that they belong to.
Our deployments are dependent on loading variables from the environment. Using django-dotenv will make this easier and allow us to configure the application from various .env
files and override them directly with variables in the environment.
We may wish to only do this or #24 as they may address the same issue.
The settings file should be split out into development
, testing
and production
to allow different configuration based on environment.
Users should be able to pass their ego JWT in the Authorization
header as a Bearer
token.
When a request with a token comes in, it should be validated against ego's public key to ensure the token was issued by ego.
The user's role
and groups
should be populated on the user's context in the Django request in an authentication middleware for future authorization.
Files uploaded through graphql in #9 need to be uploaded to the proper s3 study bucket.
We should use the django-storages module to support S3 uploads for this.
Files uploaded to the api should be placed in:
s3://kf-study-us-east-1-{env}-sd-XXXXXXXX/source/uploads
@baileyckelly commented on Tue Sep 18 2018
Notes from Sprint Planning:
More clarification is needed about what expected values are required and optional and whether this is a dynamic set of values.
File types should expose a download url field in their schema that will point to the download url where they may be downloaded.
Need to check the user's permission to view files to sort out which files to return in the allFiles
query.
The allFiles query should return only files that are in a study that a user is in the group of.
The download and download latest views can be combined into one view to prevent from writing the same download logic twice.
Write manage command to sync the study creator's study objects with those that exist in the dataservice.
This should be run with ./manage.py syncdataservice
during the entry point, or on demand.
This command will scrape the /studies
endpoint of the dataservice and insert any studies that exist in the dataservice, but not in the study creator.
Need to add a service type 1 jenkins file.
Upload instructions are outdated now. Modify to:
Curl example
^^^^^^^^^^^^
.. code-block:: bash
curl localhost:8080/graphql \
-F operations='{ "query": "mutation ($file: Upload!, $studyId: String!) { createFile(file: $file, studyId: $studyId) { success } }", "variables": { "file": null, "studyId": <study kf id> } }' \
-F map='{ "0": ["variables.file"] }' \
-F 0=@<your filepath>
The Object schema should return a downloadUrl
field for that specific version of the file, similar to the downloadUrl
on the File object.
Need to check the user's permission to view files to sort out which objects to return in the allVersions
query.
The allVersions
query should return only objects that are in a study that a user is in the group of.
The Batch
concept still needs to be better defined, but it's leaning towards being more of a selection of entities created from a study's files.
This makes files being directly related to a study more natural, so the Files
entity should point directly to a Study
rather than a Batch
.
Currently, only the latest file version may be downloaded from the /download
endpoint. We should support specifying what version to download through an endpoint:
GET /download/study/<study_ids>/file/<file_id>/version/<version_id>
.
This will have the same authorization mechanism as the /download
by file endpoint.
Add state
field to Version
that is an enumeration of one of the below:
Pending Review
Changes Needed
Approved
Processed
The state flow will look something like the following, although it won't be enforced:
+> Changes Needed +> Changes Needed
| |
| |
Pending Review +------------> Approved +-------------> Processed
The createFile query should return the new file returned in it's schema.
Need a mutation for batches so that users may create new batches through the api.
The ('SAM', 'Sample Manifest')
enum on the file_type
field in the File
model should be changed to ('SEQ', 'Sequencing Manifest)
.
We should rename FileEssence
to simply File
/graphql
does not need CSRF checks on it. We can exclude the middleware on the route in our urls.py
.
Instead of using the base Dockerfile
and installing test dependencies in the entrypoint.sh
, it may be better to create a second Dockerfile.dev
that includes these dependencies already installed and runs a different entrypoint to either pre-populate with mock data or from the data service.
CircleCI should be set up to run tests for PR status checks.
I wanted to use my KF dataservice docker container with the KF study creator API container so that I could preload studies from my dataservice.
Not sure if this is the best approach, but I added the kf-data-stack
user-defined network to both of the docker compose files for the dataservice and the study creator. Then I just set PRELOAD_DATA to http://<dataservice web server container name>
Maybe also add something to the sphinx docs about using your dockerized dataservice
Create a graphQL api with a first draft data model and mock data for evaluating use cases.
The the primary key, id
, on the File
model is currently an incremental integer.
This should be changed to a unique uuid type to cause less confusion when evaluating a url.
The id
field should also be renamed to uuid
so that it will not conflict with the graphql id
field.
The the primary key, id
, on the Object
model is currently an incremental integer.
This should be changed to a unique uuid type to cause less confusion when evaluating a url.
The id
field should also be renamed to uuid
so that it will not conflict with the graphql id
field.
Saving shared study and file upload setup process to a fixture for test_download
and test_query
We need to allow uploading of a file through the API.
This is possible through the grapql Multipart spec and has an existing library for django, graphene-file-upload
Need to have some restrictions on the size of files being uploaded to the study creator.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.