chanzuckerberg / cryoet-data-portal Goto Github PK

CryoET Data Portal

License: MIT License

Python 12.19% JavaScript 1.73% Dockerfile 0.08% TypeScript 75.76% CSS 1.08% HCL 2.29% MDX 6.62% Makefile 0.12% Shell 0.09% Roff 0.01% mIRC Script 0.02%

cryoet-data-portal's Introduction

Community Cryo-Electron Tomography Data Portal

CryoET Data Portal frontend, docs, tools, and api client.

Please see our Landing Page here!

cryoet-data-portal's People

Contributors

Stargazers

Watchers

Forkers

kandarpksk andy-sweet qitsweauca manasav3 codemonkey800 dgmccart

cryoet-data-portal's Issues

Description of the kinds of data hosted on Landing Page

User story: (1.2) I want to see the description of the kinds of data hosted (e.g., type and count), so that I can quickly get an idea of what I might find

Tasks

total count of datasets
total count of species
total count of tomograms

The count will auto update as new datasets are added

Link to design: https://www.figma.com/file/WEmbsjtlBUtRy7pzmuCCjj/CryoET-Data-Portal---Phase-2-Designs?node-id=1929%3A65457&mode=dev

Annotation dates are not serializable

There are multiple date objects in Annotation objects in the data portal:
'deposition_date': datetime.date(2023, 4, 1), 'release_date': datetime.date(2023, 6, 1), 'last_modified_date': datetime.date(2023, 6, 1). These objects are not serializable. I suggest converting these objects to iso 8601 strings before adding them to the data portal .

I ran into this trying to pretty print annotations, but this would affect other workflows that involve serializing annotations.

>>> json.dumps(annotations[0].to_dict(), indent=4)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/kharrington/micromamba/envs/nesoi/lib/python3.10/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
  File "/Users/kharrington/micromamba/envs/nesoi/lib/python3.10/json/encoder.py", line 201, in encode
    chunks = list(chunks)
  File "/Users/kharrington/micromamba/envs/nesoi/lib/python3.10/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/Users/kharrington/micromamba/envs/nesoi/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/Users/kharrington/micromamba/envs/nesoi/lib/python3.10/json/encoder.py", line 438, in _iterencode
    o = _default(o)
  File "/Users/kharrington/micromamba/envs/nesoi/lib/python3.10/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type date is not JSON serializable

Find relevant data (Browse all data page)

Epic for tracking all issues/tasks related to implementing the browse all data page (aka general search page)

P0:

P1:

Design:
https://www.figma.com/file/WEmbsjtlBUtRy7pzmuCCjj/CryoET-Data-Portal---Phase-2-Designs?mode=dev

Investigate at dataset level (singe dataset page)

Epic for tracking all issues/tasks related to implementing a single dataset page

P0:

P1:

#34
#486

No method for download annotation alone

The only possible way to download an annotation is to use TomoVoxelSpacing.download_all.

AWS documentation

Generate JSON for neuroglancer

User story:
I want to visualize data from the data portal via an interactive web viewer, so that I am able to view my data in real time.

Create a JSON that frontend can use to create an encoded link to the neuroglancer to visualize the tomogram.

Setup staging environment

Create staging environment, to test all merged in changes.

View other tomograms and segmentation annotations in Neuroglancer

User story: (4.6) I want to open a tomogram in Neuroglancer, so that I can visualize the data

Tasks

option to view other tomograms (besides canonical) in Neuroglancer
open Neuroglancer in a separate tab OR navigate back to original page it was launched from
view segmentation annotations

Link to design:

Download annotation only

filter on single dataset page - to narrow down runs

Tasks

2 filters: tilt series quality score, tilt range
general AND/OR logic:
- AND logic between different filters
- OR logic for multiple values within one filter
when filter is applied, show as a blue tag that use can "X" out of to remove
only show the filter values that we have available in the database
once filter is applied, show number of search results in the runs table

Note: sort is out of scope

Link to design: https://www.figma.com/file/oUtMvkDdADObFmvOJVVy9G/Phase-2-Deprioritized-Designs-%26-Explorations?node-id=29%3A9586&mode=dev

all runs associated with dataset (list view)

User story: (3.5) I want to see all runs associated with this dataset + filter/sort within the dataset so that I can easily navigate large amounts within dataset

Tasks

show list of runs

Link to design: https://www.figma.com/file/WEmbsjtlBUtRy7pzmuCCjj/CryoET-Data-Portal---Phase-2-Designs?type=design&node-id=757%3A86414&mode=dev

mrc pyramid is not necessary

Those who download tomogram mrc files are end users needing it at the particular voxel_size. The multi-scale pyramid is not useful. In addition, the pyramid scale all three-axes equally, causing tilt_series which are stack of image skipping the tilt sequence.

Recommendation: remove mrc multiscale pyramid. Only do so on zarr files.

Single Run page - Annotations table view

User story: (4.5) I want to see essential annotation metadata displayed on the table so that I can find the annotation of interest

Tasks

table has the following columns: Annotation, annotation object, object count, precision, recall
show pagination - max out at 20
show how many annotations total and how many are being displayed (e.g. 20 of 25 Annotations)
Show blue tag where ground truth is available

Link to design: https://www.figma.com/file/WEmbsjtlBUtRy7pzmuCCjj/CryoET-Data-Portal---Phase-2-Designs?type=design&node-id=1310%3A11321&mode=dev

Setup frontend framework

Setting up framework
Research on nextjs vs remix

Create FAQ page

create FAQ page to host documentation
allow user to navigate there from the FAQ menu dropdown
allow PRs to directly update the text on the FAQ page
each question/section can be collapsed and expanded (accordion style), defaulting to collapsed view when users first navigate to page

Link to design: https://www.figma.com/file/WEmbsjtlBUtRy7pzmuCCjj/CryoET-Data-Portal---Phase-2-Designs?node-id=2234%3A141854&mode=dev

detailed dataset metadata

User story: (3.4) I want to see detailed dataset metadata, so that I can get information on the experiment as a whole

Tasks

ability to expand/navigate for full list of metadata fields

Link to design: https://www.figma.com/file/WEmbsjtlBUtRy7pzmuCCjj/CryoET-Data-Portal---Phase-2-Designs?type=design&node-id=757-86414&mode=design&t=Vl2vV88PQ7K0PR5R-0

Create CORS support for cloudfront

To allow neuroglancer to request files from the Cloudfront API gateway, we have to add the neuroglancer domain to the supported cross-orgins.

key photo at dataset level

User story: (3.2) I want to see a key photo at the dataset level, so that I can get a glimpse of what I might expect to see in the dataset

Tasks

surface key photo (single large photo at the top of single dataset page)
surface key photo on (small photos at the browse all data page)
note: key photos are submitted at run level and will have to pick one to show at dataset level. Other times, the key photo is coming from EMPIAR (in which case we would only have 1 and will not have to select which to show)

Link to design: https://www.figma.com/file/WEmbsjtlBUtRy7pzmuCCjj/CryoET-Data-Portal---Phase-2-Designs?type=design&node-id=757%3A86414&mode=dev

Agree on metadata changes

We have a variety of different metadata changes coming up for Phase 2. Before we can make these changes to our json and GQL schema, let's get approval for what these fields should be called and what their structure should be.

Draft changes are here: https://docs.google.com/document/d/1zcD9OyY86gaogVZOBZPwcn0fxZ5QSFrlqXeaodJX9BI/edit#heading=h.be2fyivqz1zp

Browse all data (show # of datasets)

User story: (2.1) I want to browse all data, so that I am aware of what is available

Tasks

Show total number of datasets

Link to design:

high-level description of dataset

User story: (3.1) I want to see a high-level description of the dataset, so that I can decide whether I want to proceed with downloading

Tasks

Short description of dataset + author, EMPIAR ID, etc. See design for specific text

Link to design: https://www.figma.com/file/WEmbsjtlBUtRy7pzmuCCjj/CryoET-Data-Portal---Phase-2-Designs?type=design&node-id=757%3A86414&mode=dev

Querying annotations returns duplicate HTTPS URLs

Querying some of the annotations returns more objects than I'd expect, some of which have duplicate metadata URLs.

See the following reproducer.

from cryoet_data_portal import Client, Tomogram
client = Client()
tomo = next(Tomogram.find(client, [Tomogram.name == 'TS_026']))
annos = list(tomo.tomogram_voxel_spacing.annotations)
for a in annos:
    print(a.https_metadata_path)

# https://files.cryoetdataportal.cziscience.com/10000/TS_026/Tomograms/VoxelSpacing13.48/Annotations/sara_goetz-fatty_acid_synthase-1.0.json
# https://files.cryoetdataportal.cziscience.com/10000/TS_026/Tomograms/VoxelSpacing13.48/Annotations/sara_goetz-ribosome-1.0.json
# https://files.cryoetdataportal.cziscience.com/10000/TS_026/Tomograms/VoxelSpacing13.48/Annotations/sara_goetz-ribosome-1.0.json
# https://files.cryoetdataportal.cziscience.com/10000/TS_026/Tomograms/VoxelSpacing13.48/Annotations/sara_goetz-fatty_acid_synthase-1.0.json

I suspect that we're returning multiple s3 versions of each bucket object, but not certain.

Error in Client.get_by_id

version == 1.0.0

from cryoet_data_portal import Client, Dataset
myclient=Client()
item = Dataset.get_by_id(myclient, 10000)

Gave error

  File "/Users/anchi.cheng/Library/Python/3.9/lib/python/site-packages/cryoet_data_portal/_gql_base.py", line 259, in get_by_id
    return client.find_one(cls, [cls.id == id])
  File "/Users/anchi.cheng/Library/Python/3.9/lib/python/site-packages/cryoet_data_portal/_client.py", line 58, in find_one
    for result in self.find(*args, **kwargs):
  File "/Users/anchi.cheng/Library/Python/3.9/lib/python/site-packages/cryoet_data_portal/_client.py", line 53, in find
    response = self.client.execute(self.build_query(cls, gql_type, query_filters))
  File "/Users/anchi.cheng/Library/Python/3.9/lib/python/site-packages/gql/client.py", line 403, in execute
    return self.execute_sync(
  File "/Users/anchi.cheng/Library/Python/3.9/lib/python/site-packages/gql/client.py", line 221, in execute_sync
    return session.execute(
  File "/Users/anchi.cheng/Library/Python/3.9/lib/python/site-packages/gql/client.py", line 860, in execute
    raise TransportQueryError(
gql.transport.exceptions.TransportQueryError: {'extensions': {'code': 'validation-failed', 'path': '$.selectionSet.datasets.args.where'}, 'message': "expected an object for type 'datasets_bool_exp', but found an enum value"}

Workaround:

    items = Dataset.find(client,[Dataset.id==10004])
    for item in items:
        print(item.title)

Filter on browse all data page

User Story: 2.2 I want to filter through the data, so that I can narrow down my selection

13 filters total, organized within 7 groups (see design)
general AND/OR logic:
- AND logic between different filters (e.g., I want to see object name = ribosome AND camera manufacturer = Gatan)
- OR logic for multiple values within one filter (e.g., I want to see datasets that have Ribosomes OR Membranes annotated)
- exception to the above rule: multi-select under the "available files" section is AND logic (e.g., i want to see datasets where it includes raw frames AND tilt series)
when filter is applied, show as a blue tag that use can "X" out of to remove
only show the filter values that we have available in the database (i.e., we wouldn't list "Mesh" under object shape type if it doesn't exist in our metadata)
once filter is applied, show number of search results in the dataset table
note: see specific filter menu types in design

Link to design: https://www.figma.com/file/WEmbsjtlBUtRy7pzmuCCjj/CryoET-Data-Portal---Phase-2-Designs?type=design&node-id=1637-37001&mode=design&t=QGhDNstOreaiFP49-0

Understand the CryoET Portal Website (Landing Page)

Epic for tracking all issues/tasks related to implementing the Portal landing page

Copy for the landing page (with marketing)
#26

Create preview images from tomogram

Generate snapshots of tomogram, so that user can get a glimpse of what they might expect to see in the dataset.

Configure download dialog from single run page

User story: (5.5) I want to download specific tomograms and/or annotations so that I have targeted data without unnecessarily using up memory

Tasks

download button on single run page to open "Configure Download" dialog
dialog contains the following elements: title, dataset name, run name, and 2 download options (see below)
- #120
- #121
directions to download all run data via API, with link to API instructions

Link to design: https://www.figma.com/file/WEmbsjtlBUtRy7pzmuCCjj/CryoET-Data-Portal---Phase-2-Designs?type=design&node-id=2315-169419&mode=design&t=8myNEdr3dZigyAP2-0

empty, reuse

Setup infrastructure for front end application

Configure workflow to build code and deploy using the happy path. Also, setup infrastructure for the repository and containers to deploy the application.

Run quality metrics

User story: (4.5) I want to see quality metrics, so that I can decide how reliable the data is

Tasks

tilt series quality metrics shown directly in UI (separated from metadata to make it more accessible)

Link to design:

Single Run page - Metadata within a run

User story: (4.2) I want to see all the metadata within a run, so that I can access detailed information

Tasks

full list of metadata accessible but initially tucked away (likely via panel) - open via "more info" that's at the top right corner of the page

Link to design: https://www.figma.com/file/WEmbsjtlBUtRy7pzmuCCjj/CryoET-Data-Portal---Phase-2-Designs?type=design&node-id=1310%3A11321&mode=dev

A client with an invalid URL hangs indefinitely on find

I can create a client with an invalid URL successfully, then attempt to find datasets (or presumably other things). I'd expect this to error (ideally on initialization of the client), but instead it hangs indefinitely when calling find.

The following code should reproduce this issue

from cryoet_data_portal import Client, Dataset
client = Client("https://graphql.catdataportal.cziscience.com/v1/graphql")
datasets = list(Dataset.find(client, [Dataset.id == 10000]))
# hangs indefinitely

Download a single dataset

User story: (5.1) I want to download a single dataset, so that I can use it to develop/train ML models

Tasks:

P0:

download button on single dataset page to open download dataset dialog
dialog contains the following elements: title, dataset name, and 2 download tabs for users to chose from
tab 1: download via AWS S3
- step 1 asks user to select their save destination (this is optional for user) - P1
step 2 is a AWS s3 snippet that user can copy
provide link to AWS instructions
- note: instructions are WIP by Dannielle
tab 2: download via API
info box to link user to API documentation
- link is: https://chanzuckerberg.github.io/cryoet-data-portal/python-api.html

P1:

surface dataset ID for user to copy
asks user to select their save destination (this is optional for user)

Link to design: https://www.figma.com/file/WEmbsjtlBUtRy7pzmuCCjj/CryoET-Data-Portal---Phase-2-Designs?type=design&node-id=2315-169422&mode=design&t=8myNEdr3dZigyAP2-0

Setup build pipeline with happy for frontend

Build and deploy container using happy
Have the option to deploy only frontend components for dev env

Filter on single runs page - to narrow down annotations

Add the following 6 filters to the single run page to narrow down annotations:

1.Annotation Author
- new filter following existing patterns, should return annotation authors, and not dataset authors
2.Object Name
- filter exists on Browse All page
3.GO ID
- new filter, see filter specs in designs
4.Object Shape Type
- filter exists on Browse All page
5.Method Type
- new filter following existing patterns. Should only have a max of 3 dropdown options
- dependent on #443 and #442
6.Annotation Software
- new filter following existing patterns

Link to design: https://www.figma.com/file/q6Z394Xy6wUmaXQ9YFjZH7/Kevin---2023-Spillover-Work?type=design&node-id=2118-11276&mode=design&t=W43r2ZQSbiy3s1ES-0

browse on 2 separate tabs (dataset, run)

User Story: (2.6) I want the filter, sort, and search results to be shown on 2 separate tabs (dataset, run), so that I can better navigate different levels of data

Link to design:

Sort data (on browse all data page)

User Story: (2.3) I want to sort the data, so that the data is arranged in a meaningful order

sort by organism, # of runs, object type, etc. (see designs of final list)

Link to design:

add example for using ilike method to construct query in API documentation

like, ilike, _in does work like other operators.

I had to use them as attribute of the field, or it get syntex error. For example,

from cryoet_data_portal import Client, Annotation         
myclient = Client()
results = Annotation.find(myclient,[Annotation.object_name.ilike('fatty acid synthase%')])

while

results = Annotation.find(myclient,[Annotation.object_name ilike 'fatty acid synthase%'])

gives syntax error

Browse all data (list view & pagination)

User story: (2.1) I want to browse all data, so that I am aware of what is available

Tasks

Show dataset in list view
pagination

Link to design: https://www.figma.com/file/WEmbsjtlBUtRy7pzmuCCjj/CryoET-Data-Portal---Phase-2-Designs?mode=dev

Open tomogram in Neuroglancer

User story: (4.6) I want to open a tomogram in Neuroglancer, so that I can visualize the data

Tasks

default to opening canonical tomogram in Neuroglancer
- run page
- dataset page

Link to design:

OME-Zarr tomogram scale is not physically correct

Currently, the OME-Zarr metadata (i.e. the top-level zattrs) describe unit-less spatial dimensions of z/y/x and multi-resolution scales of (1, 2, 4).

While these scales are relatively correct (i.e. they represent the scaling factors from the highest resolution tomogram), they do not capture the physical spacing between the voxels of the tomogram as I would expect.

Instead, I would expect the spatial dimensions to include a supported OME-NGFF unit (e.g. angstrom) and then absorb the voxel spacing/sizes into the multi-resolution scales (e.g. 13.48, 26.96, 53.92).

Not a major issue, but this should mean that tools that read those metadata, will automatically set the scale of the data to be physically correct with respect to other data and visualization tools like scale bars.

Single run page header area

User story: (4.1) I want to see high-level description of the run so that I can quickly decide if I would like to investigate further

Tasks

count of # of files (Frames, Tilt-series, Tomograms, Annotations)
Overview of tilt series, including tilt quality, tilt range, tilt scheme
Overview of Tomogram, including resolutions available, tomogram processing, annotated objects

Link to design: https://www.figma.com/file/WEmbsjtlBUtRy7pzmuCCjj/CryoET-Data-Portal---Phase-2-Designs?type=design&node-id=1310%3A11321&mode=dev