The covigator from tron-bioinformatics

covigator's Issues

Remove "effect" column from recurrent mutations table

When taking screenshots for the manuscript I noticed that this columns takes space and it is not very informative. Whatever effect the mutations have this is encoded in the HGVS code, which is enough for the trained eye.

Can we make the need to click apply more explicit

We agreed we cannot get rid of the apply buttom. But maybe we can make it more explicit that it needs to be clicked

Change threshold for low frequency mutations to 2 % and store mutations >= 50 % and < 80 % in different table

Update main system diagram in the documentation

Annotate lineages with described mutations

Following up on #61 we want here to include in the DB the mutations that are described to conform a given lineage. This is slightly different from the mutations that are observed in the samples from this lineage.

We could create a table LineageVariant with one FK pointing to the Lineage table and another FK pointing to the Variant table.

Find an appropriate resource that describes the mutations in each lineage
Define an appropriate data model (the same variant may occur in multiple lineages...)
Parse the mutation representation
Load the data in the DB at startup time

Potential issues:

Are indels represented in a normal form compatible with what we use in CoVigator?
MNVs may not be reported together, we may need to merge them
The Variant table contains the functional annotations coming from SnpEff. This information is obtained from the VCF files that come out of the pipeline. We won't be able to add this information if the database does not contain this variant beforehand.

Frequencies are not segregated by months in the recurrent mutations table

In the recurrent mutations tab the top table of recurrent mutations can show two metrics segregated by month. The frequency within the month (ie: the frequency of how many samples collected in that month show the mutation) and the default is the count within the month. The first option is not working properly.

Add table in home with features in each dataset

Annotate lineages with names (eg: Delta)

I initially thought of having a new lineages table in the database, where the PK is the Pangolin lineage identifier (ie: B.1.1.7) and additionally we have metadata about like WHO designation (ie: Delta), flag of VOC, parent lineage, etc.

This is a good resource to fetch such data: https://github.com/cov-lineages/constellations
There may be others.

Subtasks:

Define data model for the lineage table
Research the best resource to fetch such information
Integrate this data in the CoVigator package (via git submodule if possible)
Load data in DB at DB startup

Implement a simplistic lineage selection with a dropdown box that opens the page of the selected lineage

Before we implement a search tab which pose more challenges, we may select a lineage in the lineages dashboard with a dropdown.

Plot label and figure description are too close in the C19 Data Portal overview tab

Add exclusion criteria for sample without collection date

In the ENA dataset we found 105 samples that have an empty collection date. These were processed otherwise properly and they have mutations called.

We want to add an exclusion criteria for new samples coming in from both datasets and apply this exclusion retroactively

Cite pangolin lineages in docs

Update ENA accessor to include Nanopore data

The accessor queries the ENA REST API and stores each sample metadata (ie: URLs to FASTQs, country, collection date, etc.). It also, optionally, downloads the FASTQ files and stores them in the file structure where they can be later found.

The accessor needs to stop excluding nanopore samples when querying the REST API (instrument_platform)
The accessor needs to know the library strategies used for Nanopore sequencing (Amplicon,...)

Change VAF range in intrahost tab to include VAFs below 10 %

Search optimization of CoVigator dashboard with keywords

Lineage summary dashboard

ENA API update

The ENA API was updated in May 2023. We need to check if the changes affect the CoVigator accessor module.

More information here:
https://docs.google.com/document/d/1RPHmK8Pvm9UxSa21Ej3MkGoGYO9baSxwxk_dOuWWyNE/edit#heading=h.jq87g3izg5xr

Smooth the lineages plot

In the lineages tab, the second plot shows the abundance of each lineage per day. The lines have a lot of local peaks that hamper the interpretation of the plot.

Avoid possible SQL injections by using placeholders

Within queries.py some SQL queries are created by string formatting. This is a possible point for SQL injections.
SQLalchemy uses the TextClause object for simple SQL queries. In it, you can specify placeholders that will be replaced with values from a dictionary when the query is executed using the execute method.

Check compatibility with pandas.read_sql_query

Some monthly frequencies are slightly above 1.0 in table precomputed_top_occurrence

Prepare proposal of plots and information to show (slide-based)

Some ideas from our team meeting:

Geographical distribution on a map. Challenge: both datasets are biased towards some few countries, we need to devise some normalisation to alleviate this issue. We can use plotly's ScatterGeo for this purpose (https://plotly.github.io/plotly.py-docs/generated/plotly.graph_objects.Scattergeo.html)
Top recurrent mutations not in the lineage consensus. Implement a similar table to the top recurrent mutations table, but in this case showing only those mutations observed in a certain lineage that are not described as being part of the lineage. Rationale: these mutations may correspond to technical artifacts or show lineage evolution.
Compare two lineages. Venn diagram between both mutations sets
Consensus sequence across all samples belonging to the same lineage. We would need to recalculate the consensus every time a new sample is coming in. It would be useful to also compare it with the described consensus sequence. Any difference between both sequences may be due to technical artifacts or legitimate lineage evolution

Edit the first paragraph in the landing page

use lineages metadata to enrich the lineages plot in the dashboard

Use lineage metadata from #61 to enrich the dashboard. Update the combo boxes in the lineages plot where we can select the lineages to show.

Add some teaser info on each tab

Search bar for lineages

A search bar in the lineages dashboard that allows to search at least for Pangolin identifiers and WHO designations.

Harmonize colors of mutations in legends in mutation stats tab

The MNVs have two different colors in the screenshot below. Also, in the left plot it is difficult to distinguish insertions from MNVs.

Make grey boxes on tabs clickable to highlight associated plot

It is difficult to map each of these grey boxes on the top left to the corresponding plot.

Capture feedback from the dashboard

We tried in the past to capture user feedback by providing buttons to create GitHub issues in the acknowledgments section. We did not receive anything so far. Either because not being published we have not reached a lot of people or because this is kind of hidden.

It may be helpful to have a more pervasive provide feedback feature with a visible button available from all locations of the dashboard. The article below provides a description of how to do this in a Dash dashboard.
https://medium.com/codex/how-to-create-a-dashboard-with-a-contact-form-using-python-and-dash-ee3aacffd349

Link to the lineage summary (eg: https://covigator.tron-mainz.de/lineages/B.1.1.7) from the lineages tab of other dashboards

Implement content of the dashboard

Add reference to the preprint in CoVigator dashboard acknowledgements and in the documentation

Hover tooltips in lineages plot are difficult to use

Performance of the samples tab is not great

Describe the bug
When switching to the samples tab it takes a long time (more than 10 seconds and less than a minute) to show the data.
This behaviour is also reproducible when already in the samples tab the source GISAID is selected. When the ENA source is selected everything is much faster.

To Reproduce
Steps to reproduce the behavior:

Switch to the samples tab

Expected behavior
Something faster!

Additional context
Reproducible from any environment

In the intrahost tab the selection of mutations to expand details requires clicking apply and running expensive queries again, this could be seperated to an independent buttom "view details"

tron-bioinformatics / covigator Goto Github PK

covigator's People

Contributors

Stargazers

Watchers

Forkers

covigator's Issues

Recommend Projects

Recommend Topics

Recommend Org