Giter Club home page Giter Club logo

covigator's People

Contributors

dependabot[bot] avatar johausmann avatar patricksorn avatar priesgo avatar tronbfxdev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

priesgo

covigator's Issues

Remove "effect" column from recurrent mutations table

When taking screenshots for the manuscript I noticed that this columns takes space and it is not very informative. Whatever effect the mutations have this is encoded in the HGVS code, which is enough for the trained eye.

Annotate lineages with described mutations

Following up on #61 we want here to include in the DB the mutations that are described to conform a given lineage. This is slightly different from the mutations that are observed in the samples from this lineage.

We could create a table LineageVariant with one FK pointing to the Lineage table and another FK pointing to the Variant table.

  • Find an appropriate resource that describes the mutations in each lineage
  • Define an appropriate data model (the same variant may occur in multiple lineages...)
  • Parse the mutation representation
  • Load the data in the DB at startup time

Potential issues:

  • Are indels represented in a normal form compatible with what we use in CoVigator?
  • MNVs may not be reported together, we may need to merge them
  • The Variant table contains the functional annotations coming from SnpEff. This information is obtained from the VCF files that come out of the pipeline. We won't be able to add this information if the database does not contain this variant beforehand.

Frequencies are not segregated by months in the recurrent mutations table

In the recurrent mutations tab the top table of recurrent mutations can show two metrics segregated by month. The frequency within the month (ie: the frequency of how many samples collected in that month show the mutation) and the default is the count within the month. The first option is not working properly.

Annotate lineages with names (eg: Delta)

I initially thought of having a new lineages table in the database, where the PK is the Pangolin lineage identifier (ie: B.1.1.7) and additionally we have metadata about like WHO designation (ie: Delta), flag of VOC, parent lineage, etc.

This is a good resource to fetch such data: https://github.com/cov-lineages/constellations
There may be others.

Subtasks:

  • Define data model for the lineage table
  • Research the best resource to fetch such information
  • Integrate this data in the CoVigator package (via git submodule if possible)
  • Load data in DB at DB startup

Add exclusion criteria for sample without collection date

In the ENA dataset we found 105 samples that have an empty collection date. These were processed otherwise properly and they have mutations called.

We want to add an exclusion criteria for new samples coming in from both datasets and apply this exclusion retroactively

Update ENA accessor to include Nanopore data

The accessor queries the ENA REST API and stores each sample metadata (ie: URLs to FASTQs, country, collection date, etc.). It also, optionally, downloads the FASTQ files and stores them in the file structure where they can be later found.

  • The accessor needs to stop excluding nanopore samples when querying the REST API (instrument_platform)
  • The accessor needs to know the library strategies used for Nanopore sequencing (Amplicon,...)

Smooth the lineages plot

In the lineages tab, the second plot shows the abundance of each lineage per day. The lines have a lot of local peaks that hamper the interpretation of the plot.

Avoid possible SQL injections by using placeholders

Within queries.py some SQL queries are created by string formatting. This is a possible point for SQL injections.
SQLalchemy uses the TextClause object for simple SQL queries. In it, you can specify placeholders that will be replaced with values from a dictionary when the query is executed using the execute method.

  • Check compatibility with pandas.read_sql_query

Prepare proposal of plots and information to show (slide-based)

Some ideas from our team meeting:

  • Geographical distribution on a map. Challenge: both datasets are biased towards some few countries, we need to devise some normalisation to alleviate this issue. We can use plotly's ScatterGeo for this purpose (https://plotly.github.io/plotly.py-docs/generated/plotly.graph_objects.Scattergeo.html)
  • Top recurrent mutations not in the lineage consensus. Implement a similar table to the top recurrent mutations table, but in this case showing only those mutations observed in a certain lineage that are not described as being part of the lineage. Rationale: these mutations may correspond to technical artifacts or show lineage evolution.
  • Compare two lineages. Venn diagram between both mutations sets
  • Consensus sequence across all samples belonging to the same lineage. We would need to recalculate the consensus every time a new sample is coming in. It would be useful to also compare it with the described consensus sequence. Any difference between both sequences may be due to technical artifacts or legitimate lineage evolution

Search bar for lineages

A search bar in the lineages dashboard that allows to search at least for Pangolin identifiers and WHO designations.

Capture feedback from the dashboard

We tried in the past to capture user feedback by providing buttons to create GitHub issues in the acknowledgments section. We did not receive anything so far. Either because not being published we have not reached a lot of people or because this is kind of hidden.

It may be helpful to have a more pervasive provide feedback feature with a visible button available from all locations of the dashboard. The article below provides a description of how to do this in a Dash dashboard.
https://medium.com/codex/how-to-create-a-dashboard-with-a-contact-form-using-python-and-dash-ee3aacffd349

Performance of the samples tab is not great

Describe the bug
When switching to the samples tab it takes a long time (more than 10 seconds and less than a minute) to show the data.
This behaviour is also reproducible when already in the samples tab the source GISAID is selected. When the ENA source is selected everything is much faster.

To Reproduce
Steps to reproduce the behavior:

  1. Switch to the samples tab

Expected behavior
Something faster!

Additional context
Reproducible from any environment

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.