tiagofilipe12 / patlas Goto Github PK

View Code? Open in Web Editor NEW

21.0 5.0 10.0 63.79 MB

Plasmid Atlas - A web interface to browse for plasmids and their associated genes. Visit us at:

Home Page: http://www.patlas.site

License: GNU General Public License v3.0

JavaScript 54.39% HTML 22.10% CSS 2.44% Python 21.07%

plasmids database mash visualization vivagraph flask

patlas's Introduction

Description
- Citation
Documentation
Development
pATLAS API

Description

Plasmid Atlas is a web-base tool that empowers researchers to easily and rapidly access information related with plasmids present in NCBI's refseq database. In pATLAS each node (or circle) represents a plasmid and each link between two plasmids means that those two plasmids share around 90% average nucleotide identity.

With this tool we have two main goals:

Increase the accessibility of plasmid relevant metadata to users as well as facilitate the access to that metadata.
Improve the ease of interpreting results from High Throughput Sequencing (HTS) for plasmid detection.

Citation

Tiago F Jesus, Bruno Ribeiro-Gonçalves, Diogo N Silva, Valeria Bortolaia, Mário Ramirez, João A Carriço; Plasmid ATLAS: plasmid visual analytics and identification in high-throughput sequencing data, Nucleic Acids Research, gky1073, https://doi.org/10.1093/nar/gky1073

Documentation

If are interested in learning how to use pATLAS, please refer to gitbook documentation.

Development

Dependencies

Mash (2.0) - You can download mash version 2.0.0 directly here: linux and OSX. Other releases were not tested but may be downloaded in Mash git releases page.
Postgresql (>= 10.0) - This script uses Postgres database to store the database: releases page
Python 3 and the respective pip.
To install all other dependencies just run: pip install -r requirements.txt

Backend Scripts

MASHix.py

MASHix.py is the main script to generate the database. This script generates a matrix of pairwise comparisons between sequences in input fasta(s) file(s). Note that it reads multifastas, i.e., each header in fasta is a reference sequence.

Options:

Main options:

'-i','--input_references' - 'Provide the input fasta files to parse.
                            This will inputs will be joined in a
                            master fasta.'

'-o','--output' - 'Provide an output tag'

'-t', '--threads' - 'Provide the number of threads to be used'

'-db', '--database_name' - 'This argument must be provided as the last
argument. It states the database name that must be used.'

MASH related options:

'-k','--kmers' - 'Provide the number of k-mers to be provided to mash
                sketch. Default: 21'

'-p','--pvalue' - 'Provide the p-value to consider a distance
                significant. Default: 0.05'

'-md','--mashdist' - 'Provide the maximum mash distance to be
                    parsed to the matrix. Default:0.1'

Other options:

'-no_rm', '--no-remove' - 'Specify if you do not want to remove the
                        output concatenated fasta.'

'-hist', '--histograms' - 'Checks the distribution of distances
                        values ploting histograms.'

'-non', '--nodes_ncbi' - 'specify the path to the file containing
                        nodes.dmp from NCBI'

'-nan', '--names_ncbi' - 'specify the path to the file containing
                        names.dmp from NCBI'

'--search-sequences-to-remove' - 'this option allows to only run the
                                 part of the script that is required
                                 to generate the filtered fasta.
                                 Allowing for instance to debug
                                 sequences that shoudn't be removed
                                 using 'cds' and 'origin' keywords'.

Database customization:

I don't like database name! How do I change it?

Go to db_manager/config_default.py and edit the following line:

SQLALCHEMY_DATABASE_URI = 'postgresql:///<custom_database_name>'

I don't like table name inside database! How do I change it?

Go to db_manager/db_app/models.py and edit the following line:

 __tablename__ = "<custom_table_name>"

Database migration from one server to another

Database export

pg_dump <db_name> > <file_name.sql>

Database import

psql -U <user_name> -d <db_name> -f <file_name.sql>

Supplementary scripts

abricate2db.py

This script inherits a class from ODiogoSilva/Templates and uses it to parse abricate outputs and dumps abricate outputs to a psql database, depending on the input type provided.

Options:

"-i", "--input_file" - "Provide the abricate file to parse to db.
                        It can accept more than one file in the case of
                        resistances."

"-db_psql", "--database_name" - "his argument must be provided as the
                                last argument. It states the database
                                name that must be used."

"-db", "--db" - "Provide the db to output in psql models."

"-id", "--identity" - "minimum identity to be reported to db"

"-cov", "--coverage" - "minimum coverage do be reported to db"

"-csv", "--csv" - "Provide card csv file to get correspondence between
                    DNA accessions and ARO accessions. Usually named
                    aro_index.csv. By default this file is already
                    available in patlas repo with a specific path:
                    'db_manager/db_app/static/csv/aro_index.csv'"

taxa_fetch.py

This script is located in utils folder and can be used to generate a JSON file with the corresponding taxonomic tree. It fetches for a given species, the genera, family and order to which it belongs. Note: for plasmids I have to make some filtering in the resulting taxids and list of species that other users may want to skip

Options:

-i INPUT_LIST, --input_list INPUT_LIST
                        provide a file with a listof species. Each
                        speciesshould be in each line.
-non NODES_FILE, --nodes_ncbi NODES_FILE
                        specify the path to the file containing
                        nodes.dmp from NCBI
-nan NAMES_FILE, --names_ncbi NAMES_FILE
                        specify the path to the file containing
                        names.dmp from NCBI
-w, --weirdos         This option allows the userto add a checks for
                        weirdentries. This is mainly usedto parse the
                        plasmids refseq, so if you do not want this to
                        be used, use this option.

List of entries that will be filtered from `weirdos` option

From taxonomy levels:
- "bug"
- "insect"
- "angiosperm"
- "fungus"
- "cnidarian"
- "mudpuppy"
- "mantid"
- "mussel"
From species in fasta headers:
- "orf"
- "unknown"
- "Uncultured"
- "uncultured"
- "Peanut"
- "Pigeon"
- "Wheat"
- "Beet"
- "Blood"
- "Onion"
- "Tomato"
- "Zea"
- "Endosymbiont"
- "Bacillaceae"
- "Comamonadaceae"
- "Enterobacteriaceae"
- "Opitutaceae"
- "Rhodobacteraceae"
- "Bacterium"
- "Endophytic"
It also attempts to fix some bugs in species naming like the following:
- "B bronchiseptica"
- "S pyogenes"

Note: Yes people like to give interesting names to bacteria...

pATLAS API

Schematics of the pATLAS database creation

Workflow for database creation

Download plasmid sequences available in NCBI refseq
Extract fasta from tar.gz
Download and extract NCBI taxonomy, which will be fed to pATLAS.
Clone this repository:

git clone https://github.com/tiagofilipe12/pATLAS

Install its dependencies
Configure the database:

createdb <database_name>
pATLAS/patlas/db_manager/db_create.py <database_name>

run MASHix.py - the output will include a filtered fasta file (master_fasta_*.fas).
run ABRicate, with CARD, ResFinder, PlasmidFinder, VFDB databases.

# e.g.
abricate --db card <master_fasta*.fas> > abr_card.tsv
abricate --db resfinder <master_fasta*.fas> > abr_resfinder.tsv
abricate --db vfdb <master_fasta*.fas> > abr_vfdb.tsv
abricate --db plasmidfinder <master_fasta*.fas> > abr_plasmidfinder.tsv

Download the card index necessary for the abricate2db.py script (aro_index.csv).
run abricate2db.py - using all the previous tsv as input.

# e.g.
abricate2db.py -i abr_plasmidfinder.tsv -db plasmidfinder \
    -id 80 -cov 90 -csv aro_index.csv -db_psql <database_name>

dump database to a sql file.

Automation of this steps

This steps are fully automated in the nextflow pipeline pATLAS-db-creation.

Creating a custom version of pATLAS

If you require to add your own plasmids to pATLAS database without asking to add them to pATLAS website, you can provide custom fasta files when building the database using the -i option of MASHix.py. Then follow the steps described above.

Run pATLAS locally

Docker compose

You can run pATLAS locally without much requirements by using patlas-compose. This will automatically handle the installation of the version 1.5.2 of pATLAS and launch the service in a local instance. For that you just require:

Then, follow this simple steps:

Clone the repository patlas-compose.

git clone https://github.com/bfrgoncalves/patlas-compose

Enter the patlas-compose folder

cd patlas-compose

Launch the compose:

docker-compose up

Wait for the line * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit) to show up, meaning that the service is now running.
Access on 127.0.0.1:5000 or 0.0.0.0:5000.

Note: This methodology is highly recommended.

From scratch

pATLAS can be run locally if you have PostgreSQL installed and configured. After, you just need to:

Clone this repository:

git clone https://github.com/tiagofilipe12/pATLAS

Create your custom database version or generate the default pATLAS database or download sql file from version 1.5.2 (the tar.gz archive). Note: if you download the sql file from version 1.5.2 you may skip steps 3 to 4 and continue with step 5.
Make sure all the necessary files are in place.

by default pATLAS generates a import_to_vivagraph.json file in the folder <tag_provided_to_o_flag>/results. Place this file in the patlas/db_manager/db_app/static/json folder.
change session to read the new import_to_vivagraph.json file by changing from false to true a variable named devel in patlas/db_manager/db_app/static/js/pATLASGlobals.js

Create the database that the front end will run:

createdb <your_database>

load the generated sql file
Install backend dependencies:

# within the root directory of this repository
pip install -r requirements.txt

Install frontend dependencies:

# change directory to static direcoty where `index.html` will look for
# its depdendenies
cd patlas/db_manager/db_app/static/
# then install them (package.json is located in this directory)
yarn install

Compile node modules so that the html can understand, using webpack:

# You can also user a local installation of webpack.
# entry-point.js is the config file where all the imported modules are
# called
node_modules/webpack/bin/webpack.js entry-point.js

Then execute the script run.py.

# within the root directory of this repository
cd patlas/db_manager
./run.py <your_database>

Note: the database name is utterly important to properly say to the frontend where to get the data.

Go to 127.0.0.1:5000.

Optimization of the resources usage by the web page

Using the devel = true isn't very efficient, so you can allow the force directed graph to render in a devel = true session, then when you are satisfied pause the force layout using the buttons available in pATLAS and click at the same time Shift+Ctrl+Space. This will take a while but eventually it will generate a file named filtered.json. Once you have this file you can add it to the patlas/db_manager/db_app/static/json folder and change the devel variable to false. This will use the previously saved positions to render a pre rendered network.

patlas's People

Contributors

Stargazers

Watchers

Forkers

odiogosilva bfrgoncalves gitter-badger thanhleviet b-ummi nunoalexandrefaria

patlas's Issues

taxa_fetch.py refactor

taxa_fetch.py should be refactored in order to be loaded by MASHix.py instead of running separately. This will imply that doc dictionary will have information regarding the taxa and committed just once, rather than removing previous entry and adding a new entry each time we want to add taxa information to the psql database.

database cleanup

For some reason last NCBI database (plasmid) from 20/7/2017 has genes mixed with plasmid sequences. To remove them search for the header CDS and match string using .lower(), because there "CDS" and "cds".

dropdown menu somewhat slow

Dropdown menus should be populated while scrolling. Something like recyclerview for bootstrap. Maybe check bootstrap-select options.

Add export image

Add an option to export current visualization (as pdf, png and jpg).

Minimap

add a mini-map to the bottom-right corner

Dark mode

Add a dark mode to visualization.

progress bar broken

progress bar became broken after inserting a pool.join() to wait for the mp process to finish.

Duplicated links removal needs refactor

Currently, duplicated links are being removed using js front end, however this could be done more efficiently using python back end. While creating json file, with all entries, something like this gist can be done.
hash() can be used to improve script efficiency but maybe it is not worth given that strings are small (needs testing).

Also, it should be considered if json file should follow a structure more similar to database: {acc: { length: x, links: [a, b, c]}} . This would be nicer for js to parse but it will require more refactoring from the front end side.

modals in different window size

The different elements of modals overlap in small window sizes.

Order filters not working properly

Order filters are not working properly, all orders are being appended to the same entry and not plotting the colour (branch:taxa).

Fix plasmid names

Plasmid names are retrieving something like pLMG9303 instead of pLMG930.3. Database needs to be re-worked in order to correct this issue.

center graph after removing nodes

Graph should be re-centered after removing nodes and links with re run button.

asynchronous removal of all nodes and links [branch:taxa]

When triggering Re run button the removal of nodes and links is odd, and only after several clicks on the Re run button all the removals are performed.

reader is not defined

When clicking in cancel selection in file modals, reader variable is not defined, which makes the button useless.

add new taxa tree

Add new taxa_tree.json file to populate the taxa menus within the app.

Add cluster visualization

A way to cycle between clusters should be implemented and then there is already a way to search for accessions that could help to find a given cluster associated with a given sequence.

Linked node selection

When two nodes are selected on mouse click, after deselecting one, the linked node is deselected also despite the initial node is still selected.
A check has to be implemented in order to see if the linked node is still selected in another node.

Remove gifs

Remove example gifs that are not used anymore.

order appending color scheme

If we choose a color scheme for distances it will be appended to taxa filters modal body.

Show filter coverage

Add a filter that only shows a given coverage threshold in reads mode.

not all taxa are being properly colored [branch:taxa]

When selecting a taxa some of the child taxa will not be processed.

Update database

Before releasing full database, it should be updated from NCBI, given that this database is suffering updates every 3 months, which often breaks fasta parsing.

error while filtering with no taxa filters

While trying to submit a function when no taxa filters are applied an error message is raised:

Uncaught ReferenceError: assocFamilyGenusGenus is not defined
    at HTMLButtonElement.<anonymous> (visualization_functions.js:855)
    at HTMLButtonElement.dispatch (jquery-3.1.1.js:5201)
    at HTMLButtonElement.elemData.handle (jquery-3.1.1.js:5009)

Although this doesn't affect the final result and a proper warning is raised for the user, error messages to console should be avoided and thus handling instances where assocFamilyGenus , assocOrderGenus and assocGenus are undefined should be done.

more than 20 colors

currently the visualization has no support for more than 20 colors for each taxa. In future versions this should be addressed.

Report as csv

Add report as csv to coverage results.

multi-level selection issue

Multi-level selection of taxa has an issue when all 4 levels are selected, rendering no selection at all.

conflict between legends and reset buttons

When read filter legend is triggered, and taxa filters are then appended to the legend, the lists of all species present in legend is not removed until next instance of taxa filters.

Add legend to read module

Add legend to reads plotting with color scale.

Concurrency

Nodes being added async is rendering the browser to freeze in firefox and in pcs with less resources.

Tried to implement a concurrency like this:

const limit = 10
let running = 0

const scheduler = () => {
  while(running < limit && json.nodes.length > 0) {
     const array = json.nodes.shift()
     console.log(array)
     addAllNodes(array, () => {
     running--
     if (json.nodes.length > 0) {
       scheduler()
      }
    })
    running++
  }
}

scheduler()

This returns too much recursion because scheduler is being executed inside scheduler.

add loading information for plots

linked with #74 . Plots should benefit from a loading information where the user can see the queries that are being made and the ones that have already been made.

add each coverage length on read comparsion

When comparing reads diff add each read coverage length of reference sequences.

lenght selection should return to previous color instead of default color

should implement a different method for returning to previous color of each node

Memory overkill

When many sequences are given as input pairwise comparisons can became very intensive and function mash_distance_matrix is storing a lot of entries which might be consuming a lot of memory.

Update README

README needs to be updated according with api branch.

import file for multiple selections

csv file could be used to make multiple selections, on taxa filters, antibiotic resistance, plasmid families and table view mode.

Adding two entries to html legend

Two entries of the same taxa are being added to the html legend.

Playing around with p_values, mash distances and maximum number of links

Future implementations should consider including options to specify the p-value, mash distances and maximum number of links between sequences to which the user want to define a cutoff.

Add circular plots for coverage

Taken the results from samtools depth file generated by PlasmidCoverage it would be nice generate a plot with coverage depth of all positions of a given plasmid.
However, this should be done only for the results under the defined cutoff of PlasmidCoverage script, in order to avoid an overload of information.

We should check if plotly or any other js library has implemented any kind of circular histogram that we can re-use.

distance filters after re-run

Distance filters after re-run currently doesn't have the actual distance value (it just has the accession in the database), therefore it would be important to populate the database with the accession numbers + distances.

Currently this has the following structure:

{"significantLinks": ["NC_010869_1", "NC_025192_1"], .... }

However a more nested structure with name and distance linked together, e.g. accession|distance instead of accession. This would be easier to implement in a first instance.

Add labels to nodes

One way to quickly visualize metadata such as accession number could be displayed in a label next to the corresponding node. However this might be very confused... But perhaps there is other way.

This would be very useful to display images outside patlas as png or jpg.

Simplify requirements

Requirements have a lot of unused packages and versioning should be handled more loosely than it is atm.

Add a slider for coverage

Coverage results could have a slider similar to length filters, that enable the user to select and unselect previous nodes with a certain coverage.
Also legend should be updated while interacting with this slider, but only on submit definitive range of coverage percentages

After first filtering, can't remove color from legend and graph

After filtering with a given set of taxa, cannot properly remove color from legend and graph.

Add zoom in and out

Add a zoom in and zoom out slider to vivagraph output

change filtering

Right now filter iterates through all nodes and removes the nodes that doesn't have a color attributed or a link to a colored node. However, this behavior renders a slow loading time and thus should be replaced by queries to database that retrieve the information on the nodes and generates a new json to render a new instance of the graph (smaller than the initial).

four additional nodes spamming links in visualization.html

In the example provided in modules/dict_temp_005_l4.json, four additional links are being created and linking to every node. From a total of 5384 sequences retrieved in python, 5388 nodes are being created in which 4 nodes connect to every other node.
Note that, currently only 4 links are being stored in json file, so visualization.html should not have nodes with more than 4 links and should have 5384 nodes instead of 5388.

still needing implementation

number of nodes with a given resistance gene / plasmid family

tiagofilipe12 / patlas Goto Github PK

patlas's Introduction

Table of contents

Description

Citation

Documentation

Development

Dependencies

Backend Scripts

MASHix.py

Options:

Main options:

MASH related options:

Other options:

Database customization:

I don't like database name! How do I change it?

I don't like table name inside database! How do I change it?

Database migration from one server to another

Database export

Database import

Supplementary scripts

abricate2db.py

Options:

taxa_fetch.py

Options:

List of entries that will be filtered from weirdos option

pATLAS API

Schematics of the pATLAS database creation

Workflow for database creation

Automation of this steps

Creating a custom version of pATLAS

Run pATLAS locally

Docker compose

From scratch

Optimization of the resources usage by the web page

patlas's People

Contributors

Stargazers

Watchers

Forkers

patlas's Issues

Recommend Projects

Recommend Topics

Recommend Org

List of entries that will be filtered from `weirdos` option