bystrogenomics / bystro Goto Github PK

Bystro genetic analysis (annotation, filtering, statistics)

License: Apache License 2.0

Perl 33.07% Shell 1.40% Dockerfile 0.18% Rust 2.08% Python 56.41% Cython 0.06% Makefile 0.06% Go 2.73% Jupyter Notebook 4.01%

bioinformatics bioinformatics-algorithms bioinformatics-analysis bioinformatics-databases bioinformatics-pipeline bioinformatics-scripts genomics genomics-search

bystro's People

Contributors

Stargazers

Watchers

Forkers

wingolab-org raonyguimaraes wingolab alabarga yanglab-emory project-bystro ilhah akotlar poneill cristinaetrv austintalbot7241993 dlin30 bystrogenomics cdennison

bystro's Issues

add minimal vcf file format support ( seattleseq format)

Enumerate, and strip common missing values

dbSNP: unknown (function)
clinvar: not provided, not specified, no assertion criteria provided, no interpretation for the single variant, no assertion for the individual variant, see cases : akotlar@7df8409 , 7883b7f

While these provide some information, barring evidence to the contrary, I think we shouldn't waste space on their storage.

Coerce ints due to Elasticsearch removing coerce option on integers

Due to this decision elastic/elasticsearch#25861

Edit: We need to stay with ES 5.6 for now. 6.x+ remove split_on_whitespace, which dramatically changes queries. In practice, even with the 'split_queries_on_whitespace' option, queries operate very differently.

Improve errors messages

When 0 variants annotated, and error is generated with the message, "Couldn't read statistics file". This is true, but not really the core issue. There is no error, simply no data

Finish transition to camelCase

Project started off defining snake_case for variables that were configurable at run time via YAML, and camelCase elsewhere. In part because command line users may not have liked/been used to camelCase.

This was stupid and confusing.

Better online db versioning

Store hash of database in YAML after build
Automatically identify available database builds by querying nodes
Show database build date, version in dropdown
Show deprecation messages before switching databases.
Always provide deprecated database for at least 2 weeks after deprecation

Set threshold for p = 1 to .9

https://github.com/akotlar/bystro/blob/d4c952b7f454acad8533cfd0ce522d54bf0698dc/lib/SeqFromQuery.pm#L828

Can make configurable.

Support FLAG types in VCF files

Used in new gnomad ... segdup / lcr flags
appearance as:

AC=2;AF=6.52443e-05;AN=...CSQ=A|intergenic_variant|MODIFIER||||||||||||||||1||||SNV|1||||||||||||||||||||||||||||||||||||||||||||;segdup

So need to check for presence of string, in absence of an equal sign.

Create Singularity, Kubernetes, etc containers

Docker is popular, but other containers are used. For instance, some at NIH use Singularity

Create Docker container
Create Kubernetes cluster driven by said Docker container
Create singularity container

typo

https://github.com/akotlar/bystro/blob/9ce714d7a075d240330ab7c928aa7494bb77cb2a/lib/Seq/DBManager.pm#L664

Improve query builder.

In the web app, move from regex to something like PEG/ohm.

Fix Pankaj synonym issue: synonym name should match exactly
Prototype Ohm query syntax

Simplify transaction management

Remove all cleanUp() besides the checkpoints.

In general, how can we utilize LMDB more effectively? This is mostly interesting for the future Go transition, but it feels like our current dbRead vs dbReadCursorUnsafe solution is not completely satisfactory.

Depth of coverage

Alex,

Is there any proposal to also include the depth of coverage statistics in the summary output?

thanks!!

Store region data as array in region db

Currently region data is stored as a hash, but with integer keys; this doesn't seem particularly useful, except in maybe the case that features are split between region and site, but that could be handled in a more deterministic way to reduce the sparsity of the site and region arrays.

Expanded HGVS notation

Bystro currently supports HGVS search in coding regions.

The questions are:

Should we expand HGVS support to non-coding regions.
Should we permanently store the HGVS notation in a tab-delimited field.

Allow database to be built or pulled

Working on GenPro; realizing that it should be easier to start up the program.

Proposal: add a YAML config property, that provides the link to the remote resource where the version of the database specified should be uploaded to, and then pulled.

Something like

repository:
  path: "s3://" or "http://" or "/path/to"
  buildDate: 10/27/18 11:22pm

When the user first uses the config file, the program should check whether the database exists at the given database_dir, and if it does not, fetch if the repository property exists.

This will allow users to supply custom databases.

Potentially this could be extended to multiple databases. This would mean allowing per-track database configuration (as opposed to having a singleton with a fixed database_dir). It would of course cost access time, but may be reasonable in cases like GenPro, where we may want to allow users to build (or fetch) highly dynamic databases (per experiment). In GenPro's case, the ability to fetch from a remote resource would mean memoization to a remote resource (as opposed to an in-memory data structure).

Update tests

nearest-dev branch currently contains most up-to-date tests: https://github.com/akotlar/bystro/tree/nearest-dev

TODO:

Complete integration tests for all tracks (insert / fetch)
- in future revisions explore creating more granular tests
- some of this is limited by architectural choices; inlining -> performance+, but more complex tests
Complete unit tests for DB Manager functions
Create unit test for Output.pm
Create unit tests for less important, clearly working utility functions (like IO package)
Create low-level unit tests for gene track's TX builder
its function is already verified in gene track tests, but useful for future development
Write tests for fields that use delimiters that are also used as Bystro delimiters; ensure we aren't generating extra fields in subsequent versions. Currently everything works appropriately, but is fragile because of the lack of tests (can verify at bystro.io using hbox/dead)
Write test to check that newline characters are stripped from db-inserted values.

Add tests for mis-sorted files

We had an issue where VCF track builds were being cut short, because those tracks had unexpected chromosomes as an artifact of liftover.

Need to write tests for all tracks, especially those prone to liftover artifacts, showing handling of multiple chromosomes when program expects only specified chromosomes (which is the case when multiple files are present)

Create AWS instance launcher (for Spot market)

Currently we require 2 steps, since user-data is executed as root, and our scripts assumed Bystro is being installed in the home directory.

Simplest solution is to install and launch somewhere from root.

A smarter, better-long-term solution would be to use cloud-init to allow whichever path desired more.

Utils::LiftOverCadd: allow whitelist

With the release of CADD 1.4, our major use case for liftover goes away until the next human assembly release. However, we still need to lift over the GRCh37 MT to hg19's chrM (pre-patch).

A whitelist will allow this in-app, rather than as a separate processing step.

Add sampleMaf

Contains the number of non-missing alleles at the site. Allows for queries that are maximally flexible. For instance , we could filter variants that are either in gnomAD or are at low frequency in our sample.

Cut b11.0.0 release from master

The master branch is a substantial improvement of the b10 codebase, including a new "nearest" track that uses a ahead-of-time de-duplication strategy to reduce disk space and improve annotation performance, and which allows the calculation distance to nearest features.

Currently used to calculate nearest gene, nearest Tss distances (as well as list details about those genes/tss'), and to create a refSeq.gene track, which contains, pLi, pNull, pHI, lofTool, GDI, and more.

Furthermore, building now uses LMDB cursors, and is remarkably faster (build times are < 1/2 of b10).

TODOS

"*" May be deferred for first minor (feature) release

** Likely to be deferred to 2nd (feature) release.

What to do with complex variants that are both a deletion and a SNP?

Example from gnomAD:

chr10 723260 rs61831381 GCCATCATCACCATGCCCAGCGTCACGTGACATGGATAGAGTACATGTCAGGGGTATCACTGTGTGGGAAAAGGTCACACCATCATCACCACTCCCCACGTCACATGACAGGGATACAGTACGTGTCAGGGGTTTCACTGTGTGGGAAAAGGTCACGCCATCATCACCATGCCCGGCGTCACGTGACATGGATAGAGTACATGTCAGGGGTATCACTGTGTGGGAAAAGGTCACA ACCATCATCACCATGCCCAGCGTCACGTGACATGGATAGAGTACATGTCAGGGGTATCACTGTGTGGGAAAAGGTCACACCATCATCACCACTCCCCACGTCACATGACAGGGATACAGTACGTGTCAGGGGTTTCACTGTGTGGGAAAAGGTCACGCCATCATCACCATGCCCGGCGTCACGTGACATGGATAGAGTACATGTCAGGGGTATCACTGTGTGGGAAAAGGTCACA,A 2362232.76

Example 2:

chr10 735488 rs56079144 ACCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGACAGAGGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTAAGGCTCCAGACCCGAAGAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGATAGAGTGAGGCTCCAGACCCGGATAGAAGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTGAGGCT TCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGACAGAGGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTAAGGCTCCAGACCCGAAGAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGATAGAGTGAGGCTCCAGACCCGGATAGAAGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTGAGGCT,TCCAGACCCGGGACAGAGTGAGGCT,AGACCCGGGACAGAGTGAGGCTCCAGACCCGGATAGAGTGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGACAGAGGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTAAGGCTCCAGACCCGAAGAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGATAGAGTGAGGCTCCAGACCCGGATAGAAGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTGAGGCT,T,ACCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTAAGGCTCCAGACCCGAAGAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGATAGAGTGAGGCTCCAGACCCGGATAGAAGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTGAGGCT

Example 3:
chr10 735488 rs56079144 ACCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGACAGAGGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTAAGGCTCCAGACCCGAAGAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGATAGAGTGAGGCTCCAGACCCGGATAGAAGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTGAGGCT TCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGACAGAGGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTAAGGCTCCAGACCCGAAGAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGATAGAGTGAGGCTCCAGACCCGGATAGAAGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTGAGGCT,TCCAGACCCGGGACAGAGTGAGGCT,AGACCCGGGACAGAGTGAGGCTCCAGACCCGGATAGAGTGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGACAGAGGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTAAGGCTCCAGACCCGAAGAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGATAGAGTGAGGCTCCAGACCCGGATAGAAGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTGAGGCT,T,ACCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTAAGGCTCCAGACCCGAAGAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGACAGAGTGAGGCTCCAGACCCGGATAGAGTGAGGCTCCAGACCCGGATAGAAGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGGACAGAGTGAGGCT

Example 3:

chr10 737933 rs534100935 GTAGAGTGAGGCTTCAGACCCAGGTAGAGTGAGGCTCCAGACCCGGATAGAGGGAGGCTCCAGACCCGGATAGAGGGAGGCTCCAGACCCGGATAGAGTAAGGCTTCAGACCCAGGTAGAGTGAGGCTCCAGACCCGGATAGAGGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGATAGAGGGAGGCTCCAGACCCGGACAGAGGGAGGCCCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGAATAGAGTAAGGCTCCAGACCCGGA ATAGAGTGAGGCTTCAGACCCAGGTAGAGTGAGGCTCCAGACCCGGATAGAGGGAGGCTCCAGACCCGGATAGAGGGAGGCTCCAGACCCGGATAGAGTAAGGCTTCAGACCCAGGTAGAGTGAGGCTCCAGACCCGGATAGAGGGAGGCTCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGGATAGAGGGAGGCTCCAGACCCGGACAGAGGGAGGCCCCAGACCCGGGACAGAGTGAGGCTCCAGACCCGAATAGAGTAAGGCTCCAGACCCGGA,A

Write md5 hash of tracks configuration to db

Ensure that if YAML configuration is substantially modified (i.e has the track configuration modified) that the database complains.

This should not include absolute paths, database_dir or files_dir, which may be better suited as environmental variables.

This TODO is really about the initiation of use of blockchain to track state.

VCF builder

Builds VCF file, for use primarily with gnomAD, ExAc, etc

Support more compressed formats

We may be able to gain decompression efficiency by supporting lz4, bgzip. Block-compressed formats can be decompressed using multiple threads.

DbManager should check if dbdata defined

We expect that the dbmanager will store only structures if data (one track of information at each index).

It is moderately safer, and slower, to check that the site is defined, rather than flash.

Create Docker container

This will be far easier to launch on the command line, and could be useful when we enable private instance launching from https://bystro.io

Cristina student todo's

A mixture of web and local tasks:

Create new save filters, Go or Perl.
Create in-line documentation on web: documentation should appear for new users (or users who haven't seen the function previously), when they are on a page/section with that function. Can be pretty easily written in Angular Material.
Document new fields going up in master (web)
Document new UNIT SEPARATOR (ASCII 31) for overlapping fields
Document filters
Contribute to VCF / plink export
Contribute to Hail integration

Document sites that didn't liftover to hg38 for gnomad

There should be a list of sites/coordinates where missing values represent sites that didn't lift over from hg19 to hg38 for quality control measures to separate those sites from missing data representing private mutations.

Fix b10 hg38 gnomad (early exit)

Default behavior when encountering unexpected chromosomes was to skip and exit early. Fixing this will restore missing hg38 sites.

Build Error on Docker Machine

I'm using docker-machine on a Windows 10. Trying to build from Dockerfile with docker build -t bystro . :: script exits with exit code 127 (command not found) when running install/install-go-packages.sh. Additionally, script creates similar warnings when installing lmdb, but does not exit.

This is my terminal output (at step 11):

Step 11/13 : RUN . install/install-go-packages.sh
 ---> Running in 7700bf77c2b1
: not found install/install-go-packages.sh:
-e

Installing go packages (bystro-vcf, stats, snp)

: not found install/install-go-packages.sh:
: not found install/install-go-packages.sh:
: not found install/install-go-packages.sh:
Made /root/go path
: not found install/install-go-packages.sh:
: not found install/install-go-packages.sh:
: not found: install/install-go-packages.sh:
: not found: install/install-go-packages.sh:
: not found: install/install-go-packages.sh:
: not found: install/install-go-packages.sh:
: not found: install/install-go-packages.sh:
: not found: install/install-go-packages.sh:
The command '/bin/sh -c . install/install-go-packages.sh' returned a non-zero code: 127

Note, the script does run when I run it manually through the console.

The error is likely caused by my use of docker machine. The default vm that docker-machine creates does not have go installed, and it does fail to execute the script with exit code 127.

Output allele count and allele number

These are useful for finding singletons.

Permission configurability

We currently need to set read permissions on output files, so that processes on other nodes can read them without having the same user/group (files are authorized by web server, inaccessible from outside world without authorization).

TODOS:

Modify permissions on only files owned by Bystro, rather than all in output folder (only an issue if using --temp_dir "/some/path" without --archive)
Allow output permission to be set in YAML config

Add VCF export

Incorporate Dave's script...complicating factor is that it requires sdx files. The obvious solution is to have it read LMDB instead.

Export to VCF format

Will require using the tab statistics file to get the sample list, and Dave's vcf converter simply tail -n +3 statistics.tsv | cut -f1 > sample_list.txt && seqantToVcf etc.

Would be nice to update Dave's program to use LMDB db.

Note that, as it stands, we will keep multiallelics on separate lines. Could add a facility to recombine multiallelics.

Generate sample-list output from bystro-vcf
Add support for sample-list in YAML config, Bystro Seq.pm
Generate sample-list output from bystro-snp
Propagate sample-list during saving from query
Add Dave Cutler's converter program
Make, use Rust implementation

Add support for fam files and case/control male/female allele frequencies

Need to support an optional fam parameter bystro-snp and bystro-vcf, and of course pass through the fam file during upload.

Store partially-overlapping sparse tracks in join?

We should decide whether the join track for genes should require the gene to be fully covered by the joining track (currently we configure clinvar in our hg19.yml and hg38.yml builds).

Some missing data in refSeq not being undef'd

mRNA field shows up as an empty string in search

Use semantic versioning for both database and program

Right now program version is intimately tied to database version.

We either need to decouple them, or use semantic versioning to track all changes, such that any identified database bugs that require a rebuild increment the corresponding minor version digit.

Revise stripping of delimiters

For instance: RH C/c Polymorphism currently gets transformed in master to RH C c Polymorphism.

We could replace our delimiters with commas or underscores, to preserve the fact that these aren't separate tokens (which google will interpret correctly), and which will allow us to index them as concatenated in elastic.

Ex: RH C/c -> RH C-c or RH C_c would both work well. In google RH C,c works best, returns the same results as RH C/c.

Alternatively our overlapDelimiter could be changed to \\, but I think this makes parsing much more difficult, and should be a last resort.

Edit: By discussion with Thomas, will try \ for now.

Improve upload reliability

Users from Albert Einstein have run into issues with large uploads (10’s of GB).

We should add ability to retry chunks
If uploading from s3, we should run the upload completely in background, rather than as a synchronous event that the user needs to keep a connection open during (meaning don’t tie to request/response lifecycle; start upload and return).

Cc @wingolab

Add pLi scores.

Very important. Seems at least as useful as CADD, and maybe more sensitive.

Add chrPerFile support

This is a low-priority update. Its only benefit is to allow faster skipping of previously-built chromosomes.

Something along the lines of

sub makeChromCheckFunction {
  my ($onNew, $onExit) = @_;

  return sub {
    my ($currentChr, $newChr) = @_;

    if( ($currentChr && $currentChr ne $newChr) || !$currentChr ) {
      if($self->chrPerFile) {
        # show the longer $currentChr ne $newChr condition for clarity
        if($currentChr ne $newChr) {
          # if use guarantees that they have one chromosome per file, this is a fatal error
          $self->log('fatal', $self->name . ": Expected one chromosome in $file, found at leats 2.");
        }
        
        if(!$self->chrIsWanted($newChr)) {
          $self->log('warn', $self->name . ": $newChr unwanted, and chrPerFile flag set; exiting file");
          last FH_LOOP;
        }

        if(!$self->completionMeta->okToBuild($newChr)) {
          $self->log('warn', $self->name . ": $newChr wanted, but completed, and chrPerFile flag set; exiting file");
          last FH_LOOP;
        }

        $onNew->($currentChr, $newChr);

        return $newChr;
      }

      return $self->chrIsWanted($newChr) && $self->completionMeta->okToBuild($newChr) ? $newChr : undef;
    }

    return $currentChr;
  }
  
}

Investigate use of named databases

In this version, every track would get a separate named database, as opposed to a key in the serialized data structure.

The advantage is a substantially easier insertion model, which will allow us to modularly update the database.

The disadvantage may be read performance and size; each database will need a header; need to investigate size, but may be 16 bytes. Also, we will need to deserialize N times for N tracks, although the deserialization will be simpler.

If annotation performance or database size are substantially impacted, or this change significantly higher CPU usage during annotation, the tradeoff will likely not be worth it. Currently on master branch build times are 1 day with 3 additional whole-genome tracks (refSeq.gene, nearest.refSeq, nearestTss.refSeq), which cumulatively take ~ 7 hours. We re-run builds no more than once per month.

Update install documentation
Update fields documentation
Add documentation on building

Nearest tssName and tssDist

TODO:

Validate that both nearest.refSeq and nearestTss.refSeq are accurate
Decide whether these track names are ok
Decide whether we report all desired data
Document parsing of these fields (since they are de-duplicated in a way that refSeq isn't).

Add ploidy (het ploidy and homozygote ploidy)

This will be used to allow dropping of samples, without screwing up allele numbers.

We should also include an allele number (maybe "sampleAn") field; this will allow easy updates to homozygosity, heterozygosity and missingness when dropping samples.