Giter Club home page Giter Club logo

variantspark's People

Contributors

arashbayatdev avatar bauerlab avatar bhosking avatar dependabot[bot] avatar lynnlangit avatar piotrszul avatar plyte avatar rocreguant avatar yatish0833 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

variantspark's Issues

Fix readthedocs build to include pydocs strings.

In the current build of readthedocs the autogenerated documenation for python packages is blank.
I suspect it's becuase the required dependencies are not available in readthedocs virtual envn and need to be configured.

Add scalacheck to the build

Add scala check to the build (based on spark one).
Update the build, travis-ci and contribution guides.
Make sure all the scala files conform.

Enable running variant spark on AWS EMR

This will include:

  • variant-spark setup on EMR
  • fixing any issue that may arise (like local filesystem writes etc)
  • scrips to facilitate creation of variant-spark enabled EMR cluster
  • scrips to submit variant-spark commands to EMR cluster
  • demo and documentation

Add FAQ to address potential vs-emr install issues for VS on AWS EMR

FAQ for OSX users
Q: Do I need to download the entire source from GitHub to install VS on AWS EMR?
A: No, but it's the simplest way to get the files you need for the install

Q: If I get some kind of permissions error when I attempt to install vs-emr what should I do?
A: Install using sudo -H pip install --user ./python

Q: If I attempt to use vs-emr and I get the command not found error what should I do?
A: Run sudo find / -name "vs-emr" to find the install path and then run vs-emrusing the full path, for example sudo /private/var/root/.local/bin/vs-emr

Q: If I attempt to create an EMR cluster and it fails, read the error message.
A1: If error is default config not found, then mkdir ~/.vs_emr, copy config.yaml into that new directory cp conf/config.min.yaml ~/.vs_emr/config.yaml, then edit config.yaml to replace values as required (do NOT use quotes around values)
A2: If error is unknown options: --<some parameter>,<some value>, etc..., verify your version of the awscli client tool aws --version, minimum supported version is aws-cli 1.10.22. To upgrade awscli run sudo -H pip install awscli --upgrade --user
TIP: you can attempt to run the generated awscli code directly in your terminal (i.e. without vs-emr) to get more detailed error messages. An example is at the end of this note. You can remove EMR options as needed, depending on your version of awscli.
screen shot 2018-02-12 at 7 06 15 pm

Q: How do I know that the vs-emr command succeeded in creating a cluster?
A. You will see a "ClusterId" value (as shown below). TIP: Remember to immediately submit your analysis job as the cluster is set to auto-terminate.

screen shot 2018-02-12 at 7 08 11 pm

Adding ALT field to output file

In the output file of importance analysis, each variant is identified with its site that is the combination of CHR_POS. However, there are cases that multiple bi-allelic variants appear in the same site. It would be great to add the ALT field to the output file such that each variant is identified as CHR_POS_ALT. This can resolve the ambiguity.

Publish the vs-emr to pypi

Convert it to a proper python package to avoid all the installation issues.
And: optionally publish to pypi

Write concise user documentation

Write a concise user manual using sphinx and make is available at readthedocs.
The audience should be primarily bioinformaticians.
The scope should coverL

  • installing variant spark
  • running importance analysis with command line an python APIs
  • references for available functions and options
  • running on cluster and clouds (AWS, Databricks)
  • a simple developer guide (building testing).

The proposed outline is here (CSIRO internal only): https://docs.google.com/document/d/1kIZ69VoDTdQhhC0eLBYM_w3v9bG77ZB2lbOYfdXKaWU/edit

The preview of the version committed to the feature branch [ _i76_docs ] is at: https://variantspark.readthedocs.io/en/i76_docs/

Add API for Pairwise Operations

Implement the following PairWise operations:

  • Distance Metrics (Euclidean and Manhattan)
  • Shared Variants Counts:
    -- SharedAltCount - counts the number of alternative alleles shared by two genotypes. Can be 0, 1 or two with summation for all variants.
    -- AtLeastOneAltShardCount - count the number of variants with at last one shared alt allele between two genotypes.

Include in:

  • command line
  • Scala API
  • Hail Integration API

FakeFamily improvements

Some ideas on how to improve fake family generation:

  • Add reading all files from HDFS (including ped and spec)
  • Review and improve hail support for phased genotypes
  • Added one pass generation of independent mutations( generate mutation for all individuals in one pass and distribute them per individual)
  • Scala/Python API for fake family generation
  • Performance improvements for offspring genotype generation (use indexed sequence rather than HashMap)
  • Add mutations based on fasta (actual sequence file)

AIR - unbiased gini-based importance score - Algorithm

This procedure provides a Gini-based variable importance method that corrects bias for different number of categories (minor-allele-frequency bias in GWAS) and also shows some promising results regarding correlation issues.

the idea is to create pseudo variables for each variable in the dataset by permuting the values of the variables and adding them to the model. Then run the random forest model, and subtract the importance of the pseudo variables from that of the original variable (and by that subtract the bias)

The addition of variables is done only theoretically where in practice no variables are added to the model, saving runtime and memory usage.

A link to the paper:
https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty373/4994791

The procedure is implemented in R in ranger. Here is an example of how to use it:
ranger(data=data,dependent.variable.name = "y", importance = "impurity_corrected")

The "impurity_corrected"

The procedure works as follows:

  1. Before fitting the RF, a single random reordering � of the sample IDs is performed
  2. instead of sampling mTry variables from {1,...,p} we sample mTry variables from {1,...,2p}
  3. at a given node in a tree we check which variable and which value to split according to in the regular manner, while if the index sampled (in the subset in step 2) is 1<i<=p we use variable X_(i) as usual and if p+1<i<=2p we use variable X_(i)*= X_(i-p) with the permuted sample ID performed in step 1 (i.e the labels are permuted).
  4. Calculate mean Gini decrease as usual (Gini importance) I_G for each of the 2p variables. This results in 2p importance scores.
  5. Calculate the new importance score of X_(i) as AIR(X_i)=I_G(X_(i))-I_G(X_(i)*)

now we want to produce a null distribution for computing p-values. for that we take all the resulting negative AIR importances and mirror the results (creating a set of importances derived from all the negative scores and the absolute values of those scores).
We use this set of importance score to compute an Empirical Cumulative Distribution.

We then check the importances of each of the variables to see where it falls in this distribution. this is the p-value for that score. (in simpler words, we have a list of scores resulting from the negatives and the absolute value of the negative, we order this list, and now the p-value of an importance score is its rank in this list divided by the length of the list).

Since the list resulted from the mirroring of the variables (counting all the variables that we not given a score as 0, which can be added to this list) should be very big (roughly the number of variables in the model) we can extract very small p-values from it, which should be enough . if it is still not enough, it might be worth to consider making the procedure twice, and by this double the number of scores in the estimated null distribution (which in this case averaging the importance scores of the two runs for making the results more stable could be done as well)

I'm happy to answer if you have any questions for it as I figure my explanation might not be the best :)

Optimised tree growing method

I recommend the following improvement to VariantSpark Random Forest importance analysis.

  1. Compute and write importance score to a file after building every 1000 tree.

  2. Automatically identify when enough tree has been built. If implementing the first suggestion then we can compare importance score at each step (1000 trees built) with the importance scores computed in the previous step. if little change has happened then we can stop building more trees.

  3. Frequently (every -rbs tree) dump models (built trees) to disk and allowing to integrate previously built models in a new run. If the process crash half way produced model can be used in the next run.

Output trees in JSON format

Implement a readable output e.g. JSON for the trained trees. This will help evaluate the resulting interactions and visualise them.

Add support for ingesting non-VCF data sources using DataBricks scala API

At present only VCF files are exposed using the scala API for ingesting feature data. It would be useful to allow easy ingestion of parquet files as this would broaden the usefulness of VariantSpark beyond genomics. The class ParquetFeatureSource appears to offer this functionality already but is not available within the Databricks environment.

Develop benchmarking datasets

Develop synthetic datasets/models that would allow to compare the 'power' and the ability to detect significant variables in under various conditions (e.g. interactions, noise, correlated variables, etc). In particular to compare agains traditional GWAS methods like single locus logistic regression.

Output file line ending

Currently the output file of importance analysis "-of" uses windows line ending ("\n\r").
I suggest using Linux line ending instead which is compatible with Linux tools such as AWK.

NoClassDefFoundError after running VariantSpark (CLI)

After installing the VariantSpark environment (Java, Spark, Scala) and building VariantSpark using documentation, execution of VariantSpark from the terminal results in:

Exception in thread "main" java.lang.NoClassDefFoundError: scala/reflect/ManifestFactory$
at au.csiro.variantspark.cli.VariantSparkApp$.main(VariantSparkApp.scala:23)

This can be caused by not setting the SPARK_HOME environment variable to:

export SPARK_HOME=

Using a package manager such as Homebrew may sidestep this problem.

ImportanceAnalysis fails on EMR with 64 core workers and bgz encoded files on S3.

While reading the bgz encoded vcf file from S3 variant-spark fails with the following exception:

com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1069)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1035)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:742)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:716)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4169)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4116)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1237)
at com.amazon.ws.emr.hadoop.fs.s3.lite.call.GetObjectMetadataCall.perform(GetObjectMetadataCall.java:24)
at com.amazon.ws.emr.hadoop.fs.s3.lite.call.GetObjectMetadataCall.perform(GetObjectMetadataCall.java:10)
at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:82)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:176)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.getObjectMetadata(AmazonS3LiteClient.java:94)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AbstractAmazonS3Lite.getObjectMetadata(AbstractAmazonS3Lite.java:39)
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:211)
at sun.reflect.GeneratedMethodAccessor330.invoke(Unknown Source)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.