davek44 / basset Goto Github PK

View Code? Open in Web Editor NEW

252.0 23.0 107.0 5.36 MB

Convolutional neural network analysis for predicting DNA sequence activity.

License: MIT License

Shell 0.02% Python 7.81% Lua 2.50% Jupyter Notebook 89.67%

basset's Introduction

Basset

Deep convolutional neural networks for DNA sequence analysis.

Basset provides researchers with tools to:

Train deep convolutional neural networks to learn highly accurate models of DNA sequence activity such as accessibility (via DNaseI-seq or ATAC-seq), protein binding (via ChIP-seq), and chromatin state.
Interpret the principles learned by the model.

Read more about the method in the manuscript here:

DR Kelley, J Snoek, JL Rinn. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Research 26 (7), 990-999.

As well as follow up work here:

DR Kelley, YA Reshef, M Bileschi, D Belanger, CY McLean, J Snoek. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Research 28 (5), 739-750

Which has an associated repo where continued development on this toolkit now occurs. You can now run Basset-style peak prediction using Basenji, and I recommend using that software because I can better support it. See here..

Installation

Basset has a few dependencies because it uses both Torch7 and Python and takes advantage of a variety of packages available for both.

First, I recommend installing Torch7 from here. If you plan on training models on a GPU, make sure that you have CUDA installed and Torch should find it.

For the Python dependencies, I highly recommend the Anaconda distribution. The only library missing is pysam, which you can install through Anaconda or manually from here. You'll also need bedtools for data preprocessing. If you don't want to use Anaconda, check out the full list of dependencies here.

Basset relies on the environmental variable BASSETDIR to orient itself. In your startup script (e.g. .bashrc), write

    export BASSETDIR=the/dir/where/basset/is/installed

To make the code available for use in any directory, also write

    export PATH=$BASSETDIR/src:$PATH
    export PYTHONPATH=$BASSETDIR/src:$PYTHONPATH
    export LUA_PATH="$BASSETDIR/src/?.lua;$LUA_PATH"

To download and install the remaining dependencies, run

    ./install_dependencies.py

Alternatively, Dr. Lee Zamparo generously volunteered his Docker image.

To download and install additional useful data, like my best pre-trained model and public datasets, run

    ./install_data.py

Documentation

Basset is under active development, so don't hesitate to ask for clarifications or additional features, documentation, or tutorials.

Tutorials

These are a work in progress, so forgive incompleteness for the moment. If there's a task that you're interested in that I haven't included, feel free to post it as an Issue at the top.

basset's People

Contributors

Stargazers

Watchers

Forkers

jmrinaldi rtvt123 loyale xypan1232 rlesca01 honglongwu juesato mgymrek hammer wgapl asad arahuja laraurban wormsik xinchoubiology liliumao shoshosho tesladarling hfpostdoc dolittle007 xuanheiiis belvo kuixu ml-lab zakiindrasukma thouis jwl2006 skp80 adeshpande tifftliu onceupon biocodings mlbileschi ljljolinq1010 stella-gao wirawara vreuter resurgo-genetics biomystery yukatherin engineerkhan hatem-jassim dnapower harveyyan ahonarmand-zz elbertgong sjgosai crescentluo dfajar2 ubcmark maxjansen zhouyu h5li xflicsu carlalbc amirunpri2018 dhtc yynst2 richardyando ytliu1985 jtmahoney returnmax cetienn01 dlhuang guoyanb boolran aspirincode hefv57 callsobing rbs-pli hrit1995 syssynbio chiahsuy oladosuabass enformatik wenkaiyan-kevin deepstatsanalysis song984888 mingleiyang binfnstats divya5504 akrammendez shtoneyan lichenbiostat biodeepneuro ireneban yd9508 bijikyu hegu2692 typekey bioinformatica yenglikkadyr stogqy zdf1122 bercestedincer x-hydrogen fzjsfh taichi8912 sbnoor rhassett-cshl

basset's Issues

Training aways converges after 2 or 3 epochs

Do you have any suggestions on exploring the hyper-parameter space in terms of learning rate, etc. My concern is that using basset the prediction error on the validation set always converges after 2 or 3 epochs. My guess is that selecting better hyper-parameter values will improve the prediction error. Have you encountered this issue?

Thanks,

Gabriel

Wrong size Error while running basset_train.lua with my own dataset

Hi Dave,
I followed the instructions mentioned in the tutorial file: new_data_iso.ipynb
in my dataset, I hava only only one positive bed file and one negative bed file , after preprocess I got:
68534 training sequences
6000 test sequences
6000 validation sequences
saved in learn_cd4.h5. (length equals 600 as you suggested)
then I created a blank params.txt to save training parameters.
In the last step I try to run "basset_train.lua -job params.txt -save cd4_cnn learn_cd4.h5"

I got the following error:

[liuqiao@g01 mydata]$ basset_train.lua -job params.txt -save cd4_cnn learn_cd4.h5
{}
seq_len: 600, filter_size: 10, pad_width: 9
nn.Sequential {
[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> output]
(1): nn.SpatialConvolution(4 -> 10, 10x1, 1,1, 4,0)
(2): nn.SpatialBatchNormalization (4D) (10)
(3): nn.ReLU
(4): nn.Reshape(6000)
(5): nn.Linear(6000 -> 500)
(6): nn.BatchNormalization (2D) (500)
(7): nn.ReLU
(8): nn.Linear(500 -> 1)
(9): nn.Sigmoid
}
/home/liuqiao/torch/install/bin/luajit: /home/liuqiao/torch/install/share/lua/5.1/nn/Container.lua:67:
In 4 module of nn.Sequential:
/home/liuqiao/torch/install/share/lua/5.1/torch/Tensor.lua:466: Wrong size for view. Input size: 128x10x1x599. Output size: 128x6000
stack traceback:
[C]: in function 'error'
/home/liuqiao/torch/install/share/lua/5.1/torch/Tensor.lua:466: in function 'view'
/home/liuqiao/torch/install/share/lua/5.1/nn/Reshape.lua:46: in function </home/liuqiao/torch/install/share/lua/5.1/nn/Reshape.lua:31>
[C]: in function 'xpcall'
/home/liuqiao/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/liuqiao/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
/home/liuqiao/Basset/src/convnet.lua:1030: in function 'opfunc'
/home/liuqiao/torch/install/share/lua/5.1/optim/rmsprop.lua:35: in function 'rmsprop'
/home/liuqiao/Basset/src/convnet.lua:1063: in function 'train_epoch'
/home/liuqiao/Basset/src/basset_train.lua:148: in main chunk
[C]: in function 'dofile'
...qiao/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x004064f0

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
[C]: in function 'error'
/home/liuqiao/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
/home/liuqiao/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
/home/liuqiao/Basset/src/convnet.lua:1030: in function 'opfunc'
/home/liuqiao/torch/install/share/lua/5.1/optim/rmsprop.lua:35: in function 'rmsprop'
/home/liuqiao/Basset/src/convnet.lua:1063: in function 'train_epoch'
/home/liuqiao/Basset/src/basset_train.lua:148: in main chunk
[C]: in function 'dofile'
...qiao/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x004064f0

It reported that "Wrong size for view" . I've tried to read your source code in basset_train.lua but I'm still not sure what's wrong with it. I have to say I'm not very familiar with torch and lua. I guess maybe some command line parameters got wrong or I have to revise some source code? .I really hope you could help me figure it out or give me some suggestions.
Thanks a lot!

Training takes too long

Hi!
I am trying to train basset with my own data but I have been running into several issues with the gpus so I decided to try on the normal nodes in our cluster, however it's been running for 4 days now and I have far less sequences than the used amount in the paper so I don't know if there is any chance something went wrong without killing the process and now it's just halted or it really takes that long to run.

All the best,
Laura

Used in non human species

I am wondering if the software can be used in nonhuman species? If not, can we train a model for other species? Many thanks!

import error quantile

Hi
on PlotRoc.py
you imported
from stats import quantile
but I got an error , where is the package stats comming from

trouble with saturated mutagenesis

Hi there,

Sorry, feels like I'm ploughing through the functions and hitting everything that can go wrong ;]

Trying to use the saturated mutatgenesis I faced problems and tried to fall back on the tutorial chr7 hox ctcf site example location:

~/toolsBasset/$ basset_sat.py -t 3 -n 200 -o satmut_hox cnn_full_with_prim_erythroid/dnase_with_prim_erythroid_cnn_best.decuda.th ./hox_test_seq.fa basset_sat_predict.lua -center_nt 200 cnn_full_with_prim_erythroid/dnase_with_prim_erythroid_cnn_best.decuda.th satmut_hox/model_in.h5 satmut_hox/model_out.h5

Predicting sequence 1 variants
Traceback (most recent call last):
File "/home/ron/tools/Basset/src/basset_sat.py", line 367, in
main()
File "/home/ron/tools/Basset/src/basset_sat.py", line 206, in main
g = sns.heatmap(preds_heat, vmin=0, vmax=1, linewidths=0, xticklabels=target_labels, yticklabels=False, cbar_kws={"orientation": "horizontal"})
File "/home/ron/anaconda2/lib/python2.7/site-packages/seaborn/matrix.py", line 452, in heatmap
mask)
File "/home/ron/anaconda2/lib/python2.7/site-packages/seaborn/matrix.py", line 140, in init
self.xticklabels = xticklabels[xstart:xend:xstep]
TypeError: 'NoneType' object has no attribute 'getitem'

~/toolsBasset/\$ ll satmut_hox/
total 1092
drwxrwxr-x 2 ron ron 4096 Jun 20 12:59 ./
drwxrwxr-x 11 ron ron 12288 Jun 20 13:04 ../
-rw-rw-r-- 1 ron ron 6944 Jun 20 13:06 model_in.h5
-rw-rw-r-- 1 ron ron 1058144 Jun 20 13:06 model_out.h5
-rw-rw-r-- 1 ron ron 0 Jun 20 13:06 preds.txt
-rw-rw-r-- 1 ron ron 0 Jun 20 13:06 table.txt

I was trying to run with my own trained model (your 164 selected tissues + on test set of our own). Trying the same with your provided pre-trained model results in exactly the same error output independent of the selected cell type to predict.

Also noteworthy: the randomly selecting and mutating sequences from the already extracted sequences (as described in the sat_mut tutorial) seems to work fine)

Cheers,
Ron

Error in running basset_sat_vcf.py

I am trying to run the basset_sat_vcf.py, but it returns errors:
Here is the cmd:
software/Basset-master/src/basset_sat_vcf.py -t 6 -o /sat model_default_1k/dnacnn_best.th Test_run/rs13336428.vcf
And errors:

/opt/torch/install/bin/lua: /opt/torch/install/share/lua/5.2/nn/Container.lua:67:
In 13 module of nn.Sequential:
/opt/torch/install/share/lua/5.2/torch/Tensor.lua:466: Wrong size for view. Input size: 4x200x1x13. Output size: 4x4200
stack traceback:
        [C]: in function 'error'
        /opt/torch/install/share/lua/5.2/torch/Tensor.lua:466: in function 'view'
        /opt/torch/install/share/lua/5.2/nn/Reshape.lua:46: in function </opt/torch/install/share/lua/5.2/nn/Reshape.lua:31>
        [C]: in function 'xpcall'
        /opt/torch/install/share/lua/5.2/nn/Container.lua:63: in function 'rethrowErrors'
        /opt/torch/install/share/lua/5.2/nn/Sequential.lua:44: in function </opt/torch/install/share/lua/5.2/nn/Sequential.lua:41>
        (...tail calls...)
        /opt/Basset/src/convnet.lua:509: in function 'predict'
        ...software/Basset-master/src/../src/basset_sat_predict.lua:75: in main chunk
        [C]: in function 'dofile'
        /opt/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
        [C]: in ?

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
        [C]: in function 'error'
        /opt/torch/install/share/lua/5.2/nn/Container.lua:67: in function 'rethrowErrors'
        /opt/torch/install/share/lua/5.2/nn/Sequential.lua:44: in function </opt/torch/install/share/lua/5.2/nn/Sequential.lua:41>
        (...tail calls...)
        /opt/Basset/src/convnet.lua:509: in function 'predict'
        ...software/Basset-master/src/../src/basset_sat_predict.lua:75: in main chunk
        [C]: in function 'dofile'
        /opt/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
        [C]: in ?
Traceback (most recent call last):
  File "/data/gpei/software/Basset-master/src/basset_sat_vcf2.py", line 206, in <module>
    main()
  File "/data/gpei/software/Basset-master/src/basset_sat_vcf2.py", line 88, in main
    hdf5_in = h5py.File(options.model_hdf5_file, 'r')
  File "/data/gpei/software/python2.7_module/h5py/_hl/files.py", line 312, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
  File "/data/gpei/software/python2.7_module/h5py/_hl/files.py", line 142, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 78, in h5py.h5f.open
IOError: Unable to open file (unable to open file: name = '/data/gpei/19_1.CNN/2.model_default_1k/4.sad/sat/model_out.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

Any one can help me ?
Thanks!

Error on preprocess_features.py

Hi
when I run preprocess_features.py
I got out of index error from line 76

['>chr1']
Traceback (most recent call last):
File "./preprocess_features.py", line 420, in
main()
File "./preprocess_features.py", line 76, in main
chrom_lengths[a[0]] = int(a[1])
IndexError: list index out of range

so I printed a and the result is
['>chr1']

Is there anythin I missed
thx

Exporting lua path

So, I know that you've already bumped into the issue with exporting the LUA path aka:
export LUA_PATH="$BASSETDIR/src/?.lua;$LUA_PATH"
I'm still trying to figure out why it refuses to export it if stuck in the startup script.

But this is the last last stopping point in making an executable jupyter

Data preparation scripts should report progress

I enjoyed your talk today at NIPS! Our lab has been working with deep learning tools in biology as well, though we're working with Keras/Theano and amino acid sequences.

I'm giving Basset a spin and will file a few housekeeping issues. Please don't consider these issues to be demands or criticisms--they're more like notes to myself on ways I or our other lab members might contribute.

First Epoch Loss is Bigger than Expected

Hi David,

Thanks for developing the powerful Basset. I've tried to apply it on my splicing motif study and the result is quite positive. However, I have a little confusion about the math behind:

My train and valid loss at the first epoch is about 1.9. As I've three classes in this study and I think binary cross entropy is used as the default loss function, this should give me a starting loss no more than ln(3), which is around 1.1. So I don't know why all my models started at 1.9, which is way higher than the maximal expectation 1.1.

Did I miss something or misunderstand some concepts? Thank you very much in advance.

Best,
Dadi

additional dependencies not given in installation directions/fail with nil ConvNet

in addition to dependencies noted in previous issues, we need to install hdf5 for lua, and
we need to install and configure intel MKL for torch to work 'out of the box' ... without MKL
you can probably set some config flags for torch but I did not investigate this. however we come
a cropper without 'batcher':

%vjcair> basset_train.lua -job models/pretrained_params.txt -stagnant_t 10 encode_roadmap.h5
/Users/stvjc/torch/install/bin/luajit: /Users/stvjc/torch/install/share/lua/5.1/trepl/init.lua:384: module 'batcher' not found:No LuaRocks module found for batcher

Hunting around, we find repos for learning_torch, deepmind-atari, and ConvNet-torch that seem to include relevant modules to allow basset_train.lua to proceed further. However, we fail at line 99 of the training function

%vjcair> basset_train.lua -job models/pretrained_params.txt -stagnant_t 10 encode_roadmap.h5
{
conv_filter_sizes :
{
1 : 19
2 : 11
3 : 7
}
weight_norm : 7
momentum : 0.98
learning_rate : 0.002
hidden_units :
{
1 : 1000
2 : 1000
}
conv_filters :
{
1 : 300
2 : 200
3 : 200
}
hidden_dropouts :
{
1 : 0.3
2 : 0.3
}
pool_width :
{
1 : 3
2 : 4
3 : 4
}
}
/Users/stvjc/torch/install/bin/luajit: /Users/stvjc/Research/BASSET/Basset/src/basset_train.lua:99: attempt to index global 'ConvNet' (a nil value)

should check whether wget is adequate for retrievals in install_data.py

install_data.py moves along but eventually ...

one way to solve is to build wget from source with openssl ... links and details here:
https://coolestguidesontheplanet.com/install-and-configure-wget-on-os-x/

install_data.py requires more than 30 GiB of memory

It seems like install_data.py -- in particular

Basset/install_data.py

Line 113 in 6ae86b8

 cmd = 'seq_hdf5.py -c -r -t 71886 -v 70000 encode_roadmap.fa encode_roadmap_act.txt encode_roadmap.h5' 

and https://github.com/davek44/Basset/blob/master/src/seq_hdf5.py#L73 -- requires a lot of memory. It deterministically runs OOM on a GCE instance with 30 GiB of memory. After changing

Basset/install_data.py

Line 113 in 6ae86b8

 cmd = 'seq_hdf5.py -c -r -t 71886 -v 70000 encode_roadmap.fa encode_roadmap_act.txt encode_roadmap.h5' 

seq_hdf5.py \
  -c \
  -t 71886 \
  -v 70000 \
  encode_roadmap.fa encode_roadmap_act.txt encode_roadmap.h5

install_data.py passes on the same machine. If the "-r" option is not required, then perhaps disable it by default and make it an option at the top. Otherwise, please document the memory requirements. Thank you

Additional details:

GCE machine type: n1-standard-8 instance
GCE image: c1-deeplearning-common-cu100-20200422

install_dependencies.py fails on macosx

there is a stream of errors of this type.

%vjcair> ./install_dependencies.py
Warning: Failed searching manifest: Failed fetching manifest for https://raw.githubusercontent.com/torch/rocks/master - Failed downloading https://raw.githubusercontent.com/torch/rocks/master/manifest - /Users/stvjc/.cache/luarocks/https___raw.githubusercontent.com_torch_rocks_master/manifest
Warning: Failed searching manifest: Failed fetching manifest for https://raw.githubusercontent.com/rocks-moonscript-org/moonrocks-mirror/master - Failed downloading https://raw.githubusercontent.com/rocks-moonscript-org/moonrocks-mirror/master/manifest - /Users/stvjc/.cache/luarocks/https___raw.githubusercontent.com_rocks-moonscript-org_moonrocks-mirror_master/manifest

such an error can be found on the web and the proposed solution is to remove ~/.cache/luarocks but this does not rectify in this case. we have such errors as

Error: No results matching query were found.
fatal: destination path 'torch-hdf5' already exists and is not an empty directory.

Missing dependencies for hdf5:
totem

Error: Could not satisfy dependency: totem

To attempt to deal with totem, I used

curl https://raw.githubusercontent.com/deepmind/torch-totem/master/rocks/totem-0-0.rockspec -o totem-0-0.rockspec
sudo luarocks install totem-0-0.rockspec

this seems to have worked.

Prediction tasks specification

Hi,

Thank you for making the code available!
The basset model has an output of 164, representing the predicted probability of accessibility in each cell type.
Perhaps I might have missed it while reading the paper but I was wondering if the 164 celltypes are made available somewhere as I'd like to extract/use task-specific representations.

Thanks!

bedtools is a dependency for install_data.py

it may seem strange for a user in this space to lack bedtools, but it can happen...

https://www.dropbox.com/s/h1cqokbr8vjj5wc/encode_roadmap.bed.gz: HTTPS support not compiled in.
gunzip: can't stat: encode_roadmap.bed.gz (encode_roadmap.bed.gz.gz): No such file or directory
https://www.dropbox.com/s/8g3kc0ai9ir5d15/encode_roadmap_act.txt.gz: HTTPS support not compiled in.
gunzip: can't stat: encode_roadmap_act.txt.gz (encode_roadmap_act.txt.gz.gz): No such file or directory
/bin/sh: bedtools: command not found

can't find the file

Hi
When I run this comnand
!cd ../data; preprocess_features.py -y -m 200 -s 600 -o er -c genomes/human.hg19.genome sample_beds.txt

I got an error because there is no human.hg19.genome under genomes directory, there are
hg19.fa, hg19fa.fai files, How can I get the human.hg19.genome file or I missed something ?

Docker image Lua and Python setups wrong

The Docker image lzamparo/basset is great, but it is missing a few things

Re Lua, it needs to have

export LUA_PATH="$BASSETDIR/src/?.lua;$LUA_PATH"

Re Python, it does not have an associated Python environment with Python 2 and Basset's Python dependencies like h5py

Same seed doesn't give the same results

I am trying to run basset_train with a fixed seed and I am not getting the same results when I run the same command multiple times.

basset_train.lua -rand 123 -seed cnn_3_best.th -cudnn -job pretrained_params.txt -stagnant_t 10 -save seed_test1 learn.h5

Epoch #  1   train loss =   1.022, valid loss =   1.448, AUC = 0.8550, time = 246s best!

basset_train.lua -rand 123 -seed cnn_3_best.th -cudnn -job pretrained_params.txt -stagnant_t 10 -save seed_test2 learn.h5

Epoch #  1   train loss =   1.021, valid loss =   1.517, AUC = 0.8501, time = 246s best!

Edit: fixing the seed works when running on CPU or with -cuda, but not with -cudnn. Similar behaviour was seen here: soumith/cudnn.torch#92

Convolutional layers - padding and bias

Hi Dave,

I have some technical questions for your original Basset model published id 2016.

Do the convolutions also use a bias parameter or not? Also, the convolutions are same-padded, right?

cheers

importance of motifs for specific classes

Hello,
How can we assess the importance of motifs for specific classes? In the paper, you discuss that Basset is able to pull out motifs specific to cell types. However, when we look at the filters and their motifs, there is no way to tell which motifs are important for which classes specially in a multi-class problem.

lua error

i followed and run the new_data_iso.ipynb, and was stuck by this error

91317 learn_cd4.bed
85317 training sequences
3000 test sequences
3000 validation sequences
/root/torch/install/bin/lua: /root/torch/install/share/lua/5.2/trepl/init.lua:389: /root/torch/install/share/lua/5.2/hdf5/ffi.lua:56: expected align(#) on line 579
stack traceback:
[C]: in function 'error'
/root/torch/install/share/lua/5.2/trepl/init.lua:389: in function 'require'
/usr/local/basset/basset_train.lua:3: in main chunk
[C]: in function 'dofile'
/root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: in ?

handle compressed h5 files

Not sure how difficult this is (or if it's already possible), but it would be a nice if Basset could handle compressed h5 files for training, prediction, etc. I tried de-compressing on the fly and passing the input to basset_train.lua, but it didn't work. Would be a handy feature if it were possible, as some of the h5 files take up a large amount of disk space.

basset_sad fails with wrong matching to reference genome

When using basset_sad.py with the input vcf file containing wrong mappings of reference allele to reference genome, the script at first prints warning message:

WARNING: skipping 10100 because reference allele does not match reference genome: G vs A

but after a while the calculation fails with this exception:

Traceback (most recent call last):
  File "/software/basset/src/basset_sad.py", line 216, in <module>
    main()
  File "/software/basset/src/basset_sad.py", line 111, in main
    ref_preds = seq_preds[pi,:]
IndexError: index 1998 is out of bounds for axis 0 with size 1998

When I remove the wrong line from input file, calculation is completed successfully.
I don't know if this is desired behavior, but I suggest to terminate the calculation as soon as possible (if it is not possible to recover from this error).

how are filters combined

bedtools getfasta skipping problem

Hi,
I am trying to implement the model following the tutorials. When I try top convert the bed file to FASTA format using getfasta, I get the error "Feature xxx beyond the length of xxx size (8 bp). Skipping. " I read that a solution method is to make use of contig file that I do not think is provided.

Any tips/guidance how to solve this?

Best,
Mahmoud

Motif presence at genomic location

In my use case of Basset i've bumped into something that would be super useful. This may already be doable (but i'm not sure how): a method to take in a genomic region, a target cell type and output what motifs are associated with that region in that cell type using Tomtom. I know that right now you can feed the entire model and sequences but i'm not sure how to hook all that together.

Output name of basset_sat_vcf.py

Hey,
So the output of basset_sat_vcf often times includes ">" in the file output name, that breaks it. Just gotta add replace(">","_").

preprocess_features.py file is generating empty output bed files

Hi All,
Thanks for the tool. I am a beginner in deep learning. I was trying to reproduce the tutorials with the provided datasets. I have started with this https://github.com/davek44/Basset/blob/master/tutorials/prepare_compendium.ipynb for preparing the data. The first python script (preprocess_features.py) for that is just generating empty bed files with given default settings. I am running the script in python 3.

While running the script it is throwing the following warning message for every chromosome.

<_io.TextIOWrapper name='err_1_+.bed' mode='w' encoding='UTF-8'> 1	39761285.5	39761286.5	ENSG00000084072

Any pointers would be of great help.
Thank you!

how to install 'convnet' module

/root/torch/install/bin/luajit: /root/torch/install/share/lua/5.1/trepl/init.lua:383: module 'convnet' not found:No LuaRocks module found for convnet

when run test.ipynb

Hi
I trained and try to test your model.
on test.ipynb,

model_file = '../data/models/pretrained_model.th'
seqs_file = '../data/encode_roadmap.h5'

but on my data direcotry, there is no encode_roadmap.h5
there are
encode_roadmap.txt
encode_roadmap_act.txt
encode_roadmap.bed
Do I miss something to generate .h5 ? If so, where is the encoding part to make .h5 ?

last layer

Hi,
I am trying to find what you guys used for the last layer function? is it softmax or any specific function. I tried to go through the codes but I couldnt find how did you implement the last layer ?
can you please help me with that and also tell me where I can find the detailed code implementation please?
Thanks in advance

basset_motifs_predict.lua does not work with GPU-trained model

basset_motifs_predict.lua expects a CPU-trained model, making basset_motifs.py non-functional for GPU-trained models.

Converting all model tensors to double solves problem, may want to add this functionality

get_dnase.sh

where is the get_dnase.sh?

Use docstrings for function comments

You've done a great job documenting functions but by putting them in comments rather than docstrings they're not available to users interactively.

Not python 3 friendly

Hi,

Love the package, I'm keen to try it out for myself. Just wanted to point out that it doesn't seem to play well with Python 3. Some errors are easily fixed by running 2to3 on the relevant .py files, but some are not. For example, try running install_data.py using Anaconda Python 3. I get the following:

[zamparol@gpu-1-14 Basset]$ python install_data.py -r
[edited for brevity]
Traceback (most recent call last):
  File "/cbio/cllab/nobackup/zamparol/Basset/src/seq_hdf5.py", line 130, in <module>
    main()
  File "/cbio/cllab/nobackup/zamparol/Basset/src/seq_hdf5.py", line 46, in main
    seqs, targets = dna_io.load_data_1hot(fasta_file, targets_file, extend_len=options.extend_length, mean_norm=False, whiten=False, permute=False, sort=False)
  File "/cbio/cllab/nobackup/zamparol/Basset/src/dna_io.py", line 293, in load_data_1hot
    seq_vecs = hash_sequences_1hot(fasta_file, extend_len)
  File "/cbio/cllab/nobackup/zamparol/Basset/src/dna_io.py", line 267, in hash_sequences_1hot
    seq_vecs[header] = dna_one_hot(seq, seq_len)
  File "/cbio/cllab/nobackup/zamparol/Basset/src/dna_io.py", line 137, in dna_one_hot
    seq = seq[seq_trim:seq_trim+seq_len]
TypeError: slice indices must be integers or None or have an __index__ method

The same script seems to succeed using Anaconda Python 2.7.1 (though I can't be sure, the seq_hdf5.py step takes a while to complete). I'll use that for my purposes, but maybe you should update the readme to explicitly say python 2 is required?

Data type Error while running basset_train.lua

Hi Dave,
I followed the instructions mentioned in the tutorial file: new_data_iso.ipynb
Then ran this:
$ basset_train.lua -job params.txt -save mg_dnase_cnn learn_mg_dnase.h5
(previously, learn_mg_dnase.bed, learn_mg_dnase_act.txt, learn_mg_dnase.fa and learn_mg_dnase.h5 were created as per the instructions with your scripts)
I get this error message : "ffi.lua:332: Cannot support reading float data with size = 2 bytes"
Question: Is this error related to the datatype in the .h5 file?
Could you please suggest something?
Thanks!

(Please see the error details below)

{}
nn.Sequential {
input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> output: nn.SpatialConvolution(4 -> 10, 10x1)
(2): nn.SpatialBatchNormalization
(3): nn.ReLU
(4): nn.Reshape(5910)
(5): nn.Linear(5910 -> 500)
(6): nn.BatchNormalization
(7): nn.ReLU
(8): nn.Linear(500 -> 9)
(9): nn.Sigmoid
}
/home/sam/softwares/torch/install/bin/luajit: ...e/sam/softwares/torch/install/share/lua/5.1/hdf5/ffi.lua:332: Cannot support reading float data with size = 2 bytes
stack traceback:
[C]: in function 'error'
...e/sam/softwares/torch/install/share/lua/5.1/hdf5/ffi.lua:332: in function '_getTorchType'
...m/softwares/torch/install/share/lua/5.1/hdf5/dataset.lua:88: in function 'getTensorFactory'
...m/softwares/torch/install/share/lua/5.1/hdf5/dataset.lua:138: in function 'partial'
/home/sam/softwares/basset/src/batcher.lua:31: in function 'next'
/home/sam/softwares/basset/src/convnet.lua:884: in function 'train_epoch'
/home/sam/softwares/basset/src/basset_train.lua:148: in main chunk
[C]: in function 'dofile'
...ares/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00405d50

Epoch # 1

trouble with basset_motifs.py

Hi again,

After happily finishing the training procedure and (model testing running fine) ... I am running into troubles with basset_motifs.py:

I basically get a couple of these error:

/home/ron/anaconda2/lib/python2.7/site-packages/weblogolib/logomath.py:114: FutureWarning: comparison to None will result in an elementwise object comparison in the future.
if self._mean ==None:

Before the script stops working throwing (aftr 25 motif steps):

Traceback (most recent call last):
File "/home/ron/tools/Basset/src/basset_motifs.py", line 659, in
main()
File "/home/ron/tools/Basset/src/basset_motifs.py", line 125, in main
filter_possum(filter_weights[f,:,:], 'filter%d'%f, '%s/filter%d_possum.txt'%(options.out_dir,f), options.trim_filters)
File "/home/ron/tools/Basset/src/basset_motifs.py", line 558, in filter_possum
while np.max(param_matrix[:,trim_start]) - np.min(param_matrix[:,trim_start]) < trim_t:
IndexError: index 19 is out of bounds for axis 1 with size 19

My learned model files are still pretty small (68 MB before 129 MB after decuda ...) Not sure if that might be a problem or hint to a faulty training run ...

Cheers,
Ron

input and target size mismatch when running test.lua

Hi Dave!
Sorry to interrupt you. I wanted to run this impressive pipeline but was stuck in an error. I haven't tried to train a new network yet and all data were downloaded from your dropbox directly. But when I run basset_test.lua, it occured that:

[myb@localhost ~]$ cd ./Basset/src
[myb@localhost src]$ basset_test.lua ../data/models/pretrained_model.th ../data/encode_roadmap.h5 ../data
/home/myb/torch/install/bin/lua: /home/myb/torch/install/share/lua/5.2/nn/BCECriterion.lua:24: input and target size mismatch
stack traceback:
[C]: in function 'assert'
/home/myb/torch/install/share/lua/5.2/nn/BCECriterion.lua:24: in function </home/myb/torch/install/share/lua/5.2/nn/BCECriterion.lua:22>
(...tail calls...)
./convnet.lua:1156: in function 'test'
/home/myb/Basset/src/basset_test.lua:75: in main chunk
[C]: in function 'dofile'
.../myb/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: in ?

Would you mind telling me what may lead to it and what I should do? Thanks a lot

wrong directory position

when I run get_dnase.sh

rearrange

mv egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/narrowPeak roadmap
rm -r egg2.wustl.edu
rmdir roadmap/hammock

on the script rmdir roadmap/hammock but hammock is under roadmap/narrowPeack/
the script is wrong or Did i do something wrong ?

Pre-trained Weights?

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4937568/

In the above article, Basset is trained on ENCODE's DNAse-seq dataset. Are the pre-trained weights available somewhere, in case we want to replicate the analysis without training phase?

basset_sad throws an error for variants near the chromosome start

  File "/Users/arahuja/src/Basset/src/basset_sad.py", line 198, in <module>
    main()
  File "/Users/arahuja/src/Basset/src/basset_sad.py", line 54, in main
    seq_vecs, seqs, seq_headers = vcf.snps_seq1(snps, options.genome_fasta, options.seq_len)
  File "/Users/arahuja/src/Basset/src/vcf.py", line 57, in snps_seq1
    seq = genome.fetch(snp.chrom, seq_start-1, seq_end).upper()
  File "pysam/cfaidx.pyx", line 238, in pysam.cfaidx.FastaFile.fetch (pysam/cfaidx.c:3991)
  File "pysam/cutils.pyx", line 202, in pysam.cutils.parse_region (pysam/cutils.c:3378)
ValueError: start out of range (-228)

This is due to an issue in vcf.py where the seq_start is not check to be < 0 when variant start < window size

Additional Dependencies

basset_motifs_infl.lua requires dp/torchx, which is not currently listed.

ENUM error when running basset_train.lua

Hi Dave,

When I run basset_train.lua -job pretrained_params.txt -stagnant_t 10 er.h5, I end up getting an ENUM(50331977) error. I have tried diagnosing it, but haven't had any luck. Below are the error logs:

{
conv_filter_sizes :
{
1 : 19
2 : 11
3 : 7
}
weight_norm : 7
momentum : 0.98
learning_rate : 0.002
hidden_units :
{
1 : 1000
2 : 1000
}
conv_filters :
{
1 : 300
2 : 200
3 : 200
}
hidden_dropouts :
{
1 : 0.3
2 : 0.3
}
pool_width :
{
1 : 3
2 : 4
3 : 4
}
}
seq_len: 600, filter_size: 19, pad_width: 18
seq_len: 200, filter_size: 11, pad_width: 10
seq_len: 50, filter_size: 7, pad_width: 6
nn.Sequential {
[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> output]
(1): nn.SpatialConvolution(4 -> 300, 19x1, 1,1, 9,0)
(2): nn.SpatialBatchNormalization (4D) (300)
(3): nn.ReLU
(4): nn.SpatialMaxPooling(3x1, 3,1)
(5): nn.SpatialConvolution(300 -> 200, 11x1, 1,1, 5,0)
(6): nn.SpatialBatchNormalization (4D) (200)
(7): nn.ReLU
(8): nn.SpatialMaxPooling(4x1, 4,1)
(9): nn.SpatialConvolution(200 -> 200, 7x1, 1,1, 3,0)
(10): nn.SpatialBatchNormalization (4D) (200)
(11): nn.ReLU
(12): nn.SpatialMaxPooling(4x1, 4,1)
(13): nn.Reshape(2600)
(14): nn.Linear(2600 -> 1000)
(15): nn.BatchNormalization (2D) (1000)
(16): nn.ReLU
(17): nn.Dropout(0.300000)
(18): nn.Linear(1000 -> 1000)
(19): nn.BatchNormalization (2D) (1000)
(20): nn.ReLU
(21): nn.Dropout(0.300000)
(22): nn.Linear(1000 -> 164)
(23): nn.Sigmoid
}
/home/hugheslab2/zainmunirpatel/torch/install/bin/luajit: .../zainmunirpatel/torch/install/share/lua/5.1/hdf5/ffi.lua:335: Reading data of class ENUM(50331977) is unsupported
stack traceback:
[C]: in function 'error'
.../zainmunirpatel/torch/install/share/lua/5.1/hdf5/ffi.lua:335: in function '_getTorchType'
...nmunirpatel/torch/install/share/lua/5.1/hdf5/dataset.lua:88: in function 'getTensorFactory'
...nmunirpatel/torch/install/share/lua/5.1/hdf5/dataset.lua:138: in function 'partial'
/home/hugheslab2/zainmunirpatel/Basset/src/batcher.lua:39: in function 'next'
/home/hugheslab2/zainmunirpatel/Basset/src/convnet.lua:1009: in function 'train_epoch'
...e/hugheslab2/zainmunirpatel/Basset//src/basset_train.lua:156: in main chunk
[C]: in function 'dofile'
...atel/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00406460
Epoch # 1 [zainmunirpatel@bc2 data]$

Thanks for the help!

Best,
Zain

Pre-computed model output as HDF5

I am trying to run basset_motifs.py, but it tries to access a files that doesn't exist. If I don't specify the -d parameter:
-d | model_hdf5_file | Pre-computed model output as HDF5

then it tries to read a file: model_out.h5

Can you clarify what this file is, and what I should pass to the -d flag

Thanks,
Gabriel

Citation on README

Hi,

First of all, I've just read the paper and I found it a really astonishing piece of work!

I just wanted to comment that it would be helpful to have the citation to the publication in the README, so people can locate it faster to cite it or to read it.

Torch-HDF5 Failure during Writing Output by basset_motifs_predict.lua

Hi David,

It seems torch-hdf5 is not very compatible to Basset. The basset_motifs_predict.lua fails to write output. I tested the script line by line inside th interactive mode and encountered the failure at local hdf_out = hdf5.open(opt.out_file, 'w'). Strangely, it didn't fail when read in the hdf5 file (the validation sequences) in the previous steps. The error message is as following:

/usr/dgXXX/torch/install/bin/luajit: /usr/dgXXX/torch/install/share/lua/5.1/hdf5/file.lua:10: HDF5File.__init() requires a fileID - perhaps you want HDF5File.create()?
stack traceback:
	[C]: in function 'assert'
	/usr/dgXXX/torch/install/share/lua/5.1/hdf5/file.lua:10: in function '__init'
	/usr/dgXXX/torch/install/share/lua/5.1/torch/init.lua:91: in function </PHShome/dg520/torch/install/share/lua/5.1/torch/init.lua:87>
	[C]: in function 'open'
	./basset_motifs_predict.lua:57: in main chunk
	[C]: in function 'dofile'
	...gXXX/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
	[C]: at 0x00405810

Actually I met the same error message in the training stage but worked around by reading Deepmind issue #81 and Basset issue #25. I made the following changes:

Inside src folder of Basset, I downloaded https://github.com/davek44/torch-hdf5.git and luarocks make the library inside the torch-hdf5 folder.
Changed lua/5.1/hdf5/ffi.lua by replacing if maj[0] ~= 1 or min[0] ~= 8 then with if maj[0] ~= 1 or min[0] ~= 10 then, to make it work with HDF5 version 1.10;
Changed lua/5.1/hdf5/file.lua by adding fileID=tonumber(fileID) in openFunc and have changed ua/5.1/hdf5/group.lua by replacing .. self._groupID .. with .. tostring(self._groupID) .. in HDF5Group:__tostring(). These two changes overcome the error message for lacking fileID at training stage.

I've also tried to install the hdf5-1.10 version, but then encountered the problem of 2-byte float incompatibility.

I really appreciate if you can help me out in this case. Thanks a lot!

Best,
Dadi

Nucleotide Order in Motif Heatmap Reversed

Hi David,

It seems the nucleotide order in motif heat map is reversed. On my end, they're printed as T, G, C, A from top to bottom. I've compared the weight matrix and the heat map matrix, both of the matrix suggest the correct order should be A, C, G, T from top to bottom. A, C, G, T` is also the order in your tutorial outputs.

This involves ax.set_yticklabels('TGCA', rotation='horizontal') in basset_motifs.py, and ax_heat.yaxis.set_ticklabels('TGCA', rotation='horizontal') in basset_sat.py and basset_sat_vcf.py.

Best,
Dadi

Original dataset Basset is trained on

I am trying to recreate Basset dataset. I want to make sure that is prepare_compendium.ipynb the notebook that recreates that dataset? And if so then when I run the following code:

!cd ../data; preprocess_features.py -y -m 200 -s 600 -o er -c genomes/human.hg19.genome sample_beds.txt

This gives me empty er files. Can you explain why that might be? I am simply trying to recreate the original Basset dataset. Insights will be appreciated.

davek44 / basset Goto Github PK

basset's Introduction

Basset

Deep convolutional neural networks for DNA sequence analysis.

Installation

Documentation

Tutorials

basset's People

Contributors

Stargazers

Watchers

Forkers

basset's Issues

I got the following error:

(Please see the error details below)

Epoch # 1

rearrange

Recommend Projects

Recommend Topics

Recommend Org