davek44 / basset Goto Github PK
View Code? Open in Web Editor NEWConvolutional neural network analysis for predicting DNA sequence activity.
License: MIT License
Convolutional neural network analysis for predicting DNA sequence activity.
License: MIT License
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4937568/
In the above article, Basset is trained on ENCODE's DNAse-seq dataset. Are the pre-trained weights available somewhere, in case we want to replicate the analysis without training phase?
When using basset_sad.py
with the input vcf file containing wrong mappings of reference allele to reference genome, the script at first prints warning message:
WARNING: skipping 10100 because reference allele does not match reference genome: G vs A
but after a while the calculation fails with this exception:
Traceback (most recent call last):
File "/software/basset/src/basset_sad.py", line 216, in <module>
main()
File "/software/basset/src/basset_sad.py", line 111, in main
ref_preds = seq_preds[pi,:]
IndexError: index 1998 is out of bounds for axis 0 with size 1998
When I remove the wrong line from input file, calculation is completed successfully.
I don't know if this is desired behavior, but I suggest to terminate the calculation as soon as possible (if it is not possible to recover from this error).
Hi
When I run this comnand
!cd ../data; preprocess_features.py -y -m 200 -s 600 -o er -c genomes/human.hg19.genome sample_beds.txt
I got an error because there is no human.hg19.genome under genomes directory, there are
hg19.fa, hg19fa.fai files, How can I get the human.hg19.genome file or I missed something ?
Hi Dave!
Sorry to interrupt you. I wanted to run this impressive pipeline but was stuck in an error. I haven't tried to train a new network yet and all data were downloaded from your dropbox directly. But when I run basset_test.lua, it occured that:
[myb@localhost ~]$ cd ./Basset/src
[myb@localhost src]$ basset_test.lua ../data/models/pretrained_model.th ../data/encode_roadmap.h5 ../data
/home/myb/torch/install/bin/lua: /home/myb/torch/install/share/lua/5.2/nn/BCECriterion.lua:24: input and target size mismatch
stack traceback:
[C]: in function 'assert'
/home/myb/torch/install/share/lua/5.2/nn/BCECriterion.lua:24: in function </home/myb/torch/install/share/lua/5.2/nn/BCECriterion.lua:22>
(...tail calls...)
./convnet.lua:1156: in function 'test'
/home/myb/Basset/src/basset_test.lua:75: in main chunk
[C]: in function 'dofile'
.../myb/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: in ?
Would you mind telling me what may lead to it and what I should do? Thanks a lot
Hi David,
Thanks for developing the powerful Basset. I've tried to apply it on my splicing motif study and the result is quite positive. However, I have a little confusion about the math behind:
My train and valid loss at the first epoch is about 1.9. As I've three classes in this study and I think binary cross entropy
is used as the default loss function, this should give me a starting loss no more than ln(3)
, which is around 1.1. So I don't know why all my models started at 1.9, which is way higher than the maximal expectation 1.1.
Did I miss something or misunderstand some concepts? Thank you very much in advance.
Best,
Dadi
I am trying to run basset_motifs.py, but it tries to access a files that doesn't exist. If I don't specify the -d parameter:
-d | model_hdf5_file | Pre-computed model output as HDF5
then it tries to read a file: model_out.h5
Can you clarify what this file is, and what I should pass to the -d flag
Thanks,
Gabriel
basset_motifs_infl.lua requires dp/torchx, which is not currently listed.
Hi,
Thank you for making the code available!
The basset model has an output of 164, representing the predicted probability of accessibility in each cell type.
Perhaps I might have missed it while reading the paper but I was wondering if the 164 celltypes are made available somewhere as I'd like to extract/use task-specific representations.
Thanks!
I enjoyed your talk today at NIPS! Our lab has been working with deep learning tools in biology as well, though we're working with Keras/Theano and amino acid sequences.
I'm giving Basset a spin and will file a few housekeeping issues. Please don't consider these issues to be demands or criticisms--they're more like notes to myself on ways I or our other lab members might contribute.
Hi Dave,
I followed the instructions mentioned in the tutorial file: new_data_iso.ipynb
in my dataset, I hava only only one positive bed file and one negative bed file , after preprocess I got:
68534 training sequences
6000 test sequences
6000 validation sequences
saved in learn_cd4.h5. (length equals 600 as you suggested)
then I created a blank params.txt to save training parameters.
In the last step I try to run "basset_train.lua -job params.txt -save cd4_cnn learn_cd4.h5"
[liuqiao@g01 mydata]$ basset_train.lua -job params.txt -save cd4_cnn learn_cd4.h5
{}
seq_len: 600, filter_size: 10, pad_width: 9
nn.Sequential {
[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> output]
(1): nn.SpatialConvolution(4 -> 10, 10x1, 1,1, 4,0)
(2): nn.SpatialBatchNormalization (4D) (10)
(3): nn.ReLU
(4): nn.Reshape(6000)
(5): nn.Linear(6000 -> 500)
(6): nn.BatchNormalization (2D) (500)
(7): nn.ReLU
(8): nn.Linear(500 -> 1)
(9): nn.Sigmoid
}
/home/liuqiao/torch/install/bin/luajit: /home/liuqiao/torch/install/share/lua/5.1/nn/Container.lua:67:
In 4 module of nn.Sequential:
/home/liuqiao/torch/install/share/lua/5.1/torch/Tensor.lua:466: Wrong size for view. Input size: 128x10x1x599. Output size: 128x6000
stack traceback:
[C]: in function 'error'
/home/liuqiao/torch/install/share/lua/5.1/torch/Tensor.lua:466: in function 'view'
/home/liuqiao/torch/install/share/lua/5.1/nn/Reshape.lua:46: in function </home/liuqiao/torch/install/share/lua/5.1/nn/Reshape.lua:31>
[C]: in function 'xpcall'
/home/liuqiao/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/liuqiao/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
/home/liuqiao/Basset/src/convnet.lua:1030: in function 'opfunc'
/home/liuqiao/torch/install/share/lua/5.1/optim/rmsprop.lua:35: in function 'rmsprop'
/home/liuqiao/Basset/src/convnet.lua:1063: in function 'train_epoch'
/home/liuqiao/Basset/src/basset_train.lua:148: in main chunk
[C]: in function 'dofile'
...qiao/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x004064f0
WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
[C]: in function 'error'
/home/liuqiao/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
/home/liuqiao/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
/home/liuqiao/Basset/src/convnet.lua:1030: in function 'opfunc'
/home/liuqiao/torch/install/share/lua/5.1/optim/rmsprop.lua:35: in function 'rmsprop'
/home/liuqiao/Basset/src/convnet.lua:1063: in function 'train_epoch'
/home/liuqiao/Basset/src/basset_train.lua:148: in main chunk
[C]: in function 'dofile'
...qiao/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x004064f0
It reported that "Wrong size for view" . I've tried to read your source code in basset_train.lua but I'm still not sure what's wrong with it. I have to say I'm not very familiar with torch and lua. I guess maybe some command line parameters got wrong or I have to revise some source code? .I really hope you could help me figure it out or give me some suggestions.
Thanks a lot!
File "/Users/arahuja/src/Basset/src/basset_sad.py", line 198, in <module>
main()
File "/Users/arahuja/src/Basset/src/basset_sad.py", line 54, in main
seq_vecs, seqs, seq_headers = vcf.snps_seq1(snps, options.genome_fasta, options.seq_len)
File "/Users/arahuja/src/Basset/src/vcf.py", line 57, in snps_seq1
seq = genome.fetch(snp.chrom, seq_start-1, seq_end).upper()
File "pysam/cfaidx.pyx", line 238, in pysam.cfaidx.FastaFile.fetch (pysam/cfaidx.c:3991)
File "pysam/cutils.pyx", line 202, in pysam.cutils.parse_region (pysam/cutils.c:3378)
ValueError: start out of range (-228)
This is due to an issue in vcf.py
where the seq_start
is not check to be < 0 when variant start < window size
Hi
when I run preprocess_features.py
I got out of index error from line 76
['>chr1']
Traceback (most recent call last):
File "./preprocess_features.py", line 420, in
main()
File "./preprocess_features.py", line 76, in main
chrom_lengths[a[0]] = int(a[1])
IndexError: list index out of range
so I printed a and the result is
['>chr1']
Is there anythin I missed
thx
Hi
I trained and try to test your model.
on test.ipynb,
model_file = '../data/models/pretrained_model.th'
seqs_file = '../data/encode_roadmap.h5'
but on my data direcotry, there is no encode_roadmap.h5
there are
encode_roadmap.txt
encode_roadmap_act.txt
encode_roadmap.bed
Do I miss something to generate .h5 ? If so, where is the encoding part to make .h5 ?
In my use case of Basset i've bumped into something that would be super useful. This may already be doable (but i'm not sure how): a method to take in a genomic region, a target cell type and output what motifs are associated with that region in that cell type using Tomtom. I know that right now you can feed the entire model and sequences but i'm not sure how to hook all that together.
Hi Dave,
I have some technical questions for your original Basset model published id 2016.
Do the convolutions also use a bias parameter or not? Also, the convolutions are same-padded, right?
cheers
I am trying to recreate Basset dataset. I want to make sure that is prepare_compendium.ipynb the notebook that recreates that dataset? And if so then when I run the following code:
!cd ../data; preprocess_features.py -y -m 200 -s 600 -o er -c genomes/human.hg19.genome sample_beds.txt
This gives me empty er files. Can you explain why that might be? I am simply trying to recreate the original Basset dataset. Insights will be appreciated.
Hi,
First of all, I've just read the paper and I found it a really astonishing piece of work!
I just wanted to comment that it would be helpful to have the citation to the publication in the README, so people can locate it faster to cite it or to read it.
Hi there,
Sorry, feels like I'm ploughing through the functions and hitting everything that can go wrong ;]
Trying to use the saturated mutatgenesis I faced problems and tried to fall back on the tutorial chr7 hox ctcf site example location:
~/toolsBasset/$ basset_sat.py -t 3 -n 200 -o satmut_hox cnn_full_with_prim_erythroid/dnase_with_prim_erythroid_cnn_best.decuda.th ./hox_test_seq.fa basset_sat_predict.lua -center_nt 200 cnn_full_with_prim_erythroid/dnase_with_prim_erythroid_cnn_best.decuda.th satmut_hox/model_in.h5 satmut_hox/model_out.h5
Predicting sequence 1 variants
Traceback (most recent call last):
File "/home/ron/tools/Basset/src/basset_sat.py", line 367, in
main()
File "/home/ron/tools/Basset/src/basset_sat.py", line 206, in main
g = sns.heatmap(preds_heat, vmin=0, vmax=1, linewidths=0, xticklabels=target_labels, yticklabels=False, cbar_kws={"orientation": "horizontal"})
File "/home/ron/anaconda2/lib/python2.7/site-packages/seaborn/matrix.py", line 452, in heatmap
mask)
File "/home/ron/anaconda2/lib/python2.7/site-packages/seaborn/matrix.py", line 140, in init
self.xticklabels = xticklabels[xstart:xend:xstep]
TypeError: 'NoneType' object has no attribute 'getitem'
~/toolsBasset/\$ ll satmut_hox/
total 1092
drwxrwxr-x 2 ron ron 4096 Jun 20 12:59 ./
drwxrwxr-x 11 ron ron 12288 Jun 20 13:04 ../
-rw-rw-r-- 1 ron ron 6944 Jun 20 13:06 model_in.h5
-rw-rw-r-- 1 ron ron 1058144 Jun 20 13:06 model_out.h5
-rw-rw-r-- 1 ron ron 0 Jun 20 13:06 preds.txt
-rw-rw-r-- 1 ron ron 0 Jun 20 13:06 table.txt
I was trying to run with my own trained model (your 164 selected tissues + on test set of our own). Trying the same with your provided pre-trained model results in exactly the same error output independent of the selected cell type to predict.
Also noteworthy: the randomly selecting and mutating sequences from the already extracted sequences (as described in the sat_mut tutorial) seems to work fine)
Cheers,
Ron
The Docker image lzamparo/basset is great, but it is missing a few things
Re Lua, it needs to have
export LUA_PATH="$BASSETDIR/src/?.lua;$LUA_PATH"
Re Python, it does not have an associated Python environment with Python 2 and Basset's Python dependencies like h5py
Do you have any suggestions on exploring the hyper-parameter space in terms of learning rate, etc. My concern is that using basset the prediction error on the validation set always converges after 2 or 3 epochs. My guess is that selecting better hyper-parameter values will improve the prediction error. Have you encountered this issue?
Thanks,
You've done a great job documenting functions but by putting them in comments rather than docstrings they're not available to users interactively.
Hi All,
Thanks for the tool. I am a beginner in deep learning. I was trying to reproduce the tutorials with the provided datasets. I have started with this https://github.com/davek44/Basset/blob/master/tutorials/prepare_compendium.ipynb for preparing the data. The first python script (preprocess_features.py
) for that is just generating empty bed files with given default settings. I am running the script in python 3.
While running the script it is throwing the following warning message for every chromosome.
<_io.TextIOWrapper name='err_1_+.bed' mode='w' encoding='UTF-8'> 1 39761285.5 39761286.5 ENSG00000084072
Any pointers would be of great help.
Thank you!
where is the get_dnase.sh?
it may seem strange for a user in this space to lack bedtools, but it can happen...
https://www.dropbox.com/s/h1cqokbr8vjj5wc/encode_roadmap.bed.gz: HTTPS support not compiled in.
gunzip: can't stat: encode_roadmap.bed.gz (encode_roadmap.bed.gz.gz): No such file or directory
https://www.dropbox.com/s/8g3kc0ai9ir5d15/encode_roadmap_act.txt.gz: HTTPS support not compiled in.
gunzip: can't stat: encode_roadmap_act.txt.gz (encode_roadmap_act.txt.gz.gz): No such file or directory
/bin/sh: bedtools: command not found
install_data.py moves along but eventually ...
https://www.dropbox.com/s/h1cqokbr8vjj5wc/encode_roadmap.bed.gz: HTTPS support not compiled in.
gunzip: can't stat: encode_roadmap.bed.gz (encode_roadmap.bed.gz.gz): No such file or directory
https://www.dropbox.com/s/8g3kc0ai9ir5d15/encode_roadmap_act.txt.gz: HTTPS support not compiled in.
one way to solve is to build wget from source with openssl ... links and details here:
https://coolestguidesontheplanet.com/install-and-configure-wget-on-os-x/
Hi!
I am trying to train basset with my own data but I have been running into several issues with the gpus so I decided to try on the normal nodes in our cluster, however it's been running for 4 days now and I have far less sequences than the used amount in the paper so I don't know if there is any chance something went wrong without killing the process and now it's just halted or it really takes that long to run.
All the best,
Laura
I am trying to run basset_train with a fixed seed and I am not getting the same results when I run the same command multiple times.
basset_train.lua -rand 123 -seed cnn_3_best.th -cudnn -job pretrained_params.txt -stagnant_t 10 -save seed_test1 learn.h5
Epoch # 1 train loss = 1.022, valid loss = 1.448, AUC = 0.8550, time = 246s best!
basset_train.lua -rand 123 -seed cnn_3_best.th -cudnn -job pretrained_params.txt -stagnant_t 10 -save seed_test2 learn.h5
Epoch # 1 train loss = 1.021, valid loss = 1.517, AUC = 0.8501, time = 246s best!
Edit: fixing the seed works when running on CPU or with -cuda, but not with -cudnn. Similar behaviour was seen here: soumith/cudnn.torch#92
It seems like install_data.py -- in particular
Line 113 in 6ae86b8
Line 113 in 6ae86b8
seq_hdf5.py \
-c \
-t 71886 \
-v 70000 \
encode_roadmap.fa encode_roadmap_act.txt encode_roadmap.h5
install_data.py passes on the same machine. If the "-r" option is not required, then perhaps disable it by default and make it an option at the top. Otherwise, please document the memory requirements. Thank you
Additional details:
I am wondering if the software can be used in nonhuman species? If not, can we train a model for other species? Many thanks!
So, I know that you've already bumped into the issue with exporting the LUA path aka:
export LUA_PATH="$BASSETDIR/src/?.lua;$LUA_PATH"
I'm still trying to figure out why it refuses to export it if stuck in the startup script.
But this is the last last stopping point in making an executable jupyter
Hi again,
After happily finishing the training procedure and (model testing running fine) ... I am running into troubles with basset_motifs.py:
I basically get a couple of these error:
/home/ron/anaconda2/lib/python2.7/site-packages/weblogolib/logomath.py:114: FutureWarning: comparison to None
will result in an elementwise object comparison in the future.
if self._mean ==None:
Before the script stops working throwing (aftr 25 motif steps):
Traceback (most recent call last):
File "/home/ron/tools/Basset/src/basset_motifs.py", line 659, in
main()
File "/home/ron/tools/Basset/src/basset_motifs.py", line 125, in main
filter_possum(filter_weights[f,:,:], 'filter%d'%f, '%s/filter%d_possum.txt'%(options.out_dir,f), options.trim_filters)
File "/home/ron/tools/Basset/src/basset_motifs.py", line 558, in filter_possum
while np.max(param_matrix[:,trim_start]) - np.min(param_matrix[:,trim_start]) < trim_t:
IndexError: index 19 is out of bounds for axis 1 with size 19
My learned model files are still pretty small (68 MB before 129 MB after decuda ...) Not sure if that might be a problem or hint to a faulty training run ...
Cheers,
Ron
Hi Dave,
I followed the instructions mentioned in the tutorial file: new_data_iso.ipynb
Then ran this:
$ basset_train.lua -job params.txt -save mg_dnase_cnn learn_mg_dnase.h5
(previously, learn_mg_dnase.bed, learn_mg_dnase_act.txt, learn_mg_dnase.fa and learn_mg_dnase.h5 were created as per the instructions with your scripts)
I get this error message : "ffi.lua:332: Cannot support reading float data with size = 2 bytes"
Question: Is this error related to the datatype in the .h5 file?
Could you please suggest something?
Thanks!
{}
nn.Sequential {
input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> output: nn.SpatialConvolution(4 -> 10, 10x1)
(2): nn.SpatialBatchNormalization
(3): nn.ReLU
(4): nn.Reshape(5910)
(5): nn.Linear(5910 -> 500)
(6): nn.BatchNormalization
(7): nn.ReLU
(8): nn.Linear(500 -> 9)
(9): nn.Sigmoid
}
/home/sam/softwares/torch/install/bin/luajit: ...e/sam/softwares/torch/install/share/lua/5.1/hdf5/ffi.lua:332: Cannot support reading float data with size = 2 bytes
stack traceback:
[C]: in function 'error'
...e/sam/softwares/torch/install/share/lua/5.1/hdf5/ffi.lua:332: in function '_getTorchType'
...m/softwares/torch/install/share/lua/5.1/hdf5/dataset.lua:88: in function 'getTensorFactory'
...m/softwares/torch/install/share/lua/5.1/hdf5/dataset.lua:138: in function 'partial'
/home/sam/softwares/basset/src/batcher.lua:31: in function 'next'
/home/sam/softwares/basset/src/convnet.lua:884: in function 'train_epoch'
/home/sam/softwares/basset/src/basset_train.lua:148: in main chunk
[C]: in function 'dofile'
...ares/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00405d50
Hi David,
It seems the nucleotide order in motif heat map is reversed. On my end, they're printed as T, G, C, A
from top to bottom. I've compared the weight matrix and the heat map matrix, both of the matrix suggest the correct order should be A, C, G, T
from top to bottom. A, C, G, T` is also the order in your tutorial outputs.
This involves ax.set_yticklabels('TGCA', rotation='horizontal')
in basset_motifs.py
, and ax_heat.yaxis.set_ticklabels('TGCA', rotation='horizontal')
in basset_sat.py
and basset_sat_vcf.py
.
Best,
Dadi
/root/torch/install/bin/luajit: /root/torch/install/share/lua/5.1/trepl/init.lua:383: module 'convnet' not found:No LuaRocks module found for convnet
i followed and run the new_data_iso.ipynb, and was stuck by this error
91317 learn_cd4.bed
85317 training sequences
3000 test sequences
3000 validation sequences
/root/torch/install/bin/lua: /root/torch/install/share/lua/5.2/trepl/init.lua:389: /root/torch/install/share/lua/5.2/hdf5/ffi.lua:56: expected align(#) on line 579
stack traceback:
[C]: in function 'error'
/root/torch/install/share/lua/5.2/trepl/init.lua:389: in function 'require'
/usr/local/basset/basset_train.lua:3: in main chunk
[C]: in function 'dofile'
/root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: in ?
Hi,
I am trying to find what you guys used for the last layer function? is it softmax or any specific function. I tried to go through the codes but I couldnt find how did you implement the last layer ?
can you please help me with that and also tell me where I can find the detailed code implementation please?
Thanks in advance
basset_motifs_predict.lua expects a CPU-trained model, making basset_motifs.py non-functional for GPU-trained models.
Converting all model tensors to double solves problem, may want to add this functionality
Hi
on PlotRoc.py
you imported
from stats import quantile
but I got an error , where is the package stats comming from
I am trying to run the basset_sat_vcf.py, but it returns errors:
Here is the cmd:
software/Basset-master/src/basset_sat_vcf.py -t 6 -o /sat model_default_1k/dnacnn_best.th Test_run/rs13336428.vcf
And errors:
/opt/torch/install/bin/lua: /opt/torch/install/share/lua/5.2/nn/Container.lua:67:
In 13 module of nn.Sequential:
/opt/torch/install/share/lua/5.2/torch/Tensor.lua:466: Wrong size for view. Input size: 4x200x1x13. Output size: 4x4200
stack traceback:
[C]: in function 'error'
/opt/torch/install/share/lua/5.2/torch/Tensor.lua:466: in function 'view'
/opt/torch/install/share/lua/5.2/nn/Reshape.lua:46: in function </opt/torch/install/share/lua/5.2/nn/Reshape.lua:31>
[C]: in function 'xpcall'
/opt/torch/install/share/lua/5.2/nn/Container.lua:63: in function 'rethrowErrors'
/opt/torch/install/share/lua/5.2/nn/Sequential.lua:44: in function </opt/torch/install/share/lua/5.2/nn/Sequential.lua:41>
(...tail calls...)
/opt/Basset/src/convnet.lua:509: in function 'predict'
...software/Basset-master/src/../src/basset_sat_predict.lua:75: in main chunk
[C]: in function 'dofile'
/opt/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: in ?
WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
[C]: in function 'error'
/opt/torch/install/share/lua/5.2/nn/Container.lua:67: in function 'rethrowErrors'
/opt/torch/install/share/lua/5.2/nn/Sequential.lua:44: in function </opt/torch/install/share/lua/5.2/nn/Sequential.lua:41>
(...tail calls...)
/opt/Basset/src/convnet.lua:509: in function 'predict'
...software/Basset-master/src/../src/basset_sat_predict.lua:75: in main chunk
[C]: in function 'dofile'
/opt/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: in ?
Traceback (most recent call last):
File "/data/gpei/software/Basset-master/src/basset_sat_vcf2.py", line 206, in <module>
main()
File "/data/gpei/software/Basset-master/src/basset_sat_vcf2.py", line 88, in main
hdf5_in = h5py.File(options.model_hdf5_file, 'r')
File "/data/gpei/software/python2.7_module/h5py/_hl/files.py", line 312, in __init__
fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
File "/data/gpei/software/python2.7_module/h5py/_hl/files.py", line 142, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 78, in h5py.h5f.open
IOError: Unable to open file (unable to open file: name = '/data/gpei/19_1.CNN/2.model_default_1k/4.sad/sat/model_out.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
Any one can help me ?
Thanks!
Hello,
How can we assess the importance of motifs for specific classes? In the paper, you discuss that Basset is able to pull out motifs specific to cell types. However, when we look at the filters and their motifs, there is no way to tell which motifs are important for which classes specially in a multi-class problem.
Hi David,
It seems torch-hdf5
is not very compatible to Basset. The basset_motifs_predict.lua
fails to write output. I tested the script line by line inside th
interactive mode and encountered the failure at local hdf_out = hdf5.open(opt.out_file, 'w')
. Strangely, it didn't fail when read in the hdf5 file (the validation sequences) in the previous steps. The error message is as following:
/usr/dgXXX/torch/install/bin/luajit: /usr/dgXXX/torch/install/share/lua/5.1/hdf5/file.lua:10: HDF5File.__init() requires a fileID - perhaps you want HDF5File.create()?
stack traceback:
[C]: in function 'assert'
/usr/dgXXX/torch/install/share/lua/5.1/hdf5/file.lua:10: in function '__init'
/usr/dgXXX/torch/install/share/lua/5.1/torch/init.lua:91: in function </PHShome/dg520/torch/install/share/lua/5.1/torch/init.lua:87>
[C]: in function 'open'
./basset_motifs_predict.lua:57: in main chunk
[C]: in function 'dofile'
...gXXX/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405810
Actually I met the same error message in the training stage but worked around by reading Deepmind issue #81 and Basset issue #25. I made the following changes:
src
folder of Basset, I downloaded https://github.com/davek44/torch-hdf5.git
and luarocks make
the library inside the torch-hdf5
folder.lua/5.1/hdf5/ffi.lua
by replacing if maj[0] ~= 1 or min[0] ~= 8 then
with if maj[0] ~= 1 or min[0] ~= 10 then
, to make it work with HDF5 version 1.10;lua/5.1/hdf5/file.lua
by adding fileID=tonumber(fileID)
in openFunc
and have changed ua/5.1/hdf5/group.lua
by replacing .. self._groupID ..
with .. tostring(self._groupID) ..
in HDF5Group:__tostring()
. These two changes overcome the error message for lacking fileID
at training stage.I've also tried to install the hdf5-1.10
version, but then encountered the problem of 2-byte float incompatibility.
I really appreciate if you can help me out in this case. Thanks a lot!
Best,
Dadi
when I run get_dnase.sh
mv egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/narrowPeak roadmap
rm -r egg2.wustl.edu
rmdir roadmap/hammock
on the script rmdir roadmap/hammock but hammock is under roadmap/narrowPeack/
the script is wrong or Did i do something wrong ?
in addition to dependencies noted in previous issues, we need to install hdf5 for lua, and
we need to install and configure intel MKL for torch to work 'out of the box' ... without MKL
you can probably set some config flags for torch but I did not investigate this. however we come
a cropper without 'batcher':
%vjcair> basset_train.lua -job models/pretrained_params.txt -stagnant_t 10 encode_roadmap.h5
/Users/stvjc/torch/install/bin/luajit: /Users/stvjc/torch/install/share/lua/5.1/trepl/init.lua:384: module 'batcher' not found:No LuaRocks module found for batcher
Hunting around, we find repos for learning_torch, deepmind-atari, and ConvNet-torch that seem to include relevant modules to allow basset_train.lua to proceed further. However, we fail at line 99 of the training function
%vjcair> basset_train.lua -job models/pretrained_params.txt -stagnant_t 10 encode_roadmap.h5
{
conv_filter_sizes :
{
1 : 19
2 : 11
3 : 7
}
weight_norm : 7
momentum : 0.98
learning_rate : 0.002
hidden_units :
{
1 : 1000
2 : 1000
}
conv_filters :
{
1 : 300
2 : 200
3 : 200
}
hidden_dropouts :
{
1 : 0.3
2 : 0.3
}
pool_width :
{
1 : 3
2 : 4
3 : 4
}
}
/Users/stvjc/torch/install/bin/luajit: /Users/stvjc/Research/BASSET/Basset/src/basset_train.lua:99: attempt to index global 'ConvNet' (a nil value)
Not sure how difficult this is (or if it's already possible), but it would be a nice if Basset could handle compressed h5 files for training, prediction, etc. I tried de-compressing on the fly and passing the input to basset_train.lua, but it didn't work. Would be a handy feature if it were possible, as some of the h5 files take up a large amount of disk space.
there is a stream of errors of this type.
%vjcair> ./install_dependencies.py
Warning: Failed searching manifest: Failed fetching manifest for https://raw.githubusercontent.com/torch/rocks/master - Failed downloading https://raw.githubusercontent.com/torch/rocks/master/manifest - /Users/stvjc/.cache/luarocks/https___raw.githubusercontent.com_torch_rocks_master/manifest
Warning: Failed searching manifest: Failed fetching manifest for https://raw.githubusercontent.com/rocks-moonscript-org/moonrocks-mirror/master - Failed downloading https://raw.githubusercontent.com/rocks-moonscript-org/moonrocks-mirror/master/manifest - /Users/stvjc/.cache/luarocks/https___raw.githubusercontent.com_rocks-moonscript-org_moonrocks-mirror_master/manifest
such an error can be found on the web and the proposed solution is to remove ~/.cache/luarocks but this does not rectify in this case. we have such errors as
Error: No results matching query were found.
fatal: destination path 'torch-hdf5' already exists and is not an empty directory.
Missing dependencies for hdf5:
totem
Error: Could not satisfy dependency: totem
To attempt to deal with totem, I used
curl https://raw.githubusercontent.com/deepmind/torch-totem/master/rocks/totem-0-0.rockspec -o totem-0-0.rockspec
sudo luarocks install totem-0-0.rockspec
this seems to have worked.
Hi Dave,
When I run basset_train.lua -job pretrained_params.txt -stagnant_t 10 er.h5, I end up getting an ENUM(50331977) error. I have tried diagnosing it, but haven't had any luck. Below are the error logs:
{
conv_filter_sizes :
{
1 : 19
2 : 11
3 : 7
}
weight_norm : 7
momentum : 0.98
learning_rate : 0.002
hidden_units :
{
1 : 1000
2 : 1000
}
conv_filters :
{
1 : 300
2 : 200
3 : 200
}
hidden_dropouts :
{
1 : 0.3
2 : 0.3
}
pool_width :
{
1 : 3
2 : 4
3 : 4
}
}
seq_len: 600, filter_size: 19, pad_width: 18
seq_len: 200, filter_size: 11, pad_width: 10
seq_len: 50, filter_size: 7, pad_width: 6
nn.Sequential {
[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> output]
(1): nn.SpatialConvolution(4 -> 300, 19x1, 1,1, 9,0)
(2): nn.SpatialBatchNormalization (4D) (300)
(3): nn.ReLU
(4): nn.SpatialMaxPooling(3x1, 3,1)
(5): nn.SpatialConvolution(300 -> 200, 11x1, 1,1, 5,0)
(6): nn.SpatialBatchNormalization (4D) (200)
(7): nn.ReLU
(8): nn.SpatialMaxPooling(4x1, 4,1)
(9): nn.SpatialConvolution(200 -> 200, 7x1, 1,1, 3,0)
(10): nn.SpatialBatchNormalization (4D) (200)
(11): nn.ReLU
(12): nn.SpatialMaxPooling(4x1, 4,1)
(13): nn.Reshape(2600)
(14): nn.Linear(2600 -> 1000)
(15): nn.BatchNormalization (2D) (1000)
(16): nn.ReLU
(17): nn.Dropout(0.300000)
(18): nn.Linear(1000 -> 1000)
(19): nn.BatchNormalization (2D) (1000)
(20): nn.ReLU
(21): nn.Dropout(0.300000)
(22): nn.Linear(1000 -> 164)
(23): nn.Sigmoid
}
/home/hugheslab2/zainmunirpatel/torch/install/bin/luajit: .../zainmunirpatel/torch/install/share/lua/5.1/hdf5/ffi.lua:335: Reading data of class ENUM(50331977) is unsupported
stack traceback:
[C]: in function 'error'
.../zainmunirpatel/torch/install/share/lua/5.1/hdf5/ffi.lua:335: in function '_getTorchType'
...nmunirpatel/torch/install/share/lua/5.1/hdf5/dataset.lua:88: in function 'getTensorFactory'
...nmunirpatel/torch/install/share/lua/5.1/hdf5/dataset.lua:138: in function 'partial'
/home/hugheslab2/zainmunirpatel/Basset/src/batcher.lua:39: in function 'next'
/home/hugheslab2/zainmunirpatel/Basset/src/convnet.lua:1009: in function 'train_epoch'
...e/hugheslab2/zainmunirpatel/Basset//src/basset_train.lua:156: in main chunk
[C]: in function 'dofile'
...atel/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00406460
Epoch # 1 [zainmunirpatel@bc2 data]$
Thanks for the help!
Best,
Zain
Hey,
So the output of basset_sat_vcf often times includes ">" in the file output name, that breaks it. Just gotta add replace(">","_").
Hi,
I am trying to implement the model following the tutorials. When I try top convert the bed file to FASTA format using getfasta, I get the error "Feature xxx beyond the length of xxx size (8 bp). Skipping. " I read that a solution method is to make use of contig file that I do not think is provided.
Any tips/guidance how to solve this?
Best,
Mahmoud
Hi,
Love the package, I'm keen to try it out for myself. Just wanted to point out that it doesn't seem to play well with Python 3. Some errors are easily fixed by running 2to3 on the relevant .py files, but some are not. For example, try running install_data.py using Anaconda Python 3. I get the following:
[zamparol@gpu-1-14 Basset]$ python install_data.py -r
[edited for brevity]
Traceback (most recent call last):
File "/cbio/cllab/nobackup/zamparol/Basset/src/seq_hdf5.py", line 130, in <module>
main()
File "/cbio/cllab/nobackup/zamparol/Basset/src/seq_hdf5.py", line 46, in main
seqs, targets = dna_io.load_data_1hot(fasta_file, targets_file, extend_len=options.extend_length, mean_norm=False, whiten=False, permute=False, sort=False)
File "/cbio/cllab/nobackup/zamparol/Basset/src/dna_io.py", line 293, in load_data_1hot
seq_vecs = hash_sequences_1hot(fasta_file, extend_len)
File "/cbio/cllab/nobackup/zamparol/Basset/src/dna_io.py", line 267, in hash_sequences_1hot
seq_vecs[header] = dna_one_hot(seq, seq_len)
File "/cbio/cllab/nobackup/zamparol/Basset/src/dna_io.py", line 137, in dna_one_hot
seq = seq[seq_trim:seq_trim+seq_len]
TypeError: slice indices must be integers or None or have an __index__ method
The same script seems to succeed using Anaconda Python 2.7.1 (though I can't be sure, the seq_hdf5.py step takes a while to complete). I'll use that for my purposes, but maybe you should update the readme to explicitly say python 2 is required?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.