pnpnpn / dna2vec Goto Github PK
View Code? Open in Web Editor NEWdna2vec: Consistent vector representations of variable-length k-mers
License: MIT License
dna2vec: Consistent vector representations of variable-length k-mers
License: MIT License
We are trying to run dna2vec against a large db (the ncbi nt dataset) which has ~47m sequences in it. Do you know of any issues with doing this (aside from it taking a really long time)?
I am seeing that the PROGRESS message report we are on sentence 105m, but we are still on epoch #1. I think we should be on epoch 3 based on the progress messages.
Do you have any thoughts on why this might be the case?
Is there already any implemented function to encode longer sequences (such as sequencing reads) using their k-mers embeddings?
Describtion:
python3 ./scripts/train_dna2vec.py -c configs/small_example.yml
Then:
File "./scripts/train_dna2vec.py", line 55, in write_vec
self.model.wv.save_word2vec_format(out_filename, binary=False)
AttributeError: 'Word2Vec' object has no attribute 'wv'
Exception:
Traceback (most recent call last):
File "./scripts/train_dna2vec.py", line 142, in <module>
main()
File "./scripts/train_dna2vec.py", line 139, in main
run_main(args, inputs, out_fileroot)
File "./scripts/train_dna2vec.py", line 88, in run_main
learner.write_vec()
File "./scripts/train_dna2vec.py", line 55, in write_vec
self.model.wv.save_word2vec_format(out_filename, binary=False)
AttributeError: 'Word2Vec' object has no attribute 'wv'
env: using pip install -r requirements.txt
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
_openmp_mutex 4.5 2_gnu https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
arrow 0.8.0 pypi_0 pypi
biopython 1.68 pypi_0 pypi
boto 2.46.1 pypi_0 pypi
bz2file 0.98 pypi_0 pypi
bzip2 1.0.8 h7f98852_4 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
ca-certificates 2022.6.15 ha878542_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
certifi 2022.6.15 pypi_0 pypi
chardet 3.0.4 pypi_0 pypi
configargparse 0.11.0 pypi_0 pypi
gensim 0.13.2 pypi_0 pypi
idna 2.7 pypi_0 pypi
ld_impl_linux-64 2.36.1 hea4e1c9_2 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libffi 3.4.2 h7f98852_5 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libgcc-ng 12.1.0 h8d9b700_16 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libgomp 12.1.0 h8d9b700_16 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libnsl 2.0.0 h7f98852_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libstdcxx-ng 12.1.0 ha89aaad_16 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libuuid 2.32.1 h7f98852_1000 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libzlib 1.2.12 h166bdaf_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
logbook 1.0.0 pypi_0 pypi
ncurses 6.3 h27087fc_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
numpy 1.16.0 pypi_0 pypi
openssl 1.1.1p h166bdaf_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
pep8 1.7.0 pypi_0 pypi
pip 21.2.4 pyhd8ed1ab_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
pluggy 0.4.0 pypi_0 pypi
py 1.4.33 pypi_0 pypi
pytest 3.0.7 pypi_0 pypi
python 3.6.15 hb7a2778_0_cpython https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
python-dateutil 2.6.0 pypi_0 pypi
readline 8.1.2 h0f457ee_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
requests 2.20.0 pypi_0 pypi
scipy 0.19.0 pypi_0 pypi
setuptools 36.4.0 py36_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
six 1.10.0 pypi_0 pypi
smart-open 1.5.1 pypi_0 pypi
sqlite 3.39.0 h4ff8645_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
tk 8.6.12 h27826a3_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
tox 2.7.0 pypi_0 pypi
tox-pyenv 1.0.3 pypi_0 pypi
tzdata 2022a h191b570_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
urllib3 1.24.3 pypi_0 pypi
virtualenv 15.1.0 pypi_0 pypi
wheel 0.37.1 pyhd8ed1ab_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
xz 5.2.5 h516909a_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
zlib 1.2.12 h166bdaf_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
Hi,
I would like to know what parameters should I use in order to be able to get the vector representation of a string of length 45.
At the moment I can go beyond 25.
dna2vec/dna2vec/multi_k_model.py
Line 17 in 8d033e9
I want to use dna2vec for E. coli genome.
When I set 2<=k<=8
, I got (86479,100)
;
When I set 3<=k<8
, I got (86614,100)
, and the correct dimension should be (87360,100)
that
So I don' know why I got 2 different results.
I also check every Kmer from 2 to 8, I find the dimension is correct from 2 to 7.
However, in k=8
, the dimension is (64450,100)
rather than (65536,100)
, and
This is horrible! There is nowhere to match.
$ python3 ./scripts/train_dna2vec.py -c configs/small_example.yml
Traceback (most recent call last):
File "./scripts/train_dna2vec.py", line 12, in <module>
from attic_util.time_benchmark import Benchmark
ImportError: No module named 'attic_util'
this is executed from ~/dna2vec
The reason for this is intrain_dna2vec.py
the relative path to attic_util
and dna2vec
are appended to sys.path
. Idiosyncratically, python appends the '../' from the folder that the script was called from.
the work around is easy - just call the script from within ./scripts
for cleaner implementation though, it might be better to consider using an egg or some other setup that allows attic_utils and dna2vec to be called from elsewhere
scripts/train_dna2vec.py line 55
How to train word embedding with dimensions less than 100
Do you have a pretained vectors for mm10?
Thanks
Hi,
What genome/sequence was the pretraining set done on? Can you make this available? I am running some initial experiments and would rather not lose time to training dna2vec for my proof of concept.
Thank you!
Nothing happen. After just one moment, the script ended. And no result file in the directory. I downloaded the .fa files and extracted them to the input folder. And install the environment all followed it. What is most likely the cause?
Hi,
Can you please make it explicit how to obtain a pre-image from a mapped vector?
Additionally, can you explain how the components v_j of the vectors in V are related to the sequence components s_i in the sequence space S?
Best wishes
Increasing the embedding vector from 100 to 200
Hello, good time, is there a way to increase the vector?
For example, fill the vector from 100 to 200 with zero
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.