Giter Club home page Giter Club logo

dna2vec's People

Contributors

aldro61 avatar dependabot[bot] avatar pnpnpn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dna2vec's Issues

dna2vec against large dataset

We are trying to run dna2vec against a large db (the ncbi nt dataset) which has ~47m sequences in it. Do you know of any issues with doing this (aside from it taking a really long time)?

I am seeing that the PROGRESS message report we are on sentence 105m, but we are still on epoch #1. I think we should be on epoch 3 based on the progress messages.

Do you have any thoughts on why this might be the case?

Encoding longer sequences

Is there already any implemented function to encode longer sequences (such as sequencing reads) using their k-mers embeddings?

AttributeError: 'Word2Vec' object has no attribute 'wv'

Describtion:

python3 ./scripts/train_dna2vec.py -c configs/small_example.yml

Then:

File "./scripts/train_dna2vec.py", line 55, in write_vec
    self.model.wv.save_word2vec_format(out_filename, binary=False)
AttributeError: 'Word2Vec' object has no attribute 'wv'

Exception:

Traceback (most recent call last):
  File "./scripts/train_dna2vec.py", line 142, in <module>
    main()
  File "./scripts/train_dna2vec.py", line 139, in main
    run_main(args, inputs, out_fileroot)
  File "./scripts/train_dna2vec.py", line 88, in run_main
    learner.write_vec()
  File "./scripts/train_dna2vec.py", line 55, in write_vec
    self.model.wv.save_word2vec_format(out_filename, binary=False)
AttributeError: 'Word2Vec' object has no attribute 'wv'

env: using pip install -r requirements.txt

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
_openmp_mutex             4.5                       2_gnu    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
arrow                     0.8.0                    pypi_0    pypi
biopython                 1.68                     pypi_0    pypi
boto                      2.46.1                   pypi_0    pypi
bz2file                   0.98                     pypi_0    pypi
bzip2                     1.0.8                h7f98852_4    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
ca-certificates           2022.6.15            ha878542_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
certifi                   2022.6.15                pypi_0    pypi
chardet                   3.0.4                    pypi_0    pypi
configargparse            0.11.0                   pypi_0    pypi
gensim                    0.13.2                   pypi_0    pypi
idna                      2.7                      pypi_0    pypi
ld_impl_linux-64          2.36.1               hea4e1c9_2    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libffi                    3.4.2                h7f98852_5    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libgcc-ng                 12.1.0              h8d9b700_16    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libgomp                   12.1.0              h8d9b700_16    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libnsl                    2.0.0                h7f98852_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libstdcxx-ng              12.1.0              ha89aaad_16    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libuuid                   2.32.1            h7f98852_1000    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libzlib                   1.2.12               h166bdaf_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
logbook                   1.0.0                    pypi_0    pypi
ncurses                   6.3                  h27087fc_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
numpy                     1.16.0                   pypi_0    pypi
openssl                   1.1.1p               h166bdaf_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
pep8                      1.7.0                    pypi_0    pypi
pip                       21.2.4             pyhd8ed1ab_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
pluggy                    0.4.0                    pypi_0    pypi
py                        1.4.33                   pypi_0    pypi
pytest                    3.0.7                    pypi_0    pypi
python                    3.6.15          hb7a2778_0_cpython    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
python-dateutil           2.6.0                    pypi_0    pypi
readline                  8.1.2                h0f457ee_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
requests                  2.20.0                   pypi_0    pypi
scipy                     0.19.0                   pypi_0    pypi
setuptools                36.4.0                   py36_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
six                       1.10.0                   pypi_0    pypi
smart-open                1.5.1                    pypi_0    pypi
sqlite                    3.39.0               h4ff8645_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
tk                        8.6.12               h27826a3_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
tox                       2.7.0                    pypi_0    pypi
tox-pyenv                 1.0.3                    pypi_0    pypi
tzdata                    2022a                h191b570_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
urllib3                   1.24.3                   pypi_0    pypi
virtualenv                15.1.0                   pypi_0    pypi
wheel                     0.37.1             pyhd8ed1ab_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
xz                        5.2.5                h516909a_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
zlib                      1.2.12               h166bdaf_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge

length longest string I can encode

Hi,
I would like to know what parameters should I use in order to be able to get the vector representation of a string of length 45.
At the moment I can go beyond 25.

Incorrect embedding dimension after training

I want to use dna2vec for E. coli genome.
When I set 2<=k<=8, I got (86479,100);
When I set 3<=k<8, I got (86614,100), and the correct dimension should be (87360,100) that $87360+16=4^2+4^3+4^4+4^5+4^6+4^7+4^8$.
So I don' know why I got 2 different results.
I also check every Kmer from 2 to 8, I find the dimension is correct from 2 to 7.
However, in k=8, the dimension is (64450,100) rather than (65536,100), and $65536-64450 != 87630-86614$.
This is horrible! There is nowhere to match.

installation/training fails unless run from scripts folder

$ python3 ./scripts/train_dna2vec.py -c configs/small_example.yml
Traceback (most recent call last):
  File "./scripts/train_dna2vec.py", line 12, in <module>
    from attic_util.time_benchmark import Benchmark
ImportError: No module named 'attic_util'

this is executed from ~/dna2vec

The reason for this is intrain_dna2vec.pythe relative path to attic_util and dna2vec are appended to sys.path. Idiosyncratically, python appends the '../' from the folder that the script was called from.

the work around is easy - just call the script from within ./scripts

for cleaner implementation though, it might be better to consider using an egg or some other setup that allows attic_utils and dna2vec to be called from elsewhere

mm10

Do you have a pretained vectors for mm10?

Thanks

Pretrained set?

Hi,
What genome/sequence was the pretraining set done on? Can you make this available? I am running some initial experiments and would rather not lose time to training dna2vec for my proof of concept.

Thank you!

Pre-image / component mapping?

Hi,
Can you please make it explicit how to obtain a pre-image from a mapped vector?
Additionally, can you explain how the components v_j of the vectors in V are related to the sequence components s_i in the sequence space S?

Best wishes

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.